====== biomarkerFromLefse ======
===== 介绍 =====
通过fuzzywuzzy库对lefse中获得的Biomaker在对应的丰度表格中进行查找匹配,解决lefse自身处理数据的改名问题
参考路径 : /TJPROJ5/META_ASS/script/yuxi/script/mediation
/TJPROJ7/META_ASS/16s/yuxi/X101SC21124468-Z01/X101SC21124468-Z01-mediation
==== 软件架构 ====
使用python 3.10.0
fuzzywuzzy 0.18.0
python-Levenshtein 0.23.0 (fuzzywuzzy库使用了纯Python实现的SequenceMatcher,这在处理大量数据时可能会变得比较慢。为了提高性能,可以安装python-Levenshtein库,它是fuzzywuzzy的一个可选依赖项,提供了更快的SequenceMatcher实现)
=== 使用说明 ===
准备好LDA.control-disease.draw.res文件, 该文件由5列数据组成
第一列:Biomaker名称;
第二列:各组分丰度平均值中最大值的log10,如果平均丰度小于10的按照10来计算;
第三列:差异物种富集的组名;
第四列:LDA值;
第五列:Kruskal-Wallis秩和检验的值,若不是Biomarker用“—”表示。
对该数据列进行处理获得Biomaker后,与每个物种层级的相对丰度表格进行比较,查找对应名称的丰度内容信息并整合输出
配置好文件名称后,执行main.py脚本(或python3 main_command_line.py res featureTable_Relative)即可,cosSimilarity.py是使用余弦值相似性算法进行比对,但是对特殊字符处理不敏感,不建议使用
**ps:如果筛选的Biomaker比res文件中的少,那说明使用fuzzywuzzy进行相似性比对时阈值设置高了,默认我设置90%,基本能涵盖大部分的特殊字符及模糊或错误匹配的部分,可以适当调小**
==== main.py ====
import glob
from fuzzywuzzy import fuzz
selectSpecies = []
num = 0
with open('LDA.control-disease.draw.res', 'r') as file:
for line in file:
columns = line.strip().split('\t')
if len(columns) >= 5 and all(columns[1:5]):
species = columns[0].split('.')[-1]
selectSpecies.append(species)
num += 1
# list(map(lambda x: print(x), selectSpecies))
files = glob.glob('featureTable_Relative/featureTable.sample.*.relative.xls')
if len(files) > 0:
first_file = files[0]
with open(first_file, 'r') as file:
for line in file:
title = line
break
checknum = 0
with open("Biomaker.txt", "w") as out_file:
out_file.write(title)
for species in selectSpecies:
level = species[0]
name = species[3:]
filename = f"featureTable_Relative/featureTable.sample.{level}.relative.xls"
with open(filename, "r") as in_file:
for line in in_file:
name_to_compare = line.split("\t")[0]
similarity = fuzz.ratio(name, name_to_compare)
threshold = 90
if similarity >= threshold:
#print(f"{name} and {name_to_compare} are similar,Similarity score: {similarity}%.")
out_file.write(f"{level}__{line}")
checknum += 1
break
if num != checknum:
print(f"Biomaker 的个数是 {num} 与比对删选后的的物种个数{checknum} 不同,适当降低相似度阈值\n")
else:
print("生成Biomaker成功!")
out_file.close()
===== main_command_line.py =====
import glob
from fuzzywuzzy import fuzz
import sys
resfile = sys.argv[1]
path = sys.argv[2]
selectSpecies = []
num = 0
with open(resfile, 'r') as file:
for line in file:
columns = line.strip().split('\t')
if len(columns) >= 5 and all(columns[1:5]):
species = columns[0].split('.')[-1]
selectSpecies.append(species)
num += 1
# list(map(lambda x: print(x), selectSpecies))
files = glob.glob(f'{path}/featureTable.sample.*.relative.xls')
if len(files) > 0:
first_file = files[0]
with open(first_file, 'r') as file:
for line in file:
title = line
break
checknum = 0
with open("Biomaker.txt", "w") as out_file:
out_file.write(title)
for species in selectSpecies:
level = species[0]
name = species[3:]
filename = f"{path}/featureTable.sample.{level}.relative.xls"
with open(filename, "r") as in_file:
for line in in_file:
name_to_compare = line.split("\t")[0]
similarity = fuzz.ratio(name, name_to_compare)
threshold = 90
if similarity >= threshold:
#print(f"{name} and {name_to_compare} are similar,Similarity score: {similarity}%.")
out_file.write(f"{level}__{line}")
checknum += 1
break
if num != checknum:
print(f"Biomaker 的个数是 {num} 与比对删选后的的物种个数{checknum} 不同,适当降低相似度阈值\n")
else:
print("生成Biomaker成功!")
out_file.close()