通过fuzzywuzzy库对lefse中获得的Biomaker在对应的丰度表格中进行查找匹配,解决lefse自身处理数据的改名问题
参考路径 : /TJPROJ5/META_ASS/script/yuxi/script/mediation
/TJPROJ7/META_ASS/16s/yuxi/X101SC21124468-Z01/X101SC21124468-Z01-mediation
使用python 3.10.0
fuzzywuzzy 0.18.0
python-Levenshtein 0.23.0 (fuzzywuzzy库使用了纯Python实现的SequenceMatcher,这在处理大量数据时可能会变得比较慢。为了提高性能,可以安装python-Levenshtein库,它是fuzzywuzzy的一个可选依赖项,提供了更快的SequenceMatcher实现)
准备好LDA.control-disease.draw.res文件, 该文件由5列数据组成
第一列:Biomaker名称;
第二列:各组分丰度平均值中最大值的log10,如果平均丰度小于10的按照10来计算;
第三列:差异物种富集的组名;
第四列:LDA值;
第五列:Kruskal-Wallis秩和检验的值,若不是Biomarker用“—”表示。
对该数据列进行处理获得Biomaker后,与每个物种层级的相对丰度表格进行比较,查找对应名称的丰度内容信息并整合输出
配置好文件名称后,执行main.py脚本(或python3 main_command_line.py res featureTable_Relative)即可,cosSimilarity.py是使用余弦值相似性算法进行比对,但是对特殊字符处理不敏感,不建议使用
ps:如果筛选的Biomaker比res文件中的少,那说明使用fuzzywuzzy进行相似性比对时阈值设置高了,默认我设置90%,基本能涵盖大部分的特殊字符及模糊或错误匹配的部分,可以适当调小
import glob from fuzzywuzzy import fuzz selectSpecies = [] num = 0 with open('LDA.control-disease.draw.res', 'r') as file: for line in file: columns = line.strip().split('\t') if len(columns) >= 5 and all(columns[1:5]): species = columns[0].split('.')[-1] selectSpecies.append(species) num += 1 # list(map(lambda x: print(x), selectSpecies)) files = glob.glob('featureTable_Relative/featureTable.sample.*.relative.xls') if len(files) > 0: first_file = files[0] with open(first_file, 'r') as file: for line in file: title = line break checknum = 0 with open("Biomaker.txt", "w") as out_file: out_file.write(title) for species in selectSpecies: level = species[0] name = species[3:] filename = f"featureTable_Relative/featureTable.sample.{level}.relative.xls" with open(filename, "r") as in_file: for line in in_file: name_to_compare = line.split("\t")[0] similarity = fuzz.ratio(name, name_to_compare) threshold = 90 if similarity >= threshold: #print(f"{name} and {name_to_compare} are similar,Similarity score: {similarity}%.") out_file.write(f"{level}__{line}") checknum += 1 break if num != checknum: print(f"Biomaker 的个数是 {num} 与比对删选后的的物种个数{checknum} 不同,适当降低相似度阈值\n") else: print("生成Biomaker成功!") out_file.close()
import glob from fuzzywuzzy import fuzz import sys resfile = sys.argv[1] path = sys.argv[2] selectSpecies = [] num = 0 with open(resfile, 'r') as file: for line in file: columns = line.strip().split('\t') if len(columns) >= 5 and all(columns[1:5]): species = columns[0].split('.')[-1] selectSpecies.append(species) num += 1 # list(map(lambda x: print(x), selectSpecies)) files = glob.glob(f'{path}/featureTable.sample.*.relative.xls') if len(files) > 0: first_file = files[0] with open(first_file, 'r') as file: for line in file: title = line break checknum = 0 with open("Biomaker.txt", "w") as out_file: out_file.write(title) for species in selectSpecies: level = species[0] name = species[3:] filename = f"{path}/featureTable.sample.{level}.relative.xls" with open(filename, "r") as in_file: for line in in_file: name_to_compare = line.split("\t")[0] similarity = fuzz.ratio(name, name_to_compare) threshold = 90 if similarity >= threshold: #print(f"{name} and {name_to_compare} are similar,Similarity score: {similarity}%.") out_file.write(f"{level}__{line}") checknum += 1 break if num != checknum: print(f"Biomaker 的个数是 {num} 与比对删选后的的物种个数{checknum} 不同,适当降低相似度阈值\n") else: print("生成Biomaker成功!") out_file.close()