目录

biomarkerFromLefse

介绍

通过fuzzywuzzy库对lefse中获得的Biomaker在对应的丰度表格中进行查找匹配,解决lefse自身处理数据的改名问题

参考路径 : /TJPROJ5/META_ASS/script/yuxi/script/mediation

/TJPROJ7/META_ASS/16s/yuxi/X101SC21124468-Z01/X101SC21124468-Z01-mediation

软件架构

使用python 3.10.0

fuzzywuzzy 0.18.0

python-Levenshtein 0.23.0 (fuzzywuzzy库使用了纯Python实现的SequenceMatcher,这在处理大量数据时可能会变得比较慢。为了提高性能,可以安装python-Levenshtein库,它是fuzzywuzzy的一个可选依赖项,提供了更快的SequenceMatcher实现)

使用说明

准备好LDA.control-disease.draw.res文件, 该文件由5列数据组成

第一列:Biomaker名称;

第二列:各组分丰度平均值中最大值的log10,如果平均丰度小于10的按照10来计算;

第三列:差异物种富集的组名;

第四列:LDA值;

第五列:Kruskal-Wallis秩和检验的值,若不是Biomarker用“—”表示。

对该数据列进行处理获得Biomaker后,与每个物种层级的相对丰度表格进行比较,查找对应名称的丰度内容信息并整合输出

配置好文件名称后,执行main.py脚本(或python3 main_command_line.py res featureTable_Relative)即可,cosSimilarity.py是使用余弦值相似性算法进行比对,但是对特殊字符处理不敏感,不建议使用

ps:如果筛选的Biomaker比res文件中的少,那说明使用fuzzywuzzy进行相似性比对时阈值设置高了,默认我设置90%,基本能涵盖大部分的特殊字符及模糊或错误匹配的部分,可以适当调小

main.py

import glob
from fuzzywuzzy import fuzz

selectSpecies = []
num = 0
with open('LDA.control-disease.draw.res', 'r') as file:
    for line in file:
        columns = line.strip().split('\t')
        if len(columns) >= 5 and all(columns[1:5]):
            species = columns[0].split('.')[-1]
            selectSpecies.append(species)
            num += 1

# list(map(lambda x: print(x), selectSpecies))

files = glob.glob('featureTable_Relative/featureTable.sample.*.relative.xls')
if len(files) > 0:
    first_file = files[0]
    with open(first_file, 'r') as file:
        for line in file:
            title = line
            break

checknum = 0
with open("Biomaker.txt", "w") as out_file:
    out_file.write(title)
    for species in selectSpecies:
        level = species[0]
        name = species[3:]
        filename = f"featureTable_Relative/featureTable.sample.{level}.relative.xls"
        with open(filename, "r") as in_file:
            for line in in_file:
                name_to_compare = line.split("\t")[0]
                similarity = fuzz.ratio(name, name_to_compare)
                threshold = 90
                if similarity >= threshold:
                    #print(f"{name} and {name_to_compare} are similar,Similarity score: {similarity}%.")
                    out_file.write(f"{level}__{line}")
                    checknum += 1
                    break

if num != checknum:
   print(f"Biomaker 的个数是 {num} 与比对删选后的的物种个数{checknum} 不同,适当降低相似度阈值\n") 
else:
    print("生成Biomaker成功!")

out_file.close()

main_command_line.py

import glob
from fuzzywuzzy import fuzz
import sys

resfile = sys.argv[1]
path = sys.argv[2]

selectSpecies = []
num = 0
with open(resfile, 'r') as file:
    for line in file:
        columns = line.strip().split('\t')
        if len(columns) >= 5 and all(columns[1:5]):
            species = columns[0].split('.')[-1]
            selectSpecies.append(species)
            num += 1

# list(map(lambda x: print(x), selectSpecies))

files = glob.glob(f'{path}/featureTable.sample.*.relative.xls')
if len(files) > 0:
    first_file = files[0]
    with open(first_file, 'r') as file:
        for line in file:
            title = line
            break

checknum = 0
with open("Biomaker.txt", "w") as out_file:
    out_file.write(title)
    for species in selectSpecies:
        level = species[0]
        name = species[3:]
        filename = f"{path}/featureTable.sample.{level}.relative.xls"
        with open(filename, "r") as in_file:
            for line in in_file:
                name_to_compare = line.split("\t")[0]
                similarity = fuzz.ratio(name, name_to_compare)
                threshold = 90
                if similarity >= threshold:
                    #print(f"{name} and {name_to_compare} are similar,Similarity score: {similarity}%.")
                    out_file.write(f"{level}__{line}")
                    checknum += 1
                    break

if num != checknum:
   print(f"Biomaker 的个数是 {num} 与比对删选后的的物种个数{checknum} 不同,适当降低相似度阈值\n") 
else:
    print("生成Biomaker成功!")

out_file.close()