用户工具

站点工具


cobra

COBRA(Contig Overlap Based Re-Assembly)

COBRA(Contig Overlap Based Re-Assembly)是一种生物信息学工具,用于从短配对读取的宏基因组中组装更高质量的病毒基因组。COBRA是用Python编写的。到目前为止,COBRA只在metaSPAdes、IDBA_UD和MEGAHIT的组装contig和Scaffold上进行了测试。

介绍

1.为什么元基因组的连续体是支离破碎的? 基于短配对端读取的元基因组组装的基因组通常因①基因组内重复,②基因组间共享区域和③种群内变化而支离破碎,因为基于de Bruijn图的广泛使用的汇编器,例如,metaSPAdes,IDBA_UD和MEGAHIT,当有多个路径可用时,往往有一个断裂点,而不是进行高风险的扩展延长。

2. Contigs可能会与预期的末端重叠结合 根据上述汇编器的原则,破碎的contigs具有与确定长度的端重叠,即metaSPAdes和MEGAHIT的de nono汇编中使用的max-kmer(maxK),以及IDBA_UD的maxK-1,我们称之为“预期重叠长度”

COBRA的工作原理

COBRA确定来自程序集的所有连接的“预期重叠长度”(正向和反向补复方向),然后根据包括连接覆盖、连接重叠关系和连接连续性(基于配对端读取映射)在内的功能列表,为用户提供的每个查询寻找有效的连接路径(应该是整个组件的一小部分)。

请注意,scaffolds(例如,metaSPAdes汇编)可以用作COBRA扩展的输入,但是,我们建议不要使用IDBA_UD的scaffolds作为scaffolds步骤中的潜在错误。因此,对于IDBA_UD和MEGAHIT组装,应该使用contigs。

鉴于COBRA只测试了来自IDBA_UD、metaSPAdes和MEGAHIT的contigs/scaffold,在任何其他汇编器的contigs/scaffolds上使用它将是有风险的。

输入文件

1)COBRA需要四个文件作为输入,即

-f/–fasta:包含单个程序集的所有contigs的fasta格式文件,请注意,IDBA_UD和MEGAHIT通常保存最小长度为200 bp的contigs。

-c/–coverage:两列(以制表符分隔)文件,文件是-f/–fasta文件中所有contigs的排序覆盖范围,示例如下:

contig-140_1    42.1388
contig-140_2    14.6023
contig-140_3    15.4817
contig-140_4    41.2746
...

-q/–query:用户希望COBRA扩展的查询contigs,可以以fasta格式文件或带有查询contigs名称的单列文本文件提供。请确保名称与-f/–fasta文件中的格式完全相同,否则,COBRA可能会在扩展它们时遇到问题。 这里测试时我使用了小于500bp以下的contig序列作为扩增查询序列

-m/–mapping:配对端读取-f/–fasta文件中所有contigs的映射文件,可以是sam或bam文件。

(2)和三个参数

-a/–assembler:使用的de novo汇编器的名称,目前只有“idba”(用于IDBA_UD)、“metaspades”(用于metaSPAdes)和“megahit”(用于MEGAHIT)。 -maxk/–maxk:de novo装配中使用的最大kmer。 -mink/–mink:de novo装配中使用的最小公里机。 (3)可选旗帜

-lm/–linkage_mismatch:在确定两个连续性是否由配对读取跨越时,允许的读取映射不匹配的数量。 -o/–output:输出文件夹的名称,否则将是“{-q/–query}”。COBRA”如果没有提供。 -t/–threads:用于BLASTn搜索的线程数。

如何获取映射文件

映射文件可以通过Bowtie2和BBMap等工具获得,有关工具的详细信息,请参阅手册描述。以下是获取排序的sam/bam文件的一般方法,因此您需要对samtools可用(可以在这里获取-https://github.com/samtools/samtools)。

例如,

contig file = “contigs.fasta”

first read file = “R1.fastq.gz”

second read file = “R2.fastq.gz”

(1)与Bowtie2(https://github.com/BenLangmead/bowtie2

bowtie2-build contigs.fasta contigs.fasta

bowtie2 -p 16 -x contigs.fasta -1 R1.fastq.gz -2 R2.fastq.gz -S output.sam && samtools view -bS output.sam | samtools sort -o sorted_output.bam -

(2)使用BBMap(https://github.com/BioInfoTools/BBMap

bbmap.sh ref=contigs.fasta in1=R1.fastq.gz in2=R2.fastq.gz threads=16 out=output.sam(好) samtools view -bS output.sam > output.bam samtools sort -o sorted_output.bam output.bam

/PUBLIC/software/public/bowtie2-2.2.4/bowtie2-build final.contigs.fa final.contigs.fa
/PUBLIC/software/public/bowtie2-2.2.4/bowtie2 -p 16 -x final.contigs.fa -1 GCH1E62_350.nohost.fq1.gz -2 GCH1E62_350.nohost.fq2.gz -S output.sam
/PUBLIC/software/MICRO/share/MetaGenome_pipeline/MetaGenome_pipeline_V5.1/software/samtools/samtools-1.3/samtools view -bS output.sam -o output.bam
/PUBLIC/software/MICRO/share/MetaGenome_pipeline/MetaGenome_pipeline_V5.1/software/samtools/samtools-1.3/samtools sort -o sorted_output.bam output.bam 
rm output.bam 

如何获取覆盖文件

(1)与jgi_summarize_bam_contig_depths

一旦排序的sam或bam文件准备就绪,来自MetaBAT的“jgi_summarize_bam_contig_depths”工具(https://bitbucket.org/berkeleylab/metabat/src/master/),或者可用于获取覆盖文件,应传输生成的配置文件以获取一个由选项卡分隔的两列文件。

jgi_summarize_bam_contig_depths –outputDepth original.coverage.txt *sam

jgi_summarize_bam_contig_depths –outputDepth original.coverage.txt *bam

jgi_summarize_bam_contig_depths的输出文件可以使用本研究中提供的脚本(coverage.transfer.py)转换为按选项卡划分的两列文件。

python coverage.transfer.py -i original.coverage.txt -o coverage.txt (2)封面M

CoverM是一个快速的DNA读取覆盖和相对丰度计算器,专注于元基因组学应用。用法可以在这里找到(https://github.com/wwood/CoverM)。

(3)pyCoverM

pyCoverM是CoverM快速覆盖估计函数的简单Python接口,可以在这里找到(https://github.com/apcamargo/pycoverm)。

source /TJPROJ1/META_ASS/soft/anaconda3/bin/activate metawrap-env
/TJPROJ1/META_ASS/soft/anaconda3/envs/metawrap-env/bin/jgi_summarize_bam_contig_depths --outputDepth original.coverage.txt *sam
/TJPROJ1/META_ASS/soft/anaconda3/envs/metawrap-env/bin/jgi_summarize_bam_contig_depths --outputDepth original.coverage.txt *bam

运行

cobra-meta -f input.fasta -q query.fasta -c coverage.txt -m mapping.sam -a idba -mink 20 -maxk 140 cobra-meta -f all.contigs.fasta -q query.fasta -o query.fasta.COBRA.out -c coverage.txt -m mapping.sam -a idba -mink 20 -maxk 140 -mm 2 cobra-meta -f all.contigs.fasta -q query.fasta -o query.fasta.COBRA.out -c coverage.txt -m mapping.sam -a metaspades -mink 21 -maxk 127 -mm 2 cobra-meta -f all.contigs.fasta -q query.fasta -o query.fasta.COBRA.out -c coverage.txt -m mapping.sam -a megahit -mink 21 -maxk 141 -mm 2

过滤长度的python脚本

def read_fasta(filename):
    """从FASTA文件中读取序列数据并返回为字典"""
    sequences = {}
    with open(filename, 'r') as file:
        header = None
        sequence = ''
        for line in file:
            line = line.strip()
            if line.startswith('>'):
                if header is not None:
                    sequences[header] = sequence
                header = line[1:]
                sequence = ''
            else:
                sequence += line
        if header is not None:
            sequences[header] = sequence
    return sequences

def write_fasta(sequences, filename):
    """将序列写入到FASTA文件中"""
    with open(filename, 'w') as file:
        for header, sequence in sequences.items():
            file.write(f'>{header}\n{sequence}\n')

def filter_sequences(input_file, output_file):
    """从输入的FASTA文件中过滤出长度小于200bp的序列并写入到输出文件中"""
    sequences = read_fasta(input_file)
    filtered_sequences = {header: sequence for header, sequence in sequences.items() if len(sequence) < 500}
    write_fasta(filtered_sequences, output_file)

if __name__ == "__main__":
    input_file = "final.contigs.fa"  # 输入的FASTA文件名
    output_file = "output.fasta"  # 输出的FASTA文件名
    filter_sequences(input_file, output_file)

输出output.fasta序列,作为-q的参数

source /TJPROJ1/META_ASS/soft/anaconda3/bin/activate /TJPROJ1/META_ASS/soft/cobra
cobra-meta -q output.fasta  -f final.contigs.fa -a megahit -mink 21 -maxk 87 -m sorted_output.bam -c coverage.txt -o query.fasta.COBRA.out

输出文件

以下是输出文件夹中输出文件的一般列表:

COBRA_category_i_self_circular_queries_trimmed.fasta
COBRA_category_i_self_circular_queries_trimmed.fasta.summary.txt
COBRA_category_ii_extended_circular_unique (folder)
COBRA_category_ii_extended_circular_unique.fasta
COBRA_category_ii_extended_circular_unique.fasta.summary.txt
COBRA_category_ii_extended_circular_unique_joining_details.txt
COBRA_category_ii_extended_failed.fasta
COBRA_category_ii_extended_failed.fasta.summary.txt
COBRA_category_ii_extended_partial_unique (folder)
COBRA_category_ii_extended_partial_unique.fasta
COBRA_category_ii_extended_partial_unique.fasta.summary.txt
COBRA_category_ii_extended_partial_unique_joining_details.txt
COBRA_category_iii_orphan_end.fasta
COBRA_category_iii_orphan_end.fasta.summary.txt
COBRA_joining_status.txt
COBRA_joining_summary.txt
intermediate.files (folder)
log
debug.txt
contig.new.fa

对于所有查询,COBRA根据其加入状态(详于COBRA_joining_status.txt文件)将它们分配到不同的类别,即

“self_circular”-查询contig本身是一个循环基因组。 “extended_circular”-查询连接被连接并扩展为循环基因组。 “extended_partial” - 查询contig已连接并扩展,但未扩展到循环。 “extended_failed” - 由于COBRA规则,查询contig无法扩展。 “孤子端”-给定连结的两端都与其他人共享“预期重叠长度”。 对于每个类别中的连接和扩展查询,将仅保存唯一查询(*.fasta)以供用户进行以下分析,并且序列信息(例如,长度、覆盖范围、GC、Ns数量)汇总在*fasta.summary.txt文件中。对于“extended_circular”和“extended_partial”类别,每个查询的加入详细信息包含在相应的文件夹和*joining_details.txt文件中,并总结在COBRA_joining_summary.txt文件中,示例如下所示:

QuerySeqID      QuerySeqLen     TotRetSeqs      TotRetLen       AssembledLen    ExtendedLen     Status
contig-140_100  47501   3       50379   49962   2461    Extended_circular
contig-140_112  45060   3       62549   62132   17072   Extended_circular
contig-140_114  44829   2       45342   45064   235     Extended_circular
contig-140_160  40329   2       41018   40740   411     Extended_circular
contig-140_188  38386   5       48986   48291   9905    Extended_circular
...

日志文件:log文件包括每个处理步骤的内容,示例如下所示:

1. INPUT INFORMATION
# Assembler: IDBA_UD
# Min-kmer: 20
# Max-kmer: 140
# Overlap length: 139 bp
# Read mapping max mismatches for contig linkage: 2
# Query contigs: file-path
# Whole contig set: file-path
# Mapping file: file-path
# Coverage file: file-path
# Output folder: file-path

2. PROCESSING STEPS
[01/22] [2023/05/04 12:48:15] Reading contigs and getting the contig end sequences. A total of 311739 contigs were imported.
[02/22] [2023/05/04 12:48:33] Getting shared contig ends.
[03/22] [2023/05/04 12:48:40] Writing contig end joining pairs.
[04/22] [2023/05/04 12:48:41] Getting contig coverage information.
[05/22] [2023/05/04 12:48:42] Getting query contig list. A total of 2304 query contigs were imported.
[06/22] [2023/05/04 12:48:43] Getting contig linkage based on sam/bam. Be patient, this may take long.
[07/22] [2023/05/04 13:00:01] Detecting self_circular contigs.
[08/22] [2023/05/04 13:00:37] Detecting joins of contigs. 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90%, 100% finished.
[09/22] [2023/05/04 14:29:20] Saving potential joining paths.
[10/22] [2023/05/04 14:29:23] Checking for invalid joining: sharing queries.
[11/22] [2023/05/04 14:29:28] Getting initial joining status of each query contig.
[12/22] [2023/05/04 14:29:28] Getting final joining status of each query contig.
[13/22] [2023/05/04 14:29:28] Getting the joining order of contigs.
[14/22] [2023/05/04 14:29:28] Getting retrieved contigs.
[15/22] [2023/05/04 14:29:32] Saving joined seqeuences.
[16/22] [2023/05/04 14:29:35] Checking for invalid joining using BLASTn: close strains.
[17/22] [2023/05/04 14:30:11] Saving unique sequences of "Extended_circular" and "Extended_partial" for joining checking.
[18/22] [2023/05/04 14:30:12] Getting the joining details of unique "Extended_circular" and "Extended_partial" query contigs.
[19/22] [2023/05/04 14:30:12] Saving joining summary of "Extended_circular" and "Extended_partial" query contigs.
[20/22] [2023/05/04 14:30:15] Saving joining status of all query contigs.
[21/22] [2023/05/04 14:30:15] Saving self_circular contigs.
[22/22] [2023/05/04 14:30:16] Saving the new fasta file.


======================================================================================================================================================
3. RESULTS SUMMARY
# Total queries: 2304
# Category i   - Self_circular: 74
# Category ii  - Extended_circular: 120 (Unique: 82)
# Category ii  - Extended_partial: 1088 (Unique: 889)
# Category ii  - Extended_failed (due to COBRA rules): 245
# Category iii - Orphan end: 777
# Check "COBRA_joining_status.txt" for joining status of each query.
# Check "COBRA_joining_summary.txt" for joining details of "Extended_circular" and "Extended_partial" queries.
======================================================================================================================================================

官方网址:https://github.com/linxingchen/cobra 文献 : https://www.nature.com/articles/s41564-023-01598-2

cobra.txt · 最后更改: 2024/03/07 03:21 由 yuxi