====== COBRA(Contig Overlap Based Re-Assembly) ====== COBRA(Contig Overlap Based Re-Assembly)是一种生物信息学工具,用于从短配对读取的宏基因组中组装更高质量的病毒基因组。COBRA是用Python编写的。到目前为止,COBRA只在metaSPAdes、IDBA_UD和MEGAHIT的组装contig和Scaffold上进行了测试。 ===== 介绍 ===== 1.为什么元基因组的连续体是支离破碎的? 基于短配对端读取的元基因组组装的基因组通常因①基因组内重复,②基因组间共享区域和③种群内变化而支离破碎,因为基于de Bruijn图的广泛使用的汇编器,例如,metaSPAdes,IDBA_UD和MEGAHIT,当有多个路径可用时,往往有一个断裂点,而不是进行高风险的扩展延长。 {{:undefined:aaaa.png?400|}} 2. Contigs可能会与预期的末端重叠结合 根据上述汇编器的原则,破碎的contigs具有与确定长度的端重叠,即metaSPAdes和MEGAHIT的de nono汇编中使用的max-kmer(maxK),以及IDBA_UD的maxK-1,我们称之为“预期重叠长度” {{::aaa1.png?400|}} ======= COBRA的工作原理 ========== COBRA确定来自程序集的所有连接的“预期重叠长度”(正向和反向补复方向),然后根据包括连接覆盖、连接重叠关系和连接连续性(基于配对端读取映射)在内的功能列表,为用户提供的每个查询寻找有效的连接路径(应该是整个组件的一小部分)。 请注意,scaffolds(例如,metaSPAdes汇编)可以用作COBRA扩展的输入,但是,我们建议不要使用IDBA_UD的scaffolds作为scaffolds步骤中的潜在错误。因此,对于IDBA_UD和MEGAHIT组装,应该使用contigs。 鉴于COBRA只测试了来自IDBA_UD、metaSPAdes和MEGAHIT的contigs/scaffold,在任何其他汇编器的contigs/scaffolds上使用它将是有风险的。 {{:aaa2.png?400|}} ======= 输入文件 ========== 1)COBRA需要四个文件作为输入,即 -f/--fasta:包含单个程序集的所有contigs的fasta格式文件,请注意,IDBA_UD和MEGAHIT通常保存最小长度为200 bp的contigs。 -c/--coverage:两列(以制表符分隔)文件,文件是-f/--fasta文件中所有contigs的排序覆盖范围,示例如下: contig-140_1 42.1388 contig-140_2 14.6023 contig-140_3 15.4817 contig-140_4 41.2746 ... -q/--query:用户希望COBRA扩展的查询contigs,可以以fasta格式文件或带有查询contigs名称的单列文本文件提供。请确保名称与-f/--fasta文件中的格式完全相同,否则,COBRA可能会在扩展它们时遇到问题。 这里测试时我使用了小于500bp以下的contig序列作为扩增查询序列 -m/--mapping:配对端读取-f/--fasta文件中所有contigs的映射文件,可以是sam或bam文件。 (2)和三个参数 -a/--assembler:使用的de novo汇编器的名称,目前只有“idba”(用于IDBA_UD)、“metaspades”(用于metaSPAdes)和“megahit”(用于MEGAHIT)。 -maxk/--maxk:de novo装配中使用的最大kmer。 -mink/--mink:de novo装配中使用的最小公里机。 (3)可选旗帜 -lm/--linkage_mismatch:在确定两个连续性是否由配对读取跨越时,允许的读取映射不匹配的数量。 -o/--output:输出文件夹的名称,否则将是“{-q/--query}”。COBRA”如果没有提供。 -t/--threads:用于BLASTn搜索的线程数。 ========= 如何获取映射文件=========== 映射文件可以通过Bowtie2和BBMap等工具获得,有关工具的详细信息,请参阅手册描述。以下是获取排序的sam/bam文件的一般方法,因此您需要对samtools可用(可以在这里获取-https://github.com/samtools/samtools)。 例如, contig file = "contigs.fasta" first read file = "R1.fastq.gz" second read file = "R2.fastq.gz" (1)与Bowtie2(https://github.com/BenLangmead/bowtie2) bowtie2-build contigs.fasta contigs.fasta bowtie2 -p 16 -x contigs.fasta -1 R1.fastq.gz -2 R2.fastq.gz -S output.sam && samtools view -bS output.sam | samtools sort -o sorted_output.bam - (2)使用BBMap(https://github.com/BioInfoTools/BBMap) bbmap.sh ref=contigs.fasta in1=R1.fastq.gz in2=R2.fastq.gz threads=16 out=output.sam(好) samtools view -bS output.sam > output.bam samtools sort -o sorted_output.bam output.bam /PUBLIC/software/public/bowtie2-2.2.4/bowtie2-build final.contigs.fa final.contigs.fa /PUBLIC/software/public/bowtie2-2.2.4/bowtie2 -p 16 -x final.contigs.fa -1 GCH1E62_350.nohost.fq1.gz -2 GCH1E62_350.nohost.fq2.gz -S output.sam /PUBLIC/software/MICRO/share/MetaGenome_pipeline/MetaGenome_pipeline_V5.1/software/samtools/samtools-1.3/samtools view -bS output.sam -o output.bam /PUBLIC/software/MICRO/share/MetaGenome_pipeline/MetaGenome_pipeline_V5.1/software/samtools/samtools-1.3/samtools sort -o sorted_output.bam output.bam rm output.bam ========= 如何获取覆盖文件 =========== (1)与jgi_summarize_bam_contig_depths 一旦排序的sam或bam文件准备就绪,来自MetaBAT的“jgi_summarize_bam_contig_depths”工具(https://bitbucket.org/berkeleylab/metabat/src/master/),或者可用于获取覆盖文件,应传输生成的配置文件以获取一个由选项卡分隔的两列文件。 jgi_summarize_bam_contig_depths --outputDepth original.coverage.txt *sam jgi_summarize_bam_contig_depths --outputDepth original.coverage.txt *bam jgi_summarize_bam_contig_depths的输出文件可以使用本研究中提供的脚本(coverage.transfer.py)转换为按选项卡划分的两列文件。 python coverage.transfer.py -i original.coverage.txt -o coverage.txt (2)封面M CoverM是一个快速的DNA读取覆盖和相对丰度计算器,专注于元基因组学应用。用法可以在这里找到(https://github.com/wwood/CoverM)。 (3)pyCoverM pyCoverM是CoverM快速覆盖估计函数的简单Python接口,可以在这里找到(https://github.com/apcamargo/pycoverm)。 source /TJPROJ1/META_ASS/soft/anaconda3/bin/activate metawrap-env /TJPROJ1/META_ASS/soft/anaconda3/envs/metawrap-env/bin/jgi_summarize_bam_contig_depths --outputDepth original.coverage.txt *sam /TJPROJ1/META_ASS/soft/anaconda3/envs/metawrap-env/bin/jgi_summarize_bam_contig_depths --outputDepth original.coverage.txt *bam ========== 运行 ========= cobra-meta -f input.fasta -q query.fasta -c coverage.txt -m mapping.sam -a idba -mink 20 -maxk 140 cobra-meta -f all.contigs.fasta -q query.fasta -o query.fasta.COBRA.out -c coverage.txt -m mapping.sam -a idba -mink 20 -maxk 140 -mm 2 cobra-meta -f all.contigs.fasta -q query.fasta -o query.fasta.COBRA.out -c coverage.txt -m mapping.sam -a metaspades -mink 21 -maxk 127 -mm 2 cobra-meta -f all.contigs.fasta -q query.fasta -o query.fasta.COBRA.out -c coverage.txt -m mapping.sam -a megahit -mink 21 -maxk 141 -mm 2 过滤长度的python脚本 def read_fasta(filename): """从FASTA文件中读取序列数据并返回为字典""" sequences = {} with open(filename, 'r') as file: header = None sequence = '' for line in file: line = line.strip() if line.startswith('>'): if header is not None: sequences[header] = sequence header = line[1:] sequence = '' else: sequence += line if header is not None: sequences[header] = sequence return sequences def write_fasta(sequences, filename): """将序列写入到FASTA文件中""" with open(filename, 'w') as file: for header, sequence in sequences.items(): file.write(f'>{header}\n{sequence}\n') def filter_sequences(input_file, output_file): """从输入的FASTA文件中过滤出长度小于200bp的序列并写入到输出文件中""" sequences = read_fasta(input_file) filtered_sequences = {header: sequence for header, sequence in sequences.items() if len(sequence) < 500} write_fasta(filtered_sequences, output_file) if __name__ == "__main__": input_file = "final.contigs.fa" # 输入的FASTA文件名 output_file = "output.fasta" # 输出的FASTA文件名 filter_sequences(input_file, output_file) 输出output.fasta序列,作为-q的参数 source /TJPROJ1/META_ASS/soft/anaconda3/bin/activate /TJPROJ1/META_ASS/soft/cobra cobra-meta -q output.fasta -f final.contigs.fa -a megahit -mink 21 -maxk 87 -m sorted_output.bam -c coverage.txt -o query.fasta.COBRA.out ============ 输出文件 ============ 以下是输出文件夹中输出文件的一般列表: COBRA_category_i_self_circular_queries_trimmed.fasta COBRA_category_i_self_circular_queries_trimmed.fasta.summary.txt COBRA_category_ii_extended_circular_unique (folder) COBRA_category_ii_extended_circular_unique.fasta COBRA_category_ii_extended_circular_unique.fasta.summary.txt COBRA_category_ii_extended_circular_unique_joining_details.txt COBRA_category_ii_extended_failed.fasta COBRA_category_ii_extended_failed.fasta.summary.txt COBRA_category_ii_extended_partial_unique (folder) COBRA_category_ii_extended_partial_unique.fasta COBRA_category_ii_extended_partial_unique.fasta.summary.txt COBRA_category_ii_extended_partial_unique_joining_details.txt COBRA_category_iii_orphan_end.fasta COBRA_category_iii_orphan_end.fasta.summary.txt COBRA_joining_status.txt COBRA_joining_summary.txt intermediate.files (folder) log debug.txt contig.new.fa 对于所有查询,COBRA根据其加入状态(详于COBRA_joining_status.txt文件)将它们分配到不同的类别,即 “self_circular”-查询contig本身是一个循环基因组。 “extended_circular”-查询连接被连接并扩展为循环基因组。 "extended_partial" - 查询contig已连接并扩展,但未扩展到循环。 "extended_failed" - 由于COBRA规则,查询contig无法扩展。 “孤子端”-给定连结的两端都与其他人共享“预期重叠长度”。 对于每个类别中的连接和扩展查询,将仅保存唯一查询(*.fasta)以供用户进行以下分析,并且序列信息(例如,长度、覆盖范围、GC、Ns数量)汇总在*fasta.summary.txt文件中。对于“extended_circular”和“extended_partial”类别,每个查询的加入详细信息包含在相应的文件夹和*joining_details.txt文件中,并总结在COBRA_joining_summary.txt文件中,示例如下所示: QuerySeqID QuerySeqLen TotRetSeqs TotRetLen AssembledLen ExtendedLen Status contig-140_100 47501 3 50379 49962 2461 Extended_circular contig-140_112 45060 3 62549 62132 17072 Extended_circular contig-140_114 44829 2 45342 45064 235 Extended_circular contig-140_160 40329 2 41018 40740 411 Extended_circular contig-140_188 38386 5 48986 48291 9905 Extended_circular ... 日志文件:log文件包括每个处理步骤的内容,示例如下所示: 1. INPUT INFORMATION # Assembler: IDBA_UD # Min-kmer: 20 # Max-kmer: 140 # Overlap length: 139 bp # Read mapping max mismatches for contig linkage: 2 # Query contigs: file-path # Whole contig set: file-path # Mapping file: file-path # Coverage file: file-path # Output folder: file-path 2. PROCESSING STEPS [01/22] [2023/05/04 12:48:15] Reading contigs and getting the contig end sequences. A total of 311739 contigs were imported. [02/22] [2023/05/04 12:48:33] Getting shared contig ends. [03/22] [2023/05/04 12:48:40] Writing contig end joining pairs. [04/22] [2023/05/04 12:48:41] Getting contig coverage information. [05/22] [2023/05/04 12:48:42] Getting query contig list. A total of 2304 query contigs were imported. [06/22] [2023/05/04 12:48:43] Getting contig linkage based on sam/bam. Be patient, this may take long. [07/22] [2023/05/04 13:00:01] Detecting self_circular contigs. [08/22] [2023/05/04 13:00:37] Detecting joins of contigs. 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90%, 100% finished. [09/22] [2023/05/04 14:29:20] Saving potential joining paths. [10/22] [2023/05/04 14:29:23] Checking for invalid joining: sharing queries. [11/22] [2023/05/04 14:29:28] Getting initial joining status of each query contig. [12/22] [2023/05/04 14:29:28] Getting final joining status of each query contig. [13/22] [2023/05/04 14:29:28] Getting the joining order of contigs. [14/22] [2023/05/04 14:29:28] Getting retrieved contigs. [15/22] [2023/05/04 14:29:32] Saving joined seqeuences. [16/22] [2023/05/04 14:29:35] Checking for invalid joining using BLASTn: close strains. [17/22] [2023/05/04 14:30:11] Saving unique sequences of "Extended_circular" and "Extended_partial" for joining checking. [18/22] [2023/05/04 14:30:12] Getting the joining details of unique "Extended_circular" and "Extended_partial" query contigs. [19/22] [2023/05/04 14:30:12] Saving joining summary of "Extended_circular" and "Extended_partial" query contigs. [20/22] [2023/05/04 14:30:15] Saving joining status of all query contigs. [21/22] [2023/05/04 14:30:15] Saving self_circular contigs. [22/22] [2023/05/04 14:30:16] Saving the new fasta file. ====================================================================================================================================================== 3. RESULTS SUMMARY # Total queries: 2304 # Category i - Self_circular: 74 # Category ii - Extended_circular: 120 (Unique: 82) # Category ii - Extended_partial: 1088 (Unique: 889) # Category ii - Extended_failed (due to COBRA rules): 245 # Category iii - Orphan end: 777 # Check "COBRA_joining_status.txt" for joining status of each query. # Check "COBRA_joining_summary.txt" for joining details of "Extended_circular" and "Extended_partial" queries. ====================================================================================================================================================== 官方网址:[[https://github.com/linxingchen/cobra]] 文献 : [[https://www.nature.com/articles/s41564-023-01598-2]]