====== COBRA(Contig Overlap Based Re-Assembly) ======
COBRA(Contig Overlap Based Re-Assembly)是一种生物信息学工具,用于从短配对读取的宏基因组中组装更高质量的病毒基因组。COBRA是用Python编写的。到目前为止,COBRA只在metaSPAdes、IDBA_UD和MEGAHIT的组装contig和Scaffold上进行了测试。
===== 介绍 =====
1.为什么元基因组的连续体是支离破碎的?
基于短配对端读取的元基因组组装的基因组通常因①基因组内重复,②基因组间共享区域和③种群内变化而支离破碎,因为基于de Bruijn图的广泛使用的汇编器,例如,metaSPAdes,IDBA_UD和MEGAHIT,当有多个路径可用时,往往有一个断裂点,而不是进行高风险的扩展延长。
{{:undefined:aaaa.png?400|}}
2. Contigs可能会与预期的末端重叠结合
根据上述汇编器的原则,破碎的contigs具有与确定长度的端重叠,即metaSPAdes和MEGAHIT的de nono汇编中使用的max-kmer(maxK),以及IDBA_UD的maxK-1,我们称之为“预期重叠长度”
{{::aaa1.png?400|}}
======= COBRA的工作原理 ==========
COBRA确定来自程序集的所有连接的“预期重叠长度”(正向和反向补复方向),然后根据包括连接覆盖、连接重叠关系和连接连续性(基于配对端读取映射)在内的功能列表,为用户提供的每个查询寻找有效的连接路径(应该是整个组件的一小部分)。
请注意,scaffolds(例如,metaSPAdes汇编)可以用作COBRA扩展的输入,但是,我们建议不要使用IDBA_UD的scaffolds作为scaffolds步骤中的潜在错误。因此,对于IDBA_UD和MEGAHIT组装,应该使用contigs。
鉴于COBRA只测试了来自IDBA_UD、metaSPAdes和MEGAHIT的contigs/scaffold,在任何其他汇编器的contigs/scaffolds上使用它将是有风险的。
{{:aaa2.png?400|}}
======= 输入文件 ==========
1)COBRA需要四个文件作为输入,即
-f/--fasta:包含单个程序集的所有contigs的fasta格式文件,请注意,IDBA_UD和MEGAHIT通常保存最小长度为200 bp的contigs。
-c/--coverage:两列(以制表符分隔)文件,文件是-f/--fasta文件中所有contigs的排序覆盖范围,示例如下:
contig-140_1 42.1388
contig-140_2 14.6023
contig-140_3 15.4817
contig-140_4 41.2746
...
-q/--query:用户希望COBRA扩展的查询contigs,可以以fasta格式文件或带有查询contigs名称的单列文本文件提供。请确保名称与-f/--fasta文件中的格式完全相同,否则,COBRA可能会在扩展它们时遇到问题。 这里测试时我使用了小于500bp以下的contig序列作为扩增查询序列
-m/--mapping:配对端读取-f/--fasta文件中所有contigs的映射文件,可以是sam或bam文件。
(2)和三个参数
-a/--assembler:使用的de novo汇编器的名称,目前只有“idba”(用于IDBA_UD)、“metaspades”(用于metaSPAdes)和“megahit”(用于MEGAHIT)。
-maxk/--maxk:de novo装配中使用的最大kmer。
-mink/--mink:de novo装配中使用的最小公里机。
(3)可选旗帜
-lm/--linkage_mismatch:在确定两个连续性是否由配对读取跨越时,允许的读取映射不匹配的数量。
-o/--output:输出文件夹的名称,否则将是“{-q/--query}”。COBRA”如果没有提供。
-t/--threads:用于BLASTn搜索的线程数。
========= 如何获取映射文件===========
映射文件可以通过Bowtie2和BBMap等工具获得,有关工具的详细信息,请参阅手册描述。以下是获取排序的sam/bam文件的一般方法,因此您需要对samtools可用(可以在这里获取-https://github.com/samtools/samtools)。
例如,
contig file = "contigs.fasta"
first read file = "R1.fastq.gz"
second read file = "R2.fastq.gz"
(1)与Bowtie2(https://github.com/BenLangmead/bowtie2)
bowtie2-build contigs.fasta contigs.fasta
bowtie2 -p 16 -x contigs.fasta -1 R1.fastq.gz -2 R2.fastq.gz -S output.sam && samtools view -bS output.sam | samtools sort -o sorted_output.bam -
(2)使用BBMap(https://github.com/BioInfoTools/BBMap)
bbmap.sh ref=contigs.fasta in1=R1.fastq.gz in2=R2.fastq.gz threads=16 out=output.sam(好)
samtools view -bS output.sam > output.bam
samtools sort -o sorted_output.bam output.bam
/PUBLIC/software/public/bowtie2-2.2.4/bowtie2-build final.contigs.fa final.contigs.fa
/PUBLIC/software/public/bowtie2-2.2.4/bowtie2 -p 16 -x final.contigs.fa -1 GCH1E62_350.nohost.fq1.gz -2 GCH1E62_350.nohost.fq2.gz -S output.sam
/PUBLIC/software/MICRO/share/MetaGenome_pipeline/MetaGenome_pipeline_V5.1/software/samtools/samtools-1.3/samtools view -bS output.sam -o output.bam
/PUBLIC/software/MICRO/share/MetaGenome_pipeline/MetaGenome_pipeline_V5.1/software/samtools/samtools-1.3/samtools sort -o sorted_output.bam output.bam
rm output.bam
========= 如何获取覆盖文件 ===========
(1)与jgi_summarize_bam_contig_depths
一旦排序的sam或bam文件准备就绪,来自MetaBAT的“jgi_summarize_bam_contig_depths”工具(https://bitbucket.org/berkeleylab/metabat/src/master/),或者可用于获取覆盖文件,应传输生成的配置文件以获取一个由选项卡分隔的两列文件。
jgi_summarize_bam_contig_depths --outputDepth original.coverage.txt *sam
jgi_summarize_bam_contig_depths --outputDepth original.coverage.txt *bam
jgi_summarize_bam_contig_depths的输出文件可以使用本研究中提供的脚本(coverage.transfer.py)转换为按选项卡划分的两列文件。
python coverage.transfer.py -i original.coverage.txt -o coverage.txt
(2)封面M
CoverM是一个快速的DNA读取覆盖和相对丰度计算器,专注于元基因组学应用。用法可以在这里找到(https://github.com/wwood/CoverM)。
(3)pyCoverM
pyCoverM是CoverM快速覆盖估计函数的简单Python接口,可以在这里找到(https://github.com/apcamargo/pycoverm)。
source /TJPROJ1/META_ASS/soft/anaconda3/bin/activate metawrap-env
/TJPROJ1/META_ASS/soft/anaconda3/envs/metawrap-env/bin/jgi_summarize_bam_contig_depths --outputDepth original.coverage.txt *sam
/TJPROJ1/META_ASS/soft/anaconda3/envs/metawrap-env/bin/jgi_summarize_bam_contig_depths --outputDepth original.coverage.txt *bam
========== 运行 =========
cobra-meta -f input.fasta -q query.fasta -c coverage.txt -m mapping.sam -a idba -mink 20 -maxk 140
cobra-meta -f all.contigs.fasta -q query.fasta -o query.fasta.COBRA.out -c coverage.txt -m mapping.sam -a idba -mink 20 -maxk 140 -mm 2
cobra-meta -f all.contigs.fasta -q query.fasta -o query.fasta.COBRA.out -c coverage.txt -m mapping.sam -a metaspades -mink 21 -maxk 127 -mm 2
cobra-meta -f all.contigs.fasta -q query.fasta -o query.fasta.COBRA.out -c coverage.txt -m mapping.sam -a megahit -mink 21 -maxk 141 -mm 2
过滤长度的python脚本
def read_fasta(filename):
"""从FASTA文件中读取序列数据并返回为字典"""
sequences = {}
with open(filename, 'r') as file:
header = None
sequence = ''
for line in file:
line = line.strip()
if line.startswith('>'):
if header is not None:
sequences[header] = sequence
header = line[1:]
sequence = ''
else:
sequence += line
if header is not None:
sequences[header] = sequence
return sequences
def write_fasta(sequences, filename):
"""将序列写入到FASTA文件中"""
with open(filename, 'w') as file:
for header, sequence in sequences.items():
file.write(f'>{header}\n{sequence}\n')
def filter_sequences(input_file, output_file):
"""从输入的FASTA文件中过滤出长度小于200bp的序列并写入到输出文件中"""
sequences = read_fasta(input_file)
filtered_sequences = {header: sequence for header, sequence in sequences.items() if len(sequence) < 500}
write_fasta(filtered_sequences, output_file)
if __name__ == "__main__":
input_file = "final.contigs.fa" # 输入的FASTA文件名
output_file = "output.fasta" # 输出的FASTA文件名
filter_sequences(input_file, output_file)
输出output.fasta序列,作为-q的参数
source /TJPROJ1/META_ASS/soft/anaconda3/bin/activate /TJPROJ1/META_ASS/soft/cobra
cobra-meta -q output.fasta -f final.contigs.fa -a megahit -mink 21 -maxk 87 -m sorted_output.bam -c coverage.txt -o query.fasta.COBRA.out
============ 输出文件 ============
以下是输出文件夹中输出文件的一般列表:
COBRA_category_i_self_circular_queries_trimmed.fasta
COBRA_category_i_self_circular_queries_trimmed.fasta.summary.txt
COBRA_category_ii_extended_circular_unique (folder)
COBRA_category_ii_extended_circular_unique.fasta
COBRA_category_ii_extended_circular_unique.fasta.summary.txt
COBRA_category_ii_extended_circular_unique_joining_details.txt
COBRA_category_ii_extended_failed.fasta
COBRA_category_ii_extended_failed.fasta.summary.txt
COBRA_category_ii_extended_partial_unique (folder)
COBRA_category_ii_extended_partial_unique.fasta
COBRA_category_ii_extended_partial_unique.fasta.summary.txt
COBRA_category_ii_extended_partial_unique_joining_details.txt
COBRA_category_iii_orphan_end.fasta
COBRA_category_iii_orphan_end.fasta.summary.txt
COBRA_joining_status.txt
COBRA_joining_summary.txt
intermediate.files (folder)
log
debug.txt
contig.new.fa
对于所有查询,COBRA根据其加入状态(详于COBRA_joining_status.txt文件)将它们分配到不同的类别,即
“self_circular”-查询contig本身是一个循环基因组。
“extended_circular”-查询连接被连接并扩展为循环基因组。
"extended_partial" - 查询contig已连接并扩展,但未扩展到循环。
"extended_failed" - 由于COBRA规则,查询contig无法扩展。
“孤子端”-给定连结的两端都与其他人共享“预期重叠长度”。
对于每个类别中的连接和扩展查询,将仅保存唯一查询(*.fasta)以供用户进行以下分析,并且序列信息(例如,长度、覆盖范围、GC、Ns数量)汇总在*fasta.summary.txt文件中。对于“extended_circular”和“extended_partial”类别,每个查询的加入详细信息包含在相应的文件夹和*joining_details.txt文件中,并总结在COBRA_joining_summary.txt文件中,示例如下所示:
QuerySeqID QuerySeqLen TotRetSeqs TotRetLen AssembledLen ExtendedLen Status
contig-140_100 47501 3 50379 49962 2461 Extended_circular
contig-140_112 45060 3 62549 62132 17072 Extended_circular
contig-140_114 44829 2 45342 45064 235 Extended_circular
contig-140_160 40329 2 41018 40740 411 Extended_circular
contig-140_188 38386 5 48986 48291 9905 Extended_circular
...
日志文件:log文件包括每个处理步骤的内容,示例如下所示:
1. INPUT INFORMATION
# Assembler: IDBA_UD
# Min-kmer: 20
# Max-kmer: 140
# Overlap length: 139 bp
# Read mapping max mismatches for contig linkage: 2
# Query contigs: file-path
# Whole contig set: file-path
# Mapping file: file-path
# Coverage file: file-path
# Output folder: file-path
2. PROCESSING STEPS
[01/22] [2023/05/04 12:48:15] Reading contigs and getting the contig end sequences. A total of 311739 contigs were imported.
[02/22] [2023/05/04 12:48:33] Getting shared contig ends.
[03/22] [2023/05/04 12:48:40] Writing contig end joining pairs.
[04/22] [2023/05/04 12:48:41] Getting contig coverage information.
[05/22] [2023/05/04 12:48:42] Getting query contig list. A total of 2304 query contigs were imported.
[06/22] [2023/05/04 12:48:43] Getting contig linkage based on sam/bam. Be patient, this may take long.
[07/22] [2023/05/04 13:00:01] Detecting self_circular contigs.
[08/22] [2023/05/04 13:00:37] Detecting joins of contigs. 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90%, 100% finished.
[09/22] [2023/05/04 14:29:20] Saving potential joining paths.
[10/22] [2023/05/04 14:29:23] Checking for invalid joining: sharing queries.
[11/22] [2023/05/04 14:29:28] Getting initial joining status of each query contig.
[12/22] [2023/05/04 14:29:28] Getting final joining status of each query contig.
[13/22] [2023/05/04 14:29:28] Getting the joining order of contigs.
[14/22] [2023/05/04 14:29:28] Getting retrieved contigs.
[15/22] [2023/05/04 14:29:32] Saving joined seqeuences.
[16/22] [2023/05/04 14:29:35] Checking for invalid joining using BLASTn: close strains.
[17/22] [2023/05/04 14:30:11] Saving unique sequences of "Extended_circular" and "Extended_partial" for joining checking.
[18/22] [2023/05/04 14:30:12] Getting the joining details of unique "Extended_circular" and "Extended_partial" query contigs.
[19/22] [2023/05/04 14:30:12] Saving joining summary of "Extended_circular" and "Extended_partial" query contigs.
[20/22] [2023/05/04 14:30:15] Saving joining status of all query contigs.
[21/22] [2023/05/04 14:30:15] Saving self_circular contigs.
[22/22] [2023/05/04 14:30:16] Saving the new fasta file.
======================================================================================================================================================
3. RESULTS SUMMARY
# Total queries: 2304
# Category i - Self_circular: 74
# Category ii - Extended_circular: 120 (Unique: 82)
# Category ii - Extended_partial: 1088 (Unique: 889)
# Category ii - Extended_failed (due to COBRA rules): 245
# Category iii - Orphan end: 777
# Check "COBRA_joining_status.txt" for joining status of each query.
# Check "COBRA_joining_summary.txt" for joining details of "Extended_circular" and "Extended_partial" queries.
======================================================================================================================================================
官方网址:[[https://github.com/linxingchen/cobra]]
文献 : [[https://www.nature.com/articles/s41564-023-01598-2]]