参考网址:
https://arriba.readthedocs.io/en/latest/
https://github.com/suhrig/arriba
Arriba是一个命令行软件,用于从RNA-Seq数据中检测基因融合。它专为临床研究而开发,兼具短运行时间和高灵敏度优势。Arriba的运行基于STAR的比对结果,它将STAR的标准输出(Chimeric.out.sam或Aligned.out.bam)作为输入。
1、建立STAR索引
注意STAR版本要>=2.7.10a,否则后续分析会报错。
export PATH="/TJPROJ6/RNA_SH/personal_dir/xuyi/miniconda3/bin:$PATH" source /TJPROJ6/RNA_SH/personal_dir/xuyi/miniconda3/bin/activate Arriba STAR --runMode genomeGenerate \ --genomeFastaFiles /TJPROJ6/RNA_SH/shouhou/pip_example/STAR-seqr/X101SC24074806-Z01-J087/ref/ref_genome.fa --genomeDir /TJPROJ6/RNA_SH/shouhou/pip_example/STAR-seqr/X101SC24074806-Z01-J087/ref/ \ --sjdbGTFfile /TJPROJ6/RNA_SH/shouhou/pip_example/STAR-seqr/X101SC24074806-Z01-J087/ref/ref_annot.gtf \ --runThreadN 18 --sjdbOverhang 150 --genomeSAsparseD 1
2、运行STAR和Arriba程序
标准Arriba运行允许使用管道符将STAR的输出直接输入Arriba,这样将节省运行时间。示例如下:
export PATH="/TJPROJ6/RNA_SH/personal_dir/xuyi/miniconda3/bin:$PATH" source /TJPROJ6/RNA_SH/personal_dir/xuyi/miniconda3/bin/activate Arriba STAR \ --runThreadN 8 \ --genomeDir /TJPROJ6/RNA_SH/shouhou/pip_example/STAR-seqr/X101SC24074806-Z01-J087/ref --genomeLoad NoSharedMemory \ --outFileNamePrefix /TJPROJ6/RNA_SH/shouhou/pip_example/Arriba/test/reuslt1/test. \ --readFilesIn /TJPROJ6/RNA_SH/shouhou/pip_example/STAR-seqr/X101SC24074806-Z01-J087/rawdata/sh_3/sh_3_1.fq.gz /TJPROJ6/RNA_SH/shouhou/pip_example/STAR-seqr/X101SC24074806-Z01-J087/rawdata/sh_3/sh_3_2.fq.gz --readFilesCommand zcat \ --outStd BAM_Unsorted --outSAMtype BAM Unsorted --outSAMunmapped Within --outBAMcompression 0 \ --outFilterMultimapNmax 50 --peOverlapNbasesMin 10 --alignSplicedMateMapLminOverLmate 0.5 --alignSJstitchMismatchNmax 5 -1 5 5 \ --chimSegmentMin 10 --chimOutType WithinBAM HardClip --chimJunctionOverhangMin 10 --chimScoreDropMax 30 \ --chimScoreJunctionNonGTAG 0 --chimScoreSeparation 1 --chimSegmentReadGapMax 3 --chimMultimapNmax 50 | arriba \ -x /dev/stdin \ -o /TJPROJ6/RNA_SH/shouhou/pip_example/Arriba/test/reuslt1/sh_3_fusions.tsv -O /TJPROJ6/RNA_SH/shouhou/pip_example/Arriba/test/reuslt1/sh_3_fusions.discarded.tsv \ -a /TJPROJ6/RNA_SH/shouhou/pip_example/STAR-seqr/X101SC24074806-Z01-J087/ref/ref_genome.fa -g /TJPROJ6/RNA_SH/shouhou/pip_example/STAR-seqr/X101SC24074806-Z01-J087/ref/ref_annot.gtf \ -f blacklist
3、-blacklist参数
在官方帮助文档中,作者说明了-blacklist选项的意义,不使用-blacklist参数将大大提高软件鉴定的假阳性率。经测试,当不使用-blacklist参数时,鉴定出的fusions数量大大增加,且基本集中于medium或low confidence。
Arriba软件提供了GRCh37、GRCh38、GRCm38和GRCm39参考基因组配套的blacklist文件,文件本质上是作者通过大量数据训练出的与癌症等疾病无关的fusions。因此,当使用Arriba时,建议参考基因组为GRCh37、GRCh38、GRCm38和GRCm39,否则需要使用-f选项禁用blacklist(-f blacklist)。
4、Arriba的输出结果
Arriba的主要输出结果即为各样本鉴定出的融合基因列表:sample_fusions.tsv,其中给出了融合的两个基因的name、位置、断点等信息。
#gene1 gene2 strand1(gene/fusion) strand2(gene/fusion) breakpoint1 breakpoint2 site1 site2 type split_reads1 split_reads2 discordant_mates coverage1 coverage2 confidence reading_frame tags retained_protein_domains closest_genomic_breakpoint1 closest_genomic_breakpoint2 gene_id1 gene_id2 transcript_id1 transcript_id2 direction1 direction2 filters NCOR1P2(112691),UBBP4(19766) UBBP4 ./+ +/+ chr17:22183229 chr17:22204087 intergenic 5'UTR/splice-site deletion/read-through 109 77 17 270 283 high . . . . . . ENSG00000263563.4 . ENST00000584755.1 downstream upstream duplicates(31),mismappers(20),mismatches(3),multimappers(2) BCR ABL1 +/+ +/+ chr22:23290413 chr9:130854064 CDS/splice-site CDS/splice-site translocation 75 66 10 187 216 high in-frame . . . . ENSG00000186716.18 ENSG00000097007.16 ENST00000305877.11 ENST00000372348.5 downstream upstream duplicates(9),mismatches(1) BCR ABL1 +/+ +/+ chr22:23290413 chr9:130854067 CDS/splice-site CDS/splice-site translocation 1 0 10 187 216 high in-frame . . . . ENSG00000186716.18 ENSG00000097007.16 ENST00000305877.11 ENST00000372348.5 downstream upstream mismatches(1) LA16c-352F7.1(65532),RP11-118F19.1(23646) GSE1 ./+ +/+ chr16:85556363 chr16:85633914 intergenic CDS/splice-site deletion/read-through 31 38 12 194 250 high . . . . . . ENSG00000131149.16 . ENST00000253458.10 downstream upstream duplicates(5),low_entropy(1),mismatches(3) BAG6 SLC44A4 -/- -/- chr6:31651656 chr6:31865784 CDS/splice-site CDS/splice-site duplication 38 35 6 972 79 high out-of-frame . . . . ENSG00000204463.11 ENSG00000204385.9 ENST00000211379.8 ENST00000375562.7 upstream downstream duplicates(7)
除此之外,Arriba还有配套的绘图脚本draw_fusions.R用于对融合基因进行可视化。使用示例如下:
export PATH="/TJPROJ6/RNA_SH/personal_dir/xuyi/miniconda3/bin:$PATH" source /TJPROJ6/RNA_SH/personal_dir/xuyi/miniconda3/bin/activate Arriba draw_fusions.R \ --fusions=fusions.tsv \ --output=fusions.pdf \ --annotation=genome.gtf \ --cytobands=database/cytobands.tsv \ --proteinDomains=database/protein_domains.gff3
其中,–cytobands和–proteinDomains是软件自带的文件,同样只有GRCh37、GRCh38、GRCm38和GRCm39才有相应的配套文件。绘制出的图如下图所示:
5、脚本串写
为方便使用,串写了Arriba的使用脚本,脚本路径:/TJPROJ6/RNA_SH/personal_dir/xuyi/scripts/Arriba/use_Arriba.py
python /TJPROJ6/RNA_SH/personal_dir/xuyi/scripts/Arriba/use_Arriba.py \ -f /TJPROJ6/RNA_SH/shouhou/pip_example/STAR-seqr/X101SC24074806-Z01-J087/ref/ref_genome.fa \ -g /TJPROJ6/RNA_SH/shouhou/pip_example/STAR-seqr/X101SC24074806-Z01-J087/ref/ref_annot.gtf \ -i /TJPROJ6/RNA_SH/shouhou/pip_example/STAR-seqr/X101SC24074806-Z01-J087/rawdata \ -sn nc_1,nc_2,nc_3,sh_1,sh_2,sh_3 \ -o /TJPROJ6/RNA_SH/shouhou/pip_example/Arriba/test/reuslt2 \ -t GRCh38
-t选项指定参考基因组版本,如果不在GRCh37、GRCh38、GRCm38和GRCm39内,将禁用-blacklist参数。
参考文献:
Uhrig, Sebastian et al. “Accurate and efficient detection of gene fusions from RNA sequencing data.” Genome research vol. 31,3 (2021): 448-460. doi:10.1101/gr.257246.119 https://genome.cshlp.org/content/31/3/448.long