GATK多样本(群call)需要0/0基因型位点

GATK升级到4之后,HaplotypeCaller只能处理单样本文件,当有多样本时,官方建议使用HaplotypeCaller对单bam文件分别进行变异检测,生成GVCF文件,GVCF会记录每一个位点到情况,包括有无突变,VCF只记录突变位点情况,之后在下一步对GVCF文件进行合并。

前处理与流程一致,测试脚本如下:

/TJPROJ2/GB/PUBLIC/software/GB_TR/mRNA/miniconda3/envs/SNP/bin/java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /TJPROJ2/GB/PUBLIC/software/GB_TR/mRNA/miniconda3/envs/SNP/share/gatk4-4.1.1.0-0/gatk-package-4.1.1.0-local.jar ReorderSam -I=/TJPROJ6/NC_BG_SH/shouhou/202303/X101SC22072170-Z02-J001/QC/bam/RNA_AS_2.bam -O=/TJPROJ6/NC_BG_SH/shouhou/202303/X101SC22072170-Z02-J001/SNP/RNA_AS_2/RNA_AS_2_reorder.bam -R=/TJPROJ6/GB_TR/reference_data/new_pip/Plant/Oryza_sativa/Oryza_sativa_UGA/genome/Sequence/WholeGenomeFasta/genome.fa --TMP_DIR=/TJPROJ6/NC_BG_SH/shouhou/202303/X101SC22072170-Z02-J001/SNP/RNA_AS_2/tmp --VALIDATION_STRINGENCY SILENT

/TJPROJ2/GB/PUBLIC/software/GB_TR/mRNA/miniconda3/envs/SNP/bin/java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /TJPROJ2/GB/PUBLIC/software/GB_TR/mRNA/miniconda3/envs/SNP/share/gatk4-4.1.1.0-0/gatk-package-4.1.1.0-local.jar AddOrReplaceReadGroups -I=/TJPROJ6/NC_BG_SH/shouhou/202303/X101SC22072170-Z02-J001/SNP/RNA_AS_2/RNA_AS_2_reorder.bam -O=/TJPROJ6/NC_BG_SH/shouhou/202303/X101SC22072170-Z02-J001/SNP/RNA_AS_2/RNA_AS_2_reorder_head.bam -SO=coordinate -LB=LB -PL=illumina -PU=PU -SM=RNA_AS_2 --TMP_DIR=/TJPROJ6/NC_BG_SH/shouhou/202303/X101SC22072170-Z02-J001/SNP/RNA_AS_2/tmp --VALIDATION_STRINGENCY SILENT

/TJPROJ2/GB/PUBLIC/software/GB_TR/mRNA/miniconda3/envs/SNP/bin/java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /TJPROJ2/GB/PUBLIC/software/GB_TR/mRNA/miniconda3/envs/SNP/share/gatk4-4.1.1.0-0/gatk-package-4.1.1.0-local.jar MarkDuplicates -I=/TJPROJ6/NC_BG_SH/shouhou/202303/X101SC22072170-Z02-J001/SNP/RNA_AS_2/RNA_AS_2_reorder_head.bam -O=/TJPROJ6/NC_BG_SH/shouhou/202303/X101SC22072170-Z02-J001/SNP/RNA_AS_2/RNA_AS_2_reorder_head_dedup.bam --VALIDATION_STRINGENCY=SILENT -M=/TJPROJ6/NC_BG_SH/shouhou/202303/X101SC22072170-Z02-J001/SNP/RNA_AS_2/RNA_AS_2_reorder_head_dedup_metrics.txt -MAX_FILE_HANDLES=4000 --CREATE_INDEX=true --REMOVE_DUPLICATES=true --TMP_DIR=/TJPROJ6/NC_BG_SH/shouhou/202303/X101SC22072170-Z02-J001/SNP/RNA_AS_2/tmp

/TJPROJ2/GB/PUBLIC/software/GB_TR/mRNA/miniconda3/envs/SNP/bin/java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /TJPROJ2/GB/PUBLIC/software/GB_TR/mRNA/miniconda3/envs/SNP/share/gatk4-4.1.1.0-0/gatk-package-4.1.1.0-local.jar SplitNCigarReads --skip-mapping-quality-transform -R=/TJPROJ6/GB_TR/reference_data/new_pip/Plant/Oryza_sativa/Oryza_sativa_UGA/genome/Sequence/WholeGenomeFasta/genome.fa -I=/TJPROJ6/NC_BG_SH/shouhou/202303/X101SC22072170-Z02-J001/SNP/RNA_AS_2/RNA_AS_2_reorder_head_dedup.bam -O=/TJPROJ6/NC_BG_SH/shouhou/202303/X101SC22072170-Z02-J001/SNP/RNA_AS_2/RNA_AS_2_reorder_head_dedup_split.bam --tmp-dir=/TJPROJ6/NC_BG_SH/shouhou/202303/X101SC22072170-Z02-J001/SNP/RNA_AS_2/tmp

call变异位点,加参数-ERC GVCF,输出GCVF文件

/TJPROJ2/GB/PUBLIC/software/GB_TR/mRNA/miniconda3/envs/SNP/bin/java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /TJPROJ2/GB/PUBLIC/software/GB_TR/mRNA/miniconda3/envs/SNP/share/gatk4-4.1.1.0-0/gatk-package-4.1.1.0-local.jar HaplotypeCaller -I=/TJPROJ6/NC_BG_SH/shouhou/202303/X101SC22072170-Z02-J001/SNP/RNA_AS_2/RNA_AS_2_reorder_head_dedup_split.bam -O=/TJPROJ6/NC_BG_SH/shouhou/202303/X101SC22072170-Z02-J001/SNP/RNA_AS_2/RNA_AS_2.vcf  -R=/TJPROJ6/GB_TR/reference_data/new_pip/Plant/Oryza_sativa/Oryza_sativa_UGA/genome/Sequence/WholeGenomeFasta/genome.fa --dont-use-soft-clipped-bases=true -ploidy=2 -stand-call-conf=20 --output-mode=EMIT_ALL_SITES -ERC GVCF

合并多样本GVCF文件,并获取发生变异的位点

#合并(限制内存4G最快,可以按染色体拆分或者再将染色体按bin拆分,节约时间,拆分脚本/TJPROJ6/RNA_SH/personal_dir/zhangxin/jiaoben/snp/split_vcf.py)
/TJPROJ2/GB/PUBLIC/software/GB_TR/mRNA/miniconda3/envs/SNP/bin/java -Xmx4g -jar /TJPROJ2/GB/PUBLIC/software/GB_TR/mRNA/miniconda3/envs/SNP/share/gatk4-4.1.1.0-0/gatk-package-4.1.1.0-local.jar CombineGVCFs -R /TJPROJ6/GB_TR/reference_data/new_pip/Plant/Oryza_sativa/Oryza_sativa_UGA/genome/Sequence/WholeGenomeFasta/genome.fa -V RNA_AS_1/RNA_AS_1.vcf.gz -V RNA_AS_2/RNA_AS_2.vcf.gz -V RNA_FS_1/RNA_FS_1.vcf.gz -V RNA_FS_2/RNA_FS_2.vcf.gz -V RNA_YS_1/RNA_YS_1.vcf.gz -V RNA_YS_2/RNA_YS_2.vcf.gz  -O GATK_merge_all_sample.vcf
#获取变异位点
/TJPROJ2/GB/PUBLIC/software/GB_TR/mRNA/miniconda3/envs/SNP/bin/java -Xmx25g -jar /TJPROJ2/GB/PUBLIC/software/GB_TR/mRNA/miniconda3/envs/SNP/share/gatk4-4.1.1.0-0/gatk-package-4.1.1.0-local.jar GenotypeGVCFs  -R /TJPROJ6/GB_TR/reference_data/new_pip/Plant/Oryza_sativa/Oryza_sativa_UGA/genome/Sequence/WholeGenomeFasta/genome.fa --variant GATK_merge_all_sample.vcf -O GET_GATK_merge_all_sample.vcf
若果有dbsnp,可加参数--dbsnp
/TJPROJ2/GB/PUBLIC/software/GB_TR/mRNA/miniconda3/envs/SNP/bin/java -Xmx25g -jar /TJPROJ2/GB/PUBLIC/software/GB_TR/mRNA/miniconda3/envs/SNP/share/gatk4-4.1.1.0-0/gatk-package-4.1.1.0-local.jar GenotypeGVCFs  -R /TJPROJ6/GB_TR/reference_data/new_pip/Plant/Oryza_sativa/Oryza_sativa_UGA/genome/Sequence/WholeGenomeFasta/genome.fa --variant GATK_merge_all_sample.vcf -O GET_GATK_merge_all_sample.vcf --dbsnp <dbsnp.vcf>

通过这种方法可获取,多样本的变异位点,保留基因型为0/0的位点,之后可对vcf文件继续筛选,注释等操作(未测试)

关于拆分脚本

  /TJPROJ6/RNA_SH/personal_dir/zhangxin/jiaoben/snp/split_vcf.py -h
  usage: split_vcf.py [-h] [-A VCF] [-C] [-L CHR_LIST] [-B BIN] [-O OUT]
                      [-F FAI]
  
  split vcf file
  
  optional arguments:
    -h, --help            show this help message and exit
    -A VCF, --vcf VCF     the vcf file
    -C, --chr             split chr
    -L CHR_LIST, --chr_list CHR_LIST
                          get chr list
    -B BIN, --bin BIN     the bin size
    -O OUT, --out OUT     the out dir
    -F FAI, --fai FAI     the fai file

例如仅按染色体拆分

  /TJPROJ6/RNA_SH/personal_dir/zhangxin/jiaoben/snp/split_vcf.py --vcf T4.vcf.gz --chr --out ./ --fai genome.fa.fai

按染色体bin拆分

  /TJPROJ6/RNA_SH/personal_dir/zhangxin/jiaoben/snp/split_vcf.py --vcf T4.vcf.gz --bin 10000000 --out ./ --fai genome.fa.fai