预测分泌蛋白

#!添加cd-hit环境变量 /PUBLIC/software/RNA/cd-hit-v4.6.1-2012-08-27到PATH=后面

背景：

P101SC17070737-01-北林12个样品转录组测序分析技术服务（委托）合同根据文献预测分泌蛋白

分泌蛋白特点：

（1）N端信号肽
（2）无跨膜结构域
（3）无GPI锚定位点
（4）没有将蛋白输送到线粒体或其他胞内细胞器

方法

1.TransDecoderv3.0.1（http://transdecoder.github.io/）unigene进行全长ORF预测,然后对unigene.fasta.transdecoder.pep分类，complete.fa用于预测分泌蛋白
2.SignalP v4.1（http://www.cbs.dtu.dk/services/SignalP/，参数默认）进行信号肽鉴定，获得分泌蛋白序列 
3.TMHMM v2.0（http://www.cbs.dtu.dk/services/TMHMM/）进行跨膜结构域分析，将不跨膜及单跨膜初步推断为分泌蛋白 
4.TargetP v1.1（http://www.cbs.dtu.dk/services/TargetP/） predicts the subcellular location of eukaryotic proteins，筛选LOC 为S的作为最终分泌蛋白 #  已升级为2.0版本，1.1不在使用
4.TargetP-2.0（https://services.healthtech.dtu.dk/services/TargetP-2.0/），筛选预测为SP的为最终分泌蛋白

脚本

/TJPROJ1/RNA/shouhou/script_dir/noref/Transdecoder-signalp-tmhmm.py
更新路径：/TJPROJ6/RNA_SH/script_dir/signalp/Transdecoder-signalp-tmhmm.py

使用方法

python /TJPROJ1/RNA/shouhou/script_dir/noref/Transdecoder-signalp-tmhmm.py
--outdir 输出路径\
--unigene /TJPROJ1/RNA/shouhou/shouhou_dir/P101SC17070737-01/shouhou_20170827/P101SC17070737-01-B1/raw/0.annot/unigene.fasta \ unigene序列
--sp {euk,gram+, gram-} #euk：真核\
--apart y \ unigene.fasta.transdecoder.pep 分类\

python /TJPROJ6/RNA_SH/script_dir/signalp/Transdecoder-signalp-tmhmm.py \
--outdir /TJPROJ6/RNA_SH/script_dir/signalp/ \
--unigene /TJPROJ6/RNA_SH/script_dir/signalp/raw/test_2.fasta  \
--sp euk \
--apart y \
--plant pl \
--outformat short\

结果

**transdecoder**预测结果：
*.unigene.fasta.transdecoder.bed：bed文件
*.unigene.fasta.transdecoder.gff3：结构注释文件
*.unigene.fasta.transdecoder.cds：CDS序列
*.unigene.fasta.transdecoder.pep：蛋白序列
分类结果：
unigene.fasta.transdecoder.pep.5prime_partial.fa 蛋白5prime_partial序列
unigene.fasta.transdecoder.pep.3prime_partial.fa 蛋白3prime_partial序列
unigene.fasta.transdecoder.pep.internal.fa 蛋白internal序列
unigene.fasta.transdecoder.pep.complete.fa 蛋白全长序列
unigene.fasta.transdecoder.cds.5prime_partial.fa cds5prime_partial序列
unigene.fasta.transdecoder.cds.3prime_partial.fa cds3prme_partial序列
unigene.fasta.transdecoder.cds.internal.fa cdsinternal序列
unigene.fasta.transdecoder.cds.complete.fa cds全长序列

**signalp**：
signalp_summary_out.txt：signalP 4.1预测结果
signalp.secretory.pep.fa：signalP 4.1 根据预测结果筛选得到的分泌蛋白序列
signalp_summary_out.txt结果文件说明：
• C-score (raw cleavage site score)
the C-score is trained to be high at the position immediately after the cleavage site (the first residue in the mature protein).
• S-score (signal peptide score)
The output from the SP networks, which are trained to distinguish positions within signal peptides from positions in the mature part of the proteins and from proteins without signal peptides.
• Y-score (combined cleavage site score)
A combination (geometric average) of the C-score and the slope of the S-score, resulting in a better cleavage site prediction than the raw C-score alone. This is due to the fact that multiple high-peaking C-scores can be found in one sequence, where only one is the true cleavage site. The Y-score distinguishes between C-score peaks by choosing the one where the slope of the S-score is steep.
mean S
The average S-score of the possible signal peptide (from position 1 to the position immediately before the maximal Y-score).
•D-score (discrimination score)
A weighted average of the mean S and the max. Y scores. This is the score that is used to discriminate signal peptides from non-signal peptides.

**tmhmm **
signalp_tmhmm.xls 基于signalp.secretory.pep.fa，预测跨膜结构结果
tmhmm.secretory.pep.fa：过滤不跨膜及单跨膜初步推断为分泌蛋白
signalp_tmhmm.xls:

• Length: the length of the protein sequence.
• Number of predicted TMHs: The number of predicted transmembrane helices.
• Exp number of AAs in TMHs: The expected number of amino acids intransmembrane helices. If this number is larger than 18 it is very likely to
be a transmembrane protein (OR have a signal peptide).
• Exp number, first 60 AAs: The expected number of amino acids in transmembrane helices in the first 60 amino acids of the protein. If this number more than a few, you should be warned that a predicted transmembrane helix in the N-term could be a signal peptide.
• Total prob of N-in: The total probability that the N-term is on the cytoplasmic side of the membrane.
• POSSIBLE N-term signal sequence: a warning that is produced when "Exp number, first 60 AAs" is larger than 10

fasta_summary.targetp2：
• 1. the protein prediction SP / mTP/ cTP / luTP / noTP and the associated likelihood probability
• 2. the cleavage site position and associated likelihood probability. NOTE: if the cleavage site position is "?", it means that the cleavage site is out range due to a probable protein fragment as input.)

文献

分泌蛋白.pdf

目录

预测分泌蛋白

背景：

方法

脚本

使用方法

结果

文献