从芯片和组学中获得了大量的数据,如何从数据中识别那些表达量周期性波动变化,符合周期性信号的基因,对于研究节律(例如生物钟或细胞周期)非常重要。
目前有很多方法可以去做这样的分析,但是每种方法都有优点和缺点。例如,Lomb-Scargle 和 JTK_CYCLE 可以区分具有高噪声或低采样率的的周期性基因;Fisher’s G Test 和 ARSER,需要有完整的,均匀采样的时间序列数据。
MetaCycle这个R包,整和了最常用最可靠的几种节律分析方法,软件可以自己根据我们的数据选择最优的算法,使我们免于棘手的决定。
要求:时间间隔是整数,取样不间断的,非重复的。如果取样时间不标准会对后续的分析造成困扰。
本质是计算基因的波动是否符合周期性变化(弦函数),以及符合的程度,并依据此进行检验值的计算,比如P值,Q值等等。检验值越小,代表这个结果越可信,也就是说,越符合周期性分布,说明很可能参与到了细胞节律的调节中。
2D时间序列数据集:1、很常见的时间序列数据集,2数据集是二维矩阵。行是基因,列是样本 3D时间序列数据集:3D数据主要是用在医学和临床上的,因为实验室研究,都是控制变量,遗传背景简单,处理因素统一。但是人就完全不一样了,人的变异是非常大的。因此需要考虑个体之间的差异,这时候就需要额外的实验设计信息。除了一个矩阵存储来自所有样品的检测到的分子的实验值之外,需要另一个矩阵来存储每个样品的个体信息。
CycID JTK_pvalue JTK_BH.Q JTK_period JTK_adjphase JTK_amplitude meta2d_Base meta2d_AMP meta2d_rAMP
gene:pycom04g15340 0.0637445887445887 0.201682830093123 20 10 879.871955218145 5931.28880438903 1971.05843117507 0.332315369589915
gene:pycom12g17540 0.552445887445887 0.94039794575596 20 5 410.413331214158 5189.24057758241 646.167239170505 0.124520578591395
gene:pycom05g11090 1 1 20 5 77.9226908670302 4136.81178060999 105.656181704108 0.025540485597952
gene:pycom04g11710 0.552445887445887 0.94039794575596 20 5 622.305682423581 4011.10645952353 45.5116076132619 0.0113463973276511
[,1] CycID character transcript name [,2] ARS_pvalue numeric pvalue from ARS [,3] ARS_BH.Q numeric FDR from ARS [,4] ARS_period numeric period from ARS [,5] ARS_adjphase numeric adjusted phase from ARS [,6] ARS_amplitude numeric amplitude from ARS [,7] JTK_pvalue numeric pvalue from JTK [,8] JTK_BH.Q numeric FDR from JTK [,9] JTK_period numeric period from JTK [,10] JTK_adjphase numeric adjusted phase from JTK [,11] JTK_amplitude numeric amplitude from JTK [,12] LS_pvalue numeric pvalue from LS [,13] LS_BH.Q numeric FDR from JTK [,14] LS_period numeric period from LS [,15] LS_adjphase numeric adjusted phase from LS [,16] LS_amplitude numeric amplitude from LS [,17] meta2d_pvalue numeric integrated pvalue [,18] meta2d_BH.Q numeric FDR based on integrated pvalue [,19] meta2d_period numeric averaged period of three methods [,20] meta2d_phase numeric integrated phase [,21] meta2d_Base numeric baseline value given by meta2d [,22] meta2d_AMP numeric amplitude given by meta2d [,23] meta2d_rAMP numeric relative amplitude [,24:71] CT18 to CT65 numeric sampling time point
项目:P101SC17030478-01-16个小鼠转录组测序及分析软件RNAseq-Musmusculus-YC的开发
周期性过程在生物体内广泛存在,如昼夜节律与细胞周期,识别具有周期性表达的基因是研究这些过程及发现受其调控的其他生物学过程的关键。
周期性识别算法有COSOPOT,Fisher's G-test,JTK_CYCLE,ARSER,LSPR等 http://www.docin.com/p-1811805902.html
文献:JTK_CYCLE, designed to efficiently identify and characterize cycling variables in large data sets. Compared with COSOPT and the Fisher’s G test, two commonly used methods for detecting cycling transcripts
针对有生物学重复与无生物学重复,存在缺省值都可分析
/TJPROJ6/RNA_SH/BJPROJ/RNA/shouhou/script_dir/others/jtk/jtk.py
#!usr/bin/env python #coding=utf-8 import os import argparse parser = argparse.ArgumentParser(description="mRNA,protein ass ") parser.add_argument('--project',help='the name of project',required=True) parser.add_argument('--anno',help='the file of anno',required=True) parser.add_argument('--fpkm',help='the file of fpkm',required=True) parser.add_argument('--tt',help='the number of time',default=None) parser.add_argument('--rp',help='the number of repeat',default=None) parser.add_argument('--inter',help='the number of time_inter',default=None) parser.add_argument('--type',help='yes or no',default="yes") argv = vars(parser.parse_args()) project=argv['project'].strip() anno=argv['anno'].strip() fpkm=argv['fpkm'].strip() type=argv["type"].strip() if argv['tt']: tt=argv["tt"].strip() if argv['rp']: rp=argv["rp"].strip() if argv['inter']: inter=argv["inter"].strip() f=open("jtk.sh","w") if type=="yes": f.write("Rscript /BJPROJ/RNA/shouhou/script_dir/others/jtk/Run_JTK_CYCLE_1.R --project %s --anno %s --fpkm %s --tt %s --rp %s --inter %s " % (project,anno,fpkm,tt,rp,inter)) else: f.write("Rscript /BJPROJ/RNA/shouhou/script_dir/others/jtk/Run_JTK_CYCLE_2.R --project %s --anno %s --fpkm %s " % (project,anno,fpkm)) f.close() os.system("qsub -V -cwd -l vf=1g,p=1 jtk.sh")
其中Run_JTK_CYCLE_1.R脚本如下:
source("/BJPROJ/RNA/shouhou/script_dir/others/jtk/JTK_CYCLEv3.1.R") suppressMessages({ library(argparser)}) argv <- arg_parser('') argv <- add_argument(argv,"--project", help="the name of project ") argv <- add_argument(argv,"--anno", help="the file of anno") argv <- add_argument(argv,"--fpkm", help="the file of fpkm") argv <- add_argument(argv,"--tt", help="the number of time") argv <- add_argument(argv,"--rp", help="the number of repeat") argv <- add_argument(argv,"--inter", help="the number of time_inter") argv <- parse_args(argv) project <- argv$project anno <-argv$anno fpkm<-argv$fpkm tt<-argv$tt tt<-as.numeric(tt) rp<-argv$rp rp<-as.numeric(rp) inter<-argv$inter inter<-as.numeric(inter) project <- project options(stringsAsFactors=FALSE) annot <- read.delim(anno) data <- read.delim(fpkm) rownames(data) <- data[,1] data <- data[,-1] jtkdist(tt,rp) # 13 total time points, 2 replicates per time point time1=ceiling(20/inter) time2=ceiling(28/inter) periods <- time1:time2 # looking for rhythms between 20-28 hours (i.e. between 5 and 7 time points per cycle). jtk.init(periods,inter) # 4 is the number of hours between time points cat("JTK analysis started on",date(),"\n") flush.console() st <- system.time({ res <- apply(data,1,function(z) { jtkx(z) c(JTK.ADJP,JTK.PERIOD,JTK.LAG,JTK.AMP) }) res <- as.data.frame(t(res)) bhq <- p.adjust(unlist(res[,1]),"BH") res <- cbind(bhq,res) colnames(res) <- c("BH.Q","ADJ.P","PER","LAG","AMP") results <- cbind(annot,res,data) results <- results[order(res$ADJ.P,-res$AMP),] }) print(st) save(results,file=paste("JTK",project,"rda",sep=".")) write.table(results,file=paste("JTK",project,"txt",sep="."),row.names=F,col.names=T,quote=F,sep="\t")
其中Run_JTK_CYCLE_2.R的脚本如下:
source("/BJPROJ/RNA/shouhou/script_dir/others/jtk/JTK_CYCLEv3.1.R") suppressMessages({ library(argparser)}) argv <- arg_parser('') argv <- add_argument(argv,"--project", help="the name of project ") argv <- add_argument(argv,"--anno", help="the file of anno") argv <- add_argument(argv,"--fpkm", help="the file of fpkm") argv <- parse_args(argv) project <- argv$project anno <-argv$anno fpkm<-argv$fpkm project <- project options(stringsAsFactors=FALSE) annot <- read.delim(anno) data <- read.delim(fpkm) rownames(data) <- data[,1] data <- data[,-1] jtkdist(ncol(data)) periods <- 4:40 jtk.init(periods,1) cat("JTK analysis started on",date(),"\n") flush.console() st <- system.time({ res <- apply(data,1,function(z) { jtkx(z) c(JTK.ADJP,JTK.PERIOD,JTK.LAG,JTK.AMP) }) res <- as.data.frame(t(res)) bhq <- p.adjust(unlist(res[,1]),"BH") res <- cbind(bhq,res) colnames(res) <- c("BH.Q","ADJ.P","PER","LAG","AMP") results <- cbind(annot,res,data) results <- results[order(res$ADJ.P,-res$AMP),] }) print(st) save(results,file=paste("JTK",project,"rda",sep=".")) write.table(results,file=paste("JTK",project,"txt",sep="."),row.names=F,col.names=T,quote=F,sep="\t")
python /BJPROJ/RNA/shouhou/script_dir/others/jtk/jtk.py \ --project 项目名称 \ --anno gene_anno.txt 基因注释文件 若没有直接用基因list也可以\ --fpkm gene_fpkm.txt 基因表达量文件 \ --tt 6 时间点,采样共多少时间点 如 2,6,10,14,18,22 共6个时间点 (有生物学重复必填) \ --rp 3 组内生物学重复个数(有生物学重复必填)\ --inter 4 不同时间点间隔(有生物学重复必填)\ --type yes or no 是否为有生物学重复 \ 示例:/BJPROJ/RNA/shouhou/script_dir/others/jtk/run.sh