二、生物信息學(xué)中各個層次的分析技術(shù)zsci26差異表達(dá)基因相關(guān)turku_第1頁
二、生物信息學(xué)中各個層次的分析技術(shù)zsci26差異表達(dá)基因相關(guān)turku_第2頁
二、生物信息學(xué)中各個層次的分析技術(shù)zsci26差異表達(dá)基因相關(guān)turku_第3頁
二、生物信息學(xué)中各個層次的分析技術(shù)zsci26差異表達(dá)基因相關(guān)turku_第4頁
二、生物信息學(xué)中各個層次的分析技術(shù)zsci26差異表達(dá)基因相關(guān)turku_第5頁
已閱讀5頁,還剩40頁未讀, 繼續(xù)免費(fèi)閱讀

下載本文檔

版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請進(jìn)行舉報或認(rèn)領(lǐng)

文檔簡介

1、Discovery of differentially expressed genes by statistical methodsEsa UusipaikkaDepartment of StatisticsUniversity of TurkuMicroarray Bioinformatics SeminarDataCity Turku, May 6-7, 2003Molecular portraits and the family tree of cancerOverview1. Statistical issues2. Design of experiment3. Low-level a

2、nalysisOverview4. High-level analysis- fold-change with fixed cutt-off- model for fold-change- standard statistical tests- permutation tests- multiple testing- False Discovery Rate (FDR)- time-series analysisStatistical issues1. Design of experiment2. Low-level analysisdata-cleaningStatistical issue

3、s3. High-level analysis1. select differentially expressed (DE) genes2. find groups of genes whose expression profiles can reliably classify the different RNA sources into meaningful groupsExperimental designKerr, M. K., and Churchill, G. A. (2001). Experimental design for gene expression microarrays

4、. Biostatistics 2, 183-201.Glonek, G. F. V., and Solomon, P. J. (2002). Factorial designs for microarray experiments. Technical Report, Department of Applied Mathematics, University of Adelaide, Australia.apply ideas from optimal experimental designs to suggest efficient designs for the some of the

5、common microarray experimentsExperimental designPan, W., Lin, J. and Le, C. (2002). How many replicates of arrays are required to detect gene expression changes in microarray experiments? A mixture model approach. Genome Biology 3(5): research0022.1-0022.10.considers sample sizeExperimental designSp

6、eed, T. P., and Yang, Y. H. (2002). Direct versus indirect designs for cDNA microarray experiments. Technical Report 616, Department of Statistics, University of California, Berkeley.examines the efficiency of using a reference sample as against direct comparisonExperimental designIt is not possible

7、 to give universal mendations appropriate for all situations but the general principles of statistical experiment design apply to microarray experimentsChurchill, G.A. Fundamentals of experimental design for cDNA microarrays. Nature Genet. 32, 490-495 (2002).Yang, Y.H. & Speed, T. Design issues for

8、cDNA microarray experiments. Nature Rev. Genet. 3, 579-588 (2002).Image Analysis and data-cleaningYang, Y. H., Buckley, M. J., Dudoit, S., and Speed, T. P. (2002). Comparison of methods for image analysis on cDNA microarray data. Journal of Computational and Graphical Statistics 11, 108-pare various

9、 segmentation and background estimation methodsImage Analysis and data-cleaningKerr, M. K., Martin, M., and Churchill, G. A. (2000). Analysis of variance for gene expression microarray data. Journal of Computational Biology 7, 819-837.andWolfinger, R. D., Gibson, G., Wolfinger, E. D., Bennett, L., H

10、amadeh, H., Bushel, P., Afshari, C., and Paules, R. S. (2001). Assessing gene significance from cDNA microarray expression data via mixed models. Journal of Computational Biology 8, 625-637. have proposed the use of ANOVA models for normalizationImage Analysis and data-cleaningQuackenbush, J. Microa

11、rray data normalization and transformation. Nature Genet. 32, 496-501 (2002).Selecting differentially expressed genes1.simply generating the data is not enough; one must be able to extract from it meaningful information about the system being studied2.there is no one-size-fits-all solution for the a

12、nalysis and interpretation of genome-wide expression dataSelecting differentially expressed genes3.statistical methods for interpreting the data have proliferated4.there are now so many options available that choosing among them is challenging5.understanding of both the biology and the computational

13、 methods is essential for tackling the associated data mining tasksSelecting differentially expressed genesOne of the core goals of microarray data analysis is to identify which of the genes show good evidence of being DE. This goal has two parts.1. The first is select a statistic which will rank th

14、e genes in order of evidence for differential expression, from strongest to weakest evidence.2. The second is to choose a critical-value for the ranking statistic above which any value is considered to be significant.k-fold change1.measure of differential expression by the ratio of expression levels

15、 between two samples2.genes with ratios above a fixed cut-off k that is, those whose expression underwent a k-fold change, were said to be differentially expressed3.this test is not a statistical test, and there is no associated value that can indicate the level of confidence in the designation of g

16、enes as differentially expressed or not differentially expressedk-fold change4.replication is essential in experimental design because it allows an estimate of variability5. ability to assess such variability allows identification of biologically reproducible changes in gene expression levelsModel f

17、or fold-change1.model that accounts for random, array- and probe-specific noise2.evaluation of whether the 90% confidence interval for each genes fold-change excludes 1.03.this method incorporates available information about variability in the gene-expression measurements4.can suffer when the data s

18、et is either too small or too heterogeneous5.data-derived estimates of variationModel for fold-changeLi, C. & Hung Wong, W. Model-based analysis of oligonucleotide arrays: model validation, design issues and standard error application. Genome Biol. 2, research0032 (2001).Roberts, C.J. et al. Signali

19、ng and circuitry of multiple MAPK pathways revealed by a matrix of global gene expression profiles. Science 287, 873-880 (2000).Ideker, T., Thorsson, V., Siegel, A.F. & Hood, L.E. Testing for differentially expressed genes by maximum-likelihood analysis of microarray data. J. Comput. Biol. 7, 805-81

20、7 (2000).Standard statistical tests1.More typically, researchers now rely on variants of common statistical tests.2.These generally involve two parts: calculating a test statistic and determining the significance of the observed statistic.3.A standard statistical test for detecting significant chang

21、e between repeated measurements of a variable in two groups is the t-test;4.this can be generalized to multiple groups via the ANOVA F statistic. Standard statistical testsvariations on the t-test statistic (often called t-like tests) for microarray analysis are abundantTusher, V.G., Tibshirani, R.

22、& Chu, G. Significance analysis of microarrays applied to the ionizing radiation response. Proc. Natl Acad. Sci. USA 98, 5116-5121 (2001).Golub, T.R. et al. Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286, 531-537 (1999).Model, F.,

23、Adorjan, P., Olek, A. & Piepenbrock, C. Feature selection for DNA methylation based cancer classification. Bioinformatics 17 Suppl 1, S157-S164 (2001).Standard statistical tests1.use of non-parametric rank-based statistics is also common, via both traditional statistical methods and2.ad hoc ones des

24、igned specifically for microarray dataZhan, F. et al. Global gene expression profiling of multiple myeloma, monoclonal gammopathy of undetermined significance, and normal bone marrow plasma cells. Blood 99, 1745-1757 (2002).Ben-Dor, A., Friedman, N. & Yakhini, Z. Scoring genes for relevance. Technic

25、al Report 2000-38 (Institute of Computer Science, Hebrew University, Jerusalem, 2000).Park, P.J., Pagano, M. & Bonetti, M. A nonparametric scoring algorithm for identifying informative genes from microarray data. Pac. Symp. put. 52-63 (2001).Standard statistical tests1.For most practical cases, comp

26、uting a standard t or F statistic is appropriate, although referring to the t or F distributions to determine significance is often not.2. The main hazard in using such methods occurs when there are too few replicates to obtain an accurate estimate of experimental variances. In such cases, modeling

27、methods that use pooled variance estimates may be helpful.Standard statistical testsXiangqin Cui and Gary A Churchill (2003). Statistical tests for differential expression in cDNA microarray experiments. Genome Biology 4(4): 210.1-210.10.Standard statistical tests1.Regardless of the test statistic u

28、sed, one must determine its significance2.Standard interpretations of t-like tests assume that the data are sampled from normal populations with equal variances3.Expression data may fail to satisfy either or both of these constraintsStandard statistical tests4.Although log transformation can improve

29、 normality and help equalize variances, ultimately the best estimates of the datas distribution come from the data themselvesQuackenbush, J. Microarray data normalization and transformation. Nature Genet. 32, 496-501 (2002).Permutation testsPermutation tests, generally carried out by repeatedly scra

30、mbling the samples class labels and computing t statistics for all genes in the scrambled data, best capture the unknown structure of the data.Tusher, V.G., Tibshirani, R. & Chu, G. Significance analysis of microarrays applied to the ionizing radiation response. Proc. Natl Acad. Sci. USA 98, 5116-51

31、21 (2001).Golub, T.R. et al. Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286, 531-537 (1999). Dudoit, S., Yang, Y.-H., Callow, M.J. & Speed, T.P. Statistical methods for identifying differentially expressed genes in replicated cDNA

32、microarray experiments. Technical Report 578 (Department of Statistics, University of California at Berkeley, Berkeley, CA, 2000).Permutation testsSuch permutation tests are ideal when the number of arrays is sufficient to offer the desired degree of confidence.Multiple testing1. One advantage of pe

33、rmutation methods is that they allow more reliable correction for multiple testing.2.The issue of multiple tests is crucial, as microarrays typically monitor the expression levels of thousands of genes.3.Standard Bonferroni correction (that is, multiplying the uncorrected p-value by the number of ge

34、nes tested) is overly restrictive. Multiple testing1.Step-down methods designed to minimize this overcorrection are little better for thousands of genes.2.Both methods are overly strict because they are based on the assumption that each gene represents an independent test.3.In fact, the correlation

35、structure between gene-expression patterns is significant and complex.Holm, S. A simple sequentially rejective multiple test procedure. Scand. J. Stat. 6, 65-70 (1979).Multiple testingTo capture this structure, Dudoit et al. propose a permutation-based approximation of Westfall and Youngs methodDudo

36、it, S., Yang, Y.-H., Callow, M.J. & Speed, T.P. Statistical methods for identifying differentially expressed genes in replicated cDNA microarray experiments. Technical Report 578 (Department of Statistics, University of California at Berkeley, Berkeley, CA, 2000).C code is available online Multiple

37、testingA package of R functions for other techniques evaluated in Dudoit et al is available at zarray/Software/smacode.htmlMultiple testingThe advantage of permutation-based adjustment for multiple testing. The x-axis shows unadjusted p-values derived from independent t tests for each gene to detect

38、 differential expression between sensitive and resistant cell lines. The y-axis shows the adjusted p-values using Bonferroni correction (black circles) and Westfall and Youngs permutation-based method (blue squares). At the adjusted cutoff of 0.05, the permutation method finds 11 significantly chang

39、ing genes (instead of 7 without permutation).False discovery rate1.All these approaches focus on determining the family-wise error rate, the overall chance that at least one gene is incorrectly identified as differentially expressed.2.For microarray studies focusing on finding sets of predictive gen

40、es, it may instead be acceptable to bound the false discovery rate (FDR), the probability that a given gene identified as differentially expressed is a false positive.False discovery rate3.A simple method for bounding the FDR is proposed by Benjamini and Hochberg.4.While this, too, assumes that each

41、 gene is an independent test, a permutation-based approximation of this method is implemented in the SAM (Significance Analysis of Microarrays) program by Tusher et al.Benjamini, Y. & Hochberg, Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. Roy. St

42、at. Soc. B 57, 289-300 (1995).Tusher, V.G., Tibshirani, R. & Chu, G. Significance analysis of microarrays applied to the ionizing radiation response. Proc. Natl Acad. Sci. USA 98, 5116-5121 (2001).False discovery rateEfron, B., Storey, J. & Tibshirani, R. Microarrays, Empirical Bayes Methods, and Fa

43、lse Discovery Rates. (2001).Storey, J., Taylor, J. & Siegmund, D. Strong Control, Conservative Point Estimation, and Simultaneous Conservative Consistency of False Discovery Rates: A Unified Approach. (2003).Comparison of SAM to conventional methods for analyzing microarraysFalsely significant genes

44、 plotted against number of genes called significant. Of the 57 genes most highly ranked by the fold change method, 5 were included among the 46 genes most highly ranked by SAM. Of the 38 genes most highly ranked by the pairwise fold change method, 11 were included among the 46 genes most highly rank

45、ed by SAM. These results were consistent with the FDR of SAM compared to the FDRs of the fold change and pairwise fold change methods. False discovery rate5.A more permissive permutation-based approach to bounding the FDR appears in the Whiteheads GeneCluster software package.Golub, T.R. et al. Mole

46、cular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286, 531-537 (1999).False discovery rateAlthough in some data sets even the lowest FDR may be prohibitively high, this can be a valuable approach to finding some valid leads when more stringen

47、t analyses find none.Time series analysis1.The canonical time-series data in the field come from two experiments following the yeast cell cycle.2.Spellmans analysis incorporates a Fourier transform to test the periodicity of individual genes in three separate data sets, before combining these into a

48、 single significance score used to rank the genes. Cho, R.J. et al. A genome-wide transcriptional analysis of the mitotic cell cycle. Mol. Cell 2, 65-73 (1998).Spellman, P.T. et al. Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. Mol. Bi

溫馨提示

  • 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
  • 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
  • 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
  • 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
  • 5. 人人文庫網(wǎng)僅提供信息存儲空間,僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對任何下載內(nèi)容負(fù)責(zé)。
  • 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請與我們聯(lián)系,我們立即糾正。
  • 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時也不承擔(dān)用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。

評論

0/150

提交評論