生物信息學(xué)課件英文原版課件 (125)_第1頁
生物信息學(xué)課件英文原版課件 (125)_第2頁
生物信息學(xué)課件英文原版課件 (125)_第3頁
生物信息學(xué)課件英文原版課件 (125)_第4頁
生物信息學(xué)課件英文原版課件 (125)_第5頁
已閱讀5頁,還剩45頁未讀 繼續(xù)免費(fèi)閱讀

下載本文檔

版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請(qǐng)進(jìn)行舉報(bào)或認(rèn)領(lǐng)

文檔簡(jiǎn)介

1、Pathway Bioinformatics-Database, Software, and DiscoveryY. Tom Tang, Ph.D.Bioinformatics R & DHyseq Pharmaceuticals, Inc.Sunnyvale, CA, USAOutline of the TalkIntroduction to Pathway BioinformaticsOverview Pathmetrics Technology and ProductsData Representation and SLIPR FormatPathway Comparison and P

2、athway Database SearchesPathway Predictions and BeyondA Broad Definition of BioinformaticsInformaticsIts carrier is a set of digital codes and a language. In its manifestation in the space-time continuum, it has utility (e.g. to decrease entropy of an open system). Bioinformatics The essence of life

3、 is information (i.e. from digital code to emerging properties of biosystems.) Bioinformatics is the study of information content of lifePathwaysIt can be defined ad a modular unit of interacting molecules to fulfill a cellular function. It is usually represented by a 2-D diagram with characteristic

4、 symbols linking the protein and non-protein entities. A circle indicates a protein or a non-protein biomolecule. An symbol in between indicates the nature of molecule-molecule interaction.An Example of a Pathway-EPO (erythropoeitin) pathwaysPathway Database -Increasing Level of ComplexityThe genome

5、4 bases3 billion bp total3 billion bp/cell, identicalThe proteome20 amino acids60K genes, 200K proteins 10K proteins/cell; different cells/conditions, different expressionsThe pathome200K reactions20K pathways1K pathways/cell; different cells/conditions, different expressionsEvolutionary Theory of P

6、athways -A New Field of Theoretical StudiesThe most important assumption for sequence informatics is evolutionEvolution principle also applies to pathway informatics From simple to complexDuplication, diversifying, and modular re-useWill provide new view toward fundamental questions toward a unified

7、 informatics theory of lifeWhat is life?How does new function arise?How does evolution work? (pathway is the bridge between digital signal and emerging properties)When does life begin (what is the initial set of pathways)?Data Representation in KEGGEntity: a molecule or a geneBinary relation: a rela

8、tion between two entitiesNetwork: a graph formed from a set of related entitiesPathway: metabolic pathway or regulatory pathwayDrosophila melanogaster GenesAccording to the KEGG metabolic and regulatory pathwaysPathway Search by EC | Cpd | Gene | Seq 1st Level | 2nd Level | 3rd Level | Text Search C

9、arbohydrate Metabolism Energy Metabolism 2.1 Oxidative phosphorylation PATH:dme00190 2.2 ATP Synthesis PATH:dme00193 2.4 Carbon fixation PATH:dme00710 2.5 Reductive carboxylate cycle (CO2 fixation) PATH:dme00720 2.6 Methane metabolism PATH:dme006802.7 Nitrogen metabolism PATH:dme00910 2.8 Sulfur met

10、abolism PATH:dme00920 Lipid Metabolism Nucleotide Metabolism Amino Acid Metabolism Metabolism of Other Amino Acids Metabolism of Complex CarbohydratesMetabolism of Complex Lipids Metabolism of Cofactors and VitaminsIntroduction to GenMAPPGene MicroArray Pathway Profiler by Bruce Conklin at Gladstone

11、 Institute, UCSF.GenMAPP is a free computer application designed to visualize gene expression data on maps representing biological pathways and groupings of genes. The main features underlying GenMAPP version 1.0 are:Draw pathways with easy to use graphics toolsMultiple species gene databasesColor g

12、enes on MAPP files based on user-imported gene expression data Two Main Challenges in Post-genomic AgeData integration: integrate diverse biological information Scientific literature, existing body of knowledge about cellular systemsGenomic sequencesProtein sequences, motifs, and structuresExpressio

13、n data from microarray, dbEST, and RT-PCRProtein-protein interaction data from large-scale screeningFunctional discovery: assign functions to the 60K+ human genesOnly 5% of known genes have assigned functionWe have no clue what the function for the majority of discovered genesWithout understanding f

14、unction, no drug discovery can be done in either small molecule, or in biopharmaceuticalsWill be the focus of next 20-years of life-science researchPathmetrics provides solution onData integrationEstablish standard for pathway curation and pathway database designingDevelop pathway databases using ex

15、isting knowledge in scientific literatureUtilizes dbEST, microarray, and other types of expression dataUtilizes genomic data such as promoter-region similaritiesFunctional studiesAssign proteins with unknown function into functional pathwaysDetermine which cells those pathways work at what levelBe m

16、uch more efficient then large-scale random screeningDiscover the majority of pathways and protein functionsDeliver many tissue-specific pathways for pharmaceutical industryBasic ConceptsNode Protein, peptide, or non-protein biomolecules.ModeThe nature of interaction between two nodes. Qualitative da

17、ta. PathwayA linked list of interconnected nodes and modes. Represented in either 2-D or 1-D format.Pathway NetworkA network of cellular function and regulation involving interconnected pathways. Curating Pathway DatabasesSLIPR standard for pathway curationRelational database design including divers

18、e information about genes, proteins, expression, and tissuesInput in graphical format, and graphical output displaying SLIRPP standard for pathway curationSLIPR stands for Semi-LInear Pathway Representation. Like the FastA, it is pronounced as SlipR or Slipir.For linear comparison (homology) and dis

19、play the alignments, 2-D diagrams of pathways 1-D format. We call the 2-D diagrams graph pathways, and the corresponding 1-D representation semi-linear pathways. One graph pathway may be transformed into multiple semi-linear pathways. But we prefer one-to-one mapping between the 2-D graph or the SLI

20、PR form. The generation of 2-D graph pathways and the corresponding 1-D SLIPR form from scientific literature is called pathway curation. Pathways are curated by trained scientists with expertise on the relevant pathways. In addition to generating the 2-D and 1-D formats, they also have to generate

21、a pathway description file for each pathway they curate (pathway annotation), and a protein file that contains all the proteins in the pathway. Mode Symbol SpecificationsIt is usually specified by two non-character ASCII symbols.- Direct interaction with direction. Used when there is known direct in

22、teractions between two nodes (reverse orientation: -).- | Direct inhibition with direction. Used when there is a direct inhibition from one node to the next. |- for reverse orientation.- Association, indirect action. Used when there is uncertain interaction, indirect interaction, or simply co-expres

23、sion.= = Parallel members. The members can all serve the same function. Usually variants of the same gene, or members from the same family. Clear interaction, but no direction of information flow (notice, no space within, no letters either). This could happen when more than two proteins are involved

24、 to form a large complex.* Bifurcating members (usually appears only in beginning or ending of a pathway, it can occur in the middle of a pathway only when a pathway bifurcates and immediately folds back, e.g. A-B-*C-*E-F).If a pathway starts to bifurcate in the middle or at the end, one can use a *

25、path_name to record this event. E.g: A-B-(xx)-C-D-*New_path_1-E-*New_path_2.( ) Symbol for non-protein nodes. If the small molecule is uncertain, it can be omitted. If the small molecule is known, its name should be inserted in between, e.g. -(Ca), or (cAMP).All the small molecules should be include

26、d inside a set of parentheses, e.g. A1-(Ca)-A2-(Cytidine_Diphosphate_Choline). Symbol for another pathway. The path_id should be within the bracket.When linked to other pathways, the path_ids should be put inside a bracket, e.g. A1-Ca_triggered_path1, A1-Gs_pathway.When an ID is given without a () o

27、r , it means it is a protein nodeSLIPR Format for Pathway EntriesThe format is based on a common sequence format, FASTA. Nodes are linked by modes with no space between them. Bifurcating branches are specified later within the same entry with PATHsub_ID and content. Eg.PW_IDPW_name PW_annotation Sou

28、rce Curator Date SpeciesPr1-Pr2-(Ca)-Pr3=Pr4-*Pr5-*PATHsub_XX-Pr5-(Mg)ZZprPATHsub_XXAA1-AA2(SM1)-AA3AA4-AA5PW_ID: ID for the pathwayPW_name: A namePW_annotation: a brief description about the pathwaySource: where this pathway is taken from: article, KEGG, GenMAPP, etc.Curator: the person who inputs

29、the pathwayDate: date of curationPathway Database in Simplest FormatA SLIPR format pathway fileA FASTA format protein sequence fileA FASTA format non-protein molecule fileFlat file tools to do basic database manipulations:Index: generate index fileRetrieval: logN scale speed of component accessInser

30、tion: cat to the end, new indexDeletion: delete, and new indexUpdating: deletion, cat to the end, new indexRelational Database Implementation-an example with only protein nodesgene_idGene_Tablegene_idchromosomestartstopProtein_Tableseq_idcellular locationseq_txtgene_id Interaction_Tableprotein Aprot

31、ein Bpathway_idliterature_idInfo flow directionPathway_Tablepathway_idpathway_namedescriptionspeciescuratorentry_dataprotein=seq_idpathway_idProtein_Motifsmotif_idseq_id seq_idMotif_Def_Tablemotif_iddescriptionregular expresssionHMM_matrixLiterature_Tableliterature_idauthorjournalpub_datePDF_filelit

32、erature_idmotif_idPathway Search EnginesComparing two pathways in SLIRPP standard using dynamic programming algorithmSearch a query pathway against a pathway database: advance BLAST-type of searches into pathway levelFind orthologous, paralogous, and homologous pathways with alignmentsLike BLAST, th

33、ere are different types of searches:Node only searchMode only searchNode and mode searchIn node only searches, one can perform:protein-node onlynon-protein node onlyProtein-node and non-protein nodeAlignment Scoring MatricesComparing protein nodesidentity mapping and orthologs (current status)percen

34、t_identitypercent_positive (PAM/BLOSSUM) structural similarityComparing non-protein nodesidentity mappingstructural similarityEvolutionary linkage and functional similarityComparing modesidentity mappingSCIM matrix (similarity coefficient of interacting modes). A matrix of positive and negative valu

35、es between 1 and 1.Protein Comparison vs. Pathway ComparisonProtein:Pathway:# of NodeNode-compMode20BLOSUM/PAMMatricesPeptide-bond200KPct_identityPct_positiveStructural Simil.Identity_mappingSCIM matrixPeptide-bond (fused proteinsPathway Level Search EngineQuery: A pathway (associated query.pw, quer

36、y.aa file)DB: Pathways (associated DB.pw, DB.aa file)Search Types: Node onlyprotein node onlynon-protein node onlyAny nodeMode onlyNode and modePMsearch DocumentationPMsearch is a pathway comparison program. After a user specifies a query pathway, and a search database, PMsearch will compare the que

37、ry pathway with each entry in the pathway database. The query pathway is specified by two input files: A query.pw pathway file, and a query.aa, the protein file The query.pw contains the pathway information, in FASTA format. The query.aa contains the involved proteins, in FASTA format.The pathway da

38、tabase is also composed of two files, a db.pw and a db.aa file, except the database files contain more than one entry. Once a job is submitted, the search engine (pm_search) will perform the job, and report back all the homologous pathways that are above a user-specified threshold. The user can also

39、 specify other parameters, which are given in the user manual. Specifics for pathway alignmentIt is a higher level alignment, containing protein or structural alignment within.Each element in the pathway can represent a node (protein or non-protein), or a mode.Distance between nodes and modes, and b

40、etween protein nodes and non-protein nodes are infinite, you cannot align different types of elements.In the simplest case, consider pathway with only protein nodes. Given an alignment z, the score is given bywhere s(x,y) is the similarity of protein x and protein y, ngap is the number of gaps in z,

41、 lgap is the total length of the gaps, is a parameter called the “gap opening penalty, and is a second parameter called the “gap extension penalty. PMsearch uses a dynamic programming algorithms to find the alignment with the highest score.How Alignments Are Determined And ScoredFor the alignment to

42、 get to (m,n), it must go through one of: (m-1, n-1) (am and bn are a match), (m-1, n) (meaning (m,n) is in a gap in sequence 2), (m, n-1) (meaning (m,n) is in a gap in sequence 1).Recursion:For i = 1 to m For j = 1 to n H(i,j) = max H(i-1,j-1)+s(i,j), Hh(i,j), Hv(i,j), where Hh(i,j) = max Hh(i,j-1)

43、-, H(i,j-1)- Hv(i,j) = max Hv(i-1,j)-, H(i-1,j)- EndEnd Novel Pathway Prediction EnginesPredicting orthologous pathways across different organismsA known query pathway from some organism as queryA protein database or genomic database for the organism of interest to search againstOutput is the orthol

44、og pathway in the organism of interestPredicting homologous pathways for an organism of interestA known query pathway from some organism as queryA protein database or genomic database for the organism A protein-protein correlation matrix for protein expressionOutput is a collection of homologous pat

45、hwaysOpen Questions for Pathway ComparisonLike extending points in Rn to functional space, we need to generalize theory for protein alignment to a higher level, where the component itself may have alignment.How to calculate p-value in this pathway space?How to design intelligent scores?How to genera

46、te meaningful non-identity-mapping non-protein node comparison matrixHow to integrate multiple component types into the alignment theory?HOMOLOGS, ORTHOLOGS, AND PARALOGSHomologs: proteins with good alignment and similar functionOrthologs: proteins performing the same function in different speciesPa

47、ralogs: homologous proteins in the same speciesHow to tell the unique orthologThe ortholog should have a much higher similarity to the query protein that any other protein in its species, and usually higher than most of the paralogs.PMortholog DocumentationPMortholog is a simple ortholog prediction

48、program for pathways. Inputs:(1) a pathway (query.pw and query.aa files)(2) a protein database, e.g., SwissProtReports all apparent orthologous pathwaysMost accurate for closely related organisms (e.g. humanmouse)False matches can appear when organisms are too distant, or possibly, because of other

49、paralogous pathways in the organism.PMortholog sample output: hitsPM_ORTHOLOG 0.1, Pathmetrics, Inc. Oct-20-2001 Build linux-x86Reference: US Patent Pending. Methods for Establishing Pathway Databaseand Perform Pathway Searches. Y. Yang, C. Piercy. February 20, 2001. Application number 60/269,711Que

50、ry pathway= hsa00625 (5 proteins)Database: /u1/pub_db/sp_db/allspecies.aa 374855 proteins.Summary of ortholog pathways:Hit_nu species . score- 1: Homo sapiens . 100.00 2: Mus musculus . 65.20 3: Rattus norvegicus . 65.20 4: Caenorhabditis elegans . 44.20 5: Drosophila melanogaster . 37.80 6: Arabido

51、psis thaliana . 37.00 7: . 31.80 8: Saccharomyces cerevisiae . 26.60 9: Sinorhizobium meliloti . 25.80 10: Mesorhizobium loti . 24.80 11: Agrobacterium tumefaciens . 24.80 12: Escherichia coli . 22.60 13: Pseudomonas aeruginosa . 22.40 14: Schizosaccharomyces pombe . 18.80 15: Bacillus subtilis . 15

52、.00 16: Oryza sativa . 11.0PMortholog sample output: alignmentsHit 1: Ortholog pathway for: Homo sapiens. With score: 100.00Query:hsa:51144 hsa:2052 hsa:2053 hsa:51004hsa:9420%_id:|1.00| |1.00| |1.00| |1.00| |1.00|Sbjct:gi15082281 gi13097729 gi181395 gi4680659gi13094303Hit 2: Ortholog pathway for: M

53、us musculus. With score: 65.20Query:hsa:51144 hsa:2052 hsa:2053 hsa:51004hsa:9420%_id: |0.85| |0.88| |0.81|0|0.72|Sbjct:gi3142702gi12857870 gi12832382 -gi12850151Hit 3: Ortholog pathway for: Rattus norvegicus. With score: 65.20Query:hsa:51144 hsa:2052 hsa:2053 hsa:51004hsa:9420%_id: |0.81| |0.88| |0

54、.84|0|0.73|Sbjct:gi4098957 gi207689 gi55930 -gi1226240Hit 4: Ortholog pathway for: Caenorhabditis elegans. With score: 44.20Query:hsa:51144 hsa:2052 hsa:2053 hsa:51004hsa:9420%_id:|0.48| |0.56| |0.42| |0.44| |0.31|Sbjct:gi726418 gi1465805 gi3876864 gi2088820gi13775482Homolog Pathway Prediction Engin

55、esThey are the crown jewels of Pathmetrics software toolsCan predict many novel interactionsUse diverse input data, including sequence data, expression data, and known interaction dataEmploy complex numerical algorithms such as dynamical programming and clusteringExample of Novel Pathway Prediction-

56、predicting novel pathways homologous to the query pathway Node1Node2Node3Node4Mode1=1Mode2=1Mode3=1Node1 Hitscandidate1_1candidate1_2 . . . . . . . .candidate1_l Node2 Hitscandidate2_1candidate2_2 . . . . . . . .candidate2_m Node3 Hitscandidate3_1candidate3_2 . . . . . . . .candidate3_n Node4 Hitscandidate4_1candidate4_2 . . . . . . . .candidate4_o Pathway Searches and Pathway PredictionsQueryDatabaseModeOutputPathwayPathway_dbSCIM*Homologous pathways Protein_dbNoneOrthologous pathways Protein_dbPromoter_simil.matrixPredicted homol

溫馨提示

  • 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請(qǐng)下載最新的WinRAR軟件解壓。
  • 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請(qǐng)聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
  • 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會(huì)有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
  • 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
  • 5. 人人文庫網(wǎng)僅提供信息存儲(chǔ)空間,僅對(duì)用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對(duì)用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對(duì)任何下載內(nèi)容負(fù)責(zé)。
  • 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請(qǐng)與我們聯(lián)系,我們立即糾正。
  • 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶因使用這些下載資源對(duì)自己和他人造成任何形式的傷害或損失。

評(píng)論

0/150

提交評(píng)論