AnotherPerspectiveofBioinformatics_第1頁
AnotherPerspectiveofBioinformatics_第2頁
AnotherPerspectiveofBioinformatics_第3頁
AnotherPerspectiveofBioinformatics_第4頁
AnotherPerspectiveofBioinformatics_第5頁
已閱讀5頁,還剩38頁未讀, 繼續(xù)免費(fèi)閱讀

下載本文檔

版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請進(jìn)行舉報(bào)或認(rèn)領(lǐng)

文檔簡介

1、Another Perspective of Bioinformatics許聞廉中央研究院資訊所大綱 DNA序列分析 Error-tolerant Algorithms Mapping. Assembly Repeat sequences DNA序列比對、搜尋 序列的標(biāo)註(Annotations) 資料庫的建立 各種結(jié)構(gòu)的預(yù)測 Integrated Genomic/Proteomic Knowledge Base SystemMy own involvement in Bioinformatics 1980以來進(jìn)行演算法的研究 1995和中研院生醫(yī)所常蘭陽,白果能,孫以瀚合作,我們開始研究ma

2、pping, assembly algorithms 1989以來進(jìn)行自然語言,知識(shí)表達(dá)的研究。與生物資訊相關(guān)的部分有搜尋,資料庫研究,data miningDNA 序列分析基因序列的切割與合成基因序列的比對基因的發(fā)現(xiàn)基因序列轉(zhuǎn)錄的機(jī)制蛋白質(zhì)三度空間結(jié)構(gòu)分析種族樹分析Algorithms for noisy data Mapping algorithms Assembly algorithms Algorithms for Repeat sequences基因序列的切割與合成 Cut a DNA sequence into small pieces in different ways and

3、reassemble them together the “small” pieces (called clones) are still too large to find complete sequences. biologically, use “probe”(探針) to mark the clones each probe could mark several clones each clone could contain several probesProbe-Clone (0,1)-Matrix Each probe can be regarded as a column; ea

4、ch clone can be regarded as a row of probes If each probe hits the DNA sequence only once (unique probe) and there is no error in the probe-clone matrix, then one can use the consecutive ones test to order the clonesConsecutive Ones Property (C1P) Booth & Lueker 1976 linear time, on-line made us

5、e of a data structure called PQ-trees Hsu 1992 decomposition, off-line did not use PQ-trees However, these algorithms do not work on data that contain errors大綱 DNA序列分析 Error-tolerant Algorithms Mapping. Assembly Repeat sequences DNA序列比對、搜尋 序列的標(biāo)註(Annotations) 資料庫的建立 各種結(jié)構(gòu)的預(yù)測 Integrated Genomic/Proteom

6、ic Knowledge Base SystemA 50 x50 matrix with error rate 5% 11111111 1111111111N11 111N1111 111111111111111111 1N11111N111111111111 1111111111111 111111 11N11111111111111111 11111111111111111111 1111111 111111111 111111111111111111 11111111111111N1 11111111111 111N1111111111N 11111N11111 111111111111

7、1111N11 11111111111111 11111111111N111 1111111111111N11111 111111N111111111 11111111111111111111 11111111111111111 111111 111111111N11 11111111111111 11111111 11111111111111111 111111111N11 1111111111 1N1111111111N11111 P N111N111111111 N1N11111 P 1111111111111111 1111111N11N111111 11111111111111 11

8、111111N111111 11N1111111 N1N111 111111111111111111 1111111111111 11111111111111 P 1111111 11111N111111111 111111111111111111 1N111111111111 111111N11 11111111111111111111 1111111 111111111111111 11111111 1111111111F1F1 111F1111 111111111111111111 1F11111F111111111111 1111111111111 111111 11F11111111

9、111111111 11111111111111111111 1111111 111111111 111111111111111111 111111111111111 11111111111 111F1111111111 111111F1111 1111111111111111F11 11111111111111 11111111111F111 11FF11111111111F1111F1 111111F111111111 11111F11111111111111 11111F11111111111 11111 1111111111F1 11111111111111 11111111 1111

10、111111111111 1111111111 1111111111 11111111111F11111 11F1111111111 111111 1F1111111111111111 1111111F11F111111 11F111111111111 11111111F111111 1F11111111 1F111 111111111111111111 1111111111111 11111111111111 11111111 11111F11111111F1 111111111111111111 1F111111111111 111111F11 11111111111111111111 1

11、111111 11F1111111111111 A 50 x50 matrix with error rate 10% 11111111 1111111N11N11 1N1N1111 P 11111111111111111N 1N1NN11N11111111111N 1111111111111 P 111111 11N11111111111N11111 11111111111111111111 1N11111 1111111N1 111111N11111111111 11111111111111NN N1111111N11 111N1111111111N 11111N1N111 11111NN

12、111111111N11 11111111111111 P 1N111111111N111 1N1N1111N1111N11111 111111N111111111 11111111111111111111 11111111111111111 111111 111111111N11 11111111111111 11111111 111N1111111111N11 111111111N11 111111N111 1N1111111111NN1111 N111N111111111 N1N11111 1111111111111111 1111111N11N111111 11111111111111

13、 11111111N111111 P 11N1N11111 N1N11N 1N111N111111111111 1N1111111111N 111111111N1111 P 1111111 11111N111111111 111111111111111111 1N111111111111 111111N11 11111111111111111111 P 1111111 111111111111111 11111111 1111111F11F11 11F1FF11 11111FF1111111111 11FFFF1F1111111111 11FF11111111FF11 111111 11111

14、1111111F11111 11111111111111111111 1F11111 1111111F1 11111F11111111111 11111FF111111 111111F11 1F11FFF11111111 1111F1F111 1111111FF1111111 11111111111111 11111111111F111 1F111F11F1111F1FF11 111111F111111111 11F11111111111111111 1FF11111111111111 11111 111111111F11 11111111111111 11111111 11F11111111

15、1F111 111111F1111 1F11111F111 11F11111111FF1111 1111F1111111 1F11111 11111111111111FF1 1111111F11F111111 11FFF1F11111FF111F1 111FF11F1111111 111FF1111 11F11111111111111 11FF11F111111F1 11111111F11111 11F11111 11111F111111111 111111111111111111 1F111111111111 111111F11 11111111111111111111 1111111 11

16、1111111111111 Assembly algorithms1234541235The interval graph modelsTandem Repeat (Satellite) DNA 15% of DNA is composed of repetitive short sequences such that one follows another (tandem repeat) ”Satellite DNAs are classified into three major groups: Satellites (very highly repetitive, organized a

17、s large (up to 100 million bp ! clusters) Minisatellites (moderately repetitive moderately-sized (9 to 100 bp, but usually about 15 bp repeats) Microsatellites (moderately repetitive, and composed of arrays of short (2-6 bp) repeats)Example of Satellite DNA TAATCGCATTCCGATACGCAGCGGACGTTAATCGCATTCCGA

18、TAGCAGCGGACGTTAACCGCATTCCGATACGCAGCGGACGTCTACCGCGTCCGAGACGCACCGGAAGCTGACCGCATTCCGTACGCAGCGGACGTAAACCGCATTCCGATCGCAGCGGACGTTAACCGCATCCCGATACGCAGCGGAACTTAACCGCTGCCGATATGCAACGGACGTTAACCGCATTCCGATACGCATCGGACGTCAACCGCAGTCCGATACGCAGCGGACGTTAACCGCATTCTGATACGCAGCGGACGTTAACCGCTTGCGATATGCAGGGGACGTTAACGCATTCCG

19、ATACACAGCGACGTTAACGCATTCAGAGACGCAGCTGACGTTAACCGCATTCCGATACGCAACGCACGTTAACCGCATCCGAACGCAGCGGACGTTAACAGCATTCCGATACCGCGGACGTTAACCGCATTCCGATACCAGCGGACGTCutting the Satellite DNA TAACCGCATTCCGATACGCAGCGGACGTTAATCGCATTCCGATACGCAGCGGACGTTAATCGCATTCCGATA_GCAGCGGACGTTAACCGCATTCCGATACGCAGCGGACGTCTACCGCGT_CCGA

20、GACGCACCGGAAGCTGACCGCATTCCG_TACGCAGCGGACGTAAACCGCATTCCGAT_CGCAGCGGACGTTAACCGCATCCCGATACGCAGCGGAACTTAACCGC_TGCCGATATGCAACGGACGTTAACCGCATTCCGATACGCATCGGACGTCAACCGCAGTCCGATACGCAGCGGACGTTAACCGCATTCTGATACGCAGCGGACGTTAACCGC_TTGCGATATGCAGGGGACGTTAA_CGCATTCCGATACACAGC_GACGTTAA_CGCATTCAGAGACGCAGCTGACGTTAACCG

21、CATTCCGATACGCAACGCACGTTAACCGCAT_CCGA_ACGCAGCGGACGTTAACAGCATTCCGATAC_C_GCGGACGTTAACCGCATTCCGATAC_CAGCGGACGTconsensusErrors in Repetitive Sequences SubstitutionTAACCGCATTCCGATACGCAGCGGACGTTAACCGCATTCCGATACGCGGCGGACGT DeletionTAACCGCATTCCGATACGCAGCGGACGTTAACCGCATT_CGATACGCAGCGGACGT InsertionTAA_CCGCATT

22、CCGATACGCAGCGGACGTTAATCCGCATTCCGATACGCAGCGGACGT大綱 DNA序列分析 Error-tolerant Algorithms Mapping. Assembly Repeat sequences DNA序列比對、搜尋 序列的標(biāo)註(Annotations) 資料庫的建立 各種結(jié)構(gòu)的預(yù)測 Integrated Genomic/Proteomic Knowledge Base System基因序列的檢索 DNA序列 mRNA 蛋白質(zhì)序列 3D 結(jié)構(gòu) 功能 以上的過程相當(dāng)耗時(shí),且花費(fèi)不眥。 如今Genome Project已近尾聲,有許多完整的序列以及部份基因

23、的資訊 如何利用現(xiàn)有的資訊歸納出足夠的規(guī)則並加上生物學(xué)家的知識(shí),進(jìn)行電腦的序列分析以及預(yù)測,以加速基因研究的進(jìn)展常用技巧 Text: 字、詞、短語的 tagging 經(jīng)由這些tags的組合進(jìn)行event的分析 經(jīng)由許多小events組合成較大的script Text analysis DNA analysis approximation sequence alignment similarity homology meaning 3D structure, functionalityTreating Genomic/Proteomic data as a Language An analogy

24、 of exons and intronsOnlyaksjcbakamcnabddfkjsmallddkdfjwosperddtrudjfdksjascdcentagedkjfdkdfjgaofhumanzidkenkdjfDNAisbelskdfjactuallyofSnadkfjkjdmeandkfjdkslasdkingfulDecoding an unknown language For proteomic data:Amino acid motif proteinAlphabetwordsentenceSentence meaningProtein structure Finding

25、 the interrelationships of data Data Mining, Knowledge DiscoveryMatching by examples Existing sentences in database (understood): His old father gave me a book. Joan loves Andy UnderstandingUnderstanding a new sentence Marys lovely daughter does not like John Techniques Corpus analysis Pattern disco

26、very and matching Sequence, semantics (classification, transformation) Structure predictionGene SplicingMatching by templatesBoundaries of Splice SitesPrediction Gene finding Intron, exon boundary determination Protein folding 3-dimensional structure Techniques used Data mining (machine learning, st

27、atistical hypothesis testing), probabilistic methods (HMM), neuron net, graph models, Dynamic programming for gene structure prediction Constraint Frame constraint Junction constraint Statistical characteristics of junction and segment Junctional scoring function Segmental scoring functionA Simple H

28、MMExonIntronA 0.4C 0.1G 0.1T 0.4A 0.05C 0.4G 0.5T 0.05hiddenObservationEEEEIIIEEEATCAAGGCGT0.90.10.010.99大綱 DNA序列分析 Error-tolerant Algorithms Mapping. Assembly Repeat sequences DNA序列比對、搜尋 序列的標(biāo)註(Annotations) 資料庫的建立 各種結(jié)構(gòu)的預(yù)測 Integrated Genomic/Proteomic Knowledge Base SystemGoal Develop an integrated G

29、enomic/Proteomic Knowledge Base System to serve as a bioinformatics framework The system is designed to facilitate genome annotation, functional characterization of proteins, disease and pathway studiesBiological Knowledge Map We shall organize various biological data and their relationships into a

30、MAP This map will provide the following: Basic knowledge inference ability A biological question and answering system A decision support system for biologists The map will speed up the exploration of genes, protein structures and functionsAn Example of Information MAP1.由建構(gòu)及儲(chǔ)存常蘭陽博士的 CRASA ( Complexit

31、y Reduction Algorithm for Sequence Analysis )相關(guān)資料為例。2.CRASA主要是將DNA sequence透過一層層的二維矩陣來重新定義。每一層矩陣的兩邊以AA、AG、AT、AC、GT、GG、GC16種組合代表。所以一條DNA的序列就可以在此以層層相疊所組成的金字塔上標(biāo)出其路線。希望透過此機(jī)制能降低在分析DNA sequence的複雜度 3.所要管理的部分包括:a.資源 ( Data Base:cDNAdb (HGI6.0 190M) )b.程式 ( CRASA query interface、procedure interface definiti

32、on )c.文件 ( CRASA:summary、methodology、program source code ; cDNAdb:summary、specification)4.透過Biology Knowledge Management可以重複利用CRASA query program,甚至利用CRASA query program發(fā)展新的分析sequence的程式,只要將新程式和procedure interface definition作連結(jié)即可達(dá)成分享以及重複利用的優(yōu)點(diǎn)。再加上,管理介面同時(shí)紀(jì)錄CRASA相關(guān)的文件( methodology、summary等),所以運(yùn)用上來更方便。I

33、BMSIBMSSinicaSinicaqueryqueryinterfaceinterfacecDNAcDNAdbdbgenericfunctionFunction TypeFunction TypeInformationProgramqueryqueryinterfaceinterfacecDNAcDNAdbdbsummarymethodologyDr. Lan-YangChangCRASAprocedureinterfacedefinitionprocedureinterfacedefinitionsummaryspecificationsummarymethodologycSNPDr.

34、Wen-changLinssourcecodeSourceIBMS Sinica Map常老師實(shí)驗(yàn)室實(shí)驗(yàn)室方法CRASA分類屬性何謂實(shí)驗(yàn)室方法實(shí)驗(yàn)室方法condition實(shí)驗(yàn)室成員分類林老師實(shí)驗(yàn)室老師博士後研究研究助理常蘭陽實(shí)驗(yàn)室方法分類屬性cSNP何謂實(shí)驗(yàn)室方法實(shí)驗(yàn)室方法condition實(shí)驗(yàn)室成員5. 根據(jù)文件部分的資料轉(zhuǎn)化成以自然語言為基礎(chǔ)的知識(shí),其中包括 CRASA:summary、methodology、program source code ; cDNAdb:summary、specification。所以Biology Knowledge Management管理的資源都可以對應(yīng)

35、相關(guān)以自然語言為基本的知識(shí)。6.導(dǎo)入IASL實(shí)驗(yàn)室自然語言Agent的技術(shù),建立常用的FAQ的答詢機(jī)制。7.導(dǎo)入新的script language技術(shù),希望透過簡單的script language編輯知識(shí),由此產(chǎn)生新的function,因?yàn)锽io KM管理的資源已轉(zhuǎn)成NL base的知識(shí),透過編輯、新增、重組知識(shí)來產(chǎn)生新的function,這對於biology領(lǐng)域的研究人員對於跨入資訊應(yīng)用更是簡便。Procedure Automation Protein structure predictionGiven a sequence, predict its structure automatica

36、lly1.Find homologous ( 25%) sequences 2.If we can find one whose structure is known, then carry out an automated homology modeling3.Otherwise, transform our sequence into other representation (2ndary or super-secondary structure)IAMHSUWENLAI - HHHCCBBBB4.Align the transformed sequence5.If none works, go back to the “ab initio” approach6.With structure available, scan the catalytic fragments, ligand binding sites (need 3D active site database)Biology Agent A query interface on biology KM 人性化

溫馨提示

  • 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
  • 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
  • 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會(huì)有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
  • 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
  • 5. 人人文庫網(wǎng)僅提供信息存儲(chǔ)空間,僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對任何下載內(nèi)容負(fù)責(zé)。
  • 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請與我們聯(lián)系,我們立即糾正。
  • 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。

最新文檔

評論

0/150

提交評論