版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請進(jìn)行舉報(bào)或認(rèn)領(lǐng)
文檔簡介
1、Another Perspective of Bioinformatics許聞廉中央研究院資訊所大綱 DNA序列分析 Error-tolerant Algorithms Mapping. Assembly Repeat sequences DNA序列比對、搜尋 序列的標(biāo)註(Annotations) 資料庫的建立 各種結(jié)構(gòu)的預(yù)測 Integrated Genomic/Proteomic Knowledge Base SystemMy own involvement in Bioinformatics 1980以來進(jìn)行演算法的研究 1995和中研院生醫(yī)所常蘭陽,白果能,孫以瀚合作,我們開始研究ma
2、pping, assembly algorithms 1989以來進(jìn)行自然語言,知識(shí)表達(dá)的研究。與生物資訊相關(guān)的部分有搜尋,資料庫研究,data miningDNA 序列分析基因序列的切割與合成基因序列的比對基因的發(fā)現(xiàn)基因序列轉(zhuǎn)錄的機(jī)制蛋白質(zhì)三度空間結(jié)構(gòu)分析種族樹分析Algorithms for noisy data Mapping algorithms Assembly algorithms Algorithms for Repeat sequences基因序列的切割與合成 Cut a DNA sequence into small pieces in different ways and
3、reassemble them together the “small” pieces (called clones) are still too large to find complete sequences. biologically, use “probe”(探針) to mark the clones each probe could mark several clones each clone could contain several probesProbe-Clone (0,1)-Matrix Each probe can be regarded as a column; ea
4、ch clone can be regarded as a row of probes If each probe hits the DNA sequence only once (unique probe) and there is no error in the probe-clone matrix, then one can use the consecutive ones test to order the clonesConsecutive Ones Property (C1P) Booth & Lueker 1976 linear time, on-line made us
5、e of a data structure called PQ-trees Hsu 1992 decomposition, off-line did not use PQ-trees However, these algorithms do not work on data that contain errors大綱 DNA序列分析 Error-tolerant Algorithms Mapping. Assembly Repeat sequences DNA序列比對、搜尋 序列的標(biāo)註(Annotations) 資料庫的建立 各種結(jié)構(gòu)的預(yù)測 Integrated Genomic/Proteom
6、ic Knowledge Base SystemA 50 x50 matrix with error rate 5% 11111111 1111111111N11 111N1111 111111111111111111 1N11111N111111111111 1111111111111 111111 11N11111111111111111 11111111111111111111 1111111 111111111 111111111111111111 11111111111111N1 11111111111 111N1111111111N 11111N11111 111111111111
7、1111N11 11111111111111 11111111111N111 1111111111111N11111 111111N111111111 11111111111111111111 11111111111111111 111111 111111111N11 11111111111111 11111111 11111111111111111 111111111N11 1111111111 1N1111111111N11111 P N111N111111111 N1N11111 P 1111111111111111 1111111N11N111111 11111111111111 11
8、111111N111111 11N1111111 N1N111 111111111111111111 1111111111111 11111111111111 P 1111111 11111N111111111 111111111111111111 1N111111111111 111111N11 11111111111111111111 1111111 111111111111111 11111111 1111111111F1F1 111F1111 111111111111111111 1F11111F111111111111 1111111111111 111111 11F11111111
9、111111111 11111111111111111111 1111111 111111111 111111111111111111 111111111111111 11111111111 111F1111111111 111111F1111 1111111111111111F11 11111111111111 11111111111F111 11FF11111111111F1111F1 111111F111111111 11111F11111111111111 11111F11111111111 11111 1111111111F1 11111111111111 11111111 1111
10、111111111111 1111111111 1111111111 11111111111F11111 11F1111111111 111111 1F1111111111111111 1111111F11F111111 11F111111111111 11111111F111111 1F11111111 1F111 111111111111111111 1111111111111 11111111111111 11111111 11111F11111111F1 111111111111111111 1F111111111111 111111F11 11111111111111111111 1
11、111111 11F1111111111111 A 50 x50 matrix with error rate 10% 11111111 1111111N11N11 1N1N1111 P 11111111111111111N 1N1NN11N11111111111N 1111111111111 P 111111 11N11111111111N11111 11111111111111111111 1N11111 1111111N1 111111N11111111111 11111111111111NN N1111111N11 111N1111111111N 11111N1N111 11111NN
12、111111111N11 11111111111111 P 1N111111111N111 1N1N1111N1111N11111 111111N111111111 11111111111111111111 11111111111111111 111111 111111111N11 11111111111111 11111111 111N1111111111N11 111111111N11 111111N111 1N1111111111NN1111 N111N111111111 N1N11111 1111111111111111 1111111N11N111111 11111111111111
13、 11111111N111111 P 11N1N11111 N1N11N 1N111N111111111111 1N1111111111N 111111111N1111 P 1111111 11111N111111111 111111111111111111 1N111111111111 111111N11 11111111111111111111 P 1111111 111111111111111 11111111 1111111F11F11 11F1FF11 11111FF1111111111 11FFFF1F1111111111 11FF11111111FF11 111111 11111
14、1111111F11111 11111111111111111111 1F11111 1111111F1 11111F11111111111 11111FF111111 111111F11 1F11FFF11111111 1111F1F111 1111111FF1111111 11111111111111 11111111111F111 1F111F11F1111F1FF11 111111F111111111 11F11111111111111111 1FF11111111111111 11111 111111111F11 11111111111111 11111111 11F11111111
15、1F111 111111F1111 1F11111F111 11F11111111FF1111 1111F1111111 1F11111 11111111111111FF1 1111111F11F111111 11FFF1F11111FF111F1 111FF11F1111111 111FF1111 11F11111111111111 11FF11F111111F1 11111111F11111 11F11111 11111F111111111 111111111111111111 1F111111111111 111111F11 11111111111111111111 1111111 11
16、1111111111111 Assembly algorithms1234541235The interval graph modelsTandem Repeat (Satellite) DNA 15% of DNA is composed of repetitive short sequences such that one follows another (tandem repeat) ”Satellite DNAs are classified into three major groups: Satellites (very highly repetitive, organized a
17、s large (up to 100 million bp ! clusters) Minisatellites (moderately repetitive moderately-sized (9 to 100 bp, but usually about 15 bp repeats) Microsatellites (moderately repetitive, and composed of arrays of short (2-6 bp) repeats)Example of Satellite DNA TAATCGCATTCCGATACGCAGCGGACGTTAATCGCATTCCGA
18、TAGCAGCGGACGTTAACCGCATTCCGATACGCAGCGGACGTCTACCGCGTCCGAGACGCACCGGAAGCTGACCGCATTCCGTACGCAGCGGACGTAAACCGCATTCCGATCGCAGCGGACGTTAACCGCATCCCGATACGCAGCGGAACTTAACCGCTGCCGATATGCAACGGACGTTAACCGCATTCCGATACGCATCGGACGTCAACCGCAGTCCGATACGCAGCGGACGTTAACCGCATTCTGATACGCAGCGGACGTTAACCGCTTGCGATATGCAGGGGACGTTAACGCATTCCG
19、ATACACAGCGACGTTAACGCATTCAGAGACGCAGCTGACGTTAACCGCATTCCGATACGCAACGCACGTTAACCGCATCCGAACGCAGCGGACGTTAACAGCATTCCGATACCGCGGACGTTAACCGCATTCCGATACCAGCGGACGTCutting the Satellite DNA TAACCGCATTCCGATACGCAGCGGACGTTAATCGCATTCCGATACGCAGCGGACGTTAATCGCATTCCGATA_GCAGCGGACGTTAACCGCATTCCGATACGCAGCGGACGTCTACCGCGT_CCGA
20、GACGCACCGGAAGCTGACCGCATTCCG_TACGCAGCGGACGTAAACCGCATTCCGAT_CGCAGCGGACGTTAACCGCATCCCGATACGCAGCGGAACTTAACCGC_TGCCGATATGCAACGGACGTTAACCGCATTCCGATACGCATCGGACGTCAACCGCAGTCCGATACGCAGCGGACGTTAACCGCATTCTGATACGCAGCGGACGTTAACCGC_TTGCGATATGCAGGGGACGTTAA_CGCATTCCGATACACAGC_GACGTTAA_CGCATTCAGAGACGCAGCTGACGTTAACCG
21、CATTCCGATACGCAACGCACGTTAACCGCAT_CCGA_ACGCAGCGGACGTTAACAGCATTCCGATAC_C_GCGGACGTTAACCGCATTCCGATAC_CAGCGGACGTconsensusErrors in Repetitive Sequences SubstitutionTAACCGCATTCCGATACGCAGCGGACGTTAACCGCATTCCGATACGCGGCGGACGT DeletionTAACCGCATTCCGATACGCAGCGGACGTTAACCGCATT_CGATACGCAGCGGACGT InsertionTAA_CCGCATT
22、CCGATACGCAGCGGACGTTAATCCGCATTCCGATACGCAGCGGACGT大綱 DNA序列分析 Error-tolerant Algorithms Mapping. Assembly Repeat sequences DNA序列比對、搜尋 序列的標(biāo)註(Annotations) 資料庫的建立 各種結(jié)構(gòu)的預(yù)測 Integrated Genomic/Proteomic Knowledge Base System基因序列的檢索 DNA序列 mRNA 蛋白質(zhì)序列 3D 結(jié)構(gòu) 功能 以上的過程相當(dāng)耗時(shí),且花費(fèi)不眥。 如今Genome Project已近尾聲,有許多完整的序列以及部份基因
23、的資訊 如何利用現(xiàn)有的資訊歸納出足夠的規(guī)則並加上生物學(xué)家的知識(shí),進(jìn)行電腦的序列分析以及預(yù)測,以加速基因研究的進(jìn)展常用技巧 Text: 字、詞、短語的 tagging 經(jīng)由這些tags的組合進(jìn)行event的分析 經(jīng)由許多小events組合成較大的script Text analysis DNA analysis approximation sequence alignment similarity homology meaning 3D structure, functionalityTreating Genomic/Proteomic data as a Language An analogy
24、 of exons and intronsOnlyaksjcbakamcnabddfkjsmallddkdfjwosperddtrudjfdksjascdcentagedkjfdkdfjgaofhumanzidkenkdjfDNAisbelskdfjactuallyofSnadkfjkjdmeandkfjdkslasdkingfulDecoding an unknown language For proteomic data:Amino acid motif proteinAlphabetwordsentenceSentence meaningProtein structure Finding
25、 the interrelationships of data Data Mining, Knowledge DiscoveryMatching by examples Existing sentences in database (understood): His old father gave me a book. Joan loves Andy UnderstandingUnderstanding a new sentence Marys lovely daughter does not like John Techniques Corpus analysis Pattern disco
26、very and matching Sequence, semantics (classification, transformation) Structure predictionGene SplicingMatching by templatesBoundaries of Splice SitesPrediction Gene finding Intron, exon boundary determination Protein folding 3-dimensional structure Techniques used Data mining (machine learning, st
27、atistical hypothesis testing), probabilistic methods (HMM), neuron net, graph models, Dynamic programming for gene structure prediction Constraint Frame constraint Junction constraint Statistical characteristics of junction and segment Junctional scoring function Segmental scoring functionA Simple H
28、MMExonIntronA 0.4C 0.1G 0.1T 0.4A 0.05C 0.4G 0.5T 0.05hiddenObservationEEEEIIIEEEATCAAGGCGT0.90.10.010.99大綱 DNA序列分析 Error-tolerant Algorithms Mapping. Assembly Repeat sequences DNA序列比對、搜尋 序列的標(biāo)註(Annotations) 資料庫的建立 各種結(jié)構(gòu)的預(yù)測 Integrated Genomic/Proteomic Knowledge Base SystemGoal Develop an integrated G
29、enomic/Proteomic Knowledge Base System to serve as a bioinformatics framework The system is designed to facilitate genome annotation, functional characterization of proteins, disease and pathway studiesBiological Knowledge Map We shall organize various biological data and their relationships into a
30、MAP This map will provide the following: Basic knowledge inference ability A biological question and answering system A decision support system for biologists The map will speed up the exploration of genes, protein structures and functionsAn Example of Information MAP1.由建構(gòu)及儲(chǔ)存常蘭陽博士的 CRASA ( Complexit
31、y Reduction Algorithm for Sequence Analysis )相關(guān)資料為例。2.CRASA主要是將DNA sequence透過一層層的二維矩陣來重新定義。每一層矩陣的兩邊以AA、AG、AT、AC、GT、GG、GC16種組合代表。所以一條DNA的序列就可以在此以層層相疊所組成的金字塔上標(biāo)出其路線。希望透過此機(jī)制能降低在分析DNA sequence的複雜度 3.所要管理的部分包括:a.資源 ( Data Base:cDNAdb (HGI6.0 190M) )b.程式 ( CRASA query interface、procedure interface definiti
32、on )c.文件 ( CRASA:summary、methodology、program source code ; cDNAdb:summary、specification)4.透過Biology Knowledge Management可以重複利用CRASA query program,甚至利用CRASA query program發(fā)展新的分析sequence的程式,只要將新程式和procedure interface definition作連結(jié)即可達(dá)成分享以及重複利用的優(yōu)點(diǎn)。再加上,管理介面同時(shí)紀(jì)錄CRASA相關(guān)的文件( methodology、summary等),所以運(yùn)用上來更方便。I
33、BMSIBMSSinicaSinicaqueryqueryinterfaceinterfacecDNAcDNAdbdbgenericfunctionFunction TypeFunction TypeInformationProgramqueryqueryinterfaceinterfacecDNAcDNAdbdbsummarymethodologyDr. Lan-YangChangCRASAprocedureinterfacedefinitionprocedureinterfacedefinitionsummaryspecificationsummarymethodologycSNPDr.
34、Wen-changLinssourcecodeSourceIBMS Sinica Map常老師實(shí)驗(yàn)室實(shí)驗(yàn)室方法CRASA分類屬性何謂實(shí)驗(yàn)室方法實(shí)驗(yàn)室方法condition實(shí)驗(yàn)室成員分類林老師實(shí)驗(yàn)室老師博士後研究研究助理常蘭陽實(shí)驗(yàn)室方法分類屬性cSNP何謂實(shí)驗(yàn)室方法實(shí)驗(yàn)室方法condition實(shí)驗(yàn)室成員5. 根據(jù)文件部分的資料轉(zhuǎn)化成以自然語言為基礎(chǔ)的知識(shí),其中包括 CRASA:summary、methodology、program source code ; cDNAdb:summary、specification。所以Biology Knowledge Management管理的資源都可以對應(yīng)
35、相關(guān)以自然語言為基本的知識(shí)。6.導(dǎo)入IASL實(shí)驗(yàn)室自然語言Agent的技術(shù),建立常用的FAQ的答詢機(jī)制。7.導(dǎo)入新的script language技術(shù),希望透過簡單的script language編輯知識(shí),由此產(chǎn)生新的function,因?yàn)锽io KM管理的資源已轉(zhuǎn)成NL base的知識(shí),透過編輯、新增、重組知識(shí)來產(chǎn)生新的function,這對於biology領(lǐng)域的研究人員對於跨入資訊應(yīng)用更是簡便。Procedure Automation Protein structure predictionGiven a sequence, predict its structure automatica
36、lly1.Find homologous ( 25%) sequences 2.If we can find one whose structure is known, then carry out an automated homology modeling3.Otherwise, transform our sequence into other representation (2ndary or super-secondary structure)IAMHSUWENLAI - HHHCCBBBB4.Align the transformed sequence5.If none works, go back to the “ab initio” approach6.With structure available, scan the catalytic fragments, ligand binding sites (need 3D active site database)Biology Agent A query interface on biology KM 人性化
溫馨提示
- 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
- 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
- 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會(huì)有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
- 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
- 5. 人人文庫網(wǎng)僅提供信息存儲(chǔ)空間,僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對任何下載內(nèi)容負(fù)責(zé)。
- 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請與我們聯(lián)系,我們立即糾正。
- 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。
最新文檔
- 初中體育(游仙都)教案體育與健康
- 人教版(2019)選擇性必修第二冊Unit 5 First Aid Reading and Thinking 教學(xué)設(shè)計(jì)
- 安徽省宣城市八校2025屆高三教學(xué)質(zhì)量調(diào)研(四模)考試化學(xué)試題含解析
- 【核心素養(yǎng)】人教版地理七年級(jí)下冊第七章第三節(jié) 印度 教學(xué)設(shè)計(jì)
- 人教版四年級(jí)數(shù)學(xué)下冊 4.3. 《小數(shù)的讀法和寫法》(教案)
- 小學(xué)數(shù)學(xué)人教版一年級(jí)下《100以內(nèi)數(shù)的認(rèn)識(shí)》教學(xué)設(shè)計(jì)
- 第一單元第一節(jié) 走進(jìn)互聯(lián)世界 教學(xué)設(shè)計(jì) 2024-2025學(xué)年川教版(2024)初中信息技術(shù) 七年級(jí)上冊
- 中華小廚神(教學(xué)設(shè)計(jì))-五年級(jí)下冊勞動(dòng)人教版1
- 人教版數(shù)學(xué)八年級(jí)下冊18.1.2.3三角形的中位線教案
- 4-1《喜看稻菽千重浪-記首屆國家最高科技獎(jiǎng)獲得者袁隆平》教案 2023-2024學(xué)年統(tǒng)編版高中語文必修上冊
- 揚(yáng)體育精神展青春風(fēng)采主題班會(huì)
- (正式版)QBT 2317-2024 口腔清潔護(hù)理用品 牙膏用天然碳酸鈣
- 中青年干部培訓(xùn)
- 新能源汽車對城市居民生活方式改變的影響
- 2024-2030年中國離網(wǎng)儲(chǔ)能系統(tǒng)行業(yè)發(fā)展趨勢及投資前景預(yù)測報(bào)告
- 2024年山東省青島市中考英語模擬試卷+
- (多場合應(yīng)用)河北省事業(yè)單位聘用合同書
- 小學(xué)六年級(jí)數(shù)學(xué)100道題解分?jǐn)?shù)方程
- 安全故事比賽活動(dòng)方案
- 家庭教育與校園欺凌
- 2024年高等教育經(jīng)濟(jì)類自考-03333電子政務(wù)概論筆試歷年真題薈萃含答案
評論
0/150
提交評論