生物資訊相關(guān)演算法_第1頁
生物資訊相關(guān)演算法_第2頁
生物資訊相關(guān)演算法_第3頁
生物資訊相關(guān)演算法_第4頁
生物資訊相關(guān)演算法_第5頁
已閱讀5頁,還剩48頁未讀, 繼續(xù)免費(fèi)閱讀

下載本文檔

版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請進(jìn)行舉報(bào)或認(rèn)領(lǐng)

文檔簡介

1、生物資訊相關(guān)演算法Algorithms in Bioinformatics呂學(xué)一 (中央研究院 資訊科學(xué)所).tw/hil/2003/10/211Algorithms in Bioinformatics, Lecture 5TodayApplications of suffix treesSubstring problem (暖身)“Exact string matching” revisitedLinearization of circular string (挪移乾坤)Longest common substring (異中求同)Intermission 小巨s magic showThe

2、 Lure of Queen (皇后的魅惑)2003/10/212Algorithms in Bioinformatics, Lecture 5Application OneSubstring Problem (recap as a warm-up)2003/10/213Algorithms in Bioinformatics, Lecture 5Substring ProblemInput: two strings P and S, where S is allowed to be preprocessed in O(|S|) time.Output: an occurrence of P

3、in S.Objective: done in O(|P|) time.2003/10/214Algorithms in Bioinformatics, Lecture 5 12345678 S=bbabbaab1,13,1,12,3,127,2,34,7,14,17,3,34,13,32003/10/215Algorithms in Bioinformatics, Lecture 5 12345678 S=bbabbaab1,17,2,34,7,4,7,3,34,3,3Q: Where are abba, baa, bb?2003/10/216Algorithms in Bioinforma

4、tics, Lecture 5 12345678 S=bbabbaab121,1347,2,34,57,4,67,3,34,3,33211Q: Where are abba, baa, bb?2003/10/217Algorithms in Bioinformatics, Lecture 5Application TwoExact String Matching2003/10/218Algorithms in Bioinformatics, Lecture 5Exact String MatchingInput: two strings P and S, where S is allowed

5、to be preprocessed in O(|S|) time.Output: all occurrences of P in S.Challenge: solving this in O(|P| + k) time, where k is the number of occurrences of P in S.2003/10/219Algorithms in Bioinformatics, Lecture 5IdeaEach internal node keeps the labels of all its descendant leaves.2003/10/2110Algorithms

6、 in Bioinformatics, Lecture 5 12345678 S=bbabbaab121,1347,2,34,57,4,67,3,34,3,3Q: Somethings missing? 6,35,24,15,2,4,1Q: How do we fix this problem?2003/10/2111Algorithms in Bioinformatics, Lecture 5 123456789 S=bbabbaab$121,1347,2,34,57,4,67,3,34,3,35,24,15,2,4,1,8179,4,45,89,99,6,3,7Q: Obtainable

7、in O(|S|) time?2003/10/2112Algorithms in Bioinformatics, Lecture 5Perhaps notS = a a a a a $1654321,2,3,4,51,2,3,41,2,31,22003/10/2113Algorithms in Bioinformatics, Lecture 5An observationConsider the sequence L of leaves from left to right. The descendant leaves of each internal node has to be conse

8、cutive in L.2003/10/2114Algorithms in Bioinformatics, Lecture 5 123456789 S=bbabbaab$121,1347,2,34,57,4,67,3,33,35,24,15,2,4,1,879,4,45,89,99,6,3,7 123456789L=6375241891,34,84,56,72003/10/2115Algorithms in Bioinformatics, Lecture 5Application ThreeCircular String Linearization (挪移乾坤)2003/10/2116Algo

9、rithms in Bioinformatics, Lecture 5NotationLet挪(S, i)denote the string Si|S| S1i 1.Si挪(S,i)2003/10/2117Algorithms in Bioinformatics, Lecture 5The problemInputa string S.Outputan index i that maximizes the alphabetical order of 挪(S, i). 1 2 3 4 5 6 7 8挪(S,1) = b b a b b a a b挪(S,2) = b a b b a a b b挪

10、(S,3) = a b b a a b b b挪(S,4) = b b a a b b b a挪(S,5) = b a a b b b a b挪(S,6) = a a b b b a b b挪(S,7) = a b b b a b b a挪(S,8) = b b b a b b a a2003/10/2118Algorithms in Bioinformatics, Lecture 5Nave algorithmlet j = 1;for i = 2 to |S| do if (挪(S,i) 挪(S,j) let j = i;output j;Time complexity?2003/10/2

11、119Algorithms in Bioinformatics, Lecture 5Q: Can we beat O(|S|2)? 1 2 3 4 5 6 7 8挪(S,1) = b b a b b a a b挪(S,2) = b a b b a a b b挪(S,3) = a b b a a b b b挪(S,4) = b b a a b b b a挪(S,5) = b a a b b b a b挪(S,6) = a a b b b a b b挪(S,7) = a b b b a b b a挪(S,8) = b b b a b b a a2003/10/2120Algorithms in B

12、ioinformatics, Lecture 5Linear-Time Algorithm via Suffix Tree2003/10/2121Algorithms in Bioinformatics, Lecture 5First attempt going right1 2 3 4 5 6 7 8b b a b b a a b b a b b a a b a b b a a b b b a a b b a a b a a b a b b Q: How to fix the problem?2003/10/2122Algorithms in Bioinformatics, Lecture

13、5Second AttemptSuffix tree for SS2003/10/2123Algorithms in Bioinformatics, Lecture 5Key observationEach length-|S| substring of SS is a 挪(S, j) for some index j with 1 j |S|.Each 挪(S, j) with 1 j |S| is a length-|S| substring of SS.2003/10/2124Algorithms in Bioinformatics, Lecture 5 1234567890123456

14、SS=bbabbaabbbabbaab11,13,1,12,3,127,2,34,7,14,17,3,34,1210,4,56,110,2,23,323453,32003/10/2125Algorithms in Bioinformatics, Lecture 51,17,4,7,4,7,3,310,4,56,10,2,23,33,3Q: How to use this suffix tree? 1234567890123456SS=bbabbaabbbabbaab2003/10/2126Algorithms in Bioinformatics, Lecture 5Equivalently,

15、Output the index i such that SSi|SS| corresponds to the rightmost leaf of the suffix tree for SS.Clearly, this takes O(|S|) time.2003/10/2127Algorithms in Bioinformatics, Lecture 5Application FourLongest common substring (異中求同)2003/10/2128Algorithms in Bioinformatics, Lecture 5The problemInput: two

16、strings A and B.Output: a longest string C that occurs in both A and B.A = b b b a b b a a bB = b a a b b a b b a bC = b bC = b a a bC = a b b aC = b b a b b a2003/10/2129Algorithms in Bioinformatics, Lecture 5Nave algorithmbuild suffix tree for B;for L = |A| downto 1 do for i = 1 to |A|-L+1 do if A

17、ii+L-1 occurs in B output Aii+L-1 and halt; output “no common substring”;Time complexity?2003/10/2130Algorithms in Bioinformatics, Lecture 5O(|A|3+|B|)()3|12|1|1|)()1|(|AOAiiAOiOiAALALAL=+-=+-=The for-loop takes timeCan we do better than this?2003/10/2131Algorithms in Bioinformatics, Lecture 5A fast

18、er algorithmbuild suffix tree for B;for i = 1 to |A| do find the largest integer L(i) such that Aii+L(i)-1 occurs in B by binary search;output AiL(i) for the i with the largest L(i);Time complexity?2003/10/2132Algorithms in Bioinformatics, Lecture 5O(|A|2 log|A|+|B|)The for-loop takes O(|A|2 log|A|)

19、 time.Each binary search takes time O(|A| log |A|).There are overall O(|A|) binary searches.Can we do better than this?2003/10/2133Algorithms in Bioinformatics, Lecture 5Donald E. Knuth conjectured in 1970 that it is impossible to solve this longest common substring problem in O(|A|+|B|) time.2003/1

20、0/2134Algorithms in Bioinformatics, Lecture 5Longest Common Substringin O(|A|+|B|) time via suffix tree2003/10/2135Algorithms in Bioinformatics, Lecture 5IdeaConstruct a suffix tree T for A#B$, where # and $ are two characters not in A and B.There are exactly |A|+|B|+2 leaves in T, each leaf corresp

21、onds to a suffix of A#B$.A-leaf: with label in 1, 2, , |A|corresponds to an A-suffix.B-leaf: with label in |A|+2, ,|A|+|B|+1corresponds to a B-suffix.$#ABA-suffixB-suffix2003/10/2136Algorithms in Bioinformatics, Lecture 5ObservationLet v be an arbitrary position of T (i.e., v is not necessarily a no

22、de of T.)v has a descendant A-leaf if and only if v corresponds to a prefix of an A-suffix of A#B$.v has a descendant B-leaf if and only if v corresponds to a prefix of a B-suffix of A#B$.rootv2003/10/2137Algorithms in Bioinformatics, Lecture 5LemmaLet v be a position of T.v has descendant A-leaf an

23、d B-suffix if and only if v corresponds to a common substring of A and B.root$#ABA-suffixB-suffixv2003/10/2138Algorithms in Bioinformatics, Lecture 5QuestionDo we really need # to separate A and B in the concatenated string A#B$?root$ABA-suffixB-suffixv2003/10/2139Algorithms in Bioinformatics, Lectu

24、re 5The algorithmConstruct the suffix tree T of A#B$.Output the string corresponding to a deepest internal node v such that the subtree of T rooted at v contains both A-leaf and B-leaf.Q: why not checking leaves?Q: why not checking positions of T that are not internal nodes of T?2003/10/2140Algorith

25、ms in Bioinformatics, Lecture 5It suffices to check internal nodesIf the position v contains both kinds of descendant leaves, then so does its closest internal node below.rootv2003/10/2141Algorithms in Bioinformatics, Lecture 5Time = O(|A|+|B|)O(|A|+|B|) time for constructing T.O(|A|+|B|) time for m

26、arking the colors of each node, including each leaf and each internal nodesO(|A|+|B|) time for computing the depths of all nodesO(|A|+|B|) time to find a deepest internal node with both colors.2003/10/2142Algorithms in Bioinformatics, Lecture 5Space Complexity is also O(|A|+|B|).Q: Can we further im

27、prove the time and space complexity?“No” for the time complexity.“Yes” for the space complexity.2003/10/2143Algorithms in Bioinformatics, Lecture 5Reducing the space to O(|A|)2003/10/2144Algorithms in Bioinformatics, Lecture 5Longest Common SubstringInput: two strings A and B.Output: a longest strin

28、g C that occurs in both A and B.Objective: O(|A|+|B|) timeO(|A|) spaceIdea: Construct the suffix tree of A only.2003/10/2145Algorithms in Bioinformatics, Lecture 5The algorithmConstruct the suffix tree T of A, keeping all the suffix links.For i = 1 to |B| doFind the largest integer 深(i) such that Bi

29、i+深(i)1 occurs in B.Output Bii+深(i)1 where i is the index with maximum 深(i).2003/10/2146Algorithms in Bioinformatics, Lecture 5Navely, Finding 深(i) for each i takes O(深(i) +1) time by traversing T from the root.But all these |B| iterations would take O(|A|B|) time in total.depth = 深(i)2003/10/2147Al

30、gorithms in Bioinformatics, Lecture 5Observation深(i+1) 深(i) 1.深(i)深(i) 1What if this suffix link does not exist?2003/10/2148Algorithms in Bioinformatics, Lecture 5 12345678 12345678901 A=bbabbaab B=babaabbaaba121,1347,82,34,857,84,867,83,34,83,31Record: 深(1) = 31Record: 深(3) = 4Record: 深(5) = 62003/10/2149Algorithms in Bioinformatics, Lecture 5Time and spaceClearly, the space complexity is O(|A|).The time complexity still O(|A|+|B|).We first show that the time is O(|A|+|B|) without considering that of suffix link traversal. We then show that t

溫馨提示

  • 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
  • 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
  • 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
  • 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
  • 5. 人人文庫網(wǎng)僅提供信息存儲空間,僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對任何下載內(nèi)容負(fù)責(zé)。
  • 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請與我們聯(lián)系,我們立即糾正。
  • 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。

評論

0/150

提交評論