生物資訊相關(guān)演算法_第1頁(yè)
生物資訊相關(guān)演算法_第2頁(yè)
生物資訊相關(guān)演算法_第3頁(yè)
生物資訊相關(guān)演算法_第4頁(yè)
生物資訊相關(guān)演算法_第5頁(yè)
已閱讀5頁(yè),還剩103頁(yè)未讀, 繼續(xù)免費(fèi)閱讀

下載本文檔

版權(quán)說(shuō)明:本文檔由用戶(hù)提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請(qǐng)進(jìn)行舉報(bào)或認(rèn)領(lǐng)

文檔簡(jiǎn)介

1、呂學(xué)一 (中央研究院 資訊科學(xué)所).tw/hil/nOur linear-time algorithm for computing all Z-values in last weeks slides might go wrong. So, please (1) Find out where of our algorithm might go wrong. (Hint: 右(i) could be i 1. Should be more careful about some boundary condition.) (2) Explain how

2、to fix the bug.nTwo possible methods: (Dont do both.) Hand it to our TAs (e.g., bring it to our classroom or try them at their offices), or Email your file to one of our TAs. (Dont send it to our broadcast address. Otherwise, you get 0 point for this homework.)nDue: (100%) 11:59pm, Oct. 14, 2003. (5

3、0%) 1:20pm, Oct. 21, 2003.nYou are welcome to discuss the problem with anybody. However, you have to write your solution by yourself. Copying solutions of others is prohibited. Anybody doing that will get zero point for ALL of his/her homeworks of this course.nList the names of those who have discus

4、sed with you before you write down your solutions.nSuffix tree an extremely powerful data structure for string algorithms.nIntermission 小巨s magic show: “Prediction Impossible”. Input: P and S. Output: All occurrences of P in S. Time: O(|P| + |S|) Technique: Z values of PS.nZ(i + |P|) |P| iff P = Sii

5、 + |P| 1. Pi+|P|i+|P|+d-1SnSolving the Exact String Matching problem in O(|S|) time under the assumption that P is known and already pre-processed in O(|P|) time? E.g., P is one of your private collection of DNA sequence.nAnswer: trivial, because |S| |P|).nSolving the Exact String Matching problem i

6、n O(|P|) time under the assumption that S is known and already pre-processed in O(|S|) time? E.g., S is a dictionary whose content does not change frequently.nAnswer: may not always be possible, e.g., S and P are both all-1 strings.The Substring Problem Input: P and S. Output: an occurrence of P in

7、S.nSolving the Substring problem in O(|S|) time under the assumption that P is known and already pre-processed in O(|P|) time?nAnswer: trivialnSolving the Substring problem in O(|P|) time under the assumption that S is known and already pre-processed in O(|S|) time?nAnswer: possible, but its not cle

8、ar how to do it using Gusfields Z-value algorithm.nPreprocessing P Gusfield Boyer-Moore Knuth-Morris-PrattnPreprocessing S Suffix treeS = b b a b b a a bS18= b b a b b a a bS28= b a b b a a bS38= a b b a a bS48= b b a a bS58= b a a bS68= a a bS78= a bS88= b 1st suffix2nd suffix3rd suffix4th suffix5t

9、h suffix6th suffix7th suffix8th suffixS = b b a b b a a bS18= b b a b b a a bS28= b a b b a a bS38= a b b a a bS48= b b a a bS58= b a a bS68= a a bS78= a bS88= b 1st suffix2nd suffix3rd suffix4th suffix5th suffix6th suffix7th suffix8th suffixb b a b b a a b b a b b a a b a b b a a b b b a a b b a a

10、b a a b a b b nThe following statements are equivalent. P occurrs in S. P is a prefix of a suffix of S. P corresponds to a path of T starting from the root of T.b b a b b a a b b a b b a a b a b b a a b b b a a b b a a b a a b a b b P occurs in S!b b a b b a a b b a b b a a b a b b a a b b b a a b b

11、 a a b a a b a b b P doesnt occur in S!b b a b b a a b b a b b a a b a b b a a b b b a a b b a a b a a b a b b P doesnt occur in S!844441111155522222676333731 2 3 4 5 6 7 8b b a b b a a b b a b b a a b a b b a a b b b a a b b a a b a a b a b b Output: 3nQuestion: If we are given a suffix trie of S,

12、what is the time complexity for finding an occurrence of P in S?nAnswer: O(|P|) time.Q: Time complexity for constructing the suffix trie T of S?Q(|S|)Q(|S| log |S|)Q(|S|2)Q(|S|3)844441111155522222676333731 2 3 4 5 6 7 8b b a b b a a b b a b b a a b a b b a a b b b a a b b a a b a a b a b b nHow to e

13、stablish a lower bound?nAnswer: Showing an example which takes W(|S|2) time for any algorithm.nSuffix trie is good in solving Substring Problem, but may require W(|S|2) time and space.nQuestion: is there a compact representation of suffix trie that needs only O(|S|) time and space?A compact represen

14、tation of suffix trienT has at most |S| leaves. Why?nT has at most |S| 1 branching nodes. Why?nKeeping leaves and branching nodes only.ncompact representation of edge labels5,85,85,85,84,81,12,23,35,85,85,85,84,81,12,23,31,12,34,87,84,87,84,87,83,33,33,31,13,32,37,84,87,84,87,84,83,31,13,32,37,84,87

15、,84,87,84,8nThe space complexity of suffix tree O(|S|) O(|S| log |S|) O(|S|2) O(|S|3)nWhy? Number of nodes = O(|S|). Number of edges = O(|S|). Space required by each edge = O(1).Constructing Suffix Tree in Linear TimenWeiner, IEEE FOCS 1973 Linear time but expensive in space. D. E. Knuth: “the algor

16、ithm of 1973”.nMcCreight, J. ACM 1976 Linear time and quadratic space.nUkkonen, Algorithmica 1995 Linear time and linear space. Much better readability.nLet T(k) denote the suffix tree of S1k.nFrameworkcompute T(1);for k = 2 to |S| do compute T(k) from T(k-1);S = b b a b b a a bS18= b b a b b a a bS

17、28= b a b b a a bS38= a b b a a bS48= b b a a bS58= b a a bS68= a a bS78= a bS88= b b b a b b a a b b a b b a a b a b b a a b b b a a b b a a b a a b a b b S = b b a b b a a b1,12,31,23,33,31,12,43,43,42,53,53,52,63,63,63,37,74,73,37,74,72,37,74,77,84,87,84,87,84,8nObservation: The tree structure do

18、es not change very often, i.e., only O(|S|) times.nAt the beginning of the k-th iteration, there are exactly k “growing points”(生長(zhǎng)點(diǎn)), all with different heights.nThe k-th iteration iteratively grows k edges with label Sk at those k growing points.b b a b b a a b b a b b a a b a b b a a b b b a a b b

19、 a a b a a b a b b nThree cases while growing trie Case 1: growing an edge at a leaf. Case 2: growing a new branch of leaf. Case 3: does not change the tree structure.1 2 3 4 5 6 7 8b b a b b a a b b a b b a a b a b b a a b b b a a b b a a b a a b a b b Case 1: 長(zhǎng)此以往Case 2: 節(jié)外生枝Case 3: 若無(wú)其事nThose k s

20、teps in the k-th iteration have the following pattern: some (at least one) Case-1 steps, followed by some (could be zero) Case-2 steps, followed by some (could be zero) Case-3 steps.1 2 3 4 5 6 7 8b b a b b a a b b a b b a a b a b b a a b b b a a b b a a b a a b a b b Case 1: 長(zhǎng)此以往Case 2: 節(jié)外生枝Case 3:

21、 若無(wú)其事n在同一個(gè)iteration之中,長(zhǎng)此以往之前(下)的step一定也是長(zhǎng)此以往。nAt the end of each iteration, if a growing point is a leaf, then all previous (lower) growing points are also leaves. Why?b b a b b a a b b a b b a a b a b b a a b b b a a b b a a b a a b a b b nThe i-th growing point = the end of the i-th suffix Sik of

22、the current prefix S1k.nArgument: the i-th (i 1) growing point is a leaf. Neither Sika nor Si.kb is a substring of S1k. Neither Si1ka nor Si1.kb is a substring of S1k. The (i 1)-st growing point is a leaf.b b a b b a a b b a b b a a b a b b a a b b b a a b b a a b a a b a b b nAt the beginning of th

23、e current iteration, its corresponding growing point is a leaf.b b a b b a a b b a b b a a b a b b a a b b b a a b b a a b a a b a b b nAt the end of each iteration, if a growing point is a leaf, then all previous (lower) growing points are also leaves. nTherefore, in the next iteration, if a step i

24、s a “長(zhǎng)此以往”, then all previous steps are also “長(zhǎng)此以往”.n在同一個(gè)iteration之中,若無(wú)其事之後(上)的step一定也是若無(wú)其事。nAt the end of each iteration, if a growing point is an internal node, then all latter (higher) growing points are also internal. Why?b b a b b a b a b b a a b b a b b a b a a nThe i-th growing point = the en

25、d of the i-th suffix Sik of the current prefix S1k.nArgument: the i-th (i k) growing point is internal. Sika or Si.kb is a substring of S1k. Si+1ka or Si+1.kb is a substring of S1k. The (i+1)-st growing point is internal.b b a b b a b a b b a a b b a b b a b a a nAt the end of the current iteration,

26、 its corresponding growing point is still an internal node.b b a b b a b a b b a a b b a b b a b a a nAt the end of each iteration, if a growing point is an internal node, then all latter (higher) growing points are also internal.nTherefore, in “this” iteration, if a step is “若無(wú)其事”, then all its fol

27、lowing steps are also “若無(wú)其事”.nThose k steps in the k-th iteration have the following pattern: some (at least one) Case-1 steps, followed by some (could be zero) Case-2 steps, followed by some (could be zero) Case-3 steps.This is about the status within the same iteration葉子的宿命 一日為葉子, 終身為葉子b b a b b a

28、 a b b a b b a a b a b b a a b b b a a b b a a b a a b a b b Case 1: 長(zhǎng)此以往Case 2: 節(jié)外生枝Case 3: 若無(wú)其事nOnly O(|S|) Case-2 steps throughout the execution.nWhy?n 葉子生長(zhǎng)點(diǎn)之所對(duì)應(yīng)的edge labeln 斷頭指標(biāo)n Case-1 StepnAlways occurs at a leaf (growing point).nAt the beginning of iteration k, the path above a leaf has label

29、 ?, k1.nEach Case-1 step in iteration k simply changes the label ?, k1 to ?, k.?, k1?, knUse ?, - to label the path above each leaf. Thus, no need to do anything for each case-1 step.?, -S = b b a b b a a b1,-2,-3,-3,-3,37,-4,-3,37,-4,-2,37,-4,-11,11 2111nWe can simply ignore all Case-1 steps.nRecal

30、l that the number of Case-2 steps is at most |S|.nQ: Is this good enough?Case 1: 長(zhǎng)此以往Case 2: 節(jié)外生枝Case 3: 若無(wú)其事1 2 3 4 5 6 7 8b b a b b a a b b a b b a a b a b b a a b b b a a b b a a b a a b a b b Case-2 Steps: 節(jié)外生枝nAlways occurs at a growing point that is not a leaf, which is not necessarily an inte

31、rnal node.nWhen the growing point is an internal node Label = k, -, where k is the current iteration index.k, -nWhen the growing point is not an internal node k, -ti, ji+t, ji, i+t-1Case-3 Steps: 若無(wú)其事nAlways occurs at a growing point that is not a leaf, which is not necessarily an internal node. nWh

32、en the new position of the growing point is an internal node ti, j0nWhen the new position of the growing point is not an internal node ti, jt + 1nGive a sequence S such that the number of Case-3 steps throughout the process of growing its suffix trie (or suffix tree) is W(|S|2).Completely ignore the

33、m以若無(wú)其事的態(tài)度處理若無(wú)其事的狀況ti, j0ti, jt + 11.For correctly maintaining the position of each growing point. (Why?)2.For correctly running Case-2 steps. (By what?)k, -ti, ji+t, ji, i+t-1Saving the book keeping efforts in all Case-3 steps Case 1: 長(zhǎng)此以往Case 2: 節(jié)外生枝Case 3: 若無(wú)其事1 2 3 4 5 6 7 8b b a b b a a b b a b

34、b a a b a b b a a b b b a a b b a a b a a b a b b nJust keep one current growing point throughout the execution.nDeriving the new position of the current growing point from its previous position (with the helpusing suffix links (斷頭指標(biāo))Case 2: 節(jié)外生枝Case 3: 若無(wú)其事1 2 3 4 5 6 7 8b b a b b a a b b a b b a a

35、 b a b b a a b b b a a b b a a b a a b a b b The challenges: How do we derive the position of the current growing point? Vertically (case 2) Horizontally (case 3)nQ: Which one is easier?nMoving from iteration k 1 to iteration k.nThe growing point does not move!nThis is the easier case.Case 2: 節(jié)外生枝Ca

36、se 3: 若無(wú)其事1 2 3 4 5 6 7 8b b a b b a a b b a b b a a b a b b a a b b b a a b b a a b a a b a b b 1 2 3 4 5 6 7 8b b a b b a a b b a b b a a b a b b a a b b b a a b b a a b a a b a b b Case 1: 長(zhǎng)此以往Case 2: 節(jié)外生枝Case 3: 若無(wú)其事nMoving from Step i to Step i+1 in the same iteration.nThe growing point moves d

37、ramatically.nThis is the tougher case.Case 2: 節(jié)外生枝Case 3: 若無(wú)其事1 2 3 4 5 6 7 8b b a b b a a b b a b b a a b a b b a a b b b a a b b a a b a a b a b b 1 2 3 4 5 6 7 8b b a b b a a b b a b b a a b a b b a a b b b a a b b a a b a a b a b b Case 1: 長(zhǎng)此以往Case 2: 節(jié)外生枝Case 3: 若無(wú)其事前人種樹(shù)後人涼的哲學(xué)n每次千辛萬(wàn)苦找到vertical

38、movement的目的時(shí), 把這個(gè)movement的起點(diǎn)與終點(diǎn)用一個(gè)link記錄下來(lái).n下回遇到這個(gè)起點(diǎn)時(shí), 就可以直接走到終點(diǎn)去,不用再重新找一次.n這些link就叫做suffix link (斷頭指標(biāo)).n終點(diǎn)所對(duì)應(yīng)的字串,是起點(diǎn)所對(duì)應(yīng)之字串的斷頭字串(second suffix)Case 2: 節(jié)外生枝Case 3: 若無(wú)其事1 2 3 4 5 6 7 8b b a b b a a b b a b b a a b a b b a a b b b a a b b a a b a a b a b b n每個(gè)斷頭指標(biāo)的起點(diǎn)一定是個(gè)internal node, 不會(huì)葉子 不會(huì)是某個(gè)suffix

39、tree edge的中間nWhy?Case 2: 節(jié)外生枝Case 3: 若無(wú)其事1 2 3 4 5 6 7 8b b a b b a a b b a b b a a b a b b a a b b b a a b b a a b a a b a b b n每個(gè)internal node一定是某個(gè)斷頭指標(biāo)的起點(diǎn)nWhy?Case 2: 節(jié)外生枝Case 3: 若無(wú)其事1 2 3 4 5 6 7 8b b a b b a a b b a b b a a b a b b a a b b b a a b b a a b a a b a b b S = b b a b b a a b1,-2,-3,-3,-3,37,-4,-3,37,-4,-2,37,-4,-11,112111nGoing up to a closest internal node (whose suffix link must be available). Suppose this upward traversal passes throu

溫馨提示

  • 1. 本站所有資源如無(wú)特殊說(shuō)明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請(qǐng)下載最新的WinRAR軟件解壓。
  • 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請(qǐng)聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶(hù)所有。
  • 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁(yè)內(nèi)容里面會(huì)有圖紙預(yù)覽,若沒(méi)有圖紙預(yù)覽就沒(méi)有圖紙。
  • 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
  • 5. 人人文庫(kù)網(wǎng)僅提供信息存儲(chǔ)空間,僅對(duì)用戶(hù)上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對(duì)用戶(hù)上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對(duì)任何下載內(nèi)容負(fù)責(zé)。
  • 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請(qǐng)與我們聯(lián)系,我們立即糾正。
  • 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶(hù)因使用這些下載資源對(duì)自己和他人造成任何形式的傷害或損失。

最新文檔

評(píng)論

0/150

提交評(píng)論