第三章序列比對(duì)_第1頁
第三章序列比對(duì)_第2頁
第三章序列比對(duì)_第3頁
第三章序列比對(duì)_第4頁
第三章序列比對(duì)_第5頁
已閱讀5頁,還剩79頁未讀, 繼續(xù)免費(fèi)閱讀

下載本文檔

版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請(qǐng)進(jìn)行舉報(bào)或認(rèn)領(lǐng)

文檔簡(jiǎn)介

1、第三章序列比對(duì)Sequence AlignmentContentsl I. Introductionl II. Pair-wise alignment via dynamic programmingl III. Multiple alignmentl IV. Genome assembling3.1l 序列比較的根本任務(wù)是: 發(fā)現(xiàn)序列之間的相似性 辨別序列之間的差異l 目的:相似序列 相似的結(jié)構(gòu),相似的功能判別序列之間的同源性推列之間的進(jìn)化關(guān)系序列相似性 同源(homology)- 具有共同的祖先 直向同源(Orthologous ) 共生同源(paralogous ) 相似(similari

2、ty)同源序列一般是相似的 相似序列不一定是同源的 進(jìn)化趨同(同功能)lll水平轉(zhuǎn)移進(jìn)化趨同直向同源(a1 in species I, a1 in species II)共生同源(a1 and a2 in species I)3.1l AlignmentA mutual arrangement of two or more sequences, it exhibits where the two or more sequences are similar, and where they differ.L G P S S K Q T G K G S - S R I W D NGlobal| L|

3、N - I T K S A G K G A I M R L G D Al Alignment T G K G Local| A G K G two sequencesl Alignmentmultiple sequencesl Databases searchl Phylogenetic analysisAlignmentN-YLS NKYLS N-F-S N-FLSNYLSNKYLSNFS-LYNFLS+KFNYLSWhy do we care about sequence alignment?It can tell us something about the evolution of o

4、rganisms.We can see which regions of a gene (or its derived protein) are susceptible to mutation and which can have one residue replaced by another without changing function.Homologous genes (genes with share evolutionary origin) have similar sequences.Orthologs are genes that are evolutionarily rel

5、ated, have a similar function, but now appear in different species.Paralogs are evolutionarily related (share an origin) but no longer have the same function.You can uncover either orthologs or paralogs through sequence alignment.llllll雙序列比對(duì) Dot matrix plot3.2(Gibbs and McIntyre,1970)例一:s: SSENTIALS

6、OFSEQUENCEANALYSISt:SSENTIALSOFSEQUENCEANALYSIS例2:s: ESSENTIALSOFSEQUENCEANALYSISt:ESSENTIALANALYSIS例3:s: ESSENTIALSOFSEQUENCEANALYSISt:APRIMERONHOWTOANALYZESEQUENCESANALYSI SSEQUENCEAANALYZESSEQUENCESWindow filtering techniquel s:15l t:15J. Bacteriol., 2003, 185:1018-1026.3.3 Pair-wise alignment vi

7、a DP(1).(2).(3).(4).(5).Edit Operations Optimal AlignmentDynamic ProgrammingScoring Scheme Exact Algorithms Global: N-W AlgorithmLocal:S-W Algorithm(6).Heuristic Methods Fasta: FAST AlignmentBlast:Basic Local Alignment Search ToolPairwise AlignmentSequence s: CTTAACTSequencet: CGGATCATAn alignment o

8、f s and t:MismatchMatchTT TDeletion gapInsertion gapAC-TA CAC CG-GAT(1). Edit operationss:t:AGCACACAACACACTAone-character edit operations that turn s into t :(a,a) (a,-)(a,b)(-,b)denotes a match (no change from s to t), denotes deletion of character a (in s),denotes replacement of a (in s) by b (in

9、t), where ab,denotes insertion of character b (in s).Since the problem is symmetric in s and t, a deletion ins can be seen as an insertion in t, and vice versa.An alignment of two sequences s and t is an arrangement of s and t by position, where s and t can be padded with gap symbols toachieve the s

10、ame length:s: AGCACACAAGCACACAort:ACACACTAACACACTAIf we read the alignment column-wise, we have a protocol of edit operations that lead from s to t.S:T:AGCACACAA CACACTAAGCACACAACACACTALeft:Match(A,A)Right : Match(A,A)Delete (G,-)Replace(G,C)Match Match Match Match Match InsertMatch(C,C)(A,A)(C,C)(A

11、,A)(C,C) (-,T) (A,A)Insert Match MatchMatch(-,A) (C,C)(A,A)(C,C)Replace(A,T)DeleteMatch(C,-)(A,A)(2). Optimal alignment-Important notion for sequence analysisThe total cost of an alignment of two sequences s and t is the sum of the costs of all the edit operations that lead from s to t .l An optimal

12、 alignment of s and t is an alignment which has minimal total cost among all possible alignments.Using the unit cost min our previous example,we obtain the following cost:s: AGCACACAAG CACACA ACACACT Acost : 4t:A CACACTAcost : 2orHere it is easily seen that the left-hand assignmentis optimal under t

13、he unit cost m the edit distance., and henceAt this point, you should improve your understanding by doing some exercises.Similarity scorel Similarity scorematch: w(a,a)=8mismatch: w(a,b)= -5for abeach gap symbol: w(a,-)=w (-, b) = -3(3). Dynamic Programming(R. Bellman 1950)” Dynamic Programming” is

14、a very general optimal algorithm. It is applicable when a large search space can be structured into a succession of stages, such that:The initial stage contains trivial solutions to sub-problems;Each partial solution in a later stage can be calculated by recurring on only a fixed number of partial s

15、olutions in an earlier stage;The final stage contains the overall solution.llll(3). Dynamic Programming (the primary principle)l Best alignment at (i, j) is the best alignment previous to (i, j) plus aligning these two.l Repeat the process until reaching the two sequences end.(3). Dynamic Programmin

16、g (a gamepicking money)CGGATCATCodeAge 305oA:C:G:T:Adult Child3336C T C A A CTGrandpa60Teenager 1036336Money in each cell612max :36$m=max / (old: young) Money on line 5 $Queue s: CGGATCAT Queue: t: CTCAACTf(3). Dynamic Programming (pathways)CGGATCATou Rule: only forward not backward; along line or c

17、ross diagonallyC T C A A CTu Caution: localumsand global optimal;u Compute before you walk;u How many pathways (mn) :(m = 7)(n = 8)n+m 2iN = 65280i=nu How many calculations:CN = 2i i = 915968i=nn+mf363361236363361236181236CalculationLet si,j the most money one can get at position (i,j)The money at o

18、rigin is: s0,0=0 The money on top line and left edge is:s0,j=j*5 (j=1.n)si,0=i*5 (i=1.m)CGGATCATljo10$CT Clli15A A CTsi,jcan be computed as follows:ls+ $line (=5$)i-1, j= max si, j -1 + $line (=5$)si, js+ $ diagi-1, j -1The most money one can get until f and the corresponding path lead to it are wha

19、t we want.l?$ f3633636$1236$363361236181236Computing Si,jjj-1i-1iSm,n$dia g$lin eSi,j$lineInitializationsCGGATCAT0510152025303540C TC36336512361015AA2025CT3633630123618123635Computing S1,10 C 1G 2 G 3 A 4 T 5 C 6 A7 T80C0510152025303540363365361T2C123610315A204A255C 6 T 73633630123618123635Computing

20、 S1,2 , S2,1 , S2,2C 1G 2 G 3 A 4 T 5 C 6 A7 T80C5101520253035403633636411T123641462C 3 A4A5C363366T71236181236Computing S3,5=?0 C 1G 2 G 3 A 4 T 5 C 6 A7 T80C051015202530354036336536411T12361041462C 3 A4A1520255C36336306T7123618123635Computing S3,5=?0 C 1G 2 G 3 A 4 T5 C 6 A7 T80C051015202530354036

21、336536414651561T12361041465158872C 3 A4A1520255C36336306T7123618123635Answer S3,5=920 C 1G 2 G 3 A 4 T5 C 6 A7 T80C051015202530354036336536414651561T2C 3 A4A51236104146515887152025C 6 T 73633630123618123635All computed0 C 1G 2 G 3 A 4 T5 C 6 A7 T80C051015202530354036336536414651566166711T12361041465

22、1588792972C 3 A4A102155C366T71236181236667992125156174205161Select the dot withum Money0 C 1G 2 G 3 A 4 T5 C 6 A7 T80C051015202530354036336536414651566166711T123610414651588792972C 3 A4A102155C366T71236181236667992125156174205161Trace back for the best path0 C 1G 2 G 3 A 4 T5 C 6 A7 T80C051015202530

23、354036336536414651566166711T123610414651588792972C 3 A4A102155C366T71236181236667992125156174205161(3). Dynamic Programming (Characteristics)l A progress can be parted to stagesl Forward compute trace back for optimall Reduce computation by recursionNP O(2m+nT ) P O(m nT )ComplexitySpace: O(mn)Time:

24、 O(mn)l Filling the matrix O(mn)l Back trace O(m+n)(3). Dynamic Programming (a pathway an alignment)CGGATCATto3633C T C A A C TsC-TCAACT CGGATCA-Ts:t:36336612fAn optimal alignment of s and t- the pathway of most money!l Let s=a1a2am and t=b1b2bn .l Si,j: the score of an optimal alignment betweena1a2

25、ai and b1b2bjl Initializations:s0,0=0;si,0=i*gap (i=1.m) ; s0,j=j*gap (j=1.n)l Si,j can be computed as follows:si-1, j+ w(ai , -)($line)($line) ($diag)=+ w(-, b )smax si, j -1i, jjs+ w(a , b)i-1, j -1ijComputing Si,jjj-1i-1iSm,nw(ai,bj)w(ai,-)w(-,bj)TracebackStart from the lower right corner and tra

26、ce back to the upper left.Each arrow introduces one character at the end of each aligned sequence.A horizontal move puts a gap in the left sequence.A vertical move puts a gap in the top sequence.A diagonal move uses one character from each sequence.lllllA simple scoring schemel Match: +8(w(x, y) = 8

27、, if x = y)l Mismatch: -5 (w(x, y) = -5, if x y)l Each gap symbol: -3 (w(-,x)=w(x,-)=-3)C C+8- G-3- G-3- A-3T T+8T C-5A A+8A-3C-3T T+8 =Alignment score+12Now try this example after classSequence A: CAATTGA Sequence B: GAATCTGCTheir optimal alignment ?GAATCTGCC A A T T GA0-3-6-9-12-15-18-21-24-3-5-8-

28、11-14-4-7-10-13-6-830-3-6-9-12-15-9-11011852-1-4-12-14-38191613107-15-11-651614242118-18-7-921311213229-21-101-1108182927C A A T - T G A G A A T C T G C-5 +8 +8+8 -3+8 +8-5 = 27Global Alignment vs. Local Alignmentl global alignmentl local alignmentAn optimal local alignmentl Si,j: the score of an op

29、timal local alignment ending at ai and bjl With proper initializations, Si,j can be computed as follows.0s+ w(a,-)i -1,ji=+ w(-,bj )si ,jmaxsi ,j -1s+ w(a,b )i -1,jijlocal alignmentCGGATCATMatch: 8Mismatch: -5Gap symbol: -3C T T A A CT000000000085200852053008531302000852110000853?000local alignmentM

30、atch: 8CGGATCATMismatch: -5Gap symbol: -3C T T A A CTThe best score00000000008520085205300853130200085211000085311808525313107053021310818A A TC C- AT T8-3+8-3+8=18CGGATCATC T T A A CTThe best score00000000008520085205300853130200085211000085311808525313107053021310818local alignmentMatch: 8Mismatch

31、: -5Gap symbol: -3C T T A A CTCGGATCATThe best score00000000008520085205300853130200085211000085311808525313107053021310818Now try this example in classSequence A: CAATTGA Sequence B: GAATCTGCTheir optimal local alignment?Did you get it right?GAATCTGCC A A T T GA0000000000000085280088553050085132421

32、217516131513233432A AA AT T CT TG G8+8+8-3+8+8=G37AATCTGCC A A T T GA0000000000000085280088553050085132421217516131513233432(4). Score Schemel For DNA: match, mismatchl For proteins: different amino acid pairs receive different scoreSubstitution MatrixTwo ways: chemically, statisticallyPAM(Dayhoff)BLUSOM (Henikoff & Henikoff )l Gap penaltyConstant gap penalties Affine gap penaltiesNucleic acid PAM matricesl PAM = percent accepted mutationl 1 PAM = 1% probability of mutation at each sequence position.l A uniform PAM1 matrixAGTCA0.990.003330.003330.00333G0.003330.990.003330

溫馨提示

  • 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請(qǐng)下載最新的WinRAR軟件解壓。
  • 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請(qǐng)聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
  • 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會(huì)有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
  • 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
  • 5. 人人文庫網(wǎng)僅提供信息存儲(chǔ)空間,僅對(duì)用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對(duì)用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對(duì)任何下載內(nèi)容負(fù)責(zé)。
  • 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請(qǐng)與我們聯(lián)系,我們立即糾正。
  • 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶因使用這些下載資源對(duì)自己和他人造成任何形式的傷害或損失。

最新文檔

評(píng)論

0/150

提交評(píng)論