polynomial-verifiable_第1頁
polynomial-verifiable_第2頁
polynomial-verifiable_第3頁
polynomial-verifiable_第4頁
polynomial-verifiable_第5頁
已閱讀5頁,還剩37頁未讀, 繼續(xù)免費(fèi)閱讀

下載本文檔

版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請(qǐng)進(jìn)行舉報(bào)或認(rèn)領(lǐng)

文檔簡介

1、Class 26: Computing Genomes, Genomes ComputingDavid Evans/evanscs302: Theory of ComputationUniversity of Virginia Computer ScienceFinal Exam Sneak PreviewHandout available nowHonor policy: you may discuss these problems with others and use any resources you want until the FinalNo notes or other reso

2、urces may be used during the finalIntent is to give you an idea what to expect on the final and a chance to start thinking about some problemsDont attempt to memorize answers: need to understand things since the actual questions may be differentMenuComputing Genomes (PS6, Problem 6)Crash course in b

3、iologyBusy Beaver result! Computing with GenomesConclusionGenome Assembly ProblemIn order to assemble a genome, it is necessary to combine snippets from many reads into a single sequence. The input is a set of n genome snippets, each of which is a string of up to k symbols. The output is the smalles

4、t single string that contains all of the input snippets as substrings.DNASequence of nucleotides: adenine (A), guanine (G), cytosine (C), and thymine (T)Two strands, A must attach to T and G must attach to CAGTCCentral Dogma of BiologyRNA makes copies of DNA segmentsRNA describes sequences of amino

5、acidsChains of amino acids make proteinsProteins make usDNATranscriptionRNATranslationProteinImage from http:/protein/Human Genome3 Billion Base PairsEach nucleotide is 2 bits (4 possibilities)3 B pairs * 1 byte/4 pairs = 750 MBEvery sequence of 3 base pairs one of 20 amino acids (or stop codon)21 p

6、ossible codons, but 43 = 64 possible So, really only 750MB * (21/64) 250 MBMuch of it ( 95%) is may be junk (doesnt encode proteins, but some might be important)1 CD 650 MBHuman Genome RaceUVa CLAS 1970Yale PhDTenured Professor at U. MichiganFrancis Collins(Director of public National Center for Hum

7、an Genome Research)(Picture from UVa Graduation 2001)Craig Venter(President of Celera Genomics)vs.San Mateo CollegeCourt-martialedDenied tenure at SUNY BuffaloReading the GenomeWhitehead Institute, MITGene Reading MachinesOne read: about 700 base pairsButdont know where they are on the chromosomeAGG

8、CATACCAGAATACCCGTGATCCAGAATAAGCActualGenomeACCAGAATACCRead 1TCCAGAATAARead 2TACCCGTGATCCARead 3Genome Assembly ProblemInput: Genome fragments (but without knowing where they are from)Ouput: The full genomeACCAGAATACCRead 1TCCAGAATAARead 2TACCCGTGATCCARead 3Decision ProblemGA = | where each xi is a s

9、tring and there is a string X of length m that includes all of the xi stringsas substrings If we had a decider for GA, can we find the length of the shortest common superstring?Yes. Try all possible m values from 1, 2, , |xi|.Is GA In Class NP?GA = | where each xi is a string and there is a string X

10、 of length m that includes all of the xi stringsas substrings Yes. The string X is a polynomial-verifiable certificate.- Check it includes each substring- Check its length is mIs GA NP-Complete?GA = | where each xi is a string and there is a string X of length m that includes all of the xi stringsas

11、 substrings NP-CompleteA language B is in NP-complete if:2. There is a polynomial-time reduction from every problem A NP to B.1. B NPBNPBNP?To Prove NP-HardnessPick some known NP-Complete problem X.Show that a polynomial-time solver for Y could be used to build a polynomial-time solver for X.This pr

12、oves that there is no polynomial-time solver for Y (unless P = NP).Possible ChoicesHAMPATHABCDE3SAT(a b c) (a b c) SUBSET-SUM3, 5, 12, 13, 17, 30By definition, all must work. Every NP-Complete problem can be reduced to every NP-Complete problem.In practice, some will work much more easily than other

13、s. Try to pick a problem “close” to the target problem.Busy Beaver ChallengeRuixin YangDoubly-Infinite Tape (Regular BB)One-Way Infinite Tape (Challenge)Winning (3, 2) MachineCodeBusyBeaver32.cppBusyBeaver.cppIs GA NP-Complete?GA = | where each xi is a string and there is a string X of length m that

14、 includes all of the xi stringsas substrings Reducing HAMPATH to GATake an arbitrary input to HAMPATH:HAMPATH = G, s, t | G represents a graph, and there is a path between s and t in G that includes every node in G exactly once Construct a corresponding input to GA such that the input is in GA if an

15、d only if the original input is in HAMPATH.So, we need to map the nodes and edges of G into thesubstrings input to GA.Consider Each NodeIncoming Edges(x, a)aOutgoing Edges(a, y)In a Hamiltonian path, for each node (except start and end), exactly one incoming edge, and one outgoing edge must be used.

16、Consider Each NodeIncoming Edges(x, a)aOutgoing Edges(a, y)We need GA substrings to represent each edge, butsuch that only 1 can be actually followed. But, the GA superstring needs to include all substrings.Idea: make it so the untaken edges can loop back!Simple NodesIf there is only one edge (a, b)

17、 out of a given node, thatedge must be used in the path. Add the substring: ababGenerating the SubstringsFor each edge (a, yi) add two substrings:ayiaThis is the “back” edge.yiayi+1 This connects the possible destinations.abcabaacabaccabPossible (length 4) alignments:aba aca bac cabIf there is a Ham

18、iltonian path, one of these must be used.Start and EndACDEThe start node must come first:Add string: A(where is unused elsewhere)Add string: E$(where $ is unused elsewhere)BWhat is m?ACDE A, ABA, BAC, ACA, CAB, BE, ED, DA, CBC, CDC, BCD, DBC, D$ BA ACA CAB ABA BAC CBC BCD CDC DCB BE ED D$m = Start a

19、nd end = 2 + one-edge nodes * 1 + multi-edge nodes * 2 for each edge + 2 for alignHuman Genome3 Billion base pairs600-700 bases per read8X coverage requiredSo, n 37 Million sequence fragmentsCelera used 27.2 Million reads (but could get more than 700 bases per read)How can we solve an NP-Complete pr

20、oblem for such large n?ApproachesHuman Genome Project (Collins)Start by producing a genome map (using biology, chemistry, etc) to have a framework for knowing where the fragments should goCelera Solution (Venter)Approximate: we cant guarantee finding the shortest possible, but we can develop clever

21、algorithms that get close most of the timeResult: DrawPresident Clinton announces Human Genome Sequenceessentially complete (with Venter and Collins), June 26, 2000But, Human Genome Project mostly adopted Venters approach.Genomes ComputingSolving HAMPATH with DNAMake up a two random 4-nucleotide seq

22、uences for each node:A: A1 = ACTT A2 = gcagB:B1 = TCGGB2 = actgC:C1 = GGCTC2 = atgtD:D1 = GATCD2 = tccaIf there is a link between two cities (XY), create a nucleotide sequence: X2Y1 ABgcagTCGGBAactgACTTBased on Fred Hapgoods notes on Adelmans talkEncoding The ProblemEach city nucleotide sequence bin

23、ds with its complement (A T, G C) :A: A1 = ACTT A2 = gcagA: TGAA cgtcB:TCGGactgB: AGCCtgacC:GGCTatgtC = CCGAtacaD:GATCtccaD = CTAGaggtMix up all the link and complement DNA strands they will bind to show a path!Path BindingABCDACTTgcagTCGGactgGATCtccaGGCTatgtATGAAcgtcgcagGGCTABBCCGAtacaatgtTCGG BCCA

24、GCCtgacDCTAGaggtactgGATC CDGetting the SolutionShake up all the DNA to get it to bindExtract DNA strands starting with A and ending with D Can do this with chemical binding on start/end tags: remove all strands that do not start with A, and then remove all strands that do not end with DWeigh remaining strands to find ones with the right weight (7 * 8 nucleotides)

溫馨提示

  • 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請(qǐng)下載最新的WinRAR軟件解壓。
  • 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請(qǐng)聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
  • 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會(huì)有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
  • 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
  • 5. 人人文庫網(wǎng)僅提供信息存儲(chǔ)空間,僅對(duì)用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對(duì)用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對(duì)任何下載內(nèi)容負(fù)責(zé)。
  • 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請(qǐng)與我們聯(lián)系,我們立即糾正。
  • 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶因使用這些下載資源對(duì)自己和他人造成任何形式的傷害或損失。

評(píng)論

0/150

提交評(píng)論