



版權(quán)說(shuō)明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請(qǐng)進(jìn)行舉報(bào)或認(rèn)領(lǐng)
文檔簡(jiǎn)介
1、 基本的匹配計(jì)算主要內(nèi)容 關(guān)鍵詞查詢 結(jié)構(gòu)化查詢 字符串的匹配算法 允許出錯(cuò)的字符串的匹配算法關(guān)鍵詞查詢目前關(guān)鍵詞查詢是最常用的信息查詢方式。又可分為:1. 單個(gè)詞2. 多個(gè)詞組成的上下文 (高級(jí)檢索)3. 多個(gè)詞用and,or 或not 組成的句子4. 自然語(yǔ)言句子單個(gè)詞查詢 用一個(gè)最貼切的詞表示查詢的意思 考研 原理:文檔用詞組成的向量或文本。匹配變成是否文檔中含有查詢?cè)~。上下文查詢類型:類型: 短語(yǔ)Phrase e.g, Shandong University 近似句子 允許有拼寫錯(cuò)誤等 Shadong University高級(jí)查詢中組成的上下文: 書名作者,出版社, 發(fā)表時(shí)間,價(jià)格De
2、finitionA syntax (語(yǔ)法) composed of atoms that retrieve documents, and of Boolean operators which work on their operandse.g, translation AND (syntax OR syntactic) (布爾表達(dá)式)Fuzzy Boolean Retrieve documents appearing in some operands (The AND may require it to appear in more operands than the OR) Ranked h
3、igher which has a larger number of elements Boolean 查詢語(yǔ)法句法自然語(yǔ)言 Generalization of “fuzzy Boolean” A query is an enumeration of words and context queries All the documents matching a portion of the user query are retrieved Set a threshold so that the document with very low weight are not retrieved結(jié)構(gòu)查詢
4、 內(nèi)容與結(jié)構(gòu)混合查詢內(nèi)容與結(jié)構(gòu)混合查詢 - 給出匹配模板進(jìn)行匹配 三種結(jié)構(gòu)三種結(jié)構(gòu) - 固定結(jié)構(gòu) - 超鏈結(jié)構(gòu) - 層次結(jié)構(gòu)Fixed (固定)StructureDocument:a fixed set of fieldsEX: a mail has a sender, a receiver, a date, a subject and a body fieldSearch for the mails sent to a given person with “football” in the Subject fieldA hypertext is a directed graph where
5、nodes hold some text (text contents)the links represent connections between nodes or between positions inside nodes (structural connectivity)HypertextHierarchical StructureHierarchical Structure層次查詢的處理 從根到葉逐層限制的多次查詢 基于樹或圖的匹配算法String Matchingdetecting the occurrence of a particular substring (pattern
6、) in another string (text)A straightforward SolutionThe Knuth-Morris-Pratt AlgorithmStraightforward solution Algorithm: Simple string matching Input: P and T, the pattern and text strings; m, the length of P. The pattern is assumed to be nonempty. Output: The return value is the index in T where a c
7、opy of P begins, or -1 if no match for P is found.Definition 11.1 Notation for patterns and textP the pattern being search for;T the text in which P is sought;m the length of Pn the length of T, not known to the algorithm; m m ) /* m is the length of P*/ match = i; /match found. / success case / bre
8、ak; /*exit the loop here*/ if(tj = pk) j+; k+; else /Back up over matched characters. int backup=k-1; /從本次查詢點(diǎn)的下一個(gè)頂點(diǎn)開始從本次查詢點(diǎn)的下一個(gè)頂點(diǎn)開始/ j = j-backup; k = k-backup; /Slide pattern forward,start over. j+; i=j;return match;ikjPTAnalysis Worst-case complexity is in (mn)P=aaab T=aaaaaaaaaaaaaab Need to back
9、 up. However, it works quite well on average for natural language.The Knuth-Morris-Pratt Algorithm Pattern Matching with Finite Automata (自動(dòng)機(jī)) e.g. P = “AABC” start is the beginning index of T Idea: remembering the matched part by utilizing the prefix ofpattern P and do not consider T. However, it i
10、s not scalable for the size of term table.The Knuth-Morris-Pratt Flowchart (流程圖) Character labels are inside the nodes, not on the arcs. Each node has two arrows out to other nodes: success link, or fail link next character is read only after a success link A special node, node 0, called “get next c
11、har” which read in next text character. e.g. P = “ABABCB”T=ABABABCBConstruction of the KMP Flowchart Definition:Fail links We define failk as the largest r (with rk) such that p1,.pr-1 matches pk-r+1.pk-1.That is the (r-1) character prefix of P is identical to the one (r-1) character substring endin
12、g at index k-1. Thus the fail links are determined by repetition within P itself.P: |A B A B A B| C B | |T: |A B A B A B| x mismatchP: |A B A B | A B C B | |T: A B |A B A B| x A B A B A B A B A B A B xr = 5; k = 7, p1.p4 = p3p6Fail7 = 5注意:T的指針沒(méi)有改變函數(shù) fail 是從左到右計(jì)算的,即遞推方式進(jìn)行的。我們可以假設(shè) 在計(jì)算failk 時(shí),所有failt,
13、tk 已經(jīng)被計(jì)算了。fail1=0. | p1 ps-1 | ps p1 | pk-r-1 pk-2 | pk-1 pk pm (a) by definition of failk-1 (which is s) p1. pfails-1 |pfails | | p1 ps-1 | ps | p1 | pk-r-1 pk-2 | pk-1 | pk pm do these match? (b) looking back for a match for pk-1. failk = s+1 = Pfailk-1+1Algorithm: KMP flowchart construction Input
14、: P,a string of characters;m,the length of P. Output: fail,the array of failure links,defined for indexes 1,.,m.The array is passed in and the algorithm fills it. Step: void kmpSetup(char P, int m, int fail) int k,s 1. fail1=0; /start point/ 2. for(k=2;k=1) / p1,ps 與 pk-s+1,pk-1比較 5. if(ps=pk-1) /*就
15、是它!*/ 6. break; 7. s=fails; /*否則遞歸向下找*/ 8. failk=s+1; fail1 = 0 ; fail2-1 =fail1= s = 0; fail2 =1; k= 3; s=fail3-1= 1; p2 p1, s= fail1=0; fail3= s+1=1. k=4; s= fail3 =1; p1=p3; fail4=s+1=2; To compute fail8, s= fail7 = 5, but p7 p5, recompute s= fail5 = 3, but p7 p3 either, so re-compute s= fail3 =
16、1. still p7 p1. Finally, s= fail1 = 0, end the search, and fail8 is assigned s+1 = 1;P = A B A B A B C Bfail 0 1 1 2 3 4 5 1 index 1 2 3 4 5 6 7 8The Knuth-Morris-Pratt Scan Algorithm int kmpScan(char P,char T,int m,int fail) int match, j,k; match= -1; j=1; k=1; while(endText(T,j)=false) if(km) / su
17、ccess/ match = j-m; break; if(k= =0) /the point of T moves ahead, and rescan/ j+; k=1; else if(tj= =pk) / success at position k of P/ j+; k+; else /Follow fail arrow. k=failk; / fail and go back to the point of P / /continue loop. return match; 沒(méi)有使用變量iAnalysis Based on the similar method on analyzin
18、g the time complexity of algorithm KMP setup, The scan algorithm requires 2n character comparisons in the worst case Overall: worst case complexity is (n+m)RK 算法 輸入: Two n bit strings A(a1,a2, ,an) and B(b1,b2, ,bn) 輸出: whether A = B. 傳統(tǒng)方法:傳輸n位依次比較。 指紋機(jī)制: 定義n位整數(shù) 根據(jù)指紋函數(shù) Fp(x)=x mod p , p是一個(gè)素?cái)?shù)是一個(gè)素?cái)?shù) 比較
19、Fp(a)是否等于Fp(b),傳輸位數(shù)減小為O(logp)112iniiaa112iniibb設(shè) 代表字符集合,x , 定義函數(shù)ord(x), d = | |, ord(x): 0,1,2,d-1 對(duì)任意的模式P, |P| = m, 利用多項(xiàng)式指紋 Q(P) = ord(P1)dm-1+ord(P2)dm-2+ord(Pm-1)d+ord(Pm) 代表 P 同樣對(duì)文本T = T1,T2,.,Tn 從左到右計(jì)算長(zhǎng)度為m的連續(xù)子串的指紋,如Q(i)= ord(Ti)dm-1+ord(Ti+1)dm-2 +ord(Ti+m-2)d+ ord(T i+m-1)并和Q(P)相比較。若相同,則找到匹配的子
20、串。起始位置為i, 0i0, b- 1 aa 0*2+0=0, bb- 2*1+1=3, 0 3 ba- (3-2*1)*2+0 = 2 0 2 ab- (2-1*2)*2+ 1 = 1 ba- (1-0*2)*2+0 = 2 aa - (2-1*2)*2+0 = 0 find the position問(wèn)題是得到的整數(shù)無(wú)法表示了,過(guò)于大取素?cái)?shù) q, Q(i) (mod q) = Q(p) (mod q)Q(i+1) (mod q) = (Q(i) ord(Ti)d m-1)*d ) (mod q) + ord(Ti+m) 但這樣的話,當(dāng) Q(i) (mod q) = Q(p) (mod q),
21、 不一定 對(duì)應(yīng)的字符串相同,這時(shí)可以逐位進(jìn)行檢查,有人證明該算法的期望時(shí)間復(fù)雜性為O(m+n), 是較好的算法。 特點(diǎn):可以推廣到高維的字符串匹配,是否可以應(yīng)用到對(duì)2維圖像的匹配?應(yīng)用到對(duì)3維物體的匹配?計(jì)算具有一定誤差的匹配Elements of Dynamic Programming Constructing solution to a problem by building it up dynamically from solutions to smaller (or simpler) sub-problems sub-instances are combined to obtain s
22、ub-instances of increasing size, until finally arriving at the solution of the original instance. make a choice at each step, but the choice may depend on the solutions to sub-problemsPrinciple of optimalityQthe optimal solution to any nontrivial instance of a problem is a combination of optimal sol
23、utions to some of its sub-instances. Memorization (for overlapping sub-problems)Qavoid calculating the same thing twice, Qusually by keeping a table of know results that fills up as sub-instances are solved. Principle of optimalityQthe optimal solution to any nontrivial instance of a problem is a co
24、mbination of optimal solutions to some of its sub-instances. Memorization (for overlapping sub-problems)Qavoid calculating the same thing twice, Qusually by keeping a table of know results that fills up as sub-instances are solved. Memorization for Dynamic programming version of a recursive algorith
25、m e.g. Trade space for speed by storing solutions to sub-problems rather than re-computing them. As solutions are found for suproblems, they are recorded in a dictionary, Before any recursive call, say on subproblem Q, check the dictionary to see if a solution for Q has been stored. If no solution h
26、as been stored, go ahead with recursive call. If a solution has been stored for Q, retrieve the stored solution, and do not make the recursive call. Just before returning the solution, store it in the dictionary. Dynamic programming version of the fib.Development of a dynamic programming algorithm C
27、haracterize the structure of an optimal solution Breaking a problem into sub-problem whether principle of optimality apply Recursively define the value of an optimal solution define the value of an optimal solution based on value of solutions to sub-problems Compute the value of an optimal solution
28、in a bottom-up fashion compute in a bottom-up fashion and save the values along the way later steps use the save values of pervious steps Construct an optimal solution from computed information字符串的近似匹配(Approximate string matching) In many applications we cant expect an exact copy, we want to find a
29、approximating string match with at most k mistakes, e.g., a spelling corrector. We will develop a dynamic programming algorithm for the k-approximate match.Definition: Let k be a nonnegative integer. A k-approximate match is a match of P in T that has at most k differences. The differences can be an
30、y of the following three types, the name of the difference is the operation needed on T to bring it closer to P.Revise: The corresponding characters in P and T are different;Delete: T contains a character that is missing from P.Insert: T is missing a character that appears in P. 如何修改T中的子串,使其能匹配上e.g.
31、 3-approximate match P: u n n e c e s s a r i l y T: un e s c e s s a r a l y (made three spelling errors)Definition 11.6 Difference tableDij = the minimum number of difference between P1,Pi and a segment of T ending at tj. 1i m, 1j m.定義: D0j=0; Di,0 = i; There will be a k-approximate match ending a
32、t tj for any j such that Dmj k, so we can stop as soon as we find an entry less than or equal to k in the last row of D, which is the first k -approximate match.The rules for the computation of D Dij = Di-1j-1 if pi = tj /*no error*/ Dij = Di-1j-1+1 if pi tj and revise tj to pi and both i and j increase; Dij = Di-1j+1 if insert pi into T, only i increase. Dij = Dij-1+1 if delete tj from T and only j increase. Each entry requires only entries above it and to its left in the table0 0 0 0 012m
溫馨提示
- 1. 本站所有資源如無(wú)特殊說(shuō)明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請(qǐng)下載最新的WinRAR軟件解壓。
- 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請(qǐng)聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
- 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁(yè)內(nèi)容里面會(huì)有圖紙預(yù)覽,若沒(méi)有圖紙預(yù)覽就沒(méi)有圖紙。
- 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
- 5. 人人文庫(kù)網(wǎng)僅提供信息存儲(chǔ)空間,僅對(duì)用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對(duì)用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對(duì)任何下載內(nèi)容負(fù)責(zé)。
- 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請(qǐng)與我們聯(lián)系,我們立即糾正。
- 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶因使用這些下載資源對(duì)自己和他人造成任何形式的傷害或損失。
最新文檔
- 租賃車輛管理辦法暫緩
- 小區(qū)公攤物業(yè)管理辦法
- 管理人員職務(wù)管理辦法
- 省級(jí)人民醫(yī)院管理辦法
- 房屋簽約制度管理辦法
- 眼部瑜伽培訓(xùn)課件文案
- 腸胃細(xì)胞健康課件
- 腸癰的護(hù)理課件
- 人事管理培訓(xùn)課件
- 店長(zhǎng)培訓(xùn)內(nèi)容流程課件
- 我國(guó)醫(yī)療保險(xiǎn)制度的變遷
- 軍訓(xùn)服軍訓(xùn)服生產(chǎn)方案
- 廣東省深圳市福田區(qū)2024年數(shù)學(xué)八年級(jí)下冊(cè)期末綜合測(cè)試試題含解析
- GB/T 43803-2024科研機(jī)構(gòu)評(píng)估指南
- 國(guó)家工種目錄分類
- 2024年廣東惠州市交通投資集團(tuán)招聘筆試參考題庫(kù)含答案解析
- 南充市儀隴縣縣城學(xué)??颊{(diào)教師考試真題2022
- 國(guó)開液壓氣動(dòng)技術(shù)專題報(bào)告
- 《公安機(jī)關(guān)人民警察內(nèi)務(wù)條令》
- 生理學(xué)智慧樹知到答案章節(jié)測(cè)試2023年暨南大學(xué)
- 瀝青拌合站崗位職責(zé)
評(píng)論
0/150
提交評(píng)論