



版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請進(jìn)行舉報或認(rèn)領(lǐng)
文檔簡介
1、 基本的匹配計算主要內(nèi)容 關(guān)鍵詞查詢 結(jié)構(gòu)化查詢 字符串的匹配算法 允許出錯的字符串的匹配算法關(guān)鍵詞查詢目前關(guān)鍵詞查詢是最常用的信息查詢方式。又可分為:1. 單個詞2. 多個詞組成的上下文 (高級檢索)3. 多個詞用and,or 或not 組成的句子4. 自然語言句子單個詞查詢 用一個最貼切的詞表示查詢的意思 考研 原理:文檔用詞組成的向量或文本。匹配變成是否文檔中含有查詢詞。上下文查詢類型:類型: 短語Phrase e.g, Shandong University 近似句子 允許有拼寫錯誤等 Shadong University高級查詢中組成的上下文: 書名作者,出版社, 發(fā)表時間,價格De
2、finitionA syntax (語法) composed of atoms that retrieve documents, and of Boolean operators which work on their operandse.g, translation AND (syntax OR syntactic) (布爾表達(dá)式)Fuzzy Boolean Retrieve documents appearing in some operands (The AND may require it to appear in more operands than the OR) Ranked h
3、igher which has a larger number of elements Boolean 查詢語法句法自然語言 Generalization of “fuzzy Boolean” A query is an enumeration of words and context queries All the documents matching a portion of the user query are retrieved Set a threshold so that the document with very low weight are not retrieved結(jié)構(gòu)查詢
4、 內(nèi)容與結(jié)構(gòu)混合查詢內(nèi)容與結(jié)構(gòu)混合查詢 - 給出匹配模板進(jìn)行匹配 三種結(jié)構(gòu)三種結(jié)構(gòu) - 固定結(jié)構(gòu) - 超鏈結(jié)構(gòu) - 層次結(jié)構(gòu)Fixed (固定)StructureDocument:a fixed set of fieldsEX: a mail has a sender, a receiver, a date, a subject and a body fieldSearch for the mails sent to a given person with “football” in the Subject fieldA hypertext is a directed graph where
5、nodes hold some text (text contents)the links represent connections between nodes or between positions inside nodes (structural connectivity)HypertextHierarchical StructureHierarchical Structure層次查詢的處理 從根到葉逐層限制的多次查詢 基于樹或圖的匹配算法String Matchingdetecting the occurrence of a particular substring (pattern
6、) in another string (text)A straightforward SolutionThe Knuth-Morris-Pratt AlgorithmStraightforward solution Algorithm: Simple string matching Input: P and T, the pattern and text strings; m, the length of P. The pattern is assumed to be nonempty. Output: The return value is the index in T where a c
7、opy of P begins, or -1 if no match for P is found.Definition 11.1 Notation for patterns and textP the pattern being search for;T the text in which P is sought;m the length of Pn the length of T, not known to the algorithm; m m ) /* m is the length of P*/ match = i; /match found. / success case / bre
8、ak; /*exit the loop here*/ if(tj = pk) j+; k+; else /Back up over matched characters. int backup=k-1; /從本次查詢點的下一個頂點開始從本次查詢點的下一個頂點開始/ j = j-backup; k = k-backup; /Slide pattern forward,start over. j+; i=j;return match;ikjPTAnalysis Worst-case complexity is in (mn)P=aaab T=aaaaaaaaaaaaaab Need to back
9、 up. However, it works quite well on average for natural language.The Knuth-Morris-Pratt Algorithm Pattern Matching with Finite Automata (自動機(jī)) e.g. P = “AABC” start is the beginning index of T Idea: remembering the matched part by utilizing the prefix ofpattern P and do not consider T. However, it i
10、s not scalable for the size of term table.The Knuth-Morris-Pratt Flowchart (流程圖) Character labels are inside the nodes, not on the arcs. Each node has two arrows out to other nodes: success link, or fail link next character is read only after a success link A special node, node 0, called “get next c
11、har” which read in next text character. e.g. P = “ABABCB”T=ABABABCBConstruction of the KMP Flowchart Definition:Fail links We define failk as the largest r (with rk) such that p1,.pr-1 matches pk-r+1.pk-1.That is the (r-1) character prefix of P is identical to the one (r-1) character substring endin
12、g at index k-1. Thus the fail links are determined by repetition within P itself.P: |A B A B A B| C B | |T: |A B A B A B| x mismatchP: |A B A B | A B C B | |T: A B |A B A B| x A B A B A B A B A B A B xr = 5; k = 7, p1.p4 = p3p6Fail7 = 5注意:T的指針沒有改變函數(shù) fail 是從左到右計算的,即遞推方式進(jìn)行的。我們可以假設(shè) 在計算failk 時,所有failt,
13、tk 已經(jīng)被計算了。fail1=0. | p1 ps-1 | ps p1 | pk-r-1 pk-2 | pk-1 pk pm (a) by definition of failk-1 (which is s) p1. pfails-1 |pfails | | p1 ps-1 | ps | p1 | pk-r-1 pk-2 | pk-1 | pk pm do these match? (b) looking back for a match for pk-1. failk = s+1 = Pfailk-1+1Algorithm: KMP flowchart construction Input
14、: P,a string of characters;m,the length of P. Output: fail,the array of failure links,defined for indexes 1,.,m.The array is passed in and the algorithm fills it. Step: void kmpSetup(char P, int m, int fail) int k,s 1. fail1=0; /start point/ 2. for(k=2;k=1) / p1,ps 與 pk-s+1,pk-1比較 5. if(ps=pk-1) /*就
15、是它!*/ 6. break; 7. s=fails; /*否則遞歸向下找*/ 8. failk=s+1; fail1 = 0 ; fail2-1 =fail1= s = 0; fail2 =1; k= 3; s=fail3-1= 1; p2 p1, s= fail1=0; fail3= s+1=1. k=4; s= fail3 =1; p1=p3; fail4=s+1=2; To compute fail8, s= fail7 = 5, but p7 p5, recompute s= fail5 = 3, but p7 p3 either, so re-compute s= fail3 =
16、1. still p7 p1. Finally, s= fail1 = 0, end the search, and fail8 is assigned s+1 = 1;P = A B A B A B C Bfail 0 1 1 2 3 4 5 1 index 1 2 3 4 5 6 7 8The Knuth-Morris-Pratt Scan Algorithm int kmpScan(char P,char T,int m,int fail) int match, j,k; match= -1; j=1; k=1; while(endText(T,j)=false) if(km) / su
17、ccess/ match = j-m; break; if(k= =0) /the point of T moves ahead, and rescan/ j+; k=1; else if(tj= =pk) / success at position k of P/ j+; k+; else /Follow fail arrow. k=failk; / fail and go back to the point of P / /continue loop. return match; 沒有使用變量iAnalysis Based on the similar method on analyzin
18、g the time complexity of algorithm KMP setup, The scan algorithm requires 2n character comparisons in the worst case Overall: worst case complexity is (n+m)RK 算法 輸入: Two n bit strings A(a1,a2, ,an) and B(b1,b2, ,bn) 輸出: whether A = B. 傳統(tǒng)方法:傳輸n位依次比較。 指紋機(jī)制: 定義n位整數(shù) 根據(jù)指紋函數(shù) Fp(x)=x mod p , p是一個素數(shù)是一個素數(shù) 比較
19、Fp(a)是否等于Fp(b),傳輸位數(shù)減小為O(logp)112iniiaa112iniibb設(shè) 代表字符集合,x , 定義函數(shù)ord(x), d = | |, ord(x): 0,1,2,d-1 對任意的模式P, |P| = m, 利用多項式指紋 Q(P) = ord(P1)dm-1+ord(P2)dm-2+ord(Pm-1)d+ord(Pm) 代表 P 同樣對文本T = T1,T2,.,Tn 從左到右計算長度為m的連續(xù)子串的指紋,如Q(i)= ord(Ti)dm-1+ord(Ti+1)dm-2 +ord(Ti+m-2)d+ ord(T i+m-1)并和Q(P)相比較。若相同,則找到匹配的子
20、串。起始位置為i, 0i0, b- 1 aa 0*2+0=0, bb- 2*1+1=3, 0 3 ba- (3-2*1)*2+0 = 2 0 2 ab- (2-1*2)*2+ 1 = 1 ba- (1-0*2)*2+0 = 2 aa - (2-1*2)*2+0 = 0 find the position問題是得到的整數(shù)無法表示了,過于大取素數(shù) q, Q(i) (mod q) = Q(p) (mod q)Q(i+1) (mod q) = (Q(i) ord(Ti)d m-1)*d ) (mod q) + ord(Ti+m) 但這樣的話,當(dāng) Q(i) (mod q) = Q(p) (mod q),
21、 不一定 對應(yīng)的字符串相同,這時可以逐位進(jìn)行檢查,有人證明該算法的期望時間復(fù)雜性為O(m+n), 是較好的算法。 特點:可以推廣到高維的字符串匹配,是否可以應(yīng)用到對2維圖像的匹配?應(yīng)用到對3維物體的匹配?計算具有一定誤差的匹配Elements of Dynamic Programming Constructing solution to a problem by building it up dynamically from solutions to smaller (or simpler) sub-problems sub-instances are combined to obtain s
22、ub-instances of increasing size, until finally arriving at the solution of the original instance. make a choice at each step, but the choice may depend on the solutions to sub-problemsPrinciple of optimalityQthe optimal solution to any nontrivial instance of a problem is a combination of optimal sol
23、utions to some of its sub-instances. Memorization (for overlapping sub-problems)Qavoid calculating the same thing twice, Qusually by keeping a table of know results that fills up as sub-instances are solved. Principle of optimalityQthe optimal solution to any nontrivial instance of a problem is a co
24、mbination of optimal solutions to some of its sub-instances. Memorization (for overlapping sub-problems)Qavoid calculating the same thing twice, Qusually by keeping a table of know results that fills up as sub-instances are solved. Memorization for Dynamic programming version of a recursive algorith
25、m e.g. Trade space for speed by storing solutions to sub-problems rather than re-computing them. As solutions are found for suproblems, they are recorded in a dictionary, Before any recursive call, say on subproblem Q, check the dictionary to see if a solution for Q has been stored. If no solution h
26、as been stored, go ahead with recursive call. If a solution has been stored for Q, retrieve the stored solution, and do not make the recursive call. Just before returning the solution, store it in the dictionary. Dynamic programming version of the fib.Development of a dynamic programming algorithm C
27、haracterize the structure of an optimal solution Breaking a problem into sub-problem whether principle of optimality apply Recursively define the value of an optimal solution define the value of an optimal solution based on value of solutions to sub-problems Compute the value of an optimal solution
28、in a bottom-up fashion compute in a bottom-up fashion and save the values along the way later steps use the save values of pervious steps Construct an optimal solution from computed information字符串的近似匹配(Approximate string matching) In many applications we cant expect an exact copy, we want to find a
29、approximating string match with at most k mistakes, e.g., a spelling corrector. We will develop a dynamic programming algorithm for the k-approximate match.Definition: Let k be a nonnegative integer. A k-approximate match is a match of P in T that has at most k differences. The differences can be an
30、y of the following three types, the name of the difference is the operation needed on T to bring it closer to P.Revise: The corresponding characters in P and T are different;Delete: T contains a character that is missing from P.Insert: T is missing a character that appears in P. 如何修改T中的子串,使其能匹配上e.g.
31、 3-approximate match P: u n n e c e s s a r i l y T: un e s c e s s a r a l y (made three spelling errors)Definition 11.6 Difference tableDij = the minimum number of difference between P1,Pi and a segment of T ending at tj. 1i m, 1j m.定義: D0j=0; Di,0 = i; There will be a k-approximate match ending a
32、t tj for any j such that Dmj k, so we can stop as soon as we find an entry less than or equal to k in the last row of D, which is the first k -approximate match.The rules for the computation of D Dij = Di-1j-1 if pi = tj /*no error*/ Dij = Di-1j-1+1 if pi tj and revise tj to pi and both i and j increase; Dij = Di-1j+1 if insert pi into T, only i increase. Dij = Dij-1+1 if delete tj from T and only j increase. Each entry requires only entries above it and to its left in the table0 0 0 0 012m
溫馨提示
- 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
- 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
- 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
- 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
- 5. 人人文庫網(wǎng)僅提供信息存儲空間,僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對任何下載內(nèi)容負(fù)責(zé)。
- 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請與我們聯(lián)系,我們立即糾正。
- 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時也不承擔(dān)用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。
最新文檔
- 油務(wù)工專業(yè)理論考試題及參考答案
- 專業(yè)攝影測量與遙感習(xí)題及答案
- 呼叫中心服務(wù)員-初級工考試題及參考答案
- 2025屆山西省部分學(xué)校高三4月模擬考試(省二模)生物試題(原卷版+解析版)
- 江蘇省南京市五校聯(lián)盟2024-2025學(xué)年高二下學(xué)期4月期中地理試題(原卷版+解析版)
- 批發(fā)業(yè)消費者行為分析與研究考核試卷
- 畜禽糞便處理與農(nóng)業(yè)廢棄物循環(huán)利用考卷考核試卷
- 租賃店鋪的顧客滿意度提升實踐考核試卷
- 聚苯并噻吩改性與加工技術(shù)考核試卷
- 聚合纖維的綠色生產(chǎn)與可持續(xù)發(fā)展考核試卷
- 2025陜西漢中漢源電力(集團(tuán))限公司招聘56人易考易錯模擬試題(共500題)試卷后附參考答案
- 年產(chǎn)30萬噸生物航煤項目可行性研究報告(僅供參考)
- 南京師范大學(xué)自主招生個人陳述范文與撰寫要點
- 鐵粉運輸合同協(xié)議
- 計算機(jī)網(wǎng)絡(luò)安全知識試題及答案2025年計算機(jī)二級考試
- 浙江省A9協(xié)作體2024-2025學(xué)年高二下學(xué)期4月期中聯(lián)考語文試卷(含答案 )
- 2025年初中學(xué)業(yè)水平考試地理模擬卷及答案:圖表解讀與地理學(xué)科創(chuàng)新試題
- (四調(diào))武漢市2025屆高中畢業(yè)生四月調(diào)研考試 語文試卷(含答案詳解)
- 廣州廣州市天河區(qū)華陽小學(xué)-畢業(yè)在即家校共話未來-六下期中家長會【課件】
- 第4單元 亮火蟲(教學(xué)設(shè)計)-2024-2025學(xué)年粵教花城版(2024)音樂一年級下冊
- 公司事故隱患內(nèi)部報告獎勵制度
評論
0/150
提交評論