現(xiàn)代信息檢索導(dǎo)論英文課件:10-ProbabilisticIrModels_第1頁
現(xiàn)代信息檢索導(dǎo)論英文課件:10-ProbabilisticIrModels_第2頁
現(xiàn)代信息檢索導(dǎo)論英文課件:10-ProbabilisticIrModels_第3頁
現(xiàn)代信息檢索導(dǎo)論英文課件:10-ProbabilisticIrModels_第4頁
現(xiàn)代信息檢索導(dǎo)論英文課件:10-ProbabilisticIrModels_第5頁
已閱讀5頁,還剩41頁未讀 繼續(xù)免費閱讀

下載本文檔

版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請進行舉報或認(rèn)領(lǐng)

文檔簡介

1、An Introduction to IR10th CourseChapter 11:Probabilistic Information RetrievalRecap of IR rankingVSM: Term weighting in a vector space Similarity measure is cosine of vector anglesQuery expansion: Improving search resultsEspecially for high recall. E.g., searching for aircraft so it matches with pla

2、ne; thermodynamic with heatRelevance feedback: Options for improving resultsGlobal methodsQuery expansionThesauriAutomatic thesaurus generationGlobal indirect relevance feedbackLocal methodsRelevance feedbackPseudo relevance feedbackProbabilistic relevance feedbackRather than reweighting in a vector

3、 spaceIf user has told us some relevant and some irrelevant documents, then we can proceed to build a probabilistic classifier, such as a Naive Bayes model:P(tk|R) = |Drk| / |Dr|P(tk|NR) = |Dnrk| / |Dnr|tk is a term; Dr is the set of known relevant documents; Drk is the subset that contain tk; Dnr i

4、s the set of known irrelevant documents; Dnrk is the subset that contain tk.Why probabilities in IR?User Information NeedDocumentsDocumentRepresentationQueryRepresentationHow to match?In traditional IR systems, matching between each document andquery is attempted in a semantically imprecise space of

5、 index terms.Probabilities provide a principled foundation for uncertain reasoning.Can we use probabilities to quantify our uncertainties?Uncertain guess ofwhether document has relevant contentUnderstandingof user need isuncertainProbabilistic IR topicsClassical probabilistic retrieval modelProbabil

6、ity ranking principle, etc.(Nave) Bayesian Text Categorization Bayesian networks for text retrievalLanguage model approach to IRAn important emphasis in recent workProbabilistic methods are one of the oldest but also one of the currently hottest topics in IR.Traditionally: neat ideas, but theyve nev

7、er won on performance. It may be different now.The document ranking problemWe have a collection of documentsUser issues a queryA list of documents needs to be returnedRanking method is core of an IR system:In what order do we present documents to the user?We want the “best” document to be first, sec

8、ond best second, etc.Idea: Rank by probability of relevance of the document w.r.t. information needP(relevant|documenti, query)Recall a few probability basicsFor events a and b:Bayes RuleOdds:PosteriorPriorThe Probability Ranking Principle “If a reference retrieval systems response to each request i

9、s a ranking of the documents in the collection in order of decreasing probability of relevance to the user who submitted the request, where the probabilities are estimated as accurately as possible on the basis of whatever data have been made available to the system for this purpose, the overall eff

10、ectiveness of the system to its user will be the best that is obtainable on the basis of those data.”1960s/1970s S. Robertson, W.S. Cooper, M.E. Maron; vanRijsbergen(1979:113); Manning & Schtze (1999:538)Probability Ranking PrincipleLet x be a document in the collection. Let R represent relevance of

11、 a document w.r.t. given (fixed) query and let NR represent non-relevance.p(x|R), p(x|NR) - probability that if a relevant (non-relevant) document is retrieved, it is x.Need to find p(R|x) - probability that a document x is relevant.p(R),p(NR) - prior probabilityof retrieving a (non) relevantdocumen

12、tR=0,1 vs. NR/RProbability Ranking Principle (PRP)Simple case: no selection costs or other utility concerns that would differentially weight errorsBayes Optimal Decision Rulex is relevant iff p(R|x) p(NR|x)PRP in action: Rank all documents by p(R|x)Theorem:Using the PRP is optimal, in that it minimi

13、zes the loss (Bayes risk) under 1/0 lossProvable if all probabilities correct, etc. e.g., Ripley 1996Probability Ranking PrincipleMore complex case: retrieval costs.Let d be a documentC - cost of retrieval of relevant documentC - cost of retrieval of non-relevant documentProbability Ranking Principl

14、e: iffor all d not yet retrieved, then d is the next document to be retrievedWe wont further consider loss/utility from now onProbability Ranking PrincipleHow do we compute all those probabilities?Do not know exact probabilities, have to use estimates Binary Independence Retrieval (BIR) which we dis

15、cuss later today is the simplest modelQuestionable assumptions“Relevance” of each document is independent of relevance of other documents.Really, its bad to keep on returning duplicatesBoolean model of relevanceThat one has a single step information needSeeing a range of results might let user refin

16、e queryProbabilistic Retrieval StrategyEstimate how terms contribute to relevanceHow do things like tf, df, and length influence your judgments about document relevance? One answer is the Okapi formulae (S. Robertson)Combine to find document relevance probabilityOrder documents by decreasing probabi

17、lity Probabilistic RankingBasic concept:For a given query, if we know some documents that are relevant, terms that occur in those documents should be given greater weighting in searching for other relevant documents.By making assumptions about the distribution of terms and applying Bayes Theorem, it

18、 is possible to derive weights theoretically.Van RijsbergenBinary Independence ModelTraditionally used in conjunction with PRP“Binary” = Boolean: documents are represented as binary incidence vectors of terms (cf. lecture 1): iff term i is present in document x.“Independence”: terms occur in documen

19、ts independently Different documents can be modeled as same vectorBernoulli Naive Bayes model (cf. text categorization!)Binary Independence ModelQueries: binary term incidence vectorsGiven query q, for each document d need to compute p(R|q,d).replace with computing p(R|q,x) where x is binary term in

20、cidence vector representing d Interested only in rankingWill use odds and Bayes Rule:Binary Independence Model Using Independence Assumption:Constant for a given queryNeeds estimationSo :Binary Independence Model Since xi is either 0 or 1: Let Assume, for all terms not occurring in the query (qi=0)T

21、hen.This can be changed (e.g., inrelevance feedback)All matching termsNon-matching query termsBinary Independence ModelAll matching termsAll query termsBinary Independence ModelConstant foreach queryOnly quantity to be estimated for rankings Retrieval Status Value:Binary Independence Model All boils

22、 down to computing RSV.So, how do we compute cis from our data ?Binary Independence Model Estimating RSV coefficients. For each term i look at this table of document counts: Estimates:For now,assume nozero terms.More nextlecture.Estimation key challengeIf non-relevant documents are approximated by t

23、he whole collection, then ri (prob. of occurrence in non-relevant documents for query) is n/N andlog (1 ri)/ri = log (N n)/n log N/n = IDF!pi (probability of occurrence in relevant documents) can be estimated in various ways:from relevant documents if know someRelevance weighting can be used in feed

24、back loopconstant (Croft and Harper combination match) then just get idf weighting of termsproportional to prob. of occurrence in collectionmore accurately, to log of this (Greiff, SIGIR 1998)Iteratively estimating piAssume that pi constant over all xi in querypi = 0.5 (even odds) for any given docD

25、etermine guess of relevant document set:V is fixed size set of highest ranked documents on this model (note: now a bit like tf.idf!)We need to improve our guesses for pi and ri, soUse distribution of xi in docs in V. Let Vi be set of documents containing xi pi = |Vi| / |V|Assume if not retrieved the

26、n not relevant ri = (ni |Vi|) / (N |V|)Go to 2. until converges then return ranking24Probabilistic Relevance FeedbackGuess a preliminary probabilistic description of R and use it to retrieve a first set of documents V, as above.Interact with the user to refine the description: learn some definite me

27、mbers of R and NRReestimate pi and ri on the basis of theseOr can combine new information with original guess (use Bayesian prior):Repeat, thus generating a succession of approximations to R. is priorweightPRP and BIRGetting reasonable approximations of probabilities is possible.Requires restrictive

28、 assumptions:term independenceterms not in query dont affect the outcomeboolean representation of documents/queries/relevancedocument relevance values are independentSome of these assumptions can be removedProblem: either require partial relevance information or only can derive somewhat inferior ter

29、m weightsRemoving term independenceIn general, index terms arent independentDependencies can be complexvan Rijsbergen (1979) proposed model of simple tree dependenciesExactly Friedman and Goldszmidts Tree Augmented Naive Bayes (AAAI 13, 1996)Each term dependent on one otherIn 1970s, estimation probl

30、ems held back success of this modelFood for thoughtThink through the differences between standard tf.idf and the probabilistic retrieval model in the first iterationThink through the differences between vector space (pseudo) relevance feedback and probabilistic (pseudo) relevance feedbackGood and Ba

31、d NewsStandard Vector Space ModelEmpirical for the most part; success measured by resultsFew properties provableProbabilistic Model AdvantagesBased on a firm theoretical foundationTheoretically justified optimal ranking schemeDisadvantagesMaking the initial guess to get VBinary word-in-doc weights (

32、not using term frequencies)Independence of terms (can be alleviated)Amount of computationHas never worked convincingly better in practiceBayesian Networks for Text Retrieval (Turtle and Croft 1990)Standard probabilistic model assumes you cant estimate P(R|D,Q)Instead assume independence and use P(D|

33、R)But maybe you can with a Bayesian network*What is a Bayesian network?A directed acyclic graphNodes Events or VariablesAssume values. For our purposes, all BooleanLinksmodel direct dependencies between nodesBayesian Networksabca,b,c - propositions (events).p(c|ab) for all values for a,b,cp(a)p(b) B

34、ayesian networks model causal relations between eventsInference in Bayesian Nets:Given probability distributionsfor roots and conditional probabilities can compute apriori probability of any instance Fixing assumptions (e.g., b was observed) will cause recomputation of probabilities Conditional depe

35、ndenceFor more information see:R.G. Cowell, A.P. Dawid, S.L. Lauritzen, and D.J. Spiegelhalter. 1999. Probabilistic Networks and Expert Systems. Springer Verlag. J. Pearl. 1988. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan-Kaufman.Toy ExampleGloom(g)Finals(

36、f)Project Due(d)No Sleep(n)Triple Latte(t)Independence Assumptions Independence assumption: P(t|g, f)=P(t|g) Joint probability P(f d n g t) =P(f) P(d) P(n|f) P(g|f d) P(t|g)Gloom(g)Finals(f)Project Due(d)No Sleep(n)Triple Latte(t)Chained inferenceEvidence - a node takes on some valueInference Comput

37、e belief (probabilities) of other nodesconditioned on the known evidenceTwo kinds of inference: Diagnostic and PredictiveComputational complexityGeneral network: NP-hardTree-like networks are easily tractableMuch other work on efficient exact and approximate Bayesian network inferenceClever dynamic

38、programmingApproximate inference (“l(fā)oopy belief propagation”)Model for Text RetrievalGoalGiven a users information need (evidence), find probability a doc satisfies needRetrieval modelModel docs in a document networkModel information need in a query networkBayesian Nets for IR: IdeaDocument NetworkQ

39、uery NetworkLarge, butCompute once for each document collectionSmall, compute once forevery queryd1dnd2t1t2tnr1r2r3rkdi -documentsti - document representationsri - “concepts”Iq2q1cmc2c1ci - query conceptsqi - high-level conceptsI - goal nodeBayesian Nets for IRConstruct Document Network (once !)For

40、each queryConstruct best Query Network Attach it to Document NetworkFind subset of dis which maximizes the probability value of node I (best subset).Retrieve these dis as the answer to query.Bayesian nets for text retrievald1d2r1r3c1c3q1q2ir2c2DocumentNetworkQueryNetworkDocumentsTerms/ConceptsConcep

41、tsQuery operators(AND/OR/NOT)Information needLink matrices and probabilitiesPrior doc probability P(d) = 1/nP(r|d)within-document term frequencytf idf - basedP(c|r)1-to-1thesaurusP(q|c): canonical forms of query operatorsAlways use things like AND and NOT never store a full CPT*conditional probabili

42、ty tableExample: “reason trouble two”HamletMacbethreasondoublereasontwoORNOTUser querytroubletroubleDocumentNetworkQueryNetworkExtensionsPrior probs dont have to be 1/n.“User information need” doesnt have to be a query - can be words typed, in docs read, any combination Phrases, inter-document links

43、Link matrices can be modified over time.User feedback.The promise of “personalization”Computational detailsDocument network built at indexing timeQuery network built/scored at query timeRepresentation:Link matrices from docs to any single term are like the postings entry for that termCanonical link

44、matrices are efficient to store and computeAttach evidence only at roots of networkCan do single pass from roots to leavesBayes Nets in IRFlexible ways of combining term weights, which can generalize previous approachesBoolean modelBinary independence modelProbabilistic models with weaker assumption

45、sEfficient large-scale implementationInQuery text retrieval system from U MassTurtle and Croft (1990) Commercial version defunct?Need approximations to avoid intractable inferenceNeed to estimate all the probabilities by some means (whether more or less ad hoc)Much new Bayes net technology yet to be applied?Resources

溫馨提示

  • 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
  • 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
  • 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
  • 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
  • 5. 人人文庫網(wǎng)僅提供信息存儲空間,僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護處理,對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對任何下載內(nèi)容負(fù)責(zé)。
  • 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請與我們聯(lián)系,我們立即糾正。
  • 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時也不承擔(dān)用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。

最新文檔

評論

0/150

提交評論