現(xiàn)代信息檢索導(dǎo)論英文課件:11-LanguageModels_第1頁
現(xiàn)代信息檢索導(dǎo)論英文課件:11-LanguageModels_第2頁
現(xiàn)代信息檢索導(dǎo)論英文課件:11-LanguageModels_第3頁
現(xiàn)代信息檢索導(dǎo)論英文課件:11-LanguageModels_第4頁
現(xiàn)代信息檢索導(dǎo)論英文課件:11-LanguageModels_第5頁
已閱讀5頁,還剩39頁未讀, 繼續(xù)免費閱讀

下載本文檔

版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請進行舉報或認(rèn)領(lǐng)

文檔簡介

1、An Introduction to IR11th CourseChapter 12:Language Models for Information RetrievalLanguages, grammars, and language modelsA “(formal) language” is the set of strings (called sentences or clauses) generated by a grammar:G=(VT, VN, S; R) VT : terminal symbols (words, tokens);VN : non-terminal symbol

2、s (syntax nodes);SVN: start (top-most) symbolR = | ,(VTVN)* : the rule set.A “(probabilistic) language model” ML further gives each sentence of a language L a probability:ML = P() | L .Note: For any (VTVN)+, P(VTVN)+ ) can be computed from P(tVT) and P(AVN). Language models for IRThe general idea:Gi

3、ven a query q, rank the documents di by P(di | q). The rest is how to estimate P(d | q).dmax = argmaxd P(d |q) = argmaxd P(q|d)P(d), P(d) : prior probablity of doc d, e.g. PageRank(d) multiplied by f(ClickRate). Standard Probabilistic IRqueryd1d2dnInformation needdocument collectionmatchingIR based

4、on Language Model (LM)queryd1d2dnInformation needdocument collectiongenerationA common search heuristic is to use words that you expect to find in matching documents as your query why, I saw Sergey Brin advocating that strategy on late night TV one night in my hotel room, so it must be good!The LM a

5、pproach directly exploits that idea!Formal Language (Model)Traditional generative model: generates stringsFinite state machines or regular grammars, etc.Example: IwishI wishI wish I wishI wish I wish I wishI wish I wish I wish I wish*wish I wishStochastic Language ModelsModels probability of generat

6、ing strings in the language (commonly all strings over alphabet ) 0.2the0.1a0.01man0.01woman0.03said0.02likesthemanlikesthewoman0.20.010.020.20.01multiplyModel MP(s | M) = 0.00000008 Stochastic Language ModelsModel probability of generating any string0.2the0.01class0.0001sayst0.0001pleaseth0.0001yon

7、0.0005maiden0.01womanModel M1Model M2maidenclasspleasethyonthe0.00050.010.00010.00010.20.010.00010.020.10.2P(s|M2) P(s|M1)0.2the0.0001class0.03sayst0.02pleaseth0.1yon0.01maiden0.0001womanStochastic Language ModelsA statistical model for generating textProbability distribution over strings in a given

8、 languageMP ( | M )= P ( | M ) P ( | M, )P ( | M, )P ( | M, )Unigram and higher-order modelsUnigram Language ModelsBigram (generally, n-gram) Language ModelsOther Language ModelsGrammar-based models (PCFGs), etc.Probably not the first thing to try in IR= P ( )P ( | )P ( | )P ( | ) P ( ) P ( ) P ( )

9、P ( )P ( ) P ( ) P ( | ) P ( | ) P ( | )Easy.Effective!Using Language Models in IRTreat each document as the basis for a model (e.g., unigram sufficient statistics)Rank document d based on P(d | q)P(d | q) = P(q | d) x P(d) / P(q)P(q) is the same for all documents, so ignoreP(d) the prior is often t

10、reated as the same for all dBut we could use criteria like authority, length, genreP(q | d) is the probability of q given ds modelVery general formal approachThe fundamental problem of LMsUsually we dont know the model MBut have a sample of text representative of that modelEstimate a language model

11、from a sampleThen compute the observation probabilityP ( | M ( ) )MLanguage Models for IRLanguage Modeling ApproachesAttempt to model query generation processDocuments are ranked by the probability that a query would be observed as a random sample from the respective document modelMultinomial approa

12、chRetrieval based on probabilistic LMTreat the generation of queries as a random process.ApproachInfer a language model for each document.Estimate the probability of generating the query according to each of these models.Rank the documents according to these probabilities.Usually a unigram estimate

13、of words is usedSome work on bigrams, paralleling van RijsbergenRetrieval based on probabilistic LMIntuitionUsers Have a reasonable idea of terms that are likely to occur in documents of interest.They will choose query terms that distinguish these documents from others in the collection.Collection s

14、tatistics Are integral parts of the language model.Are not used heuristically as in many other approaches.In theory. In practice, theres usually some wiggle room for empirically set parametersQuery generation probability (1)Ranking formulaThe probability of producing the query given the language mod

15、el of document d using MLE is:Unigram assumption:Given a particular language model, the query terms occur independently : language model of document d : raw tf of term t in document d : total number of tokens in document dInsufficient dataZero probabilityMay not wish to assign a probability of zero

16、to a document that is missing one or more of the query terms gives conjunction semanticsGeneral approachA non-occurring term is possible, but no more likely than would be expected by chance in the collection.If , : raw count of term t in the collection : raw collection size(total number of tokens in

17、 the collection)Insufficient dataZero probabilities spell disasterWe need to smooth probabilitiesDiscount nonzero probabilitiesGive some probability mass to unseen thingsTheres a wide space of approaches to smoothing probability distributions to deal with this problem, such as adding 1, or to counts

18、, Dirichlet priors, discounting, and interpolationSee FSNLP ch. 6 or CS224N if you want moreA simple idea that works well in practice is to use a mixture between the document multinomial and the collection multinomial distributionMixture modelP(w|d) = Pmle(w|Md) + (1 )Pmle(w|Mc)Mixes the probability

19、 from the document with the general collection frequency of the word.Correctly setting is very importantA high value of lambda makes the search “conjunctive-like” suitable for short queriesA low value is more suitable for long queriesCan tune to optimize performancePerhaps make it dependent on docum

20、ent size (cf. Dirichlet prior or Witten-Bell smoothing)Basic mixture model summaryGeneral formulation of the LM for IRThe user has a document in mind, and generates the query from this document.The equation represents the probability that the document that the user had in mind was in fact this one.

21、P( t ) = PMLE( t | MCollection) general language modelindividual-document modelExampleDocument collection (2 documents)d1: Xerox reports a profit but revenue is downd2: Lucent narrows quarter loss but revenue decreases furtherModel: MLE unigram from documents; = Query: revenue downP(Q|d1) = (1/8 + 2

22、/16)/2 x (1/8 + 1/16)/2 = 1/8 x 3/32 = 3/256P(Q|d2) = (1/8 + 2/16)/2 x (0 + 1/16)/2 = 1/8 x 1/32 = 1/256Ranking: d1 d2Ponte and Croft ExperimentsDataTREC topics 202-250 on TREC disks 2 and 3Natural language queries consisting of one sentence eachTREC topics 51-100 on TREC disk 3 using the concept fi

23、eldsLists of good termsNumber: 054Domain: International EconomicsTopic: Satellite Launch ContractsDescription: Concept(s):Contract, agreementLaunch vehicle, rocket, payload, satelliteLaunch services, Precision/recall results 202-250Precision/recall results 51-100The main difference is whether “Relev

24、ance” figures explicitly in the model or notLM approach attempts to do away with modeling relevanceLM approach asssumes that documents and expressions of information problems are of the same typeComputationally tractable, intuitively appealingLM vs. Prob. Model for IRProblems of basic LM approachAss

25、umption of equivalence between document and information problem representation is unrealisticVery simple models of languageRelevance feedback is difficult to integrate, as are user preferences, and other general issues of relevanceCant easily accommodate phrases, passages, Boolean operatorsCurrent e

26、xtensions focus on putting relevance back into the model, etc.LM vs. Prob. Model for IRExtension: 3-level model3-level modelWhole collection model ( )Specific-topic model; relevant-documents model ( )Individual-document model ( )Relevance hypothesisA request(query; topic) is generated from a specifi

27、c-topic model , .Iff a document is relevant to the topic, the same model will apply to the document.It will replace part of the individual-document model in explaining the document.The probability of relevance of a documentThe probability that this model explains part of the documentThe probability

28、that the , , combination is better than the , combination3-level modelqueryd1d2dnInformation needdocument collectiongenerationAlternative Models of Text GenerationQuery ModelQueryDoc ModelDocSearcherWriterIs this the same model?Retrieval Using Language ModelsQuery ModelQueryDoc ModelDocRetrieval: Qu

29、ery likelihood (1), Document likelihood (2), Model comparison (3)123Query LikelihoodP(Q|Dm)Major issue is estimating document modeli.e. smoothing techniques instead of tf.idf weightsGood retrieval resultse.g. UMass, BBN, Twente, CMU Problems dealing with relevance feedback, query expansion, structur

30、ed queriesDocument LikelihoodRank by likelihood ratio P(D|R)/P(D|NR)treat as a generation problemP(w|R) is estimated by P(w|Qm)Qm is the query or relevance modelP(w|NR) is estimated by collection probabilities P(w)Issue is estimation of query modelTreat query as generated by mixture of topic and bac

31、kgroundEstimate relevance model from related documents (query expansion)Relevance feedback is easily incorporatedGood retrieval results e.g. UMass at SIGIR 01inconsistent with heterogeneous document collectionsModel ComparisonEstimate query and document models and compareSuitable measure is KL diver

32、gence D(Qm|Dm)equivalent to query-likelihood approach if simple empirical distribution used for query modelMore general risk minimization framework has been proposedZhai and Lafferty 2001Better results than query-likelihood or document-likelihood approachesTwo-stage smoothing:Another Reason for Smoo

33、thingQuery = “the algorithms for data mining”d1: 0.04 0.001 0.02 0.002 0.003 d2: 0.02 0.001 0.01 0.003 0.004p( “algorithms”|d1) = p(“algorithm”|d2)p( “data”|d1) p(“data”|d2)p( “mining”|d1) p(q|d2)!We should make p(“the”) and p(“for”) less different for all docs. Two-stage Smoothingc(w,d)|d|P(w|d) =+

34、p(w|C)+Stage-1 -Explain unseen words-Dirichlet prior(Bayesian)(1-)+ p(w|U)Stage-2 -Explain noise in query-2-component mixtureHow can one do relevance feedback if using language modeling approach?Introduce a query model & treat feedback as query model updatingRetrieval function: Query-likelihood = KL

35、-DivergenceFeedback: Expansion-based = Model-basedExpansion-based vs. Model-based Document DResultsFeedback DocsDoc modelDoc modelScoringScoringQuery QDocument DQuery QFeedback DocsResultsExpansion-basedFeedbackmodifymodifyModel-basedFeedbackQuery modelQuery likelihoodKL-divergenceFeedback as Model

36、InterpolationQuery QDocument DResultsFeedback Docs F=d1, d2 , , dnGenerative model=0No feedback=1Full feedbackTranslation model (Berger and Lafferty)Basic LMs do not address issues of synonymy.Or any deviation in expression of information need from language of documentsA translation model lets you g

37、enerate query words not in document via “translation” to synonyms etc.Or to do cross-language IR, or multimedia IR Basic LM TranslationNeed to learn a translation model (using a dictionary or via statistical machine translation)Language models: pro & conNovel way of looking at the problem of text re

38、trieval based on probabilistic language modelingConceptually simple and explanatoryFormal mathematical modelNatural use of collection statistics, not heuristics (almost)LMs provide effective retrieval and can be improved to the extent that the following conditions can be metOur language models are a

39、ccurate representations of the data.Users have some sense of term distribution.*Or we get more sophisticated with translation modelComparison With Vector SpaceTheres some relation to traditional tf.idf models:(unscaled) term frequency is directly in modelthe probabilities do length normalization of term frequenciesthe effect of doing a mixture with overall collection frequencies is a little like idf: terms rare in the general coll

溫馨提示

  • 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
  • 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
  • 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
  • 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
  • 5. 人人文庫網(wǎng)僅提供信息存儲空間,僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護處理,對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對任何下載內(nèi)容負(fù)責(zé)。
  • 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請與我們聯(lián)系,我們立即糾正。
  • 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時也不承擔(dān)用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。

評論

0/150

提交評論