LSA.303 Introduction to Computational Linguistics_第1頁
LSA.303 Introduction to Computational Linguistics_第2頁
LSA.303 Introduction to Computational Linguistics_第3頁
LSA.303 Introduction to Computational Linguistics_第4頁
LSA.303 Introduction to Computational Linguistics_第5頁
已閱讀5頁,還剩62頁未讀, 繼續(xù)免費閱讀

下載本文檔

版權說明:本文檔由用戶提供并上傳,收益歸屬內容提供方,若內容存在侵權,請進行舉報或認領

文檔簡介

1、Natural Language ProcessingGiuseppe AttardiIntroduction to Probability, Language ModelingIP notice: some slides from: Dan Jurafsky, Jim Martin, Sandiway Fong, Dan KleinOutlineLanguage Modeling (N-grams)N-gram IntroThe Chain RuleThe Shannon Visualization MethodEvaluation:PerplexitySmoothing: Laplace

2、(Add-1)Add-priorLanguage ModelingWe want to compute P(w1,w2,w3,w4,w5wn) = P(W)= the probability of a sequenceAlternatively we want to compute P(w5|w1,w2,w3,w4)= the probability of a word given some previous wordsThe model that computes P(W) orP(wn|w1,w2wn-1)is called the language model.A better term

3、 for this would be “The Grammar”But “Language model” or LM is standardComputing P(W)How to compute this joint probability:P(“the” , ”other” , ”day” , ”I” , ”was” , ”walking” , ”along”, ”and”,”saw”,”a”,”lizard”)Intuition: lets rely on the Chain Rule of ProbabilityThe Chain RuleRecall the definition o

4、f conditional probabilitiesRewriting:More generallyP(A,B,C,D) = P(A)P(B|A)P(C|A,B)P(D|A,B,C)In general P(x1,x2,x3,xn) = P(x1)P(x2|x1)P(x3|x1,x2)P(xn|x1xn-1)The Chain Rule applied to joint probability of words in sentenceP(“the big red dog was”) =P(the) * P(big|the) * P(red|the big) * P(dog|the big r

5、ed) * P(was|the big red dog)Obvious estimateHow to estimate?P(the | its water is so transparent that)P(the | its water is so transparent that) =C(its water is so transparent that the)_C(its water is so transparent that)UnfortunatelyThere are a lot of possible sentencesWell never be able to get enoug

6、h data to compute the statistics for those long prefixesP(lizard|the,other,day,I,was,walking,along,and,saw,a)orP(the|its water is so transparent that)Markov AssumptionMake the simplifying assumptionP(lizard|the,other,day,I,was,walking,along,and,saw,a) = P(lizard|a)or maybeP(lizard|the,other,day,I,wa

7、s,walking,along,and,saw,a) = P(lizard|saw,a)So for each component in the product replace with the approximation (assuming a prefix of N) Bigram versionMarkov AssumptionEstimating bigram probabilitiesThe Maximum Likelihood EstimateAn example I am Sam Sam I am I do not like green eggs and ham This is

8、the Maximum Likelihood Estimate, because it is the one which maximizes P(Training set|Model)Maximum Likelihood EstimatesThe maximum likelihood estimate of some parameter of a model M from a training set TIs the estimatethat maximizes the likelihood of the training set T given the model MSuppose the

9、word Chinese occurs 400 times in a corpus of a million words (Brown corpus)What is the probability that a random word from some other text will be “Chinese”MLE estimate is 400/1000000 = .004This may be a bad estimate for some other corpusBut it is the estimate that makes it most likely that “Chinese

10、” will occur 400 times in a million word corpus.The maximum likelihoodWe want to estimate the probability, p, that individuals are infected with a certain kind of parasite.Ind.:Infected:124563101011789100101The maximum likelihood method (discrete distribution): Write down the probability of each obs

11、ervation by using the model parametersWrite down the probability of all the dataFind the value parameter(s) that maximize this probabilityProbability of observation:p1-pppppp1-p1-p1-pThe maximum likelihoodWe want to estimate the probability, p, that individuals are infected with a certain kind of pa

12、rasite.Ind.:Infected:124563101011789100101- Find the value parameter(s) that maximize this probabilityProbability of observation:p1-pppppp1-p1-p1-pLikelihood function:More examples: Berkeley Restaurant Projectcan you tell me about any good cantonese restaurants close bymid priced thai food is what i

13、m looking fortell me about chez panissecan you give me a listing of the kinds of food that are availableim looking for a good place to eat breakfastwhen is caffe venezia open during the dayRaw bigram countsOut of 9222 sentencesRaw bigram probabilitiesNormalize by unigrams:Result:Bigram estimates of

14、sentence probabilitiesP( I want english food ) =P(i|) x P(want|I) x P(english|want) x P(food|english) x P(|food) =.000031What kinds of knowledge?P(english|want) = .0011P(chinese|want) = .0065P(to|want) = .66P(eat | to) = .28P(food | to) = 0P(want | spend) = 0P(i | ) = .25Jim Martin Shannons GameWhat

15、 if we turn these models around and use them to generate random sentences that are like the sentences from which the model was derived.The Shannon Visualization MethodGenerate random sentences:Choose a random bigram , w according to its probabilityNow choose a random bigram (w, x) according to its p

16、robabilityAnd so on until we choose Then string the words together I I want want to to eat eat Chinese Chinese food food Approximating Shakespeare Shakespeare as corpusN=884,647 tokens, V=29,066Shakespeare produced 300,000 bigram types out of V2= 844 million possible bigrams: so, 99.96% of the possi

17、ble bigrams were never seen (have zero entries in the table)Quadrigrams worse:Whats coming out looks like Shakespeare because it is ShakespeareThe Wall Street Journal is not Shakespeare (no offense) Lesson 1: the perils of overfittingN-grams only work well for word prediction if the test corpus look

18、s like the training corpusIn real life, it often doesntWe need to train robust models, adapt to test set, etc.Train and Test CorporaA language model must be trained on a large corpus of text to estimate good parameter values.Model can be evaluated based on its ability to predict a high probability f

19、or a disjoint (held-out) test corpus (testing on the training corpus would give an optimistically biased estimate).Ideally, the training (and test) corpus should be representative of the actual application data.May need to adapt a general model to a small amount of new (in-domain) data by adding hig

20、hly weighted small corpus to original training data.SmoothingSince there are a combinatorial number of possible word sequences, many rare (but not impossible) combinations never occur in training, so MLE incorrectly assigns zero to many parameters (aka sparse data).If a new combination occurs during

21、 testing, it is given a probability of zero and the entire sequence gets a probability of zero (i.e. infinite perplexity).In practice, parameters are smoothed (aka regularized) to reassign some probability mass to unseen events.Adding probability mass to unseen events requires removing it from seen

22、ones (discounting) in order to maintain a joint distribution that sums to 1.Smoothing is like Robin Hood:Steal from the rich and give to the poor (in probability mass)Slide from Dan KleinLaplace smoothingAlso called add-one smoothingJust add one to all the counts!Very simpleMLE estimate:Laplace esti

23、mate:Reconstructed counts:Laplace smoothed bigram countsLaplace-smoothed bigramsReconstituted countsNote big change to countsC(want to) went from 608 to 238!P(to|want) from .66 to .26!Discount d = c*/cd for “chinese food” = .10A 10 x reduction!So in general, Laplace is a blunt instrumentBut Laplace

24、smoothing not used for N-grams, as we have much better methodsDespite its flaws Laplace (add-k) is however still used to smooth other probabilistic models in NLP, especiallyFor pilot studiesin domains where the number of zeros isnt so huge.Add-kAdd a small fraction instead of 1Even better: Bayesian

25、unigram prior smoothing for bigramsMaximum Likelihood EstimationLaplace SmoothingBayesian Prior SmoothingLesson 2: zeros or not?Zipfs Law:A small number of events occur with high frequencyA large number of events occur with low frequencyYou can quickly collect statistics on the high frequency events

26、You might have to wait an arbitrarily long time to get valid statistics on low frequency eventsResult:Our estimates are sparse! no counts at all for the vast bulk of things we want to estimate!Some of the zeroes in the table are really zeros But others are simply low frequency events you havent seen

27、 yet. After all, ANYTHING CAN HAPPEN!How to address?Answer:Estimate the likelihood of unseen N-grams!Slide adapted from Bonnie Dorr and Julia HirschbergZipfs law f 1/r (f proportional to 1/r) there is a constant k such that f r = kZipfs Law for the Brown Corpus Zipf law: interpretationPrinciple of l

28、east effort: both the speaker and the hearer in communication try to minimize effort:Speakers tend to use a small vocabulary of common (shorter) wordsHearers prefer a large vocabulary of rarer less ambiguous wordsZipfs law is the result of this compromiseOther laws Number of meanings m of a word obe

29、ys the law: m 1/fInverse relationship between frequency and lengthPractical IssuesWe do everything in log spaceAvoid underflow(also adding is faster than multiplying)Language Modeling ToolkitsSRILM/IRSTLMKen LMGoogle N-Gram ReleaseGoogle N-Gram Releaseserve as the incoming 92serve as the incubator 9

30、9serve as the independent 794serve as the index 223serve as the indication 72serve as the indicator 120serve as the indicators 45serve as the indispensable 111serve as the indispensible 40serve as the individual 234EvaluationTrain parameters of our model on a training set.How do we evaluate how well

31、 our model works?Look at the models performance on some new dataThis is what happens in the real world; we want to know how our model performs on data we havent seenUse a test set. A dataset which is different than our training setThen we need an evaluation metric to tell us how well our model is do

32、ing on the test set.One such metric is perplexityEvaluating N-gram modelsBest evaluation for an N-gramPut model A in a task (language identification, speech recognizer, machine translation system)Run the task, get an accuracy for A (how many langs identified correctly, or Word Error Rate, or etc)Put

33、 model B in task, get accuracy for BCompare accuracy for A and BExtrinsic evaluationLanguage Identification taskCreate an N-gram model for each languageCompute the probability of a given textPlang1(text)Plang2(text)Plang3(text)Select language with highest probabilitylang = argmaxl Pl(text)Difficulty

34、 of extrinsic (in-vivo) evaluation of N-gram modelsExtrinsic evaluationThis is really time-consumingCan take days to run an experimentSoAs a temporary solution, in order to run experimentsTo evaluate N-grams we often use an intrinsic evaluation, an approximation called perplexityBut perplexity is a

35、poor approximation unless the test data looks just like the training dataSo is generally only useful in pilot experiments (generally is not sufficient to publish)PerplexityThe intuition behind perplexity as a measure is the notion of surprise.How surprised is the language model when it sees the test

36、 set?Where surprise is a measure of.Gee, I didnt see that coming.The more surprised the model is, the lower the probability it assigned to the test setThe higher the probability, the less surprised it wasPerplexityMeasures of how well a model “fits” the test data.Uses the probability that the model

37、assigns to the test corpus.Normalizes for the number of words in the test corpus and takes the inverse.Measures the weighted average branching factor in predicting the next word (lower is better).PerplexityPerplexity:Chain rule:For bigrams:Minimizing perplexity is the same as maximizing probabilityT

38、he best language model is one that best predicts an unseen test setPerplexity as branching factorHow hard is the task of recognizing digits 0,1,2,3,4,5,6,7,8,9Perplexity: 10Lower perplexity = better modelModel trained on 38 million words from the Wall Street Journal (WSJ) using a 19,979 word vocabul

39、ary.Evaluation on a disjoint set of 1.5 million WSJ words.Unknown WordsHow to handle words in the test corpus that did not occur in the training data, i.e. out of vocabulary (OOV) words?Train a model that includes an explicit symbol for an unknown word ():Choose a vocabulary in advance and replace o

40、ther words in the training corpus with , orReplace the first occurrence of each word in the training data with .Unknown Words handlingTraining of probabilitiesCreate a fixed lexicon L of size VAny training word not in L changed to Now we train its probabilities like a normal wordAt decoding timeIn t

41、ext input: use probabilities for any word not in trainingAdvanced LM stuffCurrent best smoothing algorithmKneser-Ney smoothingOther stuffInterpolationBackoffVariable-length n-gramsClass-based n-gramsClusteringHand-built classesCache LMsTopic-based LMsSentence mixture modelsSkipping LMsParser-based L

42、MsWord EmbeddingsBackoff and InterpolationIf we are estimating:Trigram P(z|xy) but C(xyz) is zeroUse info from:Bigram P(z|y)Or even:Unigram P(z)How to combine the trigram/bigram/unigram info?Backoff versus interpolationBackoff: use trigram if you have it, otherwise bigram, otherwise unigramInterpola

43、tion: mix all threeBackoffOnly use lower-order model when data for higher-order model is unavailableRecursively back-off to weaker models until data is availableWhere P* is a discounted probability estimate to reserve mass for unseen events and s are back-off weights (see book for details).Interpola

44、tionSimple interpolationLambdas conditional on context:How to set the lambdas?Use a held-out corpusChoose lambdas which maximize the probability of datai.e. fix the N-gram probabilitiesthen search for lambda values that,when plugged into previous equation,give largest probability for held-out setCan use EM (Expectation Maximization) to do this searchTraining DataHeld-Out DataTest DataIntuition of backoff+discountingHow much probability to assign

溫馨提示

  • 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
  • 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯(lián)系上傳者。文件的所有權益歸上傳用戶所有。
  • 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內容里面會有圖紙預覽,若沒有圖紙預覽就沒有圖紙。
  • 4. 未經權益所有人同意不得將文件中的內容挪作商業(yè)或盈利用途。
  • 5. 人人文庫網(wǎng)僅提供信息存儲空間,僅對用戶上傳內容的表現(xiàn)方式做保護處理,對用戶上傳分享的文檔內容本身不做任何修改或編輯,并不能對任何下載內容負責。
  • 6. 下載文件中如有侵權或不適當內容,請與我們聯(lián)系,我們立即糾正。
  • 7. 本站不保證下載資源的準確性、安全性和完整性, 同時也不承擔用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。

評論

0/150

提交評論