信息檢索語(yǔ)言模型中的平滑方法(Smoothing_Methods_for_LM_in_IR)_第1頁(yè)
信息檢索語(yǔ)言模型中的平滑方法(Smoothing_Methods_for_LM_in_IR)_第2頁(yè)
信息檢索語(yǔ)言模型中的平滑方法(Smoothing_Methods_for_LM_in_IR)_第3頁(yè)
信息檢索語(yǔ)言模型中的平滑方法(Smoothing_Methods_for_LM_in_IR)_第4頁(yè)
信息檢索語(yǔ)言模型中的平滑方法(Smoothing_Methods_for_LM_in_IR)_第5頁(yè)
已閱讀5頁(yè),還剩90頁(yè)未讀, 繼續(xù)免費(fèi)閱讀

下載本文檔

版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請(qǐng)進(jìn)行舉報(bào)或認(rèn)領(lǐng)

文檔簡(jiǎn)介

1、Aljaifidroriguisroai -屠驚血 ReducingInformationVariationonTexts“(AgataSavaryandChristianJacquemin).WorkonourQAGroup一DFKI.IJJJJL/jJJJZdLJLyJJjZdJJbeXarkeywordmatchingisnotevethebestdocumentsfora.Forexample:,WhenwasAlbertEinsteiniwasbornin18791P,ofPhysicsAlbertEinsteinsrmany.一Born:14March1879inUlm,Wurit

2、temberg,Germany.一PhysicsnobelprizeAlbertEinsteinwasbornatUlm,inWurttemberg,Germany,onMarch14,1879.一Died18Apr1955(bom14Mar1879)German-Americanphysicist.Ihesameinformationcanbefoundinseveralways:-37刀刁勺刀仏/7刁迥4Kindsofvariatic-Graphic:ri14Me孑Morphological:”耳Syntactical:“German如-Semanic:AlbertEinste“Germa

3、n-Americanphysicist.訂Appropriateness:-Precision.79and1114Mar1879n.Physicsnobelpriz-Americanphysicistin隙$bornatUlmand一Economy. #LanguageModelsappliedtoInformationRetrieval(ChengxiangZhaiandJohnLafferty)Theprobwaprobabilq=q02qn=dd2.dfnp(qd)taqueryQwasgeneratedbybIbasedonadocument. # # # # #P(q|d)?Op(d

4、Iq)ap(qIcl)*p(d)Uni-grammodel:P(qd)=YP(qjd)/=!logp(qI)=工logP(QiId)i=l4esuseoftwoprobabilitesfor=z盹需罟(沖)沱(,d)0GWCl)i=Pu(qid)=adP(qiC) LguagModeling JJ.JJ刁嚴(yán)PSdatasparsenTheroleofsmoothingis:urate.henorvinformativew(em:AdiusttheMLEtocompensateenessordsintiequery.sretrievalperformancetotheumentLM?Goalof

5、theworK:,3暫為|r%-一Howsensitive.smoothingofadocumV.看亠一Howshouldbethemodeljandtheparameterschosen? theMLE: # # # #1-工A3ld)_vveV:c(vv;t/)0%=工P(wlC)vveV:c(iv;t/)0 #OSSmoothinq:tacklestheeffectofstatisticalvariabilitysmalltrliningsets.Cjiscountinq:therelativefrequenciesofseeneventsarediscounted;thegainedp

6、robabilitymassisthendistributedovertheunseenwords # # # # # -Bwwmdor;oodturinaidea:Estimatethee誦莎sn恰bytakingthecountsthetotal Theprobabilityofatermwithfreq,tfisgivenby:Nd=Totalnumberoftermsoccurredind.tfj(Zf+1)Numberoftermswithfrequencytfinadocument.酗站(g)S仙+1)s叫)NjTotalnumberoftermsoccurredind. # #

7、尊dsmethockinvolvesalinearh利舸nodewiththecollection #Methodscountinq:decreasetheprobabilityofse|rordsbjOCbstractingaconstantfromtheircounts. # # # # Md)/書廠仔JOeVajheideaistoadjusttheprobabilitiesaccordingtothequery.1J -p-(一p一+3一P一一p-(u一M)dn-PW(0好丄P(H+-p_)、:1一電a一一uleedgM)dr+(p-M)zd(I)(p_M)Qr一。乏上u一-poWsl

8、of16u-lunoos-pn_osq 2iininiiUTimesondisk4.5datat-Financiondisk5.osAngelesondisk5.-Disk4anddisk5minusCongressionalRecord.-TheTREC8webdata.Queries:一Topics351-400(TREClad-hoctask).一Topics401-450(TREC8adhocwebtask). #Number:384spacestationmoonDescription:Identifydocumentsthatdiscussthebuildingofaspacest

9、ationwiththeintentofcolonizingthemoonNarrative:Arelevantdocumentwilldiscussthepurposeofaspacestation,initiativestowardscolonizingthemoon,impedimentswhichthusfarhavethwartedsuchaprojectzplanscurrentlyunderwayorintheplanningstagesforsuchaventure;costcountriespreparedtomakeacommitmentofmen,resourcesAfa

10、cilitiesandmoneytoaccomplishsuchafeat. P刁臼心尼血Em也門9Number:414Description:countriesimportit?Narrative:Cuba,sugar,exportsHowmuchsugardoesCubaexportandwhichArelevantdocumentwillprovideinformationregardingCubassugartrade.Sugarproducticrnstatisticsarenotrelevantunlessexportsarementionedexplicitly.TREC h/t

11、ype:Interactrentversionofeachsetofqueries:Titleonly(2or3wordsAlongversion(Title+description一TwodiOptinmeansofthenorrir):hmethodbyigeprecision. +(1-人)臨(4Id)丄xjyPmiWid)IC)/PIC) # # 凸口山乂處泊沏gWKDirichletpriors:|Termweight:1C)丿 # # tUBAvAr.1df(CifffCrCC7frCrUllfVfAbsoute!discdependent:erforadocumentOjtfrf

12、lg:-CIsadocument-ghtofamwithaflatterlog1+c(qd)-66duP(qiCl 宜/韌勻潺渤JUConclusionsJeljnek-Mercer:ofterms.ling,thatis,lees-Theprecisionismuchmoresensitiveto入forlongqueriesthanfortitlequerieLongqueriesneedemphasisonthereI-|n阮webcollection,wassensitivetosmoothingfortitlequeriestoo.一Fortitlequeries!theretrie

13、valperformaneetendstobeoptimizedwhen入二0.1 iffirrr/r/r/fKKtKK|I44fISXK.Kamam.、丿rtitlequeries,especialluislarge,amqueries,oppositeoptimalvalueofutenM*J門刃OMsX/S門FuM屜!is # g/ConclusionsStop-List.FVPorterSteemer.elarge-spanipsintheianThepetormaneeofthen-grammodelhasreachedaplateau,P(d).丿r1rI!ri/irjrrIrlr

14、JrIZjritAnalysis奢,一Alowdimensionalrepresentationofthedata.PCAtriewhere,_onhowcl.ensionalrepr(Relationbetweenfeatures.esto!findalow?rankapproximation,thequalityoftheapproximationdependsclosethedataistollyinginasubspaceofthegivendimensionality.-SemanticInformationisextractedbymeansafttieSinaularValueD

15、ecomposition(SVD).DfUZVfLSIusesareductionofthefirstkcolumnsofU.da0()匕indelysis;is-lheeigenvectorsforasetofdocumentscanasconceptsdescribedbyalinear)noftermschoseninsuchawaynentsaredescribedasaccuratelyaspossibleusingonlyksuchconcepts.-Termsthatco-occurfrequentlywilltendtoaligninthesameeigenvectors #

16、# # # SVDieLAwjrfiprg/fWhathappensifwwsmoothing?irirShinroblem:thedocunentsIclassificationonthenewheterogeneous.Ifthedocumentsclassificationoftheambiguousthetodbelongdiversetopics,wordsinthenewspaceisthe -T一Smoothin!othetasteofthe)linguisticphenomena-一一一_lingmethodsaresimpleandefficient.evorovideaelegantwaytodealwiththedataMBessp誕m摻三)echooseac一But,theydonotmotbeh

溫馨提示

  • 1. 本站所有資源如無(wú)特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請(qǐng)下載最新的WinRAR軟件解壓。
  • 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請(qǐng)聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
  • 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁(yè)內(nèi)容里面會(huì)有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
  • 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
  • 5. 人人文庫(kù)網(wǎng)僅提供信息存儲(chǔ)空間,僅對(duì)用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對(duì)用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對(duì)任何下載內(nèi)容負(fù)責(zé)。
  • 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請(qǐng)與我們聯(lián)系,我們立即糾正。
  • 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶因使用這些下載資源對(duì)自己和他人造成任何形式的傷害或損失。

評(píng)論

0/150

提交評(píng)論