版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請進(jìn)行舉報或認(rèn)領(lǐng)
文檔簡介
Google搜索與
Inter網(wǎng)的信息檢索
馬志明
May16,2008Email:mazm@/member/mazhiming/index.html約有626,000項(xiàng)符合中國科學(xué)院數(shù)學(xué)與系統(tǒng)科學(xué)研究院的查詢結(jié)果,以下是第1-100項(xiàng)。
(搜索用時0.45
秒)Howcangooglemakearankingof626,000pagesin0.45seconds?Amaintaskof
Internet(Web)
InformationRetrieval
=DesignandAnalysisof
SearchEngine(SE)Algorithm
involvingplentyofMathematicsHITS
PageRank1998JonKleinbergCornellUniversity
SergeyBrinandLarryPageStanfordUniversityNevanlinnaPrize(2006)
JonKleinberg
OneofKleinberg‘smostimportantresearchachievementsfocusesontheinternetworkstructureoftheWorldWideWeb.Priorto
Kleinberg‘swork,searchenginesfocusedonlyonthecontentofwebpages,notonthelinkstructure.Kleinbergintroducedtheideaof“authorities”and“hubs”:Anauthorityisawebpagethatcontains
informationonaparticulartopic,andahubisapagethatcontainslinksto
manyauthorities.Zhuzihuthesis.pdfPage
Rank,therankingsystem
usedbytheGooglesearch
engine.
Queryindependentcontentindependent.usingonlythewebgraphstructurePage
Rank,therankingsystemusedbytheGooglesearchengine.
PageRankasaFunctionoftheDampingFactorPaoloBoldiMassimoSantiniSebastianoVignaDSI,UniversitàdegliStudidiMilanoWWW2005paper3.1Choosingthedampingfactor3GeneralBehaviour3.2Gettingcloseto1
canwesomehowcharacterisethepropertiesof?whatmakes
differentfromtheother(infinitelymany,ifPisreducible)limitdistributionsofP?
isthelimitdistributionofPwhenthestartingdistributionisuniform,thatis,Conjecture1
:
Website
provideplentyofinformation:
pagesinthesamewebsitemaysharethesameIP,runonthesamewebserveranddatabaseserver,andbeauthored/maintainedbythesamepersonororganization.
theremightbehighcorrelationsbetweenpagesinthesamewebsite,intermsofcontent,pagelayoutandhyperlinks.
websitescontainhigherdensityofhyperlinksinsidethem(about75%)andlowerdensityofedgesinbetween.HostGraphlosesmuchtransitioninformation
Canasurferjumpfrompage5ofsite1toapageinsite2?From:s06-pc-chairs-email@[mailto:s06-pc-chairs-Sent:2006年4月4日8:36
To:Tie-YanLiu;wangying@;fengg03@;ybao@;mazm@
Subject:[SIGIR2006]YourPaper#191
Title:AggregateRank:BringOrdertoWebSites
Congratulations!!29thAnnual
International
Conferenceon
Research&DevelopmentonInformationRetrieval(SIGIR’06,August6–11,2006,Seattle,Washington,USA).RankingWebsites,
aProbabilisticView
YingBao,GangFeng,Tie-YanLiu,Zhi-MingMa,andYingWang
InternetMathematics,
Volume3(2007),Issue3-WesuggestevaluatingtheimportanceofawebsitewiththemeanfrequencyofvisitingthewebsitefortheMarkovchainontheInternetGraphdescribingrandomsurfing.
WeshowthatthismeanfrequencyisequaltothesumofthePageRanksofallthewebpagesinthatwebsite(henceisreferredasPageRankSum)
Weproposeanovelalgorithm(AggregateRankAlgorithm)basedonthetheoryofstochasticcomplement
tocalculatetherankofawebsite.TheAggregateRankAlgorithmcanapproximatethePageRankSumaccurately,whilethecorrespondingcomputationalcomplexityismuchlowerthanPageRankSum
Byconstructingreturn-timeMarkovchainsrestrictedtoeachwebsite,wedescribealsotheprobabilisticrelationbetweenPageRankandAggregateRank.
ThecomplexityandtheerrorboundofAggregateRankAlgorithmwithexperimentsofrealdadaarediscussedattheendofthepaper.nwebsinNsites,
Thestationarydistribution,knownasthePageRankvector,isgivenbyWemayrewritethestationarydistributionaswithasarowvectoroflength
Wedefinetheone-steptransitionprobabilityfromthewebsite
tothewebsite
bywhereeisandimensionalcolumnvectorofallones
TheN×NmatrixC(α)=(cij(α))isreferredtoasthecouplingmatrix,whoseelementsrepresentthetransitionprobabilitiesbetweenwebsites.ItcanbeprovedthatC(α)isanirreduciblestochasticmatrix,sothatitpossessesauniquestationaryprobabilityvector.Weuseξ(α)todenotethisstationaryprobability,whichcanbegottenfrom
SinceOnecaneasilycheckthatistheuniquesolutionto
WeshallreferastheAggregateRankThatis,theprobabilityofvisitingawebsiteisequaltothesumofPageRanksofallthepagesinthatwebsite.Thisconclusionisconsistenttoourintuition.thetransitionprobabilityfromSitoSjactuallysummarizesallthecasesthattherandomsurferjumpsfromanypageinSitoanypageinSjwithinone-steptransition.Therefore,thetransitioninthisnewHostGraphisinaccordancewiththerealbehavioroftheWebsurfers.Inthisregard,theso-calculatedrankfromthecouplingmatrixC(α)willbemorereasonablethanthosepreviousworks.Let
denotethenumberofvisitingthewebsite
duringthentimes,thatisWehaveAssumeastartingstateinwebsiteA,i.e.Itisclearthatallthevariables
arestoppingtimesforX.WedefineandinductivelyLet
denotethetransitionmatrixofthereturn-timeMarkovchainforsiteSimilarly,wehaveSinceThereforeSupposethatAggregateRank,i.e.thestationarydistributionofisBasedontheabovediscussions,thedirectapproachofcomputingtheAggregateRankξ(α)istoaccumulatePageRankvalues(denotedbyPageRankSum).However,thisapproachisunfeasiblebecausethecomputationofPageRankisnotatrivialtaskwhenthenumberofwebpagesisaslargeasseveralbillions.Therefore,Efficientcomputationbecomesasignificantproblem.1.Dividethen×nmatrix
intoN×NblocksaccordingtotheNsites.AggregateRank
Constructthestochasticmatrixforbychangingthediagonalelementsoftomakeeachrawsumupto1.3.Determinefrom4.Formanapproximation
tothecouplingmatrix
,byevaluating5.Determinethestationarydistributionof
anddenoteit
,i.e.,Experiments
Inourexperiments,thedatacorpusisthebenchmarkdatafortheWebtrackofTREC2003and2004,domainintheyearof2002.Itcontains1,247,753dataset.Thelargestwebsitecontains137,103webpageswhilethesmallestonecontainsonly1page.PerformanceEvaluationofRankingAlgorithmsbasedonKendall'sdistanceSimilaritybetweenPageRankSumandotherthreerankingresults.From:pcchairs@
Sent:Thursday,April03,20089:48AM
DearYutingLiu,BinGao,Tie-YanLiu,YingZhang,ZhimingMa,ShuyuanHe,HangLi
Wearepleasedtoinformyouthatyourpaper
Title:BrowseRank:LettingWebUsersVoteforPageImportance
hasbeenacceptedfororalpresentationasafullpaperandforpublicationasaneightpaperintheproceedingsofthe31stAnnualInternationalACMSIGIR
ConferenceonResearch&DevelopmentonInformationRetrieval.
Congratulations!!BuildingmodelPropertiesofQprocess:Stationarydistribution:
Jumpingprobability:
EmbeddedMarkovchain:isaMarkovchainwiththetransitionprobabilitymatrixMainconclusion1
isthemeanofthestayingtimeonpagei.
Themoreimportantapageis,thelongerstayingtimeonitis.isthemeanofthefirstre-visittimeatpagei.Themoreimportantapageis,thesmallerthere-visittimeis,andthelargerthevisitfrequencyis.Mainconclusion2
isthestationarydistributionofThestationarydistributionofdiscretemodeliseasytocomputePowermethodforLogdataforFurtherquestionsHowaboutinhomogenousprocess?Statisticresultshow:differentperiodoftimepossessesdifferentvisitingfrequency.Poissonprocesseswithdifferentintensity.MarkedpointprocessHyperlinkisnotreliable.Users’realbehaviorshouldbeconsidered.RelevanceRankingManyfeaturesformeasuringrelevanceTermdistribution(anchor,URL,title,body,proximity,….)Recommendation&citation(PageRank,click-throughdata,…)StatisticsorknowledgeextractedfromwebdataQuestionsWhatistheoptimalrankingfunctiontocombinedifferentfeatures(orevidences)?Howtomeasurerelevance?LearningtoRankWhatistheoptimalweightingsforcombiningthevariousfeaturesUsemachinelearningmethodstolearntherankingfunctionHumanrelevancesystem(HRS)Relevanceverificationtests(RVT)Wei-YingMa,MicrosoftResearchAsiaLearningtoRankModelLearningSystemRankingSystemminLoss66Wei-YingMa,MicrosoftResearchAsiaLearningtoRank(Cont)
State-of-the-artalgorithmsforlearningtoranktakethepairwiseapproachRankingSVMRankBoostRankNet(employedatLiveSearch)67BreakdownWei-YingMa,MicrosoftResearchAsialearningtorankThegoaloflearningtorankistoconstructareal-valuedfunctionthatcangeneratearankingonthedocumentsassociatedwiththegivenquery.Thestate-of-the-artmethodstransformsthelearningproblemintothatofclassificationandthenperformsthelearningtask:Foreachquery,itisassumedthattherearetwocategoriesofdocuments:positiveandnegative(representingrelevantandirreverentwithrespecttothequery).Thendocumentpairsareconstructedbetweenpositivedocumentsandnegativedocuments.Inthetrainingprocess,thequeryinformationisactuallyignored.[5]Y.Cao,J.Xu,T.-Y.Liu,H.Li,Y.Huang,andH.-W.Hon.Adaptingrankingsvmtodocumentretrieval.InProc.ofSIGIR’06,pages186–193,2006.[11]T.Qin,T.-Y.Liu,M.-F.Tsai,X.-D.Zhang,andH.Li.Learningtosearchwebpageswithquery-levellossfunctions.TechnicalReportMSR-TR-2006-156,2006.Ascasestudies,weinvestigateRankingSVMandRankBoost.Weshowthatafterintroducing
query-levelnormalization
toitsobjectivefunction,RankingSVMwillhavequery-levelstability.ForRankBoost,thequery-levelstabilitycanbeachievedifweintroduceboth
query-levelnormalizationandregularization
toitsobjectivefunction.Were-representthelearningtorankproblembyintroducingtheconceptof‘query’and‘distributiongivenquery’intoitsmathematicalformulation.Moreprecisely,weassumethatqueriesaredrawnindependentlyfromaqueryspaceQaccordingtoan(unknown)probabilitydistributionItshouldbenotedthatif,thentheboundmakessense.Thisconditioncanbesatisfiedinmanypracticalcases.Ascasestudies,weinvestigateRankingSVMandRankBoost.Weshowthatafterintroducingquery-levelnormalizationtoitsobjectivefunction,RankingSVMwillhavequery-levelstability.ForRankBoost,thequery-levelstabilitycanbeachievedifweintroducebothquery-levelnormalizationandregularizationtoitsobjectivefunction.Theseanalysesagreelargelywithourexperimentsandtheexperimentsin[5]and[11].RankaggregationRankaggregationistocombinerankingresultsofentitiesfrommultiplerankingfunctionsinordertogenerateabetterone.Theindividualrankingfunctionsarereferredtoasbaserankers,orsimplyrankers.Score-basedaggregationRankaggregationcanbeclassifiedintotwocategories[2].Inthefirstcategory,theentitiesinindividualrankinglistsareassignedscoresandtherankaggregationfunctionisassumedtousethescores(denotedasscore-basedaggregation)[11][18][28].order-basedaggregation
Inthesecondcategory,onlytheordersoftheentitiesinindividualrankinglistsa
溫馨提示
- 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
- 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
- 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
- 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
- 5. 人人文庫網(wǎng)僅提供信息存儲空間,僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對任何下載內(nèi)容負(fù)責(zé)。
- 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請與我們聯(lián)系,我們立即糾正。
- 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時也不承擔(dān)用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。
最新文檔
- 2024-2025學(xué)年人教版九年級英語復(fù)習(xí) 專題05 閱讀理解之說明文 【期末必刷15篇】
- 八年級語文第三次月考卷(考試版A3)【測試范圍:八上第1~5單元】(湖南長沙專用)-A4
- 三年級下冊英語一課一練-Module 7 unit2 it's warm today∣外研社(三起)(含解析)-1小學(xué)英語教學(xué)教材課件
- 2023年高頻電控氣閥項(xiàng)目融資計劃書
- 烹飪原料知識題庫(附參考答案)
- 養(yǎng)老院老人生活照顧細(xì)節(jié)制度
- 養(yǎng)老院老人健康巡查制度
- 汽車行業(yè)質(zhì)量管理體系內(nèi)審員模擬試題及答案
- 新造集裝箱檢驗(yàn)合同范本
- 承包道路填石粉工程協(xié)議書
- 三維超聲輸卵管造影的應(yīng)用課件
- 高壓旋噴樁檢測方案
- Unit1 My classroom Part A Lets spell(說課稿)-2022-2023學(xué)年英語四年級上冊
- 查看下載鄭州電視臺商都頻道簡介
- 2023年國開大學(xué)期末考復(fù)習(xí)題-10861《理工英語4》
- 公安廉政心談話六篇
- 【要點(diǎn)解讀】《實(shí)踐是檢驗(yàn)真理的唯一標(biāo)準(zhǔn)》論證邏輯圖
- 數(shù)字電子技術(shù)(山東工商學(xué)院)知到章節(jié)答案智慧樹2023年
- 商務(wù)禮儀(山東聯(lián)盟)知到章節(jié)答案智慧樹2023年山東財經(jīng)大學(xué)
- 人教部編版語文九年級上冊第一單元分層作業(yè)設(shè)計
- 《怪奇事物所》讀書筆記思維導(dǎo)圖PPT模板下載
評論
0/150
提交評論