版權(quán)說(shuō)明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請(qǐng)進(jìn)行舉報(bào)或認(rèn)領(lǐng)
文檔簡(jiǎn)介
貝葉斯分類、KNN、特征選擇、評(píng)估陳翀Ref:PengBo@NC&IS@PKU,WBIAlecture統(tǒng)計(jì)機(jī)器學(xué)習(xí)與數(shù)據(jù)挖掘技術(shù)與方法研討班講義本次課大綱文本分類TextCategorizationProblemdefinitionNa?veBayesClassifierK-NearestNeighborClassifierEvaluationCategorizationDefinitionGiven:Adescriptionofaninstance,xX,whereXistheinstancelanguageorinstancespace.Issue:howtorepresenttextdocuments.Afixedsetofcategories:
C=
{c1,c2,…,cn}Determine:Thecategoryofx:c(x)C,wherec(x)isacategorizationfunctionwhosedomainisXandwhoserangeisC.Wewanttoknowhowtobuildcategorizationfunctions(“classifiers”).TextCategorizationExamplesAssignlabelstoeachdocumentorweb:LabelsaremostoftentopicssuchasYahoo-categoriese.g.,"finance,""sports,""news>world>asia>business"Labelsmaybegenrese.g.,"editorials""movie-reviews""news“Labelsmaybeopinione.g.,“l(fā)ike”,“hate”,“neutral”Labelsmaybedomain-specificbinarye.g.,"interesting-to-me":"not-interesting-to-me”e.g.,“spam”:“not-spam”e.g.,“containsadultlanguage”:“doesn’t”ClassificationMethodsManualclassificationUsedbyYahoo!,Looksmart,,ODP,MedlineAccuratebutexpensivetoscaleAutomaticdocumentclassificationHand-codedrule-basedsystemsSpammailfilter,…Supervisedlearningofadocument-labelassignmentfunctionNofreelunch:requireshand-classifiedtrainingdataNotethatmanycommercialsystemsuseamixtureofmethodsNa?veBayes統(tǒng)計(jì)機(jī)器學(xué)習(xí)與數(shù)據(jù)挖掘技術(shù)與方法研討班講義BernoullitrialBernoullitrialisanexperimentwhoseoutcomeisrandomandcanbeeitheroftwopossibleoutcomes,"success"and"failure".BinomialDistributionbinomialdistributionisthediscreteprobabilitydistributionofthenumberofsuccessesinasequenceofn
independentyes/noexperiments,eachofwhichyieldssuccesswithprobability
p.MultinomialDistributionmultinomialdistributionisageneralizationofthebinomialdistribution.eachtrialresultsinoneofsomefixedfinitenumberkofpossibleoutcomes,withprobabilitiesp1,...,pk,therearen
independenttrials.WecanusearandomvariableXitoindicatethenumberoftimesoutcomenumberiwasobservedoverthentrials.Bayes’RulepriorposteriorlikelihoodUseBayesRuletoGamblesomeonedrawsanenvelopeatrandomandofferstosellittoyou.Howmuchshouldyoupay?beforedeciding,youareallowedtoseeonebeaddrawnfromtheenvelope.Supposeit’sred:Howmuchshouldyoupay?Prosecutor'sfallacyYouwinthelotteryjackpot.Youarethenchargedwithhavingcheated,forinstancewithhavingbribedlotteryofficials.Atthetrial,theprosecutorpointsoutthatwinningthelotterywithoutcheatingisextremelyunlikely,andthatthereforeyourbeinginnocentmustbecomparablyunlikely.LikelihoodalikelihoodfunctionisaconditionalprobabilityfunctionconsideredasafunctionofitssecondargumentwithitsfirstargumentheldfixedGivenaparameterizedfamilyofprobabilitydensityfunctions
Whereθistheparameter,thelikelihoodfunctionis
wherexistheobservedoutcomeofanexperiment.whenf(x|θ)isviewedasafunctionofxwithθf(wàn)ixed,itisaprobabilitydensityfunction,whenviewedasafunctionofθwithxfixed,itisalikelihoodfunction.
MaximumaposterioriHypothesisAsP(D)isconstantMaximumlikelihoodHypothesisIfallhypothesesareaprioriequallylikely,
weonlyneedtoconsidertheP(D|h)term:BayesClassifiersTask:ClassifyanewinstanceDbasedonatupleofattributevaluesintooneoftheclassescj
CNa?veBayesAssumptionP(cj)Canbeestimatedfromthefrequencyofclassesinthetrainingexamples.P(x1,x2,…,xn|cj)O(|X|n?|C|)parametersCouldonlybeestimatedifavery,verylargenumberoftrainingexampleswasavailable.Na?veBayesConditionalIndependenceAssumption:AssumethattheprobabilityofobservingtheconjunctionofattributesisequaltotheproductoftheindividualprobabilitiesP(xi|cj).FluX1X2X5X3X4feversinuscoughrunnynosemuscle-acheTheNa?veBayesClassifierConditionalIndependenceAssumption:featuresdetecttermpresenceandareindependentofeachothergiventheclass:ThismodelisappropriateforbinaryvariablesMultivariatebinomialmodelLearningtheModelFirstattempt:maximumlikelihoodestimatessimplyusethefrequenciesinthedataCX1X2X5X3X4X6Whatifwehaveseennotrainingcaseswherepatienthadnofluandmuscleaches?Zeroprobabilitiescannotbeconditionedaway,nomattertheotherevidence!ProblemwithMaxLikelihoodFluX1X2X5X3X4feversinuscoughrunnynosemuscle-acheSmoothingtoEliminateZeros#ofvaluesofX,Addonesmooth(Laplacesmoothing)Asauniformprior(eachattributeoccursonceforeachclass)thatisthenupdatedasevidencefromthetrainingdatacomesin.DocumentGenerativeModel“Love
is
patient,love
is
kind.“Basic:bagofwordsabinaryindependencemodelMultivariatebinomialgenerationfeatureXiistermValueXi=1or0,indicatetermXipresentindocornotamultinomialunigramlanguagemodelMultinomialgenerationfeatureXiistermpositionValueofXi=termatpositionipositionindependentLoveispatientkind
BernoulliNaiveBayesClassifiersMultivariatebinomialModelOnefeatureXwforeachwordindictionaryXw=trueindocumentdifwappearsindNaiveBayesassumption:Giventhedocument’stopic,appearanceofonewordinthedocumenttellsusnothingaboutchancesthatanotherwordappearsMultinomialNaiveBayesClassifiersIIMultinomial=ClassconditionalunigramOnefeatureXiforeachwordposindocumentfeature’svaluesareallwordsindictionaryValueofXiisthewordinpositioniNa?veBayesassumption:Giventhedocument’stopic,wordinonepositioninthedocumenttellsusnothingaboutwordsinotherpositionsStilltoomanypossibilitiesAssumethatclassificationisindependentofthepositionsofthewordsSecondassumption:WordappearancedoesnotdependonpositionJusthaveonemultinomialfeaturepredictingforallwordsUsesameparametersforeachpositionMultinomialNaiveBayesClassifiersforallpositionsi,j,wordw,andclasscParameterestimationfractionofdocumentsoftopiccjinwhichwordwappearsBinomialmodel:Multinomialmodel:Cancreateamega-documentfortopicjbyconcatenatingalldocumentsinthistopicUsefrequencyofwinmega-documentfractionoftimesinwhichwordwappearsacrossalldocumentsoftopiccjNaiveBayesalgorithm(Multinomialmodel)NaiveBayesalgorithm(Bernoullimodel)NBExamplec(5)=?MultinomialNBClassifierFeaturelikelihoodestimatePosteriorResult:c(5)=ChinaBernoulliNBClassifierFeaturelikelihoodestimatePosteriorResult:c(5)<>ChinaClassificationMultinomialvsMultivariatebinomial?MultinomialisingeneralbetterFeatureSelection統(tǒng)計(jì)機(jī)器學(xué)習(xí)與數(shù)據(jù)挖掘技術(shù)與方法研討班講義FeatureSelection:Why?Textcollectionshavealargenumberoffeatures10,000–1,000,000uniquewords…andmoreMaymakeusingaparticularclassifierfeasibleSomeclassifierscan’tdealwith100,000offeaturesReducestrainingtimeTrainingtimeforsomemethodsisquadraticorworseinthenumberoffeaturesCanimprovegeneralization(performance)EliminatesnoisefeaturesAvoidsoverfittingFeatureselection:how?Twoidea:Hypothesistestingstatistics:AreweconfidentthatthevalueofonecategoricalvariableisassociatedwiththevalueofanotherChi-squaretestInformationtheory:HowmuchinformationdoesthevalueofonecategoricalvariablegiveyouaboutthevalueofanotherMutualinformationThey’resimilar,but2measuresconfidenceinassociation,(basedonavailablestatistics),whileMImeasuresextentofassociation(assumingperfectknowledgeofprobabilities)Pearson'schi-squaretestExample:Issix-sideddie"fair“?nullhypothesis:six-sideddie"fair“isfair.therewillbeantheoreticalfrequencyofpossibleoutcomeChi-squareiscalculatedbyOiobservedfrequencyEitheoreticalfrequency2statistic(CHI)nullhypothesis:eventterm,classareindependentThenullhypothesisisrejectedwithconfidence.999
since12.9>10.83(thevaluefor.999confidence).9500500(4.75)(0.25)(9498)3Class
auto(502)2Class=autoTermjaguarTerm=jaguarexpected:feobserved:foExampletheclasspoultryandthetermexportThecountsofthenumberofdocumentswiththefourpossiblecombinations2statistic(CHI)Thereisasimplerformulafor2x22:N=A+B+C+DD=#(?t,?c)B=#(t,?c)C=#(?t,c)A=#(t,c)FeatureselectionviaMutualInformationIntrainingset,choosekwordswhichbestdiscriminate(givemostinfoon)thecategories.MImeasureshowmuchinformationthepresence/absenceofatermcontributestomakingthecorrectclassificationdecisiononc.TheMutualInformationbetweenaword,classis:ExampleFeatureselectionviaMI(contd.)Foreachcategorywebuildalistofkmostdiscriminatingterms.Forexample(on20Newsgroups):sci.electronics:circuit,voltage,amp,ground,copy,battery,electronics,cooling,…rec.autos:car,cars,engine,ford,dealer,mustang,oil,collision,autos,tires,toyota,…Greedy:doesnotaccountforcorrelationsbetweentermsWhy?FeatureSelectionMutualInformationClearinformation-theoreticinterpretationChi-squareStatisticalfoundationMayselectmoreraretermsthanMIJustusethecommonestterms?NoparticularfoundationInpractice,thisisoften90%asgoodFeatureselectionforNBIngeneralfeatureselectionisnecessaryforbinomialNB.Otherwiseyousufferfromnoise,multi-counting“Featureselection”reallymeanssomethingdifferentformultinomialNB.ItmeansdictionarytruncationThemultinomialNBmodelonlyhas1featureK-NearestNeighbors統(tǒng)計(jì)機(jī)器學(xué)習(xí)與數(shù)據(jù)挖掘技術(shù)與方法研討班講義Recall:VectorSpaceRepresentationEachdocumentisavector,onecomponentforeachterm(=word).Normalizetounitlength.High-dimensionalvectorspace:Termsareaxes10,000+dimensions,oreven100,000+DocsarevectorsinthisspaceClassesinaVectorSpaceGovernmentScienceArtsClassificationUsingVectorSpacesEachtrainingdocapoint(vector)labeledbyitstopic(=class)Hypothesis:docsofthesameclassformacontiguousregionofspaceWedefinesurfacestodelineateclassesinspaceTestDocument=GovernmentGovernmentScienceArtsSimilarityhypothesistrueingeneral?kNearestNeighborClassificationToclassifydocumentdintoclasscDefinek-neighborhoodNasknearestneighborsofdCountnumberofdocumentsiinNthatbelongtocEstimateP(c|d)asi/kChooseasclassargmaxcP(c|d)[=majorityclass]Example:k=6(6NN)GovernmentScienceArtsP(science|)?Nearest-NeighborLearningAlgorithmLearningisjuststoringtherepresentationsofthetrainingexamplesinD.Testinginstancex:ComputesimilaritybetweenxandallexamplesinD.AssignxthecategoryofthemostsimilarexampleinD.Doesnotexplicitlycomputeageneralizationorcategoryprototypes.Alsocalled:Case-basedlearningMemory-basedlearningLazylearningWhyK?Usingonlytheclosestexampletodeterminethecategorizationissubjecttoerrorsdueto:Asingleatypicalexample.Noise(i.e.error)inthecategorylabelofasingletrainingexample.Morerobustalternativeistofindthekmost-similarexamplesandreturnthemajoritycategoryofthesekexamples.Valueofkistypicallyoddtoavoidties;3and5aremostcommon.kNNdecisionboundariesGovernmentScienceArtsBoundariesareinprinciplearbitrarysurfaces–butusuallypolyhedraSimilarityMetricsNearestneighbormethoddependsonasimilarity(ordistance)metric.Simplestforcontinuousm-dimensionalinstancespaceisEuclideandistance.Simplestform-dimensionalbinaryinstancespaceisHammingdistance(numberoffeaturevaluesthatdiffer).Fortext,cosinesimilarityoftf.idfweightedvectorsistypicallymosteffective.Illustrationof3NearestNeighborforTextVectorSpaceNearestNeighborwithInvertedIndexNaivelyfindingnearestneighborsrequiresalinearsearchthrough|D|documentsincollectionButdeterminingknearestneighborsisthesameasdeterminingthekbestretrievalsusingthetestdocumentasaquerytoadatabaseoftrainingdocuments.Usestandardvectorspaceinvertedindexmethodstofindtheknearestneighbors.TestingTime:O(B|Vt|)whereBistheaveragenumberoftrainingdocumentsinwhichatest-documentwordappears.TypicallyB<<|D|kNN:DiscussionNofeatureselectionnecessaryScaleswellwithlargenumberofclassesDon’tneedtotrainnclassifiersfornclassesClassescaninfluenceeachotherSmallchangestooneclasscanhaverippleeffectScorescanbehardtoconverttoprobabilitiesNotrainingnecessaryBiasvs.variance:
ChoosingthecorrectmodelcapacitykNNvs.NaiveBayesBias/VariancetradeoffVariance≈CapacitykNNhashighvarianceandlowbias.InfinitememoryNBhaslowvarianceandhighbias.Decisionsurfacehastobelinear(hyperplane)SummaryCategorizationTrainingdataOver-fitting&GeneralizeNa?veBayesBayesianMethodsBernoulli
NBclassifierMultinomialNBclassifierK-NearestNeighborBias.vs.VarianceFeatureselectionChi-squaretestMutualInformationReadings[1]IIRCh13,Ch14.2[2]Y.YangandX.Liu,"Are-examinationoftextcategorizationmethods,"presentedatProceedingsofACMSIGIRConferenceonResearchandDevelopmentinInformationRetrieval(SIGIR'99),1999.ClassificationEvaluation統(tǒng)計(jì)機(jī)器學(xué)習(xí)與數(shù)據(jù)挖掘技術(shù)與方法研討班講義Most(over)useddataset21578documents9603training,3299testarticles(ModAptesplit)118categoriesAnarticlecanbeinmorethanonecategoryLearn118binarycategorydistinctionsAveragedocument:about90types,200tokensAveragenumberofclassesassigned1.24fordocswithatleastonecategoryOnlyabout10outof118categoriesarelargeCommoncategories(#train,#test)Evaluation:ClassicReutersDataSet
Earn(2877,1087)Acquisitions(1650,179)Money-fx(538,179)Grain(433,149)Crude(389,189)Trade(369,119)Interest(347,131)Ship(197,89)Wheat(212,71)Corn(182,56)ReutersTextCategorizationdataset(Reuters-21578)
document<REUTERSTOPICS="YES"LEWISSPLIT="TRAIN"CGISPLIT="TRAINING-SET"OLDID="12981"NEWID="798"><DATE>2-MAR-198716:51:43.42</DATE><TOPICS><D>livestock</D><D>hog</D></TOPICS><TITLE>AMERICANPORKCONGRESSKICKSOFFTOMORROW</TITLE><DATELINE>CHICAGO,March2-</DATELINE><BODY>TheAmericanPorkCongresskicksofftomorrow,March3,inIndianapoliswith160ofthenationsporkproducersfrom44memberstatesdeterminingindustrypositionsonanumberofissues,accordingtotheNationalPorkProducersCouncil,NPPC.DelegatestothethreedayCongresswillbeconsidering26resolutionsconcerningvariousissues,includingthefuturedirectionoffarmpolicyandthetaxlawasitappliestotheagriculturesector.ThedelegateswillalsodebatewhethertoendorseconceptsofanationalPRV(pseudorabiesvirus)controlanderadicationprogram,theNPPCsaid.Alargetradeshow,inconjunctionwiththecongress,willfeaturethelatestintechnologyinallareasoftheindustry,theNPPCadded.Reuter</BODY></TEXT></REUTERS>NewReuters:RCV1:810,000docsToptopicsinReutersRCV1北大天網(wǎng):中文網(wǎng)頁(yè)分類
通過(guò)動(dòng)員不同專業(yè)的幾十個(gè)學(xué)生,人工選取形成了一個(gè)基于層次模型的大規(guī)模中文網(wǎng)頁(yè)樣本集。包括12,336個(gè)訓(xùn)練網(wǎng)頁(yè)實(shí)例和3,269個(gè)測(cè)試網(wǎng)頁(yè)實(shí)例,分布在12個(gè)大類,共計(jì)733個(gè)類別中,每個(gè)類別平均有17個(gè)訓(xùn)練實(shí)例和4.5個(gè)測(cè)試實(shí)例天網(wǎng)免費(fèi)提供網(wǎng)頁(yè)樣本集給有興趣的同行,燕穹產(chǎn)品號(hào):YQ-WEBENCH-V0.8中文信息檢索論壇全國(guó)搜索引擎和網(wǎng)上信息挖掘?qū)W術(shù)研討會(huì)SEWM上進(jìn)行分類評(píng)測(cè)北大天網(wǎng):中文網(wǎng)頁(yè)分類表11-1樣本集中類別及實(shí)例數(shù)量的分布情況表類別編號(hào)類別名稱類別數(shù)訓(xùn)練樣本數(shù)測(cè)試樣本數(shù)1人文與藝術(shù)244191102新聞與媒體7125193商業(yè)與經(jīng)濟(jì)488392144娛樂(lè)與休閑8815103745計(jì)算機(jī)與因特網(wǎng)589252386教育18286857各國(guó)風(fēng)情538912358自然科學(xué)11318925149政府與政治182888410社會(huì)科學(xué)104176547911醫(yī)療與健康
136229561612社會(huì)與文化661101301共計(jì)733123363269
MeasuringClassificationFiguresofMeritAccuracyofclassificationMainevaluationcriterioninacademiaMoreinamomentSpeedoftrainingstatisticalclassifierSomemethodsareverycheap;someverycostlySpeedofclassification(docs/hour)NobigdifferencesformostalgorithmsExceptions:kNN,complexpreprocessingrequirementsEffortincreatingtrainingset/hand-builtclassifierhumanhours/topicMeasuringClassificationFiguresofMeritIntherealworld,economicmeasures:Yourchoicesare:DonoclassificationThathasacost(hardtocompute)DoitallmanuallyHasaneasytocomputecostifdoingitlikethatnowDoitallwithanautomaticclassifierMistakeshaveacostDoitwithacombinationofautomaticclassificationandmanualreviewofuncertain/difficult/”new”casesCommonlythelastmethodismostcostefficientandisadoptedConceptDriftCategorieschangeovertimeExample:“presidentoftheunitedstates”1999:clintonisgreatfeature2002:clintonisbadfeatureOnemeasureofatextclassificationsystemishowwellitprotectsagainstconceptdrift.Featureselection:canbebadinprotectingagainstconceptdriftPerclassevaluationmeasuresRecall:Fractionofdocsinclassiclassifiedcorrectly:Precision:Fractionofdocsassignedclassithatareactuallyaboutclassi:“Correctrate”:(1-errorrate)Fractionofdocsclassifiedcorrectly:ABCABCActualClassPredicted
classMeasuresofAccuracyEvaluationmustbedoneontestdatathatisindependentofthetrainingdataOverallerrorrateNotagoodmeasureforsmallclasses.Why?
溫馨提示
- 1. 本站所有資源如無(wú)特殊說(shuō)明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請(qǐng)下載最新的WinRAR軟件解壓。
- 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請(qǐng)聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
- 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁(yè)內(nèi)容里面會(huì)有圖紙預(yù)覽,若沒(méi)有圖紙預(yù)覽就沒(méi)有圖紙。
- 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
- 5. 人人文庫(kù)網(wǎng)僅提供信息存儲(chǔ)空間,僅對(duì)用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對(duì)用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對(duì)任何下載內(nèi)容負(fù)責(zé)。
- 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請(qǐng)與我們聯(lián)系,我們立即糾正。
- 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶因使用這些下載資源對(duì)自己和他人造成任何形式的傷害或損失。
最新文檔
- 房屋買賣合同詳細(xì)信息
- 個(gè)人借款還款合同格式
- 乳膠漆銷售購(gòu)銷合同
- 提升衛(wèi)生質(zhì)量承諾保證書
- 監(jiān)理招標(biāo)文件版速遞
- 精裝房買賣合同模板
- 招標(biāo)文件中的超值采購(gòu)項(xiàng)目
- 農(nóng)產(chǎn)品批量購(gòu)銷合同
- 招標(biāo)文件中的重要采購(gòu)項(xiàng)目
- 酒會(huì)活動(dòng)承包合同
- 2025年中考數(shù)學(xué)備考計(jì)劃
- 內(nèi)蒙古包頭市青山區(qū)2023-2024學(xué)年七年級(jí)上學(xué)期期末調(diào)研檢測(cè)數(shù)學(xué)試卷(含解析)
- 高層建筑用電安全管理制度
- 2024學(xué)校安全工作總結(jié)
- 2024-2025學(xué)年語(yǔ)文二年級(jí)上冊(cè)統(tǒng)編版期末測(cè)試卷(含答案)
- 2024年世界職業(yè)院校技能大賽高職組“關(guān)務(wù)實(shí)務(wù)組”賽項(xiàng)參考試題庫(kù)(含答案)
- 6《記念劉和珍君》《為了忘卻的紀(jì)念》說(shuō)課稿 2024-2025學(xué)年統(tǒng)編版高中語(yǔ)文選擇性必修中冊(cè)
- 智能化住宅小區(qū)施工合同
- 大學(xué)物業(yè)服務(wù)月考核評(píng)價(jià)評(píng)分表
- 軸線翻身法操作
- 福建師范大學(xué)《歌曲寫作》2023-2024學(xué)年第一學(xué)期期末試卷
評(píng)論
0/150
提交評(píng)論