貝葉斯分類、KNN、特征選擇、評(píng)估_第1頁(yè)
貝葉斯分類、KNN、特征選擇、評(píng)估_第2頁(yè)
貝葉斯分類、KNN、特征選擇、評(píng)估_第3頁(yè)
貝葉斯分類、KNN、特征選擇、評(píng)估_第4頁(yè)
貝葉斯分類、KNN、特征選擇、評(píng)估_第5頁(yè)
已閱讀5頁(yè),還剩73頁(yè)未讀, 繼續(xù)免費(fèi)閱讀

下載本文檔

版權(quán)說(shuō)明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請(qǐng)進(jìn)行舉報(bào)或認(rèn)領(lǐng)

文檔簡(jiǎn)介

貝葉斯分類、KNN、特征選擇、評(píng)估陳翀Ref:PengBo@NC&IS@PKU,WBIAlecture統(tǒng)計(jì)機(jī)器學(xué)習(xí)與數(shù)據(jù)挖掘技術(shù)與方法研討班講義本次課大綱文本分類TextCategorizationProblemdefinitionNa?veBayesClassifierK-NearestNeighborClassifierEvaluationCategorizationDefinitionGiven:Adescriptionofaninstance,xX,whereXistheinstancelanguageorinstancespace.Issue:howtorepresenttextdocuments.Afixedsetofcategories:

C=

{c1,c2,…,cn}Determine:Thecategoryofx:c(x)C,wherec(x)isacategorizationfunctionwhosedomainisXandwhoserangeisC.Wewanttoknowhowtobuildcategorizationfunctions(“classifiers”).TextCategorizationExamplesAssignlabelstoeachdocumentorweb:LabelsaremostoftentopicssuchasYahoo-categoriese.g.,"finance,""sports,""news>world>asia>business"Labelsmaybegenrese.g.,"editorials""movie-reviews""news“Labelsmaybeopinione.g.,“l(fā)ike”,“hate”,“neutral”Labelsmaybedomain-specificbinarye.g.,"interesting-to-me":"not-interesting-to-me”e.g.,“spam”:“not-spam”e.g.,“containsadultlanguage”:“doesn’t”ClassificationMethodsManualclassificationUsedbyYahoo!,Looksmart,,ODP,MedlineAccuratebutexpensivetoscaleAutomaticdocumentclassificationHand-codedrule-basedsystemsSpammailfilter,…Supervisedlearningofadocument-labelassignmentfunctionNofreelunch:requireshand-classifiedtrainingdataNotethatmanycommercialsystemsuseamixtureofmethodsNa?veBayes統(tǒng)計(jì)機(jī)器學(xué)習(xí)與數(shù)據(jù)挖掘技術(shù)與方法研討班講義BernoullitrialBernoullitrialisanexperimentwhoseoutcomeisrandomandcanbeeitheroftwopossibleoutcomes,"success"and"failure".BinomialDistributionbinomialdistributionisthediscreteprobabilitydistributionofthenumberofsuccessesinasequenceofn

independentyes/noexperiments,eachofwhichyieldssuccesswithprobability

p.MultinomialDistributionmultinomialdistributionisageneralizationofthebinomialdistribution.eachtrialresultsinoneofsomefixedfinitenumberkofpossibleoutcomes,withprobabilitiesp1,...,pk,therearen

independenttrials.WecanusearandomvariableXitoindicatethenumberoftimesoutcomenumberiwasobservedoverthentrials.Bayes’RulepriorposteriorlikelihoodUseBayesRuletoGamblesomeonedrawsanenvelopeatrandomandofferstosellittoyou.Howmuchshouldyoupay?beforedeciding,youareallowedtoseeonebeaddrawnfromtheenvelope.Supposeit’sred:Howmuchshouldyoupay?Prosecutor'sfallacyYouwinthelotteryjackpot.Youarethenchargedwithhavingcheated,forinstancewithhavingbribedlotteryofficials.Atthetrial,theprosecutorpointsoutthatwinningthelotterywithoutcheatingisextremelyunlikely,andthatthereforeyourbeinginnocentmustbecomparablyunlikely.LikelihoodalikelihoodfunctionisaconditionalprobabilityfunctionconsideredasafunctionofitssecondargumentwithitsfirstargumentheldfixedGivenaparameterizedfamilyofprobabilitydensityfunctions

Whereθistheparameter,thelikelihoodfunctionis

wherexistheobservedoutcomeofanexperiment.whenf(x|θ)isviewedasafunctionofxwithθf(wàn)ixed,itisaprobabilitydensityfunction,whenviewedasafunctionofθwithxfixed,itisalikelihoodfunction.

MaximumaposterioriHypothesisAsP(D)isconstantMaximumlikelihoodHypothesisIfallhypothesesareaprioriequallylikely,

weonlyneedtoconsidertheP(D|h)term:BayesClassifiersTask:ClassifyanewinstanceDbasedonatupleofattributevaluesintooneoftheclassescj

CNa?veBayesAssumptionP(cj)Canbeestimatedfromthefrequencyofclassesinthetrainingexamples.P(x1,x2,…,xn|cj)O(|X|n?|C|)parametersCouldonlybeestimatedifavery,verylargenumberoftrainingexampleswasavailable.Na?veBayesConditionalIndependenceAssumption:AssumethattheprobabilityofobservingtheconjunctionofattributesisequaltotheproductoftheindividualprobabilitiesP(xi|cj).FluX1X2X5X3X4feversinuscoughrunnynosemuscle-acheTheNa?veBayesClassifierConditionalIndependenceAssumption:featuresdetecttermpresenceandareindependentofeachothergiventheclass:ThismodelisappropriateforbinaryvariablesMultivariatebinomialmodelLearningtheModelFirstattempt:maximumlikelihoodestimatessimplyusethefrequenciesinthedataCX1X2X5X3X4X6Whatifwehaveseennotrainingcaseswherepatienthadnofluandmuscleaches?Zeroprobabilitiescannotbeconditionedaway,nomattertheotherevidence!ProblemwithMaxLikelihoodFluX1X2X5X3X4feversinuscoughrunnynosemuscle-acheSmoothingtoEliminateZeros#ofvaluesofX,Addonesmooth(Laplacesmoothing)Asauniformprior(eachattributeoccursonceforeachclass)thatisthenupdatedasevidencefromthetrainingdatacomesin.DocumentGenerativeModel“Love

is

patient,love

is

kind.“Basic:bagofwordsabinaryindependencemodelMultivariatebinomialgenerationfeatureXiistermValueXi=1or0,indicatetermXipresentindocornotamultinomialunigramlanguagemodelMultinomialgenerationfeatureXiistermpositionValueofXi=termatpositionipositionindependentLoveispatientkind

BernoulliNaiveBayesClassifiersMultivariatebinomialModelOnefeatureXwforeachwordindictionaryXw=trueindocumentdifwappearsindNaiveBayesassumption:Giventhedocument’stopic,appearanceofonewordinthedocumenttellsusnothingaboutchancesthatanotherwordappearsMultinomialNaiveBayesClassifiersIIMultinomial=ClassconditionalunigramOnefeatureXiforeachwordposindocumentfeature’svaluesareallwordsindictionaryValueofXiisthewordinpositioniNa?veBayesassumption:Giventhedocument’stopic,wordinonepositioninthedocumenttellsusnothingaboutwordsinotherpositionsStilltoomanypossibilitiesAssumethatclassificationisindependentofthepositionsofthewordsSecondassumption:WordappearancedoesnotdependonpositionJusthaveonemultinomialfeaturepredictingforallwordsUsesameparametersforeachpositionMultinomialNaiveBayesClassifiersforallpositionsi,j,wordw,andclasscParameterestimationfractionofdocumentsoftopiccjinwhichwordwappearsBinomialmodel:Multinomialmodel:Cancreateamega-documentfortopicjbyconcatenatingalldocumentsinthistopicUsefrequencyofwinmega-documentfractionoftimesinwhichwordwappearsacrossalldocumentsoftopiccjNaiveBayesalgorithm(Multinomialmodel)NaiveBayesalgorithm(Bernoullimodel)NBExamplec(5)=?MultinomialNBClassifierFeaturelikelihoodestimatePosteriorResult:c(5)=ChinaBernoulliNBClassifierFeaturelikelihoodestimatePosteriorResult:c(5)<>ChinaClassificationMultinomialvsMultivariatebinomial?MultinomialisingeneralbetterFeatureSelection統(tǒng)計(jì)機(jī)器學(xué)習(xí)與數(shù)據(jù)挖掘技術(shù)與方法研討班講義FeatureSelection:Why?Textcollectionshavealargenumberoffeatures10,000–1,000,000uniquewords…andmoreMaymakeusingaparticularclassifierfeasibleSomeclassifierscan’tdealwith100,000offeaturesReducestrainingtimeTrainingtimeforsomemethodsisquadraticorworseinthenumberoffeaturesCanimprovegeneralization(performance)EliminatesnoisefeaturesAvoidsoverfittingFeatureselection:how?Twoidea:Hypothesistestingstatistics:AreweconfidentthatthevalueofonecategoricalvariableisassociatedwiththevalueofanotherChi-squaretestInformationtheory:HowmuchinformationdoesthevalueofonecategoricalvariablegiveyouaboutthevalueofanotherMutualinformationThey’resimilar,but2measuresconfidenceinassociation,(basedonavailablestatistics),whileMImeasuresextentofassociation(assumingperfectknowledgeofprobabilities)Pearson'schi-squaretestExample:Issix-sideddie"fair“?nullhypothesis:six-sideddie"fair“isfair.therewillbeantheoreticalfrequencyofpossibleoutcomeChi-squareiscalculatedbyOiobservedfrequencyEitheoreticalfrequency2statistic(CHI)nullhypothesis:eventterm,classareindependentThenullhypothesisisrejectedwithconfidence.999

since12.9>10.83(thevaluefor.999confidence).9500500(4.75)(0.25)(9498)3Class

auto(502)2Class=autoTermjaguarTerm=jaguarexpected:feobserved:foExampletheclasspoultryandthetermexportThecountsofthenumberofdocumentswiththefourpossiblecombinations2statistic(CHI)Thereisasimplerformulafor2x22:N=A+B+C+DD=#(?t,?c)B=#(t,?c)C=#(?t,c)A=#(t,c)FeatureselectionviaMutualInformationIntrainingset,choosekwordswhichbestdiscriminate(givemostinfoon)thecategories.MImeasureshowmuchinformationthepresence/absenceofatermcontributestomakingthecorrectclassificationdecisiononc.TheMutualInformationbetweenaword,classis:ExampleFeatureselectionviaMI(contd.)Foreachcategorywebuildalistofkmostdiscriminatingterms.Forexample(on20Newsgroups):sci.electronics:circuit,voltage,amp,ground,copy,battery,electronics,cooling,…rec.autos:car,cars,engine,ford,dealer,mustang,oil,collision,autos,tires,toyota,…Greedy:doesnotaccountforcorrelationsbetweentermsWhy?FeatureSelectionMutualInformationClearinformation-theoreticinterpretationChi-squareStatisticalfoundationMayselectmoreraretermsthanMIJustusethecommonestterms?NoparticularfoundationInpractice,thisisoften90%asgoodFeatureselectionforNBIngeneralfeatureselectionisnecessaryforbinomialNB.Otherwiseyousufferfromnoise,multi-counting“Featureselection”reallymeanssomethingdifferentformultinomialNB.ItmeansdictionarytruncationThemultinomialNBmodelonlyhas1featureK-NearestNeighbors統(tǒng)計(jì)機(jī)器學(xué)習(xí)與數(shù)據(jù)挖掘技術(shù)與方法研討班講義Recall:VectorSpaceRepresentationEachdocumentisavector,onecomponentforeachterm(=word).Normalizetounitlength.High-dimensionalvectorspace:Termsareaxes10,000+dimensions,oreven100,000+DocsarevectorsinthisspaceClassesinaVectorSpaceGovernmentScienceArtsClassificationUsingVectorSpacesEachtrainingdocapoint(vector)labeledbyitstopic(=class)Hypothesis:docsofthesameclassformacontiguousregionofspaceWedefinesurfacestodelineateclassesinspaceTestDocument=GovernmentGovernmentScienceArtsSimilarityhypothesistrueingeneral?kNearestNeighborClassificationToclassifydocumentdintoclasscDefinek-neighborhoodNasknearestneighborsofdCountnumberofdocumentsiinNthatbelongtocEstimateP(c|d)asi/kChooseasclassargmaxcP(c|d)[=majorityclass]Example:k=6(6NN)GovernmentScienceArtsP(science|)?Nearest-NeighborLearningAlgorithmLearningisjuststoringtherepresentationsofthetrainingexamplesinD.Testinginstancex:ComputesimilaritybetweenxandallexamplesinD.AssignxthecategoryofthemostsimilarexampleinD.Doesnotexplicitlycomputeageneralizationorcategoryprototypes.Alsocalled:Case-basedlearningMemory-basedlearningLazylearningWhyK?Usingonlytheclosestexampletodeterminethecategorizationissubjecttoerrorsdueto:Asingleatypicalexample.Noise(i.e.error)inthecategorylabelofasingletrainingexample.Morerobustalternativeistofindthekmost-similarexamplesandreturnthemajoritycategoryofthesekexamples.Valueofkistypicallyoddtoavoidties;3and5aremostcommon.kNNdecisionboundariesGovernmentScienceArtsBoundariesareinprinciplearbitrarysurfaces–butusuallypolyhedraSimilarityMetricsNearestneighbormethoddependsonasimilarity(ordistance)metric.Simplestforcontinuousm-dimensionalinstancespaceisEuclideandistance.Simplestform-dimensionalbinaryinstancespaceisHammingdistance(numberoffeaturevaluesthatdiffer).Fortext,cosinesimilarityoftf.idfweightedvectorsistypicallymosteffective.Illustrationof3NearestNeighborforTextVectorSpaceNearestNeighborwithInvertedIndexNaivelyfindingnearestneighborsrequiresalinearsearchthrough|D|documentsincollectionButdeterminingknearestneighborsisthesameasdeterminingthekbestretrievalsusingthetestdocumentasaquerytoadatabaseoftrainingdocuments.Usestandardvectorspaceinvertedindexmethodstofindtheknearestneighbors.TestingTime:O(B|Vt|)whereBistheaveragenumberoftrainingdocumentsinwhichatest-documentwordappears.TypicallyB<<|D|kNN:DiscussionNofeatureselectionnecessaryScaleswellwithlargenumberofclassesDon’tneedtotrainnclassifiersfornclassesClassescaninfluenceeachotherSmallchangestooneclasscanhaverippleeffectScorescanbehardtoconverttoprobabilitiesNotrainingnecessaryBiasvs.variance:

ChoosingthecorrectmodelcapacitykNNvs.NaiveBayesBias/VariancetradeoffVariance≈CapacitykNNhashighvarianceandlowbias.InfinitememoryNBhaslowvarianceandhighbias.Decisionsurfacehastobelinear(hyperplane)SummaryCategorizationTrainingdataOver-fitting&GeneralizeNa?veBayesBayesianMethodsBernoulli

NBclassifierMultinomialNBclassifierK-NearestNeighborBias.vs.VarianceFeatureselectionChi-squaretestMutualInformationReadings[1]IIRCh13,Ch14.2[2]Y.YangandX.Liu,"Are-examinationoftextcategorizationmethods,"presentedatProceedingsofACMSIGIRConferenceonResearchandDevelopmentinInformationRetrieval(SIGIR'99),1999.ClassificationEvaluation統(tǒng)計(jì)機(jī)器學(xué)習(xí)與數(shù)據(jù)挖掘技術(shù)與方法研討班講義Most(over)useddataset21578documents9603training,3299testarticles(ModAptesplit)118categoriesAnarticlecanbeinmorethanonecategoryLearn118binarycategorydistinctionsAveragedocument:about90types,200tokensAveragenumberofclassesassigned1.24fordocswithatleastonecategoryOnlyabout10outof118categoriesarelargeCommoncategories(#train,#test)Evaluation:ClassicReutersDataSet

Earn(2877,1087)Acquisitions(1650,179)Money-fx(538,179)Grain(433,149)Crude(389,189)Trade(369,119)Interest(347,131)Ship(197,89)Wheat(212,71)Corn(182,56)ReutersTextCategorizationdataset(Reuters-21578)

document<REUTERSTOPICS="YES"LEWISSPLIT="TRAIN"CGISPLIT="TRAINING-SET"OLDID="12981"NEWID="798"><DATE>2-MAR-198716:51:43.42</DATE><TOPICS><D>livestock</D><D>hog</D></TOPICS><TITLE>AMERICANPORKCONGRESSKICKSOFFTOMORROW</TITLE><DATELINE>CHICAGO,March2-</DATELINE><BODY>TheAmericanPorkCongresskicksofftomorrow,March3,inIndianapoliswith160ofthenationsporkproducersfrom44memberstatesdeterminingindustrypositionsonanumberofissues,accordingtotheNationalPorkProducersCouncil,NPPC.DelegatestothethreedayCongresswillbeconsidering26resolutionsconcerningvariousissues,includingthefuturedirectionoffarmpolicyandthetaxlawasitappliestotheagriculturesector.ThedelegateswillalsodebatewhethertoendorseconceptsofanationalPRV(pseudorabiesvirus)controlanderadicationprogram,theNPPCsaid.Alargetradeshow,inconjunctionwiththecongress,willfeaturethelatestintechnologyinallareasoftheindustry,theNPPCadded.Reuter</BODY></TEXT></REUTERS>NewReuters:RCV1:810,000docsToptopicsinReutersRCV1北大天網(wǎng):中文網(wǎng)頁(yè)分類

通過(guò)動(dòng)員不同專業(yè)的幾十個(gè)學(xué)生,人工選取形成了一個(gè)基于層次模型的大規(guī)模中文網(wǎng)頁(yè)樣本集。包括12,336個(gè)訓(xùn)練網(wǎng)頁(yè)實(shí)例和3,269個(gè)測(cè)試網(wǎng)頁(yè)實(shí)例,分布在12個(gè)大類,共計(jì)733個(gè)類別中,每個(gè)類別平均有17個(gè)訓(xùn)練實(shí)例和4.5個(gè)測(cè)試實(shí)例天網(wǎng)免費(fèi)提供網(wǎng)頁(yè)樣本集給有興趣的同行,燕穹產(chǎn)品號(hào):YQ-WEBENCH-V0.8中文信息檢索論壇全國(guó)搜索引擎和網(wǎng)上信息挖掘?qū)W術(shù)研討會(huì)SEWM上進(jìn)行分類評(píng)測(cè)北大天網(wǎng):中文網(wǎng)頁(yè)分類表11-1樣本集中類別及實(shí)例數(shù)量的分布情況表類別編號(hào)類別名稱類別數(shù)訓(xùn)練樣本數(shù)測(cè)試樣本數(shù)1人文與藝術(shù)244191102新聞與媒體7125193商業(yè)與經(jīng)濟(jì)488392144娛樂(lè)與休閑8815103745計(jì)算機(jī)與因特網(wǎng)589252386教育18286857各國(guó)風(fēng)情538912358自然科學(xué)11318925149政府與政治182888410社會(huì)科學(xué)104176547911醫(yī)療與健康

136229561612社會(huì)與文化661101301共計(jì)733123363269

MeasuringClassificationFiguresofMeritAccuracyofclassificationMainevaluationcriterioninacademiaMoreinamomentSpeedoftrainingstatisticalclassifierSomemethodsareverycheap;someverycostlySpeedofclassification(docs/hour)NobigdifferencesformostalgorithmsExceptions:kNN,complexpreprocessingrequirementsEffortincreatingtrainingset/hand-builtclassifierhumanhours/topicMeasuringClassificationFiguresofMeritIntherealworld,economicmeasures:Yourchoicesare:DonoclassificationThathasacost(hardtocompute)DoitallmanuallyHasaneasytocomputecostifdoingitlikethatnowDoitallwithanautomaticclassifierMistakeshaveacostDoitwithacombinationofautomaticclassificationandmanualreviewofuncertain/difficult/”new”casesCommonlythelastmethodismostcostefficientandisadoptedConceptDriftCategorieschangeovertimeExample:“presidentoftheunitedstates”1999:clintonisgreatfeature2002:clintonisbadfeatureOnemeasureofatextclassificationsystemishowwellitprotectsagainstconceptdrift.Featureselection:canbebadinprotectingagainstconceptdriftPerclassevaluationmeasuresRecall:Fractionofdocsinclassiclassifiedcorrectly:Precision:Fractionofdocsassignedclassithatareactuallyaboutclassi:“Correctrate”:(1-errorrate)Fractionofdocsclassifiedcorrectly:ABCABCActualClassPredicted

classMeasuresofAccuracyEvaluationmustbedoneontestdatathatisindependentofthetrainingdataOverallerrorrateNotagoodmeasureforsmallclasses.Why?

溫馨提示

  • 1. 本站所有資源如無(wú)特殊說(shuō)明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請(qǐng)下載最新的WinRAR軟件解壓。
  • 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請(qǐng)聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
  • 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁(yè)內(nèi)容里面會(huì)有圖紙預(yù)覽,若沒(méi)有圖紙預(yù)覽就沒(méi)有圖紙。
  • 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
  • 5. 人人文庫(kù)網(wǎng)僅提供信息存儲(chǔ)空間,僅對(duì)用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對(duì)用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對(duì)任何下載內(nèi)容負(fù)責(zé)。
  • 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請(qǐng)與我們聯(lián)系,我們立即糾正。
  • 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶因使用這些下載資源對(duì)自己和他人造成任何形式的傷害或損失。

評(píng)論

0/150

提交評(píng)論