貝葉斯分類、KNN、特征選擇、評估

上傳人：a*** IP屬地：湖北上傳時間：2023-02-01 格式：PPT 頁數(shù)：78 大小：1.32MB 積分：28 舉報 版權(quán)申訴

已閱讀5頁，還剩73頁未讀，繼續(xù)免費(fèi)閱讀

版權(quán)說明：本文檔由用戶提供并上傳，收益歸屬內(nèi)容提供方，若內(nèi)容存在侵權(quán)，請進(jìn)行舉報或認(rèn)領(lǐng)

文檔簡介

貝葉斯分類、KNN、特征選擇、評估陳翀Ref:PengBo@NC&IS@PKU,WBIAlecture統(tǒng)計機(jī)器學(xué)習(xí)與數(shù)據(jù)挖掘技術(shù)與方法研討班講義本次課大綱文本分類TextCategorizationProblemdefinitionNa?veBayesClassifierK-NearestNeighborClassifierEvaluationCategorizationDefinitionGiven:Adescriptionofaninstance,xX,whereXistheinstancelanguageorinstancespace.Issue:howtorepresenttextdocuments.Afixedsetofcategories:

{c1,c2,…,cn}Determine:Thecategoryofx:c(x)C,wherec(x)isacategorizationfunctionwhosedomainisXandwhoserangeisC.Wewanttoknowhowtobuildcategorizationfunctions(“classifiers”).TextCategorizationExamplesAssignlabelstoeachdocumentorweb:LabelsaremostoftentopicssuchasYahoo-categoriese.g.,"finance,""sports,""news>world>asia>business"Labelsmaybegenrese.g.,"editorials""movie-reviews""news“Labelsmaybeopinione.g.,“l(fā)ike”,“hate”,“neutral”Labelsmaybedomain-specificbinarye.g.,"interesting-to-me":"not-interesting-to-me”e.g.,“spam”:“not-spam”e.g.,“containsadultlanguage”:“doesn’t”ClassificationMethodsManualclassificationUsedbyYahoo!,Looksmart,,ODP,MedlineAccuratebutexpensivetoscaleAutomaticdocumentclassificationHand-codedrule-basedsystemsSpammailfilter,…Supervisedlearningofadocument-labelassignmentfunctionNofreelunch:requireshand-classifiedtrainingdataNotethatmanycommercialsystemsuseamixtureofmethodsNa?veBayes統(tǒng)計機(jī)器學(xué)習(xí)與數(shù)據(jù)挖掘技術(shù)與方法研討班講義BernoullitrialBernoullitrialisanexperimentwhoseoutcomeisrandomandcanbeeitheroftwopossibleoutcomes,"success"and"failure".BinomialDistributionbinomialdistributionisthediscreteprobabilitydistributionofthenumberofsuccessesinasequenceofn

independentyes/noexperiments,eachofwhichyieldssuccesswithprobability

p.MultinomialDistributionmultinomialdistributionisageneralizationofthebinomialdistribution.eachtrialresultsinoneofsomefixedfinitenumberkofpossibleoutcomes,withprobabilitiesp1,...,pk,therearen

independenttrials.WecanusearandomvariableXitoindicatethenumberoftimesoutcomenumberiwasobservedoverthentrials.Bayes’RulepriorposteriorlikelihoodUseBayesRuletoGamblesomeonedrawsanenvelopeatrandomandofferstosellittoyou.Howmuchshouldyoupay?beforedeciding,youareallowedtoseeonebeaddrawnfromtheenvelope.Supposeit’sred:Howmuchshouldyoupay?Prosecutor'sfallacyYouwinthelotteryjackpot.Youarethenchargedwithhavingcheated,forinstancewithhavingbribedlotteryofficials.Atthetrial,theprosecutorpointsoutthatwinningthelotterywithoutcheatingisextremelyunlikely,andthatthereforeyourbeinginnocentmustbecomparablyunlikely.LikelihoodalikelihoodfunctionisaconditionalprobabilityfunctionconsideredasafunctionofitssecondargumentwithitsfirstargumentheldfixedGivenaparameterizedfamilyofprobabilitydensityfunctions

Whereθistheparameter,thelikelihoodfunctionis

wherexistheobservedoutcomeofanexperiment.whenf(x|θ)isviewedasafunctionofxwithθfixed,itisaprobabilitydensityfunction,whenviewedasafunctionofθwithxfixed,itisalikelihoodfunction.

MaximumaposterioriHypothesisAsP(D)isconstantMaximumlikelihoodHypothesisIfallhypothesesareaprioriequallylikely,

weonlyneedtoconsidertheP(D|h)term:BayesClassifiersTask:ClassifyanewinstanceDbasedonatupleofattributevaluesintooneoftheclassescj

CNa?veBayesAssumptionP(cj)Canbeestimatedfromthefrequencyofclassesinthetrainingexamples.P(x1,x2,…,xn|cj)O(|X|n?|C|)parametersCouldonlybeestimatedifavery,verylargenumberoftrainingexampleswasavailable.Na?veBayesConditionalIndependenceAssumption:AssumethattheprobabilityofobservingtheconjunctionofattributesisequaltotheproductoftheindividualprobabilitiesP(xi|cj).FluX1X2X5X3X4feversinuscoughrunnynosemuscle-acheTheNa?veBayesClassifierConditionalIndependenceAssumption:featuresdetecttermpresenceandareindependentofeachothergiventheclass:ThismodelisappropriateforbinaryvariablesMultivariatebinomialmodelLearningtheModelFirstattempt:maximumlikelihoodestimatessimplyusethefrequenciesinthedataCX1X2X5X3X4X6Whatifwehaveseennotrainingcaseswherepatienthadnofluandmuscleaches?Zeroprobabilitiescannotbeconditionedaway,nomattertheotherevidence!ProblemwithMaxLikelihoodFluX1X2X5X3X4feversinuscoughrunnynosemuscle-acheSmoothingtoEliminateZeros#ofvaluesofX,Addonesmooth(Laplacesmoothing)Asauniformprior(eachattributeoccursonceforeachclass)thatisthenupdatedasevidencefromthetrainingdatacomesin.DocumentGenerativeModel“Love

patient,love

kind.“Basic:bagofwordsabinaryindependencemodelMultivariatebinomialgenerationfeatureXiistermValueXi=1or0,indicatetermXipresentindocornotamultinomialunigramlanguagemodelMultinomialgenerationfeatureXiistermpositionValueofXi=termatpositionipositionindependentLoveispatientkind

BernoulliNaiveBayesClassifiersMultivariatebinomialModelOnefeatureXwforeachwordindictionaryXw=trueindocumentdifwappearsindNaiveBayesassumption:Giventhedocument’stopic,appearanceofonewordinthedocumenttellsusnothingaboutchancesthatanotherwordappearsMultinomialNaiveBayesClassifiersIIMultinomial=ClassconditionalunigramOnefeatureXiforeachwordposindocumentfeature’svaluesareallwordsindictionaryValueofXiisthewordinpositioniNa?veBayesassumption:Giventhedocument’stopic,wordinonepositioninthedocumenttellsusnothingaboutwordsinotherpositionsStilltoomanypossibilitiesAssumethatclassificationisindependentofthepositionsofthewordsSecondassumption:WordappearancedoesnotdependonpositionJusthaveonemultinomialfeaturepredictingforallwordsUsesameparametersforeachpositionMultinomialNaiveBayesClassifiersforallpositionsi,j,wordw,andclasscParameterestimationfractionofdocumentsoftopiccjinwhichwordwappearsBinomialmodel:Multinomialmodel:Cancreateamega-documentfortopicjbyconcatenatingalldocumentsinthistopicUsefrequencyofwinmega-documentfractionoftimesinwhichwordwappearsacrossalldocumentsoftopiccjNaiveBayesalgorithm(Multinomialmodel)NaiveBayesalgorithm(Bernoullimodel)NBExamplec(5)=?MultinomialNBClassifierFeaturelikelihoodestimatePosteriorResult:c(5)=ChinaBernoulliNBClassifierFeaturelikelihoodestimatePosteriorResult:c(5)<>ChinaClassificationMultinomialvsMultivariatebinomial?MultinomialisingeneralbetterFeatureSelection統(tǒng)計機(jī)器學(xué)習(xí)與數(shù)據(jù)挖掘技術(shù)與方法研討班講義FeatureSelection:Why?Textcollectionshavealargenumberoffeatures10,000–1,000,000uniquewords…andmoreMaymakeusingaparticularclassifierfeasibleSomeclassifierscan’tdealwith100,000offeaturesReducestrainingtimeTrainingtimeforsomemethodsisquadraticorworseinthenumberoffeaturesCanimprovegeneralization(performance)EliminatesnoisefeaturesAvoidsoverfittingFeatureselection:how?Twoidea:Hypothesistestingstatistics:AreweconfidentthatthevalueofonecategoricalvariableisassociatedwiththevalueofanotherChi-squaretestInformationtheory:HowmuchinformationdoesthevalueofonecategoricalvariablegiveyouaboutthevalueofanotherMutualinformationThey’resimilar,but2measuresconfidenceinassociation,(basedonavailablestatistics),whileMImeasuresextentofassociation(assumingperfectknowledgeofprobabilities)Pearson'schi-squaretestExample:Issix-sideddie"fair“?nullhypothesis:six-sideddie"fair“isfair.therewillbeantheoreticalfrequencyofpossibleoutcomeChi-squareiscalculatedbyOiobservedfrequencyEitheoreticalfrequency2statistic(CHI)nullhypothesis:eventterm,classareindependentThenullhypothesisisrejectedwithconfidence.999

since12.9>10.83(thevaluefor.999confidence).9500500(4.75)(0.25)(9498)3Class

auto(502)2Class=autoTermjaguarTerm=jaguarexpected:feobserved:foExampletheclasspoultryandthetermexportThecountsofthenumberofdocumentswiththefourpossiblecombinations2statistic(CHI)Thereisasimplerformulafor2x22:N=A+B+C+DD=#(?t,?c)B=#(t,?c)C=#(?t,c)A=#(t,c)FeatureselectionviaMutualInformationIntrainingset,choosekwordswhichbestdiscriminate(givemostinfoon)thecategories.MImeasureshowmuchinformationthepresence/absenceofatermcontributestomakingthecorrectclassificationdecisiononc.TheMutualInformationbetweenaword,classis:ExampleFeatureselectionviaMI(contd.)Foreachcategorywebuildalistofkmostdiscriminatingterms.Forexample(on20Newsgroups):sci.electronics:circuit,voltage,amp,ground,copy,battery,electronics,cooling,…rec.autos:car,cars,engine,ford,dealer,mustang,oil,collision,autos,tires,toyota,…Greedy:doesnotaccountforcorrelationsbetweentermsWhy?FeatureSelectionMutualInformationClearinformation-theoreticinterpretationChi-squareStatisticalfoundationMayselectmoreraretermsthanMIJustusethecommonestterms?NoparticularfoundationInpractice,thisisoften90%asgoodFeatureselectionforNBIngeneralfeatureselectionisnecessaryforbinomialNB.Otherwiseyousufferfromnoise,multi-counting“Featureselection”reallymeanssomethingdifferentformultinomialNB.ItmeansdictionarytruncationThemultinomialNBmodelonlyhas1featureK-NearestNeighbors統(tǒng)計機(jī)器學(xué)習(xí)與數(shù)據(jù)挖掘技術(shù)與方法研討班講義Recall:VectorSpaceRepresentationEachdocumentisavector,onecomponentforeachterm(=word).Normalizetounitlength.High-dimensionalvectorspace:Termsareaxes10,000+dimensions,oreven100,000+DocsarevectorsinthisspaceClassesinaVectorSpaceGovernmentScienceArtsClassificationUsingVectorSpacesEachtrainingdocapoint(vector)labeledbyitstopic(=class)Hypothesis:docsofthesameclassformacontiguousregionofspaceWedefinesurfacestodelineateclassesinspaceTestDocument=GovernmentGovernmentScienceArtsSimilarityhypothesistrueingeneral?kNearestNeighborClassificationToclassifydocumentdintoclasscDefinek-neighborhoodNasknearestneighborsofdCountnumberofdocumentsiinNthatbelongtocEstimateP(c|d)asi/kChooseasclassargmaxcP(c|d)[=majorityclass]Example:k=6(6NN)GovernmentScienceArtsP(science|)?Nearest-NeighborLearningAlgorithmLearningisjuststoringtherepresentationsofthetrainingexamplesinD.Testinginstancex:ComputesimilaritybetweenxandallexamplesinD.AssignxthecategoryofthemostsimilarexampleinD.Doesnotexplicitlycomputeageneralizationorcategoryprototypes.Alsocalled:Case-basedlearningMemory-basedlearningLazylearningWhyK?Usingonlytheclosestexampletodeterminethecategorizationissubjecttoerrorsdueto:Asingleatypicalexample.Noise(i.e.error)inthecategorylabelofasingletrainingexample.Morerobustalternativeistofindthekmost-similarexamplesandreturnthemajoritycategoryofthesekexamples.Valueofkistypicallyoddtoavoidties;3and5aremostcommon.kNNdecisionboundariesGovernmentScienceArtsBoundariesareinprinciplearbitrarysurfaces–butusuallypolyhedraSimilarityMetricsNearestneighbormethoddependsonasimilarity(ordistance)metric.Simplestforcontinuousm-dimensionalinstancespaceisEuclideandistance.Simplestform-dimensionalbinaryinstancespaceisHammingdistance(numberoffeaturevaluesthatdiffer).Fortext,cosinesimilarityoftf.idfweightedvectorsistypicallymosteffective.Illustrationof3NearestNeighborforTextVectorSpaceNearestNeighborwithInvertedIndexNaivelyfindingnearestneighborsrequiresalinearsearchthrough|D|documentsincollectionButdeterminingknearestneighborsisthesameasdeterminingthekbestretrievalsusingthetestdocumentasaquerytoadatabaseoftrainingdocuments.Usestandardvectorspaceinvertedindexmethodstofindtheknearestneighbors.TestingTime:O(B|Vt|)whereBistheaveragenumberoftrainingdocumentsinwhichatest-documentwordappears.TypicallyB<<|D|kNN:DiscussionNofeatureselectionnecessaryScaleswellwithlargenumberofclassesDon’tneedtotrainnclassifiersfornclassesClassescaninfluenceeachotherSmallchangestooneclasscanhaverippleeffectScorescanbehardtoconverttoprobabilitiesNotrainingnecessaryBiasvs.variance:

ChoosingthecorrectmodelcapacitykNNvs.NaiveBayesBias/VariancetradeoffVariance≈CapacitykNNhashighvarianceandlowbias.InfinitememoryNBhaslowvarianceandhighbias.Decisionsurfacehastobelinear(hyperplane)SummaryCategorizationTrainingdataOver-fitting&GeneralizeNa?veBayesBayesianMethodsBernoulli

NBclassifierMultinomialNBclassifierK-NearestNeighborBias.vs.VarianceFeatureselectionChi-squaretestMutualInformationReadings[1]IIRCh13,Ch14.2[2]Y.YangandX.Liu,"Are-examinationoftextcategorizationmethods,"presentedatProceedingsofACMSIGIRConferenceonResearchandDevelopmentinInformationRetrieval(SIGIR'99),1999.ClassificationEvaluation統(tǒng)計機(jī)器學(xué)習(xí)與數(shù)據(jù)挖掘技術(shù)與方法研討班講義Most(over)useddataset21578documents9603training,3299testarticles(ModAptesplit)118categoriesAnarticlecanbeinmorethanonecategoryLearn118binarycategorydistinctionsAveragedocument:about90types,200tokensAveragenumberofclassesassigned1.24fordocswithatleastonecategoryOnlyabout10outof118categoriesarelargeCommoncategories(#train,#test)Evaluation:ClassicReutersDataSet

Earn(2877,1087)Acquisitions(1650,179)Money-fx(538,179)Grain(433,149)Crude(389,189)Trade(369,119)Interest(347,131)Ship(197,89)Wheat(212,71)Corn(182,56)ReutersTextCategorizationdataset(Reuters-21578)

document<REUTERSTOPICS="YES"LEWISSPLIT="TRAIN"CGISPLIT="TRAINING-SET"OLDID="12981"NEWID="798"><DATE>2-MAR-198716:51:43.42</DATE><TOPICS><D>livestock</D><D>hog</D></TOPICS><TITLE>AMERICANPORKCONGRESSKICKSOFFTOMORROW</TITLE><DATELINE>CHICAGO,March2-</DATELINE><BODY>TheAmericanPorkCongresskicksofftomorrow,March3,inIndianapoliswith160ofthenationsporkproducersfrom44memberstatesdeterminingindustrypositionsonanumberofissues,accordingtotheNationalPorkProducersCouncil,NPPC.DelegatestothethreedayCongresswillbeconsidering26resolutionsconcerningvariousissues,includingthefuturedirectionoffarmpolicyandthetaxlawasitappliestotheagriculturesector.ThedelegateswillalsodebatewhethertoendorseconceptsofanationalPRV(pseudorabiesvirus)controlanderadicationprogram,theNPPCsaid.Alargetradeshow,inconjunctionwiththecongress,willfeaturethelatestintechnologyinallareasoftheindustry,theNPPCadded.Reuter</BODY></TEXT></REUTERS>NewReuters:RCV1:810,000docsToptopicsinReutersRCV1北大天網(wǎng):中文網(wǎng)頁分類

通過動員不同專業(yè)的幾十個學(xué)生，人工選取形成了一個基于層次模型的大規(guī)模中文網(wǎng)頁樣本集。包括12,336個訓(xùn)練網(wǎng)頁實(shí)例和3,269個測試網(wǎng)頁實(shí)例，分布在12個大類,共計733個類別中，每個類別平均有17個訓(xùn)練實(shí)例和4.5個測試實(shí)例天網(wǎng)免費(fèi)提供網(wǎng)頁樣本集給有興趣的同行，燕穹產(chǎn)品號：YQ-WEBENCH-V0.8中文信息檢索論壇全國搜索引擎和網(wǎng)上信息挖掘?qū)W術(shù)研討會SEWM上進(jìn)行分類評測北大天網(wǎng):中文網(wǎng)頁分類表11-1樣本集中類別及實(shí)例數(shù)量的分布情況表類別編號類別名稱類別數(shù)訓(xùn)練樣本數(shù)測試樣本數(shù)1人文與藝術(shù)244191102新聞與媒體7125193商業(yè)與經(jīng)濟(jì)488392144娛樂與休閑8815103745計算機(jī)與因特網(wǎng)589252386教育18286857各國風(fēng)情538912358自然科學(xué)11318925149政府與政治182888410社會科學(xué)104176547911醫(yī)療與健康

136229561612社會與文化661101301共計733123363269

MeasuringClassificationFiguresofMeritAccuracyofclassificationMainevaluationcriterioninacademiaMoreinamomentSpeedoftrainingstatisticalclassifierSomemethodsareverycheap;someverycostlySpeedofclassification(docs/hour)NobigdifferencesformostalgorithmsExceptions:kNN,complexpreprocessingrequirementsEffortincreatingtrainingset/hand-builtclassifierhumanhours/topicMeasuringClassificationFiguresofMeritIntherealworld,economicmeasures:Yourchoicesare:DonoclassificationThathasacost(hardtocompute)DoitallmanuallyHasaneasytocomputecostifdoingitlikethatnowDoitallwithanautomaticclassifierMistakeshaveacostDoitwithacombinationofautomaticclassificationandmanualreviewofuncertain/difficult/”new”casesCommonlythelastmethodismostcostefficientandisadoptedConceptDriftCategorieschangeovertimeExample:“presidentoftheunitedstates”1999:clintonisgreatfeature2002:clintonisbadfeatureOnemeasureofatextclassificationsystemishowwellitprotectsagainstconceptdrift.Featureselection:canbebadinprotectingagainstconceptdriftPerclassevaluationmeasuresRecall:Fractionofdocsinclassiclassifiedcorrectly:Precision:Fractionofdocsassignedclassithatareactuallyaboutclassi:“Correctrate”:(1-errorrate)Fractionofdocsclassifiedcorrectly:ABCABCActualClassPredicted

classMeasuresofAccuracyEvaluationmustbedoneontestdatathatisindependentofthetrainingdataOverallerrorrateNotagoodmeasureforsmallclasses.Why?

人人文庫> 全部分類> 教育資料 > 課件下載

溫馨提示

1. 本站所有資源如無特殊說明，都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
2. 本站的文檔不包含任何第三方提供的附件圖紙等，如果需要附件，請聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
3. 本站RAR壓縮包中若帶圖紙，網(wǎng)頁內(nèi)容里面會有圖紙預(yù)覽，若沒有圖紙預(yù)覽就沒有圖紙。
4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
5. 人人文庫網(wǎng)僅提供信息存儲空間，僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理，對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯，并不能對任何下載內(nèi)容負(fù)責(zé)。
6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容，請與我們聯(lián)系，我們立即糾正。
7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時也不承擔(dān)用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。

貝葉斯分類、KNN、特征選擇、評估

文檔簡介

溫馨提示

最新文檔

評論

貝葉斯分類、KNN、特征選擇、評估

文檔簡介

溫馨提示

最新文檔

評論

相關(guān)文檔

貝葉斯分類、KNN、特征選擇、評估