FAFU機(jī)器學(xué)習(xí)06-1 Decision Tree課件

上傳人：秋*** IP屬地：陜西上傳時(shí)間：2023-11-05 格式：PPTX 頁(yè)數(shù)：49 大?。?.52MB 積分：20 舉報(bào) 版權(quán)申訴

FAFU機(jī)器學(xué)習(xí)06-1 Decision Tree課件_第2頁(yè)

FAFU機(jī)器學(xué)習(xí)06-1 Decision Tree課件_第3頁(yè)

FAFU機(jī)器學(xué)習(xí)06-1 Decision Tree課件_第4頁(yè)

FAFU機(jī)器學(xué)習(xí)06-1 Decision Tree課件_第5頁(yè)

已閱讀5頁(yè)，還剩44頁(yè)未讀，繼續(xù)免費(fèi)閱讀

版權(quán)說(shuō)明：本文檔由用戶提供并上傳，收益歸屬內(nèi)容提供方，若內(nèi)容存在侵權(quán)，請(qǐng)進(jìn)行舉報(bào)或認(rèn)領(lǐng)

文檔簡(jiǎn)介

FoundationsofMachineLearning

DecisionTreeTop10algorithmsindataminingC4.5K-MeansSVMAprioriEM(MaximumLikelihood)PageRankAdaBoostKNNNa?veBayesCARTMainClassificationMethodsLogisticRegressionLinearDiscriminantAnalysisDecisionTreeInductionNearestNeighborBayesClassificationMethodsClassificationbyBackpropagationSupportVectorMachinesEnsembleMethods…

IllustratingClassificationTask

ExampleofaDecisionTree

AnotherExampleofDecisionTree

DecisionTreeClassificationTask

ApplyModeltoTestData

DecisionTreeClassificationTask10AlgorithmforDecisionTreeInductionManyAlgorithms:Hunt’sAlgorithm(oneoftheearliest)ID3(IterativeDichotomiser)C4.5CART(ClassificationandRegressionTree)SLIQ(SupervisedLearningInQuest)SPRINT(ScalablePaRallelizableINductionofdecisionTrees)……11AlgorithmforDecisionTreeInductionBasicalgorithm(agreedyalgorithm)Treeisconstructedinatop-downrecursivedivide-and-conquermannerAtstart,allthetrainingexamplesareattherootAttributesarecategorical(ifcontinuous-valued,theyarediscretizedinadvance)ExamplesarepartitionedrecursivelybasedonselectedattributesTestattributesareselectedonthebasisofaheuristicorstatisticalmeasure(e.g.,informationgain)ConditionsforstoppingpartitioningAllsamplesforagivennodebelongtothesameclassTherearenoremainingattributesforfurtherpartitioning–majorityvotingisemployedforclassifyingtheleafTherearenosamplesleft…12AlgorithmforDecisionTreeInduction13AlgorithmforDecisionTreeInductionGreedystrategySplittherecordsbasedonanattributetestthatoptimizescertaincriterion(根據(jù)最優(yōu)劃分屬性進(jìn)行劃分)IssuesDeterminehowtoselectthebestattribute?Howtosplittorecords?Howtodeterminethebestsplit?Determinewhentostopsplitting14AlgorithmforDecisionTreeInductionHowtoSplittherecords?DependsonattributetypesNominalOrdinalContinuousDependsonnumberofwaystosplit2-waysplitMulti-waysplit15AlgorithmforDecisionTreeInductionSplittingBasedonNominalAttributes?Multi-waysplit:UseasmanypartitionsasdistinctvaluesBinarysplit:DividesvaluesintotwosubsetsNeedtofindoptimalpartitioning16AlgorithmforDecisionTreeInductionSplittingBasedonOrdinalAttributes?Multi-waysplit:UseasmanypartitionsasdistinctvaluesBinarysplit:DividesvaluesintotwosubsetsNeedtofindoptimalpartitioning17AlgorithmforDecisionTreeInductionSplittingBasedonContinuousAttributesDiscretizationtoformanordinalcategoricalattributeBinaryDecision:(A<v)or(A>=v)considerallpossiblesplitsandfindsthebestcutcanbemorecomputationintensive18AlgorithmforDecisionTreeInductionGreedystrategySplittherecordsbasedonanattributetestthatoptimizescertaincriterionIssuesDeterminehowtoselectthebestattributeHowtosplittherecords?

Howtodeterminethebestsplit?Determinewhentostopsplitting19AlgorithmforDecisionTreeInductionHowtodeterminetheBestSplitBeforeSplitting:10recordsofclassC0,10recordsofclassC120AlgorithmforDecisionTreeInductionHowtodeterminetheBestSplitGreedyapproach:Nodeswithhomogeneousclassdistributionarepreferred

Needameasureofnodeimpurity(purity):21HowtodeterminetheBestSplit

Needameasure

22MeasuresofNodeImpurity

EntropyGiniIndex…BriefReviewofEntropy

23m=224AttributeSelectionMeasure:InformationGain(ID3/C4.5)SelecttheattributewiththehighestinformationgainLetpibetheprobabilitythatanarbitrarytupleinDbelongstoclassCi,estimatedby|Ci,D|/|D|Expectedinformation(entropy)neededtoclassifyatupleinD:Informationneeded(afterusingAtosplitDintovpartitions)toclassifyD:InformationgainedbybranchingonattributeAInformationgainisthedifferencebetweentheentropyoftheparentnodeandtheweightedaverageofthechildrennodes'entropies.25AttributeSelection:InformationGainClassP:buys_computer=“yes”ClassN:buys_computer=“no”

means“age<=30”has5outof14samples,with2yes’esand3no’s.HenceSimilarly,26ComputingInformation-GainforContinuous-ValuedAttributesLetattributeAbeacontinuous-valuedattributeMustdeterminethebestsplitpointforASortthevalueAinincreasingorderTypically,themidpointbetweeneachpairofadjacentvaluesisconsideredasapossiblesplitpoint(ai+ai+1)/2isthemidpointbetweenthevaluesofaiandai+1ThepointwiththeminimumexpectedinformationrequirementforAisselectedasthesplit-pointforASplit:D1isthesetoftuplesinDsatisfyingA≤split-point,andD2isthesetoftuplesinDsatisfyingA>split-pointGainRatioforAttributeSelection(C4.5)InformationgainmeasureisbiasedtowardsattributeswithalargenumberofvaluesC4.5(asuccessorofID3)usesgainratiotoovercometheproblem(normalizationtoinformationgain)GainRatio(A)=Gain(A)/SplitInfo(A)Ex.gain_ratio(income)=0.029/1.557=0.019TheattributewiththemaximumgainratioisselectedasthesplittingattributeGiniIndex(CART,IBMIntelligentMiner)IfadatasetDcontainsexamplesfromnclasses,giniindex,gini(D)isdefinedas wherepjistherelativefrequencyofclassjinDIfadatasetDissplitonAintotwosubsetsD1andD2,theginiindexgini(D)isdefinedasReductioninImpurity:Theattributeprovidesthesmallestginisplit(D)(orthelargestreductioninimpurity)ischosentosplitthenode(needtoenumerateallthepossiblesplittingpointsforeachattributeforCART)ComputationofGiniIndexEx.Dhas9tuplesinbuys_computer=“yes”and5in“no”SupposetheattributeincomepartitionsDinto10inD1:{low,medium}and4inD2Gini{low,high}is0.458;Gini{medium,high}is0.450.Thus,splitonthe{low,medium}(and{high})sinceithasthelowestGiniindex30ComparingAttributeSelectionMeasuresThethreemeasures,ingeneral,returngoodresultsbutInformationgain:biasedtowardsmultivaluedattributesGainratio:tendstopreferunbalancedsplitsinwhichonepartitionismuchsmallerthantheothersGiniindex:biasedtomultivaluedattributeshasdifficultywhen#ofclassesislargetendstofavorteststhatresultinequal-sizedpartitionsandpurityinbothpartitions31OverfittingandTreePruningOverfitting:AninducedtreemayoverfitthetrainingdataToomanybranches,somemayreflectanomaliesduetonoiseoroutliersPooraccuracyforunseensamplesTwoapproachestoavoidoverfittingPrepruning:Halttreeconstructionearly

?donotsplitanodeifthiswouldresultinthegoodnessmeasurefallingbelowathresholdDifficulttochooseanappropriatethresholdPostpruning:Removebranchesfroma“fullygrown”tree—getasequenceofprogressivelyprunedtreesUseasetofdatadifferentfromthetrainingdatatodecidewhichisthe“bestprunedtree”EnhancementstoBasicDecisionTreeInductionAllowforcontinuous-valuedattributesDynamicallydefinenewdiscrete-valuedattributesthatpartitionthecontinuousattributevalueintoadiscretesetofintervalsHandlemissingattributevaluesAssignthemostcommonvalueoftheattributeAssignprobabilitytoeachofthepossiblevaluesAttributeconstructionCreatenewattributesbasedonexistingonesthataresparselyrepresentedThisreducesfragmentation,repetition,andreplicationClassificationinLargeDatabasesClassification—aclassicalproblemextensivelystudiedbystatisticiansandmachinelearningresearchersScalability:ClassifyingdatasetswithmillionsofexamplesandhundredsofattributeswithreasonablespeedWhyisdecisiontreeinductionpopular?relativelyfasterlearningspeed(thanotherclassificationmethods)convertibletosimpleandeasytounderstandclassificationrulescanuseSQLqueriesforaccessingdatabasescomparableclassificationaccuracywithothermethodsDecisionTreeclassifierinsklearnclasssklearn.tree.DecisionTreeClassifier(criterion=’gini’,splitter=’best’,max_depth=None,min_samples_split=2,min_samples_leaf=1,min_weight_fraction_leaf=0.0,max_features=None,random_state=None,max_leaf_nodes=None,min_impurity_decrease=0.0,min_impurity_split=None,class_weight=None,presort=False)/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html/pinard/p/6056319.htmlExampleTheIrisdataset（Iris數(shù)據(jù)集,鳶尾屬植物）TheIrisdatasetisaclassicdatasetfromthe1930s;itisoneofthefirstmodernexamplesofstatisticalclassification.ThesettingisthatofIrisflowers,ofwhichtherearemultiplespeciesthatcanbeidentifiedbytheirmorphology.Today,thespecieswouldbedefinedbytheirgenomicsignatures,butinthe1930s,DNAhadnotevenbeenidentifiedasthecarrierofgeneticinformation.iris以鳶尾花的特征作為數(shù)據(jù)來(lái)源，數(shù)據(jù)集包含150個(gè)數(shù)據(jù)集，分為3類(lèi)，每類(lèi)50個(gè)數(shù)據(jù)，每個(gè)數(shù)據(jù)包含4個(gè)屬性，是在數(shù)據(jù)挖掘、數(shù)據(jù)分類(lèi)中非常常用的測(cè)試集、訓(xùn)練集

三類(lèi)分別為:setosa,versicolor,virginicasetosa,versicolor,virginicaThefollowingfourattributes（四種屬性）ofeachplantweremeasured:Sepallength（萼片長(zhǎng)度）Sepalwidth（萼片寬度）Petallength（花瓣長(zhǎng)度）Petalwidth（花瓣寬度）Thefirststepisvisualization（可視化）frommatplotlibimportpyplotaspltfromsklearn.datasetsimportload_irisimportnumpyasnp#Weloadthedatawithload_irisfromsklearndata=load_iris()features=data['data']feature_names=data['feature_names']target=data['target']fort,marker,cinzip(range(3),">ox","rgb"):#Weploteachclassonitsowntogetdifferentcoloredmarkersplt.scatter(features[target==t,0],features[target==t,1],marker=marker,c=c)sklearn.tree.DecisionTreeClassifier>>>from

sklearn

importtree>>>X=[[0,0],[1,1]]>>>Y=[0,1]>>>clf=tree.DecisionTreeClassifier()>>>clf=clf.fit(X,Y)>>>clf.predict([[2.,2.]])array([1])>>>clf.predict_proba([[2.,2.]])array([[0.,1.]])sklearn.tree.DecisionTreeClassifier>>>from

sklearn.datasets

importload_iris>>>from

sklearn.cross_validation

importcross_val_score>>>from

sklearn.tree

importDecisionTreeClassifier>>>clf=DecisionTreeClassifier(random_state=0)>>>iris=load_iris()>>>cross_val_score(clf,iris.data,iris.target,cv=10)......array([1.,0.93...,0.86...,0.93...,0.93...,0.93...,0.93...,1.,0.93...,1.])ExampleInternetAdvertisementsDataSet/ml/datasets/Internet+AdvertisementsDataSetCharacteristics:

MultivariateNumberofInstances:3279Area:ComputerAttributeCharacteristics:Categorical,Integer,RealNumberofAttributes:1558DateDonated1998-07-01AssociatedTasks:ClassificationMissingValues?YesNumberofWebHits:307075DecisionTreeregressor回歸樹(shù)（regressiontree），顧名思義，就是用樹(shù)模型做回歸問(wèn)題，每一片葉子都輸出一個(gè)預(yù)測(cè)值。預(yù)測(cè)值一般是該片葉子所含訓(xùn)練集元素輸出的均值，即cm=ave(yi|xi∈leafm)。CART在分類(lèi)問(wèn)題和回歸問(wèn)題中的相同和差異：相同：在分類(lèi)問(wèn)題和回歸問(wèn)題中，CART都是一棵二叉樹(shù)，除葉子節(jié)點(diǎn)外的所有節(jié)點(diǎn)都有且僅有兩個(gè)子節(jié)點(diǎn)；所有落在同一片葉子中的輸入都有同樣的輸出。差異：在分類(lèi)問(wèn)題中，CART使用基尼指數(shù)（Giniindex）作為選擇特征（feature）和劃分（split）的依據(jù)；在回歸問(wèn)題中，CART使用mse（meansquareerror）或者mae（meanabsoluteerror）作為選擇feature和split的criteria。在分類(lèi)問(wèn)題中，CART的每一片葉子都代表的是一個(gè)class；在回歸問(wèn)題中，CART的每一片葉子表示的是一個(gè)預(yù)測(cè)值，取值是連續(xù)的。DecisionTreeregressor給定一個(gè)數(shù)據(jù)集D={(x1,y1),(x2,y2),...,(xi,yi),...,(xn,yn)}，其中xi

是一個(gè)m維的向量，即xi

含有m個(gè)features?；貧w問(wèn)題的目標(biāo)就是構(gòu)造一個(gè)函數(shù)f(x)能夠擬合數(shù)據(jù)集D中的元素，使得mse最小，即：假設(shè)一棵構(gòu)建好的CART回歸樹(shù)有M片葉子，這意味著CART將輸入空間x劃分成了M個(gè)單元R1,R2,...,RM，同時(shí)意味著CART至多會(huì)有M個(gè)不同的預(yù)測(cè)值。CART最小化mse公式如下：DecisionTreeregressor在每一次的劃分中，選擇切分變量（splittingvariable）和切分點(diǎn)（splittingpoint）時(shí)（也就是選擇feature和將該featurespace一分為二的split

人人文庫(kù)> 全部分類(lèi)> 應(yīng)用文書(shū) > 技術(shù)指導(dǎo)

溫馨提示

1. 本站所有資源如無(wú)特殊說(shuō)明，都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請(qǐng)下載最新的WinRAR軟件解壓。
2. 本站的文檔不包含任何第三方提供的附件圖紙等，如果需要附件，請(qǐng)聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
3. 本站RAR壓縮包中若帶圖紙，網(wǎng)頁(yè)內(nèi)容里面會(huì)有圖紙預(yù)覽，若沒(méi)有圖紙預(yù)覽就沒(méi)有圖紙。
4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
5. 人人文庫(kù)網(wǎng)僅提供信息存儲(chǔ)空間，僅對(duì)用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理，對(duì)用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯，并不能對(duì)任何下載內(nèi)容負(fù)責(zé)。
6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容，請(qǐng)與我們聯(lián)系，我們立即糾正。
7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶因使用這些下載資源對(duì)自己和他人造成任何形式的傷害或損失。

FAFU機(jī)器學(xué)習(xí)06-1 Decision Tree課件

文檔簡(jiǎn)介

溫馨提示

最新文檔

評(píng)論

FAFU機(jī)器學(xué)習(xí)06-1 Decision Tree課件

文檔簡(jiǎn)介

溫馨提示

最新文檔

評(píng)論

相關(guān)文檔