版權(quán)說(shuō)明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請(qǐng)進(jìn)行舉報(bào)或認(rèn)領(lǐng)
文檔簡(jiǎn)介
FoundationsofMachineLearning
DecisionTreeTop10algorithmsindataminingC4.5K-MeansSVMAprioriEM(MaximumLikelihood)PageRankAdaBoostKNNNa?veBayesCARTMainClassificationMethodsLogisticRegressionLinearDiscriminantAnalysisDecisionTreeInductionNearestNeighborBayesClassificationMethodsClassificationbyBackpropagationSupportVectorMachinesEnsembleMethods…
IllustratingClassificationTask
ExampleofaDecisionTree
AnotherExampleofDecisionTree
DecisionTreeClassificationTask
ApplyModeltoTestData
DecisionTreeClassificationTask10AlgorithmforDecisionTreeInductionManyAlgorithms:Hunt’sAlgorithm(oneoftheearliest)ID3(IterativeDichotomiser)C4.5CART(ClassificationandRegressionTree)SLIQ(SupervisedLearningInQuest)SPRINT(ScalablePaRallelizableINductionofdecisionTrees)……11AlgorithmforDecisionTreeInductionBasicalgorithm(agreedyalgorithm)Treeisconstructedinatop-downrecursivedivide-and-conquermannerAtstart,allthetrainingexamplesareattherootAttributesarecategorical(ifcontinuous-valued,theyarediscretizedinadvance)ExamplesarepartitionedrecursivelybasedonselectedattributesTestattributesareselectedonthebasisofaheuristicorstatisticalmeasure(e.g.,informationgain)ConditionsforstoppingpartitioningAllsamplesforagivennodebelongtothesameclassTherearenoremainingattributesforfurtherpartitioning–majorityvotingisemployedforclassifyingtheleafTherearenosamplesleft…12AlgorithmforDecisionTreeInduction13AlgorithmforDecisionTreeInductionGreedystrategySplittherecordsbasedonanattributetestthatoptimizescertaincriterion(根據(jù)最優(yōu)劃分屬性進(jìn)行劃分)IssuesDeterminehowtoselectthebestattribute?Howtosplittorecords?Howtodeterminethebestsplit?Determinewhentostopsplitting14AlgorithmforDecisionTreeInductionHowtoSplittherecords?DependsonattributetypesNominalOrdinalContinuousDependsonnumberofwaystosplit2-waysplitMulti-waysplit15AlgorithmforDecisionTreeInductionSplittingBasedonNominalAttributes?Multi-waysplit:UseasmanypartitionsasdistinctvaluesBinarysplit:DividesvaluesintotwosubsetsNeedtofindoptimalpartitioning16AlgorithmforDecisionTreeInductionSplittingBasedonOrdinalAttributes?Multi-waysplit:UseasmanypartitionsasdistinctvaluesBinarysplit:DividesvaluesintotwosubsetsNeedtofindoptimalpartitioning17AlgorithmforDecisionTreeInductionSplittingBasedonContinuousAttributesDiscretizationtoformanordinalcategoricalattributeBinaryDecision:(A<v)or(A>=v)considerallpossiblesplitsandfindsthebestcutcanbemorecomputationintensive18AlgorithmforDecisionTreeInductionGreedystrategySplittherecordsbasedonanattributetestthatoptimizescertaincriterionIssuesDeterminehowtoselectthebestattributeHowtosplittherecords?
Howtodeterminethebestsplit?Determinewhentostopsplitting19AlgorithmforDecisionTreeInductionHowtodeterminetheBestSplitBeforeSplitting:10recordsofclassC0,10recordsofclassC120AlgorithmforDecisionTreeInductionHowtodeterminetheBestSplitGreedyapproach:Nodeswithhomogeneousclassdistributionarepreferred
Needameasureofnodeimpurity(purity):21HowtodeterminetheBestSplit
Needameasure
22MeasuresofNodeImpurity
EntropyGiniIndex…BriefReviewofEntropy
23m=224AttributeSelectionMeasure:InformationGain(ID3/C4.5)SelecttheattributewiththehighestinformationgainLetpibetheprobabilitythatanarbitrarytupleinDbelongstoclassCi,estimatedby|Ci,D|/|D|Expectedinformation(entropy)neededtoclassifyatupleinD:Informationneeded(afterusingAtosplitDintovpartitions)toclassifyD:InformationgainedbybranchingonattributeAInformationgainisthedifferencebetweentheentropyoftheparentnodeandtheweightedaverageofthechildrennodes'entropies.25AttributeSelection:InformationGainClassP:buys_computer=“yes”ClassN:buys_computer=“no”
means“age<=30”has5outof14samples,with2yes’esand3no’s.HenceSimilarly,26ComputingInformation-GainforContinuous-ValuedAttributesLetattributeAbeacontinuous-valuedattributeMustdeterminethebestsplitpointforASortthevalueAinincreasingorderTypically,themidpointbetweeneachpairofadjacentvaluesisconsideredasapossiblesplitpoint(ai+ai+1)/2isthemidpointbetweenthevaluesofaiandai+1ThepointwiththeminimumexpectedinformationrequirementforAisselectedasthesplit-pointforASplit:D1isthesetoftuplesinDsatisfyingA≤split-point,andD2isthesetoftuplesinDsatisfyingA>split-pointGainRatioforAttributeSelection(C4.5)InformationgainmeasureisbiasedtowardsattributeswithalargenumberofvaluesC4.5(asuccessorofID3)usesgainratiotoovercometheproblem(normalizationtoinformationgain)GainRatio(A)=Gain(A)/SplitInfo(A)Ex.gain_ratio(income)=0.029/1.557=0.019TheattributewiththemaximumgainratioisselectedasthesplittingattributeGiniIndex(CART,IBMIntelligentMiner)IfadatasetDcontainsexamplesfromnclasses,giniindex,gini(D)isdefinedas wherepjistherelativefrequencyofclassjinDIfadatasetDissplitonAintotwosubsetsD1andD2,theginiindexgini(D)isdefinedasReductioninImpurity:Theattributeprovidesthesmallestginisplit(D)(orthelargestreductioninimpurity)ischosentosplitthenode(needtoenumerateallthepossiblesplittingpointsforeachattributeforCART)ComputationofGiniIndexEx.Dhas9tuplesinbuys_computer=“yes”and5in“no”SupposetheattributeincomepartitionsDinto10inD1:{low,medium}and4inD2Gini{low,high}is0.458;Gini{medium,high}is0.450.Thus,splitonthe{low,medium}(and{high})sinceithasthelowestGiniindex30ComparingAttributeSelectionMeasuresThethreemeasures,ingeneral,returngoodresultsbutInformationgain:biasedtowardsmultivaluedattributesGainratio:tendstopreferunbalancedsplitsinwhichonepartitionismuchsmallerthantheothersGiniindex:biasedtomultivaluedattributeshasdifficultywhen#ofclassesislargetendstofavorteststhatresultinequal-sizedpartitionsandpurityinbothpartitions31OverfittingandTreePruningOverfitting:AninducedtreemayoverfitthetrainingdataToomanybranches,somemayreflectanomaliesduetonoiseoroutliersPooraccuracyforunseensamplesTwoapproachestoavoidoverfittingPrepruning:Halttreeconstructionearly
?donotsplitanodeifthiswouldresultinthegoodnessmeasurefallingbelowathresholdDifficulttochooseanappropriatethresholdPostpruning:Removebranchesfroma“fullygrown”tree—getasequenceofprogressivelyprunedtreesUseasetofdatadifferentfromthetrainingdatatodecidewhichisthe“bestprunedtree”EnhancementstoBasicDecisionTreeInductionAllowforcontinuous-valuedattributesDynamicallydefinenewdiscrete-valuedattributesthatpartitionthecontinuousattributevalueintoadiscretesetofintervalsHandlemissingattributevaluesAssignthemostcommonvalueoftheattributeAssignprobabilitytoeachofthepossiblevaluesAttributeconstructionCreatenewattributesbasedonexistingonesthataresparselyrepresentedThisreducesfragmentation,repetition,andreplicationClassificationinLargeDatabasesClassification—aclassicalproblemextensivelystudiedbystatisticiansandmachinelearningresearchersScalability:ClassifyingdatasetswithmillionsofexamplesandhundredsofattributeswithreasonablespeedWhyisdecisiontreeinductionpopular?relativelyfasterlearningspeed(thanotherclassificationmethods)convertibletosimpleandeasytounderstandclassificationrulescanuseSQLqueriesforaccessingdatabasescomparableclassificationaccuracywithothermethodsDecisionTreeclassifierinsklearnclasssklearn.tree.DecisionTreeClassifier(criterion=’gini’,splitter=’best’,max_depth=None,min_samples_split=2,min_samples_leaf=1,min_weight_fraction_leaf=0.0,max_features=None,random_state=None,max_leaf_nodes=None,min_impurity_decrease=0.0,min_impurity_split=None,class_weight=None,presort=False)/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html/pinard/p/6056319.htmlExampleTheIrisdataset(Iris數(shù)據(jù)集,鳶尾屬植物)TheIrisdatasetisaclassicdatasetfromthe1930s;itisoneofthefirstmodernexamplesofstatisticalclassification.ThesettingisthatofIrisflowers,ofwhichtherearemultiplespeciesthatcanbeidentifiedbytheirmorphology.Today,thespecieswouldbedefinedbytheirgenomicsignatures,butinthe1930s,DNAhadnotevenbeenidentifiedasthecarrierofgeneticinformation.iris以鳶尾花的特征作為數(shù)據(jù)來(lái)源,數(shù)據(jù)集包含150個(gè)數(shù)據(jù)集,分為3類(lèi),每類(lèi)50個(gè)數(shù)據(jù),每個(gè)數(shù)據(jù)包含4個(gè)屬性,是在數(shù)據(jù)挖掘、數(shù)據(jù)分類(lèi)中非常常用的測(cè)試集、訓(xùn)練集
三類(lèi)分別為:setosa,versicolor,virginicasetosa,versicolor,virginicaThefollowingfourattributes(四種屬性)ofeachplantweremeasured:Sepallength(萼片長(zhǎng)度)Sepalwidth(萼片寬度)Petallength(花瓣長(zhǎng)度)Petalwidth(花瓣寬度)Thefirststepisvisualization(可視化)frommatplotlibimportpyplotaspltfromsklearn.datasetsimportload_irisimportnumpyasnp#Weloadthedatawithload_irisfromsklearndata=load_iris()features=data['data']feature_names=data['feature_names']target=data['target']fort,marker,cinzip(range(3),">ox","rgb"):#Weploteachclassonitsowntogetdifferentcoloredmarkersplt.scatter(features[target==t,0],features[target==t,1],marker=marker,c=c)sklearn.tree.DecisionTreeClassifier>>>from
sklearn
importtree>>>X=[[0,0],[1,1]]>>>Y=[0,1]>>>clf=tree.DecisionTreeClassifier()>>>clf=clf.fit(X,Y)>>>clf.predict([[2.,2.]])array([1])>>>clf.predict_proba([[2.,2.]])array([[0.,1.]])sklearn.tree.DecisionTreeClassifier>>>from
sklearn.datasets
importload_iris>>>from
sklearn.cross_validation
importcross_val_score>>>from
sklearn.tree
importDecisionTreeClassifier>>>clf=DecisionTreeClassifier(random_state=0)>>>iris=load_iris()>>>cross_val_score(clf,iris.data,iris.target,cv=10)......array([1.,0.93...,0.86...,0.93...,0.93...,0.93...,0.93...,1.,0.93...,1.])ExampleInternetAdvertisementsDataSet/ml/datasets/Internet+AdvertisementsDataSetCharacteristics:
MultivariateNumberofInstances:3279Area:ComputerAttributeCharacteristics:Categorical,Integer,RealNumberofAttributes:1558DateDonated1998-07-01AssociatedTasks:ClassificationMissingValues?YesNumberofWebHits:307075DecisionTreeregressor回歸樹(shù)(regressiontree),顧名思義,就是用樹(shù)模型做回歸問(wèn)題,每一片葉子都輸出一個(gè)預(yù)測(cè)值。預(yù)測(cè)值一般是該片葉子所含訓(xùn)練集元素輸出的均值,即cm=ave(yi|xi∈leafm)。CART在分類(lèi)問(wèn)題和回歸問(wèn)題中的相同和差異:相同:在分類(lèi)問(wèn)題和回歸問(wèn)題中,CART都是一棵二叉樹(shù),除葉子節(jié)點(diǎn)外的所有節(jié)點(diǎn)都有且僅有兩個(gè)子節(jié)點(diǎn);所有落在同一片葉子中的輸入都有同樣的輸出。差異:在分類(lèi)問(wèn)題中,CART使用基尼指數(shù)(Giniindex)作為選擇特征(feature)和劃分(split)的依據(jù);在回歸問(wèn)題中,CART使用mse(meansquareerror)或者mae(meanabsoluteerror)作為選擇feature和split的criteria。在分類(lèi)問(wèn)題中,CART的每一片葉子都代表的是一個(gè)class;在回歸問(wèn)題中,CART的每一片葉子表示的是一個(gè)預(yù)測(cè)值,取值是連續(xù)的。DecisionTreeregressor給定一個(gè)數(shù)據(jù)集D={(x1,y1),(x2,y2),...,(xi,yi),...,(xn,yn)},其中xi
是一個(gè)m維的向量,即xi
含有m個(gè)features?;貧w問(wèn)題的目標(biāo)就是構(gòu)造一個(gè)函數(shù)f(x)能夠擬合數(shù)據(jù)集D中的元素,使得mse最小,即:假設(shè)一棵構(gòu)建好的CART回歸樹(shù)有M片葉子,這意味著CART將輸入空間x劃分成了M個(gè)單元R1,R2,...,RM,同時(shí)意味著CART至多會(huì)有M個(gè)不同的預(yù)測(cè)值。CART最小化mse公式如下:DecisionTreeregressor在每一次的劃分中,選擇切分變量(splittingvariable)和切分點(diǎn)(splittingpoint)時(shí)(也就是選擇feature和將該featurespace一分為二的split
溫馨提示
- 1. 本站所有資源如無(wú)特殊說(shuō)明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請(qǐng)下載最新的WinRAR軟件解壓。
- 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請(qǐng)聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
- 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁(yè)內(nèi)容里面會(huì)有圖紙預(yù)覽,若沒(méi)有圖紙預(yù)覽就沒(méi)有圖紙。
- 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
- 5. 人人文庫(kù)網(wǎng)僅提供信息存儲(chǔ)空間,僅對(duì)用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對(duì)用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對(duì)任何下載內(nèi)容負(fù)責(zé)。
- 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請(qǐng)與我們聯(lián)系,我們立即糾正。
- 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶因使用這些下載資源對(duì)自己和他人造成任何形式的傷害或損失。
最新文檔
- 企事業(yè)單位電氣安全協(xié)議
- 礦山環(huán)保音樂(lè)項(xiàng)目施工合同樣本
- 醫(yī)師授權(quán)與醫(yī)療安全
- 深圳影視制作公司租賃合同模板
- 鄉(xiāng)村物業(yè)管理員勞動(dòng)合同模板
- 湖南省娛樂(lè)經(jīng)紀(jì)人管理政策
- 活動(dòng)帳篷租賃合同
- 水上樂(lè)園管理規(guī)章
- 別墅戶外排球場(chǎng)施工協(xié)議
- 產(chǎn)品發(fā)布包車(chē)租賃合同
- 湖北省陽(yáng)新縣富池鎮(zhèn)曹家山礦區(qū)建筑石料用石灰?guī)r礦礦產(chǎn)資源開(kāi)發(fā)利用及生態(tài)復(fù)綠方案
- DL-T 5117-2021水下不分散混凝土試驗(yàn)規(guī)程-PDF解密
- 測(cè)井原理及方法
- 建筑施工承插型盤(pán)扣式鋼管支架安全技術(shù)標(biāo)準(zhǔn)
- 土地管理法培訓(xùn)課件
- 當(dāng)代媒介素養(yǎng) 課件 第六章 報(bào)刊媒介素養(yǎng)
- 采購(gòu)墊資協(xié)議書(shū)范本
- 醫(yī)學(xué)生生涯發(fā)展報(bào)告
- 全國(guó)職業(yè)院校技能大賽雙數(shù)年 中職組賽題 ZZ025 舞臺(tái)布景 賽項(xiàng)賽題匯總 第6-10套
- 關(guān)于激發(fā)興趣轉(zhuǎn)化初中物理學(xué)困生的個(gè)案研究的開(kāi)題報(bào)告
- 七年級(jí)數(shù)學(xué)(上)有理數(shù)混合運(yùn)算100題(含答案)
評(píng)論
0/150
提交評(píng)論