文稿機(jī)器學(xué)習(xí)alternative classification_第1頁
文稿機(jī)器學(xué)習(xí)alternative classification_第2頁
文稿機(jī)器學(xué)習(xí)alternative classification_第3頁
文稿機(jī)器學(xué)習(xí)alternative classification_第4頁
文稿機(jī)器學(xué)習(xí)alternative classification_第5頁
已閱讀5頁,還剩94頁未讀, 繼續(xù)免費(fèi)閱讀

下載本文檔

版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請進(jìn)行舉報(bào)或認(rèn)領(lǐng)

文檔簡介

1、Ch5 Data MiningClassification: Alternative TechniquesRule-Based ClassifierClassify records by using a collection of “ifthen”rules(Condition) ® yRule:whereCondition is a conjunctions of attributesy is the class labelLHS: rule antecedent or conditionRHS: rule consequentExamples of classification

2、rules: (Blood Type=Warm) Ù (Lay Eggs=Yes) ® Birds (Taxable Income < 50K) Ù (Refund=Yes) ® Evade=NoRule-based Classifier (Example)R1: (Give Birth = no) Ù (Can Fly = yes) ® BirdsR2: (Give Birth = no) Ù (Live in Water = yes) ® FishesR3: (Give Birth = yes) 

3、7; (Blood Type = warm) ® Mammals R4: (Give Birth = no) Ù (Can Fly = no) ® ReptilesR5: (Live in Water = sometimes) ® AmphibiansNameBlood TypeGive BirthCan FlyLive in WaterClasshuman python salmon whale frog komodo bat pigeon catleopard shark turtle penguin porcupineeelsander gila

4、monster platypus owldolphin eaglewarm cold cold warm cold cold warm warm warm cold cold warm warm cold cold cold warm warm warm warmyes no no yes no no yes no yes yes no no yes no no no no no yes nono no no no no no yes yes no no no no no no no no no yes no yesno no yes yessometimes nono no no yesso

5、metimes sometimes noyes sometimes nono no yes nomammals reptiles fishes mammals amphibians reptiles mammals birds mammals fishes reptiles birds mammals fishes amphibians reptiles mammals birds mammals birdsApplication of Rule-Based ClassifierA rule r covers an instance x if the attributes of the ins

6、tance satisfy the condition of the ruleR1: (Give Birth = no) Ù (Can Fly = yes) ® BirdsR2: (Give Birth = no) Ù (Live in Water = yes) ® FishesR3: (Give Birth = yes) Ù (Blood Type = warm) ® MammalsR4: (Give Birth = no) Ù (Can Fly = no) ® ReptilesR5: (Live in Wate

7、r = sometimes) ® AmphibiansThe rule R1 covers a hawk => BirdThe rule R3 covers the grizzly bear => MammalNameBlood TypeGive BirthCan FlyLive in WaterClasshawkwarmnoyesno?grizzly bearwarmyesnono?Rule Coverage and AccuracyCoverage of a rule: Fraction of records that satisfy the antecedent o

8、f a ruleAccuracy of a rule: Fraction of records that satisfy both the antecedent and consequent of a rule0(Status=Single) ® NoCoverage = 40%, Accuracy = 50%TidRefundMarital StatusTaxable IncomeClass12345678910Yes No No Yes No No Yes No NoNoSingle Married Single Married Divorced Married Divorced

9、 Single MarriedSingle125K100K70K120K95K60K220K85K75K90KNo No No No Yes No No Yes NoYesHow does Rule-based Classifier Work?R1: (Give Birth = no) Ù (Can Fly = yes) ® BirdsR2: (Give Birth = no) Ù (Live in Water = yes) ® FishesR3: (Give Birth = yes) Ù (Blood Type = warm) ®

10、Mammals R4: (Give Birth = no) Ù (Can Fly = no) ® ReptilesR5: (Live in Water = sometimes) ® AmphibiansA lemur triggers rule R3, so it is classified as a mammal A turtle triggers both R4 and R5A dogfish shark triggers none of the rulesNameBlood TypeGive BirthCan FlyLive in WaterClasslem

11、urwarmyesnono?turtlecoldnonosometimes?dogfish sharkcoldyesnoyes?Characteristics of Rule-Based ClassifierMutually exclusive rules Classifier contains mutually exclusive rules if the rules are independent of each other Every record is covered by at most one ruleExhaustive(無遺漏的、窮盡的) rules Classifier ha

12、s exhaustive coverage if it accounts for every possible combination of attribute values Each record is covered by at least one ruleFrom Decision Trees To RulesYesNoNOSingle,DivorcedMarriedNO< 80K> 80KNOYESRules are mutually exclusive and exhaustiveRule set contains as much information as the t

13、reeTaxable IncomeMarital StatusRefundClassification Rules(Refund=Yes) => No(Refund=No, Marital Status=Single,Divorced,Taxable Income<80K) => No(Refund=No, Marital Status=Single,Divorced,Taxable Income>80K) => Yes(Refund=No, Marital Status=Married) => NoRules Can Be SimplifiedYesNoN

14、OSingle,DivorcedMarriedNO< 80K> 80KNOYES0(Refund=No) Ù (Status=Married) ® No(Status=Married) ® NoInitial Rule:Simplified Rule:Taxable IncomeMarital StatusRefundTidRefundMarital StatusTaxable IncomeCheat12345678910Yes No No Yes No No Yes No NoNoSingle Married Single Married Divor

15、ced Married Divorced Single MarriedSingle125K100K70K120K95K60K220K85K75K90KNo No No No Yes No No Yes NoYesEffect of Rule SimplificationRules are no longer mutually exclusiveA record may trigger more than one ruleSolution?Ordered rule set(有序規(guī)則)Unordered rule set(無序規(guī)則)采用投票策略,把每條被觸發(fā)的后件看作是對相應(yīng)類的一次投票,然后計(jì)票

16、確定測試的類標(biāo)號;在某些情況下,投票可以用規(guī)則的準(zhǔn)確率;利:不易受由于選擇不當(dāng)?shù)囊?guī)則而產(chǎn)生錯誤的影響;不必維護(hù)規(guī)則的順序,建立模型的開銷也相對較??;弊:測試對測試的屬性要與規(guī)則集中的每一條規(guī)則的前件作比較,所以進(jìn)行分類是一件很繁重的任務(wù)。Rules are no longer exhaustive A record may not trigger any rules Solution?à yUse a default classOrdered Rule SetRules are ranked in descending order according to their priorit

17、y(按優(yōu)先級降序排列)優(yōu)先級的定義有多種方法:基于準(zhǔn)確率、覆蓋率、總描述長度或規(guī)則的產(chǎn)生順序有序的規(guī)則集也成為決策表( decision list)When a test record is presented to the classifier It is assigned to the class label of the highest ranked rule(排序最高的規(guī)則)it has triggered If none of the rules fired, it is assigned to the default classNameBlood TypeGive BirthCan

18、 FlyLive in WaterClassturtlecoldnonosometimes?R1: (Give Birth = no) Ù (Can Fly = yes) ® BirdsR2: (Give Birth = no) Ù (Live in Water = yes) ® FishesR3: (Give Birth = yes) Ù (Blood Type = warm) ® MammalsR4: (Give Birth = no) Ù (Can Fly = no) ® Reptiles R5: (Live

19、 in Water = sometimes) ® AmphibiansRule Ordering SchemesRule-based orderingIndividual rules are ranked based on their quality:依據(jù)規(guī)則質(zhì)量的某種度量對規(guī)則排序,該方案確保每一個(gè)測試都是由覆蓋它的“最好的”規(guī)則來分類該方案的潛在缺點(diǎn)是排在越低的規(guī)則越難解釋,因?yàn)槊織l規(guī)則都假設(shè)所有排在它前面的規(guī)則不成立Class-based orderingRules that belong to the same class appear together質(zhì)量較差的規(guī)則可能因?yàn)?/p>

20、某個(gè)類的優(yōu)先級較高排在前面而被觸發(fā),從而導(dǎo)致高質(zhì)量規(guī)則被忽略。Class-based Ordering(Refund=Yes) => No(Refund=No, Marital Status=Single,Divorced,Taxable Income<80K) => No(Refund=No, Marital Status=Married) => No(Refund=No, Marital Status=Single,Divorced, Taxable Income>80K) => YesRule-based Ordering(Refund=Yes) =&

21、gt; No(Refund=No, Marital Status=Single,Divorced,Taxable Income<80K) => No(Refund=No, Marital Status=Single,Divorced,Taxable Income>80K) => Yes(Refund=No, Marital Status=Married) => NoBuilding Classification RulesDirect Method:Extract rules directly from datae.g.: RIPPER, CN2, Holtes

22、1RIndirect Method:Extract rules from other classification mdecision trees, neural networks, etc).e.g: C4.5ruless (e.g.Direct Method: Sequential CoveringStart from an empty ruleGrow a rule using the Learn-One-Rule function Remove training records covered by the ruleRepeat Step (2) and (3) until stopp

23、ing criterion is met1.2.3.4.1. 令E是訓(xùn)練,A是屬性-值對的集合(Aj, vj)2. 令Y0是類的有序集y1, y1, , yk3. 令R=是初始規(guī)則列表4. for 每個(gè)類yÎY0 -yk dowhile 終止條件不滿足dor ß Learn-one-rule(E, A, y)從E中刪除被r覆蓋的訓(xùn)練追加r到規(guī)則列表尾部:R ß R v rend while5.6.7.8.9.10. end for11. 把默認(rèn)規(guī)則 à yk到規(guī)則列表R尾部Example of Sequential Covering(i) Origina

24、l Data(ii) Step 1Example of Sequential CoveringR1R1R2(iii) Step 2(iv) Step 3Aspects of Sequential CoveringRule GrowingInstance EliminationRule EvaluationStopping CriterionRule PruningRule GrowingTwo common strategiesYes: 3No: 4 .Income> 80KRefund= NoStatus = SingleStatus = DivorcedStatus = Marrie

25、dRefund=No, Status = Single (Class = Yes)Yes: 3Yes: 2No: 1Yes: 1No: 0Yes: 0No: 3Yes: 3No: 1No:4(b) Specific-to-general可以隨機(jī)選擇一個(gè)正例作為規(guī)則增長的初始(a) General-to-specific先建立一個(gè)初始規(guī)則r: à y (質(zhì)量很差, 覆蓋所有樣本)加入新的合取項(xiàng)提高規(guī)則的質(zhì)量。探查所有可能的候選,并貪心地選擇下一個(gè)合取項(xiàng)。繼續(xù)該過程,直至滿足終止條件為止求精步,通過刪除規(guī)則的一個(gè)合取項(xiàng),使其覆蓋的正例來泛化規(guī)則。繼續(xù)該過程,直至滿足終止條件為止Refun

26、d=No, Status=Single, Income=90K (Class=Yes)Refund=No, Status=Single, Income=85K (Class=Yes)Rule Growing (Examples)CN2 Algorithm:Start from an empty conjunct: Add conjuncts that minimizes the entropy measure:A, A,B, Determine the rule consequent by taking majority class of instances covered by the ru

27、leRIPPER Algorithm:Start from an empty rule: => classAdd conjuncts thatizes FOILs information gain measure:R0: => class(initial rule)R1: A => class (rule after adding conjunct)Gain(R0, R1) = t log (p1/(p1+n1) log (p0/(p0 + n0) wheret: number of positive instances covered by both R0 and R1p0

28、: number of positive instances covered by R0 n0: number of negative instances covered by R0 p1: number of positive instances covered by R1 n1: number of negative instances covered by R1Instance EliminationWhy do we need toeliminate instances? Otherwise, the next rule is identical to previous ruleWhy

29、 do we remove positive instances? Ensure that the next rule isdifferentWhy do we remove negative instances? Prevent underestimating accuracy of rule Compare rules R2 and R3 in the diagramR3R2R1+ +class = + +-class = -Rule Evaluation Statistical test-For example, compute the following likelihood rati

30、o statistic:R = 2å fi log( fi / ei )i=1kis the observed frequency of class i-Where k is the number of classes,fexamples that are covered by the rule, and eirule that makes random predictions.iis the expected frequency of a-Some explanation: compare the given rule with rule that makes randompred

31、ictions. A large R value suggests that the number of correct predictions make by the rule is significantly large.Rule Evaluation Statistical test-For example.-Consider a training set that contains 60 positive examples and 100 negative examples.-Rule1: covers 50 positive examples and 5 negative examp

32、les-Rule2: covers 2 positive examples and no negative examples-Rule1 covers 55 examples, the expected frequency for thepositive class is e1 = 55*60/160 = 20.625, while the expectedfrequency for the negative class is e2 = 55*100/160 = 34.375R(r1 ) = 2 ´50 ´log2 (50 / 20.625) + 5´log2 (

33、5 / 34.375) = 99.9Rule EvaluationMetrics: Accuracy= ncn= nc+ 1 Laplacen + kn : Number of instances covered by rulenc : Number of positive instancescovered by rulek : Number of classesp : Prior probability of positive instancesnc + kp M-estimate =n + kStopping Criterion and Rule PruningStopping crite

34、rion Compute the gain If gain is not significant, discard the new ruleRule Pruning Similar to post-pruning of decision trees Reduced Error Pruning:Remove one of the conjuncts in the ruleCompare error rate on validation set before and after pruningIf error improves, prune the conjunctSummary of Direc

35、t MethodGrow a single ruleRemove Instances from rulePrune the rule (if necessary)Add rule to Current Rule SetRepeatDirect Method: RIPPERFor 2-class problem, choose one of the classes as positiveclass, and the other as negative class Learn rules for positive class Negative class will be default class

36、 For multi-class problemOrder the classes according to increasing class prevalence(fraction of instances that belong to a particular class) Learn the rule set for smallest class first, treat the rest as negative classRepeat with next smallest class as positive classDirect Method: RIPPERGrowing a rul

37、e:Start from empty ruleAdd conjuncts as long as they improve FOILs information gain Stop when rule no longer covers negative examplesPrune the rule immediately using incremental reduced errorpruningMeasure for pruning:v = (p-n)/(p+n)p: number of positive examples covered by the rule inthe validation

38、 setn: number of negative examples covered by the rule inthe validation setPruning method: delete any final sequence of conditions thatizes vDirect Method: RIPPERBuilding a Rule Set: Use sequential covering algorithm Finds the best rule that covers the current set of positive examples Eliminate both

39、 positive and negative examples covered by the rule Each time a rule is added to the rule set, compute the new description length stop adding new rules when the new description length is d bits longer than the smallest description length obtained so farDirect Method: RIPPEROptimize the rule set: For

40、 each rule r in the rule set RConsider 2 alternative rules: Replacement rule (r*): grow new rule from scratch Revised rule(r): add conjuncts to extend the rule rCompare the rule set for r against the rule set for r* and rChoose rule set that minimizes MDL principle Repeat rule generation and rule op

41、timization for the remaining positive examplesIndirect MethodsPNoYesQRNoYesNoYes-+QNoYes-+Rule Setr1: (P=No,Q=No) => - r2: (P=No,Q=Yes) => + r3: (P=Yes,R=No) => +r4: (P=Yes,R=Yes,Q=No) => - r5: (P=Yes,R=Yes,Q=Yes) => +Indirect Method: C4.5rulesExtract rules from an unpruned decision t

42、reeFor each rule, r: A ® y, consider an alternative rule r: A ® y where A is obtained by removing one of the conjuncts in A Compare the pessimistic error rate for r against all rs Prune if one of the rs has lower pessimistic error rate Repeat until we can no longer improve generalization e

43、rrorIndirect Method: C4.5rulesInstead of ordering the rules, order subsets of rules (class ordering) E Recallthe regression problem and classificationr problem: C Regularization can prevent over-fitting,why?SVM is the most popular classifier, why?Tlength is given the highest priority(Occam'sRazo

44、r)Occams Razor: It is a principle urging one to select among competing hypotheses that which makes the fewest assumptionsExampleNameGive BirthLay EggsCan FlyLive in WaterHave LegsClasshumanyesnononoyesmammalspythonnoyesnononoreptilessalmonnoyesnoyesnofisheswhaleyesnonoyesnomammalsfrognoyesnosometime

45、syesamphibianskomodonoyesnonoyesreptilesbatyesnoyesnoyesmammalspigeonnoyesyesnoyesbirdscatyesnononoyesmammalsleopard sharkyesnonoyesnofishesturtlenoyesnosometimesyesreptilespenguinnoyesnosometimesyesbirdsporcupineyesnononoyesmammalseelnoyesnoyesnofishessandernoyesnosometimesyesamphibiansgila monster

46、noyesnonoyesreptilesplatypusnoyesnonoyesmammalsowlnoyesyesnoyesbirdsdolphinyesnonoyesnomammalseaglenoyesyesnoyesbirdsC4.5 versus C4.5rules versus RIPPERC4.5rules:(Give Birth=No, Can Fly=Yes) ® Birds(Give Birth=No, Live in Water=Yes) ® Fishes (Give Birth=Yes) ® Mammals(Give Birth=No, C

47、an Fly=No, Live in Water=No) ® Reptiles( ) ® AmphibiansGive Birth?YesNoLive In Water?MammalsRIPPER:(Live in Water=Yes) ® Fishes (Have Legs=No) ® ReptilesYesNoSometimes(Give Birth=No, Can Fly=No, Live In Water=No)® Reptiles(Can Fly=Yes,Give Birth=No) ® Birds() ® Mam

48、malsCan Fly?FishesAmphibiansNoYesBirdsReptilesC4.5 versus C4.5rules versus RIPPERC4.5 and C4.5rules:RIPPER:PREDICTED CLASSAmphibiansFishesReptilesBirdsMammalsACTUALAmphibians00002CLASSFishes03000Reptiles00301Birds00121Mammals02104PREDICTED CLASSAmphibiansFishesReptilesBirdsMammalsACTUALAmphibians200

49、00CLASSFishes02001Reptiles10300Birds10030Mammals00106Advantages of Rule-Based ClassifiersAs highly expressive as decision treesEasy to interpret Easy to generateCan classify new instances rapidlyPerformance comparable to decision treesInstance-Based ClassifiersSet of Stored Cases Store the training

50、records Use training records to predict the class label of unseen casesUnseen CaseAtr1.AtrNAtr1.AtrNClassABBCACBInstance Based ClassifiersExamples: Rote-learner Memorizes entire training data and performs classification only if attributes of record match one of the training examples exactly Nearest

51、neighbor Uses k “closest” points (nearest neighbors) for performing classificationNearest Neighbor ClassifiersBasic idea: If it walks like a duck, quacks like a duck, then its probably a duckComputeTestRecordDistanceTraining RecordsChoose k of the “nearest” recordsNearest-Neighbor ClassifiersUnknown

52、 recordl Requires three thingsThe set of stored recordsDistance Metric to compute distance between recordsThe value of k, the number of nearest neighbors to retrievel To classify an unknown record:Compute distance to other training recordsIdentify k nearest neighborsUse class labels of nearest neighbors to determine the class labe

溫馨提示

  • 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
  • 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
  • 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
  • 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
  • 5. 人人文庫網(wǎng)僅提供信息存儲空間,僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對任何下載內(nèi)容負(fù)責(zé)。
  • 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請與我們聯(lián)系,我們立即糾正。
  • 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。

最新文檔

評論

0/150

提交評論