




版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請進(jìn)行舉報(bào)或認(rèn)領(lǐng)
文檔簡介
1、Ch5 Data MiningClassification: Alternative TechniquesRule-Based ClassifierClassify records by using a collection of “ifthen”rules(Condition) ® yRule:whereCondition is a conjunctions of attributesy is the class labelLHS: rule antecedent or conditionRHS: rule consequentExamples of classification
2、rules: (Blood Type=Warm) Ù (Lay Eggs=Yes) ® Birds (Taxable Income < 50K) Ù (Refund=Yes) ® Evade=NoRule-based Classifier (Example)R1: (Give Birth = no) Ù (Can Fly = yes) ® BirdsR2: (Give Birth = no) Ù (Live in Water = yes) ® FishesR3: (Give Birth = yes)
3、7; (Blood Type = warm) ® Mammals R4: (Give Birth = no) Ù (Can Fly = no) ® ReptilesR5: (Live in Water = sometimes) ® AmphibiansNameBlood TypeGive BirthCan FlyLive in WaterClasshuman python salmon whale frog komodo bat pigeon catleopard shark turtle penguin porcupineeelsander gila
4、monster platypus owldolphin eaglewarm cold cold warm cold cold warm warm warm cold cold warm warm cold cold cold warm warm warm warmyes no no yes no no yes no yes yes no no yes no no no no no yes nono no no no no no yes yes no no no no no no no no no yes no yesno no yes yessometimes nono no no yesso
5、metimes sometimes noyes sometimes nono no yes nomammals reptiles fishes mammals amphibians reptiles mammals birds mammals fishes reptiles birds mammals fishes amphibians reptiles mammals birds mammals birdsApplication of Rule-Based ClassifierA rule r covers an instance x if the attributes of the ins
6、tance satisfy the condition of the ruleR1: (Give Birth = no) Ù (Can Fly = yes) ® BirdsR2: (Give Birth = no) Ù (Live in Water = yes) ® FishesR3: (Give Birth = yes) Ù (Blood Type = warm) ® MammalsR4: (Give Birth = no) Ù (Can Fly = no) ® ReptilesR5: (Live in Wate
7、r = sometimes) ® AmphibiansThe rule R1 covers a hawk => BirdThe rule R3 covers the grizzly bear => MammalNameBlood TypeGive BirthCan FlyLive in WaterClasshawkwarmnoyesno?grizzly bearwarmyesnono?Rule Coverage and AccuracyCoverage of a rule: Fraction of records that satisfy the antecedent o
8、f a ruleAccuracy of a rule: Fraction of records that satisfy both the antecedent and consequent of a rule0(Status=Single) ® NoCoverage = 40%, Accuracy = 50%TidRefundMarital StatusTaxable IncomeClass12345678910Yes No No Yes No No Yes No NoNoSingle Married Single Married Divorced Married Divorced
9、 Single MarriedSingle125K100K70K120K95K60K220K85K75K90KNo No No No Yes No No Yes NoYesHow does Rule-based Classifier Work?R1: (Give Birth = no) Ù (Can Fly = yes) ® BirdsR2: (Give Birth = no) Ù (Live in Water = yes) ® FishesR3: (Give Birth = yes) Ù (Blood Type = warm) ®
10、Mammals R4: (Give Birth = no) Ù (Can Fly = no) ® ReptilesR5: (Live in Water = sometimes) ® AmphibiansA lemur triggers rule R3, so it is classified as a mammal A turtle triggers both R4 and R5A dogfish shark triggers none of the rulesNameBlood TypeGive BirthCan FlyLive in WaterClasslem
11、urwarmyesnono?turtlecoldnonosometimes?dogfish sharkcoldyesnoyes?Characteristics of Rule-Based ClassifierMutually exclusive rules Classifier contains mutually exclusive rules if the rules are independent of each other Every record is covered by at most one ruleExhaustive(無遺漏的、窮盡的) rules Classifier ha
12、s exhaustive coverage if it accounts for every possible combination of attribute values Each record is covered by at least one ruleFrom Decision Trees To RulesYesNoNOSingle,DivorcedMarriedNO< 80K> 80KNOYESRules are mutually exclusive and exhaustiveRule set contains as much information as the t
13、reeTaxable IncomeMarital StatusRefundClassification Rules(Refund=Yes) => No(Refund=No, Marital Status=Single,Divorced,Taxable Income<80K) => No(Refund=No, Marital Status=Single,Divorced,Taxable Income>80K) => Yes(Refund=No, Marital Status=Married) => NoRules Can Be SimplifiedYesNoN
14、OSingle,DivorcedMarriedNO< 80K> 80KNOYES0(Refund=No) Ù (Status=Married) ® No(Status=Married) ® NoInitial Rule:Simplified Rule:Taxable IncomeMarital StatusRefundTidRefundMarital StatusTaxable IncomeCheat12345678910Yes No No Yes No No Yes No NoNoSingle Married Single Married Divor
15、ced Married Divorced Single MarriedSingle125K100K70K120K95K60K220K85K75K90KNo No No No Yes No No Yes NoYesEffect of Rule SimplificationRules are no longer mutually exclusiveA record may trigger more than one ruleSolution?Ordered rule set(有序規(guī)則)Unordered rule set(無序規(guī)則)采用投票策略,把每條被觸發(fā)的后件看作是對相應(yīng)類的一次投票,然后計(jì)票
16、確定測試的類標(biāo)號;在某些情況下,投票可以用規(guī)則的準(zhǔn)確率;利:不易受由于選擇不當(dāng)?shù)囊?guī)則而產(chǎn)生錯誤的影響;不必維護(hù)規(guī)則的順序,建立模型的開銷也相對較??;弊:測試對測試的屬性要與規(guī)則集中的每一條規(guī)則的前件作比較,所以進(jìn)行分類是一件很繁重的任務(wù)。Rules are no longer exhaustive A record may not trigger any rules Solution?à yUse a default classOrdered Rule SetRules are ranked in descending order according to their priorit
17、y(按優(yōu)先級降序排列)優(yōu)先級的定義有多種方法:基于準(zhǔn)確率、覆蓋率、總描述長度或規(guī)則的產(chǎn)生順序有序的規(guī)則集也成為決策表( decision list)When a test record is presented to the classifier It is assigned to the class label of the highest ranked rule(排序最高的規(guī)則)it has triggered If none of the rules fired, it is assigned to the default classNameBlood TypeGive BirthCan
18、 FlyLive in WaterClassturtlecoldnonosometimes?R1: (Give Birth = no) Ù (Can Fly = yes) ® BirdsR2: (Give Birth = no) Ù (Live in Water = yes) ® FishesR3: (Give Birth = yes) Ù (Blood Type = warm) ® MammalsR4: (Give Birth = no) Ù (Can Fly = no) ® Reptiles R5: (Live
19、 in Water = sometimes) ® AmphibiansRule Ordering SchemesRule-based orderingIndividual rules are ranked based on their quality:依據(jù)規(guī)則質(zhì)量的某種度量對規(guī)則排序,該方案確保每一個(gè)測試都是由覆蓋它的“最好的”規(guī)則來分類該方案的潛在缺點(diǎn)是排在越低的規(guī)則越難解釋,因?yàn)槊織l規(guī)則都假設(shè)所有排在它前面的規(guī)則不成立Class-based orderingRules that belong to the same class appear together質(zhì)量較差的規(guī)則可能因?yàn)?/p>
20、某個(gè)類的優(yōu)先級較高排在前面而被觸發(fā),從而導(dǎo)致高質(zhì)量規(guī)則被忽略。Class-based Ordering(Refund=Yes) => No(Refund=No, Marital Status=Single,Divorced,Taxable Income<80K) => No(Refund=No, Marital Status=Married) => No(Refund=No, Marital Status=Single,Divorced, Taxable Income>80K) => YesRule-based Ordering(Refund=Yes) =&
21、gt; No(Refund=No, Marital Status=Single,Divorced,Taxable Income<80K) => No(Refund=No, Marital Status=Single,Divorced,Taxable Income>80K) => Yes(Refund=No, Marital Status=Married) => NoBuilding Classification RulesDirect Method:Extract rules directly from datae.g.: RIPPER, CN2, Holtes
22、1RIndirect Method:Extract rules from other classification mdecision trees, neural networks, etc).e.g: C4.5ruless (e.g.Direct Method: Sequential CoveringStart from an empty ruleGrow a rule using the Learn-One-Rule function Remove training records covered by the ruleRepeat Step (2) and (3) until stopp
23、ing criterion is met1.2.3.4.1. 令E是訓(xùn)練,A是屬性-值對的集合(Aj, vj)2. 令Y0是類的有序集y1, y1, , yk3. 令R=是初始規(guī)則列表4. for 每個(gè)類yÎY0 -yk dowhile 終止條件不滿足dor ß Learn-one-rule(E, A, y)從E中刪除被r覆蓋的訓(xùn)練追加r到規(guī)則列表尾部:R ß R v rend while5.6.7.8.9.10. end for11. 把默認(rèn)規(guī)則 à yk到規(guī)則列表R尾部Example of Sequential Covering(i) Origina
24、l Data(ii) Step 1Example of Sequential CoveringR1R1R2(iii) Step 2(iv) Step 3Aspects of Sequential CoveringRule GrowingInstance EliminationRule EvaluationStopping CriterionRule PruningRule GrowingTwo common strategiesYes: 3No: 4 .Income> 80KRefund= NoStatus = SingleStatus = DivorcedStatus = Marrie
25、dRefund=No, Status = Single (Class = Yes)Yes: 3Yes: 2No: 1Yes: 1No: 0Yes: 0No: 3Yes: 3No: 1No:4(b) Specific-to-general可以隨機(jī)選擇一個(gè)正例作為規(guī)則增長的初始(a) General-to-specific先建立一個(gè)初始規(guī)則r: à y (質(zhì)量很差, 覆蓋所有樣本)加入新的合取項(xiàng)提高規(guī)則的質(zhì)量。探查所有可能的候選,并貪心地選擇下一個(gè)合取項(xiàng)。繼續(xù)該過程,直至滿足終止條件為止求精步,通過刪除規(guī)則的一個(gè)合取項(xiàng),使其覆蓋的正例來泛化規(guī)則。繼續(xù)該過程,直至滿足終止條件為止Refun
26、d=No, Status=Single, Income=90K (Class=Yes)Refund=No, Status=Single, Income=85K (Class=Yes)Rule Growing (Examples)CN2 Algorithm:Start from an empty conjunct: Add conjuncts that minimizes the entropy measure:A, A,B, Determine the rule consequent by taking majority class of instances covered by the ru
27、leRIPPER Algorithm:Start from an empty rule: => classAdd conjuncts thatizes FOILs information gain measure:R0: => class(initial rule)R1: A => class (rule after adding conjunct)Gain(R0, R1) = t log (p1/(p1+n1) log (p0/(p0 + n0) wheret: number of positive instances covered by both R0 and R1p0
28、: number of positive instances covered by R0 n0: number of negative instances covered by R0 p1: number of positive instances covered by R1 n1: number of negative instances covered by R1Instance EliminationWhy do we need toeliminate instances? Otherwise, the next rule is identical to previous ruleWhy
29、 do we remove positive instances? Ensure that the next rule isdifferentWhy do we remove negative instances? Prevent underestimating accuracy of rule Compare rules R2 and R3 in the diagramR3R2R1+ +class = + +-class = -Rule Evaluation Statistical test-For example, compute the following likelihood rati
30、o statistic:R = 2å fi log( fi / ei )i=1kis the observed frequency of class i-Where k is the number of classes,fexamples that are covered by the rule, and eirule that makes random predictions.iis the expected frequency of a-Some explanation: compare the given rule with rule that makes randompred
31、ictions. A large R value suggests that the number of correct predictions make by the rule is significantly large.Rule Evaluation Statistical test-For example.-Consider a training set that contains 60 positive examples and 100 negative examples.-Rule1: covers 50 positive examples and 5 negative examp
32、les-Rule2: covers 2 positive examples and no negative examples-Rule1 covers 55 examples, the expected frequency for thepositive class is e1 = 55*60/160 = 20.625, while the expectedfrequency for the negative class is e2 = 55*100/160 = 34.375R(r1 ) = 2 ´50 ´log2 (50 / 20.625) + 5´log2 (
33、5 / 34.375) = 99.9Rule EvaluationMetrics: Accuracy= ncn= nc+ 1 Laplacen + kn : Number of instances covered by rulenc : Number of positive instancescovered by rulek : Number of classesp : Prior probability of positive instancesnc + kp M-estimate =n + kStopping Criterion and Rule PruningStopping crite
34、rion Compute the gain If gain is not significant, discard the new ruleRule Pruning Similar to post-pruning of decision trees Reduced Error Pruning:Remove one of the conjuncts in the ruleCompare error rate on validation set before and after pruningIf error improves, prune the conjunctSummary of Direc
35、t MethodGrow a single ruleRemove Instances from rulePrune the rule (if necessary)Add rule to Current Rule SetRepeatDirect Method: RIPPERFor 2-class problem, choose one of the classes as positiveclass, and the other as negative class Learn rules for positive class Negative class will be default class
36、 For multi-class problemOrder the classes according to increasing class prevalence(fraction of instances that belong to a particular class) Learn the rule set for smallest class first, treat the rest as negative classRepeat with next smallest class as positive classDirect Method: RIPPERGrowing a rul
37、e:Start from empty ruleAdd conjuncts as long as they improve FOILs information gain Stop when rule no longer covers negative examplesPrune the rule immediately using incremental reduced errorpruningMeasure for pruning:v = (p-n)/(p+n)p: number of positive examples covered by the rule inthe validation
38、 setn: number of negative examples covered by the rule inthe validation setPruning method: delete any final sequence of conditions thatizes vDirect Method: RIPPERBuilding a Rule Set: Use sequential covering algorithm Finds the best rule that covers the current set of positive examples Eliminate both
39、 positive and negative examples covered by the rule Each time a rule is added to the rule set, compute the new description length stop adding new rules when the new description length is d bits longer than the smallest description length obtained so farDirect Method: RIPPEROptimize the rule set: For
40、 each rule r in the rule set RConsider 2 alternative rules: Replacement rule (r*): grow new rule from scratch Revised rule(r): add conjuncts to extend the rule rCompare the rule set for r against the rule set for r* and rChoose rule set that minimizes MDL principle Repeat rule generation and rule op
41、timization for the remaining positive examplesIndirect MethodsPNoYesQRNoYesNoYes-+QNoYes-+Rule Setr1: (P=No,Q=No) => - r2: (P=No,Q=Yes) => + r3: (P=Yes,R=No) => +r4: (P=Yes,R=Yes,Q=No) => - r5: (P=Yes,R=Yes,Q=Yes) => +Indirect Method: C4.5rulesExtract rules from an unpruned decision t
42、reeFor each rule, r: A ® y, consider an alternative rule r: A ® y where A is obtained by removing one of the conjuncts in A Compare the pessimistic error rate for r against all rs Prune if one of the rs has lower pessimistic error rate Repeat until we can no longer improve generalization e
43、rrorIndirect Method: C4.5rulesInstead of ordering the rules, order subsets of rules (class ordering) E Recallthe regression problem and classificationr problem: C Regularization can prevent over-fitting,why?SVM is the most popular classifier, why?Tlength is given the highest priority(Occam'sRazo
44、r)Occams Razor: It is a principle urging one to select among competing hypotheses that which makes the fewest assumptionsExampleNameGive BirthLay EggsCan FlyLive in WaterHave LegsClasshumanyesnononoyesmammalspythonnoyesnononoreptilessalmonnoyesnoyesnofisheswhaleyesnonoyesnomammalsfrognoyesnosometime
45、syesamphibianskomodonoyesnonoyesreptilesbatyesnoyesnoyesmammalspigeonnoyesyesnoyesbirdscatyesnononoyesmammalsleopard sharkyesnonoyesnofishesturtlenoyesnosometimesyesreptilespenguinnoyesnosometimesyesbirdsporcupineyesnononoyesmammalseelnoyesnoyesnofishessandernoyesnosometimesyesamphibiansgila monster
46、noyesnonoyesreptilesplatypusnoyesnonoyesmammalsowlnoyesyesnoyesbirdsdolphinyesnonoyesnomammalseaglenoyesyesnoyesbirdsC4.5 versus C4.5rules versus RIPPERC4.5rules:(Give Birth=No, Can Fly=Yes) ® Birds(Give Birth=No, Live in Water=Yes) ® Fishes (Give Birth=Yes) ® Mammals(Give Birth=No, C
47、an Fly=No, Live in Water=No) ® Reptiles( ) ® AmphibiansGive Birth?YesNoLive In Water?MammalsRIPPER:(Live in Water=Yes) ® Fishes (Have Legs=No) ® ReptilesYesNoSometimes(Give Birth=No, Can Fly=No, Live In Water=No)® Reptiles(Can Fly=Yes,Give Birth=No) ® Birds() ® Mam
48、malsCan Fly?FishesAmphibiansNoYesBirdsReptilesC4.5 versus C4.5rules versus RIPPERC4.5 and C4.5rules:RIPPER:PREDICTED CLASSAmphibiansFishesReptilesBirdsMammalsACTUALAmphibians00002CLASSFishes03000Reptiles00301Birds00121Mammals02104PREDICTED CLASSAmphibiansFishesReptilesBirdsMammalsACTUALAmphibians200
49、00CLASSFishes02001Reptiles10300Birds10030Mammals00106Advantages of Rule-Based ClassifiersAs highly expressive as decision treesEasy to interpret Easy to generateCan classify new instances rapidlyPerformance comparable to decision treesInstance-Based ClassifiersSet of Stored Cases Store the training
50、records Use training records to predict the class label of unseen casesUnseen CaseAtr1.AtrNAtr1.AtrNClassABBCACBInstance Based ClassifiersExamples: Rote-learner Memorizes entire training data and performs classification only if attributes of record match one of the training examples exactly Nearest
51、neighbor Uses k “closest” points (nearest neighbors) for performing classificationNearest Neighbor ClassifiersBasic idea: If it walks like a duck, quacks like a duck, then its probably a duckComputeTestRecordDistanceTraining RecordsChoose k of the “nearest” recordsNearest-Neighbor ClassifiersUnknown
52、 recordl Requires three thingsThe set of stored recordsDistance Metric to compute distance between recordsThe value of k, the number of nearest neighbors to retrievel To classify an unknown record:Compute distance to other training recordsIdentify k nearest neighborsUse class labels of nearest neighbors to determine the class labe
溫馨提示
- 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
- 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
- 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
- 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
- 5. 人人文庫網(wǎng)僅提供信息存儲空間,僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對任何下載內(nèi)容負(fù)責(zé)。
- 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請與我們聯(lián)系,我們立即糾正。
- 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。
最新文檔
- 2025綠化苗木采購合同樣本
- 股票出貨協(xié)議書范本
- 2025年03月浙江臺州市玉環(huán)市事業(yè)單位公開招聘工作人員74人筆試歷年典型考題(歷年真題考點(diǎn))解題思路附帶答案詳解
- 2025年03月廣東深圳市光明區(qū)統(tǒng)計(jì)局公開招聘(選聘)專干4人筆試歷年典型考題(歷年真題考點(diǎn))解題思路附帶答案詳解
- 2025年03月國家衛(wèi)生健康委統(tǒng)計(jì)信息中心應(yīng)屆畢業(yè)生公開招聘1人筆試歷年典型考題(歷年真題考點(diǎn))解題思路附帶答案詳解
- 物理試題2025年東北三省四城市聯(lián)考暨沈陽市高三質(zhì)量監(jiān)測(二)及答案
- 重慶五一職業(yè)技術(shù)學(xué)院《俄漢互譯口譯》2023-2024學(xué)年第二學(xué)期期末試卷
- 戶用和村用風(fēng)光互補(bǔ)發(fā)電系統(tǒng)控制器及逆變器項(xiàng)目安全風(fēng)險(xiǎn)評價(jià)報(bào)告
- 湖南幼兒師范高等??茖W(xué)校《BIM協(xié)同設(shè)計(jì)》2023-2024學(xué)年第一學(xué)期期末試卷
- 成都工貿(mào)職業(yè)技術(shù)學(xué)院《基礎(chǔ)化學(xué)實(shí)驗(yàn)二》2023-2024學(xué)年第二學(xué)期期末試卷
- GB/T 5453-2025紡織品織物透氣性的測定
- 2024慢性鼻竇炎診斷和治療指南解讀課件
- 2025年xx村公益性項(xiàng)目購買材料詢價(jià)會議記錄
- 六年級下冊數(shù)學(xué)教案-比例 西師大版
- 抗日英雄人物楊靖宇介紹
- AI驅(qū)動的可持續(xù)能源發(fā)展
- 整本書閱讀《林海雪原》【知識精研】六年級語文下冊 (統(tǒng)編版五四制2024)
- 健康日用品設(shè)計(jì)與研發(fā)趨勢
- 【化學(xué)】常見的鹽(第1課時(shí))-2024-2025學(xué)年九年級化學(xué)下冊(人教版2024)
- 新人教版初中英語七至九年級全部課本單詞
- 宜賓市新能源產(chǎn)業(yè)有限公司招聘筆試沖刺題2025
評論
0/150
提交評論