




版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請(qǐng)進(jìn)行舉報(bào)或認(rèn)領(lǐng)
文檔簡介
1、Ch5 Data MiningClassification: Alternative TechniquesRule-Based ClassifierClassify records by using a collection of “ifthen”rules(Condition) ® yRule:whereCondition is a conjunctions of attributesy is the class labelLHS: rule antecedent or conditionRHS: rule consequentExamples of classification
2、rules: (Blood Type=Warm) Ù (Lay Eggs=Yes) ® Birds (Taxable Income < 50K) Ù (Refund=Yes) ® Evade=NoRule-based Classifier (Example)R1: (Give Birth = no) Ù (Can Fly = yes) ® BirdsR2: (Give Birth = no) Ù (Live in Water = yes) ® FishesR3: (Give Birth = yes)
3、7; (Blood Type = warm) ® Mammals R4: (Give Birth = no) Ù (Can Fly = no) ® ReptilesR5: (Live in Water = sometimes) ® AmphibiansNameBlood TypeGive BirthCan FlyLive in WaterClasshuman python salmon whale frog komodo bat pigeon catleopard shark turtle penguin porcupineeelsander gila
4、monster platypus owldolphin eaglewarm cold cold warm cold cold warm warm warm cold cold warm warm cold cold cold warm warm warm warmyes no no yes no no yes no yes yes no no yes no no no no no yes nono no no no no no yes yes no no no no no no no no no yes no yesno no yes yessometimes nono no no yesso
5、metimes sometimes noyes sometimes nono no yes nomammals reptiles fishes mammals amphibians reptiles mammals birds mammals fishes reptiles birds mammals fishes amphibians reptiles mammals birds mammals birdsApplication of Rule-Based ClassifierA rule r covers an instance x if the attributes of the ins
6、tance satisfy the condition of the ruleR1: (Give Birth = no) Ù (Can Fly = yes) ® BirdsR2: (Give Birth = no) Ù (Live in Water = yes) ® FishesR3: (Give Birth = yes) Ù (Blood Type = warm) ® MammalsR4: (Give Birth = no) Ù (Can Fly = no) ® ReptilesR5: (Live in Wate
7、r = sometimes) ® AmphibiansThe rule R1 covers a hawk => BirdThe rule R3 covers the grizzly bear => MammalNameBlood TypeGive BirthCan FlyLive in WaterClasshawkwarmnoyesno?grizzly bearwarmyesnono?Rule Coverage and AccuracyCoverage of a rule: Fraction of records that satisfy the antecedent o
8、f a ruleAccuracy of a rule: Fraction of records that satisfy both the antecedent and consequent of a rule0(Status=Single) ® NoCoverage = 40%, Accuracy = 50%TidRefundMarital StatusTaxable IncomeClass12345678910Yes No No Yes No No Yes No NoNoSingle Married Single Married Divorced Married Divorced
9、 Single MarriedSingle125K100K70K120K95K60K220K85K75K90KNo No No No Yes No No Yes NoYesHow does Rule-based Classifier Work?R1: (Give Birth = no) Ù (Can Fly = yes) ® BirdsR2: (Give Birth = no) Ù (Live in Water = yes) ® FishesR3: (Give Birth = yes) Ù (Blood Type = warm) ®
10、Mammals R4: (Give Birth = no) Ù (Can Fly = no) ® ReptilesR5: (Live in Water = sometimes) ® AmphibiansA lemur triggers rule R3, so it is classified as a mammal A turtle triggers both R4 and R5A dogfish shark triggers none of the rulesNameBlood TypeGive BirthCan FlyLive in WaterClasslem
11、urwarmyesnono?turtlecoldnonosometimes?dogfish sharkcoldyesnoyes?Characteristics of Rule-Based ClassifierMutually exclusive rules Classifier contains mutually exclusive rules if the rules are independent of each other Every record is covered by at most one ruleExhaustive(無遺漏的、窮盡的) rules Classifier ha
12、s exhaustive coverage if it accounts for every possible combination of attribute values Each record is covered by at least one ruleFrom Decision Trees To RulesYesNoNOSingle,DivorcedMarriedNO< 80K> 80KNOYESRules are mutually exclusive and exhaustiveRule set contains as much information as the t
13、reeTaxable IncomeMarital StatusRefundClassification Rules(Refund=Yes) => No(Refund=No, Marital Status=Single,Divorced,Taxable Income<80K) => No(Refund=No, Marital Status=Single,Divorced,Taxable Income>80K) => Yes(Refund=No, Marital Status=Married) => NoRules Can Be SimplifiedYesNoN
14、OSingle,DivorcedMarriedNO< 80K> 80KNOYES0(Refund=No) Ù (Status=Married) ® No(Status=Married) ® NoInitial Rule:Simplified Rule:Taxable IncomeMarital StatusRefundTidRefundMarital StatusTaxable IncomeCheat12345678910Yes No No Yes No No Yes No NoNoSingle Married Single Married Divor
15、ced Married Divorced Single MarriedSingle125K100K70K120K95K60K220K85K75K90KNo No No No Yes No No Yes NoYesEffect of Rule SimplificationRules are no longer mutually exclusiveA record may trigger more than one ruleSolution?Ordered rule set(有序規(guī)則)Unordered rule set(無序規(guī)則)采用投票策略,把每條被觸發(fā)的后件看作是對(duì)相應(yīng)類的一次投票,然后計(jì)票
16、確定測試的類標(biāo)號(hào);在某些情況下,投票可以用規(guī)則的準(zhǔn)確率;利:不易受由于選擇不當(dāng)?shù)囊?guī)則而產(chǎn)生錯(cuò)誤的影響;不必維護(hù)規(guī)則的順序,建立模型的開銷也相對(duì)較??;弊:測試對(duì)測試的屬性要與規(guī)則集中的每一條規(guī)則的前件作比較,所以進(jìn)行分類是一件很繁重的任務(wù)。Rules are no longer exhaustive A record may not trigger any rules Solution?à yUse a default classOrdered Rule SetRules are ranked in descending order according to their priorit
17、y(按優(yōu)先級(jí)降序排列)優(yōu)先級(jí)的定義有多種方法:基于準(zhǔn)確率、覆蓋率、總描述長度或規(guī)則的產(chǎn)生順序有序的規(guī)則集也成為決策表( decision list)When a test record is presented to the classifier It is assigned to the class label of the highest ranked rule(排序最高的規(guī)則)it has triggered If none of the rules fired, it is assigned to the default classNameBlood TypeGive BirthCan
18、 FlyLive in WaterClassturtlecoldnonosometimes?R1: (Give Birth = no) Ù (Can Fly = yes) ® BirdsR2: (Give Birth = no) Ù (Live in Water = yes) ® FishesR3: (Give Birth = yes) Ù (Blood Type = warm) ® MammalsR4: (Give Birth = no) Ù (Can Fly = no) ® Reptiles R5: (Live
19、 in Water = sometimes) ® AmphibiansRule Ordering SchemesRule-based orderingIndividual rules are ranked based on their quality:依據(jù)規(guī)則質(zhì)量的某種度量對(duì)規(guī)則排序,該方案確保每一個(gè)測試都是由覆蓋它的“最好的”規(guī)則來分類該方案的潛在缺點(diǎn)是排在越低的規(guī)則越難解釋,因?yàn)槊織l規(guī)則都假設(shè)所有排在它前面的規(guī)則不成立Class-based orderingRules that belong to the same class appear together質(zhì)量較差的規(guī)則可能因?yàn)?/p>
20、某個(gè)類的優(yōu)先級(jí)較高排在前面而被觸發(fā),從而導(dǎo)致高質(zhì)量規(guī)則被忽略。Class-based Ordering(Refund=Yes) => No(Refund=No, Marital Status=Single,Divorced,Taxable Income<80K) => No(Refund=No, Marital Status=Married) => No(Refund=No, Marital Status=Single,Divorced, Taxable Income>80K) => YesRule-based Ordering(Refund=Yes) =&
21、gt; No(Refund=No, Marital Status=Single,Divorced,Taxable Income<80K) => No(Refund=No, Marital Status=Single,Divorced,Taxable Income>80K) => Yes(Refund=No, Marital Status=Married) => NoBuilding Classification RulesDirect Method:Extract rules directly from datae.g.: RIPPER, CN2, Holtes
22、1RIndirect Method:Extract rules from other classification mdecision trees, neural networks, etc).e.g: C4.5ruless (e.g.Direct Method: Sequential CoveringStart from an empty ruleGrow a rule using the Learn-One-Rule function Remove training records covered by the ruleRepeat Step (2) and (3) until stopp
23、ing criterion is met1.2.3.4.1. 令E是訓(xùn)練,A是屬性-值對(duì)的集合(Aj, vj)2. 令Y0是類的有序集y1, y1, , yk3. 令R=是初始規(guī)則列表4. for 每個(gè)類yÎY0 -yk dowhile 終止條件不滿足dor ß Learn-one-rule(E, A, y)從E中刪除被r覆蓋的訓(xùn)練追加r到規(guī)則列表尾部:R ß R v rend while5.6.7.8.9.10. end for11. 把默認(rèn)規(guī)則 à yk到規(guī)則列表R尾部Example of Sequential Covering(i) Origina
24、l Data(ii) Step 1Example of Sequential CoveringR1R1R2(iii) Step 2(iv) Step 3Aspects of Sequential CoveringRule GrowingInstance EliminationRule EvaluationStopping CriterionRule PruningRule GrowingTwo common strategiesYes: 3No: 4 .Income> 80KRefund= NoStatus = SingleStatus = DivorcedStatus = Marrie
25、dRefund=No, Status = Single (Class = Yes)Yes: 3Yes: 2No: 1Yes: 1No: 0Yes: 0No: 3Yes: 3No: 1No:4(b) Specific-to-general可以隨機(jī)選擇一個(gè)正例作為規(guī)則增長的初始(a) General-to-specific先建立一個(gè)初始規(guī)則r: à y (質(zhì)量很差, 覆蓋所有樣本)加入新的合取項(xiàng)提高規(guī)則的質(zhì)量。探查所有可能的候選,并貪心地選擇下一個(gè)合取項(xiàng)。繼續(xù)該過程,直至滿足終止條件為止求精步,通過刪除規(guī)則的一個(gè)合取項(xiàng),使其覆蓋的正例來泛化規(guī)則。繼續(xù)該過程,直至滿足終止條件為止Refun
26、d=No, Status=Single, Income=90K (Class=Yes)Refund=No, Status=Single, Income=85K (Class=Yes)Rule Growing (Examples)CN2 Algorithm:Start from an empty conjunct: Add conjuncts that minimizes the entropy measure:A, A,B, Determine the rule consequent by taking majority class of instances covered by the ru
27、leRIPPER Algorithm:Start from an empty rule: => classAdd conjuncts thatizes FOILs information gain measure:R0: => class(initial rule)R1: A => class (rule after adding conjunct)Gain(R0, R1) = t log (p1/(p1+n1) log (p0/(p0 + n0) wheret: number of positive instances covered by both R0 and R1p0
28、: number of positive instances covered by R0 n0: number of negative instances covered by R0 p1: number of positive instances covered by R1 n1: number of negative instances covered by R1Instance EliminationWhy do we need toeliminate instances? Otherwise, the next rule is identical to previous ruleWhy
29、 do we remove positive instances? Ensure that the next rule isdifferentWhy do we remove negative instances? Prevent underestimating accuracy of rule Compare rules R2 and R3 in the diagramR3R2R1+ +class = + +-class = -Rule Evaluation Statistical test-For example, compute the following likelihood rati
30、o statistic:R = 2å fi log( fi / ei )i=1kis the observed frequency of class i-Where k is the number of classes,fexamples that are covered by the rule, and eirule that makes random predictions.iis the expected frequency of a-Some explanation: compare the given rule with rule that makes randompred
31、ictions. A large R value suggests that the number of correct predictions make by the rule is significantly large.Rule Evaluation Statistical test-For example.-Consider a training set that contains 60 positive examples and 100 negative examples.-Rule1: covers 50 positive examples and 5 negative examp
32、les-Rule2: covers 2 positive examples and no negative examples-Rule1 covers 55 examples, the expected frequency for thepositive class is e1 = 55*60/160 = 20.625, while the expectedfrequency for the negative class is e2 = 55*100/160 = 34.375R(r1 ) = 2 ´50 ´log2 (50 / 20.625) + 5´log2 (
33、5 / 34.375) = 99.9Rule EvaluationMetrics: Accuracy= ncn= nc+ 1 Laplacen + kn : Number of instances covered by rulenc : Number of positive instancescovered by rulek : Number of classesp : Prior probability of positive instancesnc + kp M-estimate =n + kStopping Criterion and Rule PruningStopping crite
34、rion Compute the gain If gain is not significant, discard the new ruleRule Pruning Similar to post-pruning of decision trees Reduced Error Pruning:Remove one of the conjuncts in the ruleCompare error rate on validation set before and after pruningIf error improves, prune the conjunctSummary of Direc
35、t MethodGrow a single ruleRemove Instances from rulePrune the rule (if necessary)Add rule to Current Rule SetRepeatDirect Method: RIPPERFor 2-class problem, choose one of the classes as positiveclass, and the other as negative class Learn rules for positive class Negative class will be default class
36、 For multi-class problemOrder the classes according to increasing class prevalence(fraction of instances that belong to a particular class) Learn the rule set for smallest class first, treat the rest as negative classRepeat with next smallest class as positive classDirect Method: RIPPERGrowing a rul
37、e:Start from empty ruleAdd conjuncts as long as they improve FOILs information gain Stop when rule no longer covers negative examplesPrune the rule immediately using incremental reduced errorpruningMeasure for pruning:v = (p-n)/(p+n)p: number of positive examples covered by the rule inthe validation
38、 setn: number of negative examples covered by the rule inthe validation setPruning method: delete any final sequence of conditions thatizes vDirect Method: RIPPERBuilding a Rule Set: Use sequential covering algorithm Finds the best rule that covers the current set of positive examples Eliminate both
39、 positive and negative examples covered by the rule Each time a rule is added to the rule set, compute the new description length stop adding new rules when the new description length is d bits longer than the smallest description length obtained so farDirect Method: RIPPEROptimize the rule set: For
40、 each rule r in the rule set RConsider 2 alternative rules: Replacement rule (r*): grow new rule from scratch Revised rule(r): add conjuncts to extend the rule rCompare the rule set for r against the rule set for r* and rChoose rule set that minimizes MDL principle Repeat rule generation and rule op
41、timization for the remaining positive examplesIndirect MethodsPNoYesQRNoYesNoYes-+QNoYes-+Rule Setr1: (P=No,Q=No) => - r2: (P=No,Q=Yes) => + r3: (P=Yes,R=No) => +r4: (P=Yes,R=Yes,Q=No) => - r5: (P=Yes,R=Yes,Q=Yes) => +Indirect Method: C4.5rulesExtract rules from an unpruned decision t
42、reeFor each rule, r: A ® y, consider an alternative rule r: A ® y where A is obtained by removing one of the conjuncts in A Compare the pessimistic error rate for r against all rs Prune if one of the rs has lower pessimistic error rate Repeat until we can no longer improve generalization e
43、rrorIndirect Method: C4.5rulesInstead of ordering the rules, order subsets of rules (class ordering) E Recallthe regression problem and classificationr problem: C Regularization can prevent over-fitting,why?SVM is the most popular classifier, why?Tlength is given the highest priority(Occam'sRazo
44、r)Occams Razor: It is a principle urging one to select among competing hypotheses that which makes the fewest assumptionsExampleNameGive BirthLay EggsCan FlyLive in WaterHave LegsClasshumanyesnononoyesmammalspythonnoyesnononoreptilessalmonnoyesnoyesnofisheswhaleyesnonoyesnomammalsfrognoyesnosometime
45、syesamphibianskomodonoyesnonoyesreptilesbatyesnoyesnoyesmammalspigeonnoyesyesnoyesbirdscatyesnononoyesmammalsleopard sharkyesnonoyesnofishesturtlenoyesnosometimesyesreptilespenguinnoyesnosometimesyesbirdsporcupineyesnononoyesmammalseelnoyesnoyesnofishessandernoyesnosometimesyesamphibiansgila monster
46、noyesnonoyesreptilesplatypusnoyesnonoyesmammalsowlnoyesyesnoyesbirdsdolphinyesnonoyesnomammalseaglenoyesyesnoyesbirdsC4.5 versus C4.5rules versus RIPPERC4.5rules:(Give Birth=No, Can Fly=Yes) ® Birds(Give Birth=No, Live in Water=Yes) ® Fishes (Give Birth=Yes) ® Mammals(Give Birth=No, C
47、an Fly=No, Live in Water=No) ® Reptiles( ) ® AmphibiansGive Birth?YesNoLive In Water?MammalsRIPPER:(Live in Water=Yes) ® Fishes (Have Legs=No) ® ReptilesYesNoSometimes(Give Birth=No, Can Fly=No, Live In Water=No)® Reptiles(Can Fly=Yes,Give Birth=No) ® Birds() ® Mam
48、malsCan Fly?FishesAmphibiansNoYesBirdsReptilesC4.5 versus C4.5rules versus RIPPERC4.5 and C4.5rules:RIPPER:PREDICTED CLASSAmphibiansFishesReptilesBirdsMammalsACTUALAmphibians00002CLASSFishes03000Reptiles00301Birds00121Mammals02104PREDICTED CLASSAmphibiansFishesReptilesBirdsMammalsACTUALAmphibians200
49、00CLASSFishes02001Reptiles10300Birds10030Mammals00106Advantages of Rule-Based ClassifiersAs highly expressive as decision treesEasy to interpret Easy to generateCan classify new instances rapidlyPerformance comparable to decision treesInstance-Based ClassifiersSet of Stored Cases Store the training
50、records Use training records to predict the class label of unseen casesUnseen CaseAtr1.AtrNAtr1.AtrNClassABBCACBInstance Based ClassifiersExamples: Rote-learner Memorizes entire training data and performs classification only if attributes of record match one of the training examples exactly Nearest
51、neighbor Uses k “closest” points (nearest neighbors) for performing classificationNearest Neighbor ClassifiersBasic idea: If it walks like a duck, quacks like a duck, then its probably a duckComputeTestRecordDistanceTraining RecordsChoose k of the “nearest” recordsNearest-Neighbor ClassifiersUnknown
52、 recordl Requires three thingsThe set of stored recordsDistance Metric to compute distance between recordsThe value of k, the number of nearest neighbors to retrievel To classify an unknown record:Compute distance to other training recordsIdentify k nearest neighborsUse class labels of nearest neighbors to determine the class labe
溫馨提示
- 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請(qǐng)下載最新的WinRAR軟件解壓。
- 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請(qǐng)聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
- 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會(huì)有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
- 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
- 5. 人人文庫網(wǎng)僅提供信息存儲(chǔ)空間,僅對(duì)用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對(duì)用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對(duì)任何下載內(nèi)容負(fù)責(zé)。
- 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請(qǐng)與我們聯(lián)系,我們立即糾正。
- 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶因使用這些下載資源對(duì)自己和他人造成任何形式的傷害或損失。
最新文檔
- 《機(jī)械設(shè)計(jì)基礎(chǔ)》課件-第3章 平面連桿機(jī)構(gòu)
- 項(xiàng)鏈課件教學(xué)課件
- 農(nóng)村電商培訓(xùn):助力鄉(xiāng)村振興與農(nóng)業(yè)轉(zhuǎn)型
- 《旅行社經(jīng)營管理》課件-第一章 概 述
- xx河流排水防澇設(shè)施建設(shè)項(xiàng)目風(fēng)險(xiǎn)管理方案(范文模板)
- 2025年新型全液壓鉆機(jī)項(xiàng)目合作計(jì)劃書
- 2025年自動(dòng)酸雨采樣器及測定儀項(xiàng)目發(fā)展計(jì)劃
- 健康飲食產(chǎn)業(yè)園項(xiàng)目資金申請(qǐng)報(bào)告(范文模板)
- xx河流排水防澇設(shè)施建設(shè)項(xiàng)目招商引資報(bào)告
- 2025年解熱鎮(zhèn)痛類藥物項(xiàng)目發(fā)展計(jì)劃
- T/CQAP 3014-2024研究者發(fā)起的抗腫瘤體細(xì)胞臨床研究細(xì)胞制劑制備和質(zhì)量控制規(guī)范
- 初中體育教學(xué)中德育教育的現(xiàn)狀、問題與突破路徑探究
- 基層供銷社管理制度
- 農(nóng)業(yè)供應(yīng)鏈管理考試試題及答案
- 人行雨棚施工方案
- 2025-2030中國晶圓鍵合系統(tǒng)行業(yè)市場發(fā)展趨勢與前景展望戰(zhàn)略分析研究報(bào)告
- 從校園到職場:新員工角色轉(zhuǎn)換與職業(yè)化塑造
- 奶茶服務(wù)協(xié)議合同
- 學(xué)生食堂維修改造工程施工組織設(shè)計(jì)
- 2025年章魚小丸子項(xiàng)目可行性研究報(bào)告
- “中小學(xué)生每天至少2小時(shí)體育活動(dòng)”的價(jià)值追求與實(shí)現(xiàn)路徑研究
評(píng)論
0/150
提交評(píng)論