




版權(quán)說(shuō)明:本文檔由用戶(hù)提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請(qǐng)進(jìn)行舉報(bào)或認(rèn)領(lǐng)
文檔簡(jiǎn)介
1、Data Mining Techniques1學(xué)會(huì)閱讀文獻(xiàn)讀得進(jìn)去,鉆得出來(lái)。盡量讀原文Know everything for something and Know something for everything有人的地方就有江湖,學(xué)術(shù)界也是另外一個(gè)江湖,不了解江湖的形勢(shì)怎么能混的下去呢?學(xué)會(huì)基本技能學(xué)會(huì)綜述文獻(xiàn),學(xué)會(huì)撰寫(xiě)論文,學(xué)會(huì)撰寫(xiě)基金申請(qǐng)書(shū),學(xué)會(huì)做學(xué)術(shù)報(bào)告??茖W(xué)研究的基本套路要熟悉和掌握培養(yǎng)自己的開(kāi)放精神與學(xué)術(shù)交流的意識(shí)(y sh) 團(tuán)隊(duì)合作,欣賞和寬容保持一個(gè)好奇心廣泛的興趣愛(ài)好和好奇心是培養(yǎng)自己具有豐富想象力的前提,也是使自己進(jìn)行創(chuàng)新的原動(dòng)力在科學(xué)的道路上是沒(méi)有平坦的大道可走的,只
2、有那些不畏艱險(xiǎn),在崎嶇小路上攀登的人,才有希望達(dá)到光輝的頂點(diǎn)。共二百二十六頁(yè)Data Mining Techniques2Classification vs. Prediction共二百二十六頁(yè)Data Mining Techniques3OutlineWhat are classification and prediction?Decision tree inductionBayesian classification Classification by artificial neural networkKNNSome other methodsPrediction共二百二十六頁(yè)4The g
3、oal of data classification is to organize and categorize data in distinct classes.A model is first created based on the data distribution.The model is then used to classify new data.Given the model, a class can be predicted for new data. What is Classification(Color, weight) sweet? 共二百二十六頁(yè)Data Minin
4、g Techniques5Prediction: numeric prediction, where the models constructed predict continuous-valued functions, i.e., predicts unknown or missing values Suppose that marketing manager would like to predict how much a given customer will spend during a sale at electronic products.Classification vs. Pr
5、edictionSimilarDifferent in the values are being predicted: categorical(分類(lèi)(fn li)型) vs. continuous-valueDifferent in the methods used to build their respective models.Typical applicationsCredit approval, Target marketing, Fraud detectionMedical diagnosisWhat is Prediction共二百二十六頁(yè)Data Mining Technique
6、s6Classification: A Two-Step Process Model construction: find a relation between a set of variables (features) to target variables (labels) from finite examples. Given Y = y1, y2, , yn and X = x1, x2, , xm find y = f(x) , xiX, ! yjY: yj = f(xi)Each tuple/sample is assumed to belong to a predefined c
7、lass, as determined by the class label attributeThe set of tuples used for model construction: training setThe model is represented as classification rules, decision trees, or mathematical formulae共二百二十六頁(yè)Data Mining Techniques7Classification: A Two-Step Process Model usage: for classifying future or
8、 unknown objectsEstimate accuracy of the modelThe known label of test sample is compared with the classified result from the modelAccuracy rate is the percentage of test set samples that are correctly classified by the modelTest set is independent of training set, otherwise over-fitting will occur共二
9、百二十六頁(yè)Data Mining Techniques8TrainingDataClassificationAlgorithmsIF rank = professorOR years 6THEN tenured = yes Classifier(Model)Step 1: Model Construction共二百二十六頁(yè)Data Mining Techniques9Step 2: Using the Model in Prediction ClassifierTestingDataUnseen Data(Jeff, Professor, 5)Tenured?Error: 1 /4 = 0.2
10、5Accept?IF rank = professorOR years 6THEN tenured = yes 共二百二十六頁(yè)Data Mining Techniques10Issues: Data PreparationIn order to help improve the accuracy, efficiency, and scalability Data cleaningPreprocess data in order to remove or reduce noise and treat missing valuesRelevance analysis (feature select
11、ion)Remove the irrelevant or redundant attributesData transformationGeneralize and/or normalize datai.e. numeric values for the attribute income may be generalized to discrete ranges such as low, medium, and high. Nominal-valued attribute street can be generalized to higher-level conceptes, like cit
12、yGeneralization compresses the original training data, fewer input/output operations may be involved during learning共二百二十六頁(yè)Data Mining Techniques11Measurements of Classification Methods()Predictive accuracy: this refers to the ability of the model to correctly predict the class label of new or previ
13、ously unseen dataSpeed: this refers to the computation costs involved in generating and using the modelRobustness: this is the ability of the model to make correct predictions given noisy data or data with missing values共二百二十六頁(yè)Data Mining Techniques12Measurements of Classification Methods()Scalabili
14、ty: this refers to the ability to construct the model efficiently given large amount of dataInterpretability: this refers to the level of understanding and insight that is provided by the model Simplicity:decision tree sizerule compactness共二百二十六頁(yè)Data Mining Techniques13Classification MethodsDecision
15、 Tree InductionBayesian ClassificationNeural Networks K-Nearest Neighbour Associative ClassifiersSupport Vector Machines Case-Based ReasoningGenetic AlgorithmsRough Set TheoryFuzzy SetsEtc.共二百二十六頁(yè)Data Mining Techniques14OutlineWhat are classification and prediction?Decision tree inductionBayesian cl
16、assification Classification by artificial neural networkKNNAssociative ClassifiersSupport vector machinesSome other methodsPrediction共二百二十六頁(yè)Classification Tree 1Classification trees are methodologies to classify data into discrete ones using tree-structured algorithms.Breiman invented the terminolog
17、y in the early 1980sThe technique has found immense uses medical, market research statistics, marketing and customer relationships, 共二百二十六頁(yè)Decision TreeThe main purpose of decision tree is to expose the structural information contained in the data A flow-chart-like tree structure Each leaf node- ass
18、igns a classification Each internal node represents a decision or split Each branch corresponds to attribute valueRefundMarStTaxIncYESNONONOYesNoMarried Single, Divorced 80KAttributeTraining DataModel: Decision TreeValue RangeClass Label共二百二十六頁(yè)Another Example of Decision TreeData Mining Techniques17
19、MarStRefundTaxIncYESNONONOYesNoMarried Single, Divorced 80KThere could be more than one tree that fits the same data!共二百二十六頁(yè)Data Mining Techniques18Apply Model to Test DataRefundMarStTaxIncYESNONONOYesNoMarried Single, Divorced 80KTest DataStart from the root of tree.共二百二十六頁(yè)Data Mining Techniques19R
20、efundMarStTaxIncYESNONONOYesNoMarried Single, Divorced 80KTest DataApply Model to Test Data共二百二十六頁(yè)Data Mining Techniques20Apply Model to Test DataRefundMarStTaxIncYESNONONOYesNoMarried Single, Divorced 80KTest Data共二百二十六頁(yè)Data Mining Techniques21Apply Model to Test DataRefundMarStTaxIncYESNONONOYesNo
21、Married Single, Divorced 80KTest Data共二百二十六頁(yè)Data Mining Techniques22Apply Model to Test DataRefundMarStTaxIncYESNONONOYesNoMarried Single, Divorced 80KTest Data共二百二十六頁(yè)Data Mining Techniques23Apply Model to Test DataRefundMarStTaxIncYESNONONOYesNoMarried Single, Divorced 80KTest DataAssign Cheat to “
22、No”共二百二十六頁(yè)Data Mining Techniques24What Is A Decision TreeA Decision Tree is a predictive model that is used to make predictions through a classification process. The predictive model is represented as an upside down Tree- root at the top (or on the left-hand side) and leaves at the bottom (or on the
23、 right-hand side). A Decision Tree is linear model that use axis parallel splits to recursively partition the labeled points, so that each region is as “pure” as possible in terms of the labels.Decision Trees represent rules. By following the Tree, you can decipher the rules and understand why a rec
24、ord is classified in a certain way. These rules can then be used to predict unknown records共二百二十六頁(yè)Data Mining Techniques25Why are Decision Trees so PopularThe construction of decision tree does not require any domain knowledge or parameter setting, and therefore is appropriate for exploratory knowle
25、dge discovery;Decision tree can handle high dimensional data;Intuitive and generally easy to understand by human;The learning and classifiers steps of decision tree are simple and fast.共二百二十六頁(yè)Data Mining Techniques26Classification by Decision Tree InductionDecision tree generation consists of two ph
26、asesTree constructionAt start, all the training examples are at the rootPartition examples recursively based on selected attributesTree pruningIdentify and remove branches that reflect noise or outliersUse of decision tree: Classifying an unknown sampleTest the attribute values of the sample against
27、 the decision tree共二百二十六頁(yè)Data Mining Techniques27Decision Tree algorithmsMany Algorithms Hunts Algorithm (one of the earliest) ID3 (Information Gain), C4.5(Gain Ratio, handling missing value ) CART (Gini Index)Different in the approach to select which attribute to test at each node in the tree. 共二百二
28、十六頁(yè)Data Mining Techniques28Basic algorithm (a greedy algorithm)Construct a tree in a top-down recursive divide-and-conquer mannerTakes a given subset of the dataEvaluates all possible splitsChose the best split to partition the data into two subsetStops when certain stopping conditions are met共二百二十六
29、頁(yè)Data Mining Techniques29Issues About Decision Tree algorithmsDetermine how to split the recordsHow to specify the attribute test condition?How to determine the best split?Determine when to stop splitting共二百二十六頁(yè)Data Mining Techniques30How to specify the attribute test condition?Depends on attribute
30、typesCategoricalContinuousDepends on number of ways to split2-way splitMulti-way split共二百二十六頁(yè)Data Mining Techniques31Splitting Based on Categorical AttributeMulti-way split: Use as many partitions as distinct values. Binary split: Divides values into two subsets. Need to find optimal partitioning.OR
31、IncomelowmediumhighIncomelow, mediumhighIncomelow, highmediumIncomelow, highmediumOR共二百二十六頁(yè)Data Mining Techniques32Splitting Based on Continuous AttributesDifferent ways of handlingDiscretization to form an ordinal categorical attribute Static discretize once at the beginning Dynamic ranges can be f
32、ound by equal interval bucketing, equal frequency bucketing(percentiles), or clustering.Binary Decision: (A 80K?YesNoTaxableIncome?(i) Binary split(ii) Multi-way split 80K共二百二十六頁(yè)Data Mining Techniques34Issues About Decision Tree algorithmsDetermine how to split the recordsHow to specify the attribut
33、e test condition?How to determine the best split?Determine when to stop splitting共二百二十六頁(yè)Finding the SpiltsThe central focus of the decision tree growing algorithm is selecting which attribute to test at each node in the treeWhich attribute can make the best split?The measure used to evaluate a poten
34、tial split is purityLow purity means that the set contains a representative distribution of classesHigh purity means that members of single class predominateWe want a measureTell us the purity of a nodeTell us how much additional purity we achieved of a splitting data on a non-pure node to daughter
35、nodes based on attribute values of a variableHigh PurityOriginal Datalow Purity共二百二十六頁(yè)Information Theory and Information EntropyInformation theory: a branch of the mathematical theory of probability and mathematical statisticsDeal with the concepts of information and information entropy, communicati
36、on systems, data transmission and rate distortion theory, cryptograph, signal-to-noise ratios, data compression and related topicsClaude Shannon (1916-2001): the father of information theoryEntropy: originated in thermodynamics but late found its way to information theoryThermodynamic entropy: a mea
37、sure of the amount of energy in physical system which cannot be used to do work. It is also a measure of the disorder present in a systemInformation entropy: a measure of the disorder present in a system共二百二十六頁(yè)Data Mining Techniques37Assume there are two classes, P and N Let the set of examples D co
38、ntain p elements of class P and n elements of class NThe amount of information, needed to decide if an arbitrary example in D belongs to P or N is defined asEntropy (Binary Case)共二百二十六頁(yè)Data Mining Techniques38Suppose the class label attribute has m distinct values, Ci (for i = 1, , m)Let pi be the p
39、robability that an arbitrary tuple in D belongs to class Ci, estimated by |Ci, D|/|D|Expected information (entropy) needed to classify a tuple in D:Entropy (General Case)A log function to the base 2 is used since the information is encoded in bits共二百二十六頁(yè)Data Mining Techniques39Example of Entropy P(C
40、1) = 0/6 = 0 P(C2) = 6/6 = 1Entropy = 0 log 0 1 log 1 = 0 0 = 0 P(C1) = 1/6 P(C2) = 5/6Entropy = (1/6) log2 (1/6) (5/6) log2 (1/6) = 0.65P(C1) = 2/6 P(C2) = 4/6Entropy = (2/6) log2 (2/6) (4/6) log2 (4/6) = 0.92Maximum (log nc) when records are equally distributed among all classes implying least inf
41、ormationMinimum (0.0) when all records belong to one class, implying most information共二百二十六頁(yè)Data Mining Techniques40Let attribute A have v distinct values. Information needed (after using A to split D into v partitions) to classify D- How much more information would we still need in order to arrive
42、at an exact classification? The smaller the value, the greater the purity of the subset partitions.Gain(A) tells us how much information be gained by branching on attribute A. The attribute A with the highest information gain is chosen as the splitting attribute at node N. Information Gain共二百二十六頁(yè)Dat
43、a Mining Techniques41Select the attribute with the highest information gainThis attribute minimizes the information needed to classify the samples in the resulting partitions and reflects the least randomness or impurity in the these partitions. Information Gain共二百二十六頁(yè)Data Mining Techniques42Example
44、-Information GainClass P: buys_computer = “yes”Class N: buys_computer = “no” means “age =30” has 5 out of 14 samples,with 2 yess and 3 nos. Hence Similarly,共二百二十六頁(yè)Data Mining Techniques43ID3 has been incorporated in a number of commercial rule-induction packages. Some specific applications include m
45、edical diagnosis, credit risk assessment of loan applications, equipment malfunctions by their cause, classification of soybean diseases, and web search classification.Information Gain共二百二十六頁(yè)Data Mining Techniques44Gain Ratio (C4.5)Information gain measure is biased towards attributes with a large n
46、umber of valuesFor instance, . Consider an attribute that acts as a unique identifier, such as customer_ID. A split this attribute would result in a large number of partitions, each one containing just one tuple. Because each partition is pure, the information required to classify data set D based o
47、n this partition would be 0 . Clearly, such a partitioning is useless for classification.共二百二十六頁(yè)Data Mining Techniques45Gain Ratio (C4.5)C4.5 (a successor of ID3) uses gain ratio to overcome the problem (normalization to information gain)GainRatio(A) = Gain(A)/SplitInfo(A)Ex.gain_ratio(income) = 0.0
48、29/0.926 = 0.031The attribute with the maximum gain ratio is selected as the splitting attribute共二百二十六頁(yè)Gini IndexGini index: after Italian statistician and economist, Corrado GiniThe sum of the squares of the proportions of the classesGini index of node tMeasure the inpurity of a nodeThe basic idea
49、is to choose a split at each node so that the data in each subset (child node) is “purer” than the data in the parent node.共二百二十六頁(yè)Example of Gini IndexMaximum (1 - 1/nc) when records are equally distributed among all classes, implying least interesting informationMinimum (0.0) when all records belon
50、g to one class, implying most interesting informationData Mining Techniques47P(C1) = 0/6 = 0 P(C2) = 6/6 = 1Gini = 1 P(C1)2 P(C2)2 = 1 0 1 = 0 P(C1) = 1/6 P(C2) = 5/6Gini = 1 (1/6)2 (5/6)2 = 0.278P(C1) = 2/6 P(C2) = 4/6Gini = 1 (2/6)2 (4/6)2 = 0.444共二百二十六頁(yè)Data Mining Techniques48Gini index (CART, IB
51、M IntelligentMiner)If a data set D is split on A into two subsets D1 and D2, the gini index gini(D) is defined asReduction in Impurity:The attribute provides the smallest ginisplit(D) (or the largest reduction in impurity) is chosen to split the node (need to enumerate all the possible splitting poi
52、nts for each attribute)共二百二十六頁(yè)Data Mining Techniques49Induction of a decision tree using Gini indexEx. D has 9 tuples in buys_computer = “yes” and 5 in “no”Computer the gini index for each attributeStart with the attribute income and consider each of the possible splitting subsets. income partitions
53、 D into 10 in D1: high, medium and 4 in D2Similarly, the Gini index values for splits on the remaining subsets are: 0.490 (low, high and medium) and 0.475 (low,medium and high). ginimedium,high is 0.450 and thus the best since it is the lowest共二百二十六頁(yè)Data Mining Techniques50Comparing Attribute Select
54、ion MeasuresThe three measures, in general, return good results butInformation gain: All attributes are assumed to be categorical Can be modified for continuous-valued attributesbiased towards multivalued attributesGain ratio: tends to prefer unbalanced splits in which one partition is much smaller
55、than the othersGini index: All attributes are assumed continuous-valuedAssume there exist several possible split values for each attributeMay need other tools, such as clustering, to get the possible split valuesCan be modified for categorical attributesbiased to multivalued attributeshas difficulty
56、 when # of classes is largetends to favor tests that result in equal-sized partitions and purity in both partitions共二百二十六頁(yè)Data Mining Techniques51Issues About Decision Tree algorithmsDetermine how to split the recordsHow to specify the attribute test condition?How to determine the best split?Determi
57、ne when to stop splitting共二百二十六頁(yè)Data Mining Techniques52Stopping Criteria for Tree InductionStop expanding a node when all the records belong to the same classStop expanding a node when all the records have similar attribute valuesEarly termination (to be discussed later)共二百二十六頁(yè)Data Mining Technique
58、s53Extracting Classification Rules from TreesRepresent the knowledge in the form of IF-THEN rulesOne rule is created for each path from the root to a leafEach attribute-value pair along a path forms a conjunctionThe leaf node holds the class predictionRules are easier for humans to understandExample
59、IF age= “=30”AND student= “no”THEN buys_computer= “no”IF age= “40”AND credit_rating= “excellent”THEN buys_computer = “yes”IF age= “ 7.5, therefore instead of T4 with a leaf node 共二百二十六頁(yè)Strengths and Weakness of Decision Tree MethodsStrengths:Generate understandable rules Perform classification witho
60、ut requiring much computationHandle both continuous and categorical variablesProvide a clear indication of which fields are most important for prediction or classification. Weakness:Less appropriate for estimation tasks where the goal is to predict the value of a continuous attribute. Prone to error
溫馨提示
- 1. 本站所有資源如無(wú)特殊說(shuō)明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請(qǐng)下載最新的WinRAR軟件解壓。
- 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請(qǐng)聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶(hù)所有。
- 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁(yè)內(nèi)容里面會(huì)有圖紙預(yù)覽,若沒(méi)有圖紙預(yù)覽就沒(méi)有圖紙。
- 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
- 5. 人人文庫(kù)網(wǎng)僅提供信息存儲(chǔ)空間,僅對(duì)用戶(hù)上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對(duì)用戶(hù)上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對(duì)任何下載內(nèi)容負(fù)責(zé)。
- 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請(qǐng)與我們聯(lián)系,我們立即糾正。
- 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶(hù)因使用這些下載資源對(duì)自己和他人造成任何形式的傷害或損失。
最新文檔
- 2025年民間借貸合同模板月息
- 六年級(jí)下冊(cè)數(shù)學(xué)教案-5.2 數(shù)與代數(shù) ︳西師大版
- 二年級(jí)下冊(cè)數(shù)學(xué)教案-4.4勤勞工作-筆算三位數(shù)加減三位數(shù)(一次進(jìn)位、退位) 青島版
- 2025年城鄉(xiāng)結(jié)對(duì)共建協(xié)議書(shū)范
- 2025年河北旅游職業(yè)學(xué)院?jiǎn)握新殬I(yè)適應(yīng)性測(cè)試題庫(kù)及答案一套
- 化學(xué)-云南省三校2025屆高三2月高考備考聯(lián)考卷(六)試題和答案
- 2025江西省建筑安全員A證考試題庫(kù)及答案
- 2025年鶴崗師范高等專(zhuān)科學(xué)校單招職業(yè)傾向性測(cè)試題庫(kù)完整版
- 2025年度個(gè)人股份轉(zhuǎn)讓與員工分紅權(quán)合同模板
- 2025年度企業(yè)數(shù)字化轉(zhuǎn)型技術(shù)顧問(wèn)合作協(xié)議
- 腹腔化療腫瘤課件
- 四川省成都市武侯區(qū)2022-2023學(xué)年七年級(jí)下學(xué)期期末英語(yǔ)試卷(含答案)
- 腦卒中患者護(hù)理查房
- 智能機(jī)器人與傳感器PPT完整全套教學(xué)課件
- 高效空調(diào)制冷機(jī)房智能控制系統(tǒng)技術(shù)規(guī)程
- 《動(dòng)物王國(guó)開(kāi)大會(huì)》說(shuō)課PPT
- GB/T 42595-2023承壓設(shè)備修理基本要求
- 春玉米套種秋黃瓜技術(shù)
- 四年級(jí)下冊(cè)勞動(dòng)技術(shù)教案
- 城市軌道交通服務(wù)禮儀和意識(shí)基本知識(shí)
- 科幻小說(shuō)賞讀智慧樹(shù)知到答案章節(jié)測(cè)試2023年杭州師范大學(xué)
評(píng)論
0/150
提交評(píng)論