數(shù)據(jù)挖掘作業(yè)_第1頁
數(shù)據(jù)挖掘作業(yè)_第2頁
數(shù)據(jù)挖掘作業(yè)_第3頁
數(shù)據(jù)挖掘作業(yè)_第4頁
數(shù)據(jù)挖掘作業(yè)_第5頁
免費預(yù)覽已結(jié)束,剩余1頁可下載查看

下載本文檔

版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請進行舉報或認(rèn)領(lǐng)

文檔簡介

1、數(shù)據(jù)挖掘作業(yè)(毛卓,m080501447)1. based on your observation, describe another possible kind of knowledge that needs to be discovered by data mining methods but has not been listed in this chapter. does it require a mining methodology that is quite different from those outlined in this chapter?答:sliq algorithm

2、 is a kind of knowledge that needs to be discovered by data mining methods. it builds an index for each attribute and only class list and the current attribute list reside in memory. it requires decision tree induction methods in data mining.2. suppose that the data for analysis include the attribut

3、e age. the age values for the data tuples are (in increasing order):13,15,16,16,19,20,20,21,22,22,25,25,25,25,30,33,33,35,35,35,35,36,40,45,46,52,70.(a) using smoothing by bin means to smooth the above data, using a bin depth of 3. illustrate your steps. comment on the effect of this technique for t

4、he given data.答:bin1:14.6, 14.6, 14.6; bin2: 18.3, 18.3, 18.3; bin3: 21, 21, 21; bin4:24, 24, 24; bin5: 26.6, 26.6, 26.6; bin6: 33.6, 33.6, 33.6; bin7:35, 35, 35; bin8: 40.3, 40.3, 40.3; bin9: 56, 56, 56.第一步:將數(shù)據(jù)按寬度為3的箱子分成9箱;第二步:對每個箱中數(shù)據(jù)求平均值,用得出的平均值來代替箱中的數(shù)據(jù)。 (b) how might you determine outliers in the

5、 data?答: 1.5*irq irq=q3-q1(c) what other methods are there for data smoothing?答:binning regression clustering3. using the data for age given in exercise 2, answer the following:(a) use min-max normalization to transform the value 35 for age onto the range 0.0,1.0 答: (b) use z-score normalization to

6、transform the value 35 for age, where the standard deviation of age is 12.94 years. 答: (c) use normalization by decimal scaling to transform the value 35 for age. 答: (d) comment on which method you would prefer to use for the given data, giving reasons as to why.4. suppose that a data warehouse for

7、big university consists of the following four dimensions:student, course, semester, and instructor, and two measures count and avg _grade. when at the lowest conceptual level (e.g., for a given student, course, semester, and instructor combination), the avg_ grade measure stores the actual course gr

8、ade of the student. at higher conceptual levels, avg _grade stores the average grade for the given combination.(a) draw a snowflake schema diagram for the data warehouse. p116 答: big university are considered along four dimensions, namely, semester, student ,course and instructor. the schema contain

9、s a central fact table for big university that contains keys to each of the four dimensions, along with two measures: count and avg_grade .figure3.4 snowflake schema of a data warehouse for big _university(b) starting with the base cuboid student, course, semester, instructor, what specific olap ope

10、rations (e.g., roll-up from semester to year) should one perform in order to list the average grade of cs courses for each big university student. 答: starting with the base cuboid student, course, semester, instructor,we use the following specific olap operations in order to list the average grade o

11、f cs courses for each big university student.roll-up: the roll-up operation performs aggregation on a data cube, either by climbing up a concept hierarchy for a dimension or by dimension reduction, this hierarchy was defined as the total order “quarteryear.”slice and dice: the slice operation perfor

12、ms a selection on one dimension of the given cube, resulting in a subcube. (c) if each dimension has five levels (including all), such as “student major status university 50%, the rule buys(x, hotdogs) =buys(x, hamburgers) is a strong association rule.(b) based on the given data, is the purchase of

13、hot dogs independent of the purchase of hamburgers? if not, what kind of correlation relationship exists between the two? 答:lift (hotdogs, hamburgers) =p(hotdogshamburgers)/(p(hotdogs)*p(hamburgers)=(2000/5000)/(3000/5000)*(2500/5000)=1.331. based on the given data, the purchase of hotdogs and the p

14、urchase of hamburgers are correlated, lift value 1, so they are positively correlated.7. write an algorithm for k-nearest-neighbor classification given k and n, the number of attributes describing each tuple. 答:we will denote an arbitrary instance x as a1(x) . an(x) where ar(x) denotes the value of

15、the r-th attribute of instance x. the distance between two instances xi and xj is defined to be d(xi, xj) where the algorithm follows.k-nearest neighbor algorithm:trainingbuild the set of training examples d.classificationgiven a query instance xq to be classified,let x1. xk denote the k instances f

16、rom d that are nearest to xqreturn where -(a, b)=1, if a = b, and -(a, b)=0 otherwise8. show that accuracy is a function of sensitivity and specificity, that is, prove equation(6.58). 答:suppose that you have trained a classifier to classify medical data tuples as either “cancer” or “not_cancer” an a

17、ccuracy rate of ,say,90% may not be acceptable-the classifier could be correctly labeling only the “not_cancer” tuples, for instance. instead, we would like to be able to access how well the classifier can recognize “cancer” tuples (the positive tuples) and how well it can recognize “not_cancer” (th

18、e negative tuples).the sensitivity and specificity measures can be used, respectively, for this purpose. sensitivity is also referred to as the true positive (recognition) rate (thatis, the proportion of positive tuples that are correctly identified), while specificity is the true negative rate (tha

19、t is, the proportion of negative tuples that are correctly identified).in addition, we may use precision to access the percentage of tuples labeled as “cancer” that actually are “cancer” tuples.these measures are defined aswhere t pos is the number of true positives (“cancer” tuples that were correc

20、tly classified as such), pos is the number of positive (“cancer”) tuples, t_neg is the number of true negatives (“not cancer” tuples that were correctly classified as such), neg is the number of negative (“not cancer”) tuples, and f pos is the number of false positives (“not cancer” tuples that were

21、 incorrectly labeled as “cancer”). it can be shown that accuracy is a function of sensitivity and specificity:9given two objects represented by the tuples (22, 1, 42, 10) and (20, 0, 36, 8):(a) compute the euclidean distance between the two objects. 答:(b) compute the manhattan distance between the t

22、wo objects. 答:(c) compute the minkowski distance between the two objects, using q = 3. 答:10suppose that the data mining task is to cluster the following eight points (with (x, y) representing location) into three clusters:a1(2, 10), a2(2, 5), a3(8, 4), b1(5, 8), b2(7, 5), b3(6, 4), c1(1, 2), c2(4, 9

23、):the distance function is euclidean distance. suppose initially we assign a1, b1, and c1 as the center of each cluster, respectively. use the k-means algorithm to show only(a) the three cluster centers after the first round execution 答:the first computation for each node:for a2: d(a1,a2)=5;d(b1,a2)

24、=4.24; d(c1,a2)=3.16 so a2 belongs to center c1;for a3: d(a1,a3)=8.48; d(b1,a3)=5; d(c1,a3)=7.28, so a3 belongs to center b1;for b2: d(a1,b2)=7.07; d(b1,b2)=3.60; d(c1,b2)=6.71, so b2 belongs to center b1;for b3: d(a1,b3)=7.21; d(b1,b3)=4.12; d(c1,b3)=5.39; so b3 belongs to center b1;for c2:d(a1,c2)=2.24; d(b1,c2)=1.41; d(c1,c2)=7.62 ; so c2 belongs to center b1.so after the first computation, three clusters are as follows:cluste1:a1; cluster2:a2,c1; cluster3:a3,b1,b2,b3,c3.so new three center are a1(2,10), o1(1.5,3.5), o2(6,6).(b) the final three clusters 答:the second computation for eac

溫馨提示

  • 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
  • 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
  • 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
  • 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
  • 5. 人人文庫網(wǎng)僅提供信息存儲空間,僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護處理,對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對任何下載內(nèi)容負(fù)責(zé)。
  • 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請與我們聯(lián)系,我們立即糾正。
  • 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時也不承擔(dān)用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。

評論

0/150

提交評論