數(shù)據(jù)挖掘和分析29_W19A_第1頁
數(shù)據(jù)挖掘和分析29_W19A_第2頁
數(shù)據(jù)挖掘和分析29_W19A_第3頁
數(shù)據(jù)挖掘和分析29_W19A_第4頁
數(shù)據(jù)挖掘和分析29_W19A_第5頁
已閱讀5頁,還剩25頁未讀, 繼續(xù)免費(fèi)閱讀

下載本文檔

版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請進(jìn)行舉報(bào)或認(rèn)領(lǐng)

文檔簡介

1、聚類分析2014.12.30 第十九周數(shù)據(jù)挖掘和分析簇的確認(rèn) Cluster Validity 如何驗(yàn)證和評價(jià)聚類分析的結(jié)果? “goodness” of the resulting clusters? But “clusters are in the eye of the beholder”! 為何評價(jià)聚類分析的結(jié)果? 避免發(fā)現(xiàn)噪聲產(chǎn)生的模式 比較不同的聚類算法 比較不同的簇集合 簇之間的比較隨機(jī)數(shù)據(jù)中被發(fā)現(xiàn)的簇 Clusters found in Random Data00.81xyRandom Points00.

2、1xyK-means00.81xyDBSCAN00.81xyComplete Link1.確定數(shù)據(jù)集的聚類趨勢 clustering tendency ,即是否存在非隨機(jī)結(jié)構(gòu) 2.確定正確的簇的個數(shù).3.評估聚類分析結(jié)果對數(shù)據(jù)的擬合情況 - Use only the data4.將聚類分析的結(jié)果跟已知的客觀結(jié)果(如,外部提供的類標(biāo)號)比較5.比較不同的聚類方法的優(yōu)劣.

3、1,2,3 非監(jiān)督3,4,5 進(jìn)一步區(qū)分是評估整個聚類還是單個簇 簇確認(rèn)的重要問題 Cluster Validation三種度量方式 外部指標(biāo) External Index: 監(jiān)督的 Used to measure the extent to which cluster labels match externally supplied class labels. Entropy 熵 內(nèi)部指標(biāo) Internal Index: 非監(jiān)督的 Used to measure the goodness of a clustering structure without respect to externa

4、l information. Sum of Squared Error (SSE) 相對指標(biāo) Relative Index: Used to compare two different clusterings or clusters. Often an external or internal index is used for this function, e.g., SSE or entropySometimes these are referred to as criteria instead of indices However, sometimes criterion is the

5、general strategy and index is the numerical measure that implements the criterion.簇確認(rèn)的度量 Measures of Cluster Validity兩個矩陣 Two matrices 鄰近性矩陣 Proximity Matrix理想的鄰近性矩陣 “Incidence” Matrix每個數(shù)據(jù)點(diǎn)對應(yīng)一行一列矩陣中每項(xiàng)對應(yīng)的兩點(diǎn)如果是同簇,為1否則為 0 計(jì)算兩個矩陣的相關(guān)性Since the matrices are symmetric, only the correlation between n(n-1) /

6、 2 entries needs to be calculated.高相關(guān)-簇中的點(diǎn)相近. Not a good measure for some density or contiguity based clusters.非監(jiān)督簇評估:通過相關(guān) Via Correlation非監(jiān)督簇評估:通過相關(guān) Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets. 00.8

7、1xy00.81xyCorr = -0.9235Corr = -0.5810 根據(jù)簇對數(shù)據(jù)排序后的相似性矩陣 Order the similarity matrix with respect to cluster labels and inspect visually. 通過相似性矩陣00.81xyPointsPoints20406080100102030405060708090100Similarity00

8、.70.80.91通過相似性矩陣 Clusters in random data are not so crispPointsPoints20406080100102030405060708090100Similarity01DBSCAN00.81xyPointsPoints20406080100102030405060708090100Similarity01通過相似性矩陣 Clusters in random

9、 data are not so crispK-means00.81xy通過相似性矩陣 Clusters in random data are not so crisp00.81xyPointsPoints20406080100102030405060708090100Similarity01Complete Link通過相似性矩陣1 235647DBSCAN00.40.5

10、0.915001000150020002500300050010001500200025003000 Clusters in more complicated figures arent well separated內(nèi)部指標(biāo) Internal Index: 不需要外部信息Used to measure the goodness of a clustering structure without respect to external information SSE SSE 適合評估多個簇集或者多個簇 (average SSE). 也可以用來估計(jì)簇的個數(shù)非監(jiān)督的 Interna

11、l Measures: SSE2 5 1015202530012345678910KSSE51015-6-4-20246Internal Measures: SSE SSE curve for a more complicated data set1 235647SSE of clusters found using K-means需要框架來解釋度量. For example, if our measure of evaluation has the value, 10, is that good, fair, or poor?統(tǒng)計(jì)學(xué)角度聚類結(jié)果的分典型性意味著結(jié)果的正確性比較隨機(jī)數(shù)據(jù)和聚類后

12、的數(shù)據(jù)的某項(xiàng)指標(biāo)If the value of the index is unlikely, then the cluster results are validThese approaches are more complicated and harder to understand.如果比較兩個不同的簇集, 框架必要性降低.However, there is the question of whether the difference between two index values is significant簇確認(rèn)的框架 Framework for Cluster Validity E

13、xample Compare SSE of 0.005 against three clusters in random data Histogram shows SSE of three clusters in 500 sets of random data points of size 100 distributed over the range 0.2 0.8 for x and y valuesSSE的統(tǒng)計(jì)學(xué)框架 Statistical Framework for SSE0.0160.0180.020.0220.0240.0260.0280.030.0320.0340510152025

14、3035404550SSECount00.81xy Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets. Statistical Framework for Correlation00.81xy00.81xyC

15、orr = -0.9235Corr = -0.5810 簇凝聚度 Cluster Cohesion: Measures how closely related are objects in a cluster Example: SSE 簇分離度 Cluster Separation: Measure how distinct or well-separated a cluster is from other clusters Example: Squared Error Cohesion is measured by the within cluster sum of squares (SSE

16、) Separation is measured by the between cluster sum of squares Where |Ci| is the size of cluster i Internal Measures: Cohesion and SeparationiCxiimxWSS2)(iiimmCBSS2)(凝聚度和分離度Internal Measures: Cohesion and Separation Example: SSE BSS + WSS = constant12345 m1m2m10919) 35 . 4(2)5 . 13(21)5 . 45()5 . 44

17、()5 . 12()5 . 11 (222222TotalBSSWSSK=2 clusters:100100) 33(410) 35() 34() 32() 31 (22222TotalBSSWSSK=1 cluster:A proximity graph based approach can also be used for cohesion and separation. Cluster cohesion is the sum of the weight of all links within a cluster. Cluster separation is the sum of the

18、weights between nodes in the cluster and nodes outside the cluster.凝聚度和分離度Internal Measures: Cohesion and SeparationcohesionseparationSilhouette Coefficient combine ideas of both cohesion and separation, but for individual points, as well as clusters and clusteringsFor an individual point, i Calcula

19、te a = average distance of i to the points in its cluster Calculate b = min (average distance of i to points in another cluster) The silhouette coefficient for a point is then given by s = 1 a/b if a b, (or s = b/a - 1 if a b, not the usual case) Typically between 0 and 1. The closer to 1 the better

20、.Can calculate the Average Silhouette width for a cluster or a clustering輪廓系數(shù) Internal Measures: Silhouette Coefficientab外部指標(biāo) External Measures of Cluster Validity: Entropy and Purity確定正確的簇個數(shù) “The validation of clustering structures is the most difficult and frustrating part of cluster analysis. Wit

21、hout a strong effort in this direction, cluster analysis will remain a black art accessible only to those true believers who have experience and great courage.”Algorithms for Clustering Data, Jain and DubesFinal Comment on Cluster Validity使用R完成Kmeans聚類 newiris - iris; newiris$Species - NULL; #對訓(xùn)練數(shù)據(jù)去

22、掉分類標(biāo)記 kc - kmeans(newiris, 3); #分類模型訓(xùn)練 fitted(kc); #查看具體分類情況 table(iris$Species, kc$cluster); #查看分類概括 #聚類結(jié)果可視化 plot(newirisc(Sepal.Length, Sepal.Width), col = kc$cluster, pch = eger(iris$Species); #不同的顏色代表不同的聚類結(jié)果,不同的形狀代表訓(xùn)練數(shù)據(jù)集的原始分類情況。 points(kc$centers,c(Sepal.Length, Sepal.Width), col = 1:3, p

23、ch = 8, cex=2);http:/ a 2-dimensional examplex - rbind(matrix(rnorm(100, sd = 0.3), ncol = 2), matrix(rnorm(100, mean = 1, sd = 0.3), ncol = 2)colnames(x) - c(x, y)(cl - kmeans(x, 2)plot(x, col = cl$cluster)points(cl$centers, col = 1:2, pch = 8, cex = 2)# sum of squares# 其中scale函數(shù)提供數(shù)據(jù)中心化功能,所謂數(shù)據(jù)的中心化是指數(shù)據(jù)集中的各項(xiàng)數(shù)據(jù)減去數(shù)據(jù)集的均值,這個函數(shù)還提供數(shù)據(jù)的標(biāo)準(zhǔn)化功能,所謂數(shù)據(jù)的標(biāo)準(zhǔn)化是指中心化之后的數(shù)據(jù)在除以數(shù)

溫馨提示

  • 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
  • 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
  • 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
  • 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
  • 5. 人人文庫網(wǎng)僅提供信息存儲空間,僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對任何下載內(nèi)容負(fù)責(zé)。
  • 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請與我們聯(lián)系,我們立即糾正。
  • 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時也不承擔(dān)用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。

最新文檔

評論

0/150

提交評論