版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請(qǐng)進(jìn)行舉報(bào)或認(rèn)領(lǐng)
文檔簡(jiǎn)介
1、聚類分析:基本概念和算法What is Cluster Analysis?l Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to) the objects in other groupsInter-cluster distances areizedIntra-cluster distances are minimized6/12/2010#Applications of
2、Cluster Analysisl UnderstandingGroup related documents for browsing, group genes and proteins that have similar functionality, or group stocks with similar price fluctuationsl SummarizationReduce the size of large data setsClustering precipitation in Australia6/12/2010#Discovered ClustersIndustry Gr
3、oup1Applied-Matl-DOWN,Bay-Network-Down,3-COM-DOWN, Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN,DSC-Comm-DOWN,INTEL-DOWN,LSI-Logic-DOWN,Micron-Tech-DOWN,Texas-Inst-Down,Tellabs-Inc-Down, Natl-Semiconduct-DOWN,Oracl-DOWN,SGI-DOWN, Sun-DOWNTechnology1-DOWN2Apple-Comp-DOWN,Autodesk-DOWN,DEC-DOWN, ADV-Micro-De
4、vice-DOWN,Andrew-Corp-DOWN, Computer-Assoc-DOWN,Circuit-City-DOWN,Compaq-DOWN, EMC-Corp-DOWN, Gen-Inst-DOWN, Motorola-DOWN,Microsoft-DOWN,Scientific-Atl-DOWNTechnology2-DOWN3Fannie-Mae-DOWN,Fed-Home-Loan-DOWN, MBNA-Corp-DOWN,Morgan-Stanley-DOWNFinancial-DOWN4Baker-Hughes-UP,Dresser-Inds-UP,Halliburt
5、on-HLD-UP, Louisiana-Land-UP,Phillips-Petro-UP,Unocal-UP, Schlumberger-UPOil-UPApplications of Cluster Analysisl Understanding:conceptually meaningful groups ofobjects that share common characteristics are potential classesBiology: a taxonomy of all living things; groups of genes with similar functi
6、onsInformation Retrieval: group results of a query to a searchengine into a small number of clusters for each aspectClimate: patterns in the atmosphere and oceanPsychology and Medicine: different subcategories of illnesses and conditions (健康狀態(tài),such as depression); spatial and temporal distribution o
7、f a diseaseBusiness: customer grouping and marketing6/12/2010#Applications of Cluster Analysisl Utility:finding the most representative clusterprototypes for further data analysis and processingSummarization: applying algorithms to a reduced data set consisting only of cluster prototypesCompression:
8、 each object is represented by the index of the prototype of the cluster than it belongs toEfficiently Finding Nearest Neighbors: only need to compute the distance to objects in nearby clusters based on prototype distance6/12/2010#Notion of a Cluster can be AmbiguousHow many clusters?Six ClustersTwo
9、 ClustersFour Clusters6/12/2010#Related Techniquesl Classificationsupervisedunsupervisedl Segmentation: split an image into segmentsl Partitioning: divide graph into subgraphs6/12/2010#Types of Clusteringsl A clustering is a set of clustersl Important distinction between hierarchical and partitional
10、 sets of clustersl Partitional Clustering A division data objects into non-overlapping subsets (clusters) such that each data object is in exactly one subsetl Hierarchical clustering A set of nested clusters organized as a hierarchical tree6/12/2010#Partitional ClusteringOriginal PointsA Partitional
11、 Clustering6/12/2010#Hierarchical Clusteringp1p3p4p2p1p2p3 p4Traditional Hierarchical ClusteringTraditional Dendrogramp1p3p4p2p1 p2p3 p4Non-traditional Hierarchical ClusteringNon-traditional Dendrogram6/12/2010#Other Distinctions Between Sets of Clustersl Exclusive versus non-exclusive In non-exclus
12、ive clusterings, points may belong to multipleclusters. Can represent multiple classes or border pointsl Fuzzy versus non-fuzzyIn fuzzy clustering, a point belongs to every cluster with someweight between 0 and 1Weights must sum to 1Probabilistic clustering has similar characteristicsl Partial versu
13、s complete In some cases, we only want to cluster some of the datal Heterogeneous versus homogeneous Cluster of widely different sizes, shapes, and densities6/12/2010#Types of Clustersl Well-separated clustersl Center-based clustersl Contiguous(鄰接) clustersl Density-based clustersl Property or Conce
14、ptuall Described by an Objective Function6/12/2010#Types of Clusters: Well-Separatedl Well-Separated Clusters:A cluster is a set of points such that any point in a cluster is closer (or more similar) to every other point in the cluster than to any point not in the cluster.3 well-separated clusters6/
15、12/2010#Types of Clusters: Center-Basedl Center-basedA cluster is a set of objects such that an object in a cluster is closer (more similar) to the “center” of a cluster, than to the center of any other clusterThe center of a cluster is often a centroid, the average of all the points in the cluster,
16、 or a medoid, the most “representative” point of a cluster4 center-based clusters6/12/2010#Types of Clusters: Contiguity-Basedl Contiguous Cluster (Nearest neighbor or Transitive)A cluster is a set of points such that a point in a cluster is closer (or more similar) to one or more other points in th
17、e cluster than to any point not in the cluster.8 contiguous clusters6/12/2010#Types of Clusters: Density-Basedl Density-basedA cluster is a dense region of points, which is separated by low-density regions, from other regions of high density.Used when the clusters are irregular or intertwined, and w
18、hen noise and outliers are present.6 density-based clusters6/12/2010#Types of Clusters: Conceptual Clustersl Shared Property or Conceptual ClustersFinds clusters that share some common property or represent a particular concept.2 Overlapping Circles6/12/2010#Types of Clusters: Objective Functionl Cl
19、usters Defined by an Objective FunctionFinds clusters that minimize orize an objective function.Enumerate all possible ways of dividing the points into clusters and evaluate the goodness' of each potential set of clusters by using the given objective function. (NP Hard)Can have global or local o
20、bjectives.u Hierarchical clustering algorithms typically have local objectivesu Partitional algorithms typically have global objectivesA variation of the global objective function approach is to fit thedata to a parameterized mu Parameters for the m.are determined from the data.u Mixture ms assume t
21、hat the data is a mixture' of a number ofstatistical distributions.6/12/2010#Types of Clusters: Objective Function l Map the clustering problem to a different domain and solve a related problem in that domainProximity matrix defines a weighted graph, where the nodes are the points being clustere
22、d, and the weighted edges represent the proximities between pointsC3C4C1C5C26/12/2010#C1C2C3C4C5C1C2C3C4C5Types of Clusters: Objective Function Clustering is equivalent to breaking the graph into connected components, one for each cluster.Want to minimize the edge weight between clusters and edge we
23、ight within clustersize the6/12/2010#Characteristics of the Input Data Are Importantl Type of proximity or density measureThis is a derived measure, but central to clusteringl SparsenessDictates type of similarityAdds to efficiencyl Attribute typeDictates type of similarityl Type of DataDictates typ
24、e of similarityOther characteristics, e.g., autocorrelationl Dimensionalityl Noise and Outliersl Type of Distribution6/12/2010#Clustering Algorithmsl K-means and its variantsl Hierarchical clusteringl Density-based clustering6/12/2010#K-means ClusteringPartitional clustering approachEach cluster is
25、associated with a centroid (center point)Each point is assigned to the cluster with the closest centroidNumber of clusters, K, must be specifiedThe basic algorithm is very simplelllll6/12/2010#K-means Clustering DetailsInitial centroids are often chosen randomly.lClusters produced vary from one run
26、to another.The centroid is (typically) the mean of the points in thecluster.Closeness is measured by Euclidean distance, cosine similarity, correlation, etc.K-means will converge for common similarity measures mentioned above.Most of the convergence happens in the first few iterations.llllOften the
27、stopping condition is changed to Until relatively few points change clustersComplexity is O( n * K * I * d )ln = number of points, K = number of clusters,I = number of iterations, d = number of attributes6/12/2010#Two different K-means Clusterings32.5Original Points21.510.50-2-1.5-1-0 50x0.511.52332
28、.52.5221.51.5110.50.500-2-1 5-1-0.50x0.511.52-2-1 5-1-0.50x0.511.52Optimal ClusteringSub-optimal Clusteringyyy6/12/2010#Importance of Choosing Initial CentroidsIteration 632.521.510.50-2-1 5-1-0.50x0.511.52y6/12/2010#Importance of Choosing Initial CentroidsIterationIterationIteration3332.52.52.52221
29、.51.51.51110.50.50.5000-2-1 5-1-0.50x0.511.52-2-1 5-1-0.50x0.511.52-2-1 5-1-0.50x0.511.52IterationIterationIteration 63332.52.52.52221.51.51.51110.50.50.5000-2-1 5-1-0.50x0.511.52-2-1 5-1-0.50x0.511.52-2-1 5-1-0.50x0.511.52yyyyyy6/12/2010#Evaluating K-means Clustersl Most common measure is Sum of Sq
30、uared Error (SSE)For each point, the error is the distance to the nearest clusterTo get SSE, we square these errors and sum them.KååSSE =2dist (m , x)ii=1 xÎCix is a data point in cluster Ci and mi is the representative point for cluster Ciu can show that mi corresponds to the center
31、(mean) of the cluster Given two clusterings, we can choose the one with the smallest errorOne easy way to reduce SSE is to increase K, the number of clustersu A good clustering with smaller K can have a lower SSE than a poorclustering with higher K6/12/2010#Importance of Choosing Initial Centroids I
32、teration 532.521.510.50-2-1 5-1-0.50x0.511.52y6/12/2010#Importance of Choosing Initial Centroids IterationIteration332.52.5221.51.5110.50.500-2-1 5-1-0.50x0.511.52-2-1 5-1-0.50x0.511.52IterationIterationIteration 53332.52.52.52221.51.51.51110.50.50.5000-2-1 5-1-0.50x0.511.52-2-1 5-1-0.50x0.511.52-2-
33、1 5-1-0.50x0.511.52yyyyy6/12/2010#Problems with Selecting Initial PointsIf there are K real clusters then the chance of selecting one centroid from each cluster is small.lChance is relatively small when K is largeIf clusters are the same size, n, thenFor example, if K = 10, then probability = 10!/10
34、10= 0.00036Sometimes the initial centroids will readjust themselves inright way, and sometimes they dontConsider an example of five pairs of clusters6/12/2010#10 Clusters ExampleIteration 486420-2-4-60510x1520Starting with two initial centroids in one cluster of each pair of clustersy6/12/2010#10 Cl
35、usters ExampleIterationIteration8866442200-2-2-4-4-6-60510152005101520xxIterationIteration 48866442200-2-2-4-4-6-60510x15200510x1520Starting with two initial centroids in one cluster of each pair of clustersyyyy6/12/2010#10 Clusters ExampleIteration 486420-2-4-60510x1520Starting with some pairs of c
36、lusters having three initial centroids, while other have only one.y6/12/2010#10 Clusters ExampleIterationIteration8866442200-2-2-4-4-6-60510Iteration15200510Iteration 415208866442200-2-2-4-4-6-60510x15200510x1520Starting with some pairs of clusters having three initial centroids, while other have on
37、ly one.yyyy6/12/2010#Solutions to Initial Centroids Probleml Multiple runsHelps, but probability is not on your sidel Sample and use hierarchical clustering to determine initial centroidsl Select more than k initial centroids and then select among these initial centroidsSelect most widely separatedl
38、 Postprocessing fixup the set of clusters producedl Bisecting K-meansNot as susceptible to initialization issues6/12/2010#Handling Empty Clustersl Basic K-means algorithm can yield empty clustersl Several strategiesChoose the point that contributes most to SSEu farthest away from any current centroi
39、dChoose a point from the cluster with the highest SSEu typically split the clusterIf there are several empty clusters, the above can be repeated several times.6/12/2010#Updating Centers Incrementallyl In the basic K-means algorithm, centroids are updated after all points are assigned to a centroidl
40、An alternative is to update the centroids after each assignment (incremental approach)Each assignment updates zero or two centroids More expensiveIntroduces an order dependencyNever get an empty cluster (產(chǎn)生空簇)Can use “weights” to change the impact6/12/2010#Pre-processing and Post-processingl Pre-pro
41、cessingNormalize the dataEliminate outliersl Post-processingEliminate small clusters that may represent outliersSplit loose clusters, i.e., clusters with relatively high SSEMerge clusters that are close and that have relatively low SSECan use these steps during the clustering processu ISODATA6/12/20
42、10#Bisecting K-meansBisecting K-means algorithmlVariant of K-means that can produce a partitional or a hierarchical clustering6/12/2010#Bisecting K-means Example6/12/2010#Limitations of K-meansl K-means has problems when clusters are of differingSizes DensitiesNon-globular shapesl K-means has proble
43、ms when the data contains outliers.6/12/2010#Limitations of K-means: Differing SizesK-means (3 Clusters)Original Points6/12/2010#Limitations of K-means: Differing DensityK-means (3 Clusters)Original Points6/12/2010#Limitations of K-means: Non-globular ShapesOriginal PointsK-means (2 Clusters)6/12/20
44、10#Overcoming K-means LimitationsOriginal PointsK-means ClustersOne solution is to use many clusters.Find parts of clusters, but need to put together.6/12/2010#Overcoming K-means LimitationsOriginal PointsK-means Clusters6/12/2010#Overcoming K-means LimitationsOriginal PointsK-means Clusters6/12/201
45、0#Hierarchical Clusteringl Produces a set of nested clusters organized as a hierarchical treel Can be visualized as a dendrogram (樹狀圖)A tree like diagram that records the sequences of merges or splits560.243420.15520.110.051301325466/12/2010#Strengths of Hierarchical Clusteringl Do not have to assum
46、e any particular number of clustersAny desired number of clusters can be obtained by cutting the dendogram at the proper levell They may correspond to meaningful taxonomiesExample in biological sciences (e.g., animal kingdom, phylogeny reconstruction, )6/12/2010#Hierarchical Clusteringl Two main typ
47、es of hierarchical clusteringAgglomerative:u Start with the points as individual clustersu At each step, merge the closest pair of clusters until only one cluster (or k clusters) leftDivisive:u Start with one, all-inclusive clusteru At each step, split a cluster until each cluster contains a point (or there are k clusters)l Traditional hierarchical algorithms use a similarity or distance matrixMerge or split one cluster at a time6/12/2010#Agglomerative Clustering AlgorithmMore popular hierarchical clustering techniqueBasic algorithm is straightforwardll1.
溫馨提示
- 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請(qǐng)下載最新的WinRAR軟件解壓。
- 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請(qǐng)聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
- 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會(huì)有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
- 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
- 5. 人人文庫網(wǎng)僅提供信息存儲(chǔ)空間,僅對(duì)用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對(duì)用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對(duì)任何下載內(nèi)容負(fù)責(zé)。
- 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請(qǐng)與我們聯(lián)系,我們立即糾正。
- 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶因使用這些下載資源對(duì)自己和他人造成任何形式的傷害或損失。
最新文檔
- 2025年度床墊行業(yè)展會(huì)參展商采購(gòu)合同3篇
- 2024版航天器發(fā)射與數(shù)據(jù)服務(wù)合同
- 2024版臨時(shí)工合同協(xié)議書范文
- 2024年軟件開發(fā)購(gòu)銷合同
- 個(gè)人借款協(xié)議模板2024專業(yè)版版B版
- 二零二五版二手房買賣合同公證服務(wù)合同規(guī)范與執(zhí)行2篇
- 2024版股權(quán)激勵(lì)合同2篇
- 二零二五版房屋買賣更名與配套設(shè)施移交協(xié)議3篇
- 二零二五年度環(huán)保項(xiàng)目墊資合同范本2篇
- 2024幼兒園幼兒教師聘任與勞動(dòng)合同書3篇
- 2024-2025學(xué)年成都高新區(qū)七上數(shù)學(xué)期末考試試卷【含答案】
- 定額〔2025〕1號(hào)文-關(guān)于發(fā)布2018版電力建設(shè)工程概預(yù)算定額2024年度價(jià)格水平調(diào)整的通知
- 2025年浙江杭州市西湖區(qū)專職社區(qū)招聘85人歷年高頻重點(diǎn)提升(共500題)附帶答案詳解
- 《數(shù)學(xué)廣角-優(yōu)化》說課稿-2024-2025學(xué)年四年級(jí)上冊(cè)數(shù)學(xué)人教版
- “懂你”(原題+解題+范文+話題+技巧+閱讀類素材)-2025年中考語文一輪復(fù)習(xí)之寫作
- 2025年景觀照明項(xiàng)目可行性分析報(bào)告
- 2025年江蘇南京地鐵集團(tuán)招聘筆試參考題庫含答案解析
- 2025年度愛讀書學(xué)長(zhǎng)參與的讀書項(xiàng)目投資合同
- 電力系統(tǒng)分析答案(吳俊勇)(已修訂)
- 化學(xué)-河北省金太陽質(zhì)檢聯(lián)盟2024-2025學(xué)年高三上學(xué)期12月第三次聯(lián)考試題和答案
- 期末復(fù)習(xí)試題(試題)-2024-2025學(xué)年四年級(jí)上冊(cè)數(shù)學(xué) 北師大版
評(píng)論
0/150
提交評(píng)論