基于遺傳算法的K均值聚類分析_圖文_第1頁
基于遺傳算法的K均值聚類分析_圖文_第2頁
基于遺傳算法的K均值聚類分析_圖文_第3頁
基于遺傳算法的K均值聚類分析_圖文_第4頁
基于遺傳算法的K均值聚類分析_圖文_第5頁
已閱讀5頁,還剩9頁未讀, 繼續(xù)免費(fèi)閱讀

下載本文檔

版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請(qǐng)進(jìn)行舉報(bào)或認(rèn)領(lǐng)

文檔簡(jiǎn)介

1、基于遺傳算法的K均值聚類分析¨計(jì)算機(jī)科學(xué)2003Voi.3DN9,2王敞陳增強(qiáng)袁著祉(南開大學(xué)信息技術(shù)科學(xué)學(xué)院天津300071K-Means Clustering Based oll Genetic AlgorithmWANG Chang CHEN ZengQiang YUAN Zhu。Zhi(CollegeoI Information Science and TechnologyNankai UniversityTianjin300071Abstract This paper proposes a KMeans clustering method based on genetic

2、algorithm.We compare our methodwith the traditional KMeans method and clustering method based on simple genetic algorithmThe comparisonprovesthat our method achieves a better result than the other two-The drawback of this method is8comparably slower speed in clustering.Keywords Data mining.Clusterin

3、g,Genetic algorithm,KMeans clustering1前言聚類分析就是將數(shù)據(jù)對(duì)象分組成為多個(gè)類或簇.在同一個(gè)簇中的對(duì)象之問具有較高的相似度,而不同的簇中的對(duì)象差別較大.聚類分析目前應(yīng)用廣泛.已經(jīng)成為數(shù)據(jù)挖掘主要的研究領(lǐng)域.通過聚類.人們能夠識(shí)別密集的和稀疏的區(qū)域,從而發(fā)現(xiàn)數(shù)據(jù)的整體分布模式,還能找到數(shù)據(jù)間的有趣的相互關(guān)系.關(guān)于聚類分析目前已經(jīng)有K均值,CURE等很多算法,而且在實(shí)踐中得到了應(yīng)用.在這里,我們針對(duì)應(yīng)用最為廣泛的K均值方法的缺點(diǎn).提出了基于遺傳算法的K均值聚類分析方法.實(shí)驗(yàn)表明.新方法在聚類問題中得到的結(jié)果全面要優(yōu)于傳統(tǒng)K均值聚類方法,也好于單純的遺傳算法聚類

4、.只是由于用到了遺傳操作.聚類速度相對(duì)K均值方法要慢一些.2K均值方法的一般描述K均值方法是基于劃分的聚類方法.它在目前的聚類分析中應(yīng)用最為廣泛.其基本思想為:對(duì)于給定的聚類數(shù)目K.首先隨機(jī)創(chuàng)建一個(gè)初始劃分.然后采用選代方法通過將聚類中心不斷移動(dòng)來嘗試著改進(jìn)劃分.為了達(dá)到最優(yōu).這種K均值方法理論上應(yīng)該窮舉所有可能的劃分.但實(shí)際上,這里采用了啟發(fā)式方法.用每類的平均值來表示詼類.這大大降低了計(jì)算的復(fù)雜性.提高了運(yùn)算速度,使處理大規(guī)模數(shù)據(jù)集成為可能。但同時(shí),這也導(dǎo)致了該方法受初始值影響很大.通常只能以局部最優(yōu)結(jié)束.而且這種方法對(duì)于孤立點(diǎn)是敏感的.5基于單純遺傳算法的聚類遺傳算法是基于自然選擇和遺傳

5、規(guī)律的搜索方法.該方法是隨機(jī)選擇與適者生存理論的結(jié)合.群體中的強(qiáng)者擁有更大的機(jī)會(huì)將其基因傳給后代。我們可以簡(jiǎn)單地將一個(gè)實(shí)際問題的不同解編碼成位串也就是所謂個(gè)體(Xi(t.并評(píng)價(jià)它們的適應(yīng)度(f(t,然后基于個(gè)體的適應(yīng)度,按一定比例(PsIXi(t選擇個(gè)體.進(jìn)行遺傳操作.遺傳操作包括交叉和變異兩步,這兩步操作也是接一定概率(交叉概率Pc.變異概率Pm進(jìn)行的.在進(jìn)行若干代遺傳操作后,算法就根有希望-本文得到國束自然科學(xué)基金(60174021資助找到最優(yōu)解或近似最優(yōu)解.由于遺傳算法采用群體的方式進(jìn)行搜索,這使得它可以同時(shí)搜索空間內(nèi)的多十區(qū)域。在賦予遺傳算法自組織,自適應(yīng)。自學(xué)習(xí)能力的同時(shí)t優(yōu)勝劣汰的

6、自然選擇和遺傳操作使遺傳算法具有不受其搜索空間限翩性條件的約束和不需要其他輔助信息的特點(diǎn).這些特點(diǎn)使遺傳算法不僅能獲得較高的效率而且具有簡(jiǎn)單.易于操作和通用的特性.這些特性使得遺傳算法在各領(lǐng)域中應(yīng)用得越發(fā)廣泛.已經(jīng)有人嘗試用遺傳算法進(jìn)行聚類,基本思想如下:將一染色體分成N十部分,每個(gè)部分對(duì)應(yīng)一個(gè)數(shù)據(jù)元素的類別.例如,第lo部分值為1表示10號(hào)數(shù)據(jù)元素屬于類別1.在這里,適應(yīng)度函數(shù)定義為數(shù)據(jù)間的歐式距離.初始群體采用隨機(jī)產(chǎn)生的方法獲得.遺傳操作與傳統(tǒng)遺傳算法類似.根據(jù)遺傳算法的理論基礎(chǔ),這種方法可以找到全局最優(yōu)解,不受孤立點(diǎn)影響.不失為一個(gè)很好的方法.但是這種方法在數(shù)據(jù)量很大.要求聚類的類別很多

7、時(shí),運(yùn)算時(shí)間就顯得過長(zhǎng).而且往往效果還不如K均值方法.因此這種方法通常只適用于數(shù)量較小,類別數(shù)不大的情況.也有人改進(jìn)了遺傳算法的編碼方式.把各類別的乘糞中心坐標(biāo)編碼成染色體,其他遺傳操作與傳統(tǒng)遺傳算法一致.這種方法,收斂速度要好于第一種遺傳算法.具有更強(qiáng)的適用性.不過這種方法由于單純依靠遺傳算法尋優(yōu).因此收斂依然緩慢,而且采用這種編碼方式又導(dǎo)致了算法對(duì)孤立點(diǎn)變得敏感。4基于遺傳算法的K均值聚類分析我們的算法是在遺傳算法思想與K均值算法思想的基礎(chǔ)上提出來的.我們把K均值方法引入到遺傳算法的進(jìn)化中.首先.隨機(jī)產(chǎn)生遺傳算法的第一代并開始進(jìn)化.在每代進(jìn)化中,我們都用K均值方法對(duì)每個(gè)個(gè)體進(jìn)行進(jìn)一步的優(yōu)化

8、.選相當(dāng)于在每一代都要對(duì)所有個(gè)體計(jì)算以其為初始值的K 均值問題的局部最優(yōu)結(jié)果,并以這些局部最優(yōu)結(jié)果替換掉原來的個(gè)體并繼續(xù)進(jìn)化,直到達(dá)到最大代數(shù)或者結(jié)果符臺(tái)要求為止。這種方法力圖通過遺傳算法來保證獲取全局最優(yōu)解.而用K均值方法提高算法的收斂速度.163算法流程圖如下:【I韌始化:確定遺傳參數(shù);產(chǎn)生初始群體。(2對(duì)當(dāng)前群體的每個(gè)個(gè)體,用K均值方法將其優(yōu)化為以該個(gè)體為初始值的K均值問題的局部最優(yōu)結(jié)果?!?對(duì)這些局部最憂個(gè)體進(jìn)行選擇.(4動(dòng)態(tài)改變交叉概率進(jìn)行交叉。(5動(dòng)態(tài)改變變異概率進(jìn)行變異。(6若到達(dá)最大代數(shù)或者滿足適應(yīng)度要求,繼續(xù)執(zhí)行。否則.轉(zhuǎn)。(7輸出結(jié)果。染色體蝙碼策略我們將各個(gè)類別的中心坐標(biāo)

9、編碼為染色體。例如對(duì)于一個(gè)類別為3的聚類問題,假設(shè)數(shù)據(jù)集為5維的,韌始的三個(gè)聚類中心點(diǎn)為(10。zo,30.40.50,(20,40,60, 80,lOO,(30,50,70,90,110,則我們的染色體編碼就為(10。203040-50.20t40,60.80,lOO.30,50,70,90,110。這種基于聚類中心的編碼方式減少了染色體的長(zhǎng)度,大大提高了遺傳算法的收斂速度,對(duì)于求解大量數(shù)據(jù)的復(fù)雜聚類問題效果較好。適應(yīng)度函數(shù)我們將適應(yīng)度函數(shù)定義成誤差平方報(bào)值丘=一(/藝【P一脅。其中。K為聚類的數(shù)目.P為問題空間的點(diǎn),Mi是簇ci的平均值.其中P,Mi都是多維的。這里,我們只考慮了數(shù)值屬性的

10、聚類,對(duì)于非數(shù)值屬性,可以用適當(dāng)方法轉(zhuǎn)化為數(shù)值屬性之后再使用該方法。初姑群體初始群體有多種生成辦法.為了獲得全局最憂解,這里徒們的初始群體完全隨機(jī)生成??紤]到要對(duì)大量數(shù)據(jù)進(jìn)行聚類,我們不能將初始群體設(shè)得很大。但是為了使遺傳算法數(shù)據(jù)元素能夠多樣性,我們?cè)跁r(shí)間和硬件條件允許的條件下應(yīng)該盡可能地提高群體的規(guī)模.選擇交叉與變異為了讓遺傳算法能夠獲得全局最優(yōu)解,我們采用了隨機(jī)選擇,最佳個(gè)體保留的方式。交叉則采用一點(diǎn)交叉方式.對(duì)本問題.我們經(jīng)過仿真試驗(yàn)發(fā)現(xiàn),變異是能夠跳出局部最優(yōu)的關(guān)鍵.變異算子對(duì)最終能否獲得全局最優(yōu)解影響重大.為此。我們對(duì)變異率實(shí)行動(dòng)態(tài)變化.開始時(shí),變異率較小隨著整代的平均適應(yīng)度趨于最優(yōu)

11、個(gè)體的適應(yīng)度時(shí).變異率就自適應(yīng)地增大。缸立點(diǎn)的影響由于本方法的進(jìn)化目標(biāo)是找到最優(yōu)的聚類中心,因此與K均值方法的本質(zhì)是一樣的。少量的孤立點(diǎn)勢(shì)必導(dǎo)致聚類中心的偏移.影響壘局的聚類效果.但由于遺傳算法的全局尋忱特性,本算法對(duì)孤立點(diǎn)的敏感性要小于單純的K均值法。5仿真實(shí)驗(yàn)我們用VC¨實(shí)現(xiàn)了新算法.實(shí)驗(yàn)數(shù)據(jù)采用在聚類中心附近產(chǎn)生的高斯分布數(shù)據(jù)。我們用文中所談及的K均值方法基于單純遺傳算法的聚類方法和新方法對(duì)于300.1000, 5000,10000個(gè)有5維特征的點(diǎn)集進(jìn)行了3類(K#3聚類分析164和10類(K一10聚類分析。下面把一組用新方法和K均值方法進(jìn)行比較的結(jié)果通過表格顯示如下:表13疑

12、真分析(表格內(nèi)容為誤差平方根,進(jìn)化:20代K均值法新方法性能提升3點(diǎn)】北.478279¨720“ooo點(diǎn)189.6175007573.6%sooo點(diǎn)576.49611338l803%lOOOO.點(diǎn)812.978159、91803%K均值法新方法性能提升j300點(diǎn)124.8897918鼬f365蹦1000點(diǎn)22B.511144426368“5000點(diǎn)529.76423596555,;10000點(diǎn)799979523878345“ 基于遺傳算法的K均值聚類分析作者:王敞, 陳增強(qiáng), 袁著祉作者單位:南開大學(xué)信息技術(shù)科學(xué)學(xué)院,天津,300071刊名: 計(jì)算機(jī)科學(xué)英文刊名:COMPUTER S

13、CIENCE年,卷(期:2003,30(2被引用次數(shù):16次參考文獻(xiàn)(6條1.HanJiawei.Kamber M Data Mining:Concepts and TechniquesThe Morgan Kaufmann Series in Data Management Systems2.徐勇.劉奕文.陳賀新.戴逸松一種基于自適應(yīng)遺傳算法的聚類分析方法期刊論文-系統(tǒng)工程與電子技術(shù)1997(093.王磊.戚飛虎大矢量空間聚類的遺傳k-均值算法期刊論文-上海交通大學(xué)學(xué)報(bào) 1999(094.韓斌基于單親遺傳算法和模糊C-均值算法的混合聚類算法期刊論文-華東船舶工業(yè)學(xué)院學(xué)報(bào) 2000(035.M

14、aulik.Ujjwal Bandyopadhyay.Sanghamitra Genetic algorithmbased clustering technique Pattern Recognition 20006.Cowgill M C.Harvey R J.Watson L Z Genetic algorithm approach to cluster analysis 1999(07相似文獻(xiàn)(10條1.外文會(huì)議Eva Kovacs.Iosif Ignat Clustering with Prototype Entity Selection in Data MiningThis pape

15、r presents an original classification method used in data mining, namely, the clustering with prototype entity selection (abbreviated CPES. The paper describes its mathematical essence, presents the algorithm and the experimental results obtained as compared to other methods of classification2.外文期刊M

16、ichael K. Ng.Mark Junjie Li.Joshua Zhexue Huang.Zengyou He On the Impact ofDissimilarity Measure in k-Modes Clustering AlgorithmThis correspondence describes extensions to the k-modes algorithm for clustering categorical data. By modifying a simple matching dissimilarity measure for categorical obje

17、cts, a heuristic approach was developed in (Z. He, et al., 2005, (O. San, et al., 2004 which allows the use of the k-modes paradigm to obtain a cluster with strong intrasimilarity and to efficiently cluster large categorical data sets. The main aim of this paper is to rigorously derive the updating

18、formula of the k-modes clustering algorithm with the new dissimilarity measure and the convergence of the algorithm under the optimization framework3.外文會(huì)議Jian Yin.Zhi-Fang Tan.Jiang-Tao Ren.Yi-Qun Chen An efficient clustering algorithm formixed type attributes in large datasetClustering is a widely

19、used technique in data mining, at present there exists many clustering algorithms, but most existing clustering algorithms either are limited to handle the single attribute or can handle both data types but are not efficient when clustering large data sets. Few algorithms can do both well. In this a

20、rticle, we propose a clustering algorithm that can handle large datasets with mixed type of attributes. We first use CF*tree (just like CF-tree in BIRCH to pre-cluster datasets. After that the dense regions are stored in leaf nodes, then we look every dense region as a single point and use the ameli

21、orated k-prototype to cluster such dense regions. Experiment shows that this algorithm is very efficient in clustering large datasets with mixed type of attributes.4.外文會(huì)議Xiao-Gao Yu.Yin Jian A new clustering algorithm based on KNN and DENCLUEClustering in data mining is used for identifying useful p

22、atterns and interested distributions in the underlying data. Clustering techniques have been studied extensively in e-commerce, statistics, pattern recognition, and machine learning. This increases the need for efficient and effective analysis methods to make use of this information. Traditional DEN

23、CLUE is an important clustering algorithm. But it is difficult to make its two global parameters (/spl sigma/, /spl xi/ be globally effective. A new algorithm based on KNN and DENCLUE is proposed in this paper, which offers DENCLUE the appropriate and globally effective parameters based on KNN and D

24、ENCLUE. At the first, the window-width (WW of each data point is determined and the whole data set is partitioned into some fuzzy cluster (FC by KNN based on KDE. Then, the local /spl sigma/ of each FC is unsupervised determined according to the entropy theory. At the last, each local /spl sigma/ is

25、 mapped to the global /spl sigma/ and each FC is independently clustered, which makes the global /spl sigma/ and /spl xi/ have the global validity. The analysis and experiment prove that our clustering method achievesbetter performance on the quality of the resulting clustering and the results are n

26、ot sensitive to the parameter k.5.外文期刊Huang. J.Z.Ng. M.K.Hongqiang Rong.Zichen Li Automated variable weighting in k-meanstype clusteringThis paper proposes a k-means type clustering algorithm that can automatically calculate variable weights. A new step isformula for weight calculation is proposed.

27、The convergency theorem of the new clustering process is given. The variable weights produced by the algorithm measure the importance of variables in clustering and can be used in variable selection in data mining applications where large and complex real data are often involved. Experimental result

28、s on both synthetic and real data have shown that the new algorithm outperformed the standard k-means type algorithms in recovering clusters in data.6.外文會(huì)議Zhijie Xu.Laisheng Wang.Jiancheng Luo.Jianqin Zhang A modified clustering algorithm fordata miningClustering is a widely used technique of findin

29、g interesting patterns residing in the dataset that were not obviously known. Itis a division of data into groups of similar objects. The clustering of large data sets has received a lot of attention in recent years, however, clustering is still a challenging task since many cluster algorithms fail

30、to do well in scaling with the size of the data set and the number of dimensions that describe the points, or in finding arbitrary shapes of clusters, or dealing effectively with the presence of noise. This paper describes a clustering method for unsupervised classification of objects in large data

31、sets. The new methodology combines the simulating annealing algorithm with CLARANS (clustering large application based upon randomized search in order to cluster large data sets efficiently. At last, the method is experimented on the generated data set. The result shows that the approach is quick th

32、an CLARANS and can produce a similar division of data as CLARANS.7.外文會(huì)議Wei-Di Dai.Yue-Xian Hou.Pi-Lian He.Xiao-Shen Zheng A clustering algorithm based onbuilding a density-treeA new kind of clustering algorithm called CABDET is presented in this paper. CABDET creates a tree structure for every clust

33、er, from which the neighbor's radius of the current object is calculated by the local density of its father node. Those unprocessed objects in the neighbor of the current object are added to extend the tree structure until no new object is founded. Each density-tree is regarded as one cluster. C

34、ABDET requires only one input parameter as the initial radius of the root node and has nolimitation of density threshold. Other characteristics include the abilities of discovering clusters with arbitrary shape and processing the noise data. The result of our experiments demonstrates that CABDET is

35、significantly more accurate in discovering density-changeable clustering than the algorithm DBSCAN, and that CABDET is less sensitive to input parameters.8.外文會(huì)議Tout. S.Sverdlik. W.Junping Sun Cluster Detection with the PYRAMID AlgorithmAs databases continue to grow in size, efficient and effective c

36、lustering algorithms play a paramount role in data mining applications. Practical clustering faces several challenges including: identifying clusters of arbitrary shapes, sensitivity to the order of input, dynamic determination of the number of clusters, outlier handling, processing speed of massive

37、 data sets, handling higher dimensions, and dependence on user-supplied parameters. Many studies have addressed one or more of these challenges. PYRAMID, or parallel hybrid clustering using genetic programming and multi-objective fitness with density, is an algorithm that we introduced in a previous

38、 research, which addresses some of the above challenges. While leaving significant challenges for future work, such as handling higher dimensions, PYRAMID employs a combination of data parallelism, a form of genetic programming, and a multi-objective density-based fitness function in the context of

39、clustering. This study adds to our previous research by exploring the detection capability of PYRAMID against a challenging dataset and evaluating its independence on user supplied parameters9.外文會(huì)議Islam. M.Z.Brankovic. L.DETECTIVE: a decision tree based categorical value clusteringand perturbation t

40、echnique for preserving privacy in data miningData mining is a powerful tool for information discovery from huge datasets. Various sectors, including commercial, government, financial, medical, and scientific, are applying data mining techniques on their datasets that typically contain sensitive ind

41、ividual information. During this process the datasets get exposed to several parties, which can potentially lead to disclosure of sensitive information and thus to breaches of privacy. Several data mining privacy preserving techniques have been recently proposed. In this paper we focus on data pertu

42、rbation techniques, i.e., those that add noise to the data in order to prevent exact disclosure of confidential values. Some of these techniques were designed for datasets having only numerical non-class attributes and a categorical class attribute. However, natural datasets are more likely to have

43、both numerical and categorical non-class attributes, and occasionally they contain only categorical attributes. Noise addition techniques developed for numerical attributes are not suitable for such datasets, due to the absence of natural ordering among categorical values. In this paper we propose a

44、 new method for adding noise to categorical values, which makes use of the clusters that exist among these values. We first discuss several existing categorical clustering methods and point out the limitations they exhibit in our context. Then we present a novel approach towards clustering of catego

45、rical values and use it to perturb data while maintaining the patterns in the dataset. Our clustering approach partitions the values of a given categorical attribute rather than the records of the datasets; additionally, our approach operates on the horizontally partitioned dataset and it is possibl

46、e for two values to belong to the same cluster in one horizontal partition of the dataset, and to two distinct clusters in another partition. Finally, we provide some experimental results in order to evaluate our perturbation technique and to com10.外文期刊Ng. R.T.Jiawei Han CLARANS: a method for clustering objects for spatial data miningSpatial data mining is the discovery of interesting relationships and characteristics that may exist implicitly in spatial databases. To this end, this paper has three main contributions. First, it proposes a new clustering

溫馨提示

  • 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請(qǐng)下載最新的WinRAR軟件解壓。
  • 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請(qǐng)聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
  • 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會(huì)有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
  • 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
  • 5. 人人文庫網(wǎng)僅提供信息存儲(chǔ)空間,僅對(duì)用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對(duì)用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對(duì)任何下載內(nèi)容負(fù)責(zé)。
  • 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請(qǐng)與我們聯(lián)系,我們立即糾正。
  • 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶因使用這些下載資源對(duì)自己和他人造成任何形式的傷害或損失。

評(píng)論

0/150

提交評(píng)論