




版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請進行舉報或認領(lǐng)
文檔簡介
1、基于遺傳算法的K均值聚類分析¨計算機科學2003Voi.3DN9,2王敞陳增強袁著祉(南開大學信息技術(shù)科學學院天津300071K-Means Clustering Based oll Genetic AlgorithmWANG Chang CHEN ZengQiang YUAN Zhu。Zhi(CollegeoI Information Science and TechnologyNankai UniversityTianjin300071Abstract This paper proposes a KMeans clustering method based on genetic
2、algorithm.We compare our methodwith the traditional KMeans method and clustering method based on simple genetic algorithmThe comparisonprovesthat our method achieves a better result than the other two-The drawback of this method is8comparably slower speed in clustering.Keywords Data mining.Clusterin
3、g,Genetic algorithm,KMeans clustering1前言聚類分析就是將數(shù)據(jù)對象分組成為多個類或簇.在同一個簇中的對象之問具有較高的相似度,而不同的簇中的對象差別較大.聚類分析目前應(yīng)用廣泛.已經(jīng)成為數(shù)據(jù)挖掘主要的研究領(lǐng)域.通過聚類.人們能夠識別密集的和稀疏的區(qū)域,從而發(fā)現(xiàn)數(shù)據(jù)的整體分布模式,還能找到數(shù)據(jù)間的有趣的相互關(guān)系.關(guān)于聚類分析目前已經(jīng)有K均值,CURE等很多算法,而且在實踐中得到了應(yīng)用.在這里,我們針對應(yīng)用最為廣泛的K均值方法的缺點.提出了基于遺傳算法的K均值聚類分析方法.實驗表明.新方法在聚類問題中得到的結(jié)果全面要優(yōu)于傳統(tǒng)K均值聚類方法,也好于單純的遺傳算法聚類
4、.只是由于用到了遺傳操作.聚類速度相對K均值方法要慢一些.2K均值方法的一般描述K均值方法是基于劃分的聚類方法.它在目前的聚類分析中應(yīng)用最為廣泛.其基本思想為:對于給定的聚類數(shù)目K.首先隨機創(chuàng)建一個初始劃分.然后采用選代方法通過將聚類中心不斷移動來嘗試著改進劃分.為了達到最優(yōu).這種K均值方法理論上應(yīng)該窮舉所有可能的劃分.但實際上,這里采用了啟發(fā)式方法.用每類的平均值來表示詼類.這大大降低了計算的復(fù)雜性.提高了運算速度,使處理大規(guī)模數(shù)據(jù)集成為可能。但同時,這也導致了該方法受初始值影響很大.通常只能以局部最優(yōu)結(jié)束.而且這種方法對于孤立點是敏感的.5基于單純遺傳算法的聚類遺傳算法是基于自然選擇和遺傳
5、規(guī)律的搜索方法.該方法是隨機選擇與適者生存理論的結(jié)合.群體中的強者擁有更大的機會將其基因傳給后代。我們可以簡單地將一個實際問題的不同解編碼成位串也就是所謂個體(Xi(t.并評價它們的適應(yīng)度(f(t,然后基于個體的適應(yīng)度,按一定比例(PsIXi(t選擇個體.進行遺傳操作.遺傳操作包括交叉和變異兩步,這兩步操作也是接一定概率(交叉概率Pc.變異概率Pm進行的.在進行若干代遺傳操作后,算法就根有希望-本文得到國束自然科學基金(60174021資助找到最優(yōu)解或近似最優(yōu)解.由于遺傳算法采用群體的方式進行搜索,這使得它可以同時搜索空間內(nèi)的多十區(qū)域。在賦予遺傳算法自組織,自適應(yīng)。自學習能力的同時t優(yōu)勝劣汰的
6、自然選擇和遺傳操作使遺傳算法具有不受其搜索空間限翩性條件的約束和不需要其他輔助信息的特點.這些特點使遺傳算法不僅能獲得較高的效率而且具有簡單.易于操作和通用的特性.這些特性使得遺傳算法在各領(lǐng)域中應(yīng)用得越發(fā)廣泛.已經(jīng)有人嘗試用遺傳算法進行聚類,基本思想如下:將一染色體分成N十部分,每個部分對應(yīng)一個數(shù)據(jù)元素的類別.例如,第lo部分值為1表示10號數(shù)據(jù)元素屬于類別1.在這里,適應(yīng)度函數(shù)定義為數(shù)據(jù)間的歐式距離.初始群體采用隨機產(chǎn)生的方法獲得.遺傳操作與傳統(tǒng)遺傳算法類似.根據(jù)遺傳算法的理論基礎(chǔ),這種方法可以找到全局最優(yōu)解,不受孤立點影響.不失為一個很好的方法.但是這種方法在數(shù)據(jù)量很大.要求聚類的類別很多
7、時,運算時間就顯得過長.而且往往效果還不如K均值方法.因此這種方法通常只適用于數(shù)量較小,類別數(shù)不大的情況.也有人改進了遺傳算法的編碼方式.把各類別的乘糞中心坐標編碼成染色體,其他遺傳操作與傳統(tǒng)遺傳算法一致.這種方法,收斂速度要好于第一種遺傳算法.具有更強的適用性.不過這種方法由于單純依靠遺傳算法尋優(yōu).因此收斂依然緩慢,而且采用這種編碼方式又導致了算法對孤立點變得敏感。4基于遺傳算法的K均值聚類分析我們的算法是在遺傳算法思想與K均值算法思想的基礎(chǔ)上提出來的.我們把K均值方法引入到遺傳算法的進化中.首先.隨機產(chǎn)生遺傳算法的第一代并開始進化.在每代進化中,我們都用K均值方法對每個個體進行進一步的優(yōu)化
8、.選相當于在每一代都要對所有個體計算以其為初始值的K 均值問題的局部最優(yōu)結(jié)果,并以這些局部最優(yōu)結(jié)果替換掉原來的個體并繼續(xù)進化,直到達到最大代數(shù)或者結(jié)果符臺要求為止。這種方法力圖通過遺傳算法來保證獲取全局最優(yōu)解.而用K均值方法提高算法的收斂速度.163算法流程圖如下:【I韌始化:確定遺傳參數(shù);產(chǎn)生初始群體。(2對當前群體的每個個體,用K均值方法將其優(yōu)化為以該個體為初始值的K均值問題的局部最優(yōu)結(jié)果。【3對這些局部最憂個體進行選擇.(4動態(tài)改變交叉概率進行交叉。(5動態(tài)改變變異概率進行變異。(6若到達最大代數(shù)或者滿足適應(yīng)度要求,繼續(xù)執(zhí)行。否則.轉(zhuǎn)。(7輸出結(jié)果。染色體蝙碼策略我們將各個類別的中心坐標
9、編碼為染色體。例如對于一個類別為3的聚類問題,假設(shè)數(shù)據(jù)集為5維的,韌始的三個聚類中心點為(10。zo,30.40.50,(20,40,60, 80,lOO,(30,50,70,90,110,則我們的染色體編碼就為(10。203040-50.20t40,60.80,lOO.30,50,70,90,110。這種基于聚類中心的編碼方式減少了染色體的長度,大大提高了遺傳算法的收斂速度,對于求解大量數(shù)據(jù)的復(fù)雜聚類問題效果較好。適應(yīng)度函數(shù)我們將適應(yīng)度函數(shù)定義成誤差平方報值丘=一(/藝【P一脅。其中。K為聚類的數(shù)目.P為問題空間的點,Mi是簇ci的平均值.其中P,Mi都是多維的。這里,我們只考慮了數(shù)值屬性的
10、聚類,對于非數(shù)值屬性,可以用適當方法轉(zhuǎn)化為數(shù)值屬性之后再使用該方法。初姑群體初始群體有多種生成辦法.為了獲得全局最憂解,這里徒們的初始群體完全隨機生成??紤]到要對大量數(shù)據(jù)進行聚類,我們不能將初始群體設(shè)得很大。但是為了使遺傳算法數(shù)據(jù)元素能夠多樣性,我們在時間和硬件條件允許的條件下應(yīng)該盡可能地提高群體的規(guī)模.選擇交叉與變異為了讓遺傳算法能夠獲得全局最優(yōu)解,我們采用了隨機選擇,最佳個體保留的方式。交叉則采用一點交叉方式.對本問題.我們經(jīng)過仿真試驗發(fā)現(xiàn),變異是能夠跳出局部最優(yōu)的關(guān)鍵.變異算子對最終能否獲得全局最優(yōu)解影響重大.為此。我們對變異率實行動態(tài)變化.開始時,變異率較小隨著整代的平均適應(yīng)度趨于最優(yōu)
11、個體的適應(yīng)度時.變異率就自適應(yīng)地增大。缸立點的影響由于本方法的進化目標是找到最優(yōu)的聚類中心,因此與K均值方法的本質(zhì)是一樣的。少量的孤立點勢必導致聚類中心的偏移.影響壘局的聚類效果.但由于遺傳算法的全局尋忱特性,本算法對孤立點的敏感性要小于單純的K均值法。5仿真實驗我們用VC¨實現(xiàn)了新算法.實驗數(shù)據(jù)采用在聚類中心附近產(chǎn)生的高斯分布數(shù)據(jù)。我們用文中所談及的K均值方法基于單純遺傳算法的聚類方法和新方法對于300.1000, 5000,10000個有5維特征的點集進行了3類(K#3聚類分析164和10類(K一10聚類分析。下面把一組用新方法和K均值方法進行比較的結(jié)果通過表格顯示如下:表13疑
12、真分析(表格內(nèi)容為誤差平方根,進化:20代K均值法新方法性能提升3點】北.478279¨720“ooo點189.6175007573.6%sooo點576.49611338l803%lOOOO.點812.978159、91803%K均值法新方法性能提升j300點124.8897918鼬f365蹦1000點22B.511144426368“5000點529.76423596555,;10000點799979523878345“ 基于遺傳算法的K均值聚類分析作者:王敞, 陳增強, 袁著祉作者單位:南開大學信息技術(shù)科學學院,天津,300071刊名: 計算機科學英文刊名:COMPUTER S
13、CIENCE年,卷(期:2003,30(2被引用次數(shù):16次參考文獻(6條1.HanJiawei.Kamber M Data Mining:Concepts and TechniquesThe Morgan Kaufmann Series in Data Management Systems2.徐勇.劉奕文.陳賀新.戴逸松一種基于自適應(yīng)遺傳算法的聚類分析方法期刊論文-系統(tǒng)工程與電子技術(shù)1997(093.王磊.戚飛虎大矢量空間聚類的遺傳k-均值算法期刊論文-上海交通大學學報 1999(094.韓斌基于單親遺傳算法和模糊C-均值算法的混合聚類算法期刊論文-華東船舶工業(yè)學院學報 2000(035.M
14、aulik.Ujjwal Bandyopadhyay.Sanghamitra Genetic algorithmbased clustering technique Pattern Recognition 20006.Cowgill M C.Harvey R J.Watson L Z Genetic algorithm approach to cluster analysis 1999(07相似文獻(10條1.外文會議Eva Kovacs.Iosif Ignat Clustering with Prototype Entity Selection in Data MiningThis pape
15、r presents an original classification method used in data mining, namely, the clustering with prototype entity selection (abbreviated CPES. The paper describes its mathematical essence, presents the algorithm and the experimental results obtained as compared to other methods of classification2.外文期刊M
16、ichael K. Ng.Mark Junjie Li.Joshua Zhexue Huang.Zengyou He On the Impact ofDissimilarity Measure in k-Modes Clustering AlgorithmThis correspondence describes extensions to the k-modes algorithm for clustering categorical data. By modifying a simple matching dissimilarity measure for categorical obje
17、cts, a heuristic approach was developed in (Z. He, et al., 2005, (O. San, et al., 2004 which allows the use of the k-modes paradigm to obtain a cluster with strong intrasimilarity and to efficiently cluster large categorical data sets. The main aim of this paper is to rigorously derive the updating
18、formula of the k-modes clustering algorithm with the new dissimilarity measure and the convergence of the algorithm under the optimization framework3.外文會議Jian Yin.Zhi-Fang Tan.Jiang-Tao Ren.Yi-Qun Chen An efficient clustering algorithm formixed type attributes in large datasetClustering is a widely
19、used technique in data mining, at present there exists many clustering algorithms, but most existing clustering algorithms either are limited to handle the single attribute or can handle both data types but are not efficient when clustering large data sets. Few algorithms can do both well. In this a
20、rticle, we propose a clustering algorithm that can handle large datasets with mixed type of attributes. We first use CF*tree (just like CF-tree in BIRCH to pre-cluster datasets. After that the dense regions are stored in leaf nodes, then we look every dense region as a single point and use the ameli
21、orated k-prototype to cluster such dense regions. Experiment shows that this algorithm is very efficient in clustering large datasets with mixed type of attributes.4.外文會議Xiao-Gao Yu.Yin Jian A new clustering algorithm based on KNN and DENCLUEClustering in data mining is used for identifying useful p
22、atterns and interested distributions in the underlying data. Clustering techniques have been studied extensively in e-commerce, statistics, pattern recognition, and machine learning. This increases the need for efficient and effective analysis methods to make use of this information. Traditional DEN
23、CLUE is an important clustering algorithm. But it is difficult to make its two global parameters (/spl sigma/, /spl xi/ be globally effective. A new algorithm based on KNN and DENCLUE is proposed in this paper, which offers DENCLUE the appropriate and globally effective parameters based on KNN and D
24、ENCLUE. At the first, the window-width (WW of each data point is determined and the whole data set is partitioned into some fuzzy cluster (FC by KNN based on KDE. Then, the local /spl sigma/ of each FC is unsupervised determined according to the entropy theory. At the last, each local /spl sigma/ is
25、 mapped to the global /spl sigma/ and each FC is independently clustered, which makes the global /spl sigma/ and /spl xi/ have the global validity. The analysis and experiment prove that our clustering method achievesbetter performance on the quality of the resulting clustering and the results are n
26、ot sensitive to the parameter k.5.外文期刊Huang. J.Z.Ng. M.K.Hongqiang Rong.Zichen Li Automated variable weighting in k-meanstype clusteringThis paper proposes a k-means type clustering algorithm that can automatically calculate variable weights. A new step isformula for weight calculation is proposed.
27、The convergency theorem of the new clustering process is given. The variable weights produced by the algorithm measure the importance of variables in clustering and can be used in variable selection in data mining applications where large and complex real data are often involved. Experimental result
28、s on both synthetic and real data have shown that the new algorithm outperformed the standard k-means type algorithms in recovering clusters in data.6.外文會議Zhijie Xu.Laisheng Wang.Jiancheng Luo.Jianqin Zhang A modified clustering algorithm fordata miningClustering is a widely used technique of findin
29、g interesting patterns residing in the dataset that were not obviously known. Itis a division of data into groups of similar objects. The clustering of large data sets has received a lot of attention in recent years, however, clustering is still a challenging task since many cluster algorithms fail
30、to do well in scaling with the size of the data set and the number of dimensions that describe the points, or in finding arbitrary shapes of clusters, or dealing effectively with the presence of noise. This paper describes a clustering method for unsupervised classification of objects in large data
31、sets. The new methodology combines the simulating annealing algorithm with CLARANS (clustering large application based upon randomized search in order to cluster large data sets efficiently. At last, the method is experimented on the generated data set. The result shows that the approach is quick th
32、an CLARANS and can produce a similar division of data as CLARANS.7.外文會議Wei-Di Dai.Yue-Xian Hou.Pi-Lian He.Xiao-Shen Zheng A clustering algorithm based onbuilding a density-treeA new kind of clustering algorithm called CABDET is presented in this paper. CABDET creates a tree structure for every clust
33、er, from which the neighbor's radius of the current object is calculated by the local density of its father node. Those unprocessed objects in the neighbor of the current object are added to extend the tree structure until no new object is founded. Each density-tree is regarded as one cluster. C
34、ABDET requires only one input parameter as the initial radius of the root node and has nolimitation of density threshold. Other characteristics include the abilities of discovering clusters with arbitrary shape and processing the noise data. The result of our experiments demonstrates that CABDET is
35、significantly more accurate in discovering density-changeable clustering than the algorithm DBSCAN, and that CABDET is less sensitive to input parameters.8.外文會議Tout. S.Sverdlik. W.Junping Sun Cluster Detection with the PYRAMID AlgorithmAs databases continue to grow in size, efficient and effective c
36、lustering algorithms play a paramount role in data mining applications. Practical clustering faces several challenges including: identifying clusters of arbitrary shapes, sensitivity to the order of input, dynamic determination of the number of clusters, outlier handling, processing speed of massive
37、 data sets, handling higher dimensions, and dependence on user-supplied parameters. Many studies have addressed one or more of these challenges. PYRAMID, or parallel hybrid clustering using genetic programming and multi-objective fitness with density, is an algorithm that we introduced in a previous
38、 research, which addresses some of the above challenges. While leaving significant challenges for future work, such as handling higher dimensions, PYRAMID employs a combination of data parallelism, a form of genetic programming, and a multi-objective density-based fitness function in the context of
39、clustering. This study adds to our previous research by exploring the detection capability of PYRAMID against a challenging dataset and evaluating its independence on user supplied parameters9.外文會議Islam. M.Z.Brankovic. L.DETECTIVE: a decision tree based categorical value clusteringand perturbation t
40、echnique for preserving privacy in data miningData mining is a powerful tool for information discovery from huge datasets. Various sectors, including commercial, government, financial, medical, and scientific, are applying data mining techniques on their datasets that typically contain sensitive ind
41、ividual information. During this process the datasets get exposed to several parties, which can potentially lead to disclosure of sensitive information and thus to breaches of privacy. Several data mining privacy preserving techniques have been recently proposed. In this paper we focus on data pertu
42、rbation techniques, i.e., those that add noise to the data in order to prevent exact disclosure of confidential values. Some of these techniques were designed for datasets having only numerical non-class attributes and a categorical class attribute. However, natural datasets are more likely to have
43、both numerical and categorical non-class attributes, and occasionally they contain only categorical attributes. Noise addition techniques developed for numerical attributes are not suitable for such datasets, due to the absence of natural ordering among categorical values. In this paper we propose a
44、 new method for adding noise to categorical values, which makes use of the clusters that exist among these values. We first discuss several existing categorical clustering methods and point out the limitations they exhibit in our context. Then we present a novel approach towards clustering of catego
45、rical values and use it to perturb data while maintaining the patterns in the dataset. Our clustering approach partitions the values of a given categorical attribute rather than the records of the datasets; additionally, our approach operates on the horizontally partitioned dataset and it is possibl
46、e for two values to belong to the same cluster in one horizontal partition of the dataset, and to two distinct clusters in another partition. Finally, we provide some experimental results in order to evaluate our perturbation technique and to com10.外文期刊Ng. R.T.Jiawei Han CLARANS: a method for clustering objects for spatial data miningSpatial data mining is the discovery of interesting relationships and characteristics that may exist implicitly in spatial databases. To this end, this paper has three main contributions. First, it proposes a new clustering
溫馨提示
- 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
- 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
- 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
- 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
- 5. 人人文庫網(wǎng)僅提供信息存儲空間,僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護處理,對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對任何下載內(nèi)容負責。
- 6. 下載文件中如有侵權(quán)或不適當內(nèi)容,請與我們聯(lián)系,我們立即糾正。
- 7. 本站不保證下載資源的準確性、安全性和完整性, 同時也不承擔用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。
最新文檔
- 買原木家具合同協(xié)議書
- 陽臺施工安全協(xié)議書
- 回顧總結(jié)2025年計算機二級MySQL試題及答案
- 車禍對方賠償協(xié)議書
- 計算機二級Python考試細節(jié)常識及試題及答案
- 掌握MySQL考試的高效方法及試題及答案
- MySQL數(shù)據(jù)庫基礎(chǔ)知識試題及答案解析
- 現(xiàn)代戲劇中的新表現(xiàn)形式試題及答案
- 湖北二級建造師b證試題及答案
- 國家能源集團計算機面試題及答案
- 手術(shù)室停水的應(yīng)急預(yù)案
- 人工智能在電力行業(yè)的培訓課程
- 滴灌帶生產(chǎn)線建設(shè)項目可行性研究報告
- 職業(yè)技術(shù)學院中職教育中心繪畫專業(yè)人才培養(yǎng)方案
- 崇尚公平競爭的體育精神
- 2024-2030年中國航空發(fā)動機短艙行業(yè)市場現(xiàn)狀分析及競爭格局與投資發(fā)展研究報告
- 鋁型材加工項目投資計劃書
- 孤注一擲電影賞析
- 物流倉儲設(shè)施升級及效率優(yōu)化研究
- 合作取得更大的成功辯論稿范文六篇
- 2022-2023學年八年級數(shù)學下學期期末考試卷(含答案)
評論
0/150
提交評論