第八講挖掘頻繁模式、關(guān)聯(lián)和相關(guān)_第1頁(yè)
第八講挖掘頻繁模式、關(guān)聯(lián)和相關(guān)_第2頁(yè)
第八講挖掘頻繁模式、關(guān)聯(lián)和相關(guān)_第3頁(yè)
第八講挖掘頻繁模式、關(guān)聯(lián)和相關(guān)_第4頁(yè)
第八講挖掘頻繁模式、關(guān)聯(lián)和相關(guān)_第5頁(yè)
已閱讀5頁(yè),還剩55頁(yè)未讀 繼續(xù)免費(fèi)閱讀

下載本文檔

版權(quán)說(shuō)明:本文檔由用戶(hù)提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請(qǐng)進(jìn)行舉報(bào)或認(rèn)領(lǐng)

文檔簡(jiǎn)介

1、第八講 挖掘頻繁模式、關(guān)聯(lián)和相關(guān)1廈門(mén)大學(xué)軟件學(xué)院Mining Frequent Patterns, Association and Correlations基本概念和線(xiàn)路圖有效的和可伸縮的頻繁項(xiàng)集挖掘方法挖掘各種類(lèi)型的關(guān)聯(lián)規(guī)則(自學(xué))關(guān)聯(lián)規(guī)則到相關(guān)分析基于約束的關(guān)聯(lián)規(guī)則(自學(xué))小結(jié)2廈門(mén)大學(xué)軟件學(xué)院What Is Frequent Pattern Analysis?頻繁模式(Frequent pattern): 頻繁地出現(xiàn)在數(shù)據(jù)集中的模式(項(xiàng)集,子序列或子結(jié)構(gòu)等)。提出:Agrawal, Imielinski, and Swami AIS93動(dòng)機(jī):尋找數(shù)據(jù)內(nèi)部隱含的關(guān)聯(lián)哪些商品頻繁地被同時(shí)購(gòu)

2、買(mǎi)? Beer and diapers?!買(mǎi)了PC機(jī)之后客戶(hù)經(jīng)常還會(huì)購(gòu)買(mǎi)哪些相關(guān)商品?哪種DNA對(duì)這種新病毒很敏感?我們能自動(dòng)對(duì)Web上的文檔進(jìn)行分類(lèi)嗎?應(yīng)用購(gòu)物籃分析、交叉銷(xiāo)售、目錄設(shè)計(jì)、點(diǎn)擊流分析、DNA序列分析3廈門(mén)大學(xué)軟件學(xué)院Why Is Freq. Pattern Mining Important?挖掘數(shù)據(jù)集內(nèi)在且重要的屬性頻繁模式是縱多數(shù)據(jù)挖掘基本任務(wù)的基礎(chǔ)關(guān)聯(lián)、相關(guān)與因果分析序列、結(jié)構(gòu)(如“子圖”)分析時(shí)空數(shù)據(jù)、多媒體數(shù)據(jù)、時(shí)間序列數(shù)據(jù)、流數(shù)據(jù)上的模式分析分類(lèi)聚類(lèi)分析:基于頻繁模式的聚類(lèi)數(shù)據(jù)倉(cāng)庫(kù):冰山立方體基于語(yǔ)義的數(shù)據(jù)壓縮4廈門(mén)大學(xué)軟件學(xué)院關(guān)聯(lián)規(guī)則的分類(lèi)方法根據(jù)規(guī)則中所處理的值

3、類(lèi)型布爾關(guān)聯(lián)規(guī)則:考慮項(xiàng)的“在與不在”量化關(guān)聯(lián)規(guī)則:量化的項(xiàng)或?qū)傩灾g的關(guān)聯(lián)Age(X,”3039”)income(X,”4248K”)=buys(X,”high_resolution_TV”)根據(jù)規(guī)則中所涉及的數(shù)據(jù)維(謂詞)單維 buys(X,”computer”)buys(X,” financial_management_software”)多維 :見(jiàn)上例根據(jù)規(guī)則集所涉及的抽象層:?jiǎn)螌?、多層Age(X, “3039”)=buys(X,”laptop computer”)Age(X, “3039”)=buys(X,”computer”)5廈門(mén)大學(xué)軟件學(xué)院根據(jù)挖掘模式的完全性頻繁項(xiàng)集的完全集、

4、閉頻繁項(xiàng)集和極大頻繁項(xiàng)集、被約束的頻繁項(xiàng)集、近似頻繁項(xiàng)集根據(jù)挖掘的規(guī)則類(lèi)型分類(lèi)關(guān)聯(lián)規(guī)則、相關(guān)規(guī)則、強(qiáng)梯度聯(lián)系等根據(jù)挖掘的模式類(lèi)型分類(lèi)頻繁項(xiàng)集挖掘、序列模式挖掘、結(jié)構(gòu)模式挖掘6廈門(mén)大學(xué)軟件學(xué)院基本概念 項(xiàng)集:Itemset X = x1, , xk找出滿(mǎn)足規(guī)則 X Y 的最小支持度與置信度support, s, probability that a transaction contains X Yconfidence, c, conditional probability that a transaction having X also contains YLet supmin = 50%, c

5、onfmin = 50%Freq. Pat.: A:3, B:3, D:4, E:3, AD:3Association rules:A D (60%, 100%)D A (60%, 75%)Customerbuys diaperCustomerbuys bothCustomerbuys beerTransaction-idItems bought10A, B, D20A, C, D30A, D, E40B, E, F50B, C, D, E, F7廈門(mén)大學(xué)軟件學(xué)院關(guān)聯(lián)規(guī)則形如AB的蘊(yùn)涵式(AI, B I, AB=)D=t1,t2,.tk.tntk=i1,i2,im.ip,im稱(chēng)為項(xiàng)目ItemI

6、=i1,i2,.,im是項(xiàng)的集合規(guī)則AB在數(shù)據(jù)集D中成立,具有支持度s和置信度c8廈門(mén)大學(xué)軟件學(xué)院規(guī)則興趣度的兩個(gè)度量支持度(support):事務(wù)集中事務(wù)包含AB的百分比。反映了規(guī)則的有用性Support(AB) = P(AB)最小支持度閾值min_sup支持度計(jì)數(shù)置信度(confidence):事務(wù)集中包含A的事務(wù)同時(shí)也包含B的百分比反映了規(guī)則的確定性Confidence(A B) = P(B|A)最小置信度閾值min_conf強(qiáng)規(guī)則:滿(mǎn)足min_sup和min_conf的規(guī)則例如: Computer=financial_management_softwaresupport=2%,conf

7、idence=60%9廈門(mén)大學(xué)軟件學(xué)院有關(guān)概念項(xiàng)集:項(xiàng)的集合。K-項(xiàng)集:包含k個(gè)項(xiàng)的項(xiàng)集項(xiàng)集的頻率:包含項(xiàng)集的事務(wù)數(shù)頻繁項(xiàng)集:支持度不小于min_sup的項(xiàng)集挖掘關(guān)聯(lián)規(guī)則的過(guò)程找出所有頻繁項(xiàng)集(中心問(wèn)題)由頻繁項(xiàng)集產(chǎn)生強(qiáng)關(guān)聯(lián)規(guī)則10廈門(mén)大學(xué)軟件學(xué)院Mining Frequent Patterns, Association and Correlations基本概念和線(xiàn)路圖有效的和可伸縮的頻繁項(xiàng)集挖掘方法挖掘各種類(lèi)型的關(guān)聯(lián)規(guī)則關(guān)聯(lián)規(guī)則到相關(guān)分析基于約束的關(guān)聯(lián)規(guī)則小結(jié)11廈門(mén)大學(xué)軟件學(xué)院Scalable Methods for Mining Frequent PatternsThe downward

8、 closure property of frequent patternsAny subset of a frequent itemset must be frequentIf beer, diaper, nuts is frequent, so is beer, diaperi.e., every transaction having beer, diaper, nuts also contains beer, diaper Scalable mining methods: Three major approachesApriori (Agrawal & SrikantVLDB94)Fre

9、q. pattern growth (FPgrowthHan, Pei & Yin SIGMOD00)Vertical data format approach (CharmZaki & Hsiao SDM02)12廈門(mén)大學(xué)軟件學(xué)院Apriori: A Candidate Generation-and-Test ApproachApriori pruning principle: If there is any itemset which is infrequent, its superset should not be generated/tested! (Agrawal & Srikant

10、 VLDB94, Mannila, et al. KDD 94)Method: Initially, scan DB once to get frequent 1-itemsetGenerate length (k+1) candidate itemsets from length k frequent itemsetsTest the candidates against DBTerminate when no frequent or candidate set can be generated13廈門(mén)大學(xué)軟件學(xué)院The Apriori AlgorithmAn ExampleDatabase

11、 TDB1st scanC1L1L2C2C22nd scanC3L33rd scanTidItems10A, C, D20B, C, E30A, B, C, E40B, EItemsetsupA2B3C3D1E3ItemsetsupA2B3C3E3ItemsetA, BA, CA, EB, CB, EC, EItemsetsupA, B1A, C2A, E1B, C2B, E3C, E2ItemsetsupA, C2B, C2B, E3C, E2ItemsetB, C, EItemsetsupB, C, E214廈門(mén)大學(xué)軟件學(xué)院The Apriori Algorithm偽碼:Ck: Candi

12、date itemset of size kLk : frequent itemset of size kL1 = frequent items;for (k = 1; Lk !=; k+) do begin Ck+1 = candidates generated from Lk; for each transaction t in database do increment the count of all candidates in Ck+1 that are contained in t Lk+1 = candidates in Ck+1 with min_support endretu

13、rn k Lk;15廈門(mén)大學(xué)軟件學(xué)院Important Details of Apriori如何產(chǎn)生候選?Step 1: Lk的自連接Step 2: 剪枝(候選集的成員可能不是頻繁的,剪枝原則?)候選生成的例子L3=abc, abd, acd, ace, bcdSelf-joining: L3*L3abcd from abc and abdacde from acd and acePruning:acde is removed because ade is not in L3C4=abcd16廈門(mén)大學(xué)軟件學(xué)院How to Generate Candidates?假設(shè)Lk-1中的項(xiàng)已按順序排列S

14、tep 1: Lk-1 的自連接insert into Ckselect p.item1, p.item2, , p.itemk-1, q.itemk-1from Lk-1 p, Lk-1 qwhere p.item1=q.item1, , p.itemk-2=q.itemk-2, p.itemk-1 (l-s)18廈門(mén)大學(xué)軟件學(xué)院3 Challenges of Frequent Pattern Mining(自學(xué))ChallengesMultiple scans of transaction databaseHuge number of candidatesTedious workload

15、of support counting for candidatesImproving Apriori: general ideasReduce passes of transaction database scansShrink number of candidatesFacilitate support counting of candidates19廈門(mén)大學(xué)軟件學(xué)院Partition: Scan Database Only Twice (自學(xué))Any itemset that is potentially frequent in DB must be frequent in at lea

16、st one of the partitions of DBScan 1: partition database and find local frequent patternsScan 2: consolidate global frequent patternsA. Savasere, E. Omiecinski, and S. Navathe. An efficient algorithm for mining association in large databases. In VLDB9520廈門(mén)大學(xué)軟件學(xué)院DHP: Reduce the Number of Candidates (

17、自學(xué))A k-itemset whose corresponding hashing bucket count is below the threshold cannot be frequentCandidates: a, b, c, d, eHash entries: ab, ad, ae bd, be, de Frequent 1-itemset: a, b, d, eab is not a candidate 2-itemset if the sum of count of ab, ad, ae is below support thresholdJ. Park, M. Chen, an

18、d P. Yu. An effective hash-based algorithm for mining association rules. In SIGMOD9521廈門(mén)大學(xué)軟件學(xué)院Sampling for Frequent Patterns (自學(xué))Select a sample of original database, mine frequent patterns within sample using AprioriScan database once to verify frequent itemsets found in sample, only borders of clo

19、sure of frequent patterns are checkedExample: check abcd instead of ab, ac, , etc.Scan database again to find missed frequent patternsH. Toivonen. Sampling large databases for association rules. In VLDB9622廈門(mén)大學(xué)軟件學(xué)院DIC: Reduce Number of Scans (自學(xué))ABCDABCABDACDBCDABACBCADBDCDABCDItemset latticeOnce bo

20、th A and D are determined frequent, the counting of AD beginsOnce all length-2 subsets of BCD are determined frequent, the counting of BCD beginsTransactions1-itemsets2-itemsetsApriori1-itemsets2-items3-itemsDICS. Brin R. Motwani, J. Ullman, and S. Tsur. Dynamic itemset counting and implication rule

21、s for market basket data. In SIGMOD9723廈門(mén)大學(xué)軟件學(xué)院Bottleneck of Frequent-pattern MiningMultiple database scans are costlyMining long patterns needs many passes of scanning and generates lots of candidatesTo find frequent itemset i1i2i100# of scans: 100# of Candidates: (1001) + (1002) + + (110000) = 210

22、0-1 = 1.27*1030 !Bottleneck: candidate-generation-and-testCan we avoid candidate generation?24廈門(mén)大學(xué)軟件學(xué)院4. 不產(chǎn)生候選挖掘頻繁項(xiàng)集頻繁模式增長(zhǎng)(frequent-pattern growth): FP-增長(zhǎng)將提供頻繁項(xiàng)集的數(shù)據(jù)庫(kù)壓縮到FP-樹(shù),但仍保持項(xiàng)集關(guān)聯(lián)信息壓縮后的數(shù)據(jù)庫(kù)分成一組條件數(shù)據(jù)庫(kù),每個(gè)關(guān)聯(lián)一個(gè)頻繁項(xiàng),并分別挖掘每個(gè)條件數(shù)據(jù)庫(kù)25廈門(mén)大學(xué)軟件學(xué)院Construct FP-tree from a Transaction DatabaseScan DB once, find freq

23、uent 1-itemset (single item pattern)Sort frequent items in frequency descending order, f-listScan DB again, construct FP-tree26廈門(mén)大學(xué)軟件學(xué)院 LI2:7,I1:6,I3:6,I4:2,I5:227廈門(mén)大學(xué)軟件學(xué)院28廈門(mén)大學(xué)軟件學(xué)院Benefits of the FP-tree StructureCompleteness Preserve complete information for frequent pattern miningNever break a lo

24、ng pattern of any transactionCompactnessReduce irrelevant infoinfrequent items are goneItems in frequency descending order: the more frequently occurring, the more likely to be sharedNever be larger than the original database (not count node-links and the count field)29廈門(mén)大學(xué)軟件學(xué)院Mining Frequent Patter

25、ns With FP-treesIdea: Frequent pattern growthRecursively grow frequent patterns by pattern and database partitionMethod For each frequent item, construct its conditional pattern-base, and then its conditional FP-treeRepeat the process on each newly created conditional FP-tree Until the resulting FP-

26、tree is empty, or it contains only one pathsingle path will generate all the combinations of its sub-paths, each of which is a frequent pattern30廈門(mén)大學(xué)軟件學(xué)院Why Is FP-Growth the Winner?Divide-and-conquer: decompose both the mining task and DB according to the frequent patterns obtained so farleads to fo

27、cused search of smaller databasesOther factorsno candidate generation, no candidate testcompressed database: FP-tree structureno repeated scan of entire database basic opscounting local freq items and building sub FP-tree, no pattern search and matching31廈門(mén)大學(xué)軟件學(xué)院使用垂直數(shù)據(jù)格式挖掘頻繁項(xiàng)集(自學(xué))Vertical format: t(

28、AB) = T11, T25, tid-list: list of trans.-ids containing an itemset Deriving closed patterns based on vertical intersectionst(X) = t(Y): X and Y always happen togethert(X) t(Y): transaction having X always has YUsing diffset to accelerate miningOnly keep track of differences of tidst(X) = T1, T2, T3,

29、 t(XY) = T1, T3 Diffset (XY, X) = T232廈門(mén)大學(xué)軟件學(xué)院Closed Patterns and Max-PatternsA long pattern contains a combinatorial number of sub-patterns, e.g., a1, , a100 contains (1001) + (1002) + + (110000) = 2100 1 = 1.27*1030 sub-patterns!Solution: Mine closed patterns and max-patterns insteadAn itemset X i

30、s closed if X is frequent and there exists no super-pattern Y X, with the same support as X (proposed by Pasquier, et al. ICDT99) An itemset X is a max-pattern if X is frequent and there exists no frequent super-pattern Y X (proposed by Bayardo SIGMOD98)33廈門(mén)大學(xué)軟件學(xué)院Closed Patterns and Max-PatternsExer

31、cise. DB = , Min_sup = 1.What is the set of closed itemset?: 1: 2What is the set of max-pattern?: 1What is the set of all patterns?!34廈門(mén)大學(xué)軟件學(xué)院MaxMiner: Mining Max-patterns (自學(xué))1st scan: find frequent itemsA, B, C, D, E2nd scan: find support for AB, AC, AD, AE, ABCDEBC, BD, BE, BCDECD, CE, CDE, DE,Si

32、nce BCDE is a max-pattern, no need to check BCD, BDE, CDE in later scanR. Bayardo. Efficiently mining long patterns from databases. In SIGMOD98TidItems10A,B,C,D,E20B,C,D,E,30A,C,D,FPotential max-patterns35廈門(mén)大學(xué)軟件學(xué)院Mining Frequent Closed Patterns: CLOSET (自學(xué))Flist: list of all frequent items in suppor

33、t ascending orderFlist: d-a-f-e-cDivide search spacePatterns having dPatterns having d but no a, etc.Find frequent closed pattern recursivelyEvery transaction having d also has cfa cfad is a frequent closed patternJ. Pei, J. Han & R. Mao. CLOSET: An Efficient Algorithm for Mining Frequent Closed Ite

34、msets, DMKD00.TIDItems10a, c, d, e, f20a, b, e30c, e, f40a, c, d, f50c, e, fMin_sup=236廈門(mén)大學(xué)軟件學(xué)院CLOSET+: Mining Closed Itemsets by Pattern-Growth (自學(xué))Itemset merging: if Y appears in every occurrence of X, then Y is merged with XSub-itemset pruning: if Y X, and sup(X) = sup(Y), X and all of Xs descen

35、dants in the set enumeration tree can be prunedHybrid tree projectionBottom-up physical tree-projectionTop-down pseudo tree-projectionItem skipping: if a local frequent item has the same support in several header tables at different levels, one can prune it from the header table at higher levelsEffi

36、cient subset checking37廈門(mén)大學(xué)軟件學(xué)院Mining Frequent Patterns, Association and Correlations基本概念和線(xiàn)路圖有效的和可伸縮的頻繁項(xiàng)集挖掘方法挖掘各種類(lèi)型的關(guān)聯(lián)規(guī)則(自學(xué))關(guān)聯(lián)規(guī)則到相關(guān)分析基于約束的關(guān)聯(lián)規(guī)則小結(jié)38廈門(mén)大學(xué)軟件學(xué)院Mining Various Kinds of Association RulesMining multilevel associationMiming multidimensional associationMining quantitative association 39廈門(mén)大學(xué)軟件學(xué)

37、院Mining Multiple-Level Association RulesItems often form hierarchiesFlexible support settings Items at the lower level are expected to have lower supportExploration of shared multi-level mining (Agrawal & SrikantVLB95, Han & FuVLDB95)uniform supportMilksupport = 10%2% Milk support = 6%Skim Milk supp

38、ort = 4%Level 1min_sup = 5%Level 2min_sup = 5%Level 1min_sup = 5%Level 2min_sup = 3%reduced support40廈門(mén)大學(xué)軟件學(xué)院Multi-level Association: Redundancy FilteringSome rules may be redundant due to “ancestor” relationships between items.Examplemilk wheat bread support = 8%, confidence = 70%2% milk wheat brea

39、d support = 2%, confidence = 72%We say the first rule is an ancestor of the second rule.A rule is redundant if its support is close to the “expected” value, based on the rules ancestor.41廈門(mén)大學(xué)軟件學(xué)院Mining Multi-Dimensional AssociationSingle-dimensional rules:buys(X, “milk”) buys(X, “bread”)Multi-dimens

40、ional rules: 2 dimensions or predicatesInter-dimension assoc. rules (no repeated predicates)age(X,”19-25”) occupation(X,“student”) buys(X, “coke”)hybrid-dimension assoc. rules (repeated predicates)age(X,”19-25”) buys(X, “popcorn”) buys(X, “coke”)Categorical Attributes(分類(lèi)屬性): 有限個(gè)可能值,值之間無(wú)序Quantitative

41、 Attributes(量化屬性): 數(shù)值,值之間有蘊(yùn)含的序42廈門(mén)大學(xué)軟件學(xué)院Mining Quantitative AssociationsTechniques can be categorized by how numerical attributes, such as age or salary are treated使用量化屬性的靜態(tài)離散化挖掘關(guān)聯(lián)規(guī)則動(dòng)態(tài)量化關(guān)聯(lián)規(guī)則 (quantitative rules, e.g., Agrawal & SrikantSIGMOD96) Clustering: Distance-based association (e.g., Yang & Mi

42、llerSIGMOD97) one dimensional clustering then associationDeviation: (such as Aumann and LindellKDD99)Sex = female = Wage: mean=$7/hr (overall mean = $9)43廈門(mén)大學(xué)軟件學(xué)院Static Discretization of Quantitative AttributesDiscretized prior to mining using concept hierarchy.使用預(yù)定義的概念分層,在挖掘前將數(shù)值屬性值離散化,用區(qū)間替代。如income

43、分為”020K”,”21K30K”等。將Aprior算法修改后,用于找出所有的頻繁謂詞相關(guān)任務(wù)的數(shù)據(jù)可以存放在數(shù)據(jù)立方體中(income)(age)()(buys)(age, income)(age,buys)(income,buys)(age,income,buys)44廈門(mén)大學(xué)軟件學(xué)院挖掘量化關(guān)聯(lián)規(guī)則age(X,”34-35”) income(X,”30K - 50K”) buys(X,”high resolution TV”)數(shù)值型屬性被“動(dòng)態(tài)離散化”Such that the confidence or compactness of the rules mined is maximiz

44、ed2-D量化關(guān)聯(lián)規(guī)則: Aquan1 Aquan2 Acat通過(guò)柵格化方式量化Example45廈門(mén)大學(xué)軟件學(xué)院基于距離的關(guān)聯(lián)規(guī)則分箱方法不能緊扣區(qū)間數(shù)據(jù)的語(yǔ)義,未考慮數(shù)據(jù)點(diǎn)之間或區(qū)間間的相互距離基于距離的劃分即考慮稠密性或區(qū)間的點(diǎn)數(shù),又考慮區(qū)間內(nèi)點(diǎn)的“接近性”,有助于產(chǎn)生更有意義的離散化46廈門(mén)大學(xué)軟件學(xué)院Mining Frequent Patterns, Association and Correlations基本概念和線(xiàn)路圖有效的和可伸縮的頻繁項(xiàng)集挖掘方法挖掘各種類(lèi)型的關(guān)聯(lián)規(guī)則關(guān)聯(lián)規(guī)則到相關(guān)分析基于約束的關(guān)聯(lián)規(guī)則小結(jié)47廈門(mén)大學(xué)軟件學(xué)院關(guān)聯(lián)規(guī)則一定有趣嗎?只有用戶(hù)才能確定規(guī)則是否有趣主

45、觀客觀有趣度度量可用于清除無(wú)趣的關(guān)聯(lián)規(guī)則play basketball eat cereal 40%, 66.7%是誤導(dǎo)的The overall percentage of students eating cereal is 75% which is higher than 66.7%.play basketball not eat cereal 20%, 33.3% 更真實(shí)BasketballNot basketballSum (row)Cereal200017503750Not cereal10002501250Sum(col.)30002000500048廈門(mén)大學(xué)軟件學(xué)院Interest

46、ingness Measure: Correlations (Lift)play basketball eat cereal 40%, 66.7% is misleadingThe overall % of students eating cereal is 75% 66.7%.play basketball not eat cereal 20%, 33.3% is more accurate, although with lower support and confidenceMeasure of dependent/correlated events: liftBasketballNot

47、basketballSum (row)Cereal200017503750Not cereal10002501250Sum(col.)30002000500049廈門(mén)大學(xué)軟件學(xué)院Are lift and 2 Good Measures of Correlation?“Buy walnuts buy milk 1%, 80%” is misleadingif 85% of customers buy milkSupport and confidence are not good to represent correlationsSo many interestingness measures?

48、(Tan, Kumar, Sritastava KDD02)MilkNo MilkSum (row)Coffeem, cm, ccNo Coffeem, cm, ccSum(col.)mmDBm, cm, cmcmcliftall-confcoh2A1100010010010,0009.260.910.839055A210010001000100,0008.440.090.05670A3100010010000100,0009.180.090.098172A4100010001000100010.50.33050廈門(mén)大學(xué)軟件學(xué)院Mining Frequent Patterns, Associa

49、tion and Correlations基本概念和線(xiàn)路圖有效的和可伸縮的頻繁項(xiàng)集挖掘方法挖掘各種類(lèi)型的關(guān)聯(lián)規(guī)則關(guān)聯(lián)規(guī)則到相關(guān)分析基于約束的關(guān)聯(lián)規(guī)則(自學(xué))小結(jié)51廈門(mén)大學(xué)軟件學(xué)院Constraint-based (Query-Directed) MiningFinding all the patterns in a database autonomously? unrealistic!The patterns could be too many but not focused!Data mining should be an interactive process User directs

50、what to be mined using a data mining query language (or a graphical user interface)Constraint-based miningUser flexibility: provides constraints on what to be minedSystem optimization: explores such constraints for efficient miningconstraint-based mining52廈門(mén)大學(xué)軟件學(xué)院Constraints in Data MiningKnowledge

51、type constraint: classification, association, etc.Data constraint using SQL-like queries find product pairs sold together in stores in Chicago in Dec.02Dimension/level constraintin relevance to region, price, brand, customer categoryRule (or pattern) constraintsmall sales (price $200)Interestingness

52、 constraintstrong rules: min_support 3%, min_confidence 60%53廈門(mén)大學(xué)軟件學(xué)院Anti-Monotonicity in Constraint PushingAnti-monotonicity反單調(diào)When an intemset S violates the constraint, so does any of its superset sum(S.Price) v is anti-monotonesum(S.Price) v is not anti-monotoneExample. C: range(S.profit) 15 is

53、anti-monotoneItemset ab violates CSo does every superset of abTIDTransaction10a, b, c, d, f20b, c, d, f, g, h30a, c, d, e, f40c, e, f, gTDB (min_sup=2)ItemProfita40b0c-20d10e-30f30g20h-1054廈門(mén)大學(xué)軟件學(xué)院Monotonicity for Constraint PushingMonotonicityWhen an intemset S satisfies the constraint, so does any of i

溫馨提示

  • 1. 本站所有資源如無(wú)特殊說(shuō)明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請(qǐng)下載最新的WinRAR軟件解壓。
  • 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請(qǐng)聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶(hù)所有。
  • 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁(yè)內(nèi)容里面會(huì)有圖紙預(yù)覽,若沒(méi)有圖紙預(yù)覽就沒(méi)有圖紙。
  • 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
  • 5. 人人文庫(kù)網(wǎng)僅提供信息存儲(chǔ)空間,僅對(duì)用戶(hù)上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對(duì)用戶(hù)上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對(duì)任何下載內(nèi)容負(fù)責(zé)。
  • 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請(qǐng)與我們聯(lián)系,我們立即糾正。
  • 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶(hù)因使用這些下載資源對(duì)自己和他人造成任何形式的傷害或損失。

評(píng)論

0/150

提交評(píng)論