教程數(shù)據(jù)庫課件4asso_第1頁
教程數(shù)據(jù)庫課件4asso_第2頁
教程數(shù)據(jù)庫課件4asso_第3頁
教程數(shù)據(jù)庫課件4asso_第4頁
教程數(shù)據(jù)庫課件4asso_第5頁
已閱讀5頁,還剩132頁未讀, 繼續(xù)免費(fèi)閱讀

下載本文檔

版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請進(jìn)行舉報(bào)或認(rèn)領(lǐng)

文檔簡介

1、August 12, 2022Data Mining: Concepts and Techniques1Data Mining: Concepts and Techniques Slides for Textbook Chapter 6 Jiawei Han and Micheline KamberDepartment of Computer Science University of Illinois at Urbana-Champaign August 12, 2022Data Mining: Concepts and Techniques2Chapter 6: Mining Associ

2、ation Rules in Large DatabasesAssociation rule miningAlgorithms for scalable mining of (single-dimensional Boolean) association rules in transactional databasesMining various kinds of association/correlation rules Constraint-based association miningSequential pattern miningApplications/extensions of

3、 frequent pattern miningSummaryAugust 12, 2022Data Mining: Concepts and Techniques3What Is Association Mining?Association rule mining:Finding frequent patterns, associations, correlations, or causal structures among sets of items or objects in transaction databases, relational databases, and other i

4、nformation repositories.Frequent pattern: pattern (set of items, sequence, etc.) that occurs frequently in a database AIS93Motivation: finding regularities in dataWhat products were often purchased together? Beer and diapers?!What are the subsequent purchases after buying a PC?What kinds of DNA are

5、sensitive to this new drug?Can we automatically classify web documents?August 12, 2022Data Mining: Concepts and Techniques4Why Is Frequent Pattern or Assoiciation Mining an Essential Task in Data Mining?Foundation for many essential data mining tasksAssociation, correlation, causalitySequential patt

6、erns, temporal or cyclic association, partial periodicity, spatial and multimedia associationAssociative classification, cluster analysis, iceberg cube, fascicles (semantic data compression)Broad applicationsBasket data analysis, cross-marketing, catalog design, sale campaign analysisWeb log (click

7、stream) analysis, DNA sequence analysis, etc.August 12, 2022Data Mining: Concepts and Techniques5Basic Concepts: Frequent Patterns and Association RulesItemset X=x1, , xkFind all the rules XY with min confidence and supportsupport, s, probability that a transaction contains XYconfidence, c, conditio

8、nal probability that a transaction having X also contains Y.Let min_support = 50%, min_conf = 50%:A C (50%, 66.7%)C A (50%, 100%)Customerbuys diaperCustomerbuys bothCustomerbuys beerTransaction-idItems bought10A, B, C20A, C30A, D40B, E, FAugust 12, 2022Data Mining: Concepts and Techniques6Mining Ass

9、ociation Rulesan ExampleFor rule A C:support = support(AC) = 50%confidence = support(AC)/support(A) = 66.6%Min. support 50%Min. confidence 50%Transaction-idItems bought10A, B, C20A, C30A, D40B, E, FFrequent patternSupportA75%B50%C50%A, C50%August 12, 2022Data Mining: Concepts and Techniques7Chapter

10、6: Mining Association Rules in Large DatabasesAssociation rule miningAlgorithms for scalable mining of (single-dimensional Boolean) association rules in transactional databasesMining various kinds of association/correlation rules Constraint-based association miningSequential pattern miningApplicatio

11、ns/extensions of frequent pattern miningSummaryAugust 12, 2022Data Mining: Concepts and Techniques8Apriori: A Candidate Generation-and-test ApproachAny subset of a frequent itemset must be frequentif beer, diaper, nuts is frequent, so is beer, diaperEvery transaction having beer, diaper, nuts also c

12、ontains beer, diaper Apriori pruning principle: If there is any itemset which is infrequent, its superset should not be generated/tested!Method: generate length (k+1) candidate itemsets from length k frequent itemsets, andtest the candidates against DBThe performance studies show its efficiency and

13、scalabilityAgrawal & Srikant 1994, Mannila, et al. 1994August 12, 2022Data Mining: Concepts and Techniques9The Apriori AlgorithmAn ExampleDatabase TDB1st scanC1L1L2C2C22nd scanC3L33rd scanTidItems10A, C, D20B, C, E30A, B, C, E40B, EItemsetsupA2B3C3D1E3ItemsetsupA2B3C3E3ItemsetA, BA, CA, EB, CB, EC,

14、EItemsetsupA, B1A, C2A, E1B, C2B, E3C, E2ItemsetsupA, C2B, C2B, E3C, E2ItemsetB, C, EItemsetsupB, C, E2August 12, 2022Data Mining: Concepts and Techniques10The Apriori AlgorithmPseudo-code:Ck: Candidate itemset of size kLk : frequent itemset of size kL1 = frequent items;for (k = 1; Lk !=; k+) do beg

15、in Ck+1 = candidates generated from Lk; for each transaction t in database do increment the count of all candidates in Ck+1 that are contained in t Lk+1 = candidates in Ck+1 with min_support endreturn k Lk;August 12, 2022Data Mining: Concepts and Techniques11Important Details of AprioriHow to genera

16、te candidates?Step 1: self-joining LkStep 2: pruningHow to count supports of candidates?Example of Candidate-generationL3=abc, abd, acd, ace, bcdSelf-joining: L3*L3abcd from abc and abdacde from acd and acePruning:acde is removed because ade is not in L3C4=abcdAugust 12, 2022Data Mining: Concepts an

17、d Techniques12How to Generate Candidates?Suppose the items in Lk-1 are listed in an orderStep 1: self-joining Lk-1 insert into Ckselect p.item1, p.item2, , p.itemk-1, q.itemk-1from Lk-1 p, Lk-1 qwhere p.item1=q.item1, , p.itemk-2=q.itemk-2, p.itemk-1 q.itemk-1Step 2: pruningforall itemsets c in Ck d

18、oforall (k-1)-subsets s of c doif (s is not in Lk-1) then delete c from CkAugust 12, 2022Data Mining: Concepts and Techniques13How to Count Supports of Candidates?Why counting supports of candidates a problem?The total number of candidates can be very huge One transaction may contain many candidates

19、Method:Candidate itemsets are stored in a hash-treeLeaf node of hash-tree contains a list of itemsets and countsInterior node contains a hash tableSubset function: finds all the candidates contained in a transactionAugust 12, 2022Data Mining: Concepts and Techniques14Example: Counting Supports of Ca

20、ndidates1,4,72,5,83,6,9Subset function2 3 45 6 71 4 51 3 61 2 44 5 71 2 54 5 81 5 93 4 53 5 63 5 76 8 93 6 73 6 8Transaction: 1 2 3 5 61 + 2 3 5 61 2 + 3 5 61 3 + 5 6August 12, 2022Data Mining: Concepts and Techniques15Efficient Implementation of Apriori in SQLHard to get good performance out of pur

21、e SQL (SQL-92) based approaches aloneMake use of object-relational extensions like UDFs, BLOBs, Table functions etc.Get orders of magnitude improvementS. Sarawagi, S. Thomas, and R. Agrawal. Integrating association rule mining with relational database systems: Alternatives and implications. In SIGMO

22、D98August 12, 2022Data Mining: Concepts and Techniques16Challenges of Frequent Pattern MiningChallengesMultiple scans of transaction databaseHuge number of candidatesTedious workload of support counting for candidatesImproving Apriori: general ideasReduce passes of transaction database scansShrink n

23、umber of candidatesFacilitate support counting of candidatesAugust 12, 2022Data Mining: Concepts and Techniques17DIC: Reduce Number of ScansABCDABCABDACDBCDABACBCADBDCDABCDItemset latticeOnce both A and D are determined frequent, the counting of AD beginsOnce all length-2 subsets of BCD are determin

24、ed frequent, the counting of BCD beginsTransactions1-itemsets2-itemsetsApriori1-itemsets2-items3-itemsDICS. Brin R. Motwani, J. Ullman, and S. Tsur. Dynamic itemset counting and implication rules for market basket data. In SIGMOD97August 12, 2022Data Mining: Concepts and Techniques18Partition: Scan

25、Database Only TwiceAny itemset that is potentially frequent in DB must be frequent in at least one of the partitions of DBScan 1: partition database and find local frequent patternsScan 2: consolidate global frequent patternsA. Savasere, E. Omiecinski, and S. Navathe. An efficient algorithm for mini

26、ng association in large databases. In VLDB95August 12, 2022Data Mining: Concepts and Techniques19Sampling for Frequent PatternsSelect a sample of original database, mine frequent patterns within sample using AprioriScan database once to verify frequent itemsets found in sample, only borders of closu

27、re of frequent patterns are checkedExample: check abcd instead of ab, ac, , etc.Scan database again to find missed frequent patternsH. Toivonen. Sampling large databases for association rules. In VLDB96August 12, 2022Data Mining: Concepts and Techniques20DHP: Reduce the Number of CandidatesA k-items

28、et whose corresponding hashing bucket count is below the threshold cannot be frequentCandidates: a, b, c, d, eHash entries: ab, ad, ae bd, be, de Frequent 1-itemset: a, b, d, eab is not a candidate 2-itemset if the sum of count of ab, ad, ae is below support thresholdJ. Park, M. Chen, and P. Yu. An

29、effective hash-based algorithm for mining association rules. In SIGMOD95August 12, 2022Data Mining: Concepts and Techniques21Eclat/MaxEclat and VIPER: Exploring Vertical Data FormatUse tid-list, the list of transaction-ids containing an itemsetCompression of tid-listsItemset A: t1, t2, t3, sup(A)=3I

30、temset B: t2, t3, t4, sup(B)=3Itemset AB: t2, t3, sup(AB)=2Major operation: intersection of tid-listsM. Zaki et al. New algorithms for fast discovery of association rules. In KDD97P. Shenoy et al. Turbo-charging vertical mining of large databases. In SIGMOD00August 12, 2022Data Mining: Concepts and

31、Techniques22Bottleneck of Frequent-pattern MiningMultiple database scans are costlyMining long patterns needs many passes of scanning and generates lots of candidatesTo find frequent itemset i1i2i100# of scans: 100# of Candidates: (1001) + (1002) + + (110000) = 2100-1 = 1.27*1030 !Bottleneck: candid

32、ate-generation-and-testCan we avoid candidate generation?August 12, 2022Data Mining: Concepts and Techniques23Mining Frequent Patterns Without Candidate GenerationGrow long patterns from short ones using local frequent items“abc” is a frequent patternGet all transactions having “abc”: DB|abc“d” is a

33、 local frequent item in DB|abc abcd is a frequent patternAugust 12, 2022Data Mining: Concepts and Techniques24Construct FP-tree from a Transaction Databasef:4c:1b:1p:1b:1c:3a:3b:1m:2p:2m:1Header TableItem frequency head f4c4a3b3m3p3min_support = 3TIDItems bought (ordered) frequent items100f, a, c, d

34、, g, i, m, pf, c, a, m, p200a, b, c, f, l, m, of, c, a, b, m300 b, f, h, j, o, wf, b400 b, c, k, s, pc, b, p500 a, f, c, e, l, p, m, nf, c, a, m, pScan DB once, find frequent 1-itemset (single item pattern)Sort frequent items in frequency descending order, f-listScan DB again, construct FP-treeF-lis

35、t=f-c-a-b-m-pAugust 12, 2022Data Mining: Concepts and Techniques25Benefits of the FP-tree StructureCompleteness Preserve complete information for frequent pattern miningNever break a long pattern of any transactionCompactnessReduce irrelevant infoinfrequent items are goneItems in frequency descendin

36、g order: the more frequently occurring, the more likely to be sharedNever be larger than the original database (not count node-links and the count field)For Connect-4 DB, compression ratio could be over 100August 12, 2022Data Mining: Concepts and Techniques26Partition Patterns and DatabasesFrequent

37、patterns can be partitioned into subsets according to f-listF-list=f-c-a-b-m-pPatterns containing pPatterns having m but no pPatterns having c but no a nor b, m, pPattern fCompleteness and non-redundencyAugust 12, 2022Data Mining: Concepts and Techniques27Find Patterns Having P From P-conditional Da

38、tabaseStarting at the frequent item header table in the FP-treeTraverse the FP-tree by following the link of each frequent item pAccumulate all of transformed prefix paths of item p to form ps conditional pattern baseConditional pattern basesitemcond. pattern basecf:3afc:3bfca:1, f:1, c:1mfca:2, fca

39、b:1pfcam:2, cb:1f:4c:1b:1p:1b:1c:3a:3b:1m:2p:2m:1Header TableItem frequency head f4c4a3b3m3p3August 12, 2022Data Mining: Concepts and Techniques28From Conditional Pattern-bases to Conditional FP-trees For each pattern-baseAccumulate the count for each item in the baseConstruct the FP-tree for the fr

40、equent items of the pattern basem-conditional pattern base:fca:2, fcab:1f:3c:3a:3m-conditional FP-treeAll frequent patterns relate to mm, fm, cm, am, fcm, fam, cam, fcamf:4c:1b:1p:1b:1c:3a:3b:1m:2p:2m:1Header TableItem frequency head f4c4a3b3m3p3August 12, 2022Data Mining: Concepts and Techniques29R

41、ecursion: Mining Each Conditional FP-treef:3c:3a:3m-conditional FP-treeCond. pattern base of “am”: (fc:3)f:3c:3am-conditional FP-treeCond. pattern base of “cm”: (f:3)f:3cm-conditional FP-treeCond. pattern base of “cam”: (f:3)f:3cam-conditional FP-treeAugust 12, 2022Data Mining: Concepts and Techniqu

42、es30A Special Case: Single Prefix Path in FP-treeSuppose a (conditional) FP-tree T has a shared single prefix-path PMining can be posed into two partsReduction of the single prefix path into one nodeConcatenation of the mining results of the two partsa2:n2a3:n3a1:n1b1:m1C1:k1C2:k2C3:k3b1:m1C1:k1C2:k

43、2C3:k3r1+a2:n2a3:n3a1:n1r1=August 12, 2022Data Mining: Concepts and Techniques31Mining Frequent Patterns With FP-treesIdea: Frequent pattern growthRecursively grow frequent patterns by pattern and database partitionMethod For each frequent item, construct its conditional pattern-base, and then its c

44、onditional FP-treeRepeat the process on each newly created conditional FP-tree Until the resulting FP-tree is empty, or it contains only one pathsingle path will generate all the combinations of its sub-paths, each of which is a frequent patternAugust 12, 2022Data Mining: Concepts and Techniques32Sc

45、aling FP-growth by DB ProjectionFP-tree cannot fit in memory?DB projectionFirst partition a database into a set of projected DBsThen construct and mine FP-tree for each projected DBParallel projection vs. Partition projection techniquesParallel projection is space costlyAugust 12, 2022Data Mining: C

46、oncepts and Techniques33Partition-based ProjectionParallel projection needs a lot of disk space Partition projection saves itTran. DB fcampfcabmfbcbpfcampp-proj DB fcamcbfcamm-proj DB fcabfcafcab-proj DB fcba-proj DBfcc-proj DBff-proj DB am-proj DB fcfcfccm-proj DB fffAugust 12, 2022Data Mining: Con

47、cepts and Techniques34FP-Growth vs. Apriori: Scalability With the Support ThresholdData set T25I20D10KAugust 12, 2022Data Mining: Concepts and Techniques35FP-Growth vs. Tree-Projection: Scalability with the Support ThresholdData set T25I20D100KAugust 12, 2022Data Mining: Concepts and Techniques36Why

48、 Is FP-Growth the Winner?Divide-and-conquer: pose both the mining task and DB according to the frequent patterns obtained so farleads to focused search of smaller databasesOther factorsno candidate generation, no candidate testcompressed database: FP-tree structureno repeated scan of entire database

49、 basic opscounting local freq items and building sub FP-tree, no pattern search and matchingAugust 12, 2022Data Mining: Concepts and Techniques37Implications of the Methodology Mining closed frequent itemsets and max-patternsCLOSET (DMKD00)Mining sequential patternsFreeSpan (KDD00), PrefixSpan (ICDE

50、01)Constraint-based mining of frequent patternsConvertible constraints (KDD00, ICDE01)Computing iceberg data cubes with complex measures H-tree and H-cubing algorithm (SIGMOD01)August 12, 2022Data Mining: Concepts and Techniques38Max-patternsFrequent pattern a1, , a100 (1001) + (1002) + + (110000) =

51、 2100-1 = 1.27*1030 frequent sub-patterns!Max-pattern: frequent patterns without proper frequent super patternBCDE, ACD are max-patternsBCD is not a max-patternTidItems10A,B,C,D,E20B,C,D,E,30A,C,D,FMin_sup=2August 12, 2022Data Mining: Concepts and Techniques39MaxMiner: Mining Max-patterns1st scan: f

52、ind frequent itemsA, B, C, D, E2nd scan: find support for AB, AC, AD, AE, ABCDEBC, BD, BE, BCDECD, CE, CDE, DE,Since BCDE is a max-pattern, no need to check BCD, BDE, CDE in later scanR. Bayardo. Efficiently mining long patterns from databases. In SIGMOD98TidItems10A,B,C,D,E20B,C,D,E,30A,C,D,FPotent

53、ial max-patternsAugust 12, 2022Data Mining: Concepts and Techniques40Frequent Closed PatternsConf(acd)=100% record acd onlyFor frequent itemset X, if there exists no item y s.t. every transaction containing X also contains y, then X is a frequent closed pattern“acd” is a frequent closed patternConci

54、se rep. of freq patsReduce # of patterns and rulesN. Pasquier et al. In ICDT99TIDItems10a, c, d, e, f20a, b, e30c, e, f40a, c, d, f50c, e, fMin_sup=2August 12, 2022Data Mining: Concepts and Techniques41Mining Frequent Closed Patterns: CLOSETFlist: list of all frequent items in support ascending orde

55、rFlist: d-a-f-e-cDivide search spacePatterns having dPatterns having d but no a, etc.Find frequent closed pattern recursivelyEvery transaction having d also has cfa cfad is a frequent closed patternJ. Pei, J. Han & R. Mao. CLOSET: An Efficient Algorithm for Mining Frequent Closed Itemsets, DMKD00.TI

56、DItems10a, c, d, e, f20a, b, e30c, e, f40a, c, d, f50c, e, fMin_sup=2August 12, 2022Data Mining: Concepts and Techniques42Mining Frequent Closed Patterns: CHARMUse vertical data format: t(AB)=T1, T12, Derive closed pattern based on vertical intersectionst(X)=t(Y): X and Y always happen togethert(X)t

57、(Y): transaction having X always has YUse diffset to accelerate miningOnly keep track of difference of tidst(X)=T1, T2, T3, t(Xy )=T1, T3 Diffset(Xy, X)=T2M. Zaki. CHARM: An Efficient Algorithm for Closed Association Rule Mining, CS-TR99-10, Rensselaer Polytechnic InstituteM. Zaki, Fast Vertical Min

58、ing Using Diffsets, TR01-1, Department of Computer Science, Rensselaer Polytechnic InstituteAugust 12, 2022Data Mining: Concepts and Techniques43Visualization of Association Rules: Pane GraphAugust 12, 2022Data Mining: Concepts and Techniques44Visualization of Association Rules: Rule GraphAugust 12,

59、 2022Data Mining: Concepts and Techniques45Chapter 6: Mining Association Rules in Large DatabasesAssociation rule miningAlgorithms for scalable mining of (single-dimensional Boolean) association rules in transactional databasesMining various kinds of association/correlation rules Constraint-based as

60、sociation miningSequential pattern miningApplications/extensions of frequent pattern miningSummaryAugust 12, 2022Data Mining: Concepts and Techniques46Mining Various Kinds of Rules or RegularitiesMulti-level, quantitative association rules, correlation and causality, ratio rules, sequential patterns

溫馨提示

  • 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
  • 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
  • 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會(huì)有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
  • 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
  • 5. 人人文庫網(wǎng)僅提供信息存儲(chǔ)空間,僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對任何下載內(nèi)容負(fù)責(zé)。
  • 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請與我們聯(lián)系,我們立即糾正。
  • 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。

評論

0/150

提交評論