LightLDA-v6--P8自然語言處理課件_第1頁
LightLDA-v6--P8自然語言處理課件_第2頁
LightLDA-v6--P8自然語言處理課件_第3頁
LightLDA-v6--P8自然語言處理課件_第4頁
LightLDA-v6--P8自然語言處理課件_第5頁
已閱讀5頁,還剩46頁未讀, 繼續(xù)免費閱讀

下載本文檔

版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請進行舉報或認領(lǐng)

文檔簡介

1、 LightLDA: Big Topic Models on Modest Computer ClustersJinhui Yuan, Fei Gao, Qirong Ho, Wei Dai, Jinliang Wei, Xun Zheng, Eric P. Xing, Tie-Yan Liu and Wei-Ying MaMicrosoft Research AsiaCarnegie Mellon UniversityTopic Models7/20/20222Statistical method to understand the latent topics in textsTopics:

2、 Distributions over wordsBag-of-Words Bag-of-TopicsCompact features for matchingfigure courtesy of David BleiLatent Dirichlet Allocation (LDA)37/20/2022Blei, et al. 2003Something more friendly200B10M1M400Industrial-Scale LDA ModelsRAM(TB)#parametersPeacockAliasLDA#cores#tokensPeacockAliasLDA7/20/202

3、241B Documents200 Tokens10M Words1M TopicsBig DataBig ModelSystem#Tokens#Words#Topics#CoresNote AliasLDA (CMU/Google)50B5M2,00060,000KDD/OSDI 2014Peacock (Tencent)5B0.2M100,0003,000ACM TIST 2014Our Work: LightLDA- Target: 1M topics; 1B docs; 10M words in vocabularyHighly scalable systemDevelop a par

4、allel machine learning platform to scale out computations using a modest cluster of machines7/20/20225Standard LDA Training and Inference67/20/2022Word-topic tableDocument-topic tablePer-token sampling complexity proportional to the number of topics: (), thus hard to scale up to large number of topi

5、cs.Reduce Sampling Complexity77/20/2022Reusability of Alias TableTermsAlias table constructionReused forAlias table for each word, reusable for all documentsAmortized O(1) complexity?YesYesNo87/20/2022SparseLDA and AliasLDA: Partial Amortization7/20/20229Amortizable O(1)Amortizable O(1)SparseLDA Yao

6、, et al. KDD 2009AliasLDA Li, et al. KDD 2014LightLDA: Full Amortization7/20/202210Amortizable O(1)Amortizable O(1)Multiplicative factorization to enable full amortization of time complexityLeverage sparsity to further reduce space complexityMultiplicative factorizationReducing Space ComplexityExper

7、imental Results (Single-core)11Time per iterationLikelihood vs. number of iterations3/16/2015LightLDA - Tie-Yan Liu CMUAlmost no loss on the per-iteration convergence qualityMuch less time for each iterationExperimental Results (Single-core)127/20/2022NYTimes Dataset300K documents99M tokensPubMed Da

8、taset8.2M documents738M tokensLightLDA achieves better log-likelihood than SparseLDA and AliasLDA in much shorter time!With a single core only, LightLDA uses 20 hours to train 10K topics from 1B tokens (PubMed). With a commodity machine of 20 cores, LightLDA can finish training in 1.5 hours. This si

9、ngle-machine capability is equivalent to (if not beyond) a medium-size cluster of SparseLDA or AliasLDA.Highly scalable systemDevelop a parallel machine learning platform to scale out computations using a modest cluster of machines7/20/202213Our Work: LightLDA- Target: 1M topics; 1B docs; 10M words

10、in vocabularySystem Architecture LightLDALightLDALightLDAModel SchedulingModel SchedulingModel SchedulingASP/SSPHybrid data structure and dual bufferHybrid data structure and dual bufferHybrid data structure and dual bufferDistributed Shared Model14Petuum parameter server based data parallelism to h

11、andle big dataA new model scheduling scheme to handle big modelsHybrid data structure to trade off memory usage and access speedIntensive system optimization (double buffer, pipelining, memory management, etc)7/20/2022Model Scheduling7/20/202215Model Parallelism in PeacockModel SchedulingModel is cu

12、t into slices, sending model slices to data instead of sending data to model.Model slices are pulled from server and trained in a round robin fashion, analogous to block coordinate descent (BCD). Data samples and intermediate results are locally stored.Lower communication cost since data model, and

13、model becomes sparse during trainingModel is cut into pieces and put onto different machines; sending data to model pieces. To finish training, intermediate results for every data sample need to be transferred between machines.Communication cost includes both the word-topic table (small) and doc-top

14、ic table (large) Hybrid Data StructureModel is simply too big (10 trillion parameters)Naive dense storage (array) requires prohibitive amounts of memorySparse storage (hash map) hurts sampling efficiency.Leverage power law distribution of wordsUse dense array to store frequent words to achieve high

15、throughput; Use sparse hash tables to store long-tail words, to reduce memory usage (100 times).167/20/2022Hybrid Data StructureModel is simply too big (10 trillion parameters)Naive dense storage (array) requires prohibitive amounts of memorySparse storage (hash map) hurts sampling efficiency.Levera

16、ge power law distribution of wordsUse dense array to store frequent words to achieve high throughput; Use sparse hash tables to store long-tail words, to reduce memory usage (100 times).177/20/2022RowTagSizePt.0dense1sparse202sparse10An array as the indexExperimental SettingsComputer cluster: 24 mac

17、hines (20 cores and 256GB RAM for each machine)Network connection: 1Gbps between machinesParameter server deployment: physically, each machine hosts both parameter server and workers (16 cores for workers and 4 cores for parameter server)Bing Web data: 1.2 billion documents, 200 billion tokens187/20

18、/2022System Throughput w.r.t multi-core and multi-nodeExperimental Results197/20/202220M words, 1M topics, 20T parametersNearly linear speed-up is achieved; after the first 10 iterations, the communication cost is almost masked by computations (further improvable if a high-speed network is used).Con

19、clusionsLightLDA is able to learn super large topic models on modest clustersWith 24 commodity machines, we trained one million topics (20 trillion parameters) from Bing web data (200 billion tokens) in just two days. Machine learning insights vs. brute-force parallelization:As compared to other sys

20、tems, we use much fewer cores to learn much more topics from much larger data! 207/20/2022RAM(TB)#parametersPeacockAliasLDA(CMU/Google)LightLDAModel size VS. RAM capacity#CPU cores#tokensPeacock(Tencent)AliasLDA(CMU/Google)LightLDAData size VS. CPU coresFuture WorkBeyond topic models!Multiplicative

21、factorization based fast sampling may have its general implication to many graphical model problems.Model scheduling scheme and other system innovations may be used to handle other large machine learning task as well. We are in the process of open-sourcing LightLDAStay tuned if you want to use it to

22、 train your own big topic models!7/20/202221Thanks!237/20/2022247/20/2022Alias Method Constructing Alias Table25Construction of Alias table: non-uniform distribution uniform distributionWith O(K) complexity in time and spaceReuse for future samplings:uniformly sample from alias tableO(1) timeWalker,

23、 19777/20/2022How to amortizing the O(K) initialization time through reusing? Constructing Alias Table7/20/202226Separate bins into two sets:L = C,G, for bins lower than the average value .H = A,T, for bins higher than the average value .Constructing Alias Table7/20/2022277/20/202228Looking Up Alias

24、 TableSystem Optimization7/20/202229Dual Model and Cross-thread Aggregation30Immutable model in RAM for compuationSampler in thread 4UpdatesData blockData partition for thread 1Data partition for thread 2Data partition for thread 3Data partition for thread 4Sampler in thread 3Sampler in thread 2Samp

25、ler in thread 1UpdatesUpdatesUpdatesLocal aggregator threadParameter ServerMutable model in RAM for sync-upSwitchRemove multi-thread contentionReduce communication costOverlap computations and communicationsNetworkNetworkPre-fetch7/20/2022Pipelining31Size of data block: 32GB with 4B tokens 2K second

26、s for disk reading (for 150MB I/O bandwidth).Sampling throughput: 2M tokens/worker/sec 2K seconds to sample 4B tokens.Size of model slice: 20GB, so as to be maskable by 2K seconds on a 1Gbps network.7/20/2022Memory Layout7/20/202232Pre-Allocated Memory LayoutIncrease efficiency by avoiding dynamic m

27、emory allocation33Serve Table(30GB)Data BlockOf Current Iteration(32GB)Data BlockOf Next Iteration(32GB)A Slice of Model for Current Iteration(36GB)A Slice of Model for NextIteration(36GB)Alias Table for Current Iteration(36GB)Delta ArraysFor AppendingUpdates from Sampler(10GB) Delta Table for Aggre

28、gating Local Updates of a Particular Model Slice(4GB)216GB7/20/2022Each Memory Quota is Delicately Allocated Store 1M*1M model with hybrid data structure requires 700GB RAM. Therefore, each machine needs 700/24= 29GB space to store server table.Each data block contains 4B tokens, each token requires

29、 8Bytes, therefore the size of each data block is 32GB.The number of 4B tokens is determined by sampling throughput, 2M tokens/machine/sec, so 4B tokens need 2000 seconds to complete the sampling.To sweep a data block, we needs 20 slices of model in model slicing. Each slice of model can sample 2000

30、/20 = 100 seconds.347/20/2022Each Memory Quota is Delicately Allocated In the initial stage, the whole model is 700GB, the maximum size of a model slice is 36GB, so we need split the whole model into 20 slices.When close to convergence, the model will be very sparse, it is much smaller than 700GB.We

31、 need to construct an alias table for each active slice of model. While the alias table is already sparsified, in the initial stage, it roughly has the same size with a slice of model.The maximum number of updates can not exceed 2 times of the number of sampled tokens in a particular slice. Therefor

32、e, the delta table and delta array are rather small.357/20/2022Details about Hybrid Structure7/20/202236A Hybrid Data StructureWith a hybrid data structure, we reduce memory usage by over 100 times, while maintaining an efficient random access.RowTagSizePt.0dense1sparse202sparse10An array as the ind

33、exA pre-allocated memory pool to reduce the overhead caused by dynamic memory allocation.7/20/202237Hybrid Data Structure387/20/2022Average Probes for a Given Load Factor39Load factor0.10.500.750.800.900.99SuccessfulChaining1.051.251.381.401.451.50Open AddressingLinear probing1.05.550.5Qua

34、dratic probing1.051.442.012.212.855.11Double hashing1.051.391.852.012.564.6UnsuccessfulChaining0.100.500.750.800.900.99Open AddressingLinear probing13505000Quadratic probing45.8111.4103.6Double hashing1.112.04.05.010100http:/www.augustana.ca/mohrj/courses/1999.fall/csc210/lectur

35、e_notes/hashing.html7/20/2022Out-of-Core and Fault Tolerance7/20/202240Out-of-CoreBinary serializationEfficient block-wise IO, high disk-r/w throughputSave the marshaling timeMake the token and its topic adjacent, (CPU cache friendly)Easy to estimate data size, 4B tokens requires 4B * 8Byte = 32GB41

36、w11t11w12t12w21t21w22t22Doc 1Doc 27/20/2022Zero-Cost Fault Tolerance by Warm-Start42Data block on diskData block in RAM (before sampling)Data block in RAM(after sampling)Data block on diskLoad into RAMGibbs SamplerStore into DiskAtomic File OverwriteWe dont need any checkpoint.We dont dump the model

37、, instead we dump (token,topic) data which is accomplished automatically by out-of-core module.Any fails, the system supports warm start.Atomic file overwrite can avoid file corruption in case system crashes while it is writing the file.7/20/2022Warm StartCold startIn cold-start, the topic of a toke

38、n is initialized by uniform random number generator. The system counts the (word, topic) occurrence and get the initial model.Warm startIn warm-start, the topic of a token is persistent from the previous training.The system counts the (word, topic) occurrence and the initial model. This is equivalent to failover by checkpoint.437/20/2022The Disadvantages of Dumping ModelIf we dump model in the checkpoint, we need an additional burn-in stage before continuing the training procedure based o

溫馨提示

  • 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
  • 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
  • 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
  • 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
  • 5. 人人文庫網(wǎng)僅提供信息存儲空間,僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護處理,對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對任何下載內(nèi)容負責。
  • 6. 下載文件中如有侵權(quán)或不適當內(nèi)容,請與我們聯(lián)系,我們立即糾正。
  • 7. 本站不保證下載資源的準確性、安全性和完整性, 同時也不承擔用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。

評論

0/150

提交評論