版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請進行舉報或認領(lǐng)
文檔簡介
1、 LightLDA: Big Topic Models on Modest Computer ClustersJinhui Yuan, Fei Gao, Qirong Ho, Wei Dai, Jinliang Wei, Xun Zheng, Eric P. Xing, Tie-Yan Liu and Wei-Ying MaMicrosoft Research AsiaCarnegie Mellon UniversityTopic Models7/20/20222Statistical method to understand the latent topics in textsTopics:
2、 Distributions over wordsBag-of-Words Bag-of-TopicsCompact features for matchingfigure courtesy of David BleiLatent Dirichlet Allocation (LDA)37/20/2022Blei, et al. 2003Something more friendly200B10M1M400Industrial-Scale LDA ModelsRAM(TB)#parametersPeacockAliasLDA#cores#tokensPeacockAliasLDA7/20/202
3、241B Documents200 Tokens10M Words1M TopicsBig DataBig ModelSystem#Tokens#Words#Topics#CoresNote AliasLDA (CMU/Google)50B5M2,00060,000KDD/OSDI 2014Peacock (Tencent)5B0.2M100,0003,000ACM TIST 2014Our Work: LightLDA- Target: 1M topics; 1B docs; 10M words in vocabularyHighly scalable systemDevelop a par
4、allel machine learning platform to scale out computations using a modest cluster of machines7/20/20225Standard LDA Training and Inference67/20/2022Word-topic tableDocument-topic tablePer-token sampling complexity proportional to the number of topics: (), thus hard to scale up to large number of topi
5、cs.Reduce Sampling Complexity77/20/2022Reusability of Alias TableTermsAlias table constructionReused forAlias table for each word, reusable for all documentsAmortized O(1) complexity?YesYesNo87/20/2022SparseLDA and AliasLDA: Partial Amortization7/20/20229Amortizable O(1)Amortizable O(1)SparseLDA Yao
6、, et al. KDD 2009AliasLDA Li, et al. KDD 2014LightLDA: Full Amortization7/20/202210Amortizable O(1)Amortizable O(1)Multiplicative factorization to enable full amortization of time complexityLeverage sparsity to further reduce space complexityMultiplicative factorizationReducing Space ComplexityExper
7、imental Results (Single-core)11Time per iterationLikelihood vs. number of iterations3/16/2015LightLDA - Tie-Yan Liu CMUAlmost no loss on the per-iteration convergence qualityMuch less time for each iterationExperimental Results (Single-core)127/20/2022NYTimes Dataset300K documents99M tokensPubMed Da
8、taset8.2M documents738M tokensLightLDA achieves better log-likelihood than SparseLDA and AliasLDA in much shorter time!With a single core only, LightLDA uses 20 hours to train 10K topics from 1B tokens (PubMed). With a commodity machine of 20 cores, LightLDA can finish training in 1.5 hours. This si
9、ngle-machine capability is equivalent to (if not beyond) a medium-size cluster of SparseLDA or AliasLDA.Highly scalable systemDevelop a parallel machine learning platform to scale out computations using a modest cluster of machines7/20/202213Our Work: LightLDA- Target: 1M topics; 1B docs; 10M words
10、in vocabularySystem Architecture LightLDALightLDALightLDAModel SchedulingModel SchedulingModel SchedulingASP/SSPHybrid data structure and dual bufferHybrid data structure and dual bufferHybrid data structure and dual bufferDistributed Shared Model14Petuum parameter server based data parallelism to h
11、andle big dataA new model scheduling scheme to handle big modelsHybrid data structure to trade off memory usage and access speedIntensive system optimization (double buffer, pipelining, memory management, etc)7/20/2022Model Scheduling7/20/202215Model Parallelism in PeacockModel SchedulingModel is cu
12、t into slices, sending model slices to data instead of sending data to model.Model slices are pulled from server and trained in a round robin fashion, analogous to block coordinate descent (BCD). Data samples and intermediate results are locally stored.Lower communication cost since data model, and
13、model becomes sparse during trainingModel is cut into pieces and put onto different machines; sending data to model pieces. To finish training, intermediate results for every data sample need to be transferred between machines.Communication cost includes both the word-topic table (small) and doc-top
14、ic table (large) Hybrid Data StructureModel is simply too big (10 trillion parameters)Naive dense storage (array) requires prohibitive amounts of memorySparse storage (hash map) hurts sampling efficiency.Leverage power law distribution of wordsUse dense array to store frequent words to achieve high
15、throughput; Use sparse hash tables to store long-tail words, to reduce memory usage (100 times).167/20/2022Hybrid Data StructureModel is simply too big (10 trillion parameters)Naive dense storage (array) requires prohibitive amounts of memorySparse storage (hash map) hurts sampling efficiency.Levera
16、ge power law distribution of wordsUse dense array to store frequent words to achieve high throughput; Use sparse hash tables to store long-tail words, to reduce memory usage (100 times).177/20/2022RowTagSizePt.0dense1sparse202sparse10An array as the indexExperimental SettingsComputer cluster: 24 mac
17、hines (20 cores and 256GB RAM for each machine)Network connection: 1Gbps between machinesParameter server deployment: physically, each machine hosts both parameter server and workers (16 cores for workers and 4 cores for parameter server)Bing Web data: 1.2 billion documents, 200 billion tokens187/20
18、/2022System Throughput w.r.t multi-core and multi-nodeExperimental Results197/20/202220M words, 1M topics, 20T parametersNearly linear speed-up is achieved; after the first 10 iterations, the communication cost is almost masked by computations (further improvable if a high-speed network is used).Con
19、clusionsLightLDA is able to learn super large topic models on modest clustersWith 24 commodity machines, we trained one million topics (20 trillion parameters) from Bing web data (200 billion tokens) in just two days. Machine learning insights vs. brute-force parallelization:As compared to other sys
20、tems, we use much fewer cores to learn much more topics from much larger data! 207/20/2022RAM(TB)#parametersPeacockAliasLDA(CMU/Google)LightLDAModel size VS. RAM capacity#CPU cores#tokensPeacock(Tencent)AliasLDA(CMU/Google)LightLDAData size VS. CPU coresFuture WorkBeyond topic models!Multiplicative
21、factorization based fast sampling may have its general implication to many graphical model problems.Model scheduling scheme and other system innovations may be used to handle other large machine learning task as well. We are in the process of open-sourcing LightLDAStay tuned if you want to use it to
22、 train your own big topic models!7/20/202221Thanks!237/20/2022247/20/2022Alias Method Constructing Alias Table25Construction of Alias table: non-uniform distribution uniform distributionWith O(K) complexity in time and spaceReuse for future samplings:uniformly sample from alias tableO(1) timeWalker,
23、 19777/20/2022How to amortizing the O(K) initialization time through reusing? Constructing Alias Table7/20/202226Separate bins into two sets:L = C,G, for bins lower than the average value .H = A,T, for bins higher than the average value .Constructing Alias Table7/20/2022277/20/202228Looking Up Alias
24、 TableSystem Optimization7/20/202229Dual Model and Cross-thread Aggregation30Immutable model in RAM for compuationSampler in thread 4UpdatesData blockData partition for thread 1Data partition for thread 2Data partition for thread 3Data partition for thread 4Sampler in thread 3Sampler in thread 2Samp
25、ler in thread 1UpdatesUpdatesUpdatesLocal aggregator threadParameter ServerMutable model in RAM for sync-upSwitchRemove multi-thread contentionReduce communication costOverlap computations and communicationsNetworkNetworkPre-fetch7/20/2022Pipelining31Size of data block: 32GB with 4B tokens 2K second
26、s for disk reading (for 150MB I/O bandwidth).Sampling throughput: 2M tokens/worker/sec 2K seconds to sample 4B tokens.Size of model slice: 20GB, so as to be maskable by 2K seconds on a 1Gbps network.7/20/2022Memory Layout7/20/202232Pre-Allocated Memory LayoutIncrease efficiency by avoiding dynamic m
27、emory allocation33Serve Table(30GB)Data BlockOf Current Iteration(32GB)Data BlockOf Next Iteration(32GB)A Slice of Model for Current Iteration(36GB)A Slice of Model for NextIteration(36GB)Alias Table for Current Iteration(36GB)Delta ArraysFor AppendingUpdates from Sampler(10GB) Delta Table for Aggre
28、gating Local Updates of a Particular Model Slice(4GB)216GB7/20/2022Each Memory Quota is Delicately Allocated Store 1M*1M model with hybrid data structure requires 700GB RAM. Therefore, each machine needs 700/24= 29GB space to store server table.Each data block contains 4B tokens, each token requires
29、 8Bytes, therefore the size of each data block is 32GB.The number of 4B tokens is determined by sampling throughput, 2M tokens/machine/sec, so 4B tokens need 2000 seconds to complete the sampling.To sweep a data block, we needs 20 slices of model in model slicing. Each slice of model can sample 2000
30、/20 = 100 seconds.347/20/2022Each Memory Quota is Delicately Allocated In the initial stage, the whole model is 700GB, the maximum size of a model slice is 36GB, so we need split the whole model into 20 slices.When close to convergence, the model will be very sparse, it is much smaller than 700GB.We
31、 need to construct an alias table for each active slice of model. While the alias table is already sparsified, in the initial stage, it roughly has the same size with a slice of model.The maximum number of updates can not exceed 2 times of the number of sampled tokens in a particular slice. Therefor
32、e, the delta table and delta array are rather small.357/20/2022Details about Hybrid Structure7/20/202236A Hybrid Data StructureWith a hybrid data structure, we reduce memory usage by over 100 times, while maintaining an efficient random access.RowTagSizePt.0dense1sparse202sparse10An array as the ind
33、exA pre-allocated memory pool to reduce the overhead caused by dynamic memory allocation.7/20/202237Hybrid Data Structure387/20/2022Average Probes for a Given Load Factor39Load factor0.10.500.750.800.900.99SuccessfulChaining1.051.251.381.401.451.50Open AddressingLinear probing1.05.550.5Qua
34、dratic probing1.051.442.012.212.855.11Double hashing1.051.391.852.012.564.6UnsuccessfulChaining0.100.500.750.800.900.99Open AddressingLinear probing13505000Quadratic probing45.8111.4103.6Double hashing1.112.04.05.010100http:/www.augustana.ca/mohrj/courses/1999.fall/csc210/lectur
35、e_notes/hashing.html7/20/2022Out-of-Core and Fault Tolerance7/20/202240Out-of-CoreBinary serializationEfficient block-wise IO, high disk-r/w throughputSave the marshaling timeMake the token and its topic adjacent, (CPU cache friendly)Easy to estimate data size, 4B tokens requires 4B * 8Byte = 32GB41
36、w11t11w12t12w21t21w22t22Doc 1Doc 27/20/2022Zero-Cost Fault Tolerance by Warm-Start42Data block on diskData block in RAM (before sampling)Data block in RAM(after sampling)Data block on diskLoad into RAMGibbs SamplerStore into DiskAtomic File OverwriteWe dont need any checkpoint.We dont dump the model
37、, instead we dump (token,topic) data which is accomplished automatically by out-of-core module.Any fails, the system supports warm start.Atomic file overwrite can avoid file corruption in case system crashes while it is writing the file.7/20/2022Warm StartCold startIn cold-start, the topic of a toke
38、n is initialized by uniform random number generator. The system counts the (word, topic) occurrence and get the initial model.Warm startIn warm-start, the topic of a token is persistent from the previous training.The system counts the (word, topic) occurrence and the initial model. This is equivalent to failover by checkpoint.437/20/2022The Disadvantages of Dumping ModelIf we dump model in the checkpoint, we need an additional burn-in stage before continuing the training procedure based o
溫馨提示
- 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
- 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
- 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
- 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
- 5. 人人文庫網(wǎng)僅提供信息存儲空間,僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護處理,對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對任何下載內(nèi)容負責。
- 6. 下載文件中如有侵權(quán)或不適當內(nèi)容,請與我們聯(lián)系,我們立即糾正。
- 7. 本站不保證下載資源的準確性、安全性和完整性, 同時也不承擔用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。
最新文檔
- 廣東科技學院《材料生物學》2023-2024學年第一學期期末試卷
- 廣東金融學院《快題專題訓練》2023-2024學年第一學期期末試卷
- 廣東建設(shè)職業(yè)技術(shù)學院《日語翻譯實戰(zhàn)訓練》2023-2024學年第一學期期末試卷
- 廣東環(huán)境保護工程職業(yè)學院《英語聲樂》2023-2024學年第一學期期末試卷
- 廣東工程職業(yè)技術(shù)學院《展覽場館經(jīng)營與管理》2023-2024學年第一學期期末試卷
- 廣東東軟學院《媒介經(jīng)營與管理》2023-2024學年第一學期期末試卷
- 《定量分析實驗》課件
- 西點軍校培訓課件
- 小學生誠信的課件
- 廣東碧桂園職業(yè)學院《中國近現(xiàn)代政治制度》2023-2024學年第一學期期末試卷
- 進水口快速閘門液壓啟閉機安裝施工方案
- 中小微企業(yè)融資情況調(diào)查問卷
- 西門子s7200格式s7200硬件手冊
- 時間序列分析論文
- 職校生個人簡歷自薦信范文模板
- 交通標志結(jié)構(gòu)計算書
- 汽車吊吊裝計算
- 個人獨資公司章程范本-
- 中國核電標準化組織方式及工作方案
- 淺談循環(huán)流化床鍋爐與煤粉爐比較探究
- 墜床跌倒處理流程圖
評論
0/150
提交評論