Google云計(jì)算時(shí)代的社交網(wǎng)絡(luò)平臺(tái)和技術(shù)_第1頁
Google云計(jì)算時(shí)代的社交網(wǎng)絡(luò)平臺(tái)和技術(shù)_第2頁
Google云計(jì)算時(shí)代的社交網(wǎng)絡(luò)平臺(tái)和技術(shù)_第3頁
Google云計(jì)算時(shí)代的社交網(wǎng)絡(luò)平臺(tái)和技術(shù)_第4頁
Google云計(jì)算時(shí)代的社交網(wǎng)絡(luò)平臺(tái)和技術(shù)_第5頁
已閱讀5頁,還剩65頁未讀 繼續(xù)免費(fèi)閱讀

下載本文檔

版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請進(jìn)行舉報(bào)或認(rèn)領(lǐng)

文檔簡介

1、5/9/2021ed chang1 云計(jì)算時(shí)代的社交網(wǎng)絡(luò) 平臺(tái)和技術(shù) 張智威 副院長, 研究院, 谷歌中國 教授, 電機(jī)工程系, 加州大學(xué) 5/9/2021ed chang2 180 million ( 25%) 208 million ( 3%) 60 million ( 90%) 60 million ( 29%) 500 million 180 million 600 k engineering graduates mobile phones broadband users internet population china u.s. china opportunity china re

2、snik et al. 1994; konstant et al. 1997 pros simplicity, avoid model-building stage cons memory and time consuming, uses the entire database every time to make a prediction cannot make prediction if the user has no items in common with other users 5/9/2021ed chang34 model-based model breese et al. 19

3、98; hoffman 1999; blei et al. 2004 pros scalability, model is much smaller than the actual dataset faster prediction, query the model instead of the entire dataset cons model-building takes time 5/9/2021ed chang35 algorithm selection criteria near-real-time recommendation scalable training increment

4、al training is desirable can deal with data scarcity cloud computing! 5/9/2021ed chang36 model-based prior work latent semantic analysis (lsa) probabilistic lsa (plsa) latent dirichlet allocation (lda) 5/9/2021ed chang37 latent semantic analysis (lsa) deerwester et al. 1990 map high-dimensional coun

5、t vectors to lower dimensional representation called latent semantic space by svd decomposition: a = u vt a = word-document co-occurrence matrix uij = how likely word i belongs to topic j jj = how significant topic j is vijt= how likely topic i belongs to doc j words docstopicsdocs topics topics top

6、ics word s w x d w x t t x tt x d 5/9/2021ed chang38 latent semantic analysis (cont.) lsa keeps k-largest singular values low-rank approximation to the original matrix save space, de-noisified and reduce sparsity make recommendations using word-word similarity: t doc-doc similarity: t word-doc relat

7、ionship: words docstopicsdocs topics topics topics words w x d w x k k x kk x d 5/9/2021ed chang39 probabilistic latent semantic analysis (plsa) hoffman 1999; hoffman 2004 document is viewed as a bag of words a latent semantic layer is constructed in between documents and words p(w, d) = p(d) p(w|d)

8、 = p(d)zp(w|z)p(z|d) probability delivers explicit meaning p(w|w), p(d|d), p(d, w) model learning via em algorithm p(d) dwz p(z|d)p(w|z) 5/9/2021ed chang40 plsa extensions phits cohn & chang 2000 model document-citation co-occurrence a linear combination of plsa and phits cohn & hoffmann 2001 model

9、contents (words) and inter-connectivity of documents lda blei et al. 2003 provide a complete generative model with dirichlet prior at griffiths & steyvers 2004 include authorship information document is categorized by authors and topics art mccallum 2004 include email recipient as additional informa

10、tion email is categorized by author, recipients and topics 5/9/2021ed chang41 combinational collaborative filtering (ccf) fuse multiple information alleviate the information sparsity problem hybrid training scheme gibbs sampling as initializations for em algorithm parallelization achieve linear spee

11、dup with the number of machines 5/9/2021ed chang42 notations given a collection of co-occurrence data community: c = c1, c2, , cn user: u = u1, u2, , um description: d = d1, d2, , dv latent aspect: z = z1, z2, , zk models baseline models community-user (c-u) model community-description (c-d) model c

12、cf: combinational collaborative filtering combines both baseline models 5/9/2021ed chang43 baseline models quicktime and a decompressor are needed to see this picture. quicktime and a decompressor are needed to see this picture. community-user (c-u) modelcommunity-description (c-d) model community i

13、s viewed as a bag of users c and u are rendered conditionally independent by introducing z generative process, for each user u 1. a community c is chosen uniformly 2. a topic z is selected from p(z|c) 3. a user u is generated from p(u|z) community is viewed as a bag of words c and d are rendered con

14、ditionally independent by introducing z generative process, for each word d 1. a community c is chosen uniformly 2. a topic z is selected from p(z|c) 3. a word d is generated from p(d|z) quicktime and a decompressor are needed to see this picture. quicktime and a decompressor are needed to see this

15、picture. 5/9/2021ed chang44 baseline models (cont.) community-user (c-u) modelcommunity-description (c-d) model pros 1. personalized community suggestion cons 1. c-u matrix is sparse, may suffer from information sparsity problem 2. cannot take advantage of content similarity between communities pros

16、 1. cluster communities based on community content (description words) cons 1. no personalized recommendation 2. do not consider the overlapped users between communities 5/9/2021ed chang45 ccf model combinational collaborative filtering (ccf) model ccf combines both baseline models a community is vi

17、ewed as - a bag of users and a bag of words by adding c-u, ccf can perform personalized recommendation which c-d alone cannot by adding c-d, ccf can perform better personalized recommendation than c-u alone which may suffer from sparsity things ccf can do that c-u and c-d cannot - p(d|u), relate use

18、r to word - useful for user targeting ads 5/9/2021ed chang46 algorithm requirements near-real-time recommendation scalable training incremental training is desirable 5/9/2021ed chang47 parallelizing ccf details omitted 5/9/2021ed chang48 picture source: (1)數(shù)據(jù)在云端)數(shù)據(jù)在云端 不怕丟失不怕丟失

19、 不必備份不必備份 (2)軟件在云端)軟件在云端 不必下載不必下載 自動(dòng)升級(jí)自動(dòng)升級(jí) (3)無所不在的云計(jì)算)無所不在的云計(jì)算 任何設(shè)備任何設(shè)備 登錄后就是你的登錄后就是你的 (4)無限強(qiáng)大的云計(jì)算)無限強(qiáng)大的云計(jì)算 無限空間無限空間 無限速度無限速度 業(yè)界趨勢:云計(jì)算時(shí)代的到來業(yè)界趨勢:云計(jì)算時(shí)代的到來 5/9/2021ed chang49 experiments on orkut dataset data description collected on july 26, 2007 two types of data were extracted community-user, commu

20、nity-description 312,385 users 109,987 communities 191,034 unique english words community recommendation community similarity/clustering user similarity speedup 5/9/2021ed chang50 community recommendation evaluation method no ground-truth, no user clicks available leave-one-out: randomly delete one

21、community for each user whether the deleted community can be recovered evaluation metric precision and recall 5/9/2021ed chang51 results observations: q ccf outperforms c-u q for top20, precision/recall of ccf are twice higher than those of c-u q the more communities a user has joined, the better cc

22、f/c-u can predict 5/9/2021ed chang52 runtime speedup the orkut dataset enjoys a linear speedup when the number of machines is up to 100 reduces the training time from one day to less than 14 minutes but, what makes the speedup slow down after 100 machines? 5/9/2021ed chang53 runtime speedup (cont.)

23、training time consists of two parts: computation time (comp) communication time (comm) 5/9/2021ed chang54 ccf summary combinational collaborative filtering fuse bags of words and bags of users information hybrid training provides better initializations for em rather than random seeding parallelize t

24、o handle large-scale datasets 5/9/2021ed chang55 chinas contributions on/to cloud computing parallel ccf parallel svms (kernel machines) parallel svd parallel spectral clustering parallel expectation maximization parallel association mining parallel lda 5/9/2021ed chang56 speeding up svms nips 2007

25、approximate matrix factorization parallelization open source 350+ downloads since december 07 a task that takes 7 days on 1 machine takes 1 hours on 500 machines 5/9/2021ed chang57 incomplete cholesky factorization (icf) n x nn x p p x n p n conserve storage 5/9/2021ed chang58 matrix product = p x n

26、n x pp x p 5/9/2021ed chang59 organizing the worlds information, socially 社區(qū)平臺(tái) (social platform) 云運(yùn)算 (cloud computing) 結(jié)論與前瞻 (concluding remarks) 5/9/2021ed chang60 web with people .htm .htm .htm .jpg .jpg .doc .xls .msg .msg .htm 5/9/2021ed chang61 what next for web search? personalization return q

27、uery results considering personal preferences example: disambiguate synonym like fuji oops: several tried, the problem is hard training data difficult to collect enough (for collaborative filtering) computational intensive to support personalization (e.g., for personalizing page rank) user profile m

28、ay be incomplete, erroneous 5/9/2021ed chang62 個(gè)人搜索 智能搜索 搜索“富士” 可返回 富士山 富士蘋果 富士相機(jī) 5/9/2021ed chang63 5/9/2021ed chang64 5/9/2021ed chang65 5/9/2021ed chang66 5/9/2021ed chang67 organizing worlds information , socially web is a collection of documents and people recommendation is a personalized, push

29、 model of search collaborative filtering requires dense information to be effective cloud computing is essential 5/9/2021ed chang68 references 1 alexa internet. http:/ 2 d. m. blei and m. i. jordan. variational methods for the dirichlet process. in proc. of the 21st international conference on machi

30、ne learning, pages 373-380, 2004. 3 d. m. blei, a. y. ng, and m. i. jordan. latent dirichlet allocation. journal of machine learning research, 3:993-1022, 2003. 4 d. cohn and h. chang. learning to probabilistically identify authoritative documents. in proc. of the seventeenth international conferenc

31、e on machine learning, pages 167-174, 2000. 5 d. cohn and t. hofmann. the missing link - a probabilistic model of document content and hypertext connectivity. in advances in neural information processing systems 13, pages 430-436, 2001. 6 s. c. deerwester, s. t. dumais, t. k. landauer, g. w. furnas,

32、 and r. a. harshman. indexing by latent semantic analysis. journal of the american society of information science, 41(6):391-407, 1990. 7 a. p. dempster, n. m. laird, and d. b. rubin. maximum likelihood from incomplete data via the em algorithm. journal of the royal statistical society. series b (me

33、thodological), 39(1):1-38, 1977. 8 s. geman and d. geman. stochastic relaxation, gibbs distributions, and the bayesian restoration of images. ieee transactions on pattern recognition and machine intelligence, 6:721-741, 1984. 9 t. hofmann. probabilistic latent semantic indexing. in proc. of uncertai

34、nty in arti cial intelligence, pages 289-296, 1999. 10 t. hofmann. latent semantic models for collaborative filtering. acm transactions on information system, 22(1):89- 115, 2004. 11 a. mccallum, a. corrada-emmanuel, and x. wang. the author-recipient-topic model for topic and role discovery in socia

35、l networks: experiments with enron and academic email. technical report, computer science, university of massachusetts amherst, 2004. 12 d. newman, a. asuncion, p. smyth, and m. welling. distributed inference for latent dirichlet allocation. in advances in neural information processing systems 20, 2

36、007. 13 m. ramoni, p. sebastiani, and p. cohen. bayesian clustering by dynamics. machine learning, 47(1):91-121, 2002. 5/9/2021ed chang69 references (cont.) 14 r. salakhutdinov, a. mnih, and g. hinton. restricted boltzmann machines for collaborative ltering. in proc. of the 24th international confer

37、ence on machine learning, pages 791-798, 2007. 15 e. spertus, m. sahami, and o. buyukkokten. evaluating similarity measures: a large-scale study in the orkut social network. in proc. of the 11th acm sigkdd international conference on knowledge discovery in data mining, pages 678-684, 2005. 16 m. ste

38、yvers, p. smyth, m. rosen-zvi, and t. gri ths. probabilistic author-topic models for information discovery. in proc. of the 10th acm sigkdd international conference on knowledge discovery and data mining, pages 306- 315, 2004. 17 a. strehl and j. ghosh. cluster ensembles - a knowledge reuse framewor

39、k for combining multiple partitions. journal on machine learning research (jmlr), 3:583-617, 2002. 18 t. zhang and v. s. iyengar. recommender systems using linear classi ers. journal of machine learning research, 2:313-334, 2002. 19 s. zhong and j. ghosh. generative model-based clustering of documen

40、ts: a comparative study. knowledge and information systems (kais), 8:374-384, 2005. 20 l. admic and e. adar. how to search a social network. 2004 21 t.l. griffiths and m. steyvers. finding scientific topics. proceedings of the national academy of sciences, pages 5228-5235, 2004. 22 h. kautz, b. selm

41、an, and m. shah. referral web: combining social networks and collaborative filtering. communitcations of the acm, 3:63-65, 1997. 23 r. agrawal, t. imielnski, a. swami. mining association rules between sets of items in large databses. sigmod rec., 22:207-116, 1993. 24 j. s. breese, d. heckerman, and

42、c. kadie. empirical analysis of predictive algorithms for collaborative filtering. in proceedings of the fourteenth conference on uncertainty in artifical intelligence, 1998. 25 m.deshpande and g. karypis. item-based top-n recommendation algorithms. acm trans. inf. syst., 22(1):143- 177, 2004. 5/9/2

43、021ed chang70 references (cont.) 26 b.m. sarwar, g. karypis, j.a. konstan, and j. reidl. item-based collaborative filtering recommendation algorithms. in proceedings of the 10th international world wide web conference, pages 285-295, 2001. 27 m.deshpande and g. karypis. item-based top-n recommendation algorithms. acm trans. inf. syst., 22(1):143- 177, 2004. 28 b.m. sarwar, g. karypis, j.a. konstan, and j. reidl. item-based collaborative filtering recommendation algorithms. in proceedings of the 10th international world wide web c

溫馨提示

  • 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
  • 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
  • 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會(huì)有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
  • 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
  • 5. 人人文庫網(wǎng)僅提供信息存儲(chǔ)空間,僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對任何下載內(nèi)容負(fù)責(zé)。
  • 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請與我們聯(lián)系,我們立即糾正。
  • 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。

評(píng)論

0/150

提交評(píng)論