云端的小飛象系列報(bào)告之二_第1頁(yè)
云端的小飛象系列報(bào)告之二_第2頁(yè)
云端的小飛象系列報(bào)告之二_第3頁(yè)
云端的小飛象系列報(bào)告之二_第4頁(yè)
云端的小飛象系列報(bào)告之二_第5頁(yè)
已閱讀5頁(yè),還剩23頁(yè)未讀, 繼續(xù)免費(fèi)閱讀

下載本文檔

版權(quán)說(shuō)明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請(qǐng)進(jìn)行舉報(bào)或認(rèn)領(lǐng)

文檔簡(jiǎn)介

1、云端的小飛象系列報(bào)告之二 Cloud組Hadoop in SIGMOD 2011Outline IntroductionNova: Continuous Pig/Hadoop WorkowsApache Hadoop Goes Realtime at Facebook Emerging Trends in the Enterprise Data AnalyticsA Hadoop Based Distributed Loading Approach to Parallel Data WarehousesIndustrial Session in Sigmod 2011Data Manageme

2、nt for Feeds and Streams(2)Dynamic Optimization and Unstructured Content (4)BusinessAnalytics(2)Support for Business Analytics and Warehousing (4)Applying Hadoop(4)IndustrialsessionNova: Continuous Pig/Hadoop WorkowsBy Yahoo!Nova OverviewScenariosIngesting and analyzing user behavior logs Building a

3、nd updating a search index from a stream of crawled web pages Processing semi-structured dataTwo-layer programming model (Nova over Pig)Continuous processingIndependent schedulingCross-module optimizationManageability featuresWorkflow ModelWorkflowTwo kinds of vertices: tasks (processing steps) and

4、channels (data containers)Edges connect tasks to channels and channels to tasksFour common patterns of processingNon-incremental (template detection)Stateless incremental (shingling)Stateless incremental with lookup table (template tagging)Stateful incremental (de-duping)Workflow Model (Cont.)Data a

5、nd Update ModelBlocks: A channels data is divided into blocksContains a complete snapshot of data on a channel as of some point in timeBase blocks are assigned increasing sequence numbers(B0,B1,B2Bn)Base blockUsed in conjunction with incremental processingContains instructions for transforming a bas

6、e block into a new base block( )Delta blockWorkflow Model (Cont.)Task/Data InterfaceConsumption mode: all or newProduction mode: B or Workflow Model (Cont.)Workflow Programming and SchedulingData-based trigger.Time-based triggerCascade trigger.Data Compaction and Garbage CollectionIf a channel has b

7、locks B0, , , ,the compaction operation computes and adds B3 to the channelAfter compaction is used to add B3 to the channel,and current cursor is at sequence number 2, then B0, , can be garbage-collected.Nova System ArchitectureApache Hadoop Goes Realtime at FacebookBy FacebookWorkload TypesFaceboo

8、k MessagingHigh Write ThroughputLarge TablesData MigrationFacebook InsightsRealtime AnalyticsHigh Throughput IncrementsFacebook Metrics System (ODS)Automatic ShardingFast Reads of Recent Data and Table ScansWhy Hadoop & HBaseElasticityHigh write throughputEfficient and low-latency strong consistency

9、 semantics within a data centerEfficient random reads from diskHigh Availability and Disaster RecoveryFault IsolationAtomic read-modify-write primitivesRange ScansTolerance of network partitions within a single data centerZero Downtime in case of individual data center failureActive-active serving c

10、apability across different data centersRealtime HDFSHigh Availability - AvatarNodeRealtime HDFS (Cont.)Hadoop RPC compatibilityBlock Availability: Placement Policya pluggable block placement policyRealtime HDFS (Cont.)Performance Improvements for a Realtime WorkloadRPC TimeoutReads from Local Replic

11、asNew FeaturesHDFS syncConcurrent Readers Production HBaseACID Compliance (RWCC: Read Write Consistency Control)Atomicity (WALEdit)ConsistencyAvailability ImprovementsHBase Master Rewrite,Region assignment in memory - ZooKeeperOnline UpgradesDistributed Log SplittingPerformance ImprovementsCompactio

12、n(minor and major)Read OptimizationsEmerging Trends in the Enterprise Data Analytics: Connecting Hadoop and DB2 WarehouseBy IBMMotivation1.Increasing volumes of data2. Hadoop-based solutions in conjunction with data warehousesA Hadoop Based Distributed Loading Approach to Parallel Data WarehousesBy

13、TeradataMotivationETL(Extraction Transformation Loading) is a critical part of data warehouseWhile data are partitioned and replicated across all nodes in a parallel data warehouse, load utilities reside on a single node(bottleneck)Why Hadoop for Teradata EDW(Enterprise Data Warehouse)?More disk spa

14、ce can be easily addedUse as a intermediate storageMapReduce for transformationLoad data in parallelBlock Assignment ProblemHDFS file F on a cluster of P nodes (each node is uniquely identified with an integer i where 1 i P) The problem is defined by: assignment(X, Y, n,m, k, r) X is the set of n bl

15、ocks (X = 1, . . . , n) of FY is the set of m nodes running PDBMS (called PDBMS nodes) (Y 1, . . . , P )k copies, m nodesr is the mapping recording the replicated block locations of each block. r(i) returns the set of nodes which has a copy of the block i.Block Assignment Problem(Cont.)An assignment g from the blocks in X to the nodes in Y is denoted by a mapping from X = 1, . . . , n to Y where g(i) = j (i X, j Y ) means that the block i is assigned to the node j. An eve

溫馨提示

  • 1. 本站所有資源如無(wú)特殊說(shuō)明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請(qǐng)下載最新的WinRAR軟件解壓。
  • 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請(qǐng)聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
  • 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁(yè)內(nèi)容里面會(huì)有圖紙預(yù)覽,若沒(méi)有圖紙預(yù)覽就沒(méi)有圖紙。
  • 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
  • 5. 人人文庫(kù)網(wǎng)僅提供信息存儲(chǔ)空間,僅對(duì)用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對(duì)用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對(duì)任何下載內(nèi)容負(fù)責(zé)。
  • 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請(qǐng)與我們聯(lián)系,我們立即糾正。
  • 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶因使用這些下載資源對(duì)自己和他人造成任何形式的傷害或損失。

最新文檔

評(píng)論

0/150

提交評(píng)論