2022 數(shù)據(jù)峰會 數(shù)據(jù)湖 -現(xiàn)代數(shù)據(jù)湖要點_第1頁
2022 數(shù)據(jù)峰會 數(shù)據(jù)湖 -現(xiàn)代數(shù)據(jù)湖要點_第2頁
2022 數(shù)據(jù)峰會 數(shù)據(jù)湖 -現(xiàn)代數(shù)據(jù)湖要點_第3頁
2022 數(shù)據(jù)峰會 數(shù)據(jù)湖 -現(xiàn)代數(shù)據(jù)湖要點_第4頁
2022 數(shù)據(jù)峰會 數(shù)據(jù)湖 -現(xiàn)代數(shù)據(jù)湖要點_第5頁
已閱讀5頁,還剩34頁未讀, 繼續(xù)免費閱讀

下載本文檔

版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請進行舉報或認領(lǐng)

文檔簡介

BuildingTheOpen

DataLakehouse

MarkLyons

Confidential-DoNotShareorDistribute

Mark

Product@Dremio

ProductManagement@

Vertica

ProductManager@

EnerNOC

Confidential-DoNotShareorDistribute

2

Agenda

●BriefHistoryofdataplatforms

●OpenDataArchitecture

●Tableformats

Scale&Performance

Askedtodotheimpossible

WhatYouWant

toAccomplish

DataDemocratization

SupportNewInitiatives

andProjects

FastTimetoValue

Security&Governance

ButIt’sGettingHarderbytheDay

DataIsRapidly

Increasing

AccessRequestsAreRapidlyIncreasing

YourBudgetIsFlat

DataTalentIsScarce

Whydatalakes&cloud?

Confidential-DoNotShareorDistribute

Past&Present

●Schemaonread

●Flexiblew/datafiles/types

●Costeffectivestorage

●Expensivetomaintain/upgrade

●Specializedtalent

●Schemaonwrite

●JSONdatatype

●Consumptionpricing

●Separatestorage&compute

●Managed

Confidential-DoNotShareorDistribute

OpenData

Architecture

Evolution

Confidential-DoNotShareorDistribute

OpenCompute

SQL:

Dremio,Presto,etc.

Streaming:

Databricks,Flink

OpenData

Metastore:HiveMetastore,AWSGlue

Storage

8

DataLake1.0

Spark:Databricks,EMR,etc.

FileFormat:ApacheParquet

S3,ADLS,GCS

●Thedatatierofthedatalakewasverybasic,requiringsignificantengineeringworkfromcustomers

●Analytics/Accessperformancesuffered

●Maintenanceandengineeringcostoverwhelmed

FileFormatsParquet,ORC,csv,json

S3

HDFS

ADLS

DataLakeStorage

OpenTableFormatsIceberg,DeltaLake,Hudi

MakingtheLakehouse

Users&Applications

Interfaces

ArrowF

light,ODB

C/JDBC

DataLakeEngines

Dremio

Spark

Hive

Athena

Confidential-DoNotShareorDistribute

ModernMetastoreNessie

New

OpenTableFormatsIcebergDeltaLake,Hudi

FileFormatsParquet,ORC,csv,json

S3

HDFS

ADLS

DataLakeStorage

,

GoingfurtherthantheEDW

Users&Applications

Interfaces

ArrowF

light,ODB

C/JDBC

DataLakeEngines

Dremio

Spark

Hive

Athena

Confidential-DoNotShareorDistribute

TransactionalTablesontheDataLake

Record-leveldatamutationswithSQLDML

INSERTINTOt1...

UPDATEt1SETcol1=...

DELETEFROMt1WHEREstate='CA'

Automaticpartitioning

CREATETABLEt1PARTITIONEDBY(month(date),

bucket[1000](user))

Instantschemaandpartitionevolution

ALTERTABLEt1ADD/DROP/MODIFY/RENAMECOLUMNc1...

ALTERTABLEt1ADD/DROPPARTITIONFIELD...

Timetravel

SELECT*FROMt1AT/BEFORE<timestamp>

●CreatedbyNetflix,Appleandotherbigtech

●INSERT/UPDATE/DELETEwithanyengine

●StrongmomentuminOSScommunity

●CreatedbyDatabricks

●INSERT/UPDATEwithSpark,SELECTwithanyengine

●PrimarilyusedinconjunctionwithDatabricks

Confidential-DoNotShareorDistribute

11

USEBRANCH'main'

SELECT*FROMt1//mainimplicit

SELECT*FROMt1@etl

SELECT*FROMt1AT'2020-10-26'

SELECT*FROMt1@etl'2020-10-26'

Nessie:Amodernmetastore

DataBranching

Transactionsonsteroids

–Multi-tableconsistency/transactions

–Experimentation(isolateddev/test)

–Pre-prodverification(stage→prod)

CREATEBRANCHetl

[Kafka]DataIngestion

[Spark]Transformation1

[Spark]Transformation2

[Dremio]ReflectionRefresh1

[Dremio]ReflectionRefresh2

USEBRANCHmain

MERGEBRANCHetl

DataVersionControl

Timetravelonsteroids

–Reproducibility

–Compliance

–Historicalcomparisons

USEBRANCH'main'

SELECT*FROMt1//mainimplicit

SELECT*FROMt1@etl

SELECT*FROMt1AT'2020-10-26'

SELECT*FROMt1@etlAT'2020-10-26'

Confidential-DoNotShareorDistribute

12

vs.

Real-WorldTransactions

Multiplesessions

MultipleusersMultipleengines/services

Builtfordataengineers

DataWarehouseTransactions

Onesession

Oneuser

SQL-only

Builtforapplicationdevelopers

CREATEBRANCHetl

[Kafka]Ingestdatasource1

[RDBMS]Ingestdatasource2

[Spark]Transformationstep1

[Dremio]Transformationstep2

[Spark]Insertintotable

[Dremio]Buildreflections

[Spark]Verifydata

USEBRANCHmain

MERGEBRANCHetl

ff

BEGINTRANSACTIONetl;

INSERTINTOt1...

UPDATEt1...

UPDATEt1...

COMMIT;

13

--ifchecksdon’tpass:

--don’tmerge,alert

Safe,real-worldtransactions

CREATEBRANCHetl_1897914;

USEBRANCHetl_1897914;

MERGEINTOordersoUSINGorders_stageo_s

ONo.order_id=o_s.order_id

WHENMATCHEDTHEN

UPDATE...

WHENNOTMATCHEDTHEN

INSERT*

MERGEINTOlineitemliUSINGli_stageli_s

ONli.order_id=li_s.order_id

ANDli.line_number=li_s.line_number

WHENMATCHEDTHEN

UPDATE...

WHENNOTMATCHEDTHEN

INSERT*

SELECTSUM(CASEWHENorder_amount<0THEN1ELSE0END)FROMorders;

SELECTSUM(CASEWHENquantity<0THEN1ELSE0END)FROMlineitem;

--additionaldataqualitychecksbynon-SQLtools

--ifcheckspass:

REFRESHREFLECTIONsales_dashboard_main;

--final

quality

checks

USEBRANCHproduction;

MERGEBRANCHetl_1897914;

Moreontable

formats

Confidential-DoNotShareorDistribute

16

Hivetableformat

Thede-factostandard

Atableisanlsof1ormoredirectories

Pros

●Single,centralanswerto“whatdataisinthistable”forthewholeecosystem

●Workswithbasicallyeveryenginesinceit’sbeenthede-factostandard

Cons

●Ifeverythinginadirectoryisatable’scontents,howdoIupdateanddeletedata?

●IfIneedtoaddmultiple?lesasasingleoperation,howdoImakesureaconsumerdoesn’tseeonlysomeofmyadditions?

●Allofthedirectorylistingsneededforlargetablestaketoolongfrom<startlisting>to<endlisting>

○Planningtime

○Consistencyproblemscanoccur-multiplecallstolistcontentsofthesetofpartitions

‘sgoals

●ACIDtransactionsonS3

●Tableevolution

●Tablecorrectness/consistency

●Fasterqueryplanningandexecution

●Decouplepartitioningfromlayout

●Accomplishalloftheseatscale

Howcanweresolvetheseissues?

→Weneedanewtableformat

Oldway

dir

F

F

F

Atableisan'ls'ofadirectory

New

way

Manifest(listof?les)

F

F

F

Atableisacanonicallistof?les

18

High-levelcomparisontoDatabricksDeltaLake

SIMILARITIES

DIFFERENCES

●ExtensiveintegrationinSpark

●Veryscalable

●Metadatastoredalongsidedatainthedatalake

●Immediatedatafreshness

●Fastplanning,datapruningviastats

Tableformat

Hierarchicalwithpointer-?les;snapshotcreationonwrite

Non-hierarchical;sequentiallylogeachchangeonwrite,inlinesnapshotcreationevery10changes

Whocanreadandwrite

Anyonecanread,anyonecanwrite

Anyonecanread,Databrickschooseswhocanwrite(SparkandprobablySQLanalytics)

Writebehavior

V1:copyonwrite

V2:copyonwrite+mergeonread

Copyonwrite

Fileformats

Parquet,ORC,andAvro,andisintentionallyextensible

Parquetonly

OSS&governance

FullyOS,ApacheGovernance

MostlyOS,butkeycapab

溫馨提示

  • 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
  • 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
  • 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
  • 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
  • 5. 人人文庫網(wǎng)僅提供信息存儲空間,僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護處理,對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對任何下載內(nèi)容負責(zé)。
  • 6. 下載文件中如有侵權(quán)或不適當內(nèi)容,請與我們聯(lián)系,我們立即糾正。
  • 7. 本站不保證下載資源的準確性、安全性和完整性, 同時也不承擔(dān)用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。

評論

0/150

提交評論