Hadoop數(shù)據(jù)收集與入庫(kù)系統(tǒng)Flume與Sqoop課件_第1頁(yè)
Hadoop數(shù)據(jù)收集與入庫(kù)系統(tǒng)Flume與Sqoop課件_第2頁(yè)
Hadoop數(shù)據(jù)收集與入庫(kù)系統(tǒng)Flume與Sqoop課件_第3頁(yè)
Hadoop數(shù)據(jù)收集與入庫(kù)系統(tǒng)Flume與Sqoop課件_第4頁(yè)
Hadoop數(shù)據(jù)收集與入庫(kù)系統(tǒng)Flume與Sqoop課件_第5頁(yè)
已閱讀5頁(yè),還剩83頁(yè)未讀 繼續(xù)免費(fèi)閱讀

下載本文檔

版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請(qǐng)進(jìn)行舉報(bào)或認(rèn)領(lǐng)

文檔簡(jiǎn)介

Hadoop數(shù)據(jù)收集與入庫(kù)系統(tǒng)Flume與SqoopHadoop數(shù)據(jù)收集與入庫(kù)系統(tǒng)Flume與Sqoop1主要內(nèi)容背景介紹Hadoop數(shù)據(jù)收集系統(tǒng)傳統(tǒng)數(shù)據(jù)庫(kù)與Hadoop間數(shù)據(jù)同步總結(jié)主要內(nèi)容背景介紹2背景介紹Hadoop提供了一個(gè)中央化的存儲(chǔ)系統(tǒng):有利于進(jìn)行集中式的數(shù)據(jù)分析與數(shù)據(jù)共享Hadoop對(duì)存儲(chǔ)格式?jīng)]有要求:用戶訪問日志;產(chǎn)品信息網(wǎng)頁(yè)數(shù)據(jù)等如何將數(shù)據(jù)存入Hadoop:數(shù)據(jù)分散在各個(gè)離散的設(shè)備上數(shù)據(jù)保存在傳統(tǒng)的存儲(chǔ)設(shè)備和系統(tǒng)中背景介紹Hadoop提供了一個(gè)中央化的存儲(chǔ)系統(tǒng):3常見的兩種數(shù)據(jù)來源分散的數(shù)據(jù)源:機(jī)器產(chǎn)生的數(shù)據(jù);用戶訪問日志;用戶購(gòu)買日志;傳統(tǒng)系統(tǒng)中的數(shù)據(jù):傳統(tǒng)關(guān)系型數(shù)據(jù)庫(kù):MySQL、Oracle等;磁盤陣列;磁帶.常見的兩種數(shù)據(jù)來源分散的數(shù)據(jù)源:4Hadoop收集和入庫(kù)基本要求分布式數(shù)據(jù)源多樣化數(shù)據(jù)源分散可靠性保證不丟數(shù)據(jù)允許丟部分?jǐn)?shù)據(jù)可擴(kuò)展數(shù)據(jù)源可能會(huì)不斷增加通過并行提高性能Hadoop收集和入庫(kù)基本要求分布式5常見的Hadoop收集與入庫(kù)系統(tǒng)數(shù)據(jù)收集FlumeKafkaScribe傳統(tǒng)數(shù)據(jù)庫(kù)與Hadoop同步Sqoop常見的Hadoop收集與入庫(kù)系統(tǒng)數(shù)據(jù)收集6主要內(nèi)容背景介紹Hadoop數(shù)據(jù)收集系統(tǒng)傳統(tǒng)數(shù)據(jù)庫(kù)與Hadoop間數(shù)據(jù)同步總結(jié)主要內(nèi)容背景介紹7Hadoop數(shù)據(jù)收集系統(tǒng)—FlumeFlume

OGOG:“OriginalGeneration”0.9.x或cdh3以及更早版本由agent、collector、master等組件構(gòu)成Flume

NGNG:“Next/NewGeneration”1.x或cdh4以及之后的版本由Agent、Client等組件構(gòu)成為什么要推出NG版本精簡(jiǎn)代碼架構(gòu)簡(jiǎn)化Hadoop數(shù)據(jù)收集系統(tǒng)—FlumeFlumeOG8Flume

OG基本架構(gòu)FlumeOG基本架構(gòu)9Flume

OG基本架構(gòu)FlumeOG基本架構(gòu)10Agent用于采集數(shù)據(jù)數(shù)據(jù)流產(chǎn)生的地方通常由source和sink兩部分組成Source用于獲取數(shù)據(jù),可從文本文件,syslog,HTTP等獲取數(shù)據(jù);Sink將Source獲得的數(shù)據(jù)進(jìn)一步傳輸給后面的Collector。Flume自帶了很多source和sink實(shí)現(xiàn)syslogTcp(5140)|

agentSink("localhost",35853)tail("/etc/services")|

agentSink("localhost",35853)Agent用于采集數(shù)據(jù)11Collector匯總多個(gè)Agent結(jié)果將匯總結(jié)果導(dǎo)入后端存儲(chǔ)系統(tǒng),比如HDFS,HBaseFlume自帶了很多collector實(shí)現(xiàn)collectorSource(35853)|

consoleCollectorSource(35853)

|collectorSink("file:///tmp/flume/collected",

"syslog");collectorSource(35853)|collectorSink("hdfs://namenode/user/flume/

","syslog");Collector匯總多個(gè)Agent結(jié)果12Agent與Collector對(duì)應(yīng)關(guān)系A(chǔ)gent與Collector對(duì)應(yīng)關(guān)系13Agent與Collector對(duì)應(yīng)關(guān)系可手動(dòng)指定,也可自動(dòng)匹配自動(dòng)匹配的情況下,master會(huì)平衡collector之間的負(fù)載。Agent與Collector對(duì)應(yīng)關(guān)系可手動(dòng)指定,也可自動(dòng)匹14問題:為什么引入Collector?對(duì)Agent數(shù)據(jù)進(jìn)行匯總,避免產(chǎn)生過多小文件;避免多個(gè)agent連接對(duì)Hadoop造成過大壓力

;中間件,屏蔽agent和hadoop間的異構(gòu)性。問題:為什么引入Collector?對(duì)Agent數(shù)據(jù)進(jìn)行匯總15Master管理協(xié)調(diào)

agent

和collector的配置信息;Flume集群的控制器;跟蹤數(shù)據(jù)流的最后確認(rèn)信息,并通知agent;通常需配置多個(gè)master以防止單點(diǎn)故障;借助zookeeper管理管理多Master。Master管理協(xié)調(diào)agent和collector的配置16容錯(cuò)機(jī)制容錯(cuò)機(jī)制17三種可靠性級(jí)別agentE2ESink[("machine"[,port])]agent收到確認(rèn)消息才認(rèn)為數(shù)據(jù)發(fā)送成功,否則重試.agentDFOSink[("machine"[,port])]當(dāng)agent發(fā)現(xiàn)在collector操作失敗的時(shí)候,agent寫入到本地硬盤上,當(dāng)collctor恢復(fù)后,再重新發(fā)送數(shù)據(jù)。agentBESink[("machine"[,port])]效率最好,agent不寫入到本地任何數(shù)據(jù),如果在collector

發(fā)現(xiàn)處理失敗,直接刪除消息。三種可靠性級(jí)別agentE2ESink[("machine"18構(gòu)建基于Flume的數(shù)據(jù)收集系統(tǒng)Agent和Collector均可以動(dòng)態(tài)配置可通過命令行或Web界面配置命令行配置在已經(jīng)啟動(dòng)的master節(jié)點(diǎn)上,依次輸入”flume

shell””connectlocalhost

”如執(zhí)行exec

config

a1‘tailDir(“/data/logfile”)’

‘a(chǎn)gentSink’Web界面選中節(jié)點(diǎn),填寫source、sink等信息構(gòu)建基于Flume的數(shù)據(jù)收集系統(tǒng)Agent和Collecto19常用架構(gòu)舉例—拓?fù)?agentA:tail(“/ngnix/logs”)|agentSink("collector",35853);agentB:tail(“/ngnix/logs”)|agentSink("collector",35853);agentC:tail(“/ngnix/logs”)|agentSink("collector",35853);agentD:tail(“/ngnix/logs”)|agentSink("collector",35853);agentE:tail(“/ngnix/logs”)|agentSink("collector",35853);agentF:tail(“/ngnix/logs”)|agentSink("collector",35853);collector:collectorSource(35853)

|collectorSink("hdfs://namenode/flume/","srcdata");常用架構(gòu)舉例—拓?fù)?agentA:tail(“/ngni20常用架構(gòu)舉例—拓?fù)?agentA:src|

agentE2ESink("collectorA",35853);agentB:src|agentE2ESink("collectorA",35853);agentC:src|agentE2ESink("collectorB",35853);agentD:src|agentE2ESink("collectorB",35853);agentE:src|agentE2ESink("collectorC",35853);agentF:src|

agentE2ESink("collectorC",35853);collectorA:collectorSource(35853)|

collectorSink("hdfs://...","src");collectorB:collectorSource(35853)|collectorSink("hdfs://...","src");collectorC:collectorSource(35853)|

collectorSink("hdfs://...","src");常用架構(gòu)舉例—拓?fù)?agentA:src|agent21常用架構(gòu)舉例—拓?fù)?agentA:src|agentE2EChain("collectorA:35853","collectorB:35853");agentB:src|agentE2EChain("collectorA:35853","collectorC:35853");agentC:src|agentE2EChain("collectorB:35853","collectorA:35853");agentD:src|agentE2EChain("collectorB:35853","collectorC:35853");agentE:src|agentE2EChain("collectorC:35853","collectorA:35853");agentF:src|agentE2EChain("collectorC:35853","collectorB:35853");collectorA:collectorSource(35853)|collectorSink("hdfs://...","src");collectorB:collectorSource(35853)|collectorSink("hdfs://...","src");collectorC:collectorSource(35853)|

collectorSink("hdfs://...","src");常用架構(gòu)舉例—拓?fù)?agentA:src|agent22主要內(nèi)容背景介紹Hadoop數(shù)據(jù)收集系統(tǒng)傳統(tǒng)數(shù)據(jù)庫(kù)與Hadoop間數(shù)據(jù)同步總結(jié)主要內(nèi)容背景介紹23Sqoop是什么Sqoop:SQL-to-Hadoop連接

傳統(tǒng)關(guān)系型數(shù)據(jù)庫(kù)

Hadoop

的橋梁把關(guān)系型數(shù)據(jù)庫(kù)的數(shù)據(jù)導(dǎo)入到

Hadoop

系統(tǒng)

(

HDFSHBase

Hive)

中;把數(shù)據(jù)從

Hadoop

系統(tǒng)里抽取并導(dǎo)出到關(guān)系型數(shù)據(jù)庫(kù)里。利用MapReduce加快數(shù)據(jù)傳輸速度批處理方式進(jìn)行數(shù)據(jù)傳輸Sqoop是什么Sqoop:SQL-to-Hadoop24Sqoop優(yōu)勢(shì)高效、可控地利用資源任務(wù)并行度,超時(shí)時(shí)間等數(shù)據(jù)類型映射與轉(zhuǎn)換可自動(dòng)進(jìn)行,用戶也可自定義支持多種數(shù)據(jù)庫(kù)MySQLOraclePostgreSQLSqoop優(yōu)勢(shì)高效、可控地利用資源25Sqoop1架構(gòu)Sqoop1架構(gòu)26Sqoop2架構(gòu)Sqoop2架構(gòu)27Sqoop

import將數(shù)據(jù)從關(guān)系型數(shù)據(jù)庫(kù)導(dǎo)入Hadoop中步驟1:Sqoop與數(shù)據(jù)庫(kù)Server通信,獲取數(shù)據(jù)庫(kù)表的元數(shù)據(jù)信息;步驟2:Sqoop啟動(dòng)一個(gè)Map-Only的MR作業(yè),利用元數(shù)據(jù)信息并行將數(shù)據(jù)寫入Hadoop。Sqoopimport將數(shù)據(jù)從關(guān)系型數(shù)據(jù)庫(kù)導(dǎo)入Hadoop28Sqoop

import使用sqoopimport

\--connectjdbc:mysql:///sqoop

\--usernamesqoop

\--passwordsqoop

\--table

cities--connnect:

指定JDBC

URL--username/password:mysql數(shù)據(jù)庫(kù)的用戶名--table:要讀取的數(shù)據(jù)庫(kù)表bin/hadoopfs-cat

cities/part-m-*1,USA,PaloAlto2,Czech

Republic,Brno3,USA,SunnyvaleSqoopimport使用sqoopimport\29Sqoop

import示例sqoopimport

\--connectjdbc:mysql:///sqoop

\--usernamesqoop

\--passwordsqoop

\--tablecities\--target-dir

/etl/input/citiessqoopimport

\--connectjdbc:mysql:///sqoop

\--usernamesqoop

\--passwordsqoop

\--tablecities\--where"country=

'USA'"Sqoopimport示例sqoopimport\30Sqoop

import示例sqoopimport

\--connectjdbc:mysql:///sqoop

\--usernamesqoop

\--passwordsqoop

\--tablecities

\--as-sequencefilesqoopimport

\--connectjdbc:mysql:///sqoop

\--usernamesqoop

\--passwordsqoop

\--tablecities

\--num-mappers

10Sqoopimport示例sqoopimport\31Sqoop

import—導(dǎo)入多個(gè)表sqoopimport

\--connectjdbc:mysql:///sqoop

\--usernamesqoop

\--passwordsqoop

\--query'SELECTnormcities.id,\countries.country,\normcities.city

\FROMnormcities

\JOINcountriesUSING(country_id)\WHERE$CONDITIONS'

\--split-byid

\--target-dir

citiesSqoopimport—導(dǎo)入多個(gè)表sqoopimport32Sqoop

import增量導(dǎo)入sqoopimport

\--connectjdbc:mysql:///sqoop

\--usernamesqoop

\--passwordsqoop

\--tablevisits\--incrementalappend

\--check-columnid

\--last-value

1Sqoopimport增量導(dǎo)入sqoopimport\33Sqoop

import增量導(dǎo)入(一)sqoopimport

\--connectjdbc:mysql:///sqoop

\--usernamesqoop

\--passwordsqoop

\--tablevisits\--incrementalappend

\--check-columnid

\--last-value

1適用于數(shù)據(jù)每次被追加到數(shù)據(jù)庫(kù)中,而已有數(shù)據(jù)不變的情況;僅導(dǎo)入id這一列值大于1的記錄。Sqoopimport增量導(dǎo)入(一)sqoopimpor34Sqoop

import增量導(dǎo)入(二)sqoopjob

\--createvisits

\--import

\--connectjdbc:mysql:///sqoop

\--usernamesqoop

\--passwordsqoop

\--tablevisits

\--incrementalappend

\--check-columnid

\--last-value

0運(yùn)行sqoop作業(yè):sqoop

job--exec

visits每次成功運(yùn)行后,sqoop將最后一條記錄的id值保存到metastore中,供下次使用。Sqoopimport增量導(dǎo)入(二)sqoopjob\35Sqoop

import增量導(dǎo)入(三)sqoopimport

\--connectjdbc:mysql:///sqoop

\--usernamesqoop

\--passwordsqoop

\--tablevisits

\--incrementallastmodified

\--check-columnlast_update_date

\--last-value“2013-05-22

01:01:01”數(shù)據(jù)庫(kù)中有一列l(wèi)ast_update_date,記錄了上次修改時(shí)間;Sqoop僅將某時(shí)刻后的數(shù)據(jù)導(dǎo)入Hadoop。Sqoopimport增量導(dǎo)入(三)sqoopimpor36Sqoop

Export將數(shù)據(jù)從Hadoop導(dǎo)入關(guān)系型數(shù)據(jù)庫(kù)導(dǎo)中步驟1:Sqoop與數(shù)據(jù)庫(kù)Server通信,獲取數(shù)據(jù)庫(kù)表的元數(shù)據(jù)信息;步驟2:并行導(dǎo)入數(shù)據(jù):將Hadoop上文件劃分成若干個(gè)split;每個(gè)split由一個(gè)Map

Task進(jìn)行數(shù)據(jù)導(dǎo)入。SqoopExport將數(shù)據(jù)從Hadoop導(dǎo)入關(guān)系型數(shù)據(jù)庫(kù)37Sqoop

Export使用方法sqoopexport

\--connectjdbc:mysql:///sqoop

\--usernamesqoop

\--passwordsqoop

\--tablecities\--export-dir

cities--connnect:

指定JDBC

URL--username/password:mysql數(shù)據(jù)庫(kù)的用戶名--table:要導(dǎo)入的數(shù)據(jù)庫(kù)表export-dir:數(shù)據(jù)在HDFS上存放目錄SqoopExport使用方法sqoopexport\38Sqoop

Export—保證原子性sqoopexport

\--connectjdbc:mysql:///sqoop

\--usernamesqoop

\--passwordsqoop

\--tablecities\--staging-table

staging_citiesSqoopExport—保證原子性sqoopexport39Sqoop

Export—更新已有數(shù)據(jù)sqoopexport

\--connectjdbc:mysql:///sqoop

\--usernamesqoop

\--passwordsqoop

\--tablecities

\--update-keyidsqoopexport\--connectjdbc:mysql:///sqoop

\--usernamesqoop

\--passwordsqoop

\--tablecities

\--update-keyid

\--update-mode

allowinsertSqoopExport—更新已有數(shù)據(jù)sqoopexpor40Sqoop

Export—選擇性插入sqoopexport

\--connectjdbc:mysql:///sqoop

\--usernamesqoop

\--passwordsqoop

\--tablecities

\--columns

country,citySqoopExport—選擇性插入sqoopexport41Sqoop與其他系統(tǒng)結(jié)合Sqoop可以與Oozie、Hive、Hbase等系統(tǒng)結(jié)合;用戶需要在sqoop-env.sh中增加HBASE_HOME、HIVE_HOME等環(huán)境變量。Sqoop與其他系統(tǒng)結(jié)合Sqoop可以與Oozie、Hive42Sqoop與Hive結(jié)合sqoopimport

\--connectjdbc:mysql:///sqoop

\--usernamesqoop

\--passwordsqoop

\--tablecities

\--hive-importSqoop與Hive結(jié)合sqoopimport\43Sqoop與HBase結(jié)合sqoopimport

\--connectjdbc:mysql:///sqoop

\--usernamesqoop

\--passwordsqoop

\--tablecities

\--hbase-tablecities

\--column-family

worldSqoop與HBase結(jié)合sqoopimport\44Hadoop數(shù)據(jù)收集與入庫(kù)系統(tǒng)Flume與SqoopHadoop數(shù)據(jù)收集與入庫(kù)系統(tǒng)Flume與Sqoop45主要內(nèi)容背景介紹Hadoop數(shù)據(jù)收集系統(tǒng)傳統(tǒng)數(shù)據(jù)庫(kù)與Hadoop間數(shù)據(jù)同步總結(jié)主要內(nèi)容背景介紹46背景介紹Hadoop提供了一個(gè)中央化的存儲(chǔ)系統(tǒng):有利于進(jìn)行集中式的數(shù)據(jù)分析與數(shù)據(jù)共享Hadoop對(duì)存儲(chǔ)格式?jīng)]有要求:用戶訪問日志;產(chǎn)品信息網(wǎng)頁(yè)數(shù)據(jù)等如何將數(shù)據(jù)存入Hadoop:數(shù)據(jù)分散在各個(gè)離散的設(shè)備上數(shù)據(jù)保存在傳統(tǒng)的存儲(chǔ)設(shè)備和系統(tǒng)中背景介紹Hadoop提供了一個(gè)中央化的存儲(chǔ)系統(tǒng):47常見的兩種數(shù)據(jù)來源分散的數(shù)據(jù)源:機(jī)器產(chǎn)生的數(shù)據(jù);用戶訪問日志;用戶購(gòu)買日志;傳統(tǒng)系統(tǒng)中的數(shù)據(jù):傳統(tǒng)關(guān)系型數(shù)據(jù)庫(kù):MySQL、Oracle等;磁盤陣列;磁帶.常見的兩種數(shù)據(jù)來源分散的數(shù)據(jù)源:48Hadoop收集和入庫(kù)基本要求分布式數(shù)據(jù)源多樣化數(shù)據(jù)源分散可靠性保證不丟數(shù)據(jù)允許丟部分?jǐn)?shù)據(jù)可擴(kuò)展數(shù)據(jù)源可能會(huì)不斷增加通過并行提高性能Hadoop收集和入庫(kù)基本要求分布式49常見的Hadoop收集與入庫(kù)系統(tǒng)數(shù)據(jù)收集FlumeKafkaScribe傳統(tǒng)數(shù)據(jù)庫(kù)與Hadoop同步Sqoop常見的Hadoop收集與入庫(kù)系統(tǒng)數(shù)據(jù)收集50主要內(nèi)容背景介紹Hadoop數(shù)據(jù)收集系統(tǒng)傳統(tǒng)數(shù)據(jù)庫(kù)與Hadoop間數(shù)據(jù)同步總結(jié)主要內(nèi)容背景介紹51Hadoop數(shù)據(jù)收集系統(tǒng)—FlumeFlume

OGOG:“OriginalGeneration”0.9.x或cdh3以及更早版本由agent、collector、master等組件構(gòu)成Flume

NGNG:“Next/NewGeneration”1.x或cdh4以及之后的版本由Agent、Client等組件構(gòu)成為什么要推出NG版本精簡(jiǎn)代碼架構(gòu)簡(jiǎn)化Hadoop數(shù)據(jù)收集系統(tǒng)—FlumeFlumeOG52Flume

OG基本架構(gòu)FlumeOG基本架構(gòu)53Flume

OG基本架構(gòu)FlumeOG基本架構(gòu)54Agent用于采集數(shù)據(jù)數(shù)據(jù)流產(chǎn)生的地方通常由source和sink兩部分組成Source用于獲取數(shù)據(jù),可從文本文件,syslog,HTTP等獲取數(shù)據(jù);Sink將Source獲得的數(shù)據(jù)進(jìn)一步傳輸給后面的Collector。Flume自帶了很多source和sink實(shí)現(xiàn)syslogTcp(5140)|

agentSink("localhost",35853)tail("/etc/services")|

agentSink("localhost",35853)Agent用于采集數(shù)據(jù)55Collector匯總多個(gè)Agent結(jié)果將匯總結(jié)果導(dǎo)入后端存儲(chǔ)系統(tǒng),比如HDFS,HBaseFlume自帶了很多collector實(shí)現(xiàn)collectorSource(35853)|

consoleCollectorSource(35853)

|collectorSink("file:///tmp/flume/collected",

"syslog");collectorSource(35853)|collectorSink("hdfs://namenode/user/flume/

","syslog");Collector匯總多個(gè)Agent結(jié)果56Agent與Collector對(duì)應(yīng)關(guān)系A(chǔ)gent與Collector對(duì)應(yīng)關(guān)系57Agent與Collector對(duì)應(yīng)關(guān)系可手動(dòng)指定,也可自動(dòng)匹配自動(dòng)匹配的情況下,master會(huì)平衡collector之間的負(fù)載。Agent與Collector對(duì)應(yīng)關(guān)系可手動(dòng)指定,也可自動(dòng)匹58問題:為什么引入Collector?對(duì)Agent數(shù)據(jù)進(jìn)行匯總,避免產(chǎn)生過多小文件;避免多個(gè)agent連接對(duì)Hadoop造成過大壓力

;中間件,屏蔽agent和hadoop間的異構(gòu)性。問題:為什么引入Collector?對(duì)Agent數(shù)據(jù)進(jìn)行匯總59Master管理協(xié)調(diào)

agent

和collector的配置信息;Flume集群的控制器;跟蹤數(shù)據(jù)流的最后確認(rèn)信息,并通知agent;通常需配置多個(gè)master以防止單點(diǎn)故障;借助zookeeper管理管理多Master。Master管理協(xié)調(diào)agent和collector的配置60容錯(cuò)機(jī)制容錯(cuò)機(jī)制61三種可靠性級(jí)別agentE2ESink[("machine"[,port])]agent收到確認(rèn)消息才認(rèn)為數(shù)據(jù)發(fā)送成功,否則重試.agentDFOSink[("machine"[,port])]當(dāng)agent發(fā)現(xiàn)在collector操作失敗的時(shí)候,agent寫入到本地硬盤上,當(dāng)collctor恢復(fù)后,再重新發(fā)送數(shù)據(jù)。agentBESink[("machine"[,port])]效率最好,agent不寫入到本地任何數(shù)據(jù),如果在collector

發(fā)現(xiàn)處理失敗,直接刪除消息。三種可靠性級(jí)別agentE2ESink[("machine"62構(gòu)建基于Flume的數(shù)據(jù)收集系統(tǒng)Agent和Collector均可以動(dòng)態(tài)配置可通過命令行或Web界面配置命令行配置在已經(jīng)啟動(dòng)的master節(jié)點(diǎn)上,依次輸入”flume

shell””connectlocalhost

”如執(zhí)行exec

config

a1‘tailDir(“/data/logfile”)’

‘a(chǎn)gentSink’Web界面選中節(jié)點(diǎn),填寫source、sink等信息構(gòu)建基于Flume的數(shù)據(jù)收集系統(tǒng)Agent和Collecto63常用架構(gòu)舉例—拓?fù)?agentA:tail(“/ngnix/logs”)|agentSink("collector",35853);agentB:tail(“/ngnix/logs”)|agentSink("collector",35853);agentC:tail(“/ngnix/logs”)|agentSink("collector",35853);agentD:tail(“/ngnix/logs”)|agentSink("collector",35853);agentE:tail(“/ngnix/logs”)|agentSink("collector",35853);agentF:tail(“/ngnix/logs”)|agentSink("collector",35853);collector:collectorSource(35853)

|collectorSink("hdfs://namenode/flume/","srcdata");常用架構(gòu)舉例—拓?fù)?agentA:tail(“/ngni64常用架構(gòu)舉例—拓?fù)?agentA:src|

agentE2ESink("collectorA",35853);agentB:src|agentE2ESink("collectorA",35853);agentC:src|agentE2ESink("collectorB",35853);agentD:src|agentE2ESink("collectorB",35853);agentE:src|agentE2ESink("collectorC",35853);agentF:src|

agentE2ESink("collectorC",35853);collectorA:collectorSource(35853)|

collectorSink("hdfs://...","src");collectorB:collectorSource(35853)|collectorSink("hdfs://...","src");collectorC:collectorSource(35853)|

collectorSink("hdfs://...","src");常用架構(gòu)舉例—拓?fù)?agentA:src|agent65常用架構(gòu)舉例—拓?fù)?agentA:src|agentE2EChain("collectorA:35853","collectorB:35853");agentB:src|agentE2EChain("collectorA:35853","collectorC:35853");agentC:src|agentE2EChain("collectorB:35853","collectorA:35853");agentD:src|agentE2EChain("collectorB:35853","collectorC:35853");agentE:src|agentE2EChain("collectorC:35853","collectorA:35853");agentF:src|agentE2EChain("collectorC:35853","collectorB:35853");collectorA:collectorSource(35853)|collectorSink("hdfs://...","src");collectorB:collectorSource(35853)|collectorSink("hdfs://...","src");collectorC:collectorSource(35853)|

collectorSink("hdfs://...","src");常用架構(gòu)舉例—拓?fù)?agentA:src|agent66主要內(nèi)容背景介紹Hadoop數(shù)據(jù)收集系統(tǒng)傳統(tǒng)數(shù)據(jù)庫(kù)與Hadoop間數(shù)據(jù)同步總結(jié)主要內(nèi)容背景介紹67Sqoop是什么Sqoop:SQL-to-Hadoop連接

傳統(tǒng)關(guān)系型數(shù)據(jù)庫(kù)

Hadoop

的橋梁把關(guān)系型數(shù)據(jù)庫(kù)的數(shù)據(jù)導(dǎo)入到

Hadoop

系統(tǒng)

(

HDFSHBase

Hive)

中;把數(shù)據(jù)從

Hadoop

系統(tǒng)里抽取并導(dǎo)出到關(guān)系型數(shù)據(jù)庫(kù)里。利用MapReduce加快數(shù)據(jù)傳輸速度批處理方式進(jìn)行數(shù)據(jù)傳輸Sqoop是什么Sqoop:SQL-to-Hadoop68Sqoop優(yōu)勢(shì)高效、可控地利用資源任務(wù)并行度,超時(shí)時(shí)間等數(shù)據(jù)類型映射與轉(zhuǎn)換可自動(dòng)進(jìn)行,用戶也可自定義支持多種數(shù)據(jù)庫(kù)MySQLOraclePostgreSQLSqoop優(yōu)勢(shì)高效、可控地利用資源69Sqoop1架構(gòu)Sqoop1架構(gòu)70Sqoop2架構(gòu)Sqoop2架構(gòu)71Sqoop

import將數(shù)據(jù)從關(guān)系型數(shù)據(jù)庫(kù)導(dǎo)入Hadoop中步驟1:Sqoop與數(shù)據(jù)庫(kù)Server通信,獲取數(shù)據(jù)庫(kù)表的元數(shù)據(jù)信息;步驟2:Sqoop啟動(dòng)一個(gè)Map-Only的MR作業(yè),利用元數(shù)據(jù)信息并行將數(shù)據(jù)寫入Hadoop。Sqoopimport將數(shù)據(jù)從關(guān)系型數(shù)據(jù)庫(kù)導(dǎo)入Hadoop72Sqoop

import使用sqoopimport

\--connectjdbc:mysql:///sqoop

\--usernamesqoop

\--passwordsqoop

\--table

cities--connnect:

指定JDBC

URL--username/password:mysql數(shù)據(jù)庫(kù)的用戶名--table:要讀取的數(shù)據(jù)庫(kù)表bin/hadoopfs-cat

cities/part-m-*1,USA,PaloAlto2,Czech

Republic,Brno3,USA,SunnyvaleSqoopimport使用sqoopimport\73Sqoop

import示例sqoopimport

\--connectjdbc:mysql:///sqoop

\--usernamesqoop

\--passwordsqoop

\--tablecities\--target-dir

/etl/input/citiessqoopimport

\--connectjdbc:mysql:///sqoop

\--usernamesqoop

\--passwordsqoop

\--tablecities\--where"country=

'USA'"Sqoopimport示例sqoopimport\74Sqoop

import示例sqoopimport

\--connectjdbc:mysql:///sqoop

\--usernamesqoop

\--passwordsqoop

\--tablecities

\--as-sequencefilesqoopimport

\--connectjdbc:mysql:///sqoop

\--usernamesqoop

\--passwordsqoop

\--tablecities

\--num-mappers

10Sqoopimport示例sqoopimport\75Sqoop

import—導(dǎo)入多個(gè)表sqoopimport

\--connectjdbc:mysql:///sqoop

\--usernamesqoop

\--passwordsqoop

\--query'SELECTnormcities.id,\countries.country,\normcities.city

\FROMnormcities

\JOINcountriesUSING(country_id)\WHERE$CONDITIONS'

\--split-byid

\--target-dir

citiesSqoopimport—導(dǎo)入多個(gè)表sqoopimport76Sqoop

import增量導(dǎo)入sqoopimport

\--connectjdbc:mysql:///sqoop

\--usernamesqoop

\--passwordsqoop

\--tablevisits\--incrementalappend

\--check-columnid

\--last-value

1Sqoopimport增量導(dǎo)入sqoopimport\77Sqoop

import增量導(dǎo)入(一)sqoopimport

\--connectjdbc:mysql:///sqoop

\--usernamesqoop

\--passwordsqoop

\--tablevisits\--incrementalappend

\--check-columnid

\--last-value

1適用于數(shù)據(jù)每次被追加到數(shù)據(jù)庫(kù)中,而已有數(shù)據(jù)不變的情況;僅導(dǎo)入id這一列值大于1的記錄。Sqoopimport增量導(dǎo)入(一)sqoopimpor78Sqoop

import增量導(dǎo)入(二)sqoopjob

\--createvisits

\--import

\--connectjdbc:mysql:///sqoop

\--usernamesqoop

\--passwordsqoop

\--tablevisits

\--incrementalappend

\--check-columnid

\--last-value

0運(yùn)行sqoop作業(yè):sqoop

job--exec

visits每次成功運(yùn)行后,sqoop將最后一條記錄的id值保存到metastore中,供下次使用。Sqoopimport增量導(dǎo)入(二)sqoopjob\79Sqoop

import增量導(dǎo)入(三)sqoopimport

\--connectjdbc:mysql:///sqoop

\--usernamesqoop

\--passwordsqoop

\--tablevisits

\--incrementallastmodified

\--check-columnlast_update_date

\--las

溫馨提示

  • 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請(qǐng)下載最新的WinRAR軟件解壓。
  • 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請(qǐng)聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
  • 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁(yè)內(nèi)容里面會(huì)有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
  • 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
  • 5. 人人文庫(kù)網(wǎng)僅提供信息存儲(chǔ)空間,僅對(duì)用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對(duì)用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對(duì)任何下載內(nèi)容負(fù)責(zé)。
  • 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請(qǐng)與我們聯(lián)系,我們立即糾正。
  • 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶因使用這些下載資源對(duì)自己和他人造成任何形式的傷害或損失。

最新文檔

評(píng)論

0/150

提交評(píng)論