企業(yè)級大數(shù)據(jù)分析平臺實踐案例課件_第1頁
企業(yè)級大數(shù)據(jù)分析平臺實踐案例課件_第2頁
企業(yè)級大數(shù)據(jù)分析平臺實踐案例課件_第3頁
企業(yè)級大數(shù)據(jù)分析平臺實踐案例課件_第4頁
企業(yè)級大數(shù)據(jù)分析平臺實踐案例課件_第5頁
已閱讀5頁,還剩66頁未讀 繼續(xù)免費閱讀

下載本文檔

版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請進行舉報或認領(lǐng)

文檔簡介

1、企業(yè)級大數(shù)據(jù)分析平臺實踐案例企業(yè)級大數(shù)據(jù)解決方案企業(yè)級hadoop高可用HDFS集群企業(yè)級大數(shù)據(jù)分析平臺Hive企業(yè)級大數(shù)據(jù)數(shù)據(jù)倉庫Hbase企業(yè)級數(shù)據(jù)流實時導(dǎo)入工具Flume企業(yè)級關(guān)系數(shù)據(jù)庫遷移工具sqoopSqoop根據(jù)數(shù)據(jù)庫表結(jié)構(gòu)自動創(chuàng)建class文件,提交到mapreduce框架上運行大數(shù)據(jù)生態(tài)鏈的常用工具pig - 精短強悍的數(shù)據(jù)整理清洗工具spark - 基于內(nèi)存的流式數(shù)據(jù)分析工具,內(nèi)置機器學(xué)習(xí)庫oozie - 任務(wù)自動化調(diào)度工具kafka - 跨平臺的數(shù)據(jù)傳輸工具,支持各種傳輸協(xié)議和加密壓縮等功能impala - 類似hive的數(shù)據(jù)分析工具,支持SQL查詢語言,速度更快tez -

2、 優(yōu)化mapreduce計算路徑的計算框架kudu - 一種更快速的數(shù)據(jù)分析平臺solr - 企業(yè)級搜索引擎企業(yè)級hadoop集群搭建準(zhǔn)備準(zhǔn)備至少3臺聯(lián)網(wǎng)的主機,單個主機至少4G內(nèi)存,10G磁 盤剩余空間。分別安裝Hanwate_Bigdata_OS_7_Firefly操作系統(tǒng),該系統(tǒng) 整合了大數(shù)據(jù)生態(tài)鏈的常用工具。按照后面的向?qū)?,分別在每臺主機上安裝相應(yīng)的軟件組件,并配 置其角色,啟動相應(yīng)的服務(wù)。分布式hadoop集群的主機角色主機名IP地址角色組件master1192.168.X.3masterNamenode Datanode NodeManagermaster2192.168.X.4m

3、asterSecondaryNamenode ResourceManager Datanode NodeManagerslave1192.168.X.5slaveJobHistoryServer Datanode NodeManager偽分布式hadoop集群的主機角色主機名IP地址角色組件master1192.168.X.3masterNamenode Datanode NodeManagermaster2192.168.X.3masterSecondaryNamenode ResourceManager Datanode NodeManagerslave1192.168.X.3slaveJ

4、obHistoryServer Datanode NodeManager修改主機地址和主機名每臺主機分別修改IP地址和主機名,例如在master1上: 修改網(wǎng)卡配置文件/etc/sysconfig/network-scripts/ifcfg-xxxBOOTPROTO=none IPADDR=191.168.X.3 GATEWAY=192.168.X.1 NETMASK=主機名配置文件為/etc/hostname,例如在master1上內(nèi)容為:master1配置主機名映射在所有節(jié)點上修改/etc/hosts:(文件末尾追加)191.168.X.3191.168.X.41

5、91.168.X.5master1 master2 slave1偽分布式集群的配置:191.168.X.3master1 master2 slave1應(yīng)用配置文件所有主機同步/etc/hosts重新啟動服務(wù)器# reboot檢查主機名和其他節(jié)點是否可以訪問# for host in master1 master2 slave1 doping -c1 $hostdone配置JAVA環(huán)境變量在/etc/profile 文件末尾追加export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-51- 1.b12.el7_4.x86_64/jreexport CLA

6、SSPATH=.:$CLASSPATH:$JAVA_HOME/lib export PATH=$PATH:$JAVA_HOME/bin保存文件后執(zhí)行命令使文件生效#source/etc/profile集群搭建根據(jù)主機角色在不同的節(jié)點上安裝軟件根據(jù)集群規(guī)劃對各個節(jié)點上的服務(wù)進行安裝和配置配置集群所需的配置文件(所有節(jié)點必須配置而且保持一致)- 配置文件路徑為/etc/hadoop/conf配置集群slaves增加slave節(jié)點# vim/etc/hadoop/conf/slaves配置集群core-site.xmlNameValue作用fs.defaultFShdfs:/master1:8020

7、HDFS的訪問入口hadoop.tmp.dir/usr/hdp/tmphadoop文件系統(tǒng)依賴的基礎(chǔ)配置配置集群 hdfs-site.xmlNameValue作用.dir/hadoop/hdfs/namenamenode數(shù)據(jù)的存放位置dfs.datanode.data.dir/hadoop/hdfs/datadatanode數(shù)據(jù)塊的存放位置dfs.replication3hdfs的文件備份系數(shù),偽分布式集群只能為1創(chuàng)建HDFS需要用到的Linux目錄# mkdir -p /usr/hdp/tmp# mkdir -p /hadoop/hdfs/data,name # chown -R hdfs:

8、hadoop/hadoop# chown -R hdfs:hadoop/usr/hdp/tmp初始化hdfs文件系統(tǒng)在master1 上操作:# sudo -u hdfs$ hdfs namenode -format以hdfs賬戶初始化hdfs文件系統(tǒng)啟動hdfs文件系統(tǒng)啟動master1節(jié)點上的服務(wù):# systemctl start hadoop-hdfs-namenode # systemctl start hadoop-hdfs-datanode啟動master2節(jié)點上的服務(wù):# systemctl start hadoop-hdfs-secondarynamenode # syste

9、mctl start hadoop-hdfs-datanode啟動slave1節(jié)點上的服務(wù):# systemctl start hadoop-hdfs-datanode驗證方式可以通過systemctlstatusxxx 來查看服務(wù)的狀態(tài)以root身份運行jps分別會看到master1顯示:master2顯示:slave1顯示:訪問NameNode WEB后臺,比如:50070為普通用戶創(chuàng)建hdfs工作目錄使用linux管理員創(chuàng)建新用戶# useradd hadoop切換成HDFS管理員# su - hdfs創(chuàng)建普通用戶的目錄$ hadoop fs -mkdir -p /user/hadoop

10、$ hadoop fs -chown hadoop /user/hadoop$ exit使用普通用戶驗證hdfs工作目錄# su - hadoop$ hadoop fs -mkdir input$ hadoop fs -ls軟件排錯參考日志hdfs組件的日志/var/log/hadoop-hdfs/*.log/var/log/hadoop-hdfs/*.outyarn組件的日志/var/log/hadoop-yarn/*.log/var/log/hadoop-yarn/*.outmapreduce組件的日志/var/log/hadoop-mapreduce/*.log/var/log/hado

11、op-mapreduce/*.out準(zhǔn)備運行分布式任務(wù)的目錄# su hdfs$ hadoop fs -mkdir /tmp$ hadoop fs -chmod 1777 /tmp$ hadoop fs -mkdir -p /var/log/hadoop-yarn$ hadoop fs -chown yarn:mapred /var/log/hadoop-yarn$ hadoop fs -mkdir /user/history$ hadoop fs -chmod 1777 /user/history$ hadoop fs -chown mapred:hadoop /user/history配置

12、yarn-site.xmlNameValue作用yarn.resourcemanager.hostnamemaster2指定yarn任務(wù)管理器的入 口yarn.nodemanager.aux-servicesmapreduce_shuffle為mapreduce分配yarn服 務(wù)yarn.nodemanager.local-dirsfile:/hadoop/yarn/localnodemanager本機計算任 務(wù)的臨時文件yarn.nodemanager.log-dirs/var/log/hadoop- yarn/containersnodemanager日志輸出yarn.nodemanage

13、r.remote-app- log-dir/var/log/hadoop-yarn/apps遠程任務(wù)的輸出yarn.log-aggregation-enabletrue日志匯集配置yarn-site.xml。NameValue作用yarn.scheduler.minimum-allocation-mb511單個任務(wù)可申請的最小 內(nèi)存資源量yarn.scheduler.maximum-allocation-mb2049單個任務(wù)可申請的最大 內(nèi)存資源量yarn.nodemanager.vmem-pmem-ratio4每使用1MB物理內(nèi)存, 最多可用的虛擬內(nèi)存數(shù)yarn.nodemanager.vm

14、em-check-enabledfalse針對虛擬內(nèi)存監(jiān)控的開 關(guān)注:以上設(shè)置內(nèi)存值情況為MR任務(wù)因內(nèi)存問題崩潰即可據(jù)此設(shè)置,值不唯一配置yarn-site.xmlNameValue作用yarn.application.classpath$HADOOP_CONF_DIR,yarn運行所需環(huán)/usr/hdp/-235/hadoop/*,境變量/usr/hdp/-235/hadoop/lib/*,/usr/hdp/-235/hadoop-hdfs/*,/usr/hdp/-235/hadoop-hdfs/lib/*,/usr/hdp/-235/hadoop-yarn/*,/usr/hdp/-235/

15、hadoop-yarn/lib/*,/usr/hdp/-235/hadoop-mapreduce/*,/usr/hdp/-235/hadoop-mapreduce/lib/*,/usr/hdp/-235/hadoop-httpfs/*,/usr/hdp/-235/hadoop-httpfs/lib/*配置mapred-site.xmlNameValue作用yarn分配mapreduce框架mapreduce.jobhistory.addressslave1:10020historyserver地址mapreduce.jobhistory.webapps.addressslave1:19888H

16、istoryserver web端口yarn.app.mapreduce.am.staging-dir/useryarn任務(wù)的臨時輸出目錄mapreduce.application.classpath(較長,見上一頁 yarn.application.classpath)MapReduce需要的環(huán)境變 量配置mapred-site.xmlNameValue作用mapreduce.map.java.opts-Xmx1024M運行 Map 任務(wù)的 JVM 參數(shù)mapreduce.map.memory.mb31Container 這個進程的最大可用 內(nèi)存大小mapreduce.reduce.java

17、.opts-Xmx1024M運行 Reduce 任務(wù)的 JVM 參數(shù)mapreduce.reduce.memory.mb63Container 這個進程的最大可用 內(nèi)存大小注:以上設(shè)置內(nèi)存值情況為MR任務(wù)因內(nèi)存問題卡死即可據(jù)此設(shè)置,值不唯一配置yarn的linux本地目錄# touch /etc/hadoop/conf/yarn-env.sh# mkdir -p /hadoop/yarn/local# chown yarn:yarn -R /hadoop/yarn/local啟動服務(wù)在master2上開啟resourcemanager# systemctl start hadoop-yarn-

18、resourcemanager(訪問WEB后臺http:/master2:8088)(若從windows系統(tǒng)訪問該主 機名不可用,在windows的hosts文件中加入主機地址映射即可)在slave1上開啟historyserver# systemctl start hadoop-mapreduce-historyserver(訪問WEB后臺http:/slave1:19888)所有datanode節(jié)點上開nodemanager# systemctl start hadoop-yarn-nodemanager測試mapreduce任務(wù)# su - hdfs$ cd /etc/hadoop/co

19、nf$ hadoop fs -mkdir-p/user/hdfs/input$ hadoop fs -put*/user/hdfs/input$ hadoop jar /usr/hdp/-235/hadoop- mapreduce/hadoop-mapreduce-examples-.6.3.0- 235.jar grep input output value運行成功的輸出結(jié)果Yarn WEB后臺的任務(wù)狀態(tài)配置 flume同步集群時間配置集群時鐘同步,在一臺服務(wù)器上配置 /etc/chrony.conf,打開allow 191.168/16 bindcmdaddress 在其他服務(wù)器上配置 /

20、etc/chrony.conf: server iburst都需要重新啟動 chronyd服務(wù) # systemctl restart chrony配置代理a1.sources = r1 a1.sinks = k1 a1.channels = c1a1.sources.r1.type = netcat a1.sources.r1.bind = localhost a1.sources.r1.port = 44444a1.sinks.k1.type = loggera1.channels.c1.type = memory a1.channels.c1.capacity = 1000a1.chan

21、nels.c1.transactionCapacity = 100a1.sources.r1.channels = c1 a1.sinks.k1.channel = c1啟動代理上述配置文件存入/etc/flume/conf/flume.conf啟動agent# flume-ngagent-conf /etc/flume/conf-conf-file/etc/flume/conf/flume.conf-namea1-Dflume.root.logger=INFO,console啟動代理系統(tǒng)提示連接到44444后,打開另一終端進行測試# nc localhost 44444出現(xiàn)ok的同時,另一終

22、端接受到信息,并顯示flume 的部署Flume SourcesAvroThriftExecJMSSpooling DirectoryTaildirKafkaNetcat(tcp & udp)SyslogHttpStressSourceScribeavro 源的配置a1.sources = r1 a1.channels = c1 a1.sources.r1.type = avro a1.sources.r1.channels = c1 a1.sources.r1.bind = a1.sources.r1.port = 4141thrift 源a1.sources = r1 a1.channel

23、s = c1 a1.sources.r1.type = thrift a1.sources.r1.channels = c1 a1.sources.r1.bind = a1.sources.r1.port = 4141exec 源a1.sources = r1 a1.channels = c1 a1.sources.r1.type = execmand = tail -F /var/log/secure a1.sources.r1.channels = c1netcat 源a1.sources = r1 a1.channels = c1 a1.sources.r1.type = netcata

24、1.sources.r1.bind = a1.sources.r1.port = 6666 a1.sources.r1.channels = c1syslog 源a1.sources = r1 a1.channels = c1a1.sources.r1.type = syslogtcp a1.sources.r1.port = 5140 a1.sources.r1.host = localhost a1.sources.r1.channels = c1http 源a1.sources = r1 a1.channels = c1 a1.sources.r1.type = http a1.sour

25、ces.r1.port = 5140 a1.sources.r1.channels = c1a1.sources.r1.handler = org.example.rest.RestHandler a1.sources.r1.handler.nickname = random propsFlume ChannelsMemoryJDBCKafkaFileMemory Channela1.channels = c1 a1.channels.c1.type = memory a1.channels.c1.capacity = 10000a1.channels.c1.transactionCapaci

26、ty = 10000a1.channels.c1.byteCapacityBufferPercentage = 20a1.channels.c1.byteCapacity = 800000File Channela1.channels = c1 a1.channels.c1.type = filea1.channels.c1.checkpointDir = /mnt/flume/checkpoint a1.channels.c1.dataDirs = /mnt/flume/dataFlume SinKsHDFSHiveLoggerAvroThriftIRCFile RollNullHbaseS

27、inkAsyncHBaseSinkMorphlineSolrSinkElasticSearchSinkKafkaHTTPHDFS Sinka1.channels = c1 a1.sinks = k1 a1.sinks.k1.type = hdfs a1.sinks.k1.channel = c1a1.sinks.k1.hdfs.path = /flume/events/%y-%m-%d/%H%M/%S a1.sinks.k1.hdfs.filePrefix = events-a1.sinks.k1.hdfs.round = true a1.sinks.k1.hdfs.roundValue =

28、10 a1.sinks.k1.hdfs.roundUnit = minuteHive Sinka1.channels = c1 a1.channels.c1.type = memory a1.sinks = k1a1.sinks.k1.type = hive a1.sinks.k1.channel = c1a1.sinks.k1.hive.metastore = thrift:/:9083 a1.sinks.k1.hive.database = logsdb a1.sinks.k1.hive.table = weblogsa1.sinks.k1.hive.partition = asia,%c

29、ountry,%y-%m-%d-%H-%MHive Sink(續(xù))a1.sinks.k1.useLocalTimeStamp = false a1.sinks.k1.round = true a1.sinks.k1.roundValue = 10 a1.sinks.k1.roundUnit = minute a1.sinks.k1.serializer = DELIMITED a1.sinks.k1.serializer.delimiter = t a1.sinks.k1.serializer.serdeSeparator = t a1.sinks.k1.serializer.fieldnam

30、es =id,msgLogger Sinka1.channels = c1 a1.sinks = k1 a1.sinks.k1.type = logger a1.sinks.k1.channel = c1Avro Sinka1.channels = c1 a1.sinks = k1 a1.sinks.k1.type = avro a1.sinks.k1.channel = c1a1.sinks.k1.hostname = 0a1.sinks.k1.port = 4545HTTP Sinka1.channels = c1 a1.sinks = k1 a1.sinks.k1.type = http

31、 a1.sinks.k1.channel = c1a1.sinks.k1.endpoint = http:/localhost:8080/someuri a1.sinks.k1.connectTimeout = 2000a1.sinks.k1.requestTimeout = 2000 a1.sinks.k1.acceptHeader = application/json a1.sinks.k1.contentTypeHeader = application/jsonHTTP Sink(續(xù))a1.sinks.k1.defaultBackoff = true a1.sinks.k1.defaul

32、tRollback = true a1.sinks.k1.defaultIncrementMetrics = false a1.sinks.k1.backoff.4XX = false a1.sinks.k1.rollback.4XX = false a1.sinks.k1.incrementMetrics.4XX = true a1.sinks.k1.backoff.200 = false a1.sinks.k1.rollback.200 = false a1.sinks.k1.incrementMetrics.200 = trueflume 應(yīng)用總結(jié)使用原則sink 負責(zé)寫,寫之前,目標(biāo)地

33、址可訪問channel 負責(zé)傳輸,若無審計需要,都是內(nèi)存source 負責(zé)讀,讀取的數(shù)據(jù)是變化的案例使用 flume 收集 nginx 服務(wù)器的日志到 hdfs 中參考配置文件配置文件/etc/flume/conf/flume.conf # #配置Agenta1.sources = r1 a1.sinks = k1 a1.channels = c1 # # 配置Sourcea1.sources.r1.type = exec a1.sources.r1.channels = c1a1.sources.r1.deserializer.outputCharset = UTF-8 # # 配置需要監(jiān)控的日志輸出目錄mand = tail -F /var/log/nginx/access.log(續(xù))# # 配 置 Sink a1.sinks.k1.type = hdfs a1.sinks.k1.channel = c1a1.sinks.k1.hdfs.useLocalTimeStamp = truea1.sinks.k1.hdfs.path = hdfs:/master1/user/hdf

溫馨提示

  • 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
  • 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
  • 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
  • 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
  • 5. 人人文庫網(wǎng)僅提供信息存儲空間,僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護處理,對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對任何下載內(nèi)容負責(zé)。
  • 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請與我們聯(lián)系,我們立即糾正。
  • 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時也不承擔(dān)用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。

評論

0/150

提交評論