MapReduce海量數(shù)據(jù)并行處理ch.04

上傳人：q*** IP屬地：湖北上傳時(shí)間：2023-02-04 格式：PPT 頁數(shù)：44 大?。?.37MB 積分：28 舉報(bào) 版權(quán)申訴

已閱讀5頁，還剩39頁未讀，繼續(xù)免費(fèi)閱讀

版權(quán)說明：本文檔由用戶提供并上傳，收益歸屬內(nèi)容提供方，若內(nèi)容存在侵權(quán)，請(qǐng)進(jìn)行舉報(bào)或認(rèn)領(lǐng)

文檔簡介

Ch.4.Hadoop

MapReduce基本構(gòu)架南京大學(xué)計(jì)算機(jī)科學(xué)與技術(shù)系主講人：黃宜華2011年春季學(xué)期MapReduce海量數(shù)據(jù)并行處理鳴謝：本課程得到Google公司(北京）中國大學(xué)合作部精品課程計(jì)劃資助Ch.4.

Hadoop

MapReduce基本構(gòu)架1.Hadoop分布式文件系統(tǒng)HDFS2.HadoopMapReduce的基本工作原理3.分布式結(jié)構(gòu)化數(shù)據(jù)表HBaseHDFS的基本特征模仿GoogleGFS設(shè)計(jì)實(shí)現(xiàn)存儲(chǔ)極大數(shù)目的信息（terabytesorpetabytes），將數(shù)據(jù)保存到大量的節(jié)點(diǎn)當(dāng)中；支持很大的單個(gè)文件。提供數(shù)據(jù)的高可靠性和容錯(cuò)能力，單個(gè)或者多個(gè)節(jié)點(diǎn)不工作，對(duì)系統(tǒng)不會(huì)造成任何影響，數(shù)據(jù)仍然可用。通過一定數(shù)量的數(shù)據(jù)復(fù)制保證數(shù)據(jù)存儲(chǔ)的可靠性和出錯(cuò)恢復(fù)能力。提供對(duì)數(shù)據(jù)的快速訪問；并提供良好的可擴(kuò)展性，通過簡單加入更多服務(wù)器快速擴(kuò)充系統(tǒng)容量，服務(wù)更多的客戶端。與GFS類似，HDFS是MapReduce的底層數(shù)據(jù)存儲(chǔ)支撐，并使得數(shù)據(jù)盡可能根據(jù)其本地局部性進(jìn)行訪問與計(jì)算。

1.Hadoop的分布式文件系統(tǒng)HDFSHDFS的基本特征HDFS對(duì)順序讀進(jìn)行了優(yōu)化，支持大量數(shù)據(jù)的快速順序讀出，代價(jià)是對(duì)于隨機(jī)的訪問負(fù)載較高。數(shù)據(jù)支持一次寫入，多次讀??；不支持已寫入數(shù)據(jù)的更新操作。數(shù)據(jù)不進(jìn)行本地緩存（文件很大，且順序讀沒有局部性）基于塊的文件存儲(chǔ)，默認(rèn)的塊的大小是64MB減少元數(shù)據(jù)的量有利于順序讀寫（在磁盤上數(shù)據(jù)順序存放）

多副本數(shù)據(jù)塊形式存儲(chǔ)，按照塊的方式隨機(jī)選擇存儲(chǔ)節(jié)點(diǎn)，默認(rèn)副本數(shù)目是3Hadoop的分布式文件系統(tǒng)HDFSHDFS基本構(gòu)架對(duì)等于GFS

Master對(duì)等于GFS

ChunkServer應(yīng)用程序HDFS客戶端文件名或數(shù)據(jù)塊號(hào)數(shù)據(jù)塊號(hào)，數(shù)據(jù)塊位置HDFSNameNodeDataNode數(shù)據(jù)DataNode數(shù)據(jù)DataNode數(shù)據(jù)Hadoop的分布式文件系統(tǒng)HDFSHDFS基本實(shí)現(xiàn)構(gòu)架Hadoop的分布式文件系統(tǒng)HDFSHDFS數(shù)據(jù)分布設(shè)計(jì)多副本數(shù)據(jù)塊形式存儲(chǔ)，按照塊的方式隨機(jī)選擇存儲(chǔ)節(jié)點(diǎn)默認(rèn)副本數(shù)目是3Hadoop的分布式文件系統(tǒng)HDFSHDFS數(shù)據(jù)分布設(shè)計(jì)Hadoop的分布式文件系統(tǒng)HDFSHDFS可靠性與出錯(cuò)恢復(fù)DataNode節(jié)點(diǎn)的檢測(cè)心跳：NameNode不斷檢測(cè)DataNode是否有效若失效，則尋找新的節(jié)點(diǎn)替代，將失效節(jié)點(diǎn)數(shù)據(jù)重新分布集群負(fù)載均衡數(shù)據(jù)一致性:校驗(yàn)和checksum主節(jié)點(diǎn)元數(shù)據(jù)失效MultipleFsImageandEditLogCheckpointHadoop的分布式文件系統(tǒng)HDFSHDFS設(shè)計(jì)要點(diǎn)命名空間副本選擇RackAwareness安全模式剛啟動(dòng)的時(shí)候，等待每一個(gè)DataNode報(bào)告情況退出安全模式的時(shí)候才進(jìn)行副本復(fù)制操作NameNode有自己的FsImage和EditLog，前者有自己的文件系統(tǒng)狀態(tài)，后者是還沒有更新的記錄Hadoop的分布式文件系統(tǒng)HDFSHDFS的安裝和啟動(dòng)下載hadoop-0.20.1.tar.gz（或者最新版本0.21）tarzxvfhadoop-0.20.1.tar.gz，解壓后Hadoop系統(tǒng)包括HDFS和所有配置文件都在指定的文件目錄中在Linux下進(jìn)行必要的系統(tǒng)配置設(shè)置與Hadoop相關(guān)的Java運(yùn)行環(huán)境變量啟動(dòng)Java虛擬機(jī)啟動(dòng)Hadoop，則Hadoop和HDFS文件系統(tǒng)開始運(yùn)行Hadoop的分布式文件系統(tǒng)HDFSHDFS文件系統(tǒng)操作命令建立用戶自己的目錄，用戶目錄在/user中，需要建立用-put命令在Linux文件系統(tǒng)與HDFS之間復(fù)制數(shù)據(jù)文件-put等同于-copyFromLocalsomeone@anynode:hadoop$bin/hadoop

dfs-lssomeone@anynode:hadoop$someone@anynode:hadoop$bin/hadoopdfs-ls/Found2itemsdrwxr-xr-x-hadoopsupergroup02008-09-2019:40/hadoopdrwxr-xr-x-hadoopsupergroup02008-09-2020:08/tmpsomeone@anynode:hadoop$bin/hadoop

dfs-mkdir/usersomeone@anynode:hadoop$bin/hadoop

dfs-mkdir/user/someonesomeone@anynode:hadoop$bin/hadoop

dfs-put/home/someone/interestingFile.txt/user/yourUserName/Put上傳整個(gè)目錄someone@anynode:hadoop$bin/hadoop

dfs–putsource-directory

destinationHadoop的分布式文件系統(tǒng)HDFSHDFS文件系統(tǒng)操作命令Command:Assuming:Outcome:bin/hadoopdfs-putfoobarNofile/directorynamed/user/$USER/barexistsinHDFSUploadslocalfilefootoafilenamed/user/$USER/barbin/hadoopdfs-putfoobar/user/$USER/barisadirectoryUploadslocalfilefootoafilenamed/user/$USER/bar/foobin/hadoopdfs-putfoosomedir/somefile/user/$USER/somedirdoesnotexistinHDFSUploadslocalfilefootoafilenamed/user/$USER/somedir/somefile,creatingthemissingdirectorybin/hadoopdfs-putfoobar/user/$USER/barisalreadyafileinHDFSNochangeinHDFS,andanerrorisreturnedtotheuser.Hadoop的分布式文件系統(tǒng)HDFSHDFS文件系統(tǒng)操作命令-lspathListsthecontentsofthedirectoryspecifiedbypath,showingthenames,permissions,owner,sizeandmodificationdateforeachentry.-lsrpathBehaveslike-ls,butrecursivelydisplaysentriesinallsubdirectoriesofpath.-dupathShowsdiskusage,inbytes,forallfileswhichmatchpath;filenamesarereportedwiththefullHDFSprotocolprefix.-duspathLike-du,butprintsasummaryofdiskusageofallfiles/directoriesinthepath.-mvsrcdestMovesthefileordirectoryindicatedbysrctodest,withinHDFS.-cpsrcdestCopiesthefileordirectoryidentifiedbysrctodest,withinHDFS.Hadoop的分布式文件系統(tǒng)HDFSHDFS文件系統(tǒng)操作命令-rmpathRemovesthefileoremptydirectoryidentifiedbypath.-rmrpathRemovesthefileordirectoryidentifiedbypath.Recursivelydeletesanychildentries(i.e.,filesorsubdirectoriesofpath).-putlocalSrcdestCopiesthefileordirectoryfromthelocalfilesystemidentifiedbylocalSrctodestwithintheHDFS.-copyFromLocallocalSrcdestIdenticalto-put-moveFromLocallocalSrcdestCopiesthefileordirectoryfromthelocalfilesystemidentifiedbylocalSrctodestwithinHDFS,thendeletesthelocalcopyonsuccess.Hadoop的分布式文件系統(tǒng)HDFSHDFS文件系統(tǒng)操作命令-get[-crc]srclocalDestCopiesthefileordirectoryinHDFSidentifiedbysrctothelocalfilesystempathidentifiedbylocalDest.-getmergesrclocalDest[addnl]RetrievesallfilesthatmatchthepathsrcinHDFS,andcopiesthemtoasingle,mergedfileinthelocalfilesystemidentifiedbylocalDest.-catfilenameDisplaysthecontentsoffilenameonstdout.-copyToLocal[-crc]srclocalDestIdenticalto-get-moveToLocal[-crc]srclocalDestWorkslike-get,butdeletestheHDFScopyonsuccess.-mkdirpathCreatesadirectorynamedpathinHDFS.Createsanyparentdirectoriesinpaththataremissing(e.g.,likemkdir-pinLinux).Hadoop的分布式文件系統(tǒng)HDFSHDFS文件系統(tǒng)操作命令-setrep[-R][-w]reppathSetsthetargetreplicationfactorforfilesidentifiedbypathtorep.(Theactualreplicationfactorwillmovetowardthetargetovertime)-touchzpathCreatesafileatpathcontainingthecurrenttimeasatimestamp.Failsifafilealreadyexistsatpath,unlessthefileisalreadysize0.-test-[ezd]pathReturns1ifpathexists;haszerolength;orisadirectory,or0otherwise.-stat[format]pathPrintsinformationaboutpath.formatisastringwhichacceptsfilesizeinblocks(%b),filename(%n),blocksize(%o),replication(%r),andmodificationdate(%y,%Y).-tail[-f]fileShowsthelast1KBoffileonstdout.Hadoop的分布式文件系統(tǒng)HDFSHDFS文件系統(tǒng)操作命令-chmod[-R]mode,mode,...path...Changesthefilepermissionsassociatedwithoneormoreobjectsidentifiedbypath....Performschangesrecursivelywith-R.modeisa3-digitoctalmode,or{augo}+/-{rwxX}.Assumesaifnoscopeisspecifiedanddoesnotapplyaumask.-chown[-R][owner][:[group]]path...Setstheowninguserand/orgroupforfilesordirectoriesidentifiedbypath....Setsownerrecursivelyif-Risspecified.-chgrp[-R]grouppath...Setstheowninggroupforfilesordirectoriesidentifiedbypath....Setsgrouprecursivelyif-Risspecified.-helpcmdReturnsusageinformationforoneofthecommandslistedabove.Youmustomittheleading'-'characterincmdHadoop的分布式文件系統(tǒng)HDFSHDFSAdmin命令獲得HDFS總體的狀態(tài)bin/hadoopdfsadmin–reportbin/hadoopdfsadmin-metasavefilename

whatthestateoftheNameNode'smetadataisSafemodeSafemodeisanHDFSstateinwhichthefilesystemismountedread-only;noreplicationisperformed,norcanfilesbecreatedordeleted.bin/hadoopdfsadmin–safemodeenter/leave/get/waitHadoop的分布式文件系統(tǒng)HDFSHDFSAdmin命令更改HDFS成員升級(jí)HDFS版本bin/start-dfs.sh–upgrade(第一次運(yùn)行新版本的時(shí)候使用)bin/hadoopdfsadmin–upgradeProgressstatusbin/hadoopdfsadmin–upgradeProgressdetailsbin/hadoopdfsadmin–upgradeProgressforce(onyourownrisk!)bin/start-dfs.sh–rollback(在舊版本重新安裝后使用)(onyourownrisk!)幫助bin/admindfsadmin-helpHadoop的分布式文件系統(tǒng)HDFS負(fù)載均衡加入一個(gè)新節(jié)點(diǎn)的步驟配置新節(jié)點(diǎn)上的hadoop程序在Master的slaves文件中加入新的slave節(jié)點(diǎn)啟動(dòng)slave節(jié)點(diǎn)上的DataNode，會(huì)自動(dòng)去聯(lián)系NameNode，加入到集群中Balancer類用來做負(fù)載均衡，默認(rèn)的均衡參數(shù)是10%范圍內(nèi)bin/start-balancer.sh–threshold5bin/stop-balancer.sh隨時(shí)可以停止負(fù)載均衡的工作Hadoop的分布式文件系統(tǒng)HDFS在MapReduce程序中使用HDFS通過的配置選項(xiàng)，HadoopMapReduce程序可以自動(dòng)從NameNode中獲得文件的情況HDFS接口包括：命令行接口HadoopMapReduceJob的隱含的輸入Java程序直接操作libhdfs從c/c++程序中操作Hadoop的分布式文件系統(tǒng)HDFSHDFS權(quán)限控制與安全特性類似于POSIX的安全特性不完全，主要預(yù)防操作失誤不是一個(gè)強(qiáng)的安全模型，不能保證操作的完全安全性bin/hadoopdfs–chmod,-chown,-chgrp用戶:當(dāng)前登錄的用戶名,即使用Linux自身設(shè)定的用戶與組的概念超級(jí)用戶:TheusernamewhichwasusedtostarttheHadoopprocess(i.e.,theusernamewhoactuallyranbin/start-all.shorbin/start-dfs.sh)isacknowledgedtobethesuperuserforHDFS.IfthisuserinteractswithHDFS,hedoessowithaspecialusernamesuperuser.IfHadoopisshutdownandrestartedunderadifferentusername,thatusernameisthenboundtothesuperuseraccount.超級(jí)用戶組

配置參數(shù)：dfs.permissions.supergroupHadoop的分布式文件系統(tǒng)HDFSHadoopMapReduce基本構(gòu)架與工作過程2.Hadoop

MapReduce的基本工作原理對(duì)等于GoogleMapReduce中的Master對(duì)等于GoogleMapReduce中的WorkerdatanodedaemonLinuxfilesystem…tasktrackerslavenodedatanodedaemonLinuxfilesystem…tasktrackerslavenodedatanodedaemonLinuxfilesystem…tasktrackerslavenodenamenodenamenodedaemonjobsubmissionnodejobtrackerHadoop

MapReduce的基本工作原理HadoopMapReduce基本構(gòu)架與工作過程數(shù)據(jù)存儲(chǔ)與計(jì)算節(jié)點(diǎn)構(gòu)架HadoopMapReduce基本工作過程Hadoop

MapReduce的基本工作原理HadoopMapReduce主要組件Hadoop

MapReduce的基本原理文件輸入格式InputFormat定義了數(shù)據(jù)文件如何分割和讀取InputFile提供了以下一些功能選擇文件或者其它對(duì)象，用來作為輸入定義InputSplits，將一個(gè)文件分開成為任務(wù)為RecordReader提供一個(gè)工廠，用來讀取這個(gè)文件有一個(gè)抽象的類FileInputFormat，所有的輸入格式類都從這個(gè)類繼承這個(gè)類的功能以及特性。當(dāng)啟動(dòng)一個(gè)Hadoop任務(wù)的時(shí)候，一個(gè)輸入文件所在的目錄被輸入到FileInputFormat對(duì)象中。FileInputFormat從這個(gè)目錄中讀取所有文件。然后FileInputFormat將這些文件分割為一個(gè)或者多個(gè)InputSplits。通過在JobConf對(duì)象上設(shè)置JobConf.setInputFormat設(shè)置文件輸入的格式HadoopMapReduce主要組件Hadoop

MapReduce的基本工作原理文件輸入格式InputFormatInputFormat:Description:Key:Value:TextInputFormatDefaultformat;readslinesoftextfilesThebyteoffsetofthelineThelinecontentsKeyValueTextInputFormatParseslinesintokey-valpairsEverythinguptothefirsttabcharacterTheremainderofthelineSequenceFileInputFormatAHadoop-specifichigh-performancebinaryformatuser-defineduser-definedHadoopMapReduce主要組件Hadoop

MapReduce的基本工作原理輸入數(shù)據(jù)分塊InputSplitsInputSplit定義了輸入到單個(gè)Map

任務(wù)的輸入數(shù)據(jù)一個(gè)MapReduce程序被統(tǒng)稱為

一個(gè)Job，可能有上百個(gè)任務(wù)構(gòu)成InputSplit將文件分為64MB的大小配置文件hadoop-site.xml中的mapred.min.split.size參數(shù)控制這個(gè)大小mapred.tasktracker.map.taks.maximum用來控制某一個(gè)節(jié)點(diǎn)上所有map任務(wù)的最大數(shù)目HadoopMapReduce主要組件Hadoop

MapReduce的基本工作原理數(shù)據(jù)記錄讀入RecordReaderInputSplit定義了一項(xiàng)工作的大小，

但是沒有定義如何讀取數(shù)據(jù)RecordReader實(shí)際上定義了如何

從數(shù)據(jù)上轉(zhuǎn)化為一個(gè)(key,value)對(duì)

的詳細(xì)方法，并將數(shù)據(jù)輸出到Mapper類中TextInputFormat提供了LineRecordReaderHadoopMapReduce主要組件Hadoop

MapReduce的基本工作原理Mapper每一個(gè)Mapper類的實(shí)例生成了

一個(gè)Java進(jìn)程（在某一個(gè)InputSplit上執(zhí)行）有兩個(gè)額外的參數(shù)OutputCollector

以及Reporter，前者用來收集中間

結(jié)果，后者用來獲得環(huán)境參數(shù)以及設(shè)置當(dāng)前執(zhí)行的狀態(tài)?，F(xiàn)在用Mapper.Context提供給每一個(gè)Mapper函數(shù)，用來提供上面兩個(gè)對(duì)象的功能HadoopMapReduce主要組件Hadoop

MapReduce的基本工作原理Combiner合并相同key的鍵值對(duì)，減少partitioner時(shí)候的數(shù)據(jù)通信開銷conf.setCombinerClass(Reduce.class);是在本地執(zhí)行的一個(gè)Reducer，滿足一定的條件才能夠執(zhí)行。HadoopMapReduce主要組件Hadoop

MapReduce的基本工作原理Partitioner&Shuffle在Map工作完成之后，每一個(gè)Map函數(shù)會(huì)將結(jié)果傳到對(duì)應(yīng)的Reducer所在的節(jié)點(diǎn)，此時(shí)，用戶可以提供一個(gè)Partitioner類，用來決定一個(gè)給定的(key,value)對(duì)傳輸?shù)木唧w位置Sort傳輸?shù)矫恳粋€(gè)節(jié)點(diǎn)上的所有的Reduce函數(shù)接收到得Key,value對(duì)會(huì)被Hadoop自動(dòng)排序（即Map生成的結(jié)果傳送到某一個(gè)節(jié)點(diǎn)的時(shí)候，會(huì)被自動(dòng)排序）HadoopMapReduce主要組件Hadoop

MapReduce的基本工作原理Reducer做用戶定義的Reduce操作接收到一個(gè)OutputCollector的

類作為輸出最新的編程接口是Reducer.ContextHadoopMapReduce主要組件Hadoop

MapReduce的基本工作原理文件輸出格式OutputFormat寫入到HDFS的所有OutputFormat都繼承自FileOutputFormat每一個(gè)Reducer都寫一個(gè)文件到一個(gè)共同的輸出目錄，文件名是part-nnnnn，其中nnnnn是與每一個(gè)reducer相關(guān)的一個(gè)號(hào)（partitionid）FileOutputFormat.setOutputPath()JobConf.setOutputFormat()HadoopMapReduce主要組件Hadoop

MapReduce的基本工作原理文件輸出格式OutputFormatRecordWriterTextOutputFormat實(shí)現(xiàn)了缺省的LineRecordWriter，以”key\tvalue”形式輸出一行結(jié)果。OutputFormat:DescriptionTextOutputFormatDefault;writeslinesin"key\tvalue"formSequenceFileOutputFormatWritesbinaryfilessuitableforreadingintosubsequentMapReducejobsNullOutputFormatDisregardsitsinputs容錯(cuò)處理與計(jì)算性能優(yōu)化Hadoop

MapReduce的基本工作原理由Hadoop系統(tǒng)自己解決主要方法是將失敗的任務(wù)進(jìn)行再次執(zhí)行TaskTracker會(huì)把狀態(tài)信息匯報(bào)給JobTracker，最終由JobTracker決定重新執(zhí)行哪一個(gè)任務(wù)為了加快執(zhí)行的速度，Hadoop也會(huì)自動(dòng)重復(fù)執(zhí)行同一個(gè)任務(wù)，以最先執(zhí)行成功的為準(zhǔn)（投機(jī)執(zhí)行）mapred.map.tasks.speculative.executionmapred.red

人人文庫> 全部分類> 教育資料 > 課件下載

溫馨提示

1. 本站所有資源如無特殊說明，都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請(qǐng)下載最新的WinRAR軟件解壓。
2. 本站的文檔不包含任何第三方提供的附件圖紙等，如果需要附件，請(qǐng)聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
3. 本站RAR壓縮包中若帶圖紙，網(wǎng)頁內(nèi)容里面會(huì)有圖紙預(yù)覽，若沒有圖紙預(yù)覽就沒有圖紙。
4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
5. 人人文庫網(wǎng)僅提供信息存儲(chǔ)空間，僅對(duì)用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理，對(duì)用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯，并不能對(duì)任何下載內(nèi)容負(fù)責(zé)。
6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容，請(qǐng)與我們聯(lián)系，我們立即糾正。
7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶因使用這些下載資源對(duì)自己和他人造成任何形式的傷害或損失。

MapReduce海量數(shù)據(jù)并行處理ch.04

文檔簡介

溫馨提示

最新文檔

評(píng)論

MapReduce海量數(shù)據(jù)并行處理ch.04

文檔簡介

溫馨提示

最新文檔

評(píng)論

相關(guān)文檔