版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請進行舉報或認(rèn)領(lǐng)
文檔簡介
hadoop分布式存儲平臺外文文獻翻譯(含:英文原文及中文譯文)文獻出處:BorthakurD.TheHadoopDistributedFileSystem:ArchitectureandDesign[J].HadoopProjectWebsite,2007,11(11):1-10.英文原文HadoopDistributedFileSystem:ArchitectureandDesignDhrubaBorthakurintroductionTheHadoopDistributedFileSystem(HDFS)isdesignedtobesuitablefordistributedfilesystemsrunningongeneral-purposehardware(commodityhardware).Ithasalotincommonwithexistingdistributedfilesystems.Atthesametime,itisalsoverydifferentfromotherdistributedfilesystems.HDFSisahighlyfault-tolerantsystemthatissuitablefordeploymentoninexpensivemachines.HDFScanprovidehigh-throughputdataaccessandisverysuitableforlarge-scaledata.Applicationsontheset.HDFSrelaxessomeofthePOSIXconstraintstostreamlinethereadingoffilesystemdata.HDFSwasoriginallydevelopedastheinfrastructurefortheApacheNutchsearchengineproject.HDFSispartoftheApacheHadoopCoreproject..PrerequisitesanddesigngoalsHardwareerrorHardwareerrorsarethenorm,nottheexception.HDFSmayconsistofhundredsofservers,eachofwhichstorespartofthefilesystem'sdata.Therealitywefaceisthatthenumberofcomponentsthatmakeupasystemishuge,andanycomponentcanfail.ThismeansthatthereisalwaysaportionofHDFScomponentsthatarenotworking.Therefore,errordetectionandrapid,automaticrecoveryarethecorearchitecturalgoalsofHDFS.StreamingdataaccessApplicationsrunningonHDFSaredifferentfromnormalapplicationsinthattheyneedtoaccesstheirdatasetsinastream.ThedesignofHDFStakesmoreconsiderationofdatabatchprocessingthanuserinteractionprocessing.Thelowerlatencyofdataaccessismorecriticalthanthehighthroughputofdataaccess.ManyofthehardconstraintsimposedbythePOSIXstandardsettingarenotrequiredforHDFSapplications.Toimprovethethroughputofthedata,somechangeshavebeenmadetothesemanticsofPOSIX.Large-scaledatasetsApplicationsrunningonHDFShavelargedatasets.AtypicalfilesizeonHDFSistypicallyintherangeof1byteto1byte.Therefore,HDFSistunedtosupportlargefilestorage.Itshouldbeabletoprovideahighoveralldatatransmissionbandwidththatcanscaletohundredsofnodesinacluster.AsingleHDFSinstanceshouldbeabletosupporttensofmillionsoffiles.SimpleconsistencymodelHDFSapplicationsrequirea"writeonce,readmany"fileaccessmodel.Afileiscreated,written,andclosedwithoutchangingit.Thisassumptionsimplifiesdataconsistencyissuesandmakeshigh-throughputdataaccesspossible.MAP/reductionapplicationsorwebcrawlerapplicationsarewellsuitedtothismodel.Therearealsoplanstoexpandthismodelinthefuturesothatitsupportsadditionalwriteoperationsforfiles."Mobilecomputingismorecosteffectivethanmobiledata"Thecalculationofanapplicationrequestismoreefficientasitisclosertothedataitmanipulates,especiallywhenthedatareachesamassivelevel.Becausethiscanreducetheimpactofnetworkcongestionandincreasethethroughputofsystemdata.Movingthecalculationsclosertothedataisclearlybetterthanmovingthedatatotheapplication.HDFSprovidesapplicationswithinterfacestomovethemselvesaroundthedata.HeterogeneitybetweenhardwareandsoftwareplatformsHDFStakesintoaccounttheplatform'sportabilityatdesigntime.ThisfeaturefacilitatesthepromotionofHDFSasalarge-scaledataapplicationplatform.NamenodeandDatanodeHDFSusesamaster/slavearchitecture.AnHDFSclusterconsistsofaNamenodeandacertainnumberofDatanodes.ANamenodeisacentralserverthatmanagesthefilesystem'snamespace(namespace)andclientaccesstofiles.TheDatanodeinaclusterisgenerallyanodethatisresponsibleformanagingstorageonthenodewhereitislocated.TheHDFSexposesthefilesystem'snamespace,anduserscanstoredataonitintheformoffiles.Internally,afileisactuallydividedintooneormoredatablocks,whicharestoredonasetofDatanodes.Namenodeperformsfilesystemnamespaceoperationssuchasopening,closing,renamingafileordirectory.ItisalsoresponsiblefordeterminingthemappingofdatablockstospecificDatanodenodes.TheDatanodeisresponsibleforhandlingreadandwriterequestsfromthefilesystemclient.Datablocksarecreated,deleted,andcopiedundertheunifiedscheduleoftheNameNode.TheNamenodeandaDatanodearedesignedtorunoncommonbusinessmachines.ThesemachinesgenerallyruntheGNU/Linuxoperatingsystem(OS).TheHDFSusesJavalanguagedevelopment,soanyJava-enabledmachinecandeployaNamenodeorDatanode.DuetothehighlyportablelanguageofJava,HDFScanbedeployedonmanytypesofmachines.AtypicaldeploymentscenarioiswhenonlyoneNamenodeinstanceisrunningonamachine,andothermachinesintheclusterarerunninginstancesofaDatanode.ThisarchitecturedoesnotexcludetheoperationofmultipleDatanodesonasinglemachine,butthisisrelativelyrare.ThestructureofasingleNamenodeinaclustergreatlysimplifiesthearchitectureofthesystem.NamenodeisthearbiterandadministratorofallHDFSmetadata,sothattheNameNodewhereuserdataneverflows.FileSystemNamespaceHDFSsupportstraditionalhierarchicalfileorganization.Usersorapplicationscancreatedirectoriesandthenstorefilesinthesedirectories.Thefilesystemnamespacehierarchyissimilartomostexistingfilesystems:.Userscancreate,delete,moveorrenamefiles.Currently,HDFSdoesnotsupportuserdiskquotaandaccesscontrol,nordoesitsupporthardlinksandsoftlinks.However,theHDFSarchitecturedoesnotpreventtheimplementationofthesefeatures.TheNameNodeisresponsibleformaintainingthefilesystem'snamespace,andanychangestothefilesystemnamespaceorattributeswillberecordedbytheNamenode.TheapplicationcansetthenumberofcopiesoftheHDFSsavedfile.Thenumberoffilecopiesiscalledthecopyfactorofthefile.ThisinformationisalsostoredbytheNamenode.DatareplicationHDFSisdesignedtoreliablystoreverylargefilesacrossmachinesinalargecluster.Itstoreseachfileasaseriesofdatablocks,exceptforthelastone,alldatablocksarethesamesize.Forfaulttolerance,alldatablocksofthefilewillhaveacopy.Theblocksizeandcopyfactorofeachfileareconfigurable.Applicationscanspecifythenumberofcopiesofafile.Replicacoefficientscanbespecifiedatthetimeoffilecreation,ortheycanbechangedlater.FilesinHDFSarewrittenonce,anditisstrictlyrequiredthattherecanbeonlyonewriteratanytime.Thenamenodefullymanagesthereplicationofdatablocks,whichperiodicallyreceiveheartbeatsignalsandblockstatusreportsfromDatanodesineachoftheclusters.TheheartbeatsignalreceivedmeansthattheDatanode'snodeisworkingproperly.TheblockstatusreportcontainsalistofalldatablocksontheDatanode.Copystorage:thefirststepThestorageofcopiesisthekeytoHDFSreliabilityandperformance.TheoptimizedcopystoragepolicyisanimportantfeatureofHDFSdistinguishingitfrommostotherdistributedfilesystems.Thisfeaturerequiresalotoftuningandrequirestheaccumulationofexperience.HDFSusesastrategycalledrackawareness(rackawareness)toimprovedatareliability,availability,andutilizationofnetworkbandwidth.Thecurrentcopystoragestrategyisonlythefirststepinthisdirection.Theshort-termgoaltoachievethisstrategyistoverifyitseffectivenessintheproductionenvironment,observeitsbehavior,andlaythefoundationfortestingandresearchtoachievemoreadvancedstrategies.LargeHDFSinstancestypicallyrunonclustersofcomputersthatspanmultipleracks.Communicationbetweentwomachinesondifferentracksneedstogothroughtheswitch.Inmostcases,thebandwidthbetweentwomachinesinthesamerackwillbegreaterthanthebandwidthbetweentwomachinesindifferentracks.Througharack-awareprocess,theNamenodecandeterminetheIDoftheracktowhicheachDatanodebelongs.Asimplebutnotoptimizedstrategyistostorethecopiesindifferentracks.Thiscaneffectivelypreventthelossofdatawhentheentirerackfails,andallowfullutilizationofthebandwidthofmultiplerackswhenreadingdata.Thiskindofpolicysettingcanevenlydistributethecopiesinthecluster,whichisbeneficialtoloadbalancingintheeventofcomponentfailure.However,becauseawriteoperationofthisstrategyrequiresthetransmissionofdatablockstomultipleracks,thisaddstothecostofwriting.Inmostcases,thereplicacoefficientis3,HDFSstoragestrategyistostoreacopyonthenodeofthelocalrack,acopyonanothernodeofthesamerack,thelastcopyonadifferentrackOnthenode.Thisstrategyreducesthetransmissionofdatabetweenracks,whichincreasestheefficiencyofwriteoperations.Rackerrorsarefarfewerthannodeerrors,sothisstrategydoesnotaffectdatareliabilityandavailability.Atthesametime,becausethedatablocksareonlyplacedontwo(notthree)differentracks,thisstrategyreducesthetotalnetworktransmissionbandwidthrequiredwhenreadingdata.Underthisstrategy,replicasarenotevenlydistributedacrossdifferentracks.One-thirdofthereplicasareononenode,two-thirdsofthereplicasareononerack,andotherreplicasareevenlydistributedintheremainingracks.Thisstrategydoesnotcompromisedatareliabilityandreadperformance.Undertheimprovedwriteperformance.Currently,thedefaultcopystoragestrategydescribedhereisintheprocessofdevelopment.CopyselectionInordertoreducetheoverallbandwidthconsumptionandreadlatency,HDFSwilltrytoletthereaderreadthenearestcopy.Ifthereisacopyonthesamerackofthereader,thatcopyisread.IfanHDFSclusterspansmultipledatacenters,thentheclientwillalsofirstreadacopyofthelocaldatacenter.SafemodeAftertheNamenodestartsup,itentersaspecialstatecalledsafemode.Namenodesinsafemodedonotcopydatablocks.TheNamenodereceivesheartbeatsignalsandblockstatusreportsfromallDatanodes.TheblockstatusreportincludesalistofdatablocksownedbyaDatanode.Eachdatablockhasaspecifiedminimumnumberofcopies.WhentheNamenodedetectsthatacopyofadatablockhasreacheditsminimumvalue,thedatablockisconsideredtobecopy-safe(secure);acertainpercentage(thisparameterisconfigurable)ofdatablocksisdetectedbytheNameNode.Afterconfirmingthatitissafe(plusanadditional30secondsofwaitingtime),theNamenodewillexittheSafeModestateanditwillthendeterminewhichothercopiesofthedatablockshavenotreachedthespecifiednumberandcopythesedatablockstootherDatanodes.on.PersistenceoffilesystemmetadataTheNameNodeholdstheHDFSnamespace.Foranyoperationsthatmodifythefilesystemmetadata,theNamenodewilluseatransactionlogcalledEditlog.Forexample,tocreateafileinHDFS,theNamenodewillinsertarecordintheEditlogtorepresentit;similarly,thecopyfactorofthemodifiedfilewillalsoinsertarecordintotheEditlog.TheNameNodeisstoredinthefilesystemofthelocaloperatingsystem.TheEditlog'sentirefilesystemnamespace,includingdatablock-to-filemappings,fileattributes,etc.,isstoredinafilecalledFsImage,whichisalsoplacedonthelocalfilesystemwheretheNamenodeislocated.TheNamenodestorestheentirefilesystem'snamespaceandimageofthefiledatablockmap(blockmap)inmemory.ThiskeymetadatastructureissocompactthataNamenodewith4Gofmemorywillbesufficienttosupportalargenumberoffilesanddirectories.WhentheNameNodestartsup,itreadstheEditlogandFsImagefromtheharddisk,appliesalloftheEditlogtransactionstothein-memoryFsImage,andsavesthenewversionoftheFsImagefrommemorytothelocaldisk.,AndthendeletetheoldEditlog,becausethisoldEditlogtransactionhasbeenappliedtotheFsImage.Thisprocessiscalledacheckpoint(detectionpoint).Inthecurrentimplementation,checkpointsonlyoccurwhentheNameNodeisstarted,andperiodiccheckpointswillbeimplementedinthenearfuture.Datanode'sHDFSstoresdataasfilesinthelocalfilesystem.ItdoesnotknowaboutHDFSfiles.ItstoreseachHDFSdatablockinaseparatefileonthelocalfilesystem.TheDatanodedoesnotcreateallfilesinthesamedirectory.Infact,itusesheuristicstodeterminetheoptimalnumberoffilesforeachdirectory,andCreatesubdirectorieswhenappropriate.Creatingalllocalfilesinthesamedirectoryisnotanoptimalchoicebecausethelocalfilesystemmaynotbeabletoefficientlysupportalargenumberoffilesinasingledirectory.WhenaDatanodeisstarted,itscansthelocalfilesystem,generatesalistofalltheHDFSdatablockscorrespondingtotheselocalfiles,andsendsthereporttotheNamenode.Thisreportistheblockstatusreport.ProtocolAllHDFScommunicationprotocolsarebasedontheTCP/IPprotocol.TheclientconnectstotheNamenodethroughaconfigurableTCPportandinteractswiththeNamenodethroughtheClientProtocolprotocol.TheDatanodeinteractswiththeNamenodeusingtheDatanodeProtocolprotocol.Aremoteprocedurecall(RPC)modelisabstractedouttoencapsulatetheClientProtocolandDatanodeprotocolprotocols.Indesign,NamenodedoesnotinitiateRPCactively,butrespondstoRPCrequestsfromclientsorDatanodes.
RobustnessThemaingoalofHDFSistoensurethereliabilityofdatastorageevenintheeventofanerror.Thethreecommonerrorconditionsare:Namenodeerror,Datanodeerror,andnetworkpartitions.Diskdataerror,heartbeatdetectionandrecopyEachDatanodenodeperiodicallysendsaheartbeattotheNamenode.ThenetworkfragmentationmaycauseapartoftheDatanodetolosecontactwiththeNamenode.TheNamenodedetectsthisbythelackofaheartbeatsignalandmarkstheseDatanodesthatnolongersendheartbeatsinthenearfutureasbeingdown,andwillnolongersendnewIOrequeststothem.AnydatastoredontheDatanodewillnolongerbevalid.ADatanode'sdowntimemaycausethereplicacoefficientsofsomedatablockstofallbelowaspecifiedvalue,andtheNamenodecontinuouslydetectsthesedatablocksthatneedtobecopied,andinitiatesacopyoperationonceitisdiscovered.Inthefollowingcases,youmayneedtore-copy:aDatanodenodefailure,acopyisdamaged,theharddiskontheDatanodeiswrong,orthecopyfactorofthefileisincreased.ClusterequilibriumTheHDFSarchitecturesupportsdataequalizationpolicies.IfthefreespaceonaDatanodenodefallsbelowacertaincriticalpoint,thesystemwillautomaticallymovedatafromthisDatanodetootheridleDatanodesaccordingtotheequalizationpolicy.Whentherequestforafilesuddenlyincreases,itisalsopossibletostartaplantocreateanewcopyofthefileandatthesametimerebalancetheotherdatainthecluster.Thesebalancingstrategieshavenotyetbeenimplemented.DataintegrityDatablocksobtainedfromaDatanodemaybecorrupted.ThedamagemaybecausedbyaDatanode'sstoragedeviceerror,networkerror,orsoftwarebug.HDFSclientsoftwareimplementschecksumcheckingofHDFSfilecontent.WhentheclientcreatesanewHDFSfile,thechecksumofeachdatablockofthefileiscalculatedandthechecksumisstoredasaseparatehiddenfileinthesameHDFSnamespace.Aftertheclientobtainsthecontentsofthefile,itcheckswhetherthedataobtainedfromtheDatanodematchesthechecksuminthecorrespondingchecksumfile.Ifitdoesnotmatch,theclientcanchoosetoobtainacopyofthedatablockfromotherDatanodes.MetadatadiskerrorFsImageandEditlogarethecoredatastructuresofHDFS.Ifthesefilesarecorrupted,theentireHDFSinstancewillfail.Thus,theNamenodecanbeconfiguredtosupportthemaintenanceofmultiplecopiesoftheFsImageandEditlog.AnychangestotheFsImageorEditlogwillbesynchronizedtotheircopy.Thismulti-copysynchronizationmayreducethenumberofNamespacetransactionsprocessedbytheNamenodepersecond.However,thiscostisacceptable,becauseevenifHDFSapplicationsaredata-intensive,theyarenotmetadata-intensive.WhentheNamenodeisrestarted,itwillusethelatestfullFsImageandEditlogtouse.NamenodeisthesinglepointoffailureintheHDFScluster.IftheNamenodemachinefails,manualinterventionisrequired.Currently,thefunctionofautomaticrestartorNamenodefailoveronanothermachinehasnotbeenimplementedyet.SnapshotSnapshotssupportthecopybackupofdataataparticularmoment.Withsnapshots,HDFScanberestoredtoapreviouslyknownandcorrectpointintimewhendataiscorrupted.HDFSdoesnotcurrentlysupportthesnapshotfeature,butplanstosupportitinfuturereleases.DataorganizationdatablockHDFSisdesignedtosupportlargefiles,andHDFSissuitableforapplicationsthatneedtodealwithlarge-scaledatasets.Theseapplicationsonlywritedataonce,butreadoneormoretimes,andthereadspeedshouldbeabletomeettheneedsofstreamingreads.HDFSsupports"WriteOnceReadMany"semanticsforfiles.Atypicaldatablocksizeis64MB.Therefore,thefilesinHDFSarealwaysdividedintodifferentblocksaccordingto64M,andeachblockisstoredindifferentDatanodesasmuchaspossible.StagingTheclient'srequesttocreateafileisnotactuallysenttotheNamenodeimmediately.Infact,theHDFSclientfirstcachesthefiledatatoatemporarylocalfileattheverybeginning.Applicationwriteoperationsaretransparentlyredirectedtothistemporaryfile.Whentheamountofdataaccumulatedbythistemporaryfileexceedsthesizeofadatablock,theclientwillcontacttheNamenode.TheNamenodeinsertsthefilenameintothefilesystemhierarchyandassignsadatablocktoit.ThenreturntheDatanode'sidentifierandtargetdatablocktotheclient.TheclientthenuploadsthispieceofdatafromthelocaltemporaryfiletothespecifiedDatanode.Whenthefileisclosed,thenon-uploadeddataremaininginthetemporaryfileisalsotransferredtothespecifiedDatanode.TheclientthentellstheNamenodethatthefileisclosed.Atthispoint,theNameNodesubmittedthefilecreationoperationtothelogforstorage.IfNamenodecrashesbeforethefileisclosed,thefilewillbelost.TheabovemethodistheresultofcarefulconsiderationofthetargetapplicationrunningonHDFS.Theseapplicationsrequirestreamingoffiles.Ifyoudonotuseclient-sidecaching,thenetworkspeedandnetworkcongestionwillhavealargeimpactonthethroughput.Thisapproachisnotwithoutprecedent.Earlyfilesystems,suchasAFS,usedclient-sidecachingtoimproveperformance.Inordertoachievehigherdatauploadefficiency,thePOSIXstandardhasbeenrelaxed.PipelinereplicationWhentheclientwritesdatatotheHDFSfile,itiswrittentothelocaltemporaryfile.Assumethatthereplicacoefficientofthefileissetto3,whenthelocaltemporaryfileaccumulatestothesizeofadatablock,theclientwillobtainaDatanodelistfromtheNamenodeforstoringthecopy.TheclientthenstartstotransferdatatothefirstDatanode.Asmallportion(4KB)ofthefirstDatanodereceivesthedata,writeseachportiontothelocalrepository,andtransferstheportiontothesecondDatanodeinthelist.node.ThesameistrueforthesecondDatanode,whereasmallfractionreceivesdata,writestothelocalrepository,andpassesittothethirdDatanode.Finally,thethirdDatanodereceivesthedataandstoresitlocally.Therefore,theDatanodecanreceivedatafromthepreviousnodeinapipelinedmannerandforwardittothenextnodeatthesametime.ThedataiscopiedfromthepreviousDatanodetothenextoneinapipelinedmanner.AccessibilityHDFSprovidesapplicationswithmultipleaccessmethods.TheusercanaccessthroughtheJavaAPIinterface,andcanalsoaccessthroughtheClanguageencapsulationAPI,andcanalsoaccessthefilesinHDFSthroughthebrowser.ThewaytoaccessthroughtheWebDAVprotocolisunderdevelopment.BrowserinterfaceAtypicalHDFSinstallationopensawebserveronaconfigurableTCPporttoexposetheHDFSnamespace.UserscanusethebrowsertobrowsetheHDFSnamespaceandviewthecontentsofthefile.StoragespacerecoveryFiledeletionandrecoveryWhenauserorapplicationdeletesafile,thefileisnotimmediatelydeletedfromHDFS.Infact,HDFSrenamesthisfiletothe/trashdirectory.Aslongasthefileisstillinthe/trashdirectory,thefilecanbequicklyrestored.Thetimethatthefileissavedin/trashisconfigurable.Whenthistimeisexceeded,Namenodewilldeletethefilefromthenamespace.Deletingafilewillcausetherelevantdatablockofthefiletobereleased.NotethattherewillbeadelaybetweenthetimetheuserdeletesthefileandtheHDFSfreespaceincreases.Aslongasthedeletedfileisstillinthe/trashdirectory,theusercanrecoverthefile.Iftheuserwantstorecoverthedeletedfile,he/shecanbrowsethe/trashdirectorytoretrievethefile.The/trashdirectoryonlyholdsthelastcopyofthedeletedfile.The/trashdirectoryisnodifferentfromotherdirectories,exceptthatonthisdirectoryHDFSappliesaspecialpolicytoautomaticallydeletefiles.Thecurrentdefaultpolicyistodeletefilesin/trashthathavebeenretainedformorethan6hours.Inthefuture,thisstrategycanbeconfiguredthroughawell-definedinterface.ReducethecopyfactorWhenthecopyfactorofafileisreduced,theNamenodewillselecttheexcesscopytodelete.ThisinformationwillbepassedtotheDatanodeonthenextheartbeatdetection.TheDatanoderemovesthecorrespondingdatablock,andthefreespaceintheclusterincreases.Similarly,therewillbeacertaindelaybetweentheendofthecalltothesetReplicationAPIandtheincreasedfreespaceinthecluster.中文譯文Hadoop分布式文件系統(tǒng):架構(gòu)和設(shè)計DhrubaBorthakur引言Hadoop分布式文件系統(tǒng)(HDFS)被設(shè)計成適合運行在通用硬件(commodityhardware)上的分布式文件系統(tǒng)。它和現(xiàn)有的分布式文件系統(tǒng)有很多共同點。但同時,它和其他的分布式文件系統(tǒng)的區(qū)別也是很明顯的。HDFS是一個高度容錯性的系統(tǒng),適合部署在廉價的機器上。HDFS能提供高吞吐量的數(shù)據(jù)訪問,非常適合大規(guī)模數(shù)據(jù)集上的應(yīng)用。HDFS放寬了一部分POSIX約束,來實現(xiàn)流式讀取文件系統(tǒng)數(shù)據(jù)的目的。HDFS在最開始是作為ApacheNutch搜索引擎項目的基礎(chǔ)架構(gòu)而開發(fā)的。HDFS是ApacheHadoopCore項目的一部分。前提和設(shè)計目標(biāo)硬件錯誤硬件錯誤是常態(tài)而不是異常。HDFS可能由成百上千的服務(wù)器所構(gòu)成,每個服務(wù)器上存儲著文件系統(tǒng)的部分?jǐn)?shù)據(jù)。我們面對的現(xiàn)實是構(gòu)成系統(tǒng)的組件數(shù)目是巨大的,而且任一組件都有可能失效,這意味著總是有一部分HDFS的組件是不工作的。因此錯誤檢測和快速、自動的恢復(fù)是HDFS最核心的架構(gòu)目標(biāo)。流式數(shù)據(jù)訪問運行在HDFS上的應(yīng)用和普通的應(yīng)用不同,需要流式訪問它們的數(shù)據(jù)集。HDFS的設(shè)計中更多的考慮到了數(shù)據(jù)批處理,而不是用戶交互處理。比之?dāng)?shù)據(jù)訪問的低延遲問題,更關(guān)鍵的在于數(shù)據(jù)訪問的高吞吐量。POSIX標(biāo)準(zhǔn)設(shè)置的很多硬性約束對HDFS應(yīng)用系統(tǒng)不是必需的。為了提高數(shù)據(jù)的吞吐量,在一些關(guān)鍵方面對POSIX的語義做了一些修改。大規(guī)模數(shù)據(jù)集運行在HDFS上的應(yīng)用具有很大的數(shù)據(jù)集。HDFS上的一個典型文件大小一般都在G字節(jié)至T字節(jié)。因此,HDFS被調(diào)節(jié)以支持大文件存儲。它應(yīng)該能提供整體上高的數(shù)據(jù)傳輸帶寬,能在一個集群里擴展到數(shù)百個節(jié)點。一個單一的HDFS實例應(yīng)該能支撐數(shù)以千萬計的文件。簡單的一致性模型HDFS應(yīng)用需要一個“一次寫入多次讀取”的文件訪問模型。一個文件經(jīng)過創(chuàng)建、寫入和關(guān)閉之后就不需要改變。這一假設(shè)簡化了數(shù)據(jù)一致性問題,并且使高吞吐量的數(shù)據(jù)訪問成為可能。Map/Reduce應(yīng)用或者網(wǎng)絡(luò)爬蟲應(yīng)用都非常適合這個模型。目前還有計劃在將來擴充這個模型,使之支持文件的附加寫操作?!耙苿佑嬎惚纫苿訑?shù)據(jù)更劃算”一個應(yīng)用請求的計算,離它操作的數(shù)據(jù)越近就越高效,在數(shù)據(jù)達到海量級別的時候更是如此。因為這樣就能降低網(wǎng)絡(luò)阻塞的影響,提高系統(tǒng)數(shù)據(jù)的吞吐量。將計算移動到數(shù)據(jù)附近,比之將數(shù)據(jù)移動到應(yīng)用所在顯然更好。HDFS為應(yīng)用提供了將它們自己移動到數(shù)據(jù)附近的接口。異構(gòu)軟硬件平臺間的可移植性HDFS在設(shè)計的時候就考慮到平臺的可移植性。這種特性方便了HDFS作為大規(guī)模數(shù)據(jù)應(yīng)用平臺的推廣。Namenode和DatanodeHDFS采用master/slave架構(gòu)。一個HDFS集群是由一個Namenode和一定數(shù)目的Datanodes組成。Namenode是一個中心服務(wù)器,負(fù)責(zé)管理文件系統(tǒng)的名字空間(namespace)以及客戶端對文件的訪問。集群中的Datanode一般是一個節(jié)點一個,負(fù)責(zé)管理它所在節(jié)點上的存儲。HDFS暴露了文件系統(tǒng)的名字空間,用戶能夠以文件的形式在上面存儲數(shù)據(jù)。從內(nèi)部看,一個文件其實被分成一個或多個數(shù)據(jù)塊,這些塊存儲在一組Datanode上。Namenode執(zhí)行文件系統(tǒng)的名字空間操作,比如打開、關(guān)閉、重命名文件或目錄。它也負(fù)責(zé)確定數(shù)據(jù)塊到具體Datanode節(jié)點的映射。Datanode負(fù)責(zé)處理文件系統(tǒng)客戶端的讀寫請求。在Namenode的統(tǒng)一調(diào)度下進行數(shù)據(jù)塊的創(chuàng)建、刪除和復(fù)制。Namenode和Datanode被設(shè)計成可以在普通的商用機器上運行。這些機器一般運行著GNU/Linux操作系統(tǒng)(OS)。HDFS采用Java語言開發(fā),因此任何支持Java的機器都可以部署Namenode或Datanode。由于采用了可移植性極強的Java語言,使得HDFS可以部署到多種類型的機器上。一個典型的部署場景是一臺機器上只運行一個Namenode實例,而集群中的其它機器分別運行一個Datanode實例。這種架構(gòu)并不排斥在一臺機器上運行多個Datanode,只不過這樣的情況比較少見。集群中單一Namenode的結(jié)構(gòu)大大簡化了系統(tǒng)的架構(gòu)。Namenode是所有HDFS元數(shù)據(jù)的仲裁者和管理者,這樣,用戶數(shù)據(jù)永遠(yuǎn)不會流過Namenode。文件系統(tǒng)的名字空間(namespace)HDFS支持傳統(tǒng)的層次型文件組織結(jié)構(gòu)。用戶或者應(yīng)用程序可以創(chuàng)建目錄,然后將文件保存在這些目錄里。文件系統(tǒng)名字空間的層次結(jié)構(gòu)和大多數(shù)現(xiàn)有的文件系統(tǒng)類似:用戶可以創(chuàng)建、刪除、移動或重命名文件。當(dāng)前,HDFS不支持用戶磁盤配額和訪問權(quán)限控制,也不支持硬鏈接和軟鏈接。但是HDFS架構(gòu)并不妨礙實現(xiàn)這些特性。Namenode負(fù)責(zé)維護文件系統(tǒng)的名字空間,任何對文件系統(tǒng)名字空間或?qū)傩缘男薷亩紝⒈籒amenode記錄下來。應(yīng)用程序可以設(shè)置HDFS保存的文件的副本數(shù)目。文件副本的數(shù)目稱為文件的副本系數(shù),這個信息也是由Namenode保存的。數(shù)據(jù)復(fù)制HDFS被設(shè)計成能夠在一個大集群中跨機器可靠地存儲超大文件。它將每個文件存儲成一系列的數(shù)據(jù)塊,除了最后一個,所有的數(shù)據(jù)塊都是同樣大小的。為了容錯,文件的所有數(shù)據(jù)塊都會有副本。每個文件的數(shù)據(jù)塊大小和副本系數(shù)都是可配置的。應(yīng)用程序可以指定某個文件的副本數(shù)目。副本系數(shù)可以在文件創(chuàng)建的時候指定,也可以在之后改變。HDFS中的文件都是一次性寫入的,并且嚴(yán)格要求在任何時候只能有一個寫入者。Namenode全權(quán)管理數(shù)據(jù)塊的復(fù)制,它周期性地從集群中的每個Datanode接收心跳信號和塊狀態(tài)報告(Blockreport)。接收到心跳信號意味著該Datanode節(jié)點工作正常。塊狀態(tài)報告包含了一個該Datanode上所有數(shù)據(jù)塊的列表。副本存放:最開始的一步副本的存放是HDFS可靠性和性能的關(guān)鍵。優(yōu)化的副本存放策略是HDFS區(qū)分于其他大部分分布式文件系統(tǒng)的重要特性。這種特性需要做大量的調(diào)優(yōu),并需要經(jīng)驗的積累。HDFS采用一種稱為機架感知(rack-aware)的策略來改進數(shù)據(jù)的可靠性、可用性和網(wǎng)絡(luò)帶寬的利用率。目前實現(xiàn)的副本存放策略只是在這個方向上的第一步。實現(xiàn)這個策略的短期目標(biāo)是驗證它在生產(chǎn)環(huán)境下的有效性,觀察它的行為,為實現(xiàn)更先進的策略打下測試和研究的基礎(chǔ)。大型HDFS實例一般運行在跨越多個機架的計算機組成的集群上,不同機架上的兩臺機器之間的通訊需要經(jīng)過交換機。在大多數(shù)情況下,同一個機架內(nèi)的兩臺機器間的帶寬會比不同機架的兩臺機器間的帶寬大。通過一個機架感知的過程,Namenode可以確定每個Datanode所屬的機架id。一個簡單但沒有優(yōu)化的策略就是將副本存放在不同的機架上。這樣可以有效防止當(dāng)整個機架失效時數(shù)據(jù)的丟失,并且允許讀數(shù)據(jù)的時候充分利用多個機架的帶寬。這種策略設(shè)置可以將副本均勻分布在集群中,有利于當(dāng)組件失效情況下的負(fù)載均衡。但是,因為這種策略的一個寫操作需要傳輸數(shù)據(jù)塊到多個機架,這增加了寫的代價。在大多數(shù)情況下,副本系數(shù)是3,HDFS的存放策略是將一個副本存放在本地機架的節(jié)點上,一個副本放在同一機架的另一個節(jié)點上,最后一個副本放在不同機架的節(jié)點上。這種策略減少了機架間的數(shù)據(jù)傳輸,這就提高了寫操作的效率。機架的錯誤遠(yuǎn)遠(yuǎn)比節(jié)點的錯誤少,所以這個策略不會影響到數(shù)據(jù)的可靠性和可用性。于此同時,因為數(shù)據(jù)塊只放在兩個(不是三個)不同的機架上,所以此策略減少了讀取數(shù)據(jù)時需要的網(wǎng)絡(luò)傳輸總帶寬。在這種策略下,副本并不是均勻分布在不同的機架上。三分之一的副本在一個節(jié)點上,三分之二的副本在一個機架上,其他副本均勻分布在剩下的機架中,這一策略在不損害數(shù)據(jù)可靠性和讀取性能的情況下改進了寫的性能。當(dāng)前,這里介紹的默認(rèn)副本存放策略正在開發(fā)的過程中。副本選擇為了降低整體的帶寬消耗和讀取延時,HDFS會盡量讓讀取程序讀取離它最近的副本。如果在讀取程序的同一個機架上有一個副本,那么就讀取該副本。如果一個HDFS集群跨越多個數(shù)據(jù)中心,那么客戶端也將首先讀本地數(shù)據(jù)中心的副本。安全模式Namenode啟動后會進入一個稱為安全模式的特殊狀態(tài)。處于安全模式的Namenode是不會進行數(shù)據(jù)塊的復(fù)制的。Namenode從所有的Datanode接收心跳信號和塊狀態(tài)報告。塊狀態(tài)報告包括了某個Datanode所有的數(shù)據(jù)塊列表。每個數(shù)據(jù)塊都有一個指定的最小副本數(shù)。當(dāng)Namenode檢測確認(rèn)某個數(shù)據(jù)塊的副本數(shù)目達到這個最小值,那么該數(shù)據(jù)塊就會被認(rèn)為是副本安全(safelyreplicated)的;在一定百分比(這個參數(shù)可配置)的數(shù)據(jù)塊被Namenode檢測確認(rèn)是安全之后(加上一個額外的30秒等待時間),Namenode將退出安全模式狀態(tài)。接下來它會確定還有哪些數(shù)據(jù)塊的副本沒有達到指定數(shù)目,并將這些數(shù)據(jù)塊復(fù)制到其他Datanode上。文件系統(tǒng)元數(shù)據(jù)的持久化Namenode上保存著HDFS的名字空間。對于任何對文件系統(tǒng)元數(shù)據(jù)產(chǎn)生修改的操作,Namenode都會使用一種稱為EditLog的事務(wù)日志記錄下來。例如,在HDFS中創(chuàng)建一個文件,Namenode就會在Editlog中插入一條記錄來表示;同樣地,修改文件的副本系數(shù)也將往Editlog插入一條記錄。Namenode在本地操作系統(tǒng)的文件系統(tǒng)中存儲這個Editlog。整個文件系統(tǒng)的名字空間,包括數(shù)據(jù)塊到文件的映射、文件的屬性等,都存儲在一個稱為FsImage的文件中,這個文件也是放在Namenode所在的本地文件系統(tǒng)上。Namenode在內(nèi)存中保存著整個文件系統(tǒng)的名字空間和文件數(shù)據(jù)塊映射(Blockmap)的映像。這個關(guān)鍵的元數(shù)據(jù)結(jié)構(gòu)設(shè)計得很緊湊,因而一個有4G內(nèi)存的Namenode足夠支撐大量的文件和目錄。當(dāng)Namenode啟動時,它從硬盤中讀取Editlog和FsImage,將所有Editlog中的事務(wù)作用在內(nèi)存中的FsImage上,并將這個新版本的FsImage從內(nèi)存中保存到本地磁盤上,然后刪除舊的Editlog,因為這個舊的Editlog的事務(wù)都已經(jīng)作用在FsImage上了。這個過程稱為一個檢查點(checkpoint)。在當(dāng)前實現(xiàn)中,檢查點只發(fā)生在Namenode啟動時,在不久的將來將實現(xiàn)支持周期性的檢查點。Datanode將HDFS數(shù)據(jù)以文件的形式存儲在本地的文件系統(tǒng)中,它并不知道有關(guān)HDFS文件的信息。它把每個HDFS數(shù)據(jù)塊存儲在本地文件系統(tǒng)的一個單獨的文件中。Datanode并不在同一個目錄創(chuàng)建所有的文件,實際上,它用試探的方法來確定每個目錄的最佳文件數(shù)目,并且在適當(dāng)?shù)臅r候創(chuàng)建子目錄。在同一個目錄中創(chuàng)建所有的本地文件并不是最優(yōu)的選擇,這是因為本地文件系統(tǒng)可能無法高效地在單個目錄中支持大量的文件。當(dāng)一個Datanode啟動時,它會掃描本地文件系統(tǒng),產(chǎn)生一個這些本地文件對應(yīng)的所有HDFS數(shù)據(jù)塊的列表,然后作為報告發(fā)送到Namenode,這個報告就是塊狀態(tài)報告。通訊協(xié)議所有的HDFS通訊協(xié)
溫馨提示
- 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
- 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
- 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
- 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
- 5. 人人文庫網(wǎng)僅提供信息存儲空間,僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護處理,對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對任何下載內(nèi)容負(fù)責(zé)。
- 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請與我們聯(lián)系,我們立即糾正。
- 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時也不承擔(dān)用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。
最新文檔
- 課題申報參考:教育公平與合作學(xué)習(xí)研究
- 二零二五年度鐵路旅客運輸合同修訂版2篇
- 2025版圖書電子文檔txt下載代理授權(quán)合同3篇
- 二零二五年高校創(chuàng)新創(chuàng)業(yè)基地入駐服務(wù)合同3篇
- 2025年度個人小產(chǎn)權(quán)房屋買賣合同范本與稅務(wù)籌劃要點4篇
- 二零二五年度4S店汽車銷售區(qū)域代理合同范本3篇
- 二零二五版智慧交通管理系統(tǒng)建設(shè)與運營協(xié)議3篇
- 二零二五年度馬鈴薯深加工廢棄物資源化利用合同4篇
- 二零二五年度創(chuàng)新型企業(yè)房屋租賃合同書
- 2025年度平房出租與城市可持續(xù)發(fā)展合作協(xié)議4篇
- 第1課 隋朝統(tǒng)一與滅亡 課件(26張)2024-2025學(xué)年部編版七年級歷史下冊
- 2025-2030年中國糖醇市場運行狀況及投資前景趨勢分析報告
- 冬日暖陽健康守護
- 水處理藥劑采購項目技術(shù)方案(技術(shù)方案)
- 2024級高一上期期中測試數(shù)學(xué)試題含答案
- 盾構(gòu)標(biāo)準(zhǔn)化施工手冊
- 山東省2024-2025學(xué)年高三上學(xué)期新高考聯(lián)合質(zhì)量測評10月聯(lián)考英語試題
- 不間斷電源UPS知識培訓(xùn)
- 三年級除法豎式300道題及答案
- 人教版八級物理下冊知識點結(jié)
- 2024年江蘇省徐州市中考一模數(shù)學(xué)試題(含答案)
評論
0/150
提交評論