G. Wang - Distributed Machine Learning with Python- Accelerating model training and serving with distributed systems (2022)（英文）行業(yè)資料

上傳人：行*** IP屬地：北京上傳時間：2023-02-27 格式：DOCX 頁數(shù)：373 大?。?.68MB 積分：15 舉報 版權(quán)申訴

G. Wang - Distributed Machine Learning with Python- Accelerating model training and serving with distributed systems (2022)（英文）行業(yè)資料_第2頁

G. Wang - Distributed Machine Learning with Python- Accelerating model training and serving with distributed systems (2022)（英文）行業(yè)資料_第3頁

G. Wang - Distributed Machine Learning with Python- Accelerating model training and serving with distributed systems (2022)（英文）行業(yè)資料_第4頁

G. Wang - Distributed Machine Learning with Python- Accelerating model training and serving with distributed systems (2022)（英文）行業(yè)資料_第5頁

已閱讀5頁，還剩368頁未讀，繼續(xù)免費閱讀

版權(quán)說明：本文檔由用戶提供并上傳，收益歸屬內(nèi)容提供方，若內(nèi)容存在侵權(quán)，請進(jìn)行舉報或認(rèn)領(lǐng)

文檔簡介

DistributedMachine

Learningwith

Python

Acceleratingmodeltrainingandservingwithdistributedsystems

GuanhuaWang

BIRMINGHAM—MUMBAI

DistributedMachineLearningwithPython

Allrightsreserved.Nopartofthisbookmaybereproduced,storedinaretrievalsystem,ortransmittedinanyformorbyanymeans,withoutthepriorwrittenpermissionofthepublisher,exceptinthecaseofbriefquotationsembeddedincriticalarticlesorreviews.

Everyefforthasbeenmadeinthepreparationofthisbooktoensuretheaccuracyoftheinformationpresented.However,theinformationcontainedinthisbookissoldwithoutwarranty,eitherexpressorimplied.Neithertheauthor,norPacktPublishingoritsdealersanddistributors,willbeheldliableforanydamagescausedorallegedtohavebeencauseddirectlyorindirectlybythisbook.

PacktPublishinghasendeavoredtoprovidetrademarkinformationaboutallofthecompaniesandproductsmentionedinthisbookbytheappropriateuseofcapitals.However,PacktPublishingcannotguaranteetheaccuracyofthisinformation.

PublishingProductManager:AliAbidi

SeniorEditors:RoshanKumar,NathanyaDiaz

ContentDevelopmentEditors:TazeenShaikh,ShreyaMoharir

TechnicalEditor:DevanshiAyare

CopyEditor:SafisEditing

ProjectCoordinator:AparnaRavikumarNair

Proofreader:SafisEditing

Indexer:PratikShirodkar

ProductionDesigner:AlishonMendonca

MarketingCoordinators:AbeerRiyazDawe,ShifaAnsari

Firstpublished:May2022

Productionreference:1040422

PublishedbyPacktPublishingLtd.

LiveryPlace

35LiveryStreet

Birmingham

B32PB,UK.

ISBN978-1-80181-569-7

Tomyparents,YingHanandXinWang

Tomygirlfriend,JingYuan

–GuanhuaWang

Contributors

Abouttheauthor

GuanhuaWangisafinal-yearcomputersciencePh.D.studentintheRISELabatUCBerkeley,advisedbyProfessorIonStoica.Hisresearchliesprimarilyinthemachinelearningsystemsarea,includingfastcollectivecommunication,efficientin-parallelmodeltraining,andreal-timemodelserving.Hisresearchhasgainedlotsofattentionfrombothacademiaandindustry.Hewasinvitedtogivetalkstotop-tieruniversities(MIT,Stanford,CMU,Princeton)andbigtechcompanies(Facebook/Meta,Microsoft).Hereceivedhismaster'sdegreefromHKUSTandabachelor'sdegreefromSoutheastUniversityinChina.Hehasalsodonesomecoolresearchonwirelessnetworks.Helikesplayingsoccerandhasrunmultiplehalf-marathonsintheBayAreaofCalifornia.

Aboutthereviewers

JamshaidSohailispassionateaboutdatascience,machinelearning,computervision,andnaturallanguageprocessingandhasmorethan2yearsofexperienceintheindustry.HepreviouslyworkedataSiliconValley-basedstart-up,FunnelBeam,thefoundersofwhicharefromStanfordUniversity,asadatascientist.Currently,heisworkingasadatascientistatSystemsLimited.Hehascompletedover66onlinecoursesfromdifferentplatforms.HeauthoredthebookDataWranglingwithPython3.XforPacktPublishingandhasreviewedmultiplebooksandcourses.HeisalsodevelopingacomprehensivecourseondatascienceatEducativeandisintheprocessofwritingbooksformultiplepublishers.

HiteshHindujaisanardentAIenthusiastworkingasaseniormanagerinAIatOlaElectric,whereheleadsateamof20+peopleintheareasofML,statistics,CV,NLP,andreinforcementlearning.Hehasfiled14+patentsinIndiaandtheUSandhasnumerousresearchpublicationstohisname.HiteshhasbeeninvolvedinresearchrolesatIndia'stopbusinessschools:theIndianSchoolofBusiness,Hyderabad,andtheIndianInstituteofManagement,Ahmedabad.Heisalsoactivelyinvolvedintrainingandmentoringandhasbeeninvitedtobeaguestspeakerbyvariouscorporationsandassociationsacrosstheglobe.

TableofContents

Preface

Section1–DataParallelism

SplittingInputData

Single-nodetrainingistooslow4

Themismatchbetweendataloading

bandwidthandmodeltrainingbandwidth5

Single-nodetrainingtimeonpopular

datasets6

Acceleratingthetrainingprocesswith

dataparallelism8

Dataparallelism–the

high-levelbits9

Stochasticgradientdescent13

Modelsynchronization14

Hyperparametertuning15

Globalbatchsize16

Learningrateadjustment16

Modelsynchronizationschemes17

Summary18

ParameterServerandAll-Reduce

Technicalrequirements20

Parameterserverarchitecture21

Communicationbottleneckinthe

parameterserverarchitecture22

Shardingthemodelamongparameter

servers24

Implementingtheparameter

server26

Definingmodellayers26

Definingtheparameterserver27

Definingtheworker28

Passingdatabetweentheparameter

serverandworker30

Issueswiththeparameter

server32

Theparameterserverarchitecture

introducesahighcodingcomplexity

forpractitioners33

viiiTableofContents

Broadcast40

Gather41

All-Gather42

Summary

All-Reducearchitecture34

Reduce34

All-Reduce36

RingAll-Reduce37

Collectivecommunication40

BuildingaDataParallelTrainingandServingPipeline

Technicalrequirements

Single-machinemulti-GPU52

Thedataparalleltraining

Multi-machinemulti-GPU56

pipelineinanutshell

Inputpre-processing

Inputdatapartition

Dataloading

Training

Checkpointingandfault

tolerance64

Modelcheckpointing64

Loadmodelcheckpoints65

Modelsynchronization

Modelevaluationand

Modelupdate

hyperparametertuning67

Single-machinemulti-GPUsand

multi-machinemulti-GPUs

BottlenecksandSolutions

Modelservingindataparallelism71

Summary73

Communicationbottlenecksin

dataparalleltraining76

Analyzingthecommunicationworkloads76

Parameterserverarchitecture77

TheAll-Reducearchitecture80

Theinefficiencyofstate-of-the-art

communicationschemes83

Leveragingidlelinksandhost

resources85

TreeAll-Reduce85

HybriddatatransferoverPCIeand

NVLink91

On-devicememorybottlenecks93

Recomputationandquantization94

Recomputation95

Quantization98

Summary99

TableofContentsix

Section2–ModelParallelism

SplittingtheModel

Technicalrequirements104

Single-nodetrainingerror–out

ofmemory105

Fine-tuningBERTonasingleGPU105

Tryingtopackagiantmodelinsideone

state-of-the-artGPU107

ELMo,BERT,andGPT110

Basicconcepts110

RNN114

ELMo117

PipelineInputandLayerSplit

BERT

GPT

Pre-trainingandfine-tuningState-of-the-arthardware

P100,V100,andDGX-1

NVLink

A100andDGX-2

NVSwitch

Summary

119

121

122

123

124

125

Vanillamodelparallelismis

inefficient128

Forwardpropagation130

Backwardpropagation131

GPUidletimebetweenforwardand

backwardpropagation132

Pipelineinput137

Prosandconsofpipeline

parallelism141

Advantagesofpipelineparallelism141

Disadvantagesofpipelineparallelism142

Layersplit142

Notesonintra-layermodel

parallelism145

Summary145

xTableofContents

ImplementingModelParallelTrainingandServingWorkflows

Technicalrequirements148

Wrappingupthewholemodel

parallelismpipeline149

Amodelparalleltrainingoverview149

Implementingamodelparalleltraining

pipeline150

Specifyingcommunicationprotocol

amongGPUs153

Modelparallelserving158

Fine-tuningtransformers162

Hyperparametertuningin

modelparallelism163

BalancingtheworkloadamongGPUs163

Enabling/disablingpipelineparallelism164

NLPmodelserving164

Summary165

AchievingHigherThroughputandLowerLatency

Technicalrequirements169

Freezinglayers169

Freezinglayersduringforward

propagation171

Reducingcomputationcostduring

forwardpropagation173

Freezinglayersduringbackward

propagation174

Exploringmemoryand

storageresources177

Understandingmodel

decompositionanddistillation180

Modeldecomposition180

Modeldistillation183

Reducingbitsinhardware184

Summary184

Section3–AdvancedParallelismParadigms

HybridofDataandModelParallelism

Technicalrequirements189

CasestudyofMegatron-LM189

Layersplitformodelparallelism189

Row-wisetrial-and-errorapproach192

Column-wisetrial-and-errorapproach196

Cross-machinefordataparallelism200

Implementationof

Megatron-LM201

Casestudyof

Mesh-TensorFlow203

TableofContentsxi

Implementationof

ProsandconsofMegatron-LM

Mesh-TensorFlow204

andMesh-TensorFlow204

Summary205

FederatedLearningandEdgeDevices

Technicalrequirements209

Sharingknowledgewithout

sharingdata209

Recappingthetraditionaldataparallel

modeltrainingparadigm210

Noinputsharingamongworkers211

Communicatinggradientsfor

collaborativelearning212

Casestudy:TensorFlow

Federated217

Runningedgedeviceswith

TinyML219

Casestudy:TensorFlowLite219

Summary220

ElasticModelTrainingandServing

Technicalrequirements223

Introducingadaptive

modeltraining223

Traditionaldataparalleltraining224

Adaptivemodeltrainingindata

parallelism226

Adaptivemodeltraining(AllReduce-

based)226

Adaptivemodeltraining(parameter

server-based)229

Traditionalmodel-parallelmodel

trainingparadigm231

Adaptivemodeltraininginmodel

parallelism232

Implementingadaptivemodel

traininginthecloud235

Elasticityinmodelinference236

Serverless238

Summary238

xiiTableofContents

AdvancedTechniquesforFurtherSpeed-Ups

Technicalrequirements

241

Jobmigrationandmultiplexing

249

Debuggingandperformance

Jobmigration

250

analytics

241

Jobmultiplexing

251

Generalconceptsinthe

profilingresultsCommunicationresultsanalysisComputationresultsanalysis

243

245

246

Modeltrainingina

heterogeneousenvironmentSummary

251

252

Index

OtherBooksYouMayEnjoy

Preface

Reducingtimecostsinmachinelearningleadstoashorterwaitingtimeformodeltrainingandafastermodelupdatingcycle.Distributedmachinelearningenablesmachinelearningpractitionerstoshortenmodeltrainingandinferencetimebyordersofmagnitude.Withthehelpofthispracticalguide,you'llbeabletoputyourPythondevelopmentknowledgetoworktogetupandrunningwiththeimplementationofdistributedmachinelearning,includingmulti-nodemachinelearningsystems,innotime.

You'llbeginbyexploringhowdistributedsystemsworkinthemachinelearningareaandhowdistributedmachinelearningisappliedtostate-of-the-artdeeplearningmodels.Asyouadvance,you'llseehowtousedistributedsystemstoenhancemachinelearningmodeltrainingandservingspeed.You'llalsogettogripswithapplyingdataparallelandmodelparallelapproachesbeforeoptimizingthein-parallelmodeltrainingandservingpipelineinlocalclustersorcloudenvironments.

Bytheendofthisbook,you'llhavegainedtheknowledgeandskillsneededtobuildanddeployanefficientdataprocessingpipelineformachinelearningmodeltrainingandinferenceinadistributedmanner.

Whothisbookisfor

Thisbookisfordatascientists,machinelearningengineers,andmachinelearningpractitionersinbothacademiaandindustry.AfundamentalunderstandingofmachinelearningconceptsandworkingknowledgeofPythonprogrammingisassumed.Priorexperienceimplementingmachinelearning/deeplearningmodelswithTensorFloworPyTorchwillbebeneficial.You'llfindthisbookusefulifyouareinterestedinusingdistributedsystemstoboostmachinelearningmodeltrainingandservingspeed.

xivPreface

Whatthisbookcovers

Chapter1,SplittingInputData,showshowtodistributemachinelearningtrainingorservingworkloadontheinputdatadimension,whichiscalleddataparallelism.Chapter2,ParameterServerandAll-Reduce,describestwowidely-adoptedmodelsynchronizationschemesinthedataparalleltrainingprocess.

Chapter3,BuildingaDataParallelTrainingandServingPipeline,illustrateshowtoimplementdataparalleltrainingandtheservingworkflow.

Chapter4,BottlenecksandSolutions,describeshowtoimprovedataparallelismperformancewithsomeadvancedtechniques,suchasmoreefficientcommunicationprotocols,reducingthememoryfootprint.

Chapter5,SplittingtheModel,introducesthevanillamodelparallelapproachingeneral.Chapter6,PipelineInputandLayerSplit,showshowtoimprovesystemefficiencywithpipelineparallelism.

Chapter7,ImplementingModelParallelTrainingandServingWorkflows,discusseshowtoimplementmodelparalleltrainingandservingindetail.

Chapter8,AchievingHigherThroughputandLowerLatency,coversadvancedschemestoreducecomputationandmemoryconsumptioninmodelparallelism.

Chapter9,AHybridofDataandModelParallelism,combinesdataandmodelparallelismtogetherasanadvancedin-parallelmodeltraining/servingscheme.

Chapter10,FederatedLearningandEdgeDevices,talksaboutfederatedlearningandhowedgedevicesareinvolvedinthisprocess.

Chapter11,ElasticModelTrainingandServing,describesamoreefficientschemethatcanchangethenumberofacceleratorsusedonthefly.

Chapter12,AdvancedTechniquesforFurtherSpeed-Ups,summarizesseveralusefultools,suchasaperformancedebuggingtool,jobmultiplexing,andheterogeneousmodeltraining.

Prefacexv

Togetthemostoutofthisbook

YouwillneedtoinstallPyTorch/TensorFlowsuccessfullyonyoursystem.Fordistributedworkloads,wesuggestyouatleasthavefourGPUsinhand.

WeassumeyouhaveLinux/Ubuntuasyouroperatingsystem.WeassumeyouuseNVIDIAGPUsandhaveinstalledtheproperNVIDIAdriveraswell.Wealsoassumeyouhavebasicknowledgeaboutmachinelearningingeneralandarefamiliarwithpopulardeeplearningmodels.

Ifyouareusingthedigitalversionofthisbook,weadviseyoutotypethecodeyourselforaccessthecodefromthebook'sGitHubrepository(alinkisavailableinthenextsection).Doingsowillhelpyouavoidanypotentialerrorsrelatedtothecopyingandpastingofcode.

Downloadtheexamplecodefiles

YoucandownloadtheexamplecodefilesforthisbookfromGitHubat

https://

/PacktPublishing/Distributed-Machine-Learning-with-

Python

.Ifthere'sanupdatetothecode,itwillbeupdatedintheGitHubrepository.

Wealsohaveothercodebundlesfromourrichcatalogofbooksandvideosavailableat

/PacktPublishing/

.Checkthemout!

Downloadthecolorimages

WealsoprovideaPDFfilethathascolorimagesofthescreenshotsanddiagramsusedinthisbook.Youcandownloadithere:

downloads/9781801815697_ColorImages.pdf

xviPreface

Conventionsused

Thereareanumberoftextconventionsusedthroughoutthisbook.

Codeintext:Indicatescodewordsintext,databasetablenames,foldernames,filenames,fileextensions,pathnames,dummyURLs,userinput,andTwitterhandles.Hereisanexample:"ReplaceYOUR_API_KEY_HEREwiththesubscriptionkeyofyourCognitiveServicesresource.Leavethequotationmarks!"

Ablockofcodeissetasfollows:

#ConnecttoAPIthroughsubscriptionkeyandendpoint

subscription_key="<your-subscription-key>"

endpoint="https://<your-cognitive-service>.cognitiveservices.

#Authenticate

credential=AzureKeyCredential(subscription_key)

cog_client=TextAnalyticsClient(endpoint=endpoint,

credential=credential)

Bold:Indicatesanewterm,animportantword,orwordsthatyouseeonscreen.Forinstance,wordsinmenusordialogboxesappearinbold.Hereisanexample:"Select

Review+Create."

TipsorImportantNotes

Appearlikethis.

Getintouch

Feedbackfromourreadersisalwayswelcome.

Generalfeedback:Ifyouhavequestionsaboutanyaspectofthisbook,emailusatcustomercare@andmentionthebooktitleinthesubjectofyour

message.

Errata:Althoughwehavetakeneverycaretoensuretheaccuracyofourcontent,mistakesdohappen.Ifyouhavefoundamistakeinthisbook,wewouldbegratefulifyouwouldreportthistous.Pleasevisit

/support/errata

andfillintheform.

Prefacexvii

Piracy:Ifyo

人人文庫> 全部分類> 應(yīng)用文書 > 研究報告

溫馨提示

1. 本站所有資源如無特殊說明，都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
2. 本站的文檔不包含任何第三方提供的附件圖紙等，如果需要附件，請聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
3. 本站RAR壓縮包中若帶圖紙，網(wǎng)頁內(nèi)容里面會有圖紙預(yù)覽，若沒有圖紙預(yù)覽就沒有圖紙。
4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
5. 人人文庫網(wǎng)僅提供信息存儲空間，僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理，對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯，并不能對任何下載內(nèi)容負(fù)責(zé)。
6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容，請與我們聯(lián)系，我們立即糾正。
7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時也不承擔(dān)用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。

G. Wang - Distributed Machine Learning with Python- Accelerating model training and serving with distributed systems (2022)（英文）行業(yè)資料

文檔簡介

溫馨提示

最新文檔

評論

G. Wang - Distributed Machine Learning with Python- Accelerating model training and serving with distributed systems (2022)（英文）行業(yè)資料

文檔簡介

溫馨提示

最新文檔

評論

相關(guān)文檔