版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請進(jìn)行舉報(bào)或認(rèn)領(lǐng)
文檔簡介
DeepLearning/AILifecycle
with
DellEMCand
bitfusionBhavesh
PatelDell
EMC
Server
Advanced
EngineeringAbstractThis
talk
gives
an
overview
of
the
end
to
end
application
life
cycle
ofdeep
learning
in
the
enterprise
along
with
numerous
use
cases
andsummarizes
studies
done
by
Bitfusion
and
Dell
on
a
high
performanceheterogeneous
elastic
rack
of
DellEMC
PowerEdge
C4130s
with
NvidiaGPUs.
Some
of
the
use
cases
that
will
be
talked
about
in
detail
will
beability
to
bring
on-demand
GPU
acceleration
beyond
the
rack
across
the
enterprise
with
easy
attachable
elastic
GPUs
for
deep
learningdevelopment,
as
well
as
the
creation
of
a
cost
effective
software
definedhigh
performance
elastic
multi-GPU
system
combiningmultipleDellEMC
C4130
servers
at
runtime
for
deep
learning
training.Deep
Learning
and
AI
Are
being
adoptedacross
a
wide
range
of
market
segmentsIndustry/FunctionAI
RevolutionComputer
Vision
&Speech,Drones,DroidsInteractive
Virtual
&
Mixed
RealitySelf-Driving
Cars,
Co-PilotAdvisorPredictive
Price
Analysis,Dynamic
DecisionSupportDrug
Discovery,
Protein
SimulationPredictive
Diagnosis,Wearable
IntelligenceGeo-Seismic
Resource
DiscoveryAdaptive
Learning
CoursesAdaptive
Product
RecommendationsDynamic
Routing
OptimizationBots
And
Fully-Automated
ServiceDynamic
Risk
Mitigation
And
Yield
OptimizationROBOTICSENTERTAINMENTAUTOMOTIVEFINANCEPHARMAHEALTHCAREENERGYEDUCATIONSALESSUPPLY
CHAINCUSTOMER
SERVICEMAINTENANCE...but
few
people
have
the
time,knowledge,
resources
to
even
get
startedPROBLEM
1:
HARDWARE
INFRASTRUCTURE
LIMITATIONSIncreased
cost
with
dense
serversTOR
bottleneck,
limited
scalabilityLimited
multi-tenancy
on
GPUservers
(limited
CPU
and
memoryper
user)Limited
to
8-GPU
applicationsDoes
not
support
GPU
apps
with:High
storage,
CPU,
MemoryrequirementsPROBLEM
2:
SOFTWARE
COMPLEXITYOVERLOADSoftware
ManagementGPU
Driver
ManagementFramework
&
Library
InstallationDeep
Learning
Framework
ConfigurationPackageManagerJupyter
Server
or
IDE
SetupData
ManagementData
UploaderShared
Local
File
SystemData
Volume
ManagementData
Integrations
&
PipeliningModel
ManagementCode
Version
ManagementHyperparameter
OptimizationExperiment
TrackingDeployment
AutomationDeployment
Continuous
IntegrationWorkload
ManagementJob
SchedulerLog
ManagementUser
&
Group
ManagementInference
AutoscalingInfrastructure
ManagementCloud
or
Server
OrchestrationGPU
Hardware
SetupGPU
Resource
AllocationContainer
OrchestrationNetworking
Direct
BypassMPI
/RDMA
/RPI/gRPCMonitoringNeed
to
Simplify
andScaleSOLUTION
1/2:
CONVERGED
RACK
SOLUTIONComposable
computebundleUp
to
64
GPUs
per
applicationGPU
applications
with
varied
storage,memory,
CPU
requirements30-50%
less
cost
per
GPU>
{cores,
memory}
/
GPU>>
intra-rack
networking
bandwidthLess
inter-rack
loadComposable
-
Add-as-you-goSOLUTION
2/2:
COMPLETE,
STREAMLINED
AI
DEVELOPMENTDevelop
on
pre-installed,
quickstart
deep
learning
containers.??Get
to
work
quickly
withworkspaces
with
optimized
pre-configured
drivers,
frameworks,libraries,andnotebooks.Start
with
CPUs,
and
attachElasticGPUs
on-demand.Allyour
code
and
data
issavedautomatically
and
sharable
withothers.Transition
from
developmentto
training
with
multipleGPUs.?Seamlessly
scale
out
to
moreGPUs
on
a
shared
training
clusterto
train
larger
models
quickly
andcost-effectively.Support
and
manage
multipleusers,teams,
and
projects.Train
multiple
models
in
parallelfor
massive
productivityimprovements.Pushtrained,
finalized
modelsinto
production.?Deploy
a
trained
neural
networkinto
production
and
perform
real-time
inference
across
differenthardware.Managemultiple
AI
applicationsand
inference
endpointscorresponding
to
different
trainedmodels.?GPUGPUGPUGPUGPUGPGPUGPUGPUU
GPUGPUGPUGPUGPUGPUGPUGPUGPUGPUGPUGPUGPUGPUGPUGPUGPUGPUGPUGPUGPUGPUGPU12Dell
EMC
Deep
Learning
Optimized
serversVerticalSegmentApplicationsOpenSourceFrameworksOptimizedLibrariesOperatingSystemProcessor/AcceleratorComputePlatformC4130R730C6320P
inC6300GPUKNLPhiinC6320P
SledNvLink-GPUC4130
DEEP
LEARNING
ServerFront(optional)
RedundantPower
SuppliesDual
SSDbootdrivesBackIDRAC
NIC2x
1GbNICFrontPowerSuppliesGPUaccelerators(4)CPU
sockets(under
heatsinks)8fansGPU
DEEP
LEARNING
RACK
SOLUTIONFeaturesR730C4130CPUE5-2669
v3@2.1GHzE5-2630
v3@
2.4GhzMemory4GB1TB/node;
64G
DIMMStorageIntel
PCIe
NVMEIntel
PCIe
NVMENetworking
IOCX3
FDRInfiniBandCX3
FDRInfiniBandGPUNAM40-24GBTOR
SwitchMellanox
SX6036-
FDRSwitchCablesFDR
56G
DCA
CablesConfiguration
DetailsR730C4130Pre-Built
AppContainersGPU
and
WorkspaceManagementElastic
GPUs
across
theDatacenterSoftware
definedScaled
out
GPU
ServersGPU
DEEP
LEARNING
RACK
SOLUTIONPre-Built
App
ContainersGPUandWorkspaceManagementElastic
GPUs
across
theDatacenterSoftware
definedScaledoutGPU
Servers1
Develop2
Train3DeployEnd
to
End
Deep
Learning
Application
Life
CycleGPUGPU
GPU
GPUGPUGPU
GPU
GPUGPUGPU
GPU
GPUGPUGPU
GPU
GPUGPUGPUGPUGPUGPUGPUGPUGPUGPUGPUGPUGPUGPUGPUGPUGPUC4130
#1GPU
NodesInfinibandSwitchCPU
NodesC4130
#2C4130
#3C4130
#4R730
#1R730
#2…but
wait,
‘converged
compute’requires
network
attached
GPUs...R730C4130BITFUSION
CORE
VIRTUALIZATIONGPU
Device
VirtualizationAllows
dynamic
GPU
attach
on
a
per-application
basisFeaturesAPIs: CUDA,
OpenCLDistribution:
scale-out
to
remote
GPUsPooling:
Oversubscribe
GPUsResourceProvisioning:
Fractional
vGPUsHigh
Availability:
Automatic
DMRManageability:
Remote
nvidia-smiDistributed
CUDA
Unified
MemoryNative
support
for
IB,
GPUDirect
RDMAFeature
complete
with
CUDA
8.0PUTTING
IT
ALL
TOGETHERCLIENT
SERVERGPUSERVERGPUSERVERGPUSERVERBitfusion
Flex,managed
containersBitfusion
Service
DaemonBitfusion
Client
LibraryNATIVE
VS.
REMOTE
GPUsCPUGPU
0GPU
1PCIeCPUGPU
0HCAPCIeCPUHCAGPU
1PCIeCompletely
transparent:
All
CUDA
Apps
see
local
and
remote
GPUs
as
if
directly
connectedResultsREMOTE
GPUs
-
LATENCY
AND
BANDWIDTHData
movement
overheads
is
the
primary
scalinglimiterMeasurements
done
at
application
level
–cudaMemcpyFast
Local
GPU
copiesPCIe
Intranode
copies16
GPU
virtual
system:
Naive
implementation
w/
TCP/IPC4130Fast
local
GPUcopiesIntranode
copies
via
PCIeLow
BW,
High
Latency
remote
copiesOSBypass
needed
to
avoidprimary
TCP/IP
overheadsAIapps
are
very
latency
sensitivenode
0node
1node
2node
316
GPU
virtual
system:
Bitfusion
optimized
transport
and
runtimeSame
FDRx4
transport,
but
drop
IPoIBReplace
remotecallswith
native
IB
verbsRuntime
selectionof
intranode
RDMA
vs.cudaMemcpyMulti-rail
communications
where
availaRbemleote=~
Native
Local
GPUsRuntime
optimizations:
pipelining,
speMciunilmaatlivNUeMA
effectsexecution,
distributed
caching
&
eventcoalescing,…SLICE
&
DICE
-
MORE
THAN
ONE
WAY
TO
GET
4
GPUsCaffe
GoogleNetTensorFlowPixel-CNNR730C4130Native
GPU
performance
with
networkattached
GPUsRun
time
comparison
(lower
is
better)
→Multiple
ways
to
create
a
virtual
4
GPU
node,with
nativeefficiency(secsto
trainCaffeGoogleNet,
batch
size:
128)TRAINING
PERFORMANCEContinued
Strong
ScalingCaffe
GoogleNetWeak-scalingAccelerate
Hyper
parameter
OptimizationCaffe
GoogleNet
TensorFlow1.0
with
Pixel-CNN74%73%55%53%86%PCIe
host
bridge
limit124816nativeremoteR730C4130Other
PCIe
GPU
Configurations
AvailableCurrently
TestingConfig
‘G’Further
reading:/techcenter/high-performance-computing/b/gener
al_hpc/archive/2016/11/11/deep-learning-performance-with-p100-gpushttp:///techcenter/high-performance-computing/b/general_h
pc/archive/2017/03/22/deep-learning-inference-on-p40-gpuso3f0YNvLink
Configuration????4P100-16GBSXM2GPU2CPUPCIeswitch1
PCIe
slot
–
EDRIBSXM2#3Config
‘K’SXM2#2SXM2#4SXM2#1o3f1YNvLink
Configuration?????4P100-16GBSXM2GPU2CPUPCIeswitch1
PCIe
slot
–
EDRIBMemory
:
256GBw/16GB@
2133OS:
Ubuntu
16.04CUDA:
8.1??Config
‘L’SXM2#3SXM2#2SXM2#4SXM2#1PCIe
SwitchSoftware
Solutionso3f319Overview
–
Bright
ML
Dell
EMC
has
partnered
withBrightComputing
to
offertheir
BrightML
package
as
the
software
stack
onDell
EMC
Deep
learninghardwaresolution.o3f419Bright
ML
OverviewMachine
Learning
in
SeismicImaging
Using
KNL
+
FPGA–Project
#1Bhavesh
Patel
–
Server
Advanced
EngineeringRobert
Dildy
-
Product
Technologist
Sr.
Consultant,Engineering
Solutions36AbstractThis
paper
is
focused
on
how
to
apply
Machine
Learning
to
seismic
imaging
with
the
use
of
FPGA
as
aco-accelerator.It
will
cover
2
hardware
technologies:
1)
Intel
KNL
Phi
2)
FPGA
and
also
address
how
to
use
Machine
learningforseismic
imaging.There
are
different
types
of
accelerators
like
GPU,
Intel
Phi
but
we
are
choosing
to
study
how
we
can
use
i-ABRAplatform
on
KNL
+
FPGA
to
train
the
neural
network
using
Seismic
Imaging
data
and
then
doing
the
inference.Machine
learning
in
a
broader
sense
can
be
divided
into
2
parts
namely
:
Training
and
Inference.37BackgroundSeismic
Imaging
is
a
standard
data
processing
technique
used
in
creating
an
image
of
subsurface
structures
ofthe
Earth
from
measurements
recorded
at
the
surface
via
seismic
wave
propagations
captured
from
varioussound
energy
sources.There
are
certain
challenges
with
Seismic
data
interpretation
like
3D
is
starting
to
replace
2D
for
seismicinterpretation.There
has
been
rapid
growth
in
use
of
computer
vision
technology
&
several
companies
developing
imagerecognition
platforms.
This
technology
is
being
used
for
automatic
photo
tagging
and
classificatio
溫馨提示
- 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
- 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
- 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
- 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
- 5. 人人文庫網(wǎng)僅提供信息存儲空間,僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對任何下載內(nèi)容負(fù)責(zé)。
- 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請與我們聯(lián)系,我們立即糾正。
- 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。
最新文檔
- 二零二五年第三方擔(dān)保合同護(hù)航跨境電商交易范本3篇
- 二零二五版發(fā)型師與美發(fā)機(jī)構(gòu)聘用合同3篇
- 二零二五版環(huán)保節(jié)能技術(shù)合作合同模板2篇
- 二零二五年音樂節(jié)餐飲租賃合同2篇
- 二零二五版環(huán)保型建筑砂漿采購合同模板-綠色建筑專用3篇
- 二零二五版海綿城市建設(shè)土石方運(yùn)輸與雨水收集合同3篇
- 二零二五版環(huán)保打印機(jī)銷售與環(huán)保認(rèn)證合同范本3篇
- 二零二五年鋼板樁租賃及拆除作業(yè)合同3篇
- 二零二五年度文化藝術(shù)展覽贊助合同3篇
- 2025年度智能機(jī)器人制造領(lǐng)域技術(shù)轉(zhuǎn)移合同規(guī)范3篇
- 申根簽證申請表模板
- 企業(yè)會計(jì)準(zhǔn)則、應(yīng)用指南及附錄2023年8月
- 諒解書(標(biāo)準(zhǔn)樣本)
- 2022年浙江省事業(yè)編制招聘考試《計(jì)算機(jī)專業(yè)基礎(chǔ)知識》真題試卷【1000題】
- 認(rèn)養(yǎng)一頭牛IPO上市招股書
- GB/T 3767-2016聲學(xué)聲壓法測定噪聲源聲功率級和聲能量級反射面上方近似自由場的工程法
- GB/T 23574-2009金屬切削機(jī)床油霧濃度的測量方法
- 西班牙語構(gòu)詞.前后綴
- 動物生理學(xué)-全套課件(上)
- 河北省衡水市各縣區(qū)鄉(xiāng)鎮(zhèn)行政村村莊村名居民村民委員會明細(xì)
- DB32-T 2665-2014機(jī)動車維修費(fèi)用結(jié)算規(guī)范-(高清現(xiàn)行)
評論
0/150
提交評論