Deep Learning:AI 與 Dell EMC 和 Bitfusion 的生命周期_第1頁
Deep Learning:AI 與 Dell EMC 和 Bitfusion 的生命周期_第2頁
Deep Learning:AI 與 Dell EMC 和 Bitfusion 的生命周期_第3頁
Deep Learning:AI 與 Dell EMC 和 Bitfusion 的生命周期_第4頁
Deep Learning:AI 與 Dell EMC 和 Bitfusion 的生命周期_第5頁
已閱讀5頁,還剩33頁未讀, 繼續(xù)免費(fèi)閱讀

下載本文檔

版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請進(jìn)行舉報(bào)或認(rèn)領(lǐng)

文檔簡介

DeepLearning/AILifecycle

with

DellEMCand

bitfusionBhavesh

PatelDell

EMC

Server

Advanced

EngineeringAbstractThis

talk

gives

an

overview

of

the

end

to

end

application

life

cycle

ofdeep

learning

in

the

enterprise

along

with

numerous

use

cases

andsummarizes

studies

done

by

Bitfusion

and

Dell

on

a

high

performanceheterogeneous

elastic

rack

of

DellEMC

PowerEdge

C4130s

with

NvidiaGPUs.

Some

of

the

use

cases

that

will

be

talked

about

in

detail

will

beability

to

bring

on-demand

GPU

acceleration

beyond

the

rack

across

the

enterprise

with

easy

attachable

elastic

GPUs

for

deep

learningdevelopment,

as

well

as

the

creation

of

a

cost

effective

software

definedhigh

performance

elastic

multi-GPU

system

combiningmultipleDellEMC

C4130

servers

at

runtime

for

deep

learning

training.Deep

Learning

and

AI

Are

being

adoptedacross

a

wide

range

of

market

segmentsIndustry/FunctionAI

RevolutionComputer

Vision

&Speech,Drones,DroidsInteractive

Virtual

&

Mixed

RealitySelf-Driving

Cars,

Co-PilotAdvisorPredictive

Price

Analysis,Dynamic

DecisionSupportDrug

Discovery,

Protein

SimulationPredictive

Diagnosis,Wearable

IntelligenceGeo-Seismic

Resource

DiscoveryAdaptive

Learning

CoursesAdaptive

Product

RecommendationsDynamic

Routing

OptimizationBots

And

Fully-Automated

ServiceDynamic

Risk

Mitigation

And

Yield

OptimizationROBOTICSENTERTAINMENTAUTOMOTIVEFINANCEPHARMAHEALTHCAREENERGYEDUCATIONSALESSUPPLY

CHAINCUSTOMER

SERVICEMAINTENANCE...but

few

people

have

the

time,knowledge,

resources

to

even

get

startedPROBLEM

1:

HARDWARE

INFRASTRUCTURE

LIMITATIONSIncreased

cost

with

dense

serversTOR

bottleneck,

limited

scalabilityLimited

multi-tenancy

on

GPUservers

(limited

CPU

and

memoryper

user)Limited

to

8-GPU

applicationsDoes

not

support

GPU

apps

with:High

storage,

CPU,

MemoryrequirementsPROBLEM

2:

SOFTWARE

COMPLEXITYOVERLOADSoftware

ManagementGPU

Driver

ManagementFramework

&

Library

InstallationDeep

Learning

Framework

ConfigurationPackageManagerJupyter

Server

or

IDE

SetupData

ManagementData

UploaderShared

Local

File

SystemData

Volume

ManagementData

Integrations

&

PipeliningModel

ManagementCode

Version

ManagementHyperparameter

OptimizationExperiment

TrackingDeployment

AutomationDeployment

Continuous

IntegrationWorkload

ManagementJob

SchedulerLog

ManagementUser

&

Group

ManagementInference

AutoscalingInfrastructure

ManagementCloud

or

Server

OrchestrationGPU

Hardware

SetupGPU

Resource

AllocationContainer

OrchestrationNetworking

Direct

BypassMPI

/RDMA

/RPI/gRPCMonitoringNeed

to

Simplify

andScaleSOLUTION

1/2:

CONVERGED

RACK

SOLUTIONComposable

computebundleUp

to

64

GPUs

per

applicationGPU

applications

with

varied

storage,memory,

CPU

requirements30-50%

less

cost

per

GPU>

{cores,

memory}

/

GPU>>

intra-rack

networking

bandwidthLess

inter-rack

loadComposable

-

Add-as-you-goSOLUTION

2/2:

COMPLETE,

STREAMLINED

AI

DEVELOPMENTDevelop

on

pre-installed,

quickstart

deep

learning

containers.??Get

to

work

quickly

withworkspaces

with

optimized

pre-configured

drivers,

frameworks,libraries,andnotebooks.Start

with

CPUs,

and

attachElasticGPUs

on-demand.Allyour

code

and

data

issavedautomatically

and

sharable

withothers.Transition

from

developmentto

training

with

multipleGPUs.?Seamlessly

scale

out

to

moreGPUs

on

a

shared

training

clusterto

train

larger

models

quickly

andcost-effectively.Support

and

manage

multipleusers,teams,

and

projects.Train

multiple

models

in

parallelfor

massive

productivityimprovements.Pushtrained,

finalized

modelsinto

production.?Deploy

a

trained

neural

networkinto

production

and

perform

real-time

inference

across

differenthardware.Managemultiple

AI

applicationsand

inference

endpointscorresponding

to

different

trainedmodels.?GPUGPUGPUGPUGPUGPGPUGPUGPUU

GPUGPUGPUGPUGPUGPUGPUGPUGPUGPUGPUGPUGPUGPUGPUGPUGPUGPUGPUGPUGPUGPUGPU12Dell

EMC

Deep

Learning

Optimized

serversVerticalSegmentApplicationsOpenSourceFrameworksOptimizedLibrariesOperatingSystemProcessor/AcceleratorComputePlatformC4130R730C6320P

inC6300GPUKNLPhiinC6320P

SledNvLink-GPUC4130

DEEP

LEARNING

ServerFront(optional)

RedundantPower

SuppliesDual

SSDbootdrivesBackIDRAC

NIC2x

1GbNICFrontPowerSuppliesGPUaccelerators(4)CPU

sockets(under

heatsinks)8fansGPU

DEEP

LEARNING

RACK

SOLUTIONFeaturesR730C4130CPUE5-2669

v3@2.1GHzE5-2630

v3@

2.4GhzMemory4GB1TB/node;

64G

DIMMStorageIntel

PCIe

NVMEIntel

PCIe

NVMENetworking

IOCX3

FDRInfiniBandCX3

FDRInfiniBandGPUNAM40-24GBTOR

SwitchMellanox

SX6036-

FDRSwitchCablesFDR

56G

DCA

CablesConfiguration

DetailsR730C4130Pre-Built

AppContainersGPU

and

WorkspaceManagementElastic

GPUs

across

theDatacenterSoftware

definedScaled

out

GPU

ServersGPU

DEEP

LEARNING

RACK

SOLUTIONPre-Built

App

ContainersGPUandWorkspaceManagementElastic

GPUs

across

theDatacenterSoftware

definedScaledoutGPU

Servers1

Develop2

Train3DeployEnd

to

End

Deep

Learning

Application

Life

CycleGPUGPU

GPU

GPUGPUGPU

GPU

GPUGPUGPU

GPU

GPUGPUGPU

GPU

GPUGPUGPUGPUGPUGPUGPUGPUGPUGPUGPUGPUGPUGPUGPUGPUGPUC4130

#1GPU

NodesInfinibandSwitchCPU

NodesC4130

#2C4130

#3C4130

#4R730

#1R730

#2…but

wait,

‘converged

compute’requires

network

attached

GPUs...R730C4130BITFUSION

CORE

VIRTUALIZATIONGPU

Device

VirtualizationAllows

dynamic

GPU

attach

on

a

per-application

basisFeaturesAPIs: CUDA,

OpenCLDistribution:

scale-out

to

remote

GPUsPooling:

Oversubscribe

GPUsResourceProvisioning:

Fractional

vGPUsHigh

Availability:

Automatic

DMRManageability:

Remote

nvidia-smiDistributed

CUDA

Unified

MemoryNative

support

for

IB,

GPUDirect

RDMAFeature

complete

with

CUDA

8.0PUTTING

IT

ALL

TOGETHERCLIENT

SERVERGPUSERVERGPUSERVERGPUSERVERBitfusion

Flex,managed

containersBitfusion

Service

DaemonBitfusion

Client

LibraryNATIVE

VS.

REMOTE

GPUsCPUGPU

0GPU

1PCIeCPUGPU

0HCAPCIeCPUHCAGPU

1PCIeCompletely

transparent:

All

CUDA

Apps

see

local

and

remote

GPUs

as

if

directly

connectedResultsREMOTE

GPUs

-

LATENCY

AND

BANDWIDTHData

movement

overheads

is

the

primary

scalinglimiterMeasurements

done

at

application

level

–cudaMemcpyFast

Local

GPU

copiesPCIe

Intranode

copies16

GPU

virtual

system:

Naive

implementation

w/

TCP/IPC4130Fast

local

GPUcopiesIntranode

copies

via

PCIeLow

BW,

High

Latency

remote

copiesOSBypass

needed

to

avoidprimary

TCP/IP

overheadsAIapps

are

very

latency

sensitivenode

0node

1node

2node

316

GPU

virtual

system:

Bitfusion

optimized

transport

and

runtimeSame

FDRx4

transport,

but

drop

IPoIBReplace

remotecallswith

native

IB

verbsRuntime

selectionof

intranode

RDMA

vs.cudaMemcpyMulti-rail

communications

where

availaRbemleote=~

Native

Local

GPUsRuntime

optimizations:

pipelining,

speMciunilmaatlivNUeMA

effectsexecution,

distributed

caching

&

eventcoalescing,…SLICE

&

DICE

-

MORE

THAN

ONE

WAY

TO

GET

4

GPUsCaffe

GoogleNetTensorFlowPixel-CNNR730C4130Native

GPU

performance

with

networkattached

GPUsRun

time

comparison

(lower

is

better)

→Multiple

ways

to

create

a

virtual

4

GPU

node,with

nativeefficiency(secsto

trainCaffeGoogleNet,

batch

size:

128)TRAINING

PERFORMANCEContinued

Strong

ScalingCaffe

GoogleNetWeak-scalingAccelerate

Hyper

parameter

OptimizationCaffe

GoogleNet

TensorFlow1.0

with

Pixel-CNN74%73%55%53%86%PCIe

host

bridge

limit124816nativeremoteR730C4130Other

PCIe

GPU

Configurations

AvailableCurrently

TestingConfig

‘G’Further

reading:/techcenter/high-performance-computing/b/gener

al_hpc/archive/2016/11/11/deep-learning-performance-with-p100-gpushttp:///techcenter/high-performance-computing/b/general_h

pc/archive/2017/03/22/deep-learning-inference-on-p40-gpuso3f0YNvLink

Configuration????4P100-16GBSXM2GPU2CPUPCIeswitch1

PCIe

slot

EDRIBSXM2#3Config

‘K’SXM2#2SXM2#4SXM2#1o3f1YNvLink

Configuration?????4P100-16GBSXM2GPU2CPUPCIeswitch1

PCIe

slot

EDRIBMemory

:

256GBw/16GB@

2133OS:

Ubuntu

16.04CUDA:

8.1??Config

‘L’SXM2#3SXM2#2SXM2#4SXM2#1PCIe

SwitchSoftware

Solutionso3f319Overview

Bright

ML

Dell

EMC

has

partnered

withBrightComputing

to

offertheir

BrightML

package

as

the

software

stack

onDell

EMC

Deep

learninghardwaresolution.o3f419Bright

ML

OverviewMachine

Learning

in

SeismicImaging

Using

KNL

+

FPGA–Project

#1Bhavesh

Patel

Server

Advanced

EngineeringRobert

Dildy

-

Product

Technologist

Sr.

Consultant,Engineering

Solutions36AbstractThis

paper

is

focused

on

how

to

apply

Machine

Learning

to

seismic

imaging

with

the

use

of

FPGA

as

aco-accelerator.It

will

cover

2

hardware

technologies:

1)

Intel

KNL

Phi

2)

FPGA

and

also

address

how

to

use

Machine

learningforseismic

imaging.There

are

different

types

of

accelerators

like

GPU,

Intel

Phi

but

we

are

choosing

to

study

how

we

can

use

i-ABRAplatform

on

KNL

+

FPGA

to

train

the

neural

network

using

Seismic

Imaging

data

and

then

doing

the

inference.Machine

learning

in

a

broader

sense

can

be

divided

into

2

parts

namely

:

Training

and

Inference.37BackgroundSeismic

Imaging

is

a

standard

data

processing

technique

used

in

creating

an

image

of

subsurface

structures

ofthe

Earth

from

measurements

recorded

at

the

surface

via

seismic

wave

propagations

captured

from

varioussound

energy

sources.There

are

certain

challenges

with

Seismic

data

interpretation

like

3D

is

starting

to

replace

2D

for

seismicinterpretation.There

has

been

rapid

growth

in

use

of

computer

vision

technology

&

several

companies

developing

imagerecognition

platforms.

This

technology

is

being

used

for

automatic

photo

tagging

and

classificatio

溫馨提示

  • 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
  • 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
  • 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
  • 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
  • 5. 人人文庫網(wǎng)僅提供信息存儲空間,僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對任何下載內(nèi)容負(fù)責(zé)。
  • 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請與我們聯(lián)系,我們立即糾正。
  • 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。

最新文檔

評論

0/150

提交評論