智能邊緣計算_第1頁
智能邊緣計算_第2頁
智能邊緣計算_第3頁
智能邊緣計算_第4頁
智能邊緣計算_第5頁
已閱讀5頁,還剩66頁未讀, 繼續(xù)免費閱讀

下載本文檔

版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請進行舉報或認領(lǐng)

文檔簡介

智能邊緣計算:讓智能無處不在Computing

paradigm

shiftsDistributedCentralizedDistributedCentralizedMainframeIntelligent

cloud+edgeIntelligent

cloudPersonalcomputingDistributed

devicesand

dataPeople1.5GBperdaySmart

DevicesSmart

Home20BIoTdevicesSmart

City50

GB

perday68250PBperdayAutonomousVehicle5TBperdayConnectedFactory1PBperdayStadium200TB

pergameSmart

Office150

GBperdayThecall

forintelligence

(DL)

on

theedge?

Dataexplosionfrom

fast

growing

edgedevices?

E.g.,smart

surveillance

cameras,self-driving

cars68?

Strong

needs

of

on-deviceintelligence?

Low

latency?

Highavailabilityandreliability?

Strong

privacy

protection?

Low

costIntelligent

CloudIntelligent

Edge?

Edge

devicesbecoming

increasingly

powerful?

Emerging

high-perf,

low-power,

low-costAI

ASICEmpower

ever

y

app

&

device

withAI/DL68AI-empowered

diversedevices

andapplications

everywhereAffordableAI

modelstailored

fordiverseHighly-optimizedsoftwarestack&efficienthardware

forAISecurity

&privacy,

modelprotection,explainable

AI,debuggingOn-device,continuous,collaborativelearninghardwareloopInnovations

of

on-deviceDLstackEfficientneuralnetwork(NN)

designEdgeNNFrameworksAI

ChipsEdgeTPUKPUVPUHPUNPUNN

designanddeploymentNNDesignGapModelDeploymentQuantizationModel……ConvCPUGPUDSPManualDesignDesignSpace:#

of

layers,op

structure,channel,

…constraints(e.g.,FLOPs)Re-quantizeBNConvBNNASRe-quantizeReLuTPUReLu…PruningRe-quantizeNPU…DequantizationCurrentNNdesigndoesnot

consider

platformfeaturesDoesless

FLOPs

mean

lesslatency?990MFLOPs?

MobileNetEdgeTPUMobileNetV3209MFLOPs?Latency:

3.6msModelaccuracy:

75.6%Latency:

4msModelaccuracy:

74.7%EdgeTPULessFLOPs≠>

lesslatency,

butcanharm

model

accuracy.Doesa

fastmodel

runfaston

ever

y

hardware?25%faster71%fasterMobileNetV3

MobileNetV2MobileNetV3

MobileNetV2CortexA76

CPUVPUTo

Bridge

Neural

Network

Design

and

Real-World

Performance:

ABehavior

Study

forNeural

NetworksPaperpublished

at

MLSys

2021Goal?

Measurement

studyto

answerthefollowing

3questions:1.

What

are

thebehavior

characteristics

that

showaninconsistentlatency

response

to

thechange

of

OPsand

memoryaccessesof

a

configurationinthedesignspace?2.

What

are

theroot

causesfortheseunexpected

characteristics?3.

What

are

theimplications

of

these

characteristicsfor

efficient-NNdesign?Methodology?

Profiling

on7edgeAI

platforms:TFLiteTFLiteTFLiteSNPEOpenVINOVPURKNNNPUNNCASEKPUCortexCPUAdrenoGPUEdgeTPUDSP?

Measurement

Tool:ProfileGeneratesingle

blockmodelinTFConvert

totarget

graphand

precisionCollecttimingresultsontargetdeviceCovered

designdimensions?

Thescaling

ofeach

NN

designdimension:?

Operator/blocktype

(?):?

Normal

operator:?

Elementwise:?

Activations:Conv,

FC

...Add,Pooling

...ReLU,

Sigmoid,

Swish

...MobileNet/ShuffleNetblock,

...{1,

3,5,7}?

Blocks:?

Kernel

size(?):?

Stride(?):{1,

2}?

Height

(?)/width

(?):?

#of

Conv

channels(???/????):?

Precision

(?):{3,

...,224}{3,

...,1000}INT8,FP16,...DomoreConvchannels

increase

latency??

Finding

1:

ThelatencyofConv

increases

inastep

pattern

rather

thanlinear

withthenumber

ofoutput

channelsXaxis:outputchannel

number,

Y

axis:latencyInputfeature

map:28x28;

Inputchannel

number:320;

Kernel:

3x3;Stride:

1DomoreConvchannels

increase

latency??

Cause:Theinput

tensors

arepadded

to

fullyutilize

thehardware

data-level

parallelism?

SIMD

unitonCPU;

Vector

unitonDSP;

SIMTonGPUetc.K2

x

CinH

x

WH

x

W

+

padPadto

8·nPadInputfeature

mapConvolutionKernelOutputfeature

mapSIMD

unitsonCPU[8,1]x[1,8]basicblockMatrix

multiplication

implementationDomoreConvchannels

increase

latency??

Implication:

For

potential

higheraccuracy,

itisencouraged

to

keepthelargest

number

ofchannels

ineachlatency

step

intheNN

designspace

andskip

theotheronesPrevious

ChannelNumberChoices:...

6

8

10

12

14

16

18

20

...Reduced

ChannelNumberChoices:...

6

8

10

12

14

16

18

20

...E.g.

MetaPruningChannelsearchspace:from

3014

to

414(14

layers,

each

layer

has30channel

candidates)Doesa

building

blockhave

similar

relativelatency

on

differentNN

platforms??

Finding

2:?

Therelativelatencyofabuilding

block

varies

greatlyondifferentplatforms318.9550403020100DenseBlockMobileNetV2Block+SEMobileNetV2BlockShufflenetV2BlockFLOPs

DataCPUGPUVPUDSPTPUKPUDoesa

building

blockhave

similar

relativelatency

on

differentNN

platforms??

Cause:1.

The

mismatchofcomputation

and

memorybandwidth

is

severe2.

The

supportfor

non-Conv

operatorsis

weak

ontheNN

platformsexcept

CPUDatareuse

rate44.51DenseBlockGPUCPU50822.7GFLOP/sGFLOP/s7.58MobilenetV2Block+SE4.73Memorybandwidth

23GFloat/sMobilenetV2Block0.81ShuffleNetBlockSnapdragon855

onMi9Doesa

building

blockhave

similar

relativelatency

on

differentNN

platforms??

Cause:1.

Themismatchof

computationandmemorybandwidthis

severe2.

Thesupportfornon-Convoperators

isweak

ontheNNplatformsexcept

CPUGlobalPooling

is

inefficient

inMobileNet

V2+SEBlock

onGPU3x3

DWConv,BN,ReLU6Pooling

takes

<

5%OPsbut>

70%

timeGlobalPoolingBlocktotalOPs/Latency71.7%FCReLU<5%FCSigmoidSqueeze

&Excitement

blockOPs(%)

Latency(%)MultiplyDoesa

building

blockhave

similar

relativelatency

on

differentNN

platforms??

Implication:

Itis

encouraged

to

customize

thesetofcandidate

blocksintheNN

design

space

for

eachplatformCustomizedSearch

SpaceCustomizedSearch

SpaceCustomizedSearch

SpaceModuleModuleModuleModuleModuleModuleModuleModuleModuleCPUGPUDSPSummary

of

majorfindings?

#of

Channels:The

latency

of

Conv

increases

ina

step

pattern

with

the

#of

out

channels?

Block:

The

relative

latency

of

a

NNblockvaries

greatlyon

different

platforms?

Activation

Function:

Activationfunctions

can

have

bigimpact

onlatency,

particularly

forSwish

and

HardSwish?

KernelSize:The

Conv

latency

increases

much

lesswith

kernel

size

onAIaccelerators

thanon

the

CPU?

Quantization:?

The

use

of

INT8on

the

NPUachieves

>11×

speedup,

while

CPUonly

achieves

<3.6×?

INT8can

dramaticallydecrease

inference

accuracy

of

various

models?

General:Considering

the

general

support,

accuracy,

and

latency,

the

CPUisstillagoodchoice

forinferenceEfficient

NNdesign

must

considerhardware

characteristics.Howto

getagoodmodel?Efficient

NNdesignfordiverse

edgehardwareProfiling

andmodelingModeldeploymentNNDesignEdgeTPUEdgeTPUModelsVPUHPUNPUKPUVPUHPUNPUKPUManualDesignDesignSpace:#of

layers,opstructure,channel,

…constraints(e.g.,

FLOPs)latency,

energyHW-specificpredictorsof

latencyNASandenergyPruningnn-Meter:

Towards

Accurate

LatencyPredictionof

Deep-LearningModelInferenceonDiverseEdge

DevicesCortexCPUAdrenoGPUVPUPaper

published

at

MobiSys

2021(Best

PaperAward)Existing

workonlatency

prediction?

FLOPs-based

prediction?

Pros:

very

simple?

Cons:

not

adirectmetricof

inference

latency?

Operator-level

prediction?

Pros:

stableprimitiveoperators

(conv2d,

pooling,

activations...)?

Cons:

unaware

of

graph-level

optimizations?

Model-level

prediction?

Pros:

learngraph-level

optimization

automatically?

Cons:

cannot

generalizeto

unseen

modelstructures?

nn-Meter:buildaccuratelatencypredictor?

Take

graph-level

optimizations

into

consideration?

Generalization

abilityChallenge:

framework

optimizations?

Backend-independent

opt.?

Constant

foldingDesignedmodel?

Common

expression

elimination?

...Backendindependentopt.?

Backend-dependent

opt.?

Operator

fusion?

...Backenddependent

opt.…CPU

backend1(eg

Eigenlib.)CPU

backend2(eg

NNPack

lib.)GPU

backend1(eg

OpenCL)MovidiusbackendImpact

of

operator

fusion?

Operator

fusionhas

agreatimpact

on

inference

latencyModel

graphBackendimplementation_kernelactive

()

{for(i=0;i<out.row;i++)for(j=0;j<out.col;j++)_kernelconv_2d_1x1

(

){for(i=0;i<out.row;i++)Convfor(j=0;j<out.col;j++)for(c=0;c<out.chan;c++)out[i][j][c]=active(in[i][j][c]);}for(cout=0;cout<out.chan;cout++)for(cin=0;cin<in.chan;cin++)out[i][j][cout]+=Activein[i][j][cin]*filter[cout][cin];

}Operator

fusion_kernelconv_2d_1x1_active

(

){for(i=0;i<out.row;i++)for(j=0;j<out.col;j++)Conv+Activefor(cout=0;cout<out.chan;cout++)

{for(cin=0;cin<in.chan;cin++)out[i][j][cout]+=in[i][j][cin]*filter[cout][cin];out[i][j][cout]=active(out[i][j][cout]);

}}MobileNetv2nn-Meter:

kernel-level

latency

prediction?

Kernel:

thebasic

execution

unit

on

adevice?

Can

beasingle

operatorora

fusionofmultiple

operators?

Divide

awholemodelinto

kernels,

conduct

kernel-level

prediction?

Modellatency

isthesum

ofall

kernelskernelsmodelKerneldetectorKernel

latencypredictorsum

kernels

latencies?

Problems:1.

How

to

detectkernels?

(Kernel

Detection)2.

How

to

predictaccuratelyforeach

kernel?

(Adaptive

DataSampling)nn-Metertech#1:

automatic

kernel

detectorOp1Op2Fusion

ruledetection

forblack-box

devicestest

cases:Op2Op1measuredlatency:?

A

setoftest

cases???2???1???1,

??2Op1

and

op2

are

fusible

if:+

?

?

?

>

?

?

min(???1,

???2)?

Foreverytwo

operators,wegenerate

3graphs?

Compare

thelatencydifference???1??2??1,

??2nn-Metertech#1:

Automatic

kernel

detectorFusion

ruledetectionforblack-boxdevices?

A

setoftest

cases:?

Foreverytwo

operators,wegenerate

3graphs?

Compare

thelatencydifferenceKernel

searchby

the

fusionrules?

Apply

thefusionrules

tosearchmaximumfusedoperatorsintargetmodelA

resnet18

block

exampleKernel-latency

prediction:

challenges?

Largesample

space,e.g.,ConvCollected

from

24widely

usedCNNmodels

from

PyTorch

modelzoo,Convhas?

×

??

of

configurationsto

sample!?Kernel-latency

prediction:

challenges?

Non-linear

latencyon

edgedevices?

Random

sampling

misses

crucial

data

pointsnn-Metertech#2:

adaptive

datasamplerSample

themost

beneficial

data

(kernelconfiguration)insteadofrandom

sampling?

Sample

configurations

thatarelikely

tobeconsidered

inmodeldesign?

Prior

possibility

distribution:

learned

from

model

zoo?

Fine-grained

sampling

aroundinaccurate

prediction

data12data

withconsideredconfigsinmodel

designlargeerrorsPriorpossibilitydistributionRegressionmodelFine-graineddata

samplerdata

andmeasured

latencynn-MeterEvaluation?

Prediction

accuracy:

99.0%

(CPU),

99.1%

(Adreno640

GPU),

99.0%(Adreno630

GPU)and

83.4%

(IntelVPU)?

Generalizationperformanceon

unseen

model

graphs?

Comparison

baselines:

FLOPs,FLOPs+MAC,

BRP-NAS

(GCN),?

On

average:

nn-Meter

achieves

89.2%,significantly

better

thanFLOPs(22.1%),FLOPs+MAC

(17.1%),

and

BRP-NAS

(8.5%)Efficient

NNdesignfordiverse

edgehardwareProfiling

andmodelingModeldeploymentNNDesignEdgeTPUEdgeTPUModelsVPUHPUNPUKPUVPUHPUNPUKPUManualDesignDesignSpace:#of

layers,opstructure,channel,

…constraints(e.g.,

FLOPs)latency,

energyHW-specificpredictorsof

latencyandenergyNASPruningWe

gotagoodmodel.Howdoes

itrun

onrealdevices?Arecomputing

resourcesfully

utilized?Adreno

GPU

ALU

utilization

%

forCNNARMCPU

utilization

%for

CNN100%80%60%40%20%0%100%80%60%40%20%0%Big

core

Littlecore30%90%84%Low

hardware

utilizationresultsinpoor

inferencespeed.AsyMo:

Scalable

andEfficient

Deep-LearningInferenceon

Asymmetric

Mobile

CPUsPaper

published

at

MobiCom2021Whyisutilization

lowon

theCPU?CPU

utilization

%for

CNNUnbalancedtask

distributionby

OSinter

andintra

coreclusters100%80%60%40%20%0%Bigcore

Littlecore30%Computationtasks90%B0

B1

B2

B3Bigcore

clusterL0L1L2L3Little

core

clusterWhyisdistribution

unbalanced

on

theCPU?Execution

flow

ofmatrix

multiplication2)Copy

blocksinto

continuousmemoryspace3)Schedule

tasks

tothreadqueues1)BlockpartitionforparallelismParamsFeature

mapnctaskkcThread

poolmckcttttttttMKmcx

kcKQ0

Q1Q#kc

x

ncNIgnore

hardwareasymmetryIgnore

resourceconstraintsRedundant

datacopyIgnore

hardwareasymmetryIgnore

data

localityIgnore

theinterference-proneenvironmentAsyMo:optimize

DLinference

on

big.Little

CPU?

Accelerate

edge

DLinference

withlower

energy

costOne-run

initializationInferencePartitionstrategyPartitionstrategyCost-modeldirected

blockpartitionPrearrangedmemory

layoutfor

paramsMemoryhandleAsymmetry-awareschedulingCNN/RNNmodelTaskthread

IDIntra-opthread

poolEfficientfrequencyData-reuse

basedfrequency

settingCost-model-based

block

partitionCostfor

atask:computation

+

memoryComputation

and

memoryaccesscostCostfor

asequentialunit:Costfor

parallel

calculation:

parallel

tasknumberx

CostseqDegree

ofparallelismOthercost:

unparallel+

taskschedule

+

frameworkTotal

cost:Task

scheduling

andframeworkcostOptimized

execution

flowof

matrixmultiplicationOne-run

initializationInferencerunBlockpartitionParams

layout

Copy

featuresTasks

scheduling

and

runPinthreadoncoreCore0Core1Core2Core3MKttttttttttttttKNbigBigcore

clusteBetter

datalocalityNo

work

stealingfrom

big

to

littleCore0Core1tCore2tCore3MttKttttttttKtNlittleLittle

core

clusterTotal

performanceandenergy

improvementAsymo

vs

TensorFlow

on

Kirin970

+Android9Pie18.512.01.81.61.41.21.01.721.6319171513112.01.81.61.41.21.019171513111.859.87991.3377553311PerformanceEnergyefficiencyPerformanceEnergyefficiencyPre-copy

paramsenable

parallelimplementationTensorFlow

@OSfrequency

settingAsymo@

picked

efficient

CPUfrequencyBoth@maxCPUfrequencySparseflow:

unleashfullpotentialofsparsity

indeeplearningJoint

workwith

Chen

Zhang

etal.Today’s

DNN

modelishugeGPT-3175B

parameters$12M

training

costMT-NLG530B

parametersTrained

by

560

DGXA100

serversComputation

istheengine

behind

AI’s

success&

still

needmore?TPUv3

360TopsV100

125TopsTPUv1

90TopsPerformance(Op/Sec)DedicatedHardwareTPUXeon

E5

~500GopsGPUENIAC

5KopsMoore’s

lawCPU1960197019801990200020102019Piling

uphardwareisnotsustainable:energy-efficiency

wall?1000TPUenergy-efficiencywall100GPU

energy-efficiencywallCPU

energy-efficiency

wall1010.1199520002005201020152020YearSparsity

is

thekey

to

humanbrain’s

efficiencyWe

donotlook

at

everything

inourvisualscopeSparsity

isthekey

to

humanbrain’s

efficiencySimplegeometric

shapesare

enoughfor

usto

recognizeacatWeight

PruningDifficult

toaccelerateUnstructuredsparsematricesPrune

away

small

weightsMxV→SpMxVHan,Song,

etal.Learning

both

Weights

and

Connections

for

Efficient

Neural

Networks,NIPS’15Accuracyand

Speedup

Trade

offFine-grained/IrregularCoarse-grained/RegularPros:Cons:?

Highmodel

accuracy?

Highcompression

ratioCons:?

Low

model

accuracy?

Low

compression

ratioPros:?

Irregular

pattern?

Difficultto

accelerate?

Regular

pattern?

Easyto

accelerateHow

to

Achieve

Both??

Model

accuracy?

Addfew

constraintsonthesparsity

pattern?

Speedup?

Matrix

partitioning

for

parallelcomputing?

Eliminating

irregularcomputation

andmemory

accessS.Caoet

al.,“Efficient

andEffective

Sparse

LSTMon

FPGAwith

Bank-BalancedSparsity

”,

FPGA’19.Bank-Balanced

PruningBankSplitDenseMatrixTraverse

all

rows012345678910

11

12

13

14

15DenseMatrix

Row0.8

-0.1

0.2

1.5

1.0

0.3

-0.4

-1.4

0.7

2.0

0.9

-0.5

1.2

-1.3

2.1

0.2Fine-grainedpruninginside

eachbankBBSMatrix

Row0.81.5

1.0-1.42.0

0.9-1.3

2.1Thresholdpercentage

toobtain

identical

sparsity

ratioamongbanksBank-Balanced

Sparsity

(BBS)?

Bankpartitioningforparallelcomputing?

Fine-grained

pruninginsideeachbank

formaintainingaccuracySparse

MVMultiplication

(SpMxV)?Bothinter-rowandinter-bankparallelism?Loadbalancing

acrossrows

andbanksDense

vectorV0V3V1Bank0V2V5V4Bank

0Row

0

A

0

B

C

D

0

0

E

F

G

0

HRow

1Bank

1Bank

2Bank

3Bank1V6V7V8I

J

0

K

0

L

M

N

0

O

P

0Bank2V9V10

V11Bank3?Conflict-freevectoraccessesOurCSB

(CompressedSparse

Banks)0123012301230123Data

rearrangement

forinter-bankparallelizationCSB0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15A

C

E

G

B

D

F

H

I

K

M

O

J

L

N

P0

0

0

1

2

2

3

2

0

0

1

3

1

2

3

1VALUESBANK

INTERNALINDICESPhysical

BRAM

addressesSpecifically

designed

forBBSto

eliminate

decoding

overheadsAccelerator

OverviewFPGAControllerInstruction

BufferHostServerPCIeCntlrSpMxV

PEEWOP++ACT+ValuesIndicesMatrixMemory*+*DMAOutputPrivateVectorBuffer...Off-chipDRAMDR

AMCntlrVector

MemoryModelAccuracyVery

closeLanguage

modelPTB

datasetSpeech

Recognitionon

TIMIT

datasetHardware

EfficiencyHardware

Efficiency~7x~34xSeerNet:Predicting

CNNFeature-Map

Sparsitythrough

Low-Bit

QuantizationS.Cao

etal.,

“SeerNet:

PredictingConvolutional

Neural

Network

Feature-Map

Sparsitythrough

Low-Bit

Quantization”,

CVPR’19.Accelerate

modelinference

by

feature-mapsparsity?

ReLUF’?

y=max(0,x)W1-35-12-5-322-641057020600202040ReLUorF-46Max-pooling7-1-2Convolution?

Max-pooling?

y=max(xi

|i={1,2,

溫馨提示

  • 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
  • 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
  • 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
  • 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
  • 5. 人人文庫網(wǎng)僅提供信息存儲空間,僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護處理,對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對任何下載內(nèi)容負責(zé)。
  • 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請與我們聯(lián)系,我們立即糾正。
  • 7. 本站不保證下載資源的準確性、安全性和完整性, 同時也不承擔(dān)用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。

評論

0/150

提交評論