智能邊緣計算

上傳人：無*** IP屬地：北京上傳時間：2022-11-13 格式：PPTX 頁數(shù)：71 大小：6.45MB 積分：50 舉報 版權(quán)申訴

已閱讀5頁，還剩66頁未讀，繼續(xù)免費閱讀

版權(quán)說明：本文檔由用戶提供并上傳，收益歸屬內(nèi)容提供方，若內(nèi)容存在侵權(quán)，請進行舉報或認領(lǐng)

文檔簡介

智能邊緣計算：讓智能無處不在Computing

paradigm

shiftsDistributedCentralizedDistributedCentralizedMainframeIntelligent

cloud+edgeIntelligent

cloudPersonalcomputingDistributed

devicesand

dataPeople1.5GBperdaySmart

DevicesSmart

Home20BIoTdevicesSmart

City50

perday68250PBperdayAutonomousVehicle5TBperdayConnectedFactory1PBperdayStadium200TB

pergameSmart

Office150

GBperdayThecall

forintelligence

(DL)

theedge?

Dataexplosionfrom

fast

growing

edgedevices?

E.g.,smart

surveillance

cameras,self-driving

cars68?

Strong

needs

on-deviceintelligence?

Low

latency?

Highavailabilityandreliability?

Strong

privacy

protection?

Low

costIntelligent

CloudIntelligent

Edge?

Edge

devicesbecoming

increasingly

powerful?

Emerging

high-perf,

low-power,

low-costAI

ASICEmpower

ever

app

device

withAI/DL68AI-empowered

diversedevices

andapplications

everywhereAffordableAI

modelstailored

fordiverseHighly-optimizedsoftwarestack&efficienthardware

forAISecurity

&privacy,

modelprotection,explainable

AI,debuggingOn-device,continuous,collaborativelearninghardwareloopInnovations

on-deviceDLstackEfficientneuralnetwork(NN)

designEdgeNNFrameworksAI

ChipsEdgeTPUKPUVPUHPUNPUNN

designanddeploymentNNDesignGapModelDeploymentQuantizationModel……ConvCPUGPUDSPManualDesignDesignSpace:#

layers,op

structure,channel,

…constraints(e.g.,FLOPs)Re-quantizeBNConvBNNASRe-quantizeReLuTPUReLu…PruningRe-quantizeNPU…DequantizationCurrentNNdesigndoesnot

consider

platformfeaturesDoesless

FLOPs

mean

lesslatency?990MFLOPs?

MobileNetEdgeTPUMobileNetV3209MFLOPs?Latency:

3.6msModelaccuracy:

75.6%Latency:

4msModelaccuracy:

74.7%EdgeTPULessFLOPs≠>

lesslatency,

butcanharm

model

accuracy.Doesa

fastmodel

runfaston

ever

hardware?25%faster71%fasterMobileNetV3

MobileNetV2MobileNetV3

MobileNetV2CortexA76

CPUVPUTo

Bridge

Neural

Network

Design

and

Real-World

Performance:

ABehavior

Study

forNeural

NetworksPaperpublished

MLSys

2021Goal?

Measurement

studyto

answerthefollowing

3questions:1.

What

are

thebehavior

characteristics

that

showaninconsistentlatency

response

thechange

OPsand

memoryaccessesof

configurationinthedesignspace?2.

What

are

theroot

causesfortheseunexpected

characteristics?3.

What

are

theimplications

these

characteristicsfor

efficient-NNdesign?Methodology?

Profiling

on7edgeAI

platforms:TFLiteTFLiteTFLiteSNPEOpenVINOVPURKNNNPUNNCASEKPUCortexCPUAdrenoGPUEdgeTPUDSP?

Measurement

Tool:ProfileGeneratesingle

blockmodelinTFConvert

totarget

graphand

precisionCollecttimingresultsontargetdeviceCovered

designdimensions?

Thescaling

ofeach

designdimension:?

Operator/blocktype

(?):?

Normal

operator:?

Elementwise:?

Activations:Conv,

...Add,Pooling

...ReLU,

Sigmoid,

Swish

...MobileNet/ShuffleNetblock,

...{1,

3,5,7}?

Blocks:?

Kernel

size(?):?

Stride(?):{1,

2}?

Height

(?)/width

(?):?

#of

Conv

channels(???/????):?

Precision

(?):{3,

...,224}{3,

...,1000}INT8,FP16,...DomoreConvchannels

increase

latency??

Finding

ThelatencyofConv

increases

inastep

pattern

rather

thanlinear

withthenumber

ofoutput

channelsXaxis:outputchannel

number,

axis:latencyInputfeature

map:28x28;

Inputchannel

number:320;

Kernel:

3x3;Stride:

1DomoreConvchannels

increase

latency??

Cause:Theinput

tensors

arepadded

fullyutilize

thehardware

data-level

parallelism?

SIMD

unitonCPU;

Vector

unitonDSP;

SIMTonGPUetc.K2

CinH

padPadto

8·nPadInputfeature

mapConvolutionKernelOutputfeature

mapSIMD

unitsonCPU[8,1]x[1,8]basicblockMatrix

multiplication

implementationDomoreConvchannels

increase

latency??

Implication:

For

potential

higheraccuracy,

itisencouraged

keepthelargest

number

ofchannels

ineachlatency

step

intheNN

designspace

andskip

theotheronesPrevious

ChannelNumberChoices:...

...Reduced

ChannelNumberChoices:...

...E.g.

MetaPruningChannelsearchspace:from

3014

414(14

layers,

each

layer

has30channel

candidates)Doesa

building

blockhave

similar

relativelatency

differentNN

platforms??

Finding

2:?

Therelativelatencyofabuilding

block

varies

greatlyondifferentplatforms318.9550403020100DenseBlockMobileNetV2Block+SEMobileNetV2BlockShufflenetV2BlockFLOPs

DataCPUGPUVPUDSPTPUKPUDoesa

building

blockhave

similar

relativelatency

differentNN

platforms??

Cause:1.

The

mismatchofcomputation

and

memorybandwidth

severe2.

The

supportfor

non-Conv

operatorsis

weak

ontheNN

platformsexcept

CPUDatareuse

rate44.51DenseBlockGPUCPU50822.7GFLOP/sGFLOP/s7.58MobilenetV2Block+SE4.73Memorybandwidth

23GFloat/sMobilenetV2Block0.81ShuffleNetBlockSnapdragon855

onMi9Doesa

building

blockhave

similar

relativelatency

differentNN

platforms??

Cause:1.

Themismatchof

computationandmemorybandwidthis

severe2.

Thesupportfornon-Convoperators

isweak

ontheNNplatformsexcept

CPUGlobalPooling

inefficient

inMobileNet

V2+SEBlock

onGPU3x3

DWConv,BN,ReLU6Pooling

takes

5%OPsbut>

70%

timeGlobalPoolingBlocktotalOPs/Latency71.7%FCReLU<5%FCSigmoidSqueeze

&Excitement

blockOPs(%)

Latency(%)MultiplyDoesa

building

blockhave

similar

relativelatency

differentNN

platforms??

Implication:

Itis

encouraged

customize

thesetofcandidate

blocksintheNN

design

space

for

eachplatformCustomizedSearch

SpaceCustomizedSearch

SpaceModuleModuleModuleModuleModuleModuleModuleModuleModuleCPUGPUDSPSummary

majorfindings?

#of

Channels:The

latency

Conv

increases

ina

step

pattern

with

the

#of

out

channels?

Block:

The

relative

latency

NNblockvaries

greatlyon

different

platforms?

Activation

Function:

Activationfunctions

can

have

bigimpact

onlatency,

particularly

forSwish

and

HardSwish?

KernelSize:The

Conv

latency

increases

much

lesswith

kernel

size

onAIaccelerators

thanon

the

CPU?

Quantization:?

The

use

INT8on

the

NPUachieves

>11×

speedup,

while

CPUonly

achieves

<3.6×?

INT8can

dramaticallydecrease

inference

accuracy

various

models?

General:Considering

the

general

support,

accuracy,

and

latency,

the

CPUisstillagoodchoice

forinferenceEfficient

NNdesign

must

considerhardware

characteristics.Howto

getagoodmodel?Efficient

NNdesignfordiverse

edgehardwareProfiling

andmodelingModeldeploymentNNDesignEdgeTPUEdgeTPUModelsVPUHPUNPUKPUVPUHPUNPUKPUManualDesignDesignSpace:#of

layers,opstructure,channel,

…constraints(e.g.,

FLOPs)latency,

energyHW-specificpredictorsof

latencyNASandenergyPruningnn-Meter:

Towards

Accurate

LatencyPredictionof

Deep-LearningModelInferenceonDiverseEdge

DevicesCortexCPUAdrenoGPUVPUPaper

published

MobiSys

2021(Best

PaperAward)Existing

workonlatency

prediction?

FLOPs-based

prediction?

Pros:

very

simple?

Cons:

not

adirectmetricof

inference

latency?

Operator-level

prediction?

Pros:

stableprimitiveoperators

(conv2d,

pooling,

activations...)?

Cons:

unaware

graph-level

optimizations?

Model-level

prediction?

Pros:

learngraph-level

optimization

automatically?

Cons:

cannot

generalizeto

unseen

modelstructures?

nn-Meter:buildaccuratelatencypredictor?

Take

graph-level

optimizations

into

consideration?

Generalization

abilityChallenge:

framework

optimizations?

Backend-independent

opt.?

Constant

foldingDesignedmodel?

Common

expression

elimination?

...Backendindependentopt.?

Backend-dependent

opt.?

Operator

fusion?

...Backenddependent

opt.…CPU

backend1(eg

Eigenlib.)CPU

backend2(eg

NNPack

lib.)GPU

backend1(eg

OpenCL)MovidiusbackendImpact

operator

fusion?

Operator

fusionhas

agreatimpact

inference

latencyModel

graphBackendimplementation_kernelactive

()

{for(i=0;i<out.row;i++)for(j=0;j<out.col;j++)_kernelconv_2d_1x1

(

){for(i=0;i<out.row;i++)Convfor(j=0;j<out.col;j++)for(c=0;c<out.chan;c++)out[i][j][c]=active(in[i][j][c]);}for(cout=0;cout<out.chan;cout++)for(cin=0;cin<in.chan;cin++)out[i][j][cout]+=Activein[i][j][cin]*filter[cout][cin];

}Operator

fusion_kernelconv_2d_1x1_active

(

){for(i=0;i<out.row;i++)for(j=0;j<out.col;j++)Conv+Activefor(cout=0;cout<out.chan;cout++)

{for(cin=0;cin<in.chan;cin++)out[i][j][cout]+=in[i][j][cin]*filter[cout][cin];out[i][j][cout]=active(out[i][j][cout]);

}}MobileNetv2nn-Meter:

kernel-level

latency

prediction?

Kernel:

thebasic

execution

unit

adevice?

Can

beasingle

operatorora

fusionofmultiple

operators?

Divide

awholemodelinto

kernels,

conduct

kernel-level

prediction?

Modellatency

isthesum

ofall

kernelskernelsmodelKerneldetectorKernel

latencypredictorsum

kernels

latencies?

Problems:1.

How

detectkernels?

(Kernel

Detection)2.

How

predictaccuratelyforeach

kernel?

(Adaptive

DataSampling)nn-Metertech#1:

automatic

kernel

detectorOp1Op2Fusion

ruledetection

forblack-box

devicestest

cases:Op2Op1measuredlatency:?

setoftest

cases???2???1???1,

??2Op1

and

op2

are

fusible

if:+

min(???1,

???2)?

Foreverytwo

operators,wegenerate

3graphs?

Compare

thelatencydifference???1??2??1,

??2nn-Metertech#1:

Automatic

kernel

detectorFusion

ruledetectionforblack-boxdevices?

setoftest

cases:?

Foreverytwo

operators,wegenerate

3graphs?

Compare

thelatencydifferenceKernel

searchby

the

fusionrules?

Apply

thefusionrules

tosearchmaximumfusedoperatorsintargetmodelA

resnet18

block

exampleKernel-latency

prediction:

challenges?

Largesample

space,e.g.,ConvCollected

from

24widely

usedCNNmodels

from

PyTorch

modelzoo,Convhas?

configurationsto

sample!?Kernel-latency

prediction:

challenges?

Non-linear

latencyon

edgedevices?

Random

sampling

misses

crucial

data

pointsnn-Metertech#2:

adaptive

datasamplerSample

themost

beneficial

data

(kernelconfiguration)insteadofrandom

sampling?

Sample

configurations

thatarelikely

tobeconsidered

inmodeldesign?

Prior

possibility

distribution:

learned

from

model

zoo?

Fine-grained

sampling

aroundinaccurate

prediction

data12data

withconsideredconfigsinmodel

designlargeerrorsPriorpossibilitydistributionRegressionmodelFine-graineddata

samplerdata

andmeasured

latencynn-MeterEvaluation?

Prediction

accuracy:

99.0%

(CPU),

99.1%

(Adreno640

GPU),

99.0%(Adreno630

GPU)and

83.4%

(IntelVPU)?

Generalizationperformanceon

unseen

model

graphs?

Comparison

baselines:

FLOPs,FLOPs+MAC,

BRP-NAS

(GCN),?

average:

nn-Meter

achieves

89.2%,significantly

better

thanFLOPs(22.1%),FLOPs+MAC

(17.1%),

and

BRP-NAS

(8.5%)Efficient

NNdesignfordiverse

edgehardwareProfiling

andmodelingModeldeploymentNNDesignEdgeTPUEdgeTPUModelsVPUHPUNPUKPUVPUHPUNPUKPUManualDesignDesignSpace:#of

layers,opstructure,channel,

…constraints(e.g.,

FLOPs)latency,

energyHW-specificpredictorsof

latencyandenergyNASPruningWe

gotagoodmodel.Howdoes

itrun

onrealdevices?Arecomputing

resourcesfully

utilized?Adreno

GPU

ALU

utilization

forCNNARMCPU

utilization

%for

CNN100%80%60%40%20%0%100%80%60%40%20%0%Big

core

Littlecore30%90%84%Low

hardware

utilizationresultsinpoor

inferencespeed.AsyMo:

Scalable

andEfficient

Deep-LearningInferenceon

Asymmetric

Mobile

CPUsPaper

published

MobiCom2021Whyisutilization

lowon

theCPU?CPU

utilization

%for

CNNUnbalancedtask

distributionby

OSinter

andintra

coreclusters100%80%60%40%20%0%Bigcore

Littlecore30%Computationtasks90%B0

B3Bigcore

clusterL0L1L2L3Little

core

clusterWhyisdistribution

unbalanced

theCPU?Execution

flow

ofmatrix

multiplication2)Copy

blocksinto

continuousmemoryspace3)Schedule

tasks

tothreadqueues1)BlockpartitionforparallelismParamsFeature

mapnctaskkcThread

poolmckcttttttttMKmcx

kcKQ0

Q1Q#kc

ncNIgnore

hardwareasymmetryIgnore

resourceconstraintsRedundant

datacopyIgnore

hardwareasymmetryIgnore

data

localityIgnore

theinterference-proneenvironmentAsyMo:optimize

DLinference

big.Little

CPU?

Accelerate

edge

DLinference

withlower

energy

costOne-run

initializationInferencePartitionstrategyPartitionstrategyCost-modeldirected

blockpartitionPrearrangedmemory

layoutfor

paramsMemoryhandleAsymmetry-awareschedulingCNN/RNNmodelTaskthread

IDIntra-opthread

poolEfficientfrequencyData-reuse

basedfrequency

settingCost-model-based

block

partitionCostfor

atask:computation

memoryComputation

and

memoryaccesscostCostfor

asequentialunit:Costfor

parallel

calculation:

parallel

tasknumberx

CostseqDegree

ofparallelismOthercost:

unparallel+

taskschedule

frameworkTotal

cost:Task

scheduling

andframeworkcostOptimized

execution

flowof

matrixmultiplicationOne-run

initializationInferencerunBlockpartitionParams

layout

Copy

featuresTasks

scheduling

and

runPinthreadoncoreCore0Core1Core2Core3MKttttttttttttttKNbigBigcore

clusteBetter

datalocalityNo

work

stealingfrom

big

littleCore0Core1tCore2tCore3MttKttttttttKtNlittleLittle

core

clusterTotal

performanceandenergy

improvementAsymo

TensorFlow

Kirin970

+Android9Pie18.512.01.81.61.41.21.01.721.6319171513112.01.81.61.41.21.019171513111.859.87991.3377553311PerformanceEnergyefficiencyPerformanceEnergyefficiencyPre-copy

paramsenable

parallelimplementationTensorFlow

@OSfrequency

settingAsymo@

picked

efficient

CPUfrequencyBoth@maxCPUfrequencySparseflow:

unleashfullpotentialofsparsity

indeeplearningJoint

workwith

Chen

Zhang

etal.Today’s

DNN

modelishugeGPT-3175B

parameters$12M

training

costMT-NLG530B

parametersTrained

560

DGXA100

serversComputation

istheengine

behind

AI’s

success&

still

needmore?TPUv3

360TopsV100

125TopsTPUv1

90TopsPerformance(Op/Sec)DedicatedHardwareTPUXeon

~500GopsGPUENIAC

5KopsMoore’s

lawCPU1960197019801990200020102019Piling

uphardwareisnotsustainable:energy-efficiency

wall?1000TPUenergy-efficiencywall100GPU

energy-efficiencywallCPU

energy-efficiency

wall1010.1199520002005201020152020YearSparsity

thekey

humanbrain’s

efficiencyWe

donotlook

everything

inourvisualscopeSparsity

isthekey

humanbrain’s

efficiencySimplegeometric

shapesare

enoughfor

usto

recognizeacatWeight

PruningDifficult

toaccelerateUnstructuredsparsematricesPrune

away

small

weightsMxV→SpMxVHan,Song,

etal.Learning

both

Weights

and

Connections

for

Efficient

Neural

Networks,NIPS’15Accuracyand

Speedup

Trade

offFine-grained/IrregularCoarse-grained/RegularPros:Cons:?

Highmodel

accuracy?

Highcompression

ratioCons:?

Low

model

accuracy?

Low

compression

ratioPros:?

Irregular

pattern?

Difficultto

accelerate?

Regular

pattern?

Easyto

accelerateHow

Achieve

Both??

Model

accuracy?

Addfew

constraintsonthesparsity

pattern?

Speedup?

Matrix

partitioning

for

parallelcomputing?

Eliminating

irregularcomputation

andmemory

accessS.Caoet

al.,“Efficient

andEffective

Sparse

LSTMon

FPGAwith

Bank-BalancedSparsity

”,

FPGA’19.Bank-Balanced

PruningBankSplitDenseMatrixTraverse

all

rows012345678910

15DenseMatrix

Row0.8

-0.1

0.2

1.5

1.0

0.3

-0.4

-1.4

0.7

2.0

0.9

-0.5

1.2

-1.3

2.1

0.2Fine-grainedpruninginside

eachbankBBSMatrix

Row0.81.5

1.0-1.42.0

0.9-1.3

2.1Thresholdpercentage

toobtain

identical

sparsity

ratioamongbanksBank-Balanced

Sparsity

(BBS)?

Bankpartitioningforparallelcomputing?

Fine-grained

pruninginsideeachbank

formaintainingaccuracySparse

MVMultiplication

(SpMxV)?Bothinter-rowandinter-bankparallelism?Loadbalancing

acrossrows

andbanksDense

vectorV0V3V1Bank0V2V5V4Bank

0Row

HRow

1Bank

2Bank

3Bank1V6V7V8I

0Bank2V9V10

V11Bank3?Conflict-freevectoraccessesOurCSB

(CompressedSparse

Banks)0123012301230123Data

rearrangement

forinter-bankparallelizationCSB0

15A

1VALUESBANK

INTERNALINDICESPhysical

BRAM

addressesSpecifically

designed

forBBSto

eliminate

decoding

overheadsAccelerator

OverviewFPGAControllerInstruction

BufferHostServerPCIeCntlrSpMxV

PEEWOP++ACT+ValuesIndicesMatrixMemory*+*DMAOutputPrivateVectorBuffer...Off-chipDRAMDR

AMCntlrVector

MemoryModelAccuracyVery

closeLanguage

modelPTB

datasetSpeech

Recognitionon

TIMIT

datasetHardware

EfficiencyHardware

Efficiency~7x~34xSeerNet:Predicting

CNNFeature-Map

Sparsitythrough

Low-Bit

QuantizationS.Cao

etal.,

“SeerNet:

PredictingConvolutional

Neural

Network

Feature-Map

Sparsitythrough

Low-Bit

Quantization”,

CVPR’19.Accelerate

modelinference

feature-mapsparsity?

ReLUF’?

y=max(0,x)W1-35-12-5-322-641057020600202040ReLUorF-46Max-pooling7-1-2Convolution?

Max-pooling?

y=max(xi

|i={1,2,

人人文庫> 全部分類> 行業(yè)資料 > 信息產(chǎn)業(yè)

溫馨提示

1. 本站所有資源如無特殊說明，都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
2. 本站的文檔不包含任何第三方提供的附件圖紙等，如果需要附件，請聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
3. 本站RAR壓縮包中若帶圖紙，網(wǎng)頁內(nèi)容里面會有圖紙預(yù)覽，若沒有圖紙預(yù)覽就沒有圖紙。
4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
5. 人人文庫網(wǎng)僅提供信息存儲空間，僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護處理，對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯，并不能對任何下載內(nèi)容負責(zé)。
6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容，請與我們聯(lián)系，我們立即糾正。
7. 本站不保證下載資源的準確性、安全性和完整性, 同時也不承擔(dān)用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。

智能邊緣計算

文檔簡介

溫馨提示

最新文檔

評論

智能邊緣計算

文檔簡介

溫馨提示

最新文檔

評論

相關(guān)文檔