版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請進行舉報或認領(lǐng)
文檔簡介
智能邊緣計算:讓智能無處不在Computing
paradigm
shiftsDistributedCentralizedDistributedCentralizedMainframeIntelligent
cloud+edgeIntelligent
cloudPersonalcomputingDistributed
devicesand
dataPeople1.5GBperdaySmart
DevicesSmart
Home20BIoTdevicesSmart
City50
GB
perday68250PBperdayAutonomousVehicle5TBperdayConnectedFactory1PBperdayStadium200TB
pergameSmart
Office150
GBperdayThecall
forintelligence
(DL)
on
theedge?
Dataexplosionfrom
fast
growing
edgedevices?
E.g.,smart
surveillance
cameras,self-driving
cars68?
Strong
needs
of
on-deviceintelligence?
Low
latency?
Highavailabilityandreliability?
Strong
privacy
protection?
Low
costIntelligent
CloudIntelligent
Edge?
Edge
devicesbecoming
increasingly
powerful?
Emerging
high-perf,
low-power,
low-costAI
ASICEmpower
ever
y
app
&
device
withAI/DL68AI-empowered
diversedevices
andapplications
everywhereAffordableAI
modelstailored
fordiverseHighly-optimizedsoftwarestack&efficienthardware
forAISecurity
&privacy,
modelprotection,explainable
AI,debuggingOn-device,continuous,collaborativelearninghardwareloopInnovations
of
on-deviceDLstackEfficientneuralnetwork(NN)
designEdgeNNFrameworksAI
ChipsEdgeTPUKPUVPUHPUNPUNN
designanddeploymentNNDesignGapModelDeploymentQuantizationModel……ConvCPUGPUDSPManualDesignDesignSpace:#
of
layers,op
structure,channel,
…constraints(e.g.,FLOPs)Re-quantizeBNConvBNNASRe-quantizeReLuTPUReLu…PruningRe-quantizeNPU…DequantizationCurrentNNdesigndoesnot
consider
platformfeaturesDoesless
FLOPs
mean
lesslatency?990MFLOPs?
MobileNetEdgeTPUMobileNetV3209MFLOPs?Latency:
3.6msModelaccuracy:
75.6%Latency:
4msModelaccuracy:
74.7%EdgeTPULessFLOPs≠>
lesslatency,
butcanharm
model
accuracy.Doesa
fastmodel
runfaston
ever
y
hardware?25%faster71%fasterMobileNetV3
MobileNetV2MobileNetV3
MobileNetV2CortexA76
CPUVPUTo
Bridge
Neural
Network
Design
and
Real-World
Performance:
ABehavior
Study
forNeural
NetworksPaperpublished
at
MLSys
2021Goal?
Measurement
studyto
answerthefollowing
3questions:1.
What
are
thebehavior
characteristics
that
showaninconsistentlatency
response
to
thechange
of
OPsand
memoryaccessesof
a
configurationinthedesignspace?2.
What
are
theroot
causesfortheseunexpected
characteristics?3.
What
are
theimplications
of
these
characteristicsfor
efficient-NNdesign?Methodology?
Profiling
on7edgeAI
platforms:TFLiteTFLiteTFLiteSNPEOpenVINOVPURKNNNPUNNCASEKPUCortexCPUAdrenoGPUEdgeTPUDSP?
Measurement
Tool:ProfileGeneratesingle
blockmodelinTFConvert
totarget
graphand
precisionCollecttimingresultsontargetdeviceCovered
designdimensions?
Thescaling
ofeach
NN
designdimension:?
Operator/blocktype
(?):?
Normal
operator:?
Elementwise:?
Activations:Conv,
FC
...Add,Pooling
...ReLU,
Sigmoid,
Swish
...MobileNet/ShuffleNetblock,
...{1,
3,5,7}?
Blocks:?
Kernel
size(?):?
Stride(?):{1,
2}?
Height
(?)/width
(?):?
#of
Conv
channels(???/????):?
Precision
(?):{3,
...,224}{3,
...,1000}INT8,FP16,...DomoreConvchannels
increase
latency??
Finding
1:
ThelatencyofConv
increases
inastep
pattern
rather
thanlinear
withthenumber
ofoutput
channelsXaxis:outputchannel
number,
Y
axis:latencyInputfeature
map:28x28;
Inputchannel
number:320;
Kernel:
3x3;Stride:
1DomoreConvchannels
increase
latency??
Cause:Theinput
tensors
arepadded
to
fullyutilize
thehardware
data-level
parallelism?
SIMD
unitonCPU;
Vector
unitonDSP;
SIMTonGPUetc.K2
x
CinH
x
WH
x
W
+
padPadto
8·nPadInputfeature
mapConvolutionKernelOutputfeature
mapSIMD
unitsonCPU[8,1]x[1,8]basicblockMatrix
multiplication
implementationDomoreConvchannels
increase
latency??
Implication:
For
potential
higheraccuracy,
itisencouraged
to
keepthelargest
number
ofchannels
ineachlatency
step
intheNN
designspace
andskip
theotheronesPrevious
ChannelNumberChoices:...
6
8
10
12
14
16
18
20
...Reduced
ChannelNumberChoices:...
6
8
10
12
14
16
18
20
...E.g.
MetaPruningChannelsearchspace:from
3014
to
414(14
layers,
each
layer
has30channel
candidates)Doesa
building
blockhave
similar
relativelatency
on
differentNN
platforms??
Finding
2:?
Therelativelatencyofabuilding
block
varies
greatlyondifferentplatforms318.9550403020100DenseBlockMobileNetV2Block+SEMobileNetV2BlockShufflenetV2BlockFLOPs
DataCPUGPUVPUDSPTPUKPUDoesa
building
blockhave
similar
relativelatency
on
differentNN
platforms??
Cause:1.
The
mismatchofcomputation
and
memorybandwidth
is
severe2.
The
supportfor
non-Conv
operatorsis
weak
ontheNN
platformsexcept
CPUDatareuse
rate44.51DenseBlockGPUCPU50822.7GFLOP/sGFLOP/s7.58MobilenetV2Block+SE4.73Memorybandwidth
23GFloat/sMobilenetV2Block0.81ShuffleNetBlockSnapdragon855
onMi9Doesa
building
blockhave
similar
relativelatency
on
differentNN
platforms??
Cause:1.
Themismatchof
computationandmemorybandwidthis
severe2.
Thesupportfornon-Convoperators
isweak
ontheNNplatformsexcept
CPUGlobalPooling
is
inefficient
inMobileNet
V2+SEBlock
onGPU3x3
DWConv,BN,ReLU6Pooling
takes
<
5%OPsbut>
70%
timeGlobalPoolingBlocktotalOPs/Latency71.7%FCReLU<5%FCSigmoidSqueeze
&Excitement
blockOPs(%)
Latency(%)MultiplyDoesa
building
blockhave
similar
relativelatency
on
differentNN
platforms??
Implication:
Itis
encouraged
to
customize
thesetofcandidate
blocksintheNN
design
space
for
eachplatformCustomizedSearch
SpaceCustomizedSearch
SpaceCustomizedSearch
SpaceModuleModuleModuleModuleModuleModuleModuleModuleModuleCPUGPUDSPSummary
of
majorfindings?
#of
Channels:The
latency
of
Conv
increases
ina
step
pattern
with
the
#of
out
channels?
Block:
The
relative
latency
of
a
NNblockvaries
greatlyon
different
platforms?
Activation
Function:
Activationfunctions
can
have
bigimpact
onlatency,
particularly
forSwish
and
HardSwish?
KernelSize:The
Conv
latency
increases
much
lesswith
kernel
size
onAIaccelerators
thanon
the
CPU?
Quantization:?
The
use
of
INT8on
the
NPUachieves
>11×
speedup,
while
CPUonly
achieves
<3.6×?
INT8can
dramaticallydecrease
inference
accuracy
of
various
models?
General:Considering
the
general
support,
accuracy,
and
latency,
the
CPUisstillagoodchoice
forinferenceEfficient
NNdesign
must
considerhardware
characteristics.Howto
getagoodmodel?Efficient
NNdesignfordiverse
edgehardwareProfiling
andmodelingModeldeploymentNNDesignEdgeTPUEdgeTPUModelsVPUHPUNPUKPUVPUHPUNPUKPUManualDesignDesignSpace:#of
layers,opstructure,channel,
…constraints(e.g.,
FLOPs)latency,
energyHW-specificpredictorsof
latencyNASandenergyPruningnn-Meter:
Towards
Accurate
LatencyPredictionof
Deep-LearningModelInferenceonDiverseEdge
DevicesCortexCPUAdrenoGPUVPUPaper
published
at
MobiSys
2021(Best
PaperAward)Existing
workonlatency
prediction?
FLOPs-based
prediction?
Pros:
very
simple?
Cons:
not
adirectmetricof
inference
latency?
Operator-level
prediction?
Pros:
stableprimitiveoperators
(conv2d,
pooling,
activations...)?
Cons:
unaware
of
graph-level
optimizations?
Model-level
prediction?
Pros:
learngraph-level
optimization
automatically?
Cons:
cannot
generalizeto
unseen
modelstructures?
nn-Meter:buildaccuratelatencypredictor?
Take
graph-level
optimizations
into
consideration?
Generalization
abilityChallenge:
framework
optimizations?
Backend-independent
opt.?
Constant
foldingDesignedmodel?
Common
expression
elimination?
...Backendindependentopt.?
Backend-dependent
opt.?
Operator
fusion?
...Backenddependent
opt.…CPU
backend1(eg
Eigenlib.)CPU
backend2(eg
NNPack
lib.)GPU
backend1(eg
OpenCL)MovidiusbackendImpact
of
operator
fusion?
Operator
fusionhas
agreatimpact
on
inference
latencyModel
graphBackendimplementation_kernelactive
()
{for(i=0;i<out.row;i++)for(j=0;j<out.col;j++)_kernelconv_2d_1x1
(
){for(i=0;i<out.row;i++)Convfor(j=0;j<out.col;j++)for(c=0;c<out.chan;c++)out[i][j][c]=active(in[i][j][c]);}for(cout=0;cout<out.chan;cout++)for(cin=0;cin<in.chan;cin++)out[i][j][cout]+=Activein[i][j][cin]*filter[cout][cin];
}Operator
fusion_kernelconv_2d_1x1_active
(
){for(i=0;i<out.row;i++)for(j=0;j<out.col;j++)Conv+Activefor(cout=0;cout<out.chan;cout++)
{for(cin=0;cin<in.chan;cin++)out[i][j][cout]+=in[i][j][cin]*filter[cout][cin];out[i][j][cout]=active(out[i][j][cout]);
}}MobileNetv2nn-Meter:
kernel-level
latency
prediction?
Kernel:
thebasic
execution
unit
on
adevice?
Can
beasingle
operatorora
fusionofmultiple
operators?
Divide
awholemodelinto
kernels,
conduct
kernel-level
prediction?
Modellatency
isthesum
ofall
kernelskernelsmodelKerneldetectorKernel
latencypredictorsum
kernels
latencies?
Problems:1.
How
to
detectkernels?
(Kernel
Detection)2.
How
to
predictaccuratelyforeach
kernel?
(Adaptive
DataSampling)nn-Metertech#1:
automatic
kernel
detectorOp1Op2Fusion
ruledetection
forblack-box
devicestest
cases:Op2Op1measuredlatency:?
A
setoftest
cases???2???1???1,
??2Op1
and
op2
are
fusible
if:+
?
?
?
>
?
?
min(???1,
???2)?
Foreverytwo
operators,wegenerate
3graphs?
Compare
thelatencydifference???1??2??1,
??2nn-Metertech#1:
Automatic
kernel
detectorFusion
ruledetectionforblack-boxdevices?
A
setoftest
cases:?
Foreverytwo
operators,wegenerate
3graphs?
Compare
thelatencydifferenceKernel
searchby
the
fusionrules?
Apply
thefusionrules
tosearchmaximumfusedoperatorsintargetmodelA
resnet18
block
exampleKernel-latency
prediction:
challenges?
Largesample
space,e.g.,ConvCollected
from
24widely
usedCNNmodels
from
PyTorch
modelzoo,Convhas?
×
??
of
configurationsto
sample!?Kernel-latency
prediction:
challenges?
Non-linear
latencyon
edgedevices?
Random
sampling
misses
crucial
data
pointsnn-Metertech#2:
adaptive
datasamplerSample
themost
beneficial
data
(kernelconfiguration)insteadofrandom
sampling?
Sample
configurations
thatarelikely
tobeconsidered
inmodeldesign?
Prior
possibility
distribution:
learned
from
model
zoo?
Fine-grained
sampling
aroundinaccurate
prediction
data12data
withconsideredconfigsinmodel
designlargeerrorsPriorpossibilitydistributionRegressionmodelFine-graineddata
samplerdata
andmeasured
latencynn-MeterEvaluation?
Prediction
accuracy:
99.0%
(CPU),
99.1%
(Adreno640
GPU),
99.0%(Adreno630
GPU)and
83.4%
(IntelVPU)?
Generalizationperformanceon
unseen
model
graphs?
Comparison
baselines:
FLOPs,FLOPs+MAC,
BRP-NAS
(GCN),?
On
average:
nn-Meter
achieves
89.2%,significantly
better
thanFLOPs(22.1%),FLOPs+MAC
(17.1%),
and
BRP-NAS
(8.5%)Efficient
NNdesignfordiverse
edgehardwareProfiling
andmodelingModeldeploymentNNDesignEdgeTPUEdgeTPUModelsVPUHPUNPUKPUVPUHPUNPUKPUManualDesignDesignSpace:#of
layers,opstructure,channel,
…constraints(e.g.,
FLOPs)latency,
energyHW-specificpredictorsof
latencyandenergyNASPruningWe
gotagoodmodel.Howdoes
itrun
onrealdevices?Arecomputing
resourcesfully
utilized?Adreno
GPU
ALU
utilization
%
forCNNARMCPU
utilization
%for
CNN100%80%60%40%20%0%100%80%60%40%20%0%Big
core
Littlecore30%90%84%Low
hardware
utilizationresultsinpoor
inferencespeed.AsyMo:
Scalable
andEfficient
Deep-LearningInferenceon
Asymmetric
Mobile
CPUsPaper
published
at
MobiCom2021Whyisutilization
lowon
theCPU?CPU
utilization
%for
CNNUnbalancedtask
distributionby
OSinter
andintra
coreclusters100%80%60%40%20%0%Bigcore
Littlecore30%Computationtasks90%B0
B1
B2
B3Bigcore
clusterL0L1L2L3Little
core
clusterWhyisdistribution
unbalanced
on
theCPU?Execution
flow
ofmatrix
multiplication2)Copy
blocksinto
continuousmemoryspace3)Schedule
tasks
tothreadqueues1)BlockpartitionforparallelismParamsFeature
mapnctaskkcThread
poolmckcttttttttMKmcx
kcKQ0
Q1Q#kc
x
ncNIgnore
hardwareasymmetryIgnore
resourceconstraintsRedundant
datacopyIgnore
hardwareasymmetryIgnore
data
localityIgnore
theinterference-proneenvironmentAsyMo:optimize
DLinference
on
big.Little
CPU?
Accelerate
edge
DLinference
withlower
energy
costOne-run
initializationInferencePartitionstrategyPartitionstrategyCost-modeldirected
blockpartitionPrearrangedmemory
layoutfor
paramsMemoryhandleAsymmetry-awareschedulingCNN/RNNmodelTaskthread
IDIntra-opthread
poolEfficientfrequencyData-reuse
basedfrequency
settingCost-model-based
block
partitionCostfor
atask:computation
+
memoryComputation
and
memoryaccesscostCostfor
asequentialunit:Costfor
parallel
calculation:
parallel
tasknumberx
CostseqDegree
ofparallelismOthercost:
unparallel+
taskschedule
+
frameworkTotal
cost:Task
scheduling
andframeworkcostOptimized
execution
flowof
matrixmultiplicationOne-run
initializationInferencerunBlockpartitionParams
layout
Copy
featuresTasks
scheduling
and
runPinthreadoncoreCore0Core1Core2Core3MKttttttttttttttKNbigBigcore
clusteBetter
datalocalityNo
work
stealingfrom
big
to
littleCore0Core1tCore2tCore3MttKttttttttKtNlittleLittle
core
clusterTotal
performanceandenergy
improvementAsymo
vs
TensorFlow
on
Kirin970
+Android9Pie18.512.01.81.61.41.21.01.721.6319171513112.01.81.61.41.21.019171513111.859.87991.3377553311PerformanceEnergyefficiencyPerformanceEnergyefficiencyPre-copy
paramsenable
parallelimplementationTensorFlow
@OSfrequency
settingAsymo@
picked
efficient
CPUfrequencyBoth@maxCPUfrequencySparseflow:
unleashfullpotentialofsparsity
indeeplearningJoint
workwith
Chen
Zhang
etal.Today’s
DNN
modelishugeGPT-3175B
parameters$12M
training
costMT-NLG530B
parametersTrained
by
560
DGXA100
serversComputation
istheengine
behind
AI’s
success&
still
needmore?TPUv3
360TopsV100
125TopsTPUv1
90TopsPerformance(Op/Sec)DedicatedHardwareTPUXeon
E5
~500GopsGPUENIAC
5KopsMoore’s
lawCPU1960197019801990200020102019Piling
uphardwareisnotsustainable:energy-efficiency
wall?1000TPUenergy-efficiencywall100GPU
energy-efficiencywallCPU
energy-efficiency
wall1010.1199520002005201020152020YearSparsity
is
thekey
to
humanbrain’s
efficiencyWe
donotlook
at
everything
inourvisualscopeSparsity
isthekey
to
humanbrain’s
efficiencySimplegeometric
shapesare
enoughfor
usto
recognizeacatWeight
PruningDifficult
toaccelerateUnstructuredsparsematricesPrune
away
small
weightsMxV→SpMxVHan,Song,
etal.Learning
both
Weights
and
Connections
for
Efficient
Neural
Networks,NIPS’15Accuracyand
Speedup
Trade
offFine-grained/IrregularCoarse-grained/RegularPros:Cons:?
Highmodel
accuracy?
Highcompression
ratioCons:?
Low
model
accuracy?
Low
compression
ratioPros:?
Irregular
pattern?
Difficultto
accelerate?
Regular
pattern?
Easyto
accelerateHow
to
Achieve
Both??
Model
accuracy?
Addfew
constraintsonthesparsity
pattern?
Speedup?
Matrix
partitioning
for
parallelcomputing?
Eliminating
irregularcomputation
andmemory
accessS.Caoet
al.,“Efficient
andEffective
Sparse
LSTMon
FPGAwith
Bank-BalancedSparsity
”,
FPGA’19.Bank-Balanced
PruningBankSplitDenseMatrixTraverse
all
rows012345678910
11
12
13
14
15DenseMatrix
Row0.8
-0.1
0.2
1.5
1.0
0.3
-0.4
-1.4
0.7
2.0
0.9
-0.5
1.2
-1.3
2.1
0.2Fine-grainedpruninginside
eachbankBBSMatrix
Row0.81.5
1.0-1.42.0
0.9-1.3
2.1Thresholdpercentage
toobtain
identical
sparsity
ratioamongbanksBank-Balanced
Sparsity
(BBS)?
Bankpartitioningforparallelcomputing?
Fine-grained
pruninginsideeachbank
formaintainingaccuracySparse
MVMultiplication
(SpMxV)?Bothinter-rowandinter-bankparallelism?Loadbalancing
acrossrows
andbanksDense
vectorV0V3V1Bank0V2V5V4Bank
0Row
0
A
0
B
C
D
0
0
E
F
G
0
HRow
1Bank
1Bank
2Bank
3Bank1V6V7V8I
J
0
K
0
L
M
N
0
O
P
0Bank2V9V10
V11Bank3?Conflict-freevectoraccessesOurCSB
(CompressedSparse
Banks)0123012301230123Data
rearrangement
forinter-bankparallelizationCSB0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15A
C
E
G
B
D
F
H
I
K
M
O
J
L
N
P0
0
0
1
2
2
3
2
0
0
1
3
1
2
3
1VALUESBANK
INTERNALINDICESPhysical
BRAM
addressesSpecifically
designed
forBBSto
eliminate
decoding
overheadsAccelerator
OverviewFPGAControllerInstruction
BufferHostServerPCIeCntlrSpMxV
PEEWOP++ACT+ValuesIndicesMatrixMemory*+*DMAOutputPrivateVectorBuffer...Off-chipDRAMDR
AMCntlrVector
MemoryModelAccuracyVery
closeLanguage
modelPTB
datasetSpeech
Recognitionon
TIMIT
datasetHardware
EfficiencyHardware
Efficiency~7x~34xSeerNet:Predicting
CNNFeature-Map
Sparsitythrough
Low-Bit
QuantizationS.Cao
etal.,
“SeerNet:
PredictingConvolutional
Neural
Network
Feature-Map
Sparsitythrough
Low-Bit
Quantization”,
CVPR’19.Accelerate
modelinference
by
feature-mapsparsity?
ReLUF’?
y=max(0,x)W1-35-12-5-322-641057020600202040ReLUorF-46Max-pooling7-1-2Convolution?
Max-pooling?
y=max(xi
|i={1,2,
溫馨提示
- 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
- 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
- 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
- 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
- 5. 人人文庫網(wǎng)僅提供信息存儲空間,僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護處理,對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對任何下載內(nèi)容負責(zé)。
- 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請與我們聯(lián)系,我們立即糾正。
- 7. 本站不保證下載資源的準確性、安全性和完整性, 同時也不承擔(dān)用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。
最新文檔
- 2022年安徽省阜陽市成效中學(xué)高一物理第二學(xué)期期末教學(xué)質(zhì)量檢測試題含解析
- 2024年硫代硫酸鹽項目綜合評估報告
- 2023年濃縮干燥設(shè)備項目評估分析報告
- 多元復(fù)變函數(shù)課程設(shè)計
- 往屆生求職自薦信范文大全5篇
- 師風(fēng)師德怎么寫心得(9篇)
- 2024年人教版初一上冊語文教案(16篇)
- 工作室規(guī)章制度的范文7篇(全文)
- 開業(yè)董事長致辭稿怎么寫5篇
- 幼兒園藥品安全演講稿4篇
- 鄉(xiāng)鎮(zhèn)村級建制調(diào)整改革社會穩(wěn)定風(fēng)險評估報告.docx
- 客戶資信等級評價表
- 中國瓷器發(fā)展史(課堂PPT)
- BJD SD diy 娃衣制作書 關(guān)口妙子-人形服裝制作基礎(chǔ) P60
- 模具開發(fā)合同委托書
- 火力發(fā)電廠熱工控制系統(tǒng)設(shè)計技術(shù)規(guī)定
- 承包商供應(yīng)商等相關(guān)方安全管理制度
- 壓球機項目可行性研究報告寫作范文
- 安全隱患自查自糾及整改臺賬
- 放射治療效果評價與流程
- 大手術(shù)報告審批制度及流程
評論
0/150
提交評論