版權說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權,請進行舉報或認領
文檔簡介
April
6,
2021@QCOMResearchIntelligence
atscale
throughAImodelefficiencyQualcomm
Technologies,
Inc.2Why
efficient
machine
learningis
necessaryfor
AI
to
proliferateOur
latest
research
to
makeAI
models
more
efficientOuropen-source
projectsto
scale
efficientAIAgenda33Video
monitoringExtended
realitySmart
citiesSmart
factoriesAutonomousvehiclesVideo
conferencingSmart
homesSmartphoneAI
is
being
used
allaround
usincreasing
productivity,
enhancing
collaboration,and
transforming
industriesAI
video
analysis
is
on
the
riseTrend
toward
more
cameras,
higher
resolution,and
increased
frame
rate
across
devices44Source:
WellingWill
we
have
reached
the
capacity
of
the
human
brain?Energy
efficiency
of
a
brain
is
100x
better
than
current
hardware2025Weight
parameter
count194019701980199020002010202020301943:
First
NN
(+/-
N=10)1950
19601988:
NetTalk(+/-N=20K)2009:
Hinton’s
DeepBelief
Net
(+/-
N=10M)2013:
Google/Y!(N=+/-
1B)2025:N=100T
=1014101210101081061014104102100Deep
neural
networksare
energy
hungryand
growingfastAI
isbeing
powered
by
the
explosivegrowth
of
deep
neuralnetworks2021:
Extremely
largeneural
networks
(N
=1.6T)2017:
Very
largeneuralnetworks
(N
=137B)5Power
andthermalefficiency
are
essentialfor
on-device
AIThe
challenge
ofAI
workloadsConstrained
mobileenvironmentVerycomputeintensiveLarge,complicated
neuralnetwork
modelsComplexconcurrenciesAlways-onReal-timeMust
bethermallyefficient
for
sleek,ultra-light
designsStorage/memorybandwidth
limitationsRequires
long
batterylife
for
all-day
use6Holisticmodel
efficiencyresearchMultiple
axes
to
shrinkAI
models
and
efficientlyrun
them
onhardwareQuantizationLearning
to
reducebit-precision
while
keepingdesired
accuracyCompressionLearning
to
prunemodel
while
keepingdesired
accuracyCompilationLearning
to
compileAImodels
forefficienthardware
executionNeuralarchitecturesearchLearning
to
design
smallerneural
networks
that
are
on
paroroutperform
hand-designedarchitectures
on
realhardware71:
FP32
model
compared
to
quantized
modelPromising
results
show
thatlow-precision
integer
inferencecan
become
widespreadVirtually
the
same
accuracybetween
a
FP32
and
quantizedAI
model
through:Automated,
data
free,post-training
methodsAutomated
training-basedmixed-precision
methodSignificant
performance
perwattimprovements
through
quantizationLeadingresearch
toefficientlyquantize
AImodelsAutomated
reduction
in
precisionof
weights
and
activations
whilemaintaining
accuracyModels
trained
athigh
precision32-bit
floating
point3452.31948-bit
Integer255Increase
in
performanceper
watt
fromsavings
inmemory
and
compute1Inference
atlower
precision16-bit
Integer345201010101Increase
in
performanceper
watt
fromsavings
inmemory
and
compute1up
to4X4-bit
Integer15Increase
in
performanceper
watt
fromsavings
inmemory
and
compute101010101up
to16Xup
to64X0101010101010101010101010101010101010101010188dData-freequantizationHow
can
we
makequantization
as
simpleas
possible?Created
an
automated
methothat
addresses
biasandimbalance
in
weight
ranges:NotrainingData
freePushing
thelimits
of
what’spossible
withquantizationAdaRoundIs
rounding
to
the
nearestvalue
the
best
approachfor
quantization?Created
an
automatedmethod
for
finding
thebest
rounding
choice:NotrainingMinimal
unlabeled
dataSOTA
8-bit
resultsMaking
8-bit
weightquantization
ubiquitous<1%Accuracy
drop
forMobileNet
V2against
FP32
modelData-Free
Quantization
Through
WeightEqualization
andBias
Correction
(Nagel,
van
Baalen,
et
al.,
ICCV
2019)SOTA:
State-of-the-artAccuracy
drop
for<2.5%
MobileNet
V2against
FP32
modelUp
or
Down?
AdaptiveRounding
for
Post-TrainingQuantization
(Nagel,Amjad,
et
al.,ICML
2020)Bayesian
bitsCan
we
quantize
layers
todifferent
bit
widths
basedon
precision
sensitivity?Created
a
novel
methodto
learn
mixed-precisionquantization:Training
requiredTraining
data
requiredJointly
learns
bit-widthprecision
and
pruningSOTA
mixed-precision
resultsAutomating
mixed-precisionquantization
and
enabling
the
tradeoffbetween
accuracy
and
kernel
bit-width<1%Accuracy
drop
forMobileNet
V2
against
FP32model
for
mixed
precisionmodel
with
computationalcomplexity
equivalent
to
a4-bit
weight
modelBayesian
Bits:
Unifying
Quantization
and
Pruningvan
Baalen,
Louizos,
et
al.,
NeurIPS2020)8SOTA
4-bit
weight
resultsMaking
4-bit
weight
quantizationubiquitous99Optimizing
and
deploying
state-of-the-art
AImodelsfor
diverse
scenarios
at
scale
ischallengingNeuralnetworkcomplexityManystate-of-the-artneural
networksolutions
are
large,complex,
and
donot
run
efficientlyon
target
hardwareNeuralnetworkdiversityDevicediversityCostFor
differenttasks
and
usecase
cases,many
differentneural
networksare
requiredDeploying
neuralnetworks
to
manydifferent
deviceswith
differentconfigurations
andchanging
softwareis
requiredCompute
andengineeringresources
fortraining
plusevaluation
aretoo
costly
andtime
consuming10NASNeuralArchitectureSearchAn
automated
way
to
learna
network
topology
that
canachieve
the
bestperformanceon
a
certain
taskSearchspaceSet
of
operationsand
how
theycan
be
connectedto
formvalidnetworkarchitecturesSearchalgorithmMethod
forsampling
apopulation
ofgood
networkarchitecturecandidatesEvaluationstrategyMethod
toestimate
theperformanceof
samplednetworkarchitectures11High
costBrute
force
search
is
expensive>40,000
epochs
per
platformLack
diverse
searchHard
to
search
in
diverse
spaces,
withdifferentblock-types,
attention,
and
activations
Repeated
training
phase
for
every
new
scenarioDo
not
scaleRepeated
training
phase
for
every
new
device>40,000
epochs
per
platformUnreliable
hardware
modelsRequires
differentiable
cost-functionsRepeated
training
phase
for
every
new
deviceExistingNASsolutions
do
notaddress
all
thechallenges12Introducing
new
AI
researchDistilling
Optimal
NeuralNetwork
ArchitecturesEfficient
NAS
withhardware-aware
optimizationAscalable
method
that
finds
pareto-optimalnetwork
architectures
interms
of
accuracy
andlatency
for
any
hardware
platform
at
low
costStarts
from
an
oversized
pretrainedreference
architectureDONNADONNAOversized
pretrainedreference
architectureSet
ofpareto-optimalnetwork
architecturesLow
costLow
start-up
cost
of
1000-4000
epochs,equivalent
to
training
2-10
networks
from
scratchDiverse
search
to
findthe
best
modelsSupports
diverse
spaces
with
different
cell-types,attention,
and
activation
functions
(ReLU,
Swish,
etc.)ScalableScales
to
many
hardware
devicesat
minimal
costReliable
hardwaremeasurementsUses
direct
hardware
measurements
insteadof
a
potentially
inaccurate
hardware
modelDistilling
Optimal
Neural
Networks:
Rapid
Search
in
Diverse
Spaces
(Moons,
Bert,
et
al.,
arXiv2020)Varying
parameters:KernelSizeExpansion
FactorsNetwork
depthNetwork
widthAttention/activationDifferent
efficientlayertypesDONNA
4-step
processObjective:
Build
accuracy
model
of
search
space
once,
then
deploy
to
many
scenariosDefine
referenceand
searchA13space
onceDefine
backbone:FixedchannelsHead
and
Stem1
2
3
4
514Define
reference
architecture
andsearch-space
onceA
diverse
search
space
is
essential
for
finding
optimal
architectures
with
higher
accuracySelect
referencearchitectureThe
largest
modelin
the
search-spaceChop
the
NNinto
blocksFix
the
STEM,
HEAD,#
blocks,strides,#
channels
at
block-edgeChoose
search
spaceDiverse
factorizedhierarchical
search
space,including
variable
kernel-size,
expansion-rate,
depth,#
channels,
cell-type,activation,
attentionSTEM1,
s=22,
s=23,
s=25,
s=2HEADch=32ch=64ch=96ch=1284,
s=1ch=196ch=256Conv1x1FCAvgch=1536Conv3x3s2DWConvch=32Kernel:
3,5,7Expand:
2,3,4,6Depth:
1,2,3,4Attention: SE,
no
SEActivation:
ReLU/SwishCelltype: grouped,
DW,
…Width
scale:
0.5x,
1.0xChoose
diverse
search
spaceCh:
channel;
SE:
Squeeze-and-Excitation15Varying
parameters:KernelSizeExpansion
FactorsNetwork
depthNetwork
widthAttention/activationDifferent
efficientlayertypesApproximate
ideal
projectionsof
a
reference
model
through
KD1MSE21
2MSE3MSE43
4MSE55MSEUse
quality
of
blockwise
approximationsto
build
accuracy
modelMSE
MSE
MSE
MSE
MSE1
2
3
4
5Define
backbone:FixedchannelsHead
and
Stem1
2
3
4
5DONNA
4-step
processObjective:
Build
accuracy
model
of
search
space
once,
then
deploy
to
many
scenarios15Build
accuracy
model
viaKnowledge
Distillation(KD)
onceBDefine
referenceand
searchspace
onceA16Build
accuracy
predictor
via
BKD
onceLow-cost
hardware-agnostic
trainingphasePretrain
all
blocks
in
search-space
through
blockwiseknowledge
distillationFast
block
trainingTrivial
parallelized
trainingBroad
search
spaceBlockpretrainedweightsBlockqualitymetricsFinetunedarchitecturesBlock
library Architecture
libraryQuickly
finetune
arepresentative
setof
architecturesFinetune
sampled
networksFast
network
trainingOnly20-30
NN
requiredAccuracy
predictorFit
linearregressionmodelRegularized
Ridge
RegressionAccurate
predictionsBKD:
blockwise
knowledge
distillation17Varying
parameters:KernelSizeExpansion
FactorsNetwork
depthNetwork
widthAttention/activationDifferent
efficientlayertypesDefine
backbone:FixedchannelsHead
and
Stem1
2
3
4
5AApproximate
ideal
projectionsof
a
reference
model
through
KD1MSE2MSE3MSE4MSE5MSEUse
quality
of
blockwise
approximationsto
build
accuracy
modelMSE
MSE
MSE
MSE
MSE1
2
3
4
5HWlatencyPredicted
accuracydifferent
compiler
versions,different
image
sizesScenario-specificsearchDONNA
4-step
processObjective:
Build
accuracy
model
of
search
space
once,
then
deploy
to
many
scenarios17Build
accuracy
model
viaKnowledge
Distillation(KD)
onceBEvolutionarysearch
in
24hCDefine
referenceand
searchspace
once1
23
45Qualcomm
Snapdragon
is
a
product
of
Qualcomm
Technologies,
Inc.
and/or
its
subsidiaries.18Evolutionary
search
with
real
hardware
measurementsScenario-specific
search
allows
users
to
select
optimal
architectures
for
real-life
deploymentsQuick
turnaround
timeResults
in
+/-
1
day
using
one
measurement
deviceNSGA:
Non-dominated
Sorting
Genetic
AlgorithmAccurate
scenario-specific
searchCaptures
all
intricacies
of
the
hardware
platformand
software
—
e.g.
run-time
version
or
devicesNSGA-IIsamplingalgorithmTarget
HWTask
accpredictorPredicted
task
accuracyMeasured
latency
on
deviceEnd-to-endmodel19HWlatencyPredicted
accuracydifferent
compiler
versions,different
image
sizesScenario-specificsearchMSE
MSE
MSE
MSE
MSE1
2
3
4
541523Use
KD-initializedblocks
from
step
Bto
finetune
anynetwork
inthesearch
space
in15-50
epochsinstead
of450DONNA
4-step
processObjective:
Build
accuracy
model
of
search
space
once,
then
deploy
to
many
scenarios19Evolutionarysearch
in24hCSample
andfinetuneDVarying
parameters:KernelSizeExpansion
FactorsNetwork
depthNetwork
widthAttention/activationDifferent
efficientlayertypesDefine
backbone:FixedchannelsHead
and
Stem1
2
3
4
5AApproximate
ideal
projectionsof
a
reference
model
through
KD1MSE2MSE3MSE4MSE5MSEUse
quality
of
blockwise
approximationsto
build
accuracy
modelMSE
MSE
MSE
MSE
MSE1
2
3
4
5Build
accuracy
model
viaKnowledge
Distillation(KD)
onceBDefine
referenceand
searchspace
once1
23
4520Quickly
finetune
predicted
Pareto-optimal
architecturesFinetune
to
reach
full
accuracy
and
complete
hardware-aware
optimization
for
on-device
AI
deployments4152341523CESoftCESoft
distillation
on
teacher
logitsGround-truthlabelsBlockpretrainedweightsBKD-reference
networkPredictedaccuracyHW
latencyConfirmedaccuracyHW
latency21Top-1
val
accuracy
[%]Qualcomm?
Adreno?
660
GPU
in
the
Snapdragon
888
running
on
the
Samsung
Galaxy
S21.2:
Qualcomm?
Hexagon?
780
Processor
in
the
Snapdragon
888
running
on
the
Samsung
GalaxyS21.Qualcomm
Adreno
and
Qualcomm
Hexagon
are
products
of
Qualcomm
Technologies,
Inc.
and/or
its
subsidiaries.DONNA
finds
state-of-the-artnetworks
for
on-device
scenarios#
of
Parameters
[M]FLOPSFPSDesktop
GPU
latencyFPSMobile
SoC
latency1(Adreno
660
GPU)FPSMobile
SoC
latency2(Hexagon
780Processor)20%faster
at
similaraccuracy20%faster
at
similaraccuracy224x224
images224x224
images224x224
images672x672
imagesQuickly
optimize
and
make
tradeoffs
in
model
accuracy
with
respectto
the
deploymentconditionsthat
matter20%faster
at
similaraccuracy2222DONNA
provides
MnasNet-level
diversity
at
100x
lower
cost*Training
1
model
from
scratch
=
450
epochsDONNAefficientlyfinds
optimalmodels
overdiversescenariosCost
of
trainingis
a
handfulofarchitectures*MethodGranularityMacro-diversitySearch-cost1scenario[epochs]Cost
/
scenario4scenarios[epochs]Cost
/
scenario∞
scenarios[epochs]OFALayer-levelFixed1200+10×[25
—
75]550
—1050250
—750DNALayer-levelFixed770+10×45047004500MNasNetBlock-levelVariable40000+10×4504450044500DONNABlock-levelVariable4000+10×501500500GoodOKNot
good23DONNA
findsstate-of-the-artnetworks
foron-devicescenariosQuickly
optimize
andmake
tradeoffs
inmodelaccuracy
with
respectto
the
deploymentconditions
that
matter#
Multiply
Accumulate
Operations
[FLOPS]Mobile
modelsResNet-50Predicted
top-1
accuracyVIT-BDEIT-BObjectdetectionVisiontransformersDONNA
applies
directly
to
downstreamtasks
and
non-CNN
neural
architectureswithout
conceptual
codechangesVAL
mAP(%)24Execute
predictor-drivenevolutionary
searchon-deviceDefine
a
search-spaceof
smaller,
fasternetwork
architecturesRoughly
equal
to
training2-10
nets
from
scratchCreate
an
accuracypredictor
forallnetwork
architecturesUser
perspective
for
DONNABuild
accuracy
model
of
search
space
once,then
deploy
to
manyscenariosOversized
pretrainedreference
architectureINDONNAABCDeploy
best
models
@
3ms,
6ms,
9ms,
…
on
different
chipsSet
ofPareto-optimalnetwork
architecturesOUT1
day
/use-caseFinetune
best
searchedmodels,
9-30x
faster
vsregular
trainingD50GPU-hrs/net2525DONNAConclusionsDONNA
shrinks
big
networksin
ahardware-efficient
wayDONNA
can
be
rerun
for
anynewdevice
or
setting
within
a
dayDONNA
works
on
many
differenttasks
out
of
theboxDONNA
enables
scalability
andallows
models
to
be
easily
updatedafter
small
changes
rather
thanstarting
from
scratch252626AIMET
and
AIMET
Model
Zoo
are
products
of
QualcommInnovation
Center,
Inc.Leading
AI
research
and
fast
commercializationDriving
the
industry
towards
integer
inference
and
power-efficient
AIAIModel
Efficiency
Toolkit
(AIMET)AIMETModel
ZooRelaxed
Quantization(ICLR
2019)Data-free
Quantization(ICCV
2019)AdaRound(ICML
2020)Bayesian
Bits(NeurIPS
2020)QuantizationresearchQuantizationopen-sourcing27Open-source
projects
to
scale
model-efficient
AI
to
the
massesAIMET
&AIMET
Model
Zoo2828AIMET
makes
AI
models
smallOpen-sourcedGitHub
project
that
includes
state-of-the-artquantizationand
compression
techniquesfrom
Qualcomm
AI
ResearchTrainedAI
modelAI
Model
Efficiency
Toolkit(AIMET)OptimizedAI
modelTensorFlow
or
PyTorchIf
interested,
please
join
the
AIMET
GitHub
project:
/quic/aimetCompressionQuantizationDeployedAI
modelFeatures:State-of-the-artnetwork
compressiontoolsState-of-the-artquantizationtoolsSupport
for
bothTensorFlowand
PyTorchBenchmarksand
tests
formany
modelsDeveloped
byprofessional
softwaredevelopers29BenefitsLower
memorybandwidthLowerpowerLowerstorageHigherperformanceMaintains
modelaccuracySimpleease
of
useAIMETProviding
advancedmodelefficiencyfeatures
and
benefitsFeaturesQuantization
Compression
State-of-the-art
INT8
andINT4
performancePost-training
quantization
methods,including
Data-Free
Quantizationand
Adaptive
Rounding
(AdaRound)
—coming
soonQuantization-aware
trainingQuantization
simulationEfficient
tensor
decompositionand
removal
of
redundantchannels
in
convolution
layersSpatial
singular
valuedecomposition
(SVD)Channel
pruningVisualization
Analysis
tools
for
drawing
insightsfor
quantization
and
compressionWeight
rangesPer-layer
compression
sensitivity3030APIs
invoked
directly
from
the
pipelineSupports
TensorFlow
and
PyTorchDirect
algorithm
API
frameworksUser-friendlyAPIscompress_model
(model,eval_callback=obj_det_eval,compress_scheme=Scheme.spatial_svd,...
)equalize_model
(model,...)AIMET
features
and
APIs
are
easy
touseDesigned
to
fit
naturally
in
the
AI
model
developmentworkflow
for
researchers,
developers,
and
ISVsAIMETextensionsextensionsModel
optimization
library(techniques
to
compress
&
quantize
models)Frameworkspecific
APIOtherframeworksAlgorithmAPI31Activationrange
estimationBiascorrectionWeightquantizationBiasabsorptionEqualizeweight
rangesAvoid
highactivation
rangesDFQ:
datafreequantizationData
Free
Quantization
results
in
AIMETPost-training
techniqueenabling
INT8
inference
with
very
minimal
loss
in
accuracyCross-layerequalizationMeasure
andcorrect
shift
inlayer
outputsEstimateactivation
rangesfor
quantization<1%DFQexampleresults%
Reduction
inaccuracybetween
FP32
ad
INT8MobileNet-v2(top-1
accuracy)<1%ResNet-50(top-1
accuracy)<1%DeepLabv3mean
intersection
over
union)3132INT8,
AdaRound
quantizationAP:
AveragePrecisionINT8,
baseline
quantizationAdaRound
iscoming
soonto
AIMETPost-training
technique
thatmakes
INT8
quantizationmore
accurate
andINT4quantization
possibleBitwidth
Mean
AP
(mAP)
FP32
82.20INT8
baselinequantizationINT8
AdaRoundquantization49.8581.21<1%Reduction
inaccuracybetweenFP32
ad
INT8AdaRoundquantization3233AIMETModel
ZooAccurate
pre-trained
8-bitquantized
modelsImageclassificationSemanticsegmentationPoseestimationSpeechrecognitionSuperresolutionObjectdetection35*:
Comparison
betweenFP32
model
and
INT8
model
quantized
with
AIMET.For
further
details,
check
out:
/quic/aimet-model-zoo/ResNet-50(v1)75.21%
74.96%FP32
INT8Top-1
accuracy*MobileNet-v2-1.4Top-1
accuracy*75%
74.21%FP32
INT8EfficientNetLite74.93%
74.99%FP32
INT8Top-1
accuracy*SSDMobileNet-v20.2469
0.2456FP32
INT8mAP*RetinaNetmAP*0.35FP320.349INT8PoseestimationmAP*0.383FP320.379INT8SRGANPSNR*25.45FP3224.78INT8MobileNetV2Top-1
accuracy*7167%FP3271.14%INT8EfficientNet-lite0Top-1
accuracy*75.42%FP3274.44%INT8DeepLabV3+mIoU*72.62%FP3272.22%INT8MobileNetV2-SSD-LitemAP*68.7%FP3268.6%INT8PoseestimationmAP*0.364FP320.359INT8SRGANPSNR25.51FP3225.5INT8DeepSpeech2WER*9.92%FP3210.22%INT8AIMET
Model
Zoo
includes
popular
quantized
AI
modelsAccuracy
is
maintainedfor
INT8
models—
lessthan
1%loss*35<1%Loss
inaccuracy*36Baseline
quantization:
Post-trainingquantizationusing
min-max
based
quantization
gridAIMET
quantization:
Model
fine-tuned
usingQuantization
溫馨提示
- 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
- 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯(lián)系上傳者。文件的所有權益歸上傳用戶所有。
- 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會有圖紙預覽,若沒有圖紙預覽就沒有圖紙。
- 4. 未經(jīng)權益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
- 5. 人人文庫網(wǎng)僅提供信息存儲空間,僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護處理,對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對任何下載內(nèi)容負責。
- 6. 下載文件中如有侵權或不適當內(nèi)容,請與我們聯(lián)系,我們立即糾正。
- 7. 本站不保證下載資源的準確性、安全性和完整性, 同時也不承擔用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。
最新文檔
- 2025年數(shù)據(jù)中心網(wǎng)絡設備安裝與升級合同范本3篇
- 高鐵車廂窗簾采購方案
- 社會責任標準化管理辦法
- 2025年度跨境電子產(chǎn)品運輸及售后服務合同范本3篇
- 市場進入策略工藝管理辦法
- 康復醫(yī)院治療師聘用合同書
- 校園配餐合作合同
- 投資分紅合同樣本
- 國際健身中心檢查井施工協(xié)議
- 企業(yè)團隊建設光榮院管理辦法
- GB/T 23794-2023企業(yè)信用評價指標
- 第7章 TBM設備介紹及維修保養(yǎng)匯總
- 第六章 證券法
- 飲品創(chuàng)業(yè)項目計劃書
- 2023-2024學年江蘇省昆山市小學數(shù)學五年級上冊期末??荚囶}
- 江蘇市政工程計價表定額計算規(guī)則
- 外國文學史期末考試題庫(含答案)
- GB/T 32218-2015真空技術真空系統(tǒng)漏率測試方法
- GB/T 22520-2008厚度指示表
- GB 18384-2020電動汽車安全要求
- 索拉燃氣輪機Titan130介紹
評論
0/150
提交評論