版權(quán)說(shuō)明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請(qǐng)進(jìn)行舉報(bào)或認(rèn)領(lǐng)
文檔簡(jiǎn)介
deepseek
DeepSeek-V3TechnicalReport
DeepSeek-AI
research@deepseek.com
Abstract
WepresentDeepSeek-V3,astrongMixture-of-Experts(MoE)languagemodelwith671Btotalparameterswith37Bactivatedforeachtoken.Toachieveefficientinferenceandcost-effectivetraining,DeepSeek-V3adoptsMulti-headLatentAttention(MLA)andDeepSeekMoEarchitec-tures,whichwerethoroughlyvalidatedinDeepSeek-V2.Furthermore,DeepSeek-V3pioneersanauxiliary-loss-freestrategyforloadbalancingandsetsamulti-tokenpredictiontrainingobjectiveforstrongerperformance.Wepre-trainDeepSeek-V3on14.8trilliondiverseandhigh-qualitytokens,followedbySupervisedFine-TuningandReinforcementLearningstagestofullyharnessitscapabilities.ComprehensiveevaluationsrevealthatDeepSeek-V3outperformsotheropen-sourcemodelsandachievesperformancecomparabletoleadingclosed-sourcemodels.Despiteitsexcellentperformance,DeepSeek-V3requiresonly2.788MH800GPUhoursforitsfulltraining.Inaddition,itstrainingprocessisremarkablystable.Throughouttheentiretrainingprocess,wedidnotexperienceanyirrecoverablelossspikesorperformanyrollbacks.Themodelcheckpointsareavailableat
/deepseek-ai/DeepSeek-V3.
DeepSeek-V3
DeepSeek-V2.5
Qwen2.5-72B-Inst
Llama-3.1-405B-Inst
GPT-4o-0513
Claude-3.5-Sonnet-1022
100
80
Accuracy/Percentile(%)
60
40
20
0
90.2
80.0
78.0
74.7
72.6
66.2
65.0
59.1
51.6
51.149.0
42.0
39.2
38.8
35.6
24.8
23.323.3
20.3
16.0
9.3
73.371.6
22.623.8
75.9
49.9
41.3
25.3
24.5
23.6
78.3
74.6
73.8
50.8
16.7
Codeforces
(Percentile)
SWE-benchVerified
(Resolved)
MATH500
(EM)
AIME2024
(Pass@1)
MMLU-Pro
(EM)
GPQA-Diamond
(Pass@1)
Figure1|BenchmarkperformanceofDeepSeek-V3anditscounterparts.
2
Contents
1Introduction
4
2Architecture
6
2.1BasicArchitecture
6
2.1.1Multi-HeadLatentAttention
7
2.1.2DeepSeekMoEwithAuxiliary-Loss-FreeLoadBalancing
8
2.2Multi-TokenPrediction
10
3Infrastructures
11
3.1ComputeClusters
11
3.2TrainingFramework
12
3.2.1DualPipeandComputation-CommunicationOverlap
12
3.2.2EfficientImplementationofCross-NodeAll-to-AllCommunication
13
3.2.3ExtremelyMemorySavingwithMinimalOverhead
14
3.3FP8Training
14
3.3.1MixedPrecisionFramework
15
3.3.2ImprovedPrecisionfromQuantizationandMultiplication
16
3.3.3Low-PrecisionStorageandCommunication
18
3.4InferenceandDeployment
18
3.4.1Prefilling
19
3.4.2Decoding
19
3.5SuggestionsonHardwareDesign
20
3.5.1CommunicationHardware
20
3.5.2ComputeHardware
20
4Pre-Training
22
4.1DataConstruction
22
4.2Hyper-Parameters
22
4.3LongContextExtension
23
4.4Evaluations
24
4.4.1EvaluationBenchmarks
24
4.4.2EvaluationResults
25
4.5Discussion
26
4.5.1AblationStudiesforMulti-TokenPrediction
26
4.5.2AblationStudiesfortheAuxiliary-Loss-FreeBalancingStrategy
27
3
4.5.3Batch-WiseLoadBalanceVS.Sequence-WiseLoadBalance
27
5Post-Training
28
5.1SupervisedFine-Tuning
28
5.2ReinforcementLearning
29
5.2.1RewardModel
29
5.2.2GroupRelativePolicyOptimization
30
5.3Evaluations
30
5.3.1EvaluationSettings
30
5.3.2StandardEvaluation
32
5.3.3Open-EndedEvaluation
33
5.3.4DeepSeek-V3asaGenerativeRewardModel
33
5.4Discussion
34
5.4.1DistillationfromDeepSeek-R1
34
5.4.2Self-Rewarding
34
5.4.3Multi-TokenPredictionEvaluation
35
6Conclusion,Limitations,andFutureDirections
35
AContributionsandAcknowledgments
45
BAblationStudiesforLow-PrecisionTraining
47
B.1FP8v.s.BF16Training
47
B.2DiscussionAboutBlock-WiseQuantization
47
CExpertSpecializationPatternsofthe16BAux-Loss-BasedandAux-Loss-FreeModels
48
4
1.Introduction
Inrecentyears,LargeLanguageModels(LLMs)havebeenundergoingrapiditerationand
evolution(Anthropic,
2024;
Google,
2024;
OpenAI,
2024a),progressivelydiminishingthegapto
-wardsArtificialGeneralIntelligence(AGI).Beyondclosed-sourcemodels,open-sourcemodels,
includingDeepSeekseries(DeepSeek-AI,
2024a,b,c;
Guoetal.,
2024),LLaMAseries(AI@Meta,
2024a,b;
Touvronetal.,
2023a,b),Qwenseries(Qwen,
2023,
2024a,b),andMistralseries(Jiang
etal.,
2023;
Mistral,
2024),arealsomakingsignificantstrides,endeavoringtoclosethegapwith
theirclosed-sourcecounterparts.Tofurtherpushtheboundariesofopen-sourcemodelcapa-bilities,wescaleupourmodelsandintroduceDeepSeek-V3,alargeMixture-of-Experts(MoE)modelwith671Bparameters,ofwhich37Bareactivatedforeachtoken.
Withaforward-lookingperspective,weconsistentlystriveforstrongmodelperformanceandeconomicalcosts.Therefore,intermsofarchitecture,DeepSeek-V3stilladoptsMulti-head
LatentAttention(MLA)(DeepSeek-AI,
2024c)forefficientinferenceandDeepSeekMoE(Dai
etal.,
2024)forcost-effectivetraining
.ThesetwoarchitectureshavebeenvalidatedinDeepSeek-
V2(DeepSeek-AI,
2024c),demonstratingtheircapabilitytomaintainrobustmodelperformance
whileachievingefficienttrainingandinference.Beyondthebasicarchitecture,weimplementtwoadditionalstrategiestofurtherenhancethemodelcapabilities.Firstly,DeepSeek-V3pi-
oneersanauxiliary-loss-freestrategy(Wangetal.,
2024a)forloadbalancing,withtheaimof
minimizingtheadverseimpactonmodelperformancethatarisesfromtheefforttoencourageloadbalancing.Secondly,DeepSeek-V3employsamulti-tokenpredictiontrainingobjective,whichwehaveobservedtoenhancetheoverallperformanceonevaluationbenchmarks.
Inordertoachieveefficienttraining,wesupporttheFP8mixedprecisiontrainingandimplementcomprehensiveoptimizationsforthetrainingframework.Low-precisiontraining
hasemergedasapromisingsolutionforefficienttraining(Dettmersetal.,
2022;
Kalamkaretal.,
2019;
Narangetal.,
2017;
Pengetal.,
2023b),itsevolutionbeingcloselytiedtoadvancementsin
hardwarecapabilities(Luoetal.,
2024;
Micikeviciusetal.,
2022;
Rouhanietal.,
2023a)
.Inthiswork,weintroduceanFP8mixedprecisiontrainingframeworkand,forthefirsttime,validateitseffectivenessonanextremelylarge-scalemodel.ThroughthesupportforFP8computationandstorage,weachievebothacceleratedtrainingandreducedGPUmemoryusage.Asforthetrainingframework,wedesigntheDualPipealgorithmforefficientpipelineparallelism,whichhasfewerpipelinebubblesandhidesmostofthecommunicationduringtrainingthroughcomputation-communicationoverlap.Thisoverlapensuresthat,asthemodelfurtherscalesup,aslongaswemaintainaconstantcomputation-to-communicationratio,wecanstillemployfine-grainedexpertsacrossnodeswhileachievinganear-zeroall-to-allcommunicationoverhead.Inaddition,wealsodevelopefficientcross-nodeall-to-allcommunicationkernelstofullyutilizeInfiniBand(IB)andNVLinkbandwidths.Furthermore,wemeticulouslyoptimizethememoryfootprint,makingitpossibletotrainDeepSeek-V3withoutusingcostlytensorparallelism.Combiningtheseefforts,weachievehightrainingefficiency.
Duringpre-training,wetrainDeepSeek-V3on14.8Thigh-qualityanddiversetokens.Thepre-trainingprocessisremarkablystable.Throughouttheentiretrainingprocess,wedidnotencounteranyirrecoverablelossspikesorhavetorollback.Next,weconductatwo-stagecontextlengthextensionforDeepSeek-V3.Inthefirststage,themaximumcontextlengthisextendedto32K,andinthesecondstage,itisfurtherextendedto128K.Followingthis,weconductpost-training,includingSupervisedFine-Tuning(SFT)andReinforcementLearning(RL)onthebasemodelofDeepSeek-V3,toalignitwithhumanpreferencesandfurtherunlockitspotential.Duringthepost-trainingstage,wedistillthereasoningcapabilityfromtheDeepSeek-R1seriesofmodels,andmeanwhilecarefullymaintainthebalancebetweenmodelaccuracy
5
TrainingCosts
Pre-TrainingContextExtensionPost-Training
Total
inH800GPUHours
2664K
119K
5K
2788K
inUSD
$5.328M
$0.238M
$0.01M
$5.576M
Table1|TrainingcostsofDeepSeek-V3,assumingtherentalpriceofH800is$2perGPUhour.
andgenerationlength.
WeevaluateDeepSeek-V3onacomprehensivearrayofbenchmarks.Despiteitseconomicaltrainingcosts,comprehensiveevaluationsrevealthatDeepSeek-V3-Basehasemergedasthestrongestopen-sourcebasemodelcurrentlyavailable,especiallyincodeandmath.Itschatversionalsooutperformsotheropen-sourcemodelsandachievesperformancecomparabletoleadingclosed-sourcemodels,includingGPT-4oandClaude-3.5-Sonnet,onaseriesofstandardandopen-endedbenchmarks.
Lastly,weemphasizeagaintheeconomicaltrainingcostsofDeepSeek-V3,summarizedinTable
1,achievedthroughouroptimizedco-designofalgorithms,frameworks,andhardware
.Duringthepre-trainingstage,trainingDeepSeek-V3oneachtrilliontokensrequiresonly180KH800GPUhours,i.e.,3.7daysonourclusterwith2048H800GPUs.Consequently,ourpre-trainingstageiscompletedinlessthantwomonthsandcosts2664KGPUhours.Combinedwith119KGPUhoursforthecontextlengthextensionand5KGPUhoursforpost-training,DeepSeek-V3costsonly2.788MGPUhoursforitsfulltraining.AssumingtherentalpriceoftheH800GPUis$2perGPUhour,ourtotaltrainingcostsamounttoonly$5.576M.NotethattheaforementionedcostsincludeonlytheofficialtrainingofDeepSeek-V3,excludingthecostsassociatedwithpriorresearchandablationexperimentsonarchitectures,algorithms,ordata.
Ourmaincontributionincludes:
Architecture:InnovativeLoadBalancingStrategyandTrainingObjective
?OntopoftheefficientarchitectureofDeepSeek-V2,wepioneeranauxiliary-loss-freestrategyforloadbalancing,whichminimizestheperformancedegradationthatarisesfromencouragingloadbalancing.
?WeinvestigateaMulti-TokenPrediction(MTP)objectiveandproveitbeneficialtomodelperformance.Itcanalsobeusedforspeculativedecodingforinferenceacceleration.
Pre-Training:TowardsUltimateTrainingEfficiency
?WedesignanFP8mixedprecisiontrainingframeworkand,forthefirsttime,validatethefeasibilityandeffectivenessofFP8trainingonanextremelylarge-scalemodel.
?Throughtheco-designofalgorithms,frameworks,andhardware,weovercomethecommunicationbottleneckincross-nodeMoEtraining,achievingnear-fullcomputation-communicationoverlap.Thissignificantlyenhancesourtrainingefficiencyandreducesthetrainingcosts,enablingustofurtherscaleupthemodelsizewithoutadditionaloverhead.
?Ataneconomicalcostofonly2.664MH800GPUhours,wecompletethepre-trainingofDeepSeek-V3on14.8Ttokens,producingthecurrentlystrongestopen-sourcebasemodel.Thesubsequenttrainingstagesafterpre-trainingrequireonly0.1MGPUhours.
Post-Training:KnowledgeDistillationfromDeepSeek-R1
?Weintroduceaninnovativemethodologytodistillreasoningcapabilitiesfromthelong-Chain-of-Thought(CoT)model,specificallyfromoneoftheDeepSeekR1seriesmodels,intostandardLLMs,particularlyDeepSeek-V3.Ourpipelineelegantlyincorporatesthe
6
verificationandreflectionpatternsofR1intoDeepSeek-V3andnotablyimprovesitsreasoningperformance.Meanwhile,wealsomaintaincontrolovertheoutputstyleandlengthofDeepSeek-V3.
SummaryofCoreEvaluationResults
?Knowledge:(1)OneducationalbenchmarkssuchasMMLU,MMLU-Pro,andGPQA,DeepSeek-V3outperformsallotheropen-sourcemodels,achieving88.5onMMLU,75.9onMMLU-Pro,and59.1onGPQA.Itsperformanceiscomparabletoleadingclosed-sourcemodelslikeGPT-4oandClaude-Sonnet-3.5,narrowingthegapbetweenopen-sourceandclosed-sourcemodelsinthisdomain.(2)Forfactualitybenchmarks,DeepSeek-V3demonstratessuperiorperformanceamongopen-sourcemodelsonbothSimpleQAandChineseSimpleQA.WhileittrailsbehindGPT-4oandClaude-Sonnet-3.5inEnglishfactualknowledge(SimpleQA),itsurpassesthesemodelsinChinesefactualknowledge(ChineseSimpleQA),highlightingitsstrengthinChinesefactualknowledge.
?Code,Math,andReasoning:(1)DeepSeek-V3achievesstate-of-the-artperformanceonmath-relatedbenchmarksamongallnon-long-CoTopen-sourceandclosed-sourcemodels.Notably,itevenoutperformso1-previewonspecificbenchmarks,suchasMATH-500,demonstratingitsrobustmathematicalreasoningcapabilities.(2)Oncoding-relatedtasks,DeepSeek-V3emergesasthetop-performingmodelforcodingcompetitionbenchmarks,suchasLiveCodeBench,solidifyingitspositionastheleadingmodelinthisdomain.Forengineering-relatedtasks,whileDeepSeek-V3performsslightlybelowClaude-Sonnet-3.5,itstilloutpacesallothermodelsbyasignificantmargin,demonstratingitscompetitivenessacrossdiversetechnicalbenchmarks.
Intheremainderofthispaper,wefirstpresentadetailedexpositionofourDeepSeek-V3modelarchitecture(Section
2)
.Subsequently,weintroduceourinfrastructures,encompassingourcomputeclusters,thetrainingframework,thesupportforFP8training,theinferencedeploymentstrategy,andoursuggestionsonfuturehardwaredesign.Next,wedescribeourpre-trainingprocess,includingtheconstructionoftrainingdata,hyper-parametersettings,long-contextextensiontechniques,theassociatedevaluations,aswellassomediscussions(Section
4)
.Thereafter,wediscussoureffortsonpost-training,whichincludeSupervisedFine-Tuning(SFT),ReinforcementLearning(RL),thecorrespondingevaluations,anddiscussions(Section
5)
.Lastly,weconcludethiswork,discussexistinglimitationsofDeepSeek-V3,andproposepotentialdirectionsforfutureresearch(Section
6)
.
2.Architecture
WefirstintroducethebasicarchitectureofDeepSeek-V3,featuredbyMulti-headLatentAtten-
tion(MLA)(DeepSeek-AI,
2024c)forefficientinferenceandDeepSeekMoE(Daietal.,
2024)
foreconomicaltraining.Then,wepresentaMulti-TokenPrediction(MTP)trainingobjective,whichwehaveobservedtoenhancetheoverallperformanceonevaluationbenchmarks.Forotherminordetailsnotexplicitlymentioned,DeepSeek-V3adherestothesettingsofDeepSeek-
V2(DeepSeek-AI,
2024c)
.
2.1.BasicArchitecture
ThebasicarchitectureofDeepSeek-V3isstillwithintheTransformer(Vaswanietal.,
2017)
framework.Forefficientinferenceandeconomicaltraining,DeepSeek-V3alsoadoptsMLAandDeepSeekMoE,whichhavebeenthoroughlyvalidatedbyDeepSeek-V2.ComparedwithDeepSeek-V2,anexceptionisthatweadditionallyintroduceanauxiliary-loss-freeloadbalancing
7
TransformerBlock×L
RMSNorm
▲
Feed-ForwardNetwork
Router
DeepSeekMoE
RoutedExpertSharedExpert
Nr-1Nr
1Ns
12
…
…
InputHiddenut
34
……
OutputHiddenh
▲
Top-kr
……
RMSNorm
Attention
Multi-HeadLatentAttention(MLA)
OutputHiddenut
CachedDuringInference
……
Multi-HeadAttention
{[q,i;q,i]}
{[K,i;K]}
concatenateconcatenate
▲tapplyapply··
{q,i}{q,i}K{K,i}{vi}
RoPERoPE
…LatentcLatentc…
InputHiddenhtoooo……oooo
Figure2|IllustrationofthebasicarchitectureofDeepSeek-V3.FollowingDeepSeek-V2,weadoptMLAandDeepSeekMoEforefficientinferenceandeconomicaltraining.
strategy(Wangetal.,
2024a)forDeepSeekMoEtomitigatetheperformancedegradationinduced
bytheefforttoensureloadbalance.Figure
2
illustratesthebasicarchitectureofDeepSeek-V3,andwewillbrieflyreviewthedetailsofMLAandDeepSeekMoEinthissection.
2.1.1.Multi-HeadLatentAttention
Forattention,DeepSeek-V3adoptstheMLAarchitecture.Letddenotetheembeddingdimen-sion,nhdenotethenumberofattentionheads,dhdenotethedimensionperhead,andht∈Rddenotetheattentioninputforthet-thtokenatagivenattentionlayer.ThecoreofMLAisthelow-rankjointcompressionforattentionkeysandvaluestoreduceKey-Value(KV)cacheduringinference:
cv=WDkvht,(1)
[k;k; ;k,nh]=k=Wukcv,(2)
k=RoPE(WkRht),(3)
kt,i=[k,i;k],(4)
[v;v; ;v,nh]=v=Wuvcv,(5)
8
wherecv∈Rdcisthecompressedlatentvectorforkeysandvalues;dc(?dhnh)indicatestheKV
compressiondimension;WDkv∈Rdc×ddenotesthedown-projectionmatrix;WUk,WUv∈Rdhnh×dc
aretheup-projectionmatricesforkeysandvalues,respectively;WkR∈Rd×disthematrixused
toproducethedecoupledkeythatcarriesRotaryPositionalEmbedding(RoPE)(Suetal.,
2024);
RoPE(·)denotestheoperationthatappliesRoPEmatrices;and[·;·]denotesconcatenation.Note
thatforMLA,onlytheblue-boxedvectors(i.e.,cvandk)needtobecachedduringgeneration,
whichresultsinsignificantlyreducedKVcachewhilemaintainingperformancecomparableto
standardMulti-HeadAttention(MHA)(Vaswanietal.,
2017)
.
Fortheattentionqueries,wealsoperformalow-rankcompression,whichcanreducetheactivationmemoryduringtraining:
c=WDQht,(6)
[q;q; ;q,nh]=q=WUQc,(7)
[q;q; ;q,nh]=q=RoPE(WQRc),(8)
qt,i=[q,i;q,i],(9)
wherec∈Rdisthecompressedlatentvectorforqueries;d(?dhnh)denotesthequery
compressiondimension;WDQ∈Rd×d,WUQ∈Rdhnh×darethedown-projectionandup-projection
matricesforqueries,respectively;andWQR∈Rdnh×disthematrixtoproducethedecoupled
queriesthatcarryRoPE.
Ultimately,theattentionqueries(qt,i),keys(kj,i),andvalues(v,i)arecombinedtoyieldthe
finalattentionoutputut:
ot,i=Softmaxj(10)
ut=Wo[ot,1;ot,2;...;ot,nh],(11)whereWo∈Rd×dhnhdenotestheoutputprojectionmatrix.
2.1.2.DeepSeekMoEwithAuxiliary-Loss-FreeLoadBalancing
BasicArchitectureofDeepSeekMoE.ForFeed-ForwardNetworks(FFNs),DeepSeek-V3
employstheDeepSeekMoEarchitecture(Daietal.,
2024)
.ComparedwithtraditionalMoE
architectureslikeGShard(Lepikhinetal.,
2021),DeepSeekMoEusesfiner-grainedexpertsand
isolatessomeexpertsassharedones.LetutdenotetheFFNinputofthet-thtoken,wecompute
theFFNoutputhasfollows:
′
gi,t=(13)
si,t=Sigmoid(utTei),(15)
9
whereNsandNrdenotethenumbersofsharedexpertsandroutedexperts,respectively;FFNi(s)(·)andFFNi(r)(·)denotethei-thsharedexpertandthei-throutedexpert,respectively;Krdenotesthenumberofactivatedroutedexperts;gi,tisthegatingvalueforthei-thexpert;si,tisthetoken-to-expertaffinity;eiisthecentroidvectorofthei-throutedexpert;andTopk(·,K)denotesthesetcomprisingKhighestscoresamongtheaffinityscorescalculatedforthet-thtokenandallroutedexperts.SlightlydifferentfromDeepSeek-V2,DeepSeek-V3usesthesigmoidfunctiontocomputetheaffinityscores,andappliesanormalizationamongallselectedaffinityscorestoproducethegatingvalues.
Auxiliary-Loss-FreeLoadBalancing.ForMoEmodels,anunbalancedexpertloadwillleadto
routingcollapse(Shazeeretal.,
2017)anddiminishcomputationalefficiencyinscenarioswith
expertparallelism.
Conventionalsolutionsusuallyrelyontheauxiliaryloss(Fedusetal.,
2021;
Lepikhinetal.,
2021)toavoidunbalancedload
.However,toolargeanauxiliarylosswillimpair
themodelperformance(Wangetal.,
2024a)
.Toachieveabettertrade-offbetweenloadbalance
andmodelperformance,wepioneeranauxiliary-loss-freeloadbalancingstrategy(Wangetal.,
2024a)toensureloadbalance
.Tobespecific,weintroduceabiastermbiforeachexpertandaddittothecorrespondingaffinityscoressi,ttodeterminethetop-Krouting:
Notethatthebiastermisonlyusedforrouting.Thegatingvalue,whichwillbemultipliedwiththeFFNoutput,isstillderivedfromtheoriginalaffinityscoresi,t.Duringtraining,wekeepmonitoringtheexpertloadonthewholebatchofeachtrainingstep.Attheendofeachstep,wewilldecreasethebiastermbyyifitscorrespondingexpertisoverloaded,andincreaseitbyyifitscorrespondingexpertisunderloaded,whereyisahyper-parametercalledbiasupdatespeed.Throughthedynamicadjustment,DeepSeek-V3keepsbalancedexpertloadduringtraining,andachievesbetterperformancethanmodelsthatencourageloadbalancethroughpureauxiliarylosses.
ComplementarySequence-WiseAuxiliaryLoss.AlthoughDeepSeek-V3mainlyreliesontheauxiliary-loss-freestrategyforloadbalance,topreventextremeimbalancewithinanysinglesequence,wealsoemployacomplementarysequence-wisebalanceloss:
LBal=αfiPi,(17)
(19)
(20)
wherethebalancefactorαisahyper-parameter,whichwillbeassignedanextremelysmallvalueforDeepSeek-V3;1(·)denotestheindicatorfunction;andTdenotesthenumberoftokensinasequence.Thesequence-wisebalancelossencouragestheexpertloadoneachsequencetobebalanced.
10
t5
t2
t4
t6
t5
t3
t4
t5
t4
t7
t6
t3
TargetTokens
?TP
···
Cross-EntropyLoss·?MainCross-EntropyLoss·?TPCross-EntropyLoss
Shared
Shared
MainModel
(NextTokenPrediction)
MTPModule1
(Next2TokenPrediction)
MTPModule2
(Next3TokenPrediction)
OutputHead
▲
TransformerBlock×L
EmbeddingLayer
▲
OutputHead
TransformerBlock
LinearProjection
concatenation
RMSNorm
RMSNorm
EmbeddingLayer
OutputHead
TransformerBlock
▲
LinearProjection
——concatenation
RMSNormRMSNorm
EmbeddingLayer
Shared
Shared
t2
t5
t4
t3
t2
t4
t3
t4
t5
t6
t3
t1
InputTokens
Figure3|IllustrationofourMulti-TokenPrediction(MTP)implementation.Wekeepthecompletecausalchainforthepredictionofeachtokenateachdepth.
Node-LimitedRouting.Likethedevice-limitedroutingusedbyDeepSeek-V2,DeepSeek-V3alsousesarestrictedroutingmechanismtolimitcommunicationcostsduringtraining.Inshort,weensurethateachtokenwillbesenttoatmostMnodes,whichareselectedaccordingto .Underthisconstraint,ourMoEtrainingframeworkcannearlyachievefullcomputation-communicationoverlap.
NoToken-Dropping.Duetotheeffectiveloadbalancingstrategy,DeepSeek-V3keepsagoodloadbalanceduringitsfulltraining.Therefore,DeepSeek-V3doesnotdropanytokensduringtraining.Inaddition,wealsoimplementspecificdeploymentstrategiestoensureinferenceloadbalance,soDeepSeek-V3alsodoesnotdroptokensduringinference.
2.2.Multi-TokenPrediction
Inspiredby
Gloeckleetal.
(2024),weinvestigateandsetaMulti-TokenPrediction(MTP)
objectiveforDeepSeek-V3,whichextendsthepredictionscopetomultiplefuturetokensateachposition.Ontheonehand,anMTPobjectivedensifiesthetrainingsignalsandmayimprovedataefficiency.Ontheotherhand,MTPmayenablethemodeltopre-planitsrepresentationsforbetterpredictionoffuturetokens.Figure
3
illustratesourimplementationofMTP.Differentfrom
Gloeckleetal.
(2024),whichparallellypredicts
Dadditionaltokensusingindependentoutputheads,wesequentiallypredictadditionaltokensandkeepthecompletecausalchainateachpredictiondepth.WeintroducethedetailsofourMTPimplementationinthissection.
MTPModules.Tobespecific,ourMTPimplementationusesDsequentialmodulestopredictDadditionaltokens.Thek-thMTPmoduleconsistsofasharedembeddinglayerEmb(·),asharedoutputheadOutHead(·),aTransformerblockTRMk(·),andaprojectionmatrixMk∈Rd×2d.Forthei-thinputtokenti,atthek-thpredictiondept
溫馨提示
- 1. 本站所有資源如無(wú)特殊說(shuō)明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請(qǐng)下載最新的WinRAR軟件解壓。
- 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請(qǐng)聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
- 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁(yè)內(nèi)容里面會(huì)有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
- 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
- 5. 人人文庫(kù)網(wǎng)僅提供信息存儲(chǔ)空間,僅對(duì)用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對(duì)用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對(duì)任何下載內(nèi)容負(fù)責(zé)。
- 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請(qǐng)與我們聯(lián)系,我們立即糾正。
- 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶因使用這些下載資源對(duì)自己和他人造成任何形式的傷害或損失。
最新文檔
- 2024服裝加工合作協(xié)議書
- 2024年項(xiàng)目經(jīng)理勞動(dòng)合同
- 福建省南平市塔前中學(xué)2020年高二數(shù)學(xué)文上學(xué)期期末試題含解析
- 立秋節(jié)氣營(yíng)銷新策
- 2024版二手房協(xié)議范本
- 10-2 《師說(shuō)》說(shuō)課稿 2024-2025學(xué)年統(tǒng)編版高中語(yǔ)文必修上冊(cè)
- 安全避險(xiǎn)六大系統(tǒng)
- 餐飲產(chǎn)品采購(gòu)配送合同協(xié)議
- 個(gè)人借貸反擔(dān)保責(zé)任合同樣本版B版
- 煤炭原材料采購(gòu)合同原煤采購(gòu)合同
- 2024年危險(xiǎn)化學(xué)品生產(chǎn)單位安全生產(chǎn)管理人員證考試題庫(kù)及答案
- 江蘇省宿遷市沭陽(yáng)縣2023-2024學(xué)年八年級(jí)上學(xué)期期末英語(yǔ)試題
- 老年人視覺障礙護(hù)理
- 安全隱患大排查大整治專項(xiàng)行動(dòng)方案
- 《腦梗塞的健康教育》課件
- 藍(lán)軍戰(zhàn)略課件
- 《請(qǐng)柬及邀請(qǐng)函》課件
- 遼寧省普通高中2024-2025學(xué)年高一上學(xué)期12月聯(lián)合考試語(yǔ)文試題(含答案)
- 《個(gè)體防護(hù)裝備安全管理規(guī)范AQ 6111-2023》知識(shí)培訓(xùn)
- 科學(xué)計(jì)算語(yǔ)言Julia及MWORKS實(shí)踐 課件8 - 基本數(shù)據(jù)類型
- 湖北省黃岡市2023-2024學(xué)年高一上學(xué)期期末考試化學(xué)試題(含答案)
評(píng)論
0/150
提交評(píng)論