DeepSeek-V3技術(shù)報(bào)告介紹_第1頁(yè)
DeepSeek-V3技術(shù)報(bào)告介紹_第2頁(yè)
DeepSeek-V3技術(shù)報(bào)告介紹_第3頁(yè)
DeepSeek-V3技術(shù)報(bào)告介紹_第4頁(yè)
DeepSeek-V3技術(shù)報(bào)告介紹_第5頁(yè)
已閱讀5頁(yè),還剩93頁(yè)未讀 繼續(xù)免費(fèi)閱讀

下載本文檔

版權(quán)說(shuō)明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請(qǐng)進(jìn)行舉報(bào)或認(rèn)領(lǐng)

文檔簡(jiǎn)介

deepseek

DeepSeek-V3TechnicalReport

DeepSeek-AI

research@deepseek.com

Abstract

WepresentDeepSeek-V3,astrongMixture-of-Experts(MoE)languagemodelwith671Btotalparameterswith37Bactivatedforeachtoken.Toachieveefficientinferenceandcost-effectivetraining,DeepSeek-V3adoptsMulti-headLatentAttention(MLA)andDeepSeekMoEarchitec-tures,whichwerethoroughlyvalidatedinDeepSeek-V2.Furthermore,DeepSeek-V3pioneersanauxiliary-loss-freestrategyforloadbalancingandsetsamulti-tokenpredictiontrainingobjectiveforstrongerperformance.Wepre-trainDeepSeek-V3on14.8trilliondiverseandhigh-qualitytokens,followedbySupervisedFine-TuningandReinforcementLearningstagestofullyharnessitscapabilities.ComprehensiveevaluationsrevealthatDeepSeek-V3outperformsotheropen-sourcemodelsandachievesperformancecomparabletoleadingclosed-sourcemodels.Despiteitsexcellentperformance,DeepSeek-V3requiresonly2.788MH800GPUhoursforitsfulltraining.Inaddition,itstrainingprocessisremarkablystable.Throughouttheentiretrainingprocess,wedidnotexperienceanyirrecoverablelossspikesorperformanyrollbacks.Themodelcheckpointsareavailableat

/deepseek-ai/DeepSeek-V3.

DeepSeek-V3

DeepSeek-V2.5

Qwen2.5-72B-Inst

Llama-3.1-405B-Inst

GPT-4o-0513

Claude-3.5-Sonnet-1022

100

80

Accuracy/Percentile(%)

60

40

20

0

90.2

80.0

78.0

74.7

72.6

66.2

65.0

59.1

51.6

51.149.0

42.0

39.2

38.8

35.6

24.8

23.323.3

20.3

16.0

9.3

73.371.6

22.623.8

75.9

49.9

41.3

25.3

24.5

23.6

78.3

74.6

73.8

50.8

16.7

Codeforces

(Percentile)

SWE-benchVerified

(Resolved)

MATH500

(EM)

AIME2024

(Pass@1)

MMLU-Pro

(EM)

GPQA-Diamond

(Pass@1)

Figure1|BenchmarkperformanceofDeepSeek-V3anditscounterparts.

2

Contents

1Introduction

4

2Architecture

6

2.1BasicArchitecture

6

2.1.1Multi-HeadLatentAttention

7

2.1.2DeepSeekMoEwithAuxiliary-Loss-FreeLoadBalancing

8

2.2Multi-TokenPrediction

10

3Infrastructures

11

3.1ComputeClusters

11

3.2TrainingFramework

12

3.2.1DualPipeandComputation-CommunicationOverlap

12

3.2.2EfficientImplementationofCross-NodeAll-to-AllCommunication

13

3.2.3ExtremelyMemorySavingwithMinimalOverhead

14

3.3FP8Training

14

3.3.1MixedPrecisionFramework

15

3.3.2ImprovedPrecisionfromQuantizationandMultiplication

16

3.3.3Low-PrecisionStorageandCommunication

18

3.4InferenceandDeployment

18

3.4.1Prefilling

19

3.4.2Decoding

19

3.5SuggestionsonHardwareDesign

20

3.5.1CommunicationHardware

20

3.5.2ComputeHardware

20

4Pre-Training

22

4.1DataConstruction

22

4.2Hyper-Parameters

22

4.3LongContextExtension

23

4.4Evaluations

24

4.4.1EvaluationBenchmarks

24

4.4.2EvaluationResults

25

4.5Discussion

26

4.5.1AblationStudiesforMulti-TokenPrediction

26

4.5.2AblationStudiesfortheAuxiliary-Loss-FreeBalancingStrategy

27

3

4.5.3Batch-WiseLoadBalanceVS.Sequence-WiseLoadBalance

27

5Post-Training

28

5.1SupervisedFine-Tuning

28

5.2ReinforcementLearning

29

5.2.1RewardModel

29

5.2.2GroupRelativePolicyOptimization

30

5.3Evaluations

30

5.3.1EvaluationSettings

30

5.3.2StandardEvaluation

32

5.3.3Open-EndedEvaluation

33

5.3.4DeepSeek-V3asaGenerativeRewardModel

33

5.4Discussion

34

5.4.1DistillationfromDeepSeek-R1

34

5.4.2Self-Rewarding

34

5.4.3Multi-TokenPredictionEvaluation

35

6Conclusion,Limitations,andFutureDirections

35

AContributionsandAcknowledgments

45

BAblationStudiesforLow-PrecisionTraining

47

B.1FP8v.s.BF16Training

47

B.2DiscussionAboutBlock-WiseQuantization

47

CExpertSpecializationPatternsofthe16BAux-Loss-BasedandAux-Loss-FreeModels

48

4

1.Introduction

Inrecentyears,LargeLanguageModels(LLMs)havebeenundergoingrapiditerationand

evolution(Anthropic,

2024;

Google,

2024;

OpenAI,

2024a),progressivelydiminishingthegapto

-wardsArtificialGeneralIntelligence(AGI).Beyondclosed-sourcemodels,open-sourcemodels,

includingDeepSeekseries(DeepSeek-AI,

2024a,b,c;

Guoetal.,

2024),LLaMAseries(AI@Meta,

2024a,b;

Touvronetal.,

2023a,b),Qwenseries(Qwen,

2023,

2024a,b),andMistralseries(Jiang

etal.,

2023;

Mistral,

2024),arealsomakingsignificantstrides,endeavoringtoclosethegapwith

theirclosed-sourcecounterparts.Tofurtherpushtheboundariesofopen-sourcemodelcapa-bilities,wescaleupourmodelsandintroduceDeepSeek-V3,alargeMixture-of-Experts(MoE)modelwith671Bparameters,ofwhich37Bareactivatedforeachtoken.

Withaforward-lookingperspective,weconsistentlystriveforstrongmodelperformanceandeconomicalcosts.Therefore,intermsofarchitecture,DeepSeek-V3stilladoptsMulti-head

LatentAttention(MLA)(DeepSeek-AI,

2024c)forefficientinferenceandDeepSeekMoE(Dai

etal.,

2024)forcost-effectivetraining

.ThesetwoarchitectureshavebeenvalidatedinDeepSeek-

V2(DeepSeek-AI,

2024c),demonstratingtheircapabilitytomaintainrobustmodelperformance

whileachievingefficienttrainingandinference.Beyondthebasicarchitecture,weimplementtwoadditionalstrategiestofurtherenhancethemodelcapabilities.Firstly,DeepSeek-V3pi-

oneersanauxiliary-loss-freestrategy(Wangetal.,

2024a)forloadbalancing,withtheaimof

minimizingtheadverseimpactonmodelperformancethatarisesfromtheefforttoencourageloadbalancing.Secondly,DeepSeek-V3employsamulti-tokenpredictiontrainingobjective,whichwehaveobservedtoenhancetheoverallperformanceonevaluationbenchmarks.

Inordertoachieveefficienttraining,wesupporttheFP8mixedprecisiontrainingandimplementcomprehensiveoptimizationsforthetrainingframework.Low-precisiontraining

hasemergedasapromisingsolutionforefficienttraining(Dettmersetal.,

2022;

Kalamkaretal.,

2019;

Narangetal.,

2017;

Pengetal.,

2023b),itsevolutionbeingcloselytiedtoadvancementsin

hardwarecapabilities(Luoetal.,

2024;

Micikeviciusetal.,

2022;

Rouhanietal.,

2023a)

.Inthiswork,weintroduceanFP8mixedprecisiontrainingframeworkand,forthefirsttime,validateitseffectivenessonanextremelylarge-scalemodel.ThroughthesupportforFP8computationandstorage,weachievebothacceleratedtrainingandreducedGPUmemoryusage.Asforthetrainingframework,wedesigntheDualPipealgorithmforefficientpipelineparallelism,whichhasfewerpipelinebubblesandhidesmostofthecommunicationduringtrainingthroughcomputation-communicationoverlap.Thisoverlapensuresthat,asthemodelfurtherscalesup,aslongaswemaintainaconstantcomputation-to-communicationratio,wecanstillemployfine-grainedexpertsacrossnodeswhileachievinganear-zeroall-to-allcommunicationoverhead.Inaddition,wealsodevelopefficientcross-nodeall-to-allcommunicationkernelstofullyutilizeInfiniBand(IB)andNVLinkbandwidths.Furthermore,wemeticulouslyoptimizethememoryfootprint,makingitpossibletotrainDeepSeek-V3withoutusingcostlytensorparallelism.Combiningtheseefforts,weachievehightrainingefficiency.

Duringpre-training,wetrainDeepSeek-V3on14.8Thigh-qualityanddiversetokens.Thepre-trainingprocessisremarkablystable.Throughouttheentiretrainingprocess,wedidnotencounteranyirrecoverablelossspikesorhavetorollback.Next,weconductatwo-stagecontextlengthextensionforDeepSeek-V3.Inthefirststage,themaximumcontextlengthisextendedto32K,andinthesecondstage,itisfurtherextendedto128K.Followingthis,weconductpost-training,includingSupervisedFine-Tuning(SFT)andReinforcementLearning(RL)onthebasemodelofDeepSeek-V3,toalignitwithhumanpreferencesandfurtherunlockitspotential.Duringthepost-trainingstage,wedistillthereasoningcapabilityfromtheDeepSeek-R1seriesofmodels,andmeanwhilecarefullymaintainthebalancebetweenmodelaccuracy

5

TrainingCosts

Pre-TrainingContextExtensionPost-Training

Total

inH800GPUHours

2664K

119K

5K

2788K

inUSD

$5.328M

$0.238M

$0.01M

$5.576M

Table1|TrainingcostsofDeepSeek-V3,assumingtherentalpriceofH800is$2perGPUhour.

andgenerationlength.

WeevaluateDeepSeek-V3onacomprehensivearrayofbenchmarks.Despiteitseconomicaltrainingcosts,comprehensiveevaluationsrevealthatDeepSeek-V3-Basehasemergedasthestrongestopen-sourcebasemodelcurrentlyavailable,especiallyincodeandmath.Itschatversionalsooutperformsotheropen-sourcemodelsandachievesperformancecomparabletoleadingclosed-sourcemodels,includingGPT-4oandClaude-3.5-Sonnet,onaseriesofstandardandopen-endedbenchmarks.

Lastly,weemphasizeagaintheeconomicaltrainingcostsofDeepSeek-V3,summarizedinTable

1,achievedthroughouroptimizedco-designofalgorithms,frameworks,andhardware

.Duringthepre-trainingstage,trainingDeepSeek-V3oneachtrilliontokensrequiresonly180KH800GPUhours,i.e.,3.7daysonourclusterwith2048H800GPUs.Consequently,ourpre-trainingstageiscompletedinlessthantwomonthsandcosts2664KGPUhours.Combinedwith119KGPUhoursforthecontextlengthextensionand5KGPUhoursforpost-training,DeepSeek-V3costsonly2.788MGPUhoursforitsfulltraining.AssumingtherentalpriceoftheH800GPUis$2perGPUhour,ourtotaltrainingcostsamounttoonly$5.576M.NotethattheaforementionedcostsincludeonlytheofficialtrainingofDeepSeek-V3,excludingthecostsassociatedwithpriorresearchandablationexperimentsonarchitectures,algorithms,ordata.

Ourmaincontributionincludes:

Architecture:InnovativeLoadBalancingStrategyandTrainingObjective

?OntopoftheefficientarchitectureofDeepSeek-V2,wepioneeranauxiliary-loss-freestrategyforloadbalancing,whichminimizestheperformancedegradationthatarisesfromencouragingloadbalancing.

?WeinvestigateaMulti-TokenPrediction(MTP)objectiveandproveitbeneficialtomodelperformance.Itcanalsobeusedforspeculativedecodingforinferenceacceleration.

Pre-Training:TowardsUltimateTrainingEfficiency

?WedesignanFP8mixedprecisiontrainingframeworkand,forthefirsttime,validatethefeasibilityandeffectivenessofFP8trainingonanextremelylarge-scalemodel.

?Throughtheco-designofalgorithms,frameworks,andhardware,weovercomethecommunicationbottleneckincross-nodeMoEtraining,achievingnear-fullcomputation-communicationoverlap.Thissignificantlyenhancesourtrainingefficiencyandreducesthetrainingcosts,enablingustofurtherscaleupthemodelsizewithoutadditionaloverhead.

?Ataneconomicalcostofonly2.664MH800GPUhours,wecompletethepre-trainingofDeepSeek-V3on14.8Ttokens,producingthecurrentlystrongestopen-sourcebasemodel.Thesubsequenttrainingstagesafterpre-trainingrequireonly0.1MGPUhours.

Post-Training:KnowledgeDistillationfromDeepSeek-R1

?Weintroduceaninnovativemethodologytodistillreasoningcapabilitiesfromthelong-Chain-of-Thought(CoT)model,specificallyfromoneoftheDeepSeekR1seriesmodels,intostandardLLMs,particularlyDeepSeek-V3.Ourpipelineelegantlyincorporatesthe

6

verificationandreflectionpatternsofR1intoDeepSeek-V3andnotablyimprovesitsreasoningperformance.Meanwhile,wealsomaintaincontrolovertheoutputstyleandlengthofDeepSeek-V3.

SummaryofCoreEvaluationResults

?Knowledge:(1)OneducationalbenchmarkssuchasMMLU,MMLU-Pro,andGPQA,DeepSeek-V3outperformsallotheropen-sourcemodels,achieving88.5onMMLU,75.9onMMLU-Pro,and59.1onGPQA.Itsperformanceiscomparabletoleadingclosed-sourcemodelslikeGPT-4oandClaude-Sonnet-3.5,narrowingthegapbetweenopen-sourceandclosed-sourcemodelsinthisdomain.(2)Forfactualitybenchmarks,DeepSeek-V3demonstratessuperiorperformanceamongopen-sourcemodelsonbothSimpleQAandChineseSimpleQA.WhileittrailsbehindGPT-4oandClaude-Sonnet-3.5inEnglishfactualknowledge(SimpleQA),itsurpassesthesemodelsinChinesefactualknowledge(ChineseSimpleQA),highlightingitsstrengthinChinesefactualknowledge.

?Code,Math,andReasoning:(1)DeepSeek-V3achievesstate-of-the-artperformanceonmath-relatedbenchmarksamongallnon-long-CoTopen-sourceandclosed-sourcemodels.Notably,itevenoutperformso1-previewonspecificbenchmarks,suchasMATH-500,demonstratingitsrobustmathematicalreasoningcapabilities.(2)Oncoding-relatedtasks,DeepSeek-V3emergesasthetop-performingmodelforcodingcompetitionbenchmarks,suchasLiveCodeBench,solidifyingitspositionastheleadingmodelinthisdomain.Forengineering-relatedtasks,whileDeepSeek-V3performsslightlybelowClaude-Sonnet-3.5,itstilloutpacesallothermodelsbyasignificantmargin,demonstratingitscompetitivenessacrossdiversetechnicalbenchmarks.

Intheremainderofthispaper,wefirstpresentadetailedexpositionofourDeepSeek-V3modelarchitecture(Section

2)

.Subsequently,weintroduceourinfrastructures,encompassingourcomputeclusters,thetrainingframework,thesupportforFP8training,theinferencedeploymentstrategy,andoursuggestionsonfuturehardwaredesign.Next,wedescribeourpre-trainingprocess,includingtheconstructionoftrainingdata,hyper-parametersettings,long-contextextensiontechniques,theassociatedevaluations,aswellassomediscussions(Section

4)

.Thereafter,wediscussoureffortsonpost-training,whichincludeSupervisedFine-Tuning(SFT),ReinforcementLearning(RL),thecorrespondingevaluations,anddiscussions(Section

5)

.Lastly,weconcludethiswork,discussexistinglimitationsofDeepSeek-V3,andproposepotentialdirectionsforfutureresearch(Section

6)

.

2.Architecture

WefirstintroducethebasicarchitectureofDeepSeek-V3,featuredbyMulti-headLatentAtten-

tion(MLA)(DeepSeek-AI,

2024c)forefficientinferenceandDeepSeekMoE(Daietal.,

2024)

foreconomicaltraining.Then,wepresentaMulti-TokenPrediction(MTP)trainingobjective,whichwehaveobservedtoenhancetheoverallperformanceonevaluationbenchmarks.Forotherminordetailsnotexplicitlymentioned,DeepSeek-V3adherestothesettingsofDeepSeek-

V2(DeepSeek-AI,

2024c)

.

2.1.BasicArchitecture

ThebasicarchitectureofDeepSeek-V3isstillwithintheTransformer(Vaswanietal.,

2017)

framework.Forefficientinferenceandeconomicaltraining,DeepSeek-V3alsoadoptsMLAandDeepSeekMoE,whichhavebeenthoroughlyvalidatedbyDeepSeek-V2.ComparedwithDeepSeek-V2,anexceptionisthatweadditionallyintroduceanauxiliary-loss-freeloadbalancing

7

TransformerBlock×L

RMSNorm

Feed-ForwardNetwork

Router

DeepSeekMoE

RoutedExpertSharedExpert

Nr-1Nr

1Ns

12

InputHiddenut

34

……

OutputHiddenh

Top-kr

……

RMSNorm

Attention

Multi-HeadLatentAttention(MLA)

OutputHiddenut

CachedDuringInference

……

Multi-HeadAttention

{[q,i;q,i]}

{[K,i;K]}

concatenateconcatenate

▲tapplyapply··

{q,i}{q,i}K{K,i}{vi}

RoPERoPE

…LatentcLatentc…

InputHiddenhtoooo……oooo

Figure2|IllustrationofthebasicarchitectureofDeepSeek-V3.FollowingDeepSeek-V2,weadoptMLAandDeepSeekMoEforefficientinferenceandeconomicaltraining.

strategy(Wangetal.,

2024a)forDeepSeekMoEtomitigatetheperformancedegradationinduced

bytheefforttoensureloadbalance.Figure

2

illustratesthebasicarchitectureofDeepSeek-V3,andwewillbrieflyreviewthedetailsofMLAandDeepSeekMoEinthissection.

2.1.1.Multi-HeadLatentAttention

Forattention,DeepSeek-V3adoptstheMLAarchitecture.Letddenotetheembeddingdimen-sion,nhdenotethenumberofattentionheads,dhdenotethedimensionperhead,andht∈Rddenotetheattentioninputforthet-thtokenatagivenattentionlayer.ThecoreofMLAisthelow-rankjointcompressionforattentionkeysandvaluestoreduceKey-Value(KV)cacheduringinference:

cv=WDkvht,(1)

[k;k; ;k,nh]=k=Wukcv,(2)

k=RoPE(WkRht),(3)

kt,i=[k,i;k],(4)

[v;v; ;v,nh]=v=Wuvcv,(5)

8

wherecv∈Rdcisthecompressedlatentvectorforkeysandvalues;dc(?dhnh)indicatestheKV

compressiondimension;WDkv∈Rdc×ddenotesthedown-projectionmatrix;WUk,WUv∈Rdhnh×dc

aretheup-projectionmatricesforkeysandvalues,respectively;WkR∈Rd×disthematrixused

toproducethedecoupledkeythatcarriesRotaryPositionalEmbedding(RoPE)(Suetal.,

2024);

RoPE(·)denotestheoperationthatappliesRoPEmatrices;and[·;·]denotesconcatenation.Note

thatforMLA,onlytheblue-boxedvectors(i.e.,cvandk)needtobecachedduringgeneration,

whichresultsinsignificantlyreducedKVcachewhilemaintainingperformancecomparableto

standardMulti-HeadAttention(MHA)(Vaswanietal.,

2017)

.

Fortheattentionqueries,wealsoperformalow-rankcompression,whichcanreducetheactivationmemoryduringtraining:

c=WDQht,(6)

[q;q; ;q,nh]=q=WUQc,(7)

[q;q; ;q,nh]=q=RoPE(WQRc),(8)

qt,i=[q,i;q,i],(9)

wherec∈Rdisthecompressedlatentvectorforqueries;d(?dhnh)denotesthequery

compressiondimension;WDQ∈Rd×d,WUQ∈Rdhnh×darethedown-projectionandup-projection

matricesforqueries,respectively;andWQR∈Rdnh×disthematrixtoproducethedecoupled

queriesthatcarryRoPE.

Ultimately,theattentionqueries(qt,i),keys(kj,i),andvalues(v,i)arecombinedtoyieldthe

finalattentionoutputut:

ot,i=Softmaxj(10)

ut=Wo[ot,1;ot,2;...;ot,nh],(11)whereWo∈Rd×dhnhdenotestheoutputprojectionmatrix.

2.1.2.DeepSeekMoEwithAuxiliary-Loss-FreeLoadBalancing

BasicArchitectureofDeepSeekMoE.ForFeed-ForwardNetworks(FFNs),DeepSeek-V3

employstheDeepSeekMoEarchitecture(Daietal.,

2024)

.ComparedwithtraditionalMoE

architectureslikeGShard(Lepikhinetal.,

2021),DeepSeekMoEusesfiner-grainedexpertsand

isolatessomeexpertsassharedones.LetutdenotetheFFNinputofthet-thtoken,wecompute

theFFNoutputhasfollows:

gi,t=(13)

si,t=Sigmoid(utTei),(15)

9

whereNsandNrdenotethenumbersofsharedexpertsandroutedexperts,respectively;FFNi(s)(·)andFFNi(r)(·)denotethei-thsharedexpertandthei-throutedexpert,respectively;Krdenotesthenumberofactivatedroutedexperts;gi,tisthegatingvalueforthei-thexpert;si,tisthetoken-to-expertaffinity;eiisthecentroidvectorofthei-throutedexpert;andTopk(·,K)denotesthesetcomprisingKhighestscoresamongtheaffinityscorescalculatedforthet-thtokenandallroutedexperts.SlightlydifferentfromDeepSeek-V2,DeepSeek-V3usesthesigmoidfunctiontocomputetheaffinityscores,andappliesanormalizationamongallselectedaffinityscorestoproducethegatingvalues.

Auxiliary-Loss-FreeLoadBalancing.ForMoEmodels,anunbalancedexpertloadwillleadto

routingcollapse(Shazeeretal.,

2017)anddiminishcomputationalefficiencyinscenarioswith

expertparallelism.

Conventionalsolutionsusuallyrelyontheauxiliaryloss(Fedusetal.,

2021;

Lepikhinetal.,

2021)toavoidunbalancedload

.However,toolargeanauxiliarylosswillimpair

themodelperformance(Wangetal.,

2024a)

.Toachieveabettertrade-offbetweenloadbalance

andmodelperformance,wepioneeranauxiliary-loss-freeloadbalancingstrategy(Wangetal.,

2024a)toensureloadbalance

.Tobespecific,weintroduceabiastermbiforeachexpertandaddittothecorrespondingaffinityscoressi,ttodeterminethetop-Krouting:

Notethatthebiastermisonlyusedforrouting.Thegatingvalue,whichwillbemultipliedwiththeFFNoutput,isstillderivedfromtheoriginalaffinityscoresi,t.Duringtraining,wekeepmonitoringtheexpertloadonthewholebatchofeachtrainingstep.Attheendofeachstep,wewilldecreasethebiastermbyyifitscorrespondingexpertisoverloaded,andincreaseitbyyifitscorrespondingexpertisunderloaded,whereyisahyper-parametercalledbiasupdatespeed.Throughthedynamicadjustment,DeepSeek-V3keepsbalancedexpertloadduringtraining,andachievesbetterperformancethanmodelsthatencourageloadbalancethroughpureauxiliarylosses.

ComplementarySequence-WiseAuxiliaryLoss.AlthoughDeepSeek-V3mainlyreliesontheauxiliary-loss-freestrategyforloadbalance,topreventextremeimbalancewithinanysinglesequence,wealsoemployacomplementarysequence-wisebalanceloss:

LBal=αfiPi,(17)

(19)

(20)

wherethebalancefactorαisahyper-parameter,whichwillbeassignedanextremelysmallvalueforDeepSeek-V3;1(·)denotestheindicatorfunction;andTdenotesthenumberoftokensinasequence.Thesequence-wisebalancelossencouragestheexpertloadoneachsequencetobebalanced.

10

t5

t2

t4

t6

t5

t3

t4

t5

t4

t7

t6

t3

TargetTokens

?TP

···

Cross-EntropyLoss·?MainCross-EntropyLoss·?TPCross-EntropyLoss

Shared

Shared

MainModel

(NextTokenPrediction)

MTPModule1

(Next2TokenPrediction)

MTPModule2

(Next3TokenPrediction)

OutputHead

TransformerBlock×L

EmbeddingLayer

OutputHead

TransformerBlock

LinearProjection

concatenation

RMSNorm

RMSNorm

EmbeddingLayer

OutputHead

TransformerBlock

LinearProjection

——concatenation

RMSNormRMSNorm

EmbeddingLayer

Shared

Shared

t2

t5

t4

t3

t2

t4

t3

t4

t5

t6

t3

t1

InputTokens

Figure3|IllustrationofourMulti-TokenPrediction(MTP)implementation.Wekeepthecompletecausalchainforthepredictionofeachtokenateachdepth.

Node-LimitedRouting.Likethedevice-limitedroutingusedbyDeepSeek-V2,DeepSeek-V3alsousesarestrictedroutingmechanismtolimitcommunicationcostsduringtraining.Inshort,weensurethateachtokenwillbesenttoatmostMnodes,whichareselectedaccordingto .Underthisconstraint,ourMoEtrainingframeworkcannearlyachievefullcomputation-communicationoverlap.

NoToken-Dropping.Duetotheeffectiveloadbalancingstrategy,DeepSeek-V3keepsagoodloadbalanceduringitsfulltraining.Therefore,DeepSeek-V3doesnotdropanytokensduringtraining.Inaddition,wealsoimplementspecificdeploymentstrategiestoensureinferenceloadbalance,soDeepSeek-V3alsodoesnotdroptokensduringinference.

2.2.Multi-TokenPrediction

Inspiredby

Gloeckleetal.

(2024),weinvestigateandsetaMulti-TokenPrediction(MTP)

objectiveforDeepSeek-V3,whichextendsthepredictionscopetomultiplefuturetokensateachposition.Ontheonehand,anMTPobjectivedensifiesthetrainingsignalsandmayimprovedataefficiency.Ontheotherhand,MTPmayenablethemodeltopre-planitsrepresentationsforbetterpredictionoffuturetokens.Figure

3

illustratesourimplementationofMTP.Differentfrom

Gloeckleetal.

(2024),whichparallellypredicts

Dadditionaltokensusingindependentoutputheads,wesequentiallypredictadditionaltokensandkeepthecompletecausalchainateachpredictiondepth.WeintroducethedetailsofourMTPimplementationinthissection.

MTPModules.Tobespecific,ourMTPimplementationusesDsequentialmodulestopredictDadditionaltokens.Thek-thMTPmoduleconsistsofasharedembeddinglayerEmb(·),asharedoutputheadOutHead(·),aTransformerblockTRMk(·),andaprojectionmatrixMk∈Rd×2d.Forthei-thinputtokenti,atthek-thpredictiondept

溫馨提示

  • 1. 本站所有資源如無(wú)特殊說(shuō)明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請(qǐng)下載最新的WinRAR軟件解壓。
  • 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請(qǐng)聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
  • 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁(yè)內(nèi)容里面會(huì)有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
  • 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
  • 5. 人人文庫(kù)網(wǎng)僅提供信息存儲(chǔ)空間,僅對(duì)用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對(duì)用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對(duì)任何下載內(nèi)容負(fù)責(zé)。
  • 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請(qǐng)與我們聯(lián)系,我們立即糾正。
  • 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶因使用這些下載資源對(duì)自己和他人造成任何形式的傷害或損失。

評(píng)論

0/150

提交評(píng)論