




版權(quán)說(shuō)明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請(qǐng)進(jìn)行舉報(bào)或認(rèn)領(lǐng)
文檔簡(jiǎn)介
rStar-Math:SmallLLMsCanMasterMathReasoningwithSelf-EvolvedDeepThinking
XinyuGuan*LiLynaZhang*
YifeiLiu
NingShangYouranSunYiZhuFanYangMaoYang
arXiv:2501.04519v1[cs.CL]8Jan2025
MicrosoftResearchAsia
Abstract
WepresentrStar-Mathtodemonstratethatsmalllanguagemodels(SLMs)canrivalorevensurpassthemathreasoningcapabilityofOpenAIo1,withoutdistillationfromsuperiormodels.rStar-Mathachievesthisbyexercising“deepthinking”throughMonteCarloTreeSearch(MCTS),whereamathpolicySLMperformstest-timesearchguidedbyanSLM-basedprocessrewardmodel.rStar-MathintroducesthreeinnovationstotacklethechallengesintrainingthetwoSLMs:(1)anovelcode-augmentedCoTdatasythesismethod,whichperformsextensiveMCTSrolloutstogeneratestep-by-stepverifiedreasoningtrajectoriesusedtotrainthepolicySLM;(2)anovelprocessrewardmodeltrainingmethodthatavoidsna?vestep-levelscoreannotation,yieldingamoreeffectiveprocesspreferencemodel(PPM);(3)aself-evolutionrecipeinwhichthepolicySLMandPPMarebuiltfromscratchanditerativelyevolvedtoimprovereasoningcapabilities.Through4roundsofself-evolutionwithmillionsofsynthesizedsolutionsfor747kmathproblems,rStar-MathboostsSLMs’mathreasoningtostate-of-the-artlevels.OntheMATHbenchmark,itimprovesQwen2.5-Math-7Bfrom58.8%to90.0%andPhi3-mini-3.8Bfrom41.4%to86.4%,surpassingo1-previewby+4.5%and+0.9%.OntheUSAMathOlympiad(AIME),rStar-Mathsolvesanaverageof53.3%(8/15)ofproblems,rankingamongthetop20%thebrightesthighschoolmathstudents.Codeanddatawillbeavailableat
/microsoft/rStar
.
Task
(pass@1Acc)
rStar-Math(Qwen-7B)
rStar-Math(Qwen-1.5B)
rStar-Math(Phi3-mini)
OpenAIo1-preview
OpenAIo1-mini
QWQ
32B-preview
GPT-4oDeepSeek-V3
MATH
AIME2024
OlympiadBench
CollegeMath
Omni-Math
90.053.365.660.550.5
88.646.764.659.348.5
86.443.360.359.146.0
85.544.6
-
-
52.5
90.056.765.357.860.5
90.650.061.255.849.6
76.69.343.348.530.5
90.239.255.458.935.9
Table1:rStar-MathenablesfrontiermathreasoninginSLMsviadeepthinkingover64trajectories.
1Introduction
Recentstudieshavedemonstratedthatlargelanguagemodels(LLMs)arecapableoftacklingmathematicalproblems[Team,2024a,Yangetal.,2024,OpenAI,2024,Liuetal.,2024].However,theconventionalapproachofhavingLLMsgeneratecompletesolutionsinasingleinference–akintoSystem1thinking[Daniel,2011]–oftenyieldsfastbuterror-proneresults[Valmeekametal.,2023,OpenAI,2023].Inresponse,test-timecomputescaling[Snelletal.,2024,Qietal.,2024]suggestsaparadigmshifttowardaSystem2-stylethinking,whichemulateshumanreasoningthroughasloweranddeeperthoughtprocess.Inthisparadigm,anLLMservesasapolicymodeltogeneratemultiplemathreasoningsteps,whicharethenevaluatedbyanotherLLMactingasarewardmodel[OpenAI,2024].Thestepsandsolutionsdeemedmorelikelytobecorrectareselected.Theprocessrepeatsiterativelyandultimatelyderivesthefinalanswer.
*Equalcontribution.
Projectleader;correspondencetolzhani@
§XinyuGuanandYouranSundidthisworkduringtheinternshipatMSRA.XinyuGuan(2001gxy@)iswithPekingUniversity,YouranSuniswithTsinghuaUniversity.
2
0.7
ApplyVerifiers(PPM/python)
-0.5
-1
0.5
1
0.6
Onestep
Answerstep(correct)
Answerstep(wrong)
1
0.8
MCTS-drivendeepthinking
question
SLM.PPM
-0.7
-1
(a)step-by-stepverifiedreasoningtrajectory
..
filtering/
Q-valueo…O
Step1Step2
finalstep
..
..
..
..
fullsolutions
(b)Constructionofper-steppreferencepairsbasedonQ-values
SLM-r3
SLM-r1
Round1
SLM-r2
PPM-r2
SLM-r4
PPM-r4
Round2
Round3
Round4
PPM-r3
(c)4roundsofself-evolution
PPM-augmentedMCTS
PPM-augmentedMCTS
Terminal-guidedMCTS
Terminal-guidedMCTS
Figure1:TheoverviewofrStar-Math.
Inthetest-timecomputeparadigm,thekeyistotrainapowerfulpolicymodelthatgeneratespromisingsolutionstepsandareliablerewardmodelthataccuratelyevaluatesthem,bothofwhichdependonhigh-qualitytrainingdata.Unfortunately,itiswell-knownthatoff-the-shelfhigh-qualitymathreasoningdataisscarce,andsynthesizinghigh-qualitymathdatafacesfundamentalchallenges.Forthepolicymodel,itischallengingtodistinguisherroneousreasoningstepsfromthecorrectones,complicatingtheeliminationoflow-qualitydata.Itisworthnotingthatinmathreasoning,acorrectfinalanswerdoesnotensurethecorrectnessoftheentirereasoningtrace[Lanhametal.,2023].Incorrectintermediatestepssignificantlydecreasedataquality.Asfortherewardmodel,processrewardmodeling(PRM)showsagreatpotentialbyprovidingfine-grainedfeedbackonintermediatesteps[Lightmanetal.,2023].However,thetrainingdataisevenscarcerinthisregard:accuratestep-by-stepfeedbackrequiresintensehumanlabelingeffortsandisimpracticaltoscale,whilethoseautomaticannotationattemptsshowlimitedgainsduetonoisyrewardscores[Luoetal.,2024,Wangetal.,2024c,Chenetal.,2024].Duetotheabovechallenges,existingdistill-baseddatasynthesisapproachestotrainingpolicymodels,e.g.,scalingupGPT4-distilledCoTdata[Tangetal.,2024,Huangetal.,2024],haveshowndiminishingreturnsandcannotexceedthecapabilityoftheirteachermodel;meanwhile,asoftoday,trainingreliablePRMsformathreasoningremainsanopenquestion.
Inthiswork,weintroducerStar-Math,aself-evolvableSystem2-stylereasoningapproachthatachievesthestate-of-the-artmathreasoning,rivalingandsometimesevensurpassingOpenAIo1onchallengingmathcompetitionbenchmarkswithamodelsizeassmallas7billion.UnlikesolutionsrelyingonsuperiorLLMsfordatasynthesis,rStar-Mathleveragessmallerlanguagemodels(SLMs)withMonteCarloTreeSearch(MCTS)toestablishaself-evolutionaryprocess,iterativelygeneratinghigher-qualitytrainingdata.Toachieveself-evolution,rStar-Mathintroducesthreekeyinnovations.First,anovelcode-augmentedCoTdatasynthesismethod,whichperformsextensiveMCTSrolloutstogeneratestep-by-stepverifiedreasoningtrajectorieswithself-annotatedMCTSQ-values.Specifically,mathproblem-solvingisdecomposedintomulti-stepgenerationwithinMCTS.Ateachstep,theSLMservingasthepolicymodelsamplescandidatenodes,eachgeneratingaone-stepCoTandthecorrespondingPythoncode.Toverifythegenerationquality,onlynodeswithsuccessfulPythoncodeexecutionareretained,thusmitigatingerrorsinintermediatesteps.Moreover,extensiveMCTSrolloutsautomaticallyassignaQ-valuetoeachintermediatestepbasedonitscontribution:stepscontributingtomoretrajectoriesthatleadtothecorrectansweraregivenhigherQ-valuesandconsideredhigherquality.ThisensuresthatthereasoningtrajectoriesgeneratedbySLMsconsistofcorrect,high-qualityintermediatesteps.
Second,anovelmethodthattrainsanSLMactingasaprocesspreferencemodel,i.e.,aPPMtoimplementthedesiredPRM,thatreliablypredictsarewardlabelforeachmathreasoningstep.ThePPMleveragesthefactthat,althoughQ-valuesarestillnotpreciseenoughtoscoreeachreasoningstepdespiteusingextensiveMCTSrollouts,theQ-valuescanreliablydistinguishpositive(correct)stepsfromnegative(irrelevant/incorrect)ones.ThusthetrainingmethodconstructspreferencepairsforeachstepbasedonQ-valuesandusesapairwiserankingloss[Ouyangetal.,2022]tooptimizePPM’sscorepredictionforeachreasoningstep,achievingreliablelabeling.ThisapproachavoidsconventionalmethodsthatdirectlyuseQ-valuesasrewardlabels[Luoetal.,2024,Chenetal.,2024],whichareinherentlynoisyandimpreciseinstepwiserewardassignment.
Finally,afour-roundself-evolutionrecipethatprogressivelybuildsbothafrontierpolicymodelandPPMfromscratch.Webeginbycuratingadatasetof747kmathwordproblemsfrompubliclyavailablesources.Ineachround,weusethelatestpolicymodelandPPMtoperformMCTS,
3
generatingincreasinglyhigh-qualitytrainingdatausingtheabovetwomethodstotrainastrongerpolicymodelandPPMfornextround.Eachroundachievesprogressiverefinement:(1)astrongerpolicySLM,(2)amorereliablePPM,(3)generatingbetterreasoningtrajectoriesviaPPM-augmentedMCTS,and(4)improvingtrainingdatacoveragetotacklemorechallengingandevencompetition-levelmathproblems.
ExtensiveexperimentsacrossfourSLMs(1.5B-7B)andsevenmathreasoningtasksdemonstratetheeffectivenessofrStar-Math.Remarkably,rStar-MathimprovesallfourSLMs,matchingorevensurpassingOpenAIo1onchallengingmathbenchmarks.OnMATHbenchmark,with8searchtrajectories,rStar-MathboostsQwen2.5-Math-7Bfrom58.8%to89.4%andQwen2.5-Math-1.5Bfrom51.2%to87.8%.With64trajectories,thescoresriseto90%and88.4%,outperformingo1-previewby4.5%and2.6%andmatchingo1-mini’s90%.OntheOlympiad-levelAIME2024,rStar-Mathsolvesonaverage53.3%(8/15)oftheproblems,exceedingo1-previewby8.7%andallotheropen-sourcedLLMs.Wefurtherconductcomprehensiveexperimentstoverifythesuperiorityofstep-by-stepverifiedreasoningtrajectoriesoverstate-of-the-artdatasynthesisbaselines,aswellasthePPM’seffectivenesscomparedtooutcomerewardmodelsandQvalue-basedPRMs.Finally,wepresentkeyfindingsfromrStar-Mathdeepthinking,includingtheintrinsicself-reflectioncapabilityandPPM’spreferencefortheorem-applicationsintermediatesteps.
2RelatedWorks
MathDataSynthesis.AdvancementsinLLMmathreasoninghavelargelyreliedoncuratinghigh-qualityCoTdata,withmostleadingapproachesbeingGPT-distilled,usingfrontiermodelslikeGPT-4forsynthesis[Wangetal.,2024b,Gouetal.,2023,Luoetal.,2023].NotableworksincludeNuminaMath[JiaLIandPolu,2024a]andMetaMath[Yuetal.,2023b].Whileeffective,thislimitsreasoningtothecapabilitiesoftheteacherLLM.HardproblemsthattheteacherLLMcannotsolveareexcludedinthetrainingset.Evensolvableproblemsmaycontainerror-proneintermediatesteps,whicharehardtodetect.Althoughrejectionsamplingmethods[Yuanetal.,2023,Brownetal.,2024]canimprovedataquality,theydonotguaranteecorrectintermediatesteps.Asaresult,scalingupCoTdatahasdiminishingreturns,withgainsnearingsaturation—e.g.,OpenMathInstruct-2[Toshniwaletal.,2024]onlyseesa3.9%boostonMATHdespitean8×increaseindatasetsize.
ScalingTest-timeComputehasintroducednewscalinglaws,allowingLLMstoimproveperfor-manceacrossbygeneratingmultiplesamplesandusingrewardmodelsforbest-solutionselection[Snelletal.,2024,Wuetal.,2024,Brownetal.,2024].Varioustest-timesearchmethodshavebeenproposed[Kangetal.,2024,Wangetal.,2024a],includingrandomsampling[Wangetal.,2023]andtree-searchmethods[Yaoetal.,2024,Haoetal.,2023,Zhangetal.,2024b,Qietal.,2024]likeMCTS.However,open-sourcemethodsforscalingtest-timecomputationhaveshownlimitedgainsinmathreasoning,oftenduetopolicyLLMorrewardmodellimitations.rStar-MathaddressesthisbyiterativelyevolvingthepolicyLLMandrewardmodel,achievingSystem2mathematicalreasoningperformancecomparabletoOpenAIo1[OpenAI,2024].
RewardModelsarecrucialforeffectiveSystem2reasoningbutarechallengingtoobtain.RecentworksincludeLLM-as-a-Judgeforverification[Zhengetal.,2023,Qietal.,2024]andspecializedrewardmodelslikeOutcomeRewardModel[Yangetal.,2024,Yuetal.,2023a]andProcessRewardModel(PRM)[Lightmanetal.,2024].WhilePRMsofferpromisingdense,step-levelrewardsignalsforcomplexreasoning[Luoetal.,2024,Wangetal.,2024c],collectingstep-levelannotationsremainsanobstacle.WhileKangetal.[2024],Wangetal.[2024a]relyoncostlyhuman-annotateddatasetslikePRM800k[Lightmanetal.,2024],recentapproaches[Wangetal.,2024c,Luoetal.,2024]exploreautomatedannotationviaMonteCarloSamplingorMCTS.However,theystruggletogeneratepreciserewardscores,whichlimitsperformancegains.rStar-Mathintroducesanovelprocesspreferencereward(PPM)thateliminatestheneedforaccuratestep-levelrewardscoreannotation.
3Methodology
3.1DesignChoices
MCTSforEffectiveSystem2Reasoning.WeaimtotrainamathpolicySLMandaprocessrewardmodel(PRM),andintegratingbothwithinMonteCarloTreeSearch(MCTS)forSystem2deepthinking.MCTSischosenfortwokeyreasons.First,itbreaksdowncomplexmathproblemsintosimplersingle-stepgenerationtasks,reducingthedifficultyforthepolicySLMcomparedtoother
4
System2methodslikeBest-of-N[Brownetal.,2024]orself-consistency[Wangetal.,2023],whichrequiregeneratingfullsolutionsinoneinference.Second,thestep-by-stepgenerationinMCTSnaturallyyieldsstep-leveltrainingdataforbothmodels.StandardMCTSrolloutautomaticallyassignQ-valuetoeachstepbasedonitscontributiontothefinalcorrectanswer,obviatingtheneedforhuman-generatedstep-levelannotationsforprocessrewardmodeltraining.
Ideally,advancedLLMssuchasGPT-4couldbeintegratedwithinMCTStogeneratetrainingdata.However,thisapproachfacestwokeychallenges.First,eventhesepowerfulmodelsstruggletoconsistentlysolvedifficultproblems,suchasOlympiad-levelmathematics.Consequently,theresultingtrainingdatawouldprimarilyconsistofsimplersolvableproblems,limitingitsdiversityandquality.Second,annotatingper-stepQ-valuesdemandsextensiveMCTSrollouts;insufficienttreeexplorationcanleadtospuriousQ-valueassignments,suchasoverestimatingsuboptimalsteps.Giventhateachrolloutinvolvesmultiplesingle-stepgenerationsandthesemodelsarecomputationallyexpensive,increasingrolloutssignificantlyraisesinferencecosts.
Overview.Tothisend,weexploreusingtwo7BSLMs(apolicySLMandaPRM)togeneratehigher-qualitytrainingdata,withtheirsmallersizeallowingforextensiveMCTSrolloutsonaccessiblehardware(e.g.,4×40GBA100GPUs).However,self-generatingdatapresentsgreaterchallengesforSLMs,duetotheirweakercapabilities.SLMsfrequentlyfailtogeneratecorrectsolutions,andevenwhenthefinalansweriscorrect,theintermediatestepsareoftenflawedorofpoorquality.Moreover,SLMssolvefewerchallengingproblemscomparedtoadvancedmodelslikeGPT-4.
Thissectionintroducesourmethodology,asillustratedinFig.1.Tomitigateerrorsandlow-qualityintermediatesteps,weintroduceacode-augmentedCoTsyntheticmethod,whichperformsextensiveMCTSrolloutstogeneratestep-by-stepverifiedreasoningtrajectories,annotatedwithQ-values.TofurtherimproveSLMperformanceonchallengingproblems,weintroduceafour-roundself-evolutionrecipe.Ineachround,boththepolicySLMandtherewardmodelareupdatedtostrongerversions,progressivelytacklingmoredifficultproblemsandgeneratinghigher-qualitytrainingdata.Finally,wepresentanovelprocessrewardmodeltrainingapproachthateliminatestheneedforpreciseper-steprewardannotations,yieldingthemoreeffectiveprocesspreferencemodel(PPM).
3.2Step-by-StepVerifiedReasoningTrajectory
Westartbyintroducingourmethodforgeneratingstep-by-stepverifiedreasoningtrajectorieswithper-stepQ-valueannotations.GivenaproblemxandapolicymodelM,werunthestandardMCTStoincrementallyconstructasearchtreeforstep-by-stepsolutionexploration.AsshowninFig.1(a),therootnoderepresentsquestionx,whilechildnodescorrespondtointermediatestepssgeneratedbyM.Aroot-to-leafpathendingatterminalnodesdformsatrajectoryt=x⊕s1⊕s2⊕...⊕sd,witheachstepsiassignedaQ-valueQ(si).FromthesearchtreeT,weextractsolutiontrajectoriesT={t1,t2,...,tn}(n≥1).Ourgoalistoselecthigh-qualitytrajectoriesfromTtoconstructthetrainingset.Forthispurpose,weintroducecode-augmentedCoTsynthesismethodtofilteroutlow-qualitygenerationsandperformextensiverolloutstoimprovethereliabilityofQ-valueaccuracy.
Code-augmentedCoTGeneration.PriorMCTSapproachesprimarilygeneratenaturallanguage(NL)CoTs[Qietal.,2024,Zhangetal.,2024a].However,LLMsoftensufferfromhallucination,producingincorrectorirrelevantstepsyetstillarriveatthecorrectanswerbychance[Lanhametal.,2023].Theseflawedstepsarechallengingtodetectandeliminate.Toaddressthis,weproposeanovelcodeexecutionaugmentedCoT.AsshowninFig.2,thepolicymodelgeneratesaone-stepNLCoTalongsideitscorrespondingPythoncode,wheretheNLCoTisembeddedasaPythoncomment.OnlygenerationswithsuccessfullyexecutedPythoncodeareretainedasvalidcandidates.
Question:Billwalks$\frac{1}{2}$milesouth,then$\frac{3}{4}$mileeast,andfinally$\frac{1}{2}$milesouth.Howmanymilesishe,inadirectline,fromhisstartingpoint?Expressyouranswerasadecimaltothenearesthundredth.
#Step1:Calculatethetotaldistancewalkedsouth
total_south=1/2+1/2
#Step2:Calculatethetotaldistancewalkedeast
total_east=3/4
#Step3:UsethePythagoreantheoremtofindthedirectdistancefromthestartingpoint
importmath
direct_distance=math.sqrt(total_south**2+total_east**2)
#Step4:Roundthedirectdistancetothenearesthundredth
direct_distance_rounded=round(direct_distance,2
Fromtheresult,wecanseethatthedirectdistancefromthestartingpointis\boxed{1.25}miles
NLCoTasPythonComment
Pythoncodeexecutionforstep1:
#Step1:Calculatethetotaldistancewalkedsouth
total_south=1/2+1/2
#Step1:Calculatethetotaldistancewalkedsouth
total_south=1/2+1/2
#Step2:Calculatethetotaldistancewalkedeast
total_east=3/4
Pythoncodeexecutionforstep2:
…
Figure2:AnexampleofCode-augmentedCoT.
5
Specifically,startingfromtheinitialrootnodex,weperformmultipleMCTSiterationsthroughselection,expansion,rollout,andback-propagation.Atstepi,wecollectthelatestreasoningtrajectoryx⊕s1⊕s2⊕...⊕si?1asthecurrentstate.Basedonthisstate,weprompt(seeAppendixA.3)thepolicymodeltogeneratencandidatessi,0,...,si,n?1forstepi.Pythoncodeexecutionisthenemployedtofiltervalidnodes.AsshowninFig.2,eachgenerationsi,jisconcatenatedwiththecodefromallprevioussteps,formings1⊕s2⊕...⊕si?1⊕si,j.CandidatesthatexecutesuccessfullyareretainedasvalidnodesandscoredbythePPM,whichassignsaQ-valueq(si).Then,weusethewell-knownUpperConfidenceboundsforTrees(UCT)[KocsisandSzepesvári,2006]toselectthebestnodeamongthencandidates.Thisselectionprocessismathematicallyrepresentedas:
where(1)
whereN(s)denotesthenumberofvisitstonodes,andNparent(s)isthevisitcountofs’sparentnode.Thepredictedrewardq(s)isprovidedbythePPMandwillbeupdatedthroughback-propagation.cisaconstantthatbalancesexploitationandexploration.
ExtensiveRolloutsforQ-valueAnnotation.AccurateQ-valueQ(s)annotationinEq.1iscrucialforguidingMCTSnodeselectiontowardscorrectproblem-solvingpathsandidentifyinghigh-qualitystepswithintrajectories.ToimproveQ-valuereliability,wedrawinspirationfromGoplayers,whoretrospectivelyevaluatetherewardofeachmovebasedongameoutcomes.Althoughinitialestimatesmaybeimprecise,repeatedgameplayrefinestheseevaluationsovertime.Similarly,ineachrollout,weupdatetheQ-valueofeachstepbasedonitscontributiontoachievingthecorrectfinalanswer.AfterextensiveMCTSrollouts,stepsconsistentlyleadingtocorrectanswersachievehigherQ-values,occasionalsuccessesyieldmoderateQ-values,andconsistentlyincorrectstepsreceivelowQ-values.Specifically,weintroducetwoself-annotationmethodstoobtainthesestep-levelQ-values.Fig.1(c)showsthedetailedsettinginthefourroundsofself-evolution.
Terminal-guidedannotation.Duringthefirsttworounds,whenthePPMisunavailableorinsufficientlyaccurate,weuseterminal-guidedannotation.Formally,letq(si)kdenotetheqvalueforstepsiafterback-propagationinthekthrollout.FollowingAlphaGo[Silveretal.,2017]andrStar[Qietal.,2024],wescoreeachintermediatenodebasedonitscontributiontothefinalcorrectanswer:
q(si)k=q(si)k?1+q(sd)k;(2)wheretheinitialqvalueq(si)0=0inthefirstrollout.Ifthisstepfrequentlyleadstoacorrectanswer,itsqvaluewillincrease;otherwise,itdecreases.Terminalnodesarescoredasq(sd)=1forcorrectanswersandq(sd)=?1otherwise,asshowninFig.1.
PRM-augmentedannotation.Startingfromthethirdround,weusePPMtoscoreeachstepformoreeffectivegeneration.Comparedtoterminal-guidedannotation,whichrequiresmultiplerolloutsforameaningfulqvalue,PPMdirectlypredictsanon-zeroinitialqvalue.PPM-augmentedMCTSalsohelpsthepolicymodeltogeneratehigher-qualitysteps,guidingsolutionstowardscorrectpaths.Formally,forstepsi,PPMpredictsaninitialq(si)0valuebasedonthepartialtrajectory:
q(si)0=PPM(x⊕s1⊕s2⊕...⊕si?1⊕si)(3)Thisqvaluewillbeupdatedbasedonterminalnode’sq(sd)valuethroughMCTSback-propagationinEq.2.Forterminalnodesd,wedonotusePRMforscoringduringtrainingdatageneration.Instead,weassignamoreaccuratescorebasedongroundtruthlabelsasterminal-guidedrewarding.
3.3ProcessPreferenceModel
Processrewardmodels,whichprovidegranularstep-levelrewardsignals,ishighlydesirableforsolvingchallengingmathproblems.However,obtaininghigh-qualitystep-leveltrainingdataremainsanopenchallenge.Existingmethodsrelyonhumanannotations[Lightmanetal.,2023]orMCTS-generatedscores[Zhangetal.,2024a,Chenetal.,2024]toassignascoreforeachstep.Thesescoresthenserveastrainingtargets,withmethodssuchasMSEloss[Chenetal.,2024]orpointwiseloss[Wangetal.,2024c,Luoetal.,2024,Zhangetal.,2024a]usedtominimizethedifferencebetweenpredictedandlabeledscores.Asaresult,theprecisionoftheseannotatedstep-levelrewardscoresdirectlydeterminestheeffectivenessoftheresultingprocessrewardmodel.
Unfortunately,preciseper-stepscoringremainsaunsolvedchallenge.AlthoughourextensiveMCTSrolloutsimprovethereliabilityofQ-values,preciselyevaluatingfine-grainedstepqualitypresentsa
6
majorobstacle.Forinstance,amongasetofcorrectsteps,itisdifficulttorankthemasbest,second-best,oraverageandthenassignprecisescores.Similarly,amongincorrectsteps,differentiatingtheworstfrommoderatelypoorstepsposesanalogouschallenges.Evenexperthumanannotationstruggleswithconsistency,particularlyatscale,leadingtoinherentnoiseintraininglabels.
Weintroduceanoveltrainingmethodthattrainsaprocesspreferencemodel(PPM)byconstructingstep-levelpositive-negativepreferencepairs.AsshowninFig.1(b),insteadofusingQ-valuesasdirectrewardlabels,weusethemtoselectstepsfromMCTStreeforpreferencepairconstruction.Foreachstep,weselecttwocandidateswiththehighestQ-valuesaspositivestepsandtwowiththelowestasnegativesteps.Critically,theselectedpositivestepsmustleadtoacorrectfinalanswer,whilenegativestepsmustleadtoincorrectanswers.Forintermediatesteps(exceptthefinalanswerstep),thepositiveandnegativepairssharethesameprecedingsteps.Forthefinalanswerstep,whereidenticalreasoningtrajectoriesrarelyyielddifferentfinalanswers,werelaxthisrestriction.WeselecttwocorrecttrajectorieswiththehighestaverageQ-valuesaspositiveexamplesandtwoincorrecttrajectorieswiththelowestaverageQ-valuesasnegativeexamples.Following[Ouyangetal.,2022],wedefineourlossfunctionusingthestandardBradley-Terrymodelwithapairwiserankingloss:
Lppm(θ)=?E(x,y,y∈D)[log(σ(rθ(x,y)?rθ(x,y)))](4)
wheniisnotfinalanswerstep,y=s1⊕...⊕si?1⊕sos;y=s1⊕...⊕si?1⊕seg(5)
Here,rθ(x,yi)denotestheoutputofthePPM,wherexistheproblemandyisthetrajectoryfromthefirststeptotheithstep.
3.4Self-EvolvedDeepThinking
3.4.1TrainingwithStep-by-StepVerifiedReasoningTrajectory
MathProblemsCollection.Wecollectalargedatasetof747kmathwordproblemswithfinalanswerground-truthlabels,primarilyfromNuminaMath[JiaLIandPolu,2024a]andMetaMath[Yuetal.,2023b].Notably,onlycompetition-levelproblems(e.g.,OlympiadsandAIME/AMC)fromNuminaMathareincluded,asweobservethatgrade-school-levelproblemsdonotsignificantlyimproveLLMcomplexmathreasoning.Toaugmentthelimitedcompetition-levelproblems,wefollow[Lietal.,2024]anduseGPT-4tosynthesizenewproblemsbasedontheseedproblemsin7.5kMATHtrainsetand3.6kAMC-AIMEtrainingsplit.However,GPT-4oftengeneratedunsolvableproblemsorincorrectsolutionsforchallengingseedproblems.Tofilterthese,wepromptGPT-4togenerate10solutionsperproblem,retainingonlythosewithatleast3consistentsolutions.
ReasoningTrajectoriesCollection.Insteadofusingtheoriginalsolutionsinthe747kmathdataset,weconductextensiveMCTSrollouts(Sec.3.2)togeneratehigher-qualitystep-by-stepverifiedreasoningtrajectories.Ineachself-evolutionround,weperform16rolloutspermathproblem,whichleadsto16reasoningtrajectories.Problemsarethencategoriesbydifficultybasedonthecorrectratioofthegeneratedtrajectories:easy(allsolutionsarecorrect),medium(amixofcorrectandincorrectsolutions)andhard(allsolutionsareincorrect).Forhardproblemswithnocorrecttrajectories,anadditionalMCTSwith16rolloutsisperformed.Afterthat,allstep-by-steptrajectoriesandtheirannotatedQ-valuesarecollectedandfilteredtotrainthepolicySLMandprocesspreferencemodel.
SupervisedFine-tuningthePolicySLM.
溫馨提示
- 1. 本站所有資源如無(wú)特殊說(shuō)明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請(qǐng)下載最新的WinRAR軟件解壓。
- 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請(qǐng)聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
- 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁(yè)內(nèi)容里面會(huì)有圖紙預(yù)覽,若沒(méi)有圖紙預(yù)覽就沒(méi)有圖紙。
- 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
- 5. 人人文庫(kù)網(wǎng)僅提供信息存儲(chǔ)空間,僅對(duì)用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對(duì)用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對(duì)任何下載內(nèi)容負(fù)責(zé)。
- 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請(qǐng)與我們聯(lián)系,我們立即糾正。
- 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶因使用這些下載資源對(duì)自己和他人造成任何形式的傷害或損失。
最新文檔
- 2025年保安證考試風(fēng)險(xiǎn)預(yù)測(cè)與試題及答案
- 保安證的職業(yè)標(biāo)準(zhǔn)試題及答案
- 動(dòng)手能力保安證考試試題及答案
- 生態(tài)旅游景區(qū)規(guī)劃方案
- 如何培訓(xùn)員工心態(tài)
- 廣東梅州職業(yè)技術(shù)學(xué)院《中國(guó)現(xiàn)代文學(xué)作品選Ⅰ》2023-2024學(xué)年第一學(xué)期期末試卷
- 中國(guó)農(nóng)業(yè)大學(xué)《幼兒游戲與指導(dǎo)》2023-2024學(xué)年第一學(xué)期期末試卷
- 江西省九江市彭澤縣2025屆重點(diǎn)中學(xué)小升初數(shù)學(xué)入學(xué)考試卷含解析
- 寧夏師范學(xué)院《辦公軟件操作實(shí)訓(xùn)》2023-2024學(xué)年第二學(xué)期期末試卷
- 蘇州工業(yè)園區(qū)職業(yè)技術(shù)學(xué)院《中醫(yī)護(hù)理學(xué)基礎(chǔ)Ⅰ實(shí)驗(yàn)》2023-2024學(xué)年第一學(xué)期期末試卷
- me實(shí)驗(yàn)2 電位、電壓的測(cè)定及電路電位圖的繪制
- 特殊兒童隨班就讀申請(qǐng)書范本
- 2022年縣水資源調(diào)度方案
- GSA《5G NTN衛(wèi)星連接報(bào)告(2023.8)》 Non-Terrestrial 5G Networks and Satellite Connectivity
- 專題11 以小見(jiàn)大-【幫作文】初中語(yǔ)文之從課文中學(xué)習(xí)寫作 課件(共25張PPT)
- 天溯EMS能源管理系統(tǒng)V1.3安裝配置手冊(cè)
- 垃圾清運(yùn)處理方案書及報(bào)價(jià)
- 《儀器分析》完整全套教學(xué)課件(共17章)
- 二級(jí)建造師之二建建設(shè)工程施工管理強(qiáng)化訓(xùn)練打印大全
- 灰場(chǎng)排水斜槽廊道及下灰場(chǎng)清灰施工方案
- 依戀的形成:母嬰關(guān)系如何塑造我們一生的情感
評(píng)論
0/150
提交評(píng)論