![大語言模型的數(shù)據(jù)合成與增強(qiáng)綜述 A Survey on Data Synthesis and Augmentation for Large Language Models_第1頁](http://file4.renrendoc.com/view8/M02/04/1F/wKhkGWcbXriAKaMNAAJb3TPsElU742.jpg)
![大語言模型的數(shù)據(jù)合成與增強(qiáng)綜述 A Survey on Data Synthesis and Augmentation for Large Language Models_第2頁](http://file4.renrendoc.com/view8/M02/04/1F/wKhkGWcbXriAKaMNAAJb3TPsElU7422.jpg)
![大語言模型的數(shù)據(jù)合成與增強(qiáng)綜述 A Survey on Data Synthesis and Augmentation for Large Language Models_第3頁](http://file4.renrendoc.com/view8/M02/04/1F/wKhkGWcbXriAKaMNAAJb3TPsElU7423.jpg)
![大語言模型的數(shù)據(jù)合成與增強(qiáng)綜述 A Survey on Data Synthesis and Augmentation for Large Language Models_第4頁](http://file4.renrendoc.com/view8/M02/04/1F/wKhkGWcbXriAKaMNAAJb3TPsElU7424.jpg)
![大語言模型的數(shù)據(jù)合成與增強(qiáng)綜述 A Survey on Data Synthesis and Augmentation for Large Language Models_第5頁](http://file4.renrendoc.com/view8/M02/04/1F/wKhkGWcbXriAKaMNAAJb3TPsElU7425.jpg)
版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請進(jìn)行舉報或認(rèn)領(lǐng)
文檔簡介
1
ASurveyonDataSynthesisandAugmentationforLargeLanguageModels
KeWang
onecall@
HangzhouInnovationInstitute,
BeihangUniversity
JiahuiZhu
zhujh224@
HangzhouInnovationInstitute,
BeihangUniversity
MinjieRen
rmj_rmj@
HangzhouInnovationInstitute,
BeihangUniversity
ZemingLiu
arXiv:2410.12896v1[cs.CL]16Oct2024
zmliu@
StateKeyLaboratoryofVirtual
RealityTechnologyandSystems,
BeihangUniversity
ShiweiLi
shiweili93@
HangzhouInnovationInstitute,
BeihangUniversity
ZongyeZhang
zhangzongye@
StateKeyLaboratoryofVirtual
RealityTechnologyandSystems,
BeihangUniversity
ChenkaiZhang
zhangchenkai@
StateKeyLaboratoryofVirtual
RealityTechnologyandSystems,
BeihangUniversity
XiaoyuWu
zf2306113@
HangzhouInnovationInstitute,
BeihangUniversity
QiqiZhan
zhanqiqi@
StateKeyLaboratoryofVirtual
RealityTechnologyandSystems,
BeihangUniversity
QingjieLiu
qingjie.liu@
StateKeyLaboratoryofVirtual
RealityTechnologyandSystems,
BeihangUniversity
Abstract
ThesuccessofLargeLanguageModels(LLMs)isinherentlylinkedtotheavailabilityofvast,diverse,andhigh-qualitydatafortrainingandevaluation.However,thegrowthrateofhigh-qualitydataissig-nificantlyoutpacedbytheexpansionoftrainingdatasets,leadingtoaloomingdataexhaustioncrisis.Thisunderscorestheurgentneedtoenhancedataefficiencyandexplorenewdatasources.Inthiscon-text,syntheticdatahasemergedasapromisingsolution.Currently,datagenerationprimarilyconsistsoftwomajorapproaches:dataaugmentationandsynthesis.Thispapercomprehensivelyreviewsandsummarizesdatagenerationtechniquesthroughoutthelifecy-cleofLLMs,includingdatapreparation,pre-training,fine-tuning,instruction-tuning,preferencealignment,andapplications.Further-more,Wediscussthecurrentconstraintsfacedbythesemethodsandinvestigatepotentialpathwaysforfuturedevelopmentandresearch.Ouraspirationistoequipresearcherswithaclearunder-standingofthesemethodologies,enablingthemtoswiftlyidentifyappropriatedatagenerationstrategiesintheconstructionofLLMs,whileprovidingvaluableinsightsforfutureexploration.
1Introduction
Inrecentyears,largelanguagemodels(LLMs)havedemonstratedunparalleledcapabilitiesacrossawidearrayoftasks
[9,
68,
166],
firmlyestablishingthemselvesasthebackboneofgeneralartifi-cialintelligence(AI)systems.Thesemodelsachievesignificantimprovementsinnaturallanguageprocessing
[234,
262,
264],com
-putervision
[100,
207,
239],andotherresearchfields
[36,
163,
229],
consistentlypushingtheboundariesofwhatAIcanachieve.The
YunhongWangyhwang@
StateKeyLaboratoryofVirtual
RealityTechnologyandSystems,
BeihangUniversity
successofLLMsislargelyattributedtotheirabilitytocaptureintricatepatternsandrelationshipswithinvastamountsofdata,allowingthemtoperformcomplextaskssuchasnaturallanguageinference
[39,
134],visualquestionanswering
[151,
158],andvision
-and-languagenavigation
[125,
178]withremarkableproficiency.
However,theperformanceofLLMsishighlydependentonthequalityandvolumeofthedatatheyaretrainedon
[2,
57,
58]
.Withtheexponentialgrowthinmodelsize—nowreachingbillionsoreventrillionsofparameters
[105,
168,
268]
—thereisanincreasingdemandforlarge-scale,diverse,andhigh-qualitydatatoensurerobustgeneralizationacrossvarioustasksanddomains.Obtainingsuchdataposessignificantchallengesduetothehighcostsofdatacollectionandtheproblemsintroducedbyprivacyconcerns.Ad-ditionally,thegrowthrateofhigh-qualitydatalagsfarbehindtherapidlyincreasingsizeoftrainingdatasets.Ifthistrendcontinues,theavailabledatawilleventuallybedepleted,implyingthatwithoutsignificantimprovementsindataefficiencyorthediscoveryofnewdatasources,thegrowthofLLMsmayslowdownconsiderably.Giventheseimpendinglimitations,datasynthesisandaugmen-tationtechniquesbecomeessentialtoextendingthelifespanandgeneralizationofLLMs.Traditionaldatasynthesisandaugmenta-tiontechniques
[34,
98,
135,
194],suchasimagerotation,cropping,
flipping,andrule-basednaturallanguagegeneration,havebeenwidelyusedtoaddressthesedatalimitations.Althoughtheseap-proachesimprovedatadiversityandaddressdatascarcitytosomeextent,theystillstruggletofullycapturethecomplexitiesofreal-worlddata
[55],generatedataatscale
[233],anddefendagainst
2
Figure1:StatisticsofthepublicationsrelatedtoLLM-orienteddatasynthesisandaugmentationtechnologies,groupedbythepublicationyearandvenue.
adversarialexamples
[162],limitingtheireffectivenessfortraining
LLMs.
Toovercomethesechallenges,researchershaveincreasinglyturnedtoLLM-orienteddatasynthesisandaugmentationtech-niques,recognizingtheabilityofLLMstomodelcomplexpatternsfromlargedatasetsandgeneratesyntheticdatathatcloselymir-rorreal-worlddistributionswhileintroducingvaluablevariations
[37,
175,
260]
.Thesestudiesreducetherelianceonmanuallycu-rateddatasetsandenablethegenerationofhigh-quality,diversedatathatmeetstheevolvingdemandsofLLMsthroughouttheirlifecycleandfunctions.Tocapturethebreadthoftheseefforts,wecollectedpapersrelatedtoLLM-orienteddatasynthesisandaug-mentationbysearchingGoogleScholarusingkeywordssuchas"datasynthesis,""dataaugmentation,"and"largemodels."Figure
1
illustratesthepublicationtrendsbyyearandvenue,reflectingtheincreasinginterestinthisfield.AsofOctober2024,weidenti-fied250uniquepublicationscoveringdiverseresearchtopicsandvenues.Summarizingtheseeffortsprovidescriticalinsightsintotheprogressandchallengesthatremain,offeringafoundationforfutureresearch.Despitetheseadvancements,severalkeychal-lengesremaininLLM-orienteddatasynthesisandaugmentation.Themisuseofsyntheticdataposesrisks,particularlyinspreadingmisinformationandraisingethicalconcernsaroundmanipulat-ingpublicopinion.Additionally,syntheticdataoftenintroducesambiguitywhenaligningAImodelswithhumanvalues,poten-tiallyleadingtobiasedoutcomes.Evaluatingmodelstrainedonsyntheticdataisalsocomplex,astraditionalbenchmarksmaynotfullycapturethenuancesofthisdata.Ensuringreliabilityisan-otherconcern,asbiasesandinaccuraciesfromoriginaldatasetscanpersistinsyntheticdata,limitingitsgeneralizationacrossdomains.Moreover,thecomputationaldemandsofLLMs,alongwithchal-lengesinhandlinglesscommonlanguagesornovelinstructions,complicatebroaderapplications.Finally,thelackofaunifiedframe-workfororganizingandcomparingthemethodsproposedinbothacademiaandindustryremainsabarrierforresearchersnavigatingthisrapidlyevolvingfield.
Thissurveyaimstoaddressthesegapsbyprovidingacompre-hensiveoverviewofLLM-orienteddatasynthesisandaugmenta-tiontechniques.AsshowninFigure
2,unlikeprevioussurveys
[43,
140,
147,
214,
271],whichprimarilyfocusonapplyingthese
methodstosupportspecificdownstreamtasksorparticularstagesofLLMs,ourworkemphasizesthedirectroleofLLM-orientedtech-niquesinimprovingtheoverallperformanceofLLMsacrossvariousstagesoftheirlifecycleandcorefunctions.Incontrasttothework
[137],whichfocusesonpracticesforsyntheticdatagenerationto
addresschallengeslikedatascarcityandprivacy,oursurveyex-tendsbeyondpracticalguidancebycategorizingmethodsaimedatimprovingLLMperformanceholistically.WeexaminenotonlydatagenerationbutalsohowthesetechniquesenhanceLLMsacrossallstagesandfunctions,offeringamoreintegrated,data-centricframe-workforadvancingLLMs.Specifically,wesystematicallyreviewandcategorizeexistingresearchfromtwokeyperspectives:thelifecycleofLLMs(frompre-trainingtofine-tuningandapplication)andtheircorefunctions(understanding,logic,memory,andgener-ation).Byframingthediscussionaroundthesedualperspectives,weofferclearerinsightsintothedevelopment,interconnections,andpracticalapplicationsofdifferentapproaches.Moreover,weidentifycriticalchallenges,exploreemergingresearchdirections,andhighlightpotentialbreakthroughsthatcouldfurtherdrivead-vancementsinLLMperformancethroughdata-centricmethods.
Thecontributionsofthissurveyaresummarizedasfollows:
?FirstSurvey:Toourknowledge,wepresentthefirstcom-prehensivesurveyfocusedonadvancingLLMsthroughdatasynthesisandaugmentation,systematicallycoveringtheen-tirelifecyclestagesandcorefunctionsofLLMs.Thissurveyprovidesanin-depthanalysisofcurrentmethodologiesandhighlightstheuniquechallengesateachstage.
?Newtaxonomy:Weintroduceaninnovativeorganizationalframeworkthatcategorizesexistingresearchfromtwokeyperspectives:thelifecyclestagesofLLMsandtheircorefunctions.Thistaxonomyoffersaclearerunderstandingoftheprogression,interconnections,andapplicabilityof
3
Figure2:Acomparisonbetweenexistingsurveysondatasynthesisandaugmentationtechniquesandourwork.PrevioussurveysprimarilyfocusonLLM-baseddatasynthesisandaugmentationmethodsaimedatsupportingdownstreamtasks.Incontrast,ourworkemphasizesLLM-orienteddatasynthesisandaugmentation,systematicallycoveringthefulllifecycleofLLMs—fromdatapreparationtoapplications—andaddressingcoreLLMfunctionssuchasunderstandingandgeneration,withtheultimategoalofimprovingLLMsthemselvesthroughdata-centrictechniques.
differentapproaches,providingvaluableinsightsintobothdevelopmentalandfunctionalaspectsofLLM-orienteddatasynthesisandaugmentation.
?Newfrontiers:Weidentifycriticalchallenges,andexploreemergingresearchdirections,andpotentialbreakthroughsinLLM-orienteddatasynthesisandaugmentation.Thisdiscus-sionaimstoinspirefutureresearchandguidedevelopmentsindata-centrictechniquesforLLMadvancement.
?Abundantresources:Weorganizeandmaintainadedi-catedrepositorytosupportongoingresearchandcollabo-rationinLLM-orienteddatasynthesisandaugmentation.Thisresourceincludesacuratedcollectionofrelatedpapers,multipleleaderboardstrackingthelatestadvancements,andregularupdatestofosterinnovation,guidefutureresearchdirections,andacceleratebreakthroughsinthefield.
ByofferingacomprehensiveoverviewofLLM-orienteddatasynthesisandaugmentationapproaches,thissurveyaimstoclarifythecurrentstateofthefieldandinspirefutureresearchdirectionsthatcanfurtherenhanceLLMcapabilitiesthroughdatasynthesisandaugmentationmethodologies.
Weorganizetheremainderofthissurveyasfollows:Section2categorizestheprimaryareasofLLM-orienteddatasynthesisandaugmentation,providinganoverviewofthefoundationaltech-niques.Section3discussesthecurrentLLM-orienteddatasynthesisandaugmentationmethodsfromtheperspectiveofthefulllife-cycleofLLMs,detailinghowthesetechniquesareemployedatdifferentstagesofmodeldevelopment.InSection4,wereviewthesemethodsfromtheviewpointofcoreLLMfunctions,exploringhowdatasynthesisandaugmentationenhancekeycapabilitiessuchasunderstanding,logic,memory,andgeneration.Section5delvesintotheevaluationstrategiesforLLM-orienteddatasynthe-sisandaugmentation,addressingbenchmarks,evaluationmetrics,
andleaderboardsusedtoassessandcomparetheeffectivenessofexistingapproaches.Finally,Section6providesinsightsintochal-lengesandemergingtrendsinLLM-orienteddatasynthesisandaugmentation,offeringrecommendationsforfutureresearchdirec-tionsthatcancontributetothecontinuedadvancementofLLMsthroughdatasynthesisandaugmentationmethodologies.
2Taxonomy
Datagenerationmethodsplayapivotalroleinaddressingdatascarcityandimbalance,therebyimprovingmodelperformanceandgeneralization.AsshowninFig.
4,wesummarizethedevelopment
andevolutionofdataaugmentationandsynthesistechniquesinrecentyears.Thissectionprimarilyintroducesthecurrentclassi-ficationofdatagenerationmethods,distinguishingbetweendataaugmentation,whichenhancesexistingdatasamplesthroughtransformations,anddatasynthesis,whichcreatesentirelynewsamplesfromscratchorbasedongenerativemodels.Bothmethodsdifferintheirapproachtoacquiringdatabutaimtoexpanddatasets.Furthermore,dataaugmentationandsynthesismethodscanbecat-egorizedintosubclassesfrommultipledimensions.Eachapproachhasuniquestrengthsandapplications,enablingresearcherstotailortheirdatagenerationstrategiestospecificneedsandgoals.
2.1DataAugmentation
Dataaugmentation,atypeofgenerationapproachfromdatatodata,generallyinvolvesmanipulatingtheoriginaldatatoincreaseitsdiversityandquantitywithoutsignificantlyalteringitsessen-tialcharacteristics.Techniquesusedindataaugmentationarede-signedtoenhancetherichnessofexistingdatasamplesthroughtransformationsorperturbations.Acrossdifferentmodalities,dataaugmentationtechniquesoftenexhibitsimilarities.Forinstance,inimagedata,augmentationoperationsencompassmosaic
[90],
4
DataLabeling
DataReformation
DataAugmentation
(§2.1)
T-SciQ
[205]
,ChatGPT-based
[3,
63,
275]
Mosaic
[90]
,CORE
[45]
,ALIA
[51]
,ChatAug
[37]
Co-Annotation
Taxonomy(§2)
GeneralModelDistillation
Co-annotating
[116]
,ToolCoder
[259]
DataSynthesis(§2.2)
DomainModelDistillation
TinyStories
[53]
,Phi-1
[67,
120]
,Alpagasus
[22]
,WizardLM
[223]
Minerva
[108]
,DeepSeek-Prover
[220]
,WizardCoder
[146]
ModelSelf-Improvement
Rephrasing
[150]
,Self-instruct
[210]
,SPIN
[26]
,SelTDA
[94]
GeneralModelDistillation
DataPreparation
(§3.1)
Dialogic
[122]
,MathInstruct
[244]
,Genixer
[266]
,Magpie
[227]
,MMIQC
[131]
,Genie
[236]
,Case2Code
[180]
,UltraChat
[44]
DataAugmentation
Disco
[27]
,GPT3Mix
[237]
,CoAnnotating
[116]
,ALIA
[51]
,FullAnno
[74]
,Dialgen
[142]
,TinyGSM
[128]
,AMPS
[77]
GeneralModelDistillation
Pretraining(§3.2)
Phi-1
[67]
,SciLitLLM
[118]
,TRAIT
[123]
,AnyGPT
[251]
,Phi-1.5
[120]
,TinyDialogues
[56]
ModelSelf-Improvement
VILA-2
[54]
DataAugmentation
WRAP
[150]
,KMLM
[133]
,bioR
[276]
,Physics-based
[134]
DataSynthesisandAugmentationforLargeLanguageModels:ASurvey
GeneralModelDistillation
LAB
[191]
,LLM2LLM
[107]
,GLAN
[111]
,Clingen
[226]
,
Baize
[222]
,Evol-Instruct
[223]
,HuaTuo
[204]
,NExT-GPT
[219]
Finetuning(§3.3)
ModelSelf-Improvement
STaR
[248]
,REST
[66]
,Self-Translate
[170]
,Self-Instruct
[210]
,RFT
[242]
,CodeRL
[104]
,REST-EM
[187]
,DeepSeekProver
[220]
DataAugmentation
FullLifecycle
ofLM(§3)
MathGenie
[144]
,DISC-MedLLM
[10]
,MetaMath
[238]
,
Symboltuning
[213]
,Llama-3-UltraMedical
[258]
,Llemma
[6]
GeneralModelDistillation
Instruction-Tuning
(§3.4)
Alpaca
[196]
,Vicuna
[29]
,Orca
[154]
,Baize
[222]
,LLaVA
[130]
ModelSelf-Improvement
Self-Instruct
[210]
,SPIN
[26]
,CAI
[8]
,Toolformer
[177]
DataAugmentation
T-SciQ
[205]
,CORE
[45]
,ChatAug
[37]
,ToolCoder
[259]
GeneralModelDistillation
ULTRAFEEDBACK
[35]
,HelpSteer
[212]
,LEMA
[4]
DomainModelDistillation
Preference
Alignment(§3.5)
ModelSelf-Improvement
BAD
[225]
,BEAVERTAILS
[85]
,PRM800K
[124]
,WebGPT
[156]
OAIF
[69]
,SELF-JUDGE
[235]
,SALMON
[193]
,SteerLM
[47]
DataAugmentation
Starling-7B
[273]
,UltraInteract
[240]
,CriticBench
[126]
Math
MetaMath
[238]
,MammoTH
[244]
,STaR
[248]
,Galactia
[197]
,DeepSeekProver
[220]
,WizardMath
[145]
Science
SciLitLLM
[118]
,ChemLLM
[254]
,SciGLM
[253]
,Galactia
[197]
Code
Applications(§3.6)
WizardCoder
[146]
,MagicCoder
[215]
,CodeAlpaca
[17]
,CodeLLama
[173]
,Phi-1
[67]
,Phi-1.5
[120]
Medical
DISC-MedLLM
[10]
,HuatuoGPT
[20,
256]
,ChatCounselor
[132]
,ClinGen
[226]
Law
DISC-LawLLM
[243]
,LawyerLLaMA
[82]
,LawGPT
[272]
,WisdomInterrogatory
[270]
Understanding(§4.1)
Alpaca
[196]
,WizardLM
[223]
,WRAP
[150]
,LLaVA
[130]
,ChartLlama
[73]
,Genixer
[266]
Logic(§4.2)
Functionality(§4)
Memory(§4.3)
ReSTEM
[187]
,Case2Code
[180]
,MathInstruct
[244]
,MMIQC
[131]
,STaR
[248]
,SelTDA
[94]
Quiet-STaR
[247]
,AutoKG
[274]
,PersonaHub
[16]
,AceCoder
[113]
,RepoCoder
[255]
Generation(§4.4)
Synthesizingand
AugmentingMethod
(§5.1)
Genie
[236]
,UltraMedical
[258]
,HuaTuo
[204]
,TinyStories
[53]
,DIALOGIC
[122]
,ALIA
[51]
d-RLAIF
[106]
,LLM2LLM
[107]
,Wizardmath
[145]
,STaR
[248]
,SciGLM
[253]
,ChemLLM
[254]
DataQuality(§5.2)
ImpactofDataSynthesisand
Augmentation(§5.3)
LLMs4Synthesis
[62]
,CoRAL
[217]
,FORD
[221]
,LTGC
[267]
Challengesand
Limitations(§5)
DataDreamer
[159]
,HARMONIC
[209]
ImpactonDifferentApplicationsand
Tasks(§??)
FutureDirections
(§5.5)
PANDA
[127]
,REGA
[206]
TabSynDex
[32]
,CoLa-Diff
[87]
,WizardCoder
[146]
,WebGPT
[156]
Figure3:Themaincontentflowandcategorizationofthissurvey.
flipping
[184],copy-pasting
[61],addingnoise
[149],pairing
[84]and
soforth.Similarly,intextdata,augmentationoperationsinvolvesynonymreplacement
[95],copy-pasting
[185],etc
.Moreover,to
catertothedemandsofmultimodallearning,existingresearchhasaddressedcross-modalinformationalignmentduringdataaug-mentation.MixGen
[75]generatesnewtrainingsamplesbylinearly
5
interpolatingimagesandconcatenatingtextsequencesfromtwoex-istingimage-textpairs.Thesemanticrelationshipwithinthenewlygeneratedimage-textpairremainsconsistentandmatched.Re-cently,intherapidlyadvancinglandscapeofLLMs,dataaugmenta-tionhasemergedasacornerstoneforbolsteringmodelperformancethroughthediversificationoftrainingexemplars,circumventingthenecessityforextensiveadditionaldatagathering.Fromadata-centricperspective,wesystematicallycategorizeexistingresearchondataaugmentationintothreedistinctcategories:datalabel-ing
[3,
63,
94,
136,
198,
275],
datareformation[45,
51,
143,
237],
andco-annotation
[11,
43,
116]
.
2.1.1DataLabeling.DatalabelingendeavorstoleveragethecomprehensivelanguageunderstandingcapabilitiesofLLMstoannotatevastunlabeleddatasets.Thismethodologyisparticularlybeneficialinfieldsthatpossessasubstantialunlabeleddatacorpus,encompassingdomainssuchascross-lingualprocessingandmulti-modallearning
[3,
63,
275],wheretheautomationofannotationcan
significantlyexpeditethedatapreparationprocess.Recentresearchstudiesthezero-shotannotationabilityofLLMs,suchasGPT-4onlabelingpoliticalTwitter
[198]
.Moreover,Khanetal.
[94]focuson
Visualquestionanswering(VQA)tasksbygeneratingpseudo-labeldatafromunlabeledimagesbyutilizingtheSelTDAframework.
2.1.2DataReformation.Datareformationinvolvestransform-ingandrestructuringexistingdataintoabroaderspectrumofvaria-tions,therebyfacilitatingmorefine-graineddataaugmentation
[45,
51]
.Thisapproachaimstoenrichthetraininglandscapewithdi-
verseyetpertinentexamples,enhancingthemodel’srobustnessandgeneralizationcapabilities.Classicmethodssuchasrotation
[92],
colorchanneltransformation
[64],andsynonymreplacement
[95]
arecommonlyused.Recently,approachesutilizingLLMshavealsoemerged.Forexample,Chenetal.
[27]proposeDisco,anapproach
thatharnessesLLMstoproducelarge-scale,high-qualitycounter-factualdata.
2.1.3Co-Annotation.Co-annotationdesignatesthecollabora-tiveeffortbetweenhumanannotatorsandLLMsintheannota-tionprocess
[11]
.Byintegratingthestrengthsofbothannotationmethodologies,co-annotationnotonlymitigatesannotationcostsbutalsoconcurrentlyenhancesannotationperformance,fosteringamoreefficientandeffectiveapproachtodataannotation.Lietal.
[116]introduceCoAnnotating,aframeworkthatstrategically
assignsdatapointsforannotationeithertohumansortoLLMs,basedonanassessmentoftheLLM’sannotationuncertainty.
2.2DataSynthesis
Datasynthesis,ontheotherhand,aimstocreateentirelynewdatafromscratchorbasedongenerativemodels,whicharesimilartothedistributionofrealdata.Inrecentyears,withtheexplosionandadvancementsingenerativeAI
[13,
41,
42,
78,
139,
161,
169],
therehavebeensignificantstridesinthequalityandgenerationefficiencyofsyntheticdata.BasedontherequirementsofLMs,thispapercategorizesdatasynthesismethodsintothreemaintypes:
generalmodeldistillation[22,
53,
120,
263,
266],
domainmodel
distillation[108,
145,
146,
215],and
modelself-improvement[54,
150,
210,
248]
.
2.2.1GeneralModelDistillation.Amongthese,generalmodeldistillationinvolvesleveragingpowerfulgeneralmodels,typicallyfeaturinglargerparametersandsuperiorperformance,suchasSta-bleVicuna,ChatGPT,andGPT-4,togeneratedatasetsthatcanen-hancethecapabilitiesofweakermodels.Therearevariouswaystoemploythesepowerfulmodels,suchasusingpredefinedtemplatestogeneratetinystories
[53]andleveragingtheLLMsthemselves
toevaluatethequalityofthegenerateddata.Phi-1anditsseries
[67,
120]havedemonstratedthatasmallamountofhigh-quality
datacanalsotrainapowerfulmodel,byleveragingthecompre-hensivegenerationoftextbooksandexercisesfromGPT-3.5.Someothermethodshavealsoachievedperformanceimprovementsbygeneratinginstructiondatasetsandfine-tuningmodelsafterim-provingthequalityofthesedatasets
[22,
80,
196]
.
2.2.2DomainModelDistillation.Domainmodeldistillationpertainstotheutilizationofmodelsthataretailoredtogeneratedatawithinaparticulardomain.Thisapproachisoftennecessarywhengeneralmodelsfailtomeetthespecificneedsofindustryapplications.Forinstance,inthecontextofcodeprogramming,do-mainmodeldistillationcanbeemployedtogenerateinstructional
datatailoredtospecificcodingtasks
[146,
215].Intherealmofmath
-ematics,methodssuchasMinerva
[108]andDeepSeekMath
[220]
aredesignedtogeneratesolutionstomathematicalproblemswhileensuringtheiraccuracyanddiversity.Additionally,therealmofindustrydataoftenpresentsbarriers,suchaslimiteddatascalesandtheinaccessibilityofdatawithinspecificenterpriseswithinthedomain.Thesefactorsnecessitatetheadoptionofdomain-specificmodelsthatcaneffectivelyaddresstheuniquechallengesposedbythesescenarios.
2.2.3ModelSelf-Improvement.Modelself-improvementreferstotheprocesswhereamodelgenerateshigher-qualitydatatoen-hanceitscapabilities.Forinstance,leveragingexistinginstructionstoadjustthemodelandpromptingittoparaphrasedocumentsonthewebinspecificstyles,suchasWikipedia-styleorQA-style,canbeusedtojointlypre-trainLLMsforbothauthenticandsyntheticparaphrasingtasks
[150]
.Self-Instruct
[210]enhancesLMsthem
-selvesbyautogeneratingandrefininginstructionaldata,boostingperformancewithminimalhumanintervention.
3DataSynthesisandAugmentationintheFullLifecycleofLLM
FromtheperspectiveofthefulllifecycleofLLM,Wedividetheexistinginvestigationsintosixstages,includingdatapreparation,pre-training,fine-tuning,instruction-tuning,preferencealignment,andapplications.Thepresentsectionintroducesrelevantresearchineachstage.
3.1DataPreparation
Inthedatapreparationphase,datasynthesisandaugmentationaimtogeneratediverseandhigh-qualitydatasetsforthetrainingofLLMs,addressingthechallengeofthescarcityofreal-worlddata.AccordingtothetaxonomydiscussedinSection2,Wedividethepresentsubsectionintogeneralmodeldistillationanddataaugmentation.
6
Figure4:Illustrationoftheevolutionarystepsinthedevelopmentofdatasynthesisandaugmentationtechniquesforlargemodels.
3.1.1GeneralModelDistillation.ThiswayaimstoleveragethepowerfulcapabilitiesofgeneralLLMstodistillhigh-qualitydata.Accordingtotheapproachanddatamodality,wefurtherdi-videdgeneralmodeldistillationintofivecategories:synthesizefromseeds,synthesizereasoningsteps,synthesizewithcontrollability,synthesizefromscratch,andsynthesizemultimodaldata.
SynthesizefromSeeds.Tosynthesizedatasetsforspecifictasks,promptingLLMswithasmallnumberofrelevantexamplescaneffectivelyproducehigh-qualitydatasetsatalowcost.Forinstance,toinvestigate“howsmallcananLLMbetoachievecertaincapabil-ities”,TinyStories
[53]isconstructedbyinstructinganLLMtogen
-eratestoriesthatcombinethreewordsrandomlychosenfrom1500basicwords,anditcanbeusedtotrainandevaluatelanguagemod-els.Basedonthecollectedlarge-scalefunctions,Case2Code
[180]
incorporatesLLMstogeneratesuitableinputsforthesefunctionsandutilizesthecodeinterpretertocalculatethe
溫馨提示
- 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
- 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
- 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
- 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
- 5. 人人文庫網(wǎng)僅提供信息存儲空間,僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對任何下載內(nèi)容負(fù)責(zé)。
- 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請與我們聯(lián)系,我們立即糾正。
- 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時也不承擔(dān)用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。
最新文檔
- 環(huán)保理念下的現(xiàn)代家居設(shè)計風(fēng)格
- 現(xiàn)代飲食文化與胃腸健康的平衡
- 生產(chǎn)環(huán)境下的操作規(guī)范與質(zhì)量控制
- 現(xiàn)代企業(yè)網(wǎng)絡(luò)攻擊的防范與應(yīng)對
- 現(xiàn)代企業(yè)決策分析與科學(xué)決策
- 2023三年級語文下冊 第八單元 口語交際:趣味故事會配套說課稿 新人教版
- Unit5 Humans and nature Lesson 1 A sea story 說課稿-2024-2025學(xué)年高中英語北師大版(2019)必修第二冊001
- 2024-2025學(xué)年新教材高中數(shù)學(xué) 第五章 三角函數(shù) 5.7 三角函數(shù)的應(yīng)用(2)說課稿 新人教A版必修第一冊
- 2023八年級數(shù)學(xué)下冊 第18章 平行四邊形18.1 平行四邊形的性質(zhì)第2課時 平行四邊形的性質(zhì)定理3說課稿 (新版)華東師大版
- 2023二年級語文上冊 第二單元 2 樹之歌配套說課稿 新人教版
- (人衛(wèi)版第九版?zhèn)魅静W(xué)總論(一))課件
- 壓力性損傷護(hù)理質(zhì)控細(xì)則及集束化管理措施
- 《批判性思維原理和方法》全套教學(xué)課件
- 產(chǎn)后康復(fù)-腹直肌分離
- 丙烯-危險化學(xué)品安全周知卡
- 粉條加工廠建設(shè)項目可行性研究報告
- 《配電網(wǎng)設(shè)施可靠性評價指標(biāo)導(dǎo)則》
- 2024年國家電網(wǎng)招聘之通信類題庫附參考答案(考試直接用)
- CJJ 169-2012城鎮(zhèn)道路路面設(shè)計規(guī)范
- 食品企業(yè)日管控周排查月調(diào)度記錄及其報告格式參考
- 產(chǎn)品質(zhì)量法解讀課件1
評論
0/150
提交評論