牛津斯坦福 -元強(qiáng)化學(xué)習(xí)綜述 A Survey of Meta-Reinforcement Learning_第1頁
牛津斯坦福 -元強(qiáng)化學(xué)習(xí)綜述 A Survey of Meta-Reinforcement Learning_第2頁
牛津斯坦福 -元強(qiáng)化學(xué)習(xí)綜述 A Survey of Meta-Reinforcement Learning_第3頁
牛津斯坦福 -元強(qiáng)化學(xué)習(xí)綜述 A Survey of Meta-Reinforcement Learning_第4頁
牛津斯坦福 -元強(qiáng)化學(xué)習(xí)綜述 A Survey of Meta-Reinforcement Learning_第5頁
已閱讀5頁,還剩96頁未讀, 繼續(xù)免費(fèi)閱讀

下載本文檔

版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請進(jìn)行舉報(bào)或認(rèn)領(lǐng)

文檔簡介

arXivv[cs.LG]19Jan2023AarXivv[cs.LG]19Jan2023JacobBeck*jacob.beck@cs.ox.ac.ukUniversityofOxfordRistoVuorio*risto.vuorio@cs.ox.ac.ukUniversityofOxfordEvanZheranLiuZhengXiongLuisaZintgraf十evanliu@zheng.xiong@cs.ox.ac.ukzintgraf@StanfordUniversityUniversityofOxfordUniversityofOxfordChelseaFinncbfinnChelseaFinncbfinn@shimon.whiteson@cs.ox.ac.ukUniversityofOxfordAbstractWhiledeepreinforcementlearning(RL)hasfueledmultiplehigh-pro?lesuc-dbackfrommorewidespreadadoptionbynywe1IntroductioninforcementlearningmetaRLisafamilyofmachinelearningMLmethodsthatlearntoreinforcementlearn.Thatis,meta-RLusessample-inef?cientMLtolearnsample-ef?cientRLtedasamachinelearningproblemforasignicantperiodoftimeIntriguingly,32Background2.1Reinforcementlearningtoastheagent’senvironment.AnMDPisde?nedbyatupleM=〈s,A,P,P0,R,γ,T),wheresisthesetofstates,Athesetofactions,P(st+11st,at):sxAxs→R+theprobabilityoftransitioningfromstatesttostatest+1aftertakingactionat,P0(s0):s→R+isadistributionApolicyisafunctionπ(a1s):sxA→R+thatmapsstatestoactionprobabilities.Thisway,TPPsat1st)P(st+11st,at).t=0J(π)=Eτ~P(τ)┌t0γtrt┐,ltipleepisodesaregathered.IfHepisodeshavebeengathered,thenD={τh}=0isallofthedatadeneanRLalgorithmasthefunctionf(D):((sxAxR)T)H→Φ.Inpractice,thedatamayincludeath2.2Meta-RLde?nitionisinsteadtolearn(partsof)analgorithmfusingmachinelearning.WhereRLlearnsapolicy,fthehumanfromdirectlydesigningandimplementingtheRLalgorithmsmtomaximizeameta-RLobjective.Hence,fθoutputstheparametersofπφdirectly:φ=fθ(D).Werefertothepolicyπφasthebasepolicywithbaseparametersφ.Here,Disameta-trajectoryylerdinglywemaycalltheouterloopparametersandortheesupportedbyanysetoftasksHoweverorsandAtobesharedbetweenallofthetasksandthetaskstoonly41s-s-1s-d方p1d方1s-s-1s-d方p1d方p1s-s-1sssssdd方p-s-s-11ittedinsweetheτeDK∶H7(θ)=EMi~p(M)┌ED┌G(τ)f│θ,Mi┐┐,τeDK∶HwhereG(τ)isthediscountedreturnintheMDPMiandHisthelengthofthetrial,orthetask-erloopfθ(D).2.3ExamplealgorithmsMetaLearningMAMLwhichusesmetagradientsandFastRLviaSlowRL(RL2),whichusesrecurrentneuralnetworks[46,239].Manymeta-RLalgorithmsbuildonsimilartothoseusedinMAMLandRLwhichmakesthemexcellentMAMLManydesignsoftheinner-loopalgorithmfθbuildonexistingRLalgorithmsandusemeta-learningtoimprovethem.MAML[55]isanin?uentialdesignfollowingthispattern.ItsrsandithgradientdescenttobeagoodstartingpointforlearningontasksfromthetaskdistributionWhenadaptingtoanewtask,MAMLcollectsdatastepforataskMi~p(M):φ=f(D,φ0)=φ0+α5φОJ?(D,πφО),5ptuallyrightwhereJ?(D,πφО)isanestimateofthereturnsofthepolicyπφОforthetaskMiandαistheφ=φ0+β5φОJD1),Mi~p(M)whereπ1isthepolicyfortaskiupdatedoncebytheinner-loop,βisalearningrate,andientDepolicyforvariancereductionhighervaluesofKinitmeralwithKuptodifferencesinthediscountingTooptimizemtheRNNHoweverMAMLcannottrivially6Multi-taskMulti-task-RL2[46,239],MAML[55]-LPGLPGMetaGenRL9]2.4ProblemCategoriesWhilethegivenproblemsettingappliestoallofmetaRLdistinctclustersintheliteraturehaveultitasksettingInthissettinganagentmustquicklyrringtrainingMethodsforthismanyshotsingletasksettingtendto7Meta-LearningFew-ShotMeta-RLMeta-LearningAdaptationGoal MDP1 MDP2 MDP3Rl2,L2RL,VariBADMeta-Learning MDP1 MDP2 MDP3 MAML,DREAMZero-ShotPerformwellfromstartMethods:Few-ShotFreeexplorationphaseMethods:Learnnewtaskswithinafewsteps/episodesOverMultiple(similar)tasks......Meta-LearningMany-ShotMetaMeta-LearningGoalLearnnewtasksbetterthanstandardRLalgorithmsLPG,MetaGenRLMeta-LearningOverMultiple(diverse)tasks MDP1 MDP2 MDP3SolutionsMethods:AdaptationGoalAcceleratestandardRLalgorithmsMeta-LearningOverwindowsinasingletask.(Noreset)SolutionsMethods:STACX,FRODOMeta-LearningAdaptationAdaptationAdaptationMDPMDP18ParameterizedpolicygradientsMAML-likeFinnetal.[55],Lietal.[124],Sungetal.[219],Vuorioetal.[235],ZintgrafMAML-likeDistributionalMAMLndMeta-gradientestimationFoersteretal.[60],Al-Shedivatetal.[207],Stadieetal.[216],Liuetal.[133],Maoetal.[139],Fallahetal.[52],Tang[222],andVuorioetal.[234]BlackboxnerloopHeessetal.[88],Duanetal.[46],Wangetal.[239],Humpliketal.[95],Fakooretal.[51],Yanetal.[256],Zintgrafetal.[281],Liuetal.[130],andZintgrafetal282]AttentionMishraetal.[150],Fortunatoetal.[62],Emukpereetal.[49],Ritteretal.[190],Wangetal.[240],andMelo[141]HypernetworksXianetal.[250]andBecketal.[17]TaskInferenceMulti-taskpre-trainingHumpliketal.[95],Kamiennyetal.[104],Raileanuetal.[182],Liuetal.[130],andPengetal.[174]LatentforZhouetal.[278],Raileanuetal.[182],Zintgrafetal.[281],Zhangetal.[268],Zintgrafetal.[282],Becketal.[17],Heetal.[86],andImagawaetal.97]ConstrastivelearningFuetal.[64]rnerwouldacgthis3Few-ShotMeta-RLkinhomekitchensTraininganewereitntocookinitHowevertrainingsuchanagentwithmetaRLinvolvesuniquefew-shotsetting.Recallthatmeta-RLitselflearnsalearningalgorithmfθ.Thisplacesunique?Parameterizedpolicygradientmethodsbuildthestructureofexistingpolicygradiente9PPGMethodBlackBoxMethodGeneralizationeralizationalizationAMLalizationAMLLrereInductivebiasinstructureInductiveInductivebiasinstructureInductivebiasfromdatachallengesOnesuchrningsnsupervision.Inthestandardmeta-RLproblemsetting,rewardsareavailableduringbothmeta-ample,itmaybedif?culttomanuallydesignaninformativetaskdistributionformeta-training,metanges3.1ParameterizedPolicyGradientMethodsMeta-RLlearnsalearningalgorithmfθ,theinner-loop.WecalltheparameterizationoffθthesectionwediscussonewayofparameterizingtheinnerloopthatbuildsinthestructureofexistingstandardRLalgorithms.Parameterizedpolicygradients(PPG)φj+1=fθ(Dj,φj)=φj+αθ5φjJ?θ(Dj,πφj),teverφj+1=φj+αθMθ5φjJ?θ(Dj,πφj)[255,170,58].Whileavaluebased-methodcouldbeusedcanbeupdatedwithback-propagationinaPPGmethodorbyaneuralnetworkinablackboxodslearnafulldistributionoverinitialpolicyparameters,p(φ0)[82,260,242,285,73].Thisterstion?tviavariationalinference[82,73].Moreover,thedistributionitselfcanbeupdatedintheyweightsandbiasesofthelastlayerofthepolicy[181],whileleavingtherestoftheparametersectorditionedInthiscasetheinputtothepolicyitselfparameterizesaMeta-gradientestimationinouter-loopoptimizationEstimatinggradientsfortheouter-loopisnnerloopThereforeoptimizingtheouter-looprequirestakingthegradientofagradient,orameta-gradient,whichinvolvesofdatausedbyinnerlooponpriortedbydataintheouterloopStillthesepriorpoliciesdoaffectthedistributionofdatasampledinD,usedlaterbytheinner-looplearningalgorithm.Thusignoringthegradienttermsinthepolicyentpwithanmethodmayusearstorderapproximation63],orusegradient-freeoptimizationtoopti-Outer-loopalgorithmsWhilemostPPGmethodsuseapolicy-gradientalgorithmintheouter-saDAdditionally,onecantraintask-speci?cexpertsandthenusetheseforimitationlearninginthetorybehaviorbyoptimizingEquationtheycaneoverPPG3.2BlackBoxMethodsauniversalfunctionapproximator.ThisplacesfewerconstraintsonthefunctionfθthanwithaedbystructureByconditioningapolicyonacontextvector,alloftheweightsandbiasesofTmustgeneralizebetweenalltasksHoweverwhensignicantlydistinctpoliciesarerequiredfordifferenttasks,cydirectlyTheinnerloopmayproducealloftheparametersofafeedInner-looprepresentationWhilemanyblackboxmethodsuserecurrentneuralnetworks,[88,opionmechaexOuter-loopalgorithmsWhilemanyblackboxmethodsuseon-policyalgorithmsintheouter-loop[46,239,281],itisstraightforwardtouseoff-policyalgorithms[185,51,130],whichbringBlackboxtrade-offsOnekeybene?tofblackboxmethodsisthattheycanrapidlyaltertheirnoftenstruggletogeneralizeoutsideofpM,252].Considertherobotchef:whileitkboxingafullyblack-boxmethod,thepolicyorinner-loopcanbe?ne-tunedwithpolicygradientsat3.3TaskInferenceMethodsritrainingforeachtask,withnoplanningrequired.Infact,trainingapolicyoveradistributionoftasks,withaccesstothetruetask,canbetakenasthede?nitionofmulti-taskRL[263].InthedsmapthetaskdirectlytoweightspolicyheasTaskinferencewithprivilegedinformationAstraightforwardmethodforinferringthetaskistokcMionoftionwnTaskinferencewithmulti-tasktrainingSomeresearchusesthemulti-tasksettingtoimproventedtourinonthatencodesthetaskreprensitcontainsonlythisinformation[95,130].Afterthis,gθ(cM)canbeinferredinmeta-learningtaskRLmaybeisneededforthemeta-RLpolicytoidentifythetask.InthiscaseinsteadofonlyinferringthefcientlymanyexploratorythetasksharingpoliciesbecomeslessfeasibleOftenintrinsicrewardsareTaskinferencewithoutprivilegedinformationOthertaskinferencemethodsdonotrelyonForinstanceataskcanbeonortransitionfunction[278,281,268,280,86];andtaskinferencecanusecontrastivelearningHepisodesxAxAxAxAxAxAxAxAxxAKepisodestrationoffreeexplorationinrstKepisodesyellowfollowedbynotfreeexploedbyexploitationwhitedistributionusingavariationalinformationbottleneckesldtoreheotherhandtrainingthehattion3.4ExplorationandMeta-ExplorationshouldworkforanyMDPandmayconsistofrandomon-policyexploration,epsilon-greedyex-istypeofexplorationstilloccursintheadditionallyexistsexplorationintheZhouetal.[278],Gurumurthyetal.[83],Fuetal.[64],Liuetal.[130],andZhangetal.[268]pMToenablesampleefcientadaptationduseddistribution.Recallthatinthefew-shotadaptationsetting,oneachtrial,theagentisplacedintoanewtaskonsolvingthetaskinthenextfewepisodes(i.e.,overtheH_KepisodesinEquation3).Anduringtentiallyevenbeyondtheinitialfewshotswithexploitingwhatitalreadyknowstoachievehighrewards.Itisalwaysoptimaltoexploreinthe?rstKepisodes,sincenoicingshorttermrewardstolearnabetterpolicyforhigherlaterreturnspaysdividends,whilewhenH_Kissmall,theagentmustexploitmoretoobtainanyrewarditcan,optimallyEnd-to-endoptimization.Perhapsthesimplestapproachistolearntoexploreandexploitend-to-endbydirectlymaximizingthemeta-RLobjective(Equation3)asdonebyblackboxmeta-RLapproaches[46,239,150,216,26].Approachesinthiscategoryimplicitlylearntoexplore,astheydirectlyoptimizethemeta-RLobjectivewhosemaximizationrequiresexploration.Morespeci?cally,thereturnsinthelaterK_HepisodesτeDG(τ)canonlybemaximizedifthepolicyappropriatelyexploresinthe?rstKepisodes,somaximizingthemeta-RLobjectivecanyieldoptimalexplorationinprinciple.Thisapproachworkswellwhencomplicatedexplorationstrategiesarenotneeded.Forexample,ifattemptingseveraltasksinthedistributionoftasksisareasonableformofexplorationforaparticulartaskdistribution,thenend-to-endoptimizationmayworkwell.ingredients(i.e.,explore)ifdoingsoresultsinacookedmeal.Hence,itischallengingtolearnLriorsamplingTocircumventthechallengeofimplicitlylearningtoexploreRakellyetalwhattheidentityofthetaskis,andthentoiterativelyre?nethisdistributionbyinteractingwithviahatsalongtsinitialpositionrenttaddningthedynamicsandrewardfunctioninformationgainoverthetaskdistribution[64,130],orareductioninuncertaintyoftheposte-?rstKepisodes,andthentheexploitationpolicyexploitsfortheremainingH_Kexploitationinformationaboutthetaskdynamics,butareirrelevantforarobotcheftryingtocookameal.cyusedhespaceofxxOptimal0-shotPosteriorSamplingIrrelevantExplorationAxxAxx AEpisode1xxAxxAxxAEpisode2xxAxxAxxAEpisode30imalexplorationandposteriorsamplingThethirdrowconsideringthatthisintrinsicrewardcanbeusedtotrainapolicyexclusivelyforoff-policydatasnotForexample,usingrandomnetworkdistillation[29],arewardmayaddanincentivefornovelty[282],oraddanincentiveforgettingdatawhereTD-errorishigh[77].Manyoftheserewards3.5Bayes-AdaptiveOptimalityrtaintyInsteadoptimalexplorationonlyreducesuncertaintygsmeforexplorationislimitedThereforesiscussximateBayesoptimalpoliciesandanalyzethebehaviorofBayes-adaptiveMarkovdecisionprocesses.Todeterminetheoptimalexplorationstrategy,weicsandrewardfunctionFromahighleveltheBayesadaptiveMarkovdeciBAMDPmaximizesreturnswhenplacedintoanunknownMDP.Crucially,thedynamicsofthearacterizesthecurrentuncertaintyasadistributionoverpotentialtionstsarst)sofar,andtheinitialbeliefb0isapriorp(r,p).Then,thestatesofntySpecicallytheBAMDPrewardR+(st,bt,at)=ER~bt[R(st,at)].(4)heBAMDP:P+(st+1,bt+11st,bt,at)=ER,P~bt[P(st+11st,at)δ(bt+1=p(R,P1τ:t+1)].(5)hecurrentbeliefr=R+(st,bt,at)=ER~bt[R(st,at)].EbRstbtatLearninganapproximateBayes-optimalpolicyDirectlycomputingBayes-optimalpoliciesre-onandthelatentvariablesmcanbelearnedbyrollingoutthepolicytoobtainonarnBayesadaptiveoptimalpoliciestheframeworkofBAMDPscanstillofferahelpfulFirst,blackboxmeta-RLalgorithmssuchasRL2learnarecurrentpolicythatnotonlycondi-tionsonthecurrentstatest,butonthehistoryofobservedstates,actions,andrewardsτ:t=memputingthebeliefstateetaRLalgorithmscaninprinciplelearnBayesadaptiveatetosmetaRLalgorithmsstruggletolearnischallengingLiuetalhighlightonesuchoptimizationchallengeforblackboxmeta-RLwheretheagentisgivenafew“free”episodestoexplore,andtheobjectiveistomaximizethernsbeginningfromthersttimestepThesetheresultinusinglesssuitableutensilsoringredients,though,especiallywhenoptimizedatlowerlveinterhecurrenttaskwhichisequivalenttothebeliefstateThenexplocycanbesuf?cientforoptimallysolvingthemeta-RLproblem,evenifitdoesnotmakeuseofallthisstate3.6SupervisionInthissection,wediscussmostofthedifferenttypesofsupervisionconsideredinmeta-RL.Inexperttrajectoriesorotherprivilegedinformationduringmeta-trainingand/ortesting).EachofMeta-RLMeta-RLwithMeta-RLviaImitationHYPER

溫馨提示

  • 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
  • 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
  • 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
  • 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
  • 5. 人人文庫網(wǎng)僅提供信息存儲空間,僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對任何下載內(nèi)容負(fù)責(zé)。
  • 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請與我們聯(lián)系,我們立即糾正。
  • 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時也不承擔(dān)用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。

最新文檔

評論

0/150

提交評論