(完整版)基于深度強(qiáng)化學(xué)習(xí)的flappybird_第1頁(yè)
(完整版)基于深度強(qiáng)化學(xué)習(xí)的flappybird_第2頁(yè)
(完整版)基于深度強(qiáng)化學(xué)習(xí)的flappybird_第3頁(yè)
(完整版)基于深度強(qiáng)化學(xué)習(xí)的flappybird_第4頁(yè)
(完整版)基于深度強(qiáng)化學(xué)習(xí)的flappybird_第5頁(yè)
已閱讀5頁(yè),還剩9頁(yè)未讀 繼續(xù)免費(fèi)閱讀

下載本文檔

版權(quán)說(shuō)明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請(qǐng)進(jìn)行舉報(bào)或認(rèn)領(lǐng)

文檔簡(jiǎn)介

1、SHANGHAIJIAOTONGUNIVERSITYProjectTitle:PlayingtheGameofFlappyBirdwithDeepReinforcementLearningGroupNumber:g-07GroupMembers:WangWenqing116032910080GaoXiaoning116032910032QianChen116032910073 ITOC o 1-5 h z HYPERLINK l bookmark8 o Current Document Introduction1 HYPERLINK l bookmark10 o Current Documen

2、t DeepQ-learningNetwork2 HYPERLINK l bookmark12 o Current Document Q-learning2 HYPERLINK l bookmark14 o Current Document ReinforcementLearningProblem2 HYPERLINK l bookmark16 o Current Document Q-learningFormulation63 HYPERLINK l bookmark34 o Current Document DeepQ-learningNetwork4 HYPERLINK l bookma

3、rk46 o Current Document InputPre-processing5 HYPERLINK l bookmark48 o Current Document ExperienceReplayandStability5 HYPERLINK l bookmark50 o Current Document DQNArchitectureandAlgorithm6 HYPERLINK l bookmark90 o Current Document Experiments7 HYPERLINK l bookmark92 o Current Document 3.1ParametersSe

4、ttings7 HYPERLINK l bookmark161 o Current Document 3.2ResultsAnalysis9 HYPERLINK l bookmark163 o Current Document Conclusion11 HYPERLINK l bookmark165 o Current Document References12PlayingtheGameofFlappyBirdwithDeepReinforcementLearning PlayingtheGameofFlappyBirdwithDeepReinforcementLearningAbstrac

5、tLettingmachineplaygameshasbeenoneofthepopulartopicsinAltoday.Usinggametheoryandsearchalgorithmstoplaygamesrequiresspecificdomainknowledge,lackingscalability.Inthisproject,weutilizeaconvolutionalneuralnetworktorepresenttheenvironmentofgames,updatingitsparameterswithQ-learning,areinforcementlearninga

6、lgorithm.WecallthisoverallalgorithmasdeepreinforcementlearningorDeepQ-learningNetwork(DQN).Moreover,weonlyusetherawimagesofthegameofflappybirdastheinputofDQN,whichguaranteesthescalabilityforothergames.Aftertrainingwithsometricks,DQNcangreatlyoutperformhumanbeings.IntroductionFlappybirdisapopulargame

7、intheworldrecentyears.Thegoalofplayersisguidingthebirdonscreentopassthegapconstructedbytwopipesbytappingscreen.Iftheplayertapthescreen,thebirdwilljumpup,andiftheplayerdonothing,thebirdwillfalldownataconstantrate.Thegamewillbeoverwhenthebirdcrashonpipesorground,whilethescoreswillbeaddedonewhenthebird

8、passthroughthegap.InFigure1,therearethreedifferentstateofbird.Figure1(a)representsthenormalflightstate,(b)representsthecrashstate,(c)representsthepassingstate.(b)pi*叭*(a)(c)Figure1:(a)normalflightstate(b)crashstate(c)passingstateOurgoalinthispaperistodesignanagenttoplayFlappybirdautomaticallywiththe

9、sameinputcomparingtohumanplayer,whichmeansthatweuserawimagesandrewardstoteachouragenttolearnhowtoplaythisgame.Inspiredby1,weproposeadeepreinforcementlearningarchitecturetolearnandplaythisgame.Recentyears,ahugeamountofworkhasbeendoneondeeplearningincomputervision6.Deeplearningextractshighdimensionfea

10、turesfromrawimages.Therefore,itisnaturetoaskwhetherthedeeplearningcanbeusedinreinforcementlearning.However,therearefourchallengesinusingdeeplearning.Firstly,mostsuccessfuldeeplearningapplicationstodatehaverequiredlargeamountsofhand-labelledtrainingdata.RLalgorithms,ontheotherhand,mustbeabletolearnfr

11、omascalarrewardsignalthatisfrequentlysparse,noisyanddelayed.Secondly,thedelaybetweenactionsandresultingrewards,whichcanbethousandsoftimestepslong,seemsparticularlydauntingwhencomparedtothedirectassociationbetweeninputsandtargetsfoundinsupervisedlearning.Thethirdissueisthatmostdeeplearningalgorithmsa

12、ssumethedatasamplestobeindependent,whileinreinforcementlearningonetypicallyencounterssequencesofhighlycorrelatedstates.Furthermore,inRLthedatadistributionchangesasthealgorithmlearnsnewbehaviors,whichcanbeproblematicfordeeplearningmethodsthatassumeafixedunderlyingdistribution.Thispaperwilldemonstrate

13、thatusingConvolutionalNeuralNetwork(CNN)canovercomethosechallengementionedaboveandlearnsuccessfulcontrolpolicesfromrawimagesdatainthegameFlappybird.ThisnetworkistrainedwithavariantoftheQ-learningalgorithm6.ByusingDeepQ-learningNetwork(DQN),weconstructtheagenttomakerightdecisionsonthegameflappybirdba

14、relyaccordingtoconsequentrawimages.DeepQ-learningNetworkRecentbreakthroughsincomputervisionhavereliedonefficientlytrainingdeepneuralnetworksonverylargetrainingsets.Byfeedingsufficientdataintodeepneuralnetworks,itisoftenpossibletolearnbetterrepresentationsthanhandcraftedfeatures23.Thesesuccessesmotiv

15、ateustoconnectareinforcementlearningalgorithmtoadeepneuralnetwork,whichoperatesdirectlyonrawimagesandefficientlyupdateparametersbyusingstochasticgradientdescent.Inthefollowingsection,wedescribetheDeepQ-learningNetworkalgorithm(DQN)andhowitsmodelisparameterized.Q-learning2.1.1ReinforcementLearningPro

16、blemQ-learningisaspecificalgorithmofreinforcementlearning(RL).AsFigure2show,anagentinteractswithitsenvironmentindiscretetimesteps.Ateachtimet,theagentreceivesanstatesandarewardr.Itthenchoosesanactionafromthesetoftttactionsavailable,whichissubsequentlysenttotheenvironment.Theenvironmentmovestoanewsta

17、tesandtherewardrassociatedwiththetransitiont+1t+1(s,a,s)isdetermined4.ttt+1*EnvironmemRcwaidActionStateFigure2:TraditionalReinforcementLearningscenarioThegoalofanagentistocollectasmuchrewardaspossible.Theagentcanchooseanyactionasafunctionofthehistoryanditcanevenrandomizeitsactionselection.Notethatin

18、ordertoactnearoptimally,theagentmustreasonaboutthelongtermconsequencesofitsactions(i.e.,maximizethefutureincome),althoughtheimmediaterewardassociatedwiththismightbenegativeQ-learningFormulation6InQ-learningproblem,thesetofstatesandactions,togetherwithrulesfortransitioningfromonestatetoanother

19、,makeupaMarkovdecisionprocess.Oneepisodeofthisprocess(e.g.onegame)formsafinitesequenceofstates,actionsandrewards:s,a,r,s,a,r,.,s,a,r,s001112n-1n-1nnHeresrepresentsthestate,aistheactionandristherewardafterperformingtheiii+1actiona.Theepisodeendswithterminalstates.Toperformwellinthelong-term,weinneedt

20、otakeintoaccountnotonlytheimmediaterewards,butalsothefuturerewardswearegoingtoget.Definethetotalfuturerewardfromtimepointtonwardas:TOC o 1-5 h zR=r+r+.+r+r(1)ttt+1n-1nInordertoensurethedivergenceandbalancetheimmediaterewardandfuturereward,totalrewardmustusediscountedfuturereward:R=r+丫r+.+yn-t-ir+yn-

21、tr=yi-tr(2)ttt+1n-1nii=tHereyisthediscountfactorbetween0and1,themoreintothefuturetherewardis,thelesswetakeitintoconsideration.Transformingequation(2)canget:R=r+yR(3)ttt+1InQ-learning,defineafunctionQ(s,a)representingthemaximumdiscountedfuturettrewardwhenweperformactionainstate:tTOC o 1-5 h zQ(s,a)=m

22、axR(4)ttt+1ItiscalledQ-function,becauseitrepresentsthe“quality”ofacertainactioninagivenstate.Agoodstrategyforanagentwouldbetoalwayschooseanactionthatmaximizesthediscountedfuturereward:n(s)=argmaxQ(s,a)(5)tatttHerenrepresentsthepolicy,therulehowwechooseanactionineachstate.Givenatransition(s,a,s),equa

23、tion(3)(4)cangetfollowingbellmanequation-maximumttt+1PlayingtheGameofFlappyBirdwithDeepReinforcementLearning(9) futurerewardforthisstateandactionistheimmediaterewardplusmaximumfuturerewardforthenextstate:Q(s,a)=r+ymaxQ(s,a)(6)tttat+iTheonlywaytocollectinformationabouttheenvironmentisbyinteractingwit

24、hit.Q-learningistheprocessoflearningtheoptimalfunctionQ(s,a),whichisatablein.ttHereistheoverallalgorithm1:Algorithm1Q-learningInitializeQnum_states,num_actionsarbitrarilyObserveinitialstates0RepeatSelectandcarryoutanactionaObserverewardrandnewstatesQ(s,a)=Q(s,a)+a(r+丫maxQ(s,a)-Q(s,a)as=sUntiltermina

25、tedDeepQ-learningNetworkInQ-learning,thestatespaceoftenistoobigtobeputintomainmemory.Agameframeof80 x80binaryimageshas2640(states,whichisimpossibletoberepresentedbyQ-table.Whatsmore,duringtraining,encounteringaknownstate,Q-learningjustperformarandomaction,meaningthatitsnotheuristic.Inorderovercometh

26、esetwoproblems,justapproximatetheQ-tablewithaconvolutionalneuralnetworks(CNN)78.ThisvariationofQ-learningiscalledDeepQ-learningNetwork(DQN)910.AftertrainingtheDQN,amultilayerneuralnetworkscanapproachthetraditionaloptimalQ-tableasfollowed:Q(s,a;9)=Q*(s,a)(7)ttttAsforplayingflappybird,thescreenshotsti

27、sinputtedintotheCNN,andtheoutputsaretheQ-valueofactions,asshowninFigure3:Figure3:InDQN,CNNsinputisrawgameimagewhileitsoutputsareQ-valuesQ(s,a),oneneuroncorrespondingtooneactionsQ-value.InordertoupdateCNNsweight,definingthecostfunctionandgradientupdatefunctionas910:1L=2r+max,Q(sVL=(r+ymaxQ(statat+1,a

28、;t+1,a;0-)Q(s,a;0)tt(8)0-)-Q(s,a;0)VQ(s,a;0)tt0ttPlayingtheGameofFlappyBirdwithDeepReinforcementLearning 0=0-+VL(0-)(10)9-Here,0aretheDQNparametersthatgettrainedand0arenon-updatedparametersfortheQ-valuefunction.Duringtraining,useequation(9)toupdatetheweightsofCNN.Meanwhile,obtainingoptimalrewardinev

29、eryepisoderequiresthebalancebetweenexploringtheenvironmentandexploitingexperience.-greedyapproachcanachievethistarget.Whentraining,selectarandomactionwithprobabilityorotherwisechoosetheoptimalactiona=argmaxincreaseinnumberofupdates.Q(s,a;0)t.TheannealslinearlytozerowithInputPre-processingWorkingdire

30、ctlywithrawgameframes,whichare288x512pixelRGBimages,canbecomputationallydemanding,soweapplyabasicpreprocessingstepaimedatreducingtheinputdimensionality.colorgraybinarystackupFigure4:Pre-processgameframes.Firstconvertframestograyimages,thendownsamplethemtospecificsize.Afterwards,convertthemtobinaryim

31、ages,finallystackuplast4framesasastate.Inordertoimprovetheaccuracyoftheconvolutionalnetwork,thebackgroundofgamewasremovedandsubstitutedwithapureblackimagetoremovenoise.AsFigure4shows,therawgameframesarepreprocessedbyfirstconvertingtheirRGBrepresentationtogray-scaleanddown-samplingittoan80 x80image.T

32、henconvertthegrayimagetobinaryimage.Inaddition,stackuplast4gameframesasastateforCNN.Thecurrentframeisoverlappedwiththepreviousframeswithslightlyreducedintensitiesandtheintensityreducesaswemovefartherawayfromthemostrecentframe.Thus,theinputimagewillgivegoodinformationonthetrajectoryonwhichthebirdiscu

33、rrentlyin.ExperienceReplayandStabilityBynowwecanestimatethefuturerewardineachstateusingQ-learningandapproximatetheQ-functionusingaconvolutionalneuralnetwork.ButtheapproximationofQ-valuesusingnon-linearfunctionsisnotverystable.InQ-learning,theexperiencesrecordedinasequentialmannerarehighlycorrelated.

34、IfsequentiallyusethemtoupdatetheDQNparameters,thetrainingprocessmightstuckinalocalminimalsolutionordiverge.ToensurethestabilityoftrainingofDQN,weuseatechnicaltrickcalledexperiencereplay.Duringgameplaying,particularnumberofexperience(s,a,r,s)arettt+1t+1storedinareplaymemory.Whentrainingthenetwork,ran

35、dommini-batchesfromthereplaymemoryareusedinsteadofthemostrecenttransition.Thisbreaksthesimilarityofsubsequenttrainingsamples,whichotherwisemightdrivethenetworkintoalocalminimum.Asaresultofthisrandomnessinthechoiceofthemini-batch,thedatathatgoesintoupdatetheDQNparametersarelikelytobede-correlated.Fur

36、thermore,tobetterthestabilityoftheconvergenceofthelossfunctions,weuseacloneoftheDQNmodelwithparameters0-.Theparameters0-areupdatedto9aftereveryCupdatestotheDQN.DQNArchitectureandAlgorithmAsshowninFigure5,firstly,gettheflappybirdgameframe,andafterpre-processingdescribedinsection2.3,stackuplast4frames

37、asastate.InputthisstateasrawimagesintotheCNNwhoseoutputisthequalityofspecificactioningivenstate.,theagentperformsanactionAccordingtopolicy兀(s)=argmaxQ(s,a),withprobability&,tatttotherwiseperformarandomaction.Thecurrentexperienceisstoredinareplaymemory,arandommini-batchofexperiencesaresampledfromthem

38、emoryandusedtoperformagradientdescentontheCNNsparameters.Thisisaninteractiveprocessuntilsomecriteriaarebeingsatisfied.RtindomlvsampleExperienceMemoryFigure5:DQNstrainingarchitecture:upperdataflowshowthetrainingprocess,whilethelowerdataflowdisplaytheinteractiveprocessbetweentheagentandenvironment.Raw

39、image頭RcairdrThecompleteDQNtrainingprocessisshowninAlgorithm2.Weshouldnotethatthefactorissettozeroduringtest,andwhiletrainingweuseadecayingvalue,balancingtheexplorationandexploitation.Algorithm2DeepQ-learningNetworkInitializereplaymemoryDtocertaincapacityNInitializetheCNNwithrandomweights0Initialize

40、0-=:9forgames=1:maxGamesdoforsnapShots=1:TdoWithprobabilityselectarandomactionatotherwiseselecta=:argmaxQ(s,a;0)Executeaandobserverandnextsatestt+1t+1Storetransition(st,at,rt+1,st+1)inreplaymemoryDSamplemini-batchoftransitionsfromDforj=1:batchSizedoifgameterminatesatnextstatethenelseQ_pred=:rj+maxQ(

41、s,a;0-)ja/+1endifPerformgradientdescentonL=(Q_pred-Q(s,a;0)2accordingto2ttequation(0)endforEveryCstepsreset0-=:0endforendforExperimentsThissectionwilldescribeouralgorithmsparameterssettingandtheanalysisofexperimentresults.3.1ParametersSettingsFigure6illustratesourCNNslayerssetting.Theneuralnetworksh

42、as3CNNhiddenlayersfollowedby2fullyconnectedhiddenlayers.Table1showthedetailedparametersofeverylayer.HerewejustuseamaxpoolinginthefirstCNNhiddenlayer.Also,weusetheReLUactivationfunctiontoproducetheneuraloutput.Figure6:ThelayersettingofCNN:thisCNNhas3convolutionallayersfollowedby2fullyconnectedlayers.

43、Asfortraining,weuseAdamoptimizertoupdatetheCNNsparameters.Table1:ThedetailedlayerssettingofCNNLayerInputconv1max_poolconv2conv3fc4fc580 x80 x420 x20 x3210 x10 x325x5x645x5x64512Filtersize8x82x24x43x3StrideNumfilters326464512ActivationReLUReLUReLUReLULinearOutput20 x20 x3210 x10 x325x5x645x5x645HTabl

44、e1listsalltheparametersettingofDQN.Weuseadecayedrangingfrom0.1to0.001tobalanceexplorationandexploitation.Whatsmore,Table2showsthatthebatchstochasticgradientdescentoptimizerisAdamwithbatchsizeof32.Finally,wealsoallocatealargereplaymemory.Table2:ThetrainingparametersofDQNParametersvalueObservesteps100

45、000Exploresteps3000000Initialepsilon0.1Finalepsilon0.001Replaymemory50000batchsize32learningrate0.000001FPS30optimizationalgorithmAdam3.2ResultsAnalysisWetrainourmodelabout4millionepochs.Figure7showstheweightsandbiasesofCNNsfirsthiddenlayer.Theweightsandbiasesfinallycentralizearound0,withlowvariance

46、,whichdirectlystabilizeCNNsoutputQ-valueQ(s,a)andreducettprobabilityofrandomaction.ThestabilityofCNNsparametersleadstoobtainingoptimalpolicy.Figure7:Left(right)figureisthehistogramofweights(biases)ofCNNsfirsthiddenlayerFigure8isthecostvalueofDQNduringtraining.Thecostfunctionhasaslowdowntrend,closeto

47、0after3.5millionepochs.ItmeansthatDQNhaslearnedthemostcommonstatesubspaceandwillperformoptimalactionwhencomingacrossknownstate.Inaword,DQNhasobtaineditsbestactionpolicy.Figure8:DQNscostfunction:theplotshowsthetrainingprogressofDQN.Wetrainedourmodelabout4millionepochs.Whenplayingflappybird,ifthebirdg

48、etsthroughthepipe,wegiveareward1,ifdead,give-1,otherwise0.1.Figure9istheaveragereturnedrewardfromenvironment.Thestabiltiyinfinaltrainingstatemeansthattheagentcanautomaticallychoosethebestaction,andtheenvironmentgivesthebestrewardinturns.Weknowthattheagentandenvironmenthasenterintoafriendlyinteractio

49、n,guaranteeingthemaximaltotalreward.Figure9:Theaveragereturnedrewardfromenvironment.Weaveragethereturnedrewardevery1000epochs.FromthisFigure10,thepredictedmaxQ-valuefromCNNconvergesandstabilizesinavalueafterabout100000.ItmeansthatCNNcanaccuratelypredictthequalityofactionsinspecificstate,andwecanstea

50、dilyperformactionswithmaxQ-value.TheconvergenceofmaxQ-valuesstatesthatCNNhasexploredthestatespacewidelyandgreatlyapproximatedtheenvironmentwell.Figure10:TheaveragemaxQ-valueobtainedfromCNNsoutput.WeaveragethemaxQ-valueevery1000epochs.Figure11illustratestheDQNsactionstrategy.IfthepredictedmaxQ-valuei

51、ssohigh,weareconfidentthatwewillgetthroughthegapwhenperformtheactionwithmaxQ-valuelikeA,C.IfthemaxQ-valueisrelativelylow,andweperformtheaction,wemighthitthepipe,likeB.Inthefinalstateoftraining,themaxQ-valueisdramaticallyhigh,meaningthatweareconfidenttogetthroughthegapsifperformingtheactionswithmaxQ-

52、value.maxQLE.a)valueiji5ABC*tr*wlTiMQFigure11:TheleftmostplotshowstheCNNspredictedmaxQ-valuefora100framessegmentofthegameflappybird.ThethreescreenshotscorrespondtotheframeslabeledbyA,B,andCrespectively.E0-4*04nnA7x-eConclusionWesuccessfullyuseDQNtoplayflappybird,whichcanoutperformhumanbeings.DQNcanautomaticallylearnknowledgefromenvironmentjustusingrawimagetoplaygameswithoutpriorknowledge.ThisfeaturegiveDQNthepowertoplayalmostsim

溫馨提示

  • 1. 本站所有資源如無(wú)特殊說(shuō)明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請(qǐng)下載最新的WinRAR軟件解壓。
  • 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請(qǐng)聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
  • 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁(yè)內(nèi)容里面會(huì)有圖紙預(yù)覽,若沒(méi)有圖紙預(yù)覽就沒(méi)有圖紙。
  • 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
  • 5. 人人文庫(kù)網(wǎng)僅提供信息存儲(chǔ)空間,僅對(duì)用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對(duì)用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對(duì)任何下載內(nèi)容負(fù)責(zé)。
  • 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請(qǐng)與我們聯(lián)系,我們立即糾正。
  • 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶因使用這些下載資源對(duì)自己和他人造成任何形式的傷害或損失。

最新文檔

評(píng)論

0/150

提交評(píng)論