高級(jí)人工智能第十章_第1頁(yè)
高級(jí)人工智能第十章_第2頁(yè)
高級(jí)人工智能第十章_第3頁(yè)
高級(jí)人工智能第十章_第4頁(yè)
高級(jí)人工智能第十章_第5頁(yè)
已閱讀5頁(yè),還剩74頁(yè)未讀, 繼續(xù)免費(fèi)閱讀

下載本文檔

版權(quán)說(shuō)明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請(qǐng)進(jìn)行舉報(bào)或認(rèn)領(lǐng)

文檔簡(jiǎn)介

1、高級(jí)人工智能 第十章史忠植中國(guó)科學(xué)院計(jì)算技術(shù)研究所強(qiáng)化學(xué)習(xí)2022/9/291強(qiáng)化學(xué)習(xí) 史忠植內(nèi)容提要引言強(qiáng)化學(xué)習(xí)模型動(dòng)態(tài)規(guī)劃蒙特卡羅方法時(shí)序差分學(xué)習(xí)Q學(xué)習(xí)強(qiáng)化學(xué)習(xí)中的函數(shù)估計(jì)應(yīng)用2022/9/292強(qiáng)化學(xué)習(xí) 史忠植引言 人類通常從與外界環(huán)境的交互中學(xué)習(xí)。所謂強(qiáng)化(reinforcement)學(xué)習(xí)是指從環(huán)境狀態(tài)到行為映射的學(xué)習(xí),以使系統(tǒng)行為從環(huán)境中獲得的累積獎(jiǎng)勵(lì)值最大。在強(qiáng)化學(xué)習(xí)中,我們?cè)O(shè)計(jì)算法來(lái)把外界環(huán)境轉(zhuǎn)化為最大化獎(jiǎng)勵(lì)量的方式的動(dòng)作。我們并沒(méi)有直接告訴主體要做什么或者要采取哪個(gè)動(dòng)作,而是主體通過(guò)看哪個(gè)動(dòng)作得到了最多的獎(jiǎng)勵(lì)來(lái)自己發(fā)現(xiàn)。主體的動(dòng)作的影響不只是立即得到的獎(jiǎng)勵(lì),而且還影響接下來(lái)的動(dòng)

2、作和最終的獎(jiǎng)勵(lì)。試錯(cuò)搜索(trial-and-error search)和延期強(qiáng)化(delayed reinforcement)這兩個(gè)特性是強(qiáng)化學(xué)習(xí)中兩個(gè)最重要的特性。 2022/9/293強(qiáng)化學(xué)習(xí) 史忠植引言 強(qiáng)化學(xué)習(xí)技術(shù)是從控制理論、統(tǒng)計(jì)學(xué)、心理學(xué)等相關(guān)學(xué)科發(fā)展而來(lái),最早可以追溯到巴甫洛夫的條件反射實(shí)驗(yàn)。但直到上世紀(jì)八十年代末、九十年代初強(qiáng)化學(xué)習(xí)技術(shù)才在人工智能、機(jī)器學(xué)習(xí)和自動(dòng)控制等領(lǐng)域中得到廣泛研究和應(yīng)用,并被認(rèn)為是設(shè)計(jì)智能系統(tǒng)的核心技術(shù)之一。特別是隨著強(qiáng)化學(xué)習(xí)的數(shù)學(xué)基礎(chǔ)研究取得突破性進(jìn)展后,對(duì)強(qiáng)化學(xué)習(xí)的研究和應(yīng)用日益開(kāi)展起來(lái),成為目前機(jī)器學(xué)習(xí)領(lǐng)域的研究熱點(diǎn)之一。2022/9/294強(qiáng)化

3、學(xué)習(xí) 史忠植引言強(qiáng)化思想最先來(lái)源于心理學(xué)的研究。1911年Thorndike提出了效果律(Law of Effect):一定情景下讓動(dòng)物感到舒服的行為,就會(huì)與此情景增強(qiáng)聯(lián)系(強(qiáng)化),當(dāng)此情景再現(xiàn)時(shí),動(dòng)物的這種行為也更易再現(xiàn);相反,讓動(dòng)物感覺(jué)不舒服的行為,會(huì)減弱與情景的聯(lián)系,此情景再現(xiàn)時(shí),此行為將很難再現(xiàn)。換個(gè)說(shuō)法,哪種行為會(huì)“記住”,會(huì)與刺激建立聯(lián)系,取決于行為產(chǎn)生的效果。動(dòng)物的試錯(cuò)學(xué)習(xí),包含兩個(gè)含義:選擇(selectional)和聯(lián)系(associative),對(duì)應(yīng)計(jì)算上的搜索和記憶。所以,1954年,Minsky在他的博士論文中實(shí)現(xiàn)了計(jì)算上的試錯(cuò)學(xué)習(xí)。同年,F(xiàn)arley和Clark也在計(jì)

4、算上對(duì)它進(jìn)行了研究。強(qiáng)化學(xué)習(xí)一詞最早出現(xiàn)于科技文獻(xiàn)是1961年Minsky 的論文“Steps Toward Artificial Intelligence”,此后開(kāi)始廣泛使用。1969年, Minsky因在人工智能方面的貢獻(xiàn)而獲得計(jì)算機(jī)圖靈獎(jiǎng)。2022/9/295強(qiáng)化學(xué)習(xí) 史忠植引言1953到1957年,Bellman提出了求解最優(yōu)控制問(wèn)題的一個(gè)有效方法:動(dòng)態(tài)規(guī)劃(dynamic programming) Bellman于 1957年還提出了最優(yōu)控制問(wèn)題的隨機(jī)離散版本,就是著名的馬爾可夫決策過(guò)程(MDP, Markov decision processe),1960年Howard提出馬爾可夫

5、決策過(guò)程的策略迭代方法,這些都成為現(xiàn)代強(qiáng)化學(xué)習(xí)的理論基礎(chǔ)。1972年,Klopf把試錯(cuò)學(xué)習(xí)和時(shí)序差分結(jié)合在一起。1978年開(kāi)始,Sutton、Barto、 Moore,包括Klopf等對(duì)這兩者結(jié)合開(kāi)始進(jìn)行深入研究。 1989年Watkins提出了Q-學(xué)習(xí)Watkins 1989,也把強(qiáng)化學(xué)習(xí)的三條主線扭在了一起。1992年,Tesauro用強(qiáng)化學(xué)習(xí)成功了應(yīng)用到西洋雙陸棋(backgammon)中,稱為TD-Gammon 。2022/9/296強(qiáng)化學(xué)習(xí) 史忠植內(nèi)容提要引言強(qiáng)化學(xué)習(xí)模型動(dòng)態(tài)規(guī)劃蒙特卡羅方法時(shí)序差分學(xué)習(xí)Q學(xué)習(xí)強(qiáng)化學(xué)習(xí)中的函數(shù)估計(jì)應(yīng)用2022/9/297強(qiáng)化學(xué)習(xí) 史忠植主體強(qiáng)化學(xué)習(xí)模型

6、i: inputr: reward s: statea: action狀態(tài) sisi+1ri+1獎(jiǎng)勵(lì) ri環(huán)境動(dòng)作 aia0a1a2s0s1s2s32022/9/298強(qiáng)化學(xué)習(xí) 史忠植描述一個(gè)環(huán)境(問(wèn)題)Accessible vs. inaccessibleDeterministic vs. non-deterministicEpisodic vs. non-episodicStatic vs. dynamicDiscrete vs. continuousThe most complex general class of environments are inaccessible, non-d

7、eterministic, non-episodic, dynamic, and continuous.2022/9/299強(qiáng)化學(xué)習(xí) 史忠植強(qiáng)化學(xué)習(xí)問(wèn)題Agent-environment interactionStates, Actions, RewardsTo define a finite MDPstate and action sets : S and Aone-step “dynamics” defined by transition probabilities (Markov Property):reward probabilities:Environmentactionstater

8、ewardRLAgent2022/9/2910強(qiáng)化學(xué)習(xí) 史忠植與監(jiān)督學(xué)習(xí)對(duì)比Reinforcement Learning Learn from interactionlearn from its own experience, and the objective is to get as much reward as possible. The learner is not told which actions to take, but instead must discover which actions yield the most reward by trying them.RLSyst

9、emInputsOutputs (“actions”)Training Info = evaluations (“rewards” / “penalties”)Supervised Learning Learn from examples provided by a knowledgable external supervisor.2022/9/2911強(qiáng)化學(xué)習(xí) 史忠植強(qiáng)化學(xué)習(xí)要素Policy: stochastic rule for selecting actionsReturn/Reward: the function of future rewards agent tries to ma

10、ximizeValue: what is good because it predicts rewardModel: what follows whatPolicyRewardValueModel ofenvironmentIs unknownIs my goalIs I can getIs my method2022/9/2912強(qiáng)化學(xué)習(xí) 史忠植在策略下的Bellman公式The basic idea: So: Or, without the expectation operator: is the discount rate2022/9/2913強(qiáng)化學(xué)習(xí) 史忠植Bellman最優(yōu)策略公式2

11、022/9/2914強(qiáng)化學(xué)習(xí) 史忠植MARKOV DECISION PROCESS k-armed bandit gives immediate reward DELAYED REWARD?Characteristics of MDP:a set of states : Sa set of actions : Aa reward function :R : S x A RA state transition function:T: S x A ( S) T (s,a,s): probability of transition from s to s using action a2022/9/2

12、915強(qiáng)化學(xué)習(xí) 史忠植MDP EXAMPLE:TransitionfunctionStates and rewardsBellman Equation:(Greedy policy selection)2022/9/2916強(qiáng)化學(xué)習(xí) 史忠植MDP Graphical Representation, : T (s, action, s )Similarity to Hidden Markov Models (HMMs)2022/9/2917強(qiáng)化學(xué)習(xí) 史忠植動(dòng)態(tài)規(guī)劃Dynamic Programming - ProblemA discrete-time dynamic systemStates 1

13、, , n + termination state 0Control U(i)Transition Probability pij(u)Accumulative cost structurePolicies2022/9/2918強(qiáng)化學(xué)習(xí) 史忠植Finite Horizon ProblemInfinite Horizon ProblemValue Iteration動(dòng)態(tài)規(guī)劃Dynamic Programming Iterative Solution 2022/9/2919強(qiáng)化學(xué)習(xí) 史忠植動(dòng)態(tài)規(guī)劃中的策略迭代/值迭代 policy evaluationpolicy improvement“gree

14、dification”Policy IterationValue Iteration2022/9/2920強(qiáng)化學(xué)習(xí) 史忠植動(dòng)態(tài)規(guī)劃方法TTTTTTTTTTTTT2022/9/2921強(qiáng)化學(xué)習(xí) 史忠植自適應(yīng)動(dòng)態(tài)規(guī)劃(ADP)Idea: use the constraints (state transition probabilities) between states to speed learning.Solve = value determination.No maximization over actions because agent is passive unlike in value

15、 iteration.using DPLarge state spacee.g. Backgammon: 1050 equations in 1050 variables2022/9/2922強(qiáng)化學(xué)習(xí) 史忠植Value Iteration AlgorithmAN ALTERNATIVE ITERATION: (Singh,1993)(Important for model free learning)Stop Iteration when V(s) differs less than .Policy difference ratio = 2 / (1- ) ( Williams & Baird

16、 1993b)2022/9/2923強(qiáng)化學(xué)習(xí) 史忠植Policy Iteration Algorithm Policies converge faster than values.Why faster convergence? 2022/9/2924強(qiáng)化學(xué)習(xí) 史忠植Reinforcement Learning Deterministic transitionsStochastic transitionsis the probability to reaching state j when taking action a in state istart3211234+1-1A simple en

17、vironment that presents the agent with a sequential decision problem:Move cost = 0.04(Temporal) credit assignment problem sparse reinforcement problemOffline alg: action sequences determined ex anteOnline alg: action sequences is conditional on observations along the way; Important in stochastic env

18、ironment (e.g. jet flying)2022/9/2925強(qiáng)化學(xué)習(xí) 史忠植Reinforcement Learning M = 0.8 in direction you want to go 0.2 in perpendicular 0.1 left0.1 rightPolicy: mapping from states to actions3211234+1-10.7053211234+1-1 0.8120.762 0.868 0.912 0.660 0.655 0.611 0.388An optimal policy for the stochastic environme

19、nt:utilities of states:EnvironmentObservable (accessible): percept identifies the statePartially observableMarkov property: Transition probabilities depend on state only, not on the path to the state.Markov decision problem (MDP).Partially observable MDP (POMDP): percepts does not have enough info t

20、o identify transition probabilities.2022/9/2926強(qiáng)化學(xué)習(xí) 史忠植Model Free MethodsModels of the environment:T: S x A ( S) and R : S x A RDo we know them? Do we have to know them?Monte Carlo MethodsAdaptive Heuristic CriticQ Learning2022/9/2927強(qiáng)化學(xué)習(xí) 史忠植Monte Carlo策略評(píng)價(jià)Goal: learn Vp(s) under P and R are unknown

21、 in advanceGiven: some number of episodes under p which contain sIdea: Average returns observed after visits to sEvery-Visit MC: average returns for every time s is visited in an episodeFirst-visit MC: average returns only for first time s is visited in an episodeBoth converge asymptotically12345202

22、2/9/2928強(qiáng)化學(xué)習(xí) 史忠植蒙特卡羅方法 Monte Carlo Methods Idea: Hold statistics about rewards for each state Take the average This is the V(s)Based only on experience Assumes episodic tasks (Experience is divided into episodes and all episodes will terminate regardless of the actions selected.) Incremental in epis

23、ode-by-episode sense not step-by-step sense. 2022/9/2929強(qiáng)化學(xué)習(xí) 史忠植Problem: Unvisited pairs(problem of maintaining exploration)For every make sure that:P( selected as a start state and action) 0 (Assumption of exploring starts ) 蒙特卡羅方法 2022/9/2930強(qiáng)化學(xué)習(xí) 史忠植Monte Carlo方法TTTTTTTTTTTTTTTTTTTT2022/9/2931強(qiáng)化學(xué)習(xí)

24、 史忠植蒙特卡羅控制How to select Policies:(Similar to policy evaluation) MC policy iteration: Policy evaluation using MC methods followed by policy improvement Policy improvement step: greedify with respect to value (or action-value) function2022/9/2932強(qiáng)化學(xué)習(xí) 史忠植時(shí)序差分學(xué)習(xí) Temporal-Differencetarget: the actual ret

25、urn after time ttarget: an estimate of the return2022/9/2933強(qiáng)化學(xué)習(xí) 史忠植時(shí)序差分學(xué)習(xí) (TD)Idea: Do ADP backups on a per move basis, not for the whole state space.Theorem: Average value of U(i) converges to the correct value.Theorem: If is appropriately decreased as a function of times a state is visited (=Ni),

26、 then U(i) itself converges to the correct value2022/9/2934強(qiáng)化學(xué)習(xí) 史忠植時(shí)序差分學(xué)習(xí) TDTTTTTTTTTTTTTTTTTTTT2022/9/2935強(qiáng)化學(xué)習(xí) 史忠植TD(l) A Forward ViewTD(l) is a method for averaging all n-step backups weight by ln-1 (time since visitation)l-return: Backup using l-return:2022/9/2936強(qiáng)化學(xué)習(xí) 史忠植時(shí)序差分學(xué)習(xí)算法 TD()Idea: update

27、 from the whole epoch, not just on state transition.Special cases:=1: Least-mean-square (LMS), Mont Carlo=0: TDIntermediate choice of (between 0 and 1) is best. Interplay with 2022/9/2937強(qiáng)化學(xué)習(xí) 史忠植時(shí)序差分學(xué)習(xí)算法2022/9/2938強(qiáng)化學(xué)習(xí) 史忠植時(shí)序差分學(xué)習(xí)算法收斂性TD()Theorem: Converges w.p. 1 under certain boundaries conditions.D

28、ecrease i(t) s.t. In practice, often a fixed is used for all i and t.2022/9/2939強(qiáng)化學(xué)習(xí) 史忠植時(shí)序差分學(xué)習(xí) TD2022/9/2940強(qiáng)化學(xué)習(xí) 史忠植Q-LearningWatkins,1989Estimate the Q-function using some approximator (for example, linear regression or neural networks or decision trees etc.).Derive the estimated policy as an argum

29、ent of the maximum of the estimated Q-function.Allow different parameter vectors at different time points.Let us illustrate the algorithm with linear regression as the approximator, and of course, squared error as the appropriate loss function.2022/9/2941強(qiáng)化學(xué)習(xí) 史忠植Q-learningQ (a,i)Direct approach (ADP

30、) would require learning a model .Q-learning does not:Do this update after each state transition:2022/9/2942強(qiáng)化學(xué)習(xí) 史忠植ExplorationTradeoff between exploitation (control) and exploration (identification) Extremes: greedy vs. random acting(n-armed bandit models)Q-learning converges to optimal Q-values if

31、* Every state is visited infinitely often (due to exploration),* The action selection becomes greedy as time approaches infinity, and* The learning rate a is decreased fast enough but not too fast (as we discussed in TD learning)2022/9/2943強(qiáng)化學(xué)習(xí) 史忠植Common exploration methodsIn value iteration in an A

32、DP agent: Optimistic estimate of utility U+(i)-greedy methodNongreedy actions Greedy actionBoltzmann explorationExploration funcR+ if nNu o.w.2022/9/2944強(qiáng)化學(xué)習(xí) 史忠植Q-Learning AlgorithmSetForThe estimated policy satisfies2022/9/2945強(qiáng)化學(xué)習(xí) 史忠植What is the intuition?Bellman equation gives If and the training

33、 set were infinite, then Q-learning minimizes which is equivalent to minimizing2022/9/2946強(qiáng)化學(xué)習(xí) 史忠植A-Learning Murphy, 2003 and Robins, 2004Estimate the A-function (advantages) using some approximator, as in Q-learning.Derive the estimated policy as an argument of the maximum of the estimated A-functi

34、on.Allow different parameter vectors at different time points.Let us illustrate the algorithm with linear regression as the approximator, and of course, squared error as the appropriate loss function.2022/9/2947強(qiáng)化學(xué)習(xí) 史忠植A-Learning Algorithm (Inefficient Version)ForThe estimated policy satisfies2022/9

35、/2948強(qiáng)化學(xué)習(xí) 史忠植Differences between Q and A-learningQ-learningAt time t we model the main effects of the history, (St,At-1) and the action At and their interactionOur Yt-1 is affected by how we modeled the main effect of the history in time t, (St,At-1) A-learningAt time t we only model the effects of

36、At and its interaction with (St,At-1) Our Yt-1 does not depend on a model of the main effect of the history in time t, (St,At-1) 2022/9/2949強(qiáng)化學(xué)習(xí) 史忠植Q-Learning Vs. A-LearningRelative merits and demerits are not completely known till now.Q-learning has low variance but high bias.A-learning has high va

37、riance but low bias. Comparison of Q-learning with A-learning involves a bias-variance trade-off.2022/9/2950強(qiáng)化學(xué)習(xí) 史忠植POMDP部分感知馬氏決策過(guò)程 Rather than observing the state we observe some function of the state.Ob Observable functiona random variable for each states.Problem : different states may look simila

38、rThe optimal strategy might need to consider the history.2022/9/2951強(qiáng)化學(xué)習(xí) 史忠植Framework of POMDPPOMDP由六元組定義。其中定義了環(huán)境潛在的馬爾可夫決策模型上,是觀察的集合,即系統(tǒng)可以感知的世界狀態(tài)集合,觀察函數(shù):SAPD()。系統(tǒng)在采取動(dòng)作a轉(zhuǎn)移到狀態(tài)s時(shí),觀察函數(shù)確定其在可能觀察上的概率分布。記為(s, a, o)。1 可以是S的子集,也可以與S無(wú)關(guān)2022/9/2952強(qiáng)化學(xué)習(xí) 史忠植POMDPsWhat if state information (from sensors) is noisy?M

39、ostly the case!MDP techniques are suboptimal!Two halls are not the same.2022/9/2953強(qiáng)化學(xué)習(xí) 史忠植POMDPs A Solution StrategySE: Belief State Estimator (Can be based on HMM): MDP Techniques2022/9/2954強(qiáng)化學(xué)習(xí) 史忠植POMDP_信度狀態(tài)方法Idea: Given a history of actions and observable value, we compute a posterior distributi

40、on for the state we are in (belief state)The belief-state MDPStates: distribution over S (states of the POMDP)Actions: as in POMDPTransition: the posterior distribution (given the observation)Open Problem : How to deal with the continuous distribution? 2022/9/2955強(qiáng)化學(xué)習(xí) 史忠植The Learning Process of Beli

41、ef MDP2022/9/2956強(qiáng)化學(xué)習(xí) 史忠植Major Methods to Solve POMDP 算法名稱基本思想學(xué)習(xí)值函數(shù)Memoryless policies直接采用標(biāo)準(zhǔn)的強(qiáng)化學(xué)習(xí)算法Simple memory based approaches使用k個(gè)歷史觀察表示當(dāng)前狀態(tài)UDM(Utile Distinction Memory)分解狀態(tài),構(gòu)建有限狀態(tài)機(jī)模型NSM(Nearest Sequence Memory)存儲(chǔ)狀態(tài)歷史,進(jìn)行距離度量USM(Utile Suffix Memory)綜合UDM和NSM兩種方法Recurrent-Q使用循環(huán)神經(jīng)網(wǎng)絡(luò)進(jìn)行狀態(tài)預(yù)測(cè)策略搜索Evoluti

42、onary algorithms使用遺傳算法直接進(jìn)行策略搜索Gradient ascent method使用梯度下降(上升)法搜索2022/9/2957強(qiáng)化學(xué)習(xí) 史忠植強(qiáng)化學(xué)習(xí)中的函數(shù)估計(jì)RLFASubset of statesValue estimate as targetsV (s)Generalization of the value function to the entire state spaceis the TD operator.is the function approximation operator.2022/9/2958強(qiáng)化學(xué)習(xí) 史忠植并行兩個(gè)迭代過(guò)程值函數(shù)迭代過(guò)程值函

43、數(shù)逼近過(guò)程How to construct the M function? Using state cluster, interpolation, decision tree or neural network?2022/9/2959強(qiáng)化學(xué)習(xí) 史忠植Function Approximator: V( s) = f( s, w)Update: Gradient-descent Sarsa: w w + a rt+1 + g Q(st+1,at+1)- Q(st,at) w f(st,at,w)weight vectorStandard gradienttarget valueestimated

44、valueOpen Problem : How to design the non-liner FA system which can converge with the incremental instances? 2022/9/2960強(qiáng)化學(xué)習(xí) 史忠植Semi-MDPDiscrete timeHomogeneous discountContinuous timeDiscrete eventsInterval-dependent discountDiscrete timeDiscrete eventsInterval-dependent discountA discrete-time SMD

45、P overlaid on an MDPCan be analyzed at either level. One approach to Temporal Hierarchical RL2022/9/2961強(qiáng)化學(xué)習(xí) 史忠植The equations2022/9/2962強(qiáng)化學(xué)習(xí) 史忠植Multi-agent MDPDistributed RLMarkov GameBest ResponseEnvironmentactionstaterewardRLAgentRLAgent2022/9/2963強(qiáng)化學(xué)習(xí) 史忠植三種觀點(diǎn)問(wèn)題空間主要方法算法準(zhǔn)則合作多agent強(qiáng)化學(xué)習(xí)分布、同構(gòu)、合作環(huán)境交換狀態(tài)

46、提高學(xué)習(xí)收斂速度交換經(jīng)驗(yàn)交換策略交換建議基于平衡解多agent強(qiáng)化學(xué)習(xí)同構(gòu)或異構(gòu)、合作或競(jìng)爭(zhēng)環(huán)境極小極大-Q理性和收斂性NASH-QCE-QWoLF最佳響應(yīng)多agent強(qiáng)化學(xué)習(xí)異構(gòu)、競(jìng)爭(zhēng)環(huán)境PHC收斂性和不遺憾性IGAGIGAGIGA-WoLF2022/9/2964強(qiáng)化學(xué)習(xí) 史忠植馬爾可夫?qū)Σ咴趎個(gè)agent的系統(tǒng)中,定義離散的狀態(tài)集S(即對(duì)策集合G),agent動(dòng)作集Ai的集合A, 聯(lián)合獎(jiǎng)賞函數(shù)Ri:SA1An和狀態(tài)轉(zhuǎn)移函數(shù)P:SA1AnPD(S)。 2022/9/2965強(qiáng)化學(xué)習(xí) 史忠植基于平衡解方法的強(qiáng)化學(xué)習(xí)Open Problem : Nash equilibrium or other

47、 equilibrium is enough? The optimal policy in single game is Nash equilibrium.2022/9/2966強(qiáng)化學(xué)習(xí) 史忠植Applications of RLCheckers Samuel 59TD-Gammon Tesauro 92Worlds best downpeak elevator dispatcher Crites at al 95Inventory management Bertsekas et al 9510-15% better than industry standardDynamic channel

48、assignment Singh & Bertsekas, Nie&Haykin 95Outperforms best heuristics in the literatureCart-pole Michie&Chambers 68- with bang-bang controlRobotic manipulation Grupen et al. 93-Path planningRobot docking Lin 93ParkingFootball Stone98TetrisMultiagent RL Tan 93, Sandholm&Crites 95, Sen 94-, Carmel&Ma

49、rkovitch 95-, lots of work sinceCombinatorial optimization: maintenance & repairControl of reasoning Zhang & Dietterich IJCAI-952022/9/2967強(qiáng)化學(xué)習(xí) 史忠植仿真機(jī)器人足球 應(yīng)用Q學(xué)習(xí)算法進(jìn)行仿真機(jī)器人足球2 對(duì)1 訓(xùn)練,訓(xùn)練的目的是試圖使主體學(xué)習(xí)獲得到一種戰(zhàn)略上的意識(shí),能夠在進(jìn)攻中進(jìn)行配合宋志偉, 2003 2022/9/2968強(qiáng)化學(xué)習(xí) 史忠植仿真機(jī)器人足球 前鋒A控球,并且在可射門的區(qū)域內(nèi),但是A已經(jīng)沒(méi)有射門角度了;隊(duì)友B也處于射門區(qū)域,并且B具有良好的

50、射門角度。A傳球給B,射門由B來(lái)完成,那么這次進(jìn)攻配合就會(huì)很成功。通過(guò)Q學(xué)習(xí)的方法來(lái)進(jìn)行2 對(duì)1的射門訓(xùn)練,讓A掌握在這種狀態(tài)情況下傳球給B的動(dòng)作是最優(yōu)的策略;主體通過(guò)大量的學(xué)習(xí)訓(xùn)練(大數(shù)量級(jí)的狀態(tài)量和重復(fù)相同狀態(tài))來(lái)獲得策略,因此更具有適應(yīng)性 。2022/9/2969強(qiáng)化學(xué)習(xí) 史忠植仿真機(jī)器人足球 狀態(tài)描述,將進(jìn)攻禁區(qū)劃分為個(gè)小區(qū)域,每個(gè)小區(qū)域是邊長(zhǎng)為2m的正方形,一個(gè)二維數(shù)組()便可描述這個(gè)區(qū)域。使用三個(gè)Agent的位置來(lái)描述2 對(duì) 1進(jìn)攻時(shí)的環(huán)境狀態(tài),利用圖10.11所示的劃分來(lái)泛化狀態(tài)。可認(rèn)為主體位于同一戰(zhàn)略區(qū)域?yàn)橄嗨茽顟B(tài),這樣對(duì)狀態(tài)的描述雖然不精確,但設(shè)計(jì)所需的是一種戰(zhàn)略層次的描述,

51、可認(rèn)為Agent在戰(zhàn)略區(qū)域內(nèi)是積極跑動(dòng)的,這種方法滿足了需求。如此,便描述了一個(gè)特定的狀態(tài);其中,是進(jìn)攻隊(duì)員A的區(qū)域編號(hào),是進(jìn)攻隊(duì)員B的區(qū)域編號(hào),是守門員的區(qū)域編號(hào)。區(qū)域編號(hào)計(jì)算公式為:。相應(yīng)的,所保存的狀態(tài)值為三個(gè)區(qū)域編號(hào)組成的對(duì)。前鋒A控球,并且在可射門的區(qū)域內(nèi),但是A已經(jīng)沒(méi)有射門角度了;隊(duì)友B也處于射門區(qū)域,并且B具有良好的射門角度。A傳球給B,射門由B來(lái)完成,那么這次進(jìn)攻配合就會(huì)很成功。通過(guò)Q學(xué)習(xí)的方法來(lái)進(jìn)行2 對(duì)1的射門訓(xùn)練,讓A掌握在這種狀態(tài)情況下傳球給B的動(dòng)作是最優(yōu)的策略;主體通過(guò)大量的學(xué)習(xí)訓(xùn)練(大數(shù)量級(jí)的狀態(tài)量和重復(fù)相同狀態(tài))來(lái)獲得策略,因此更具有適應(yīng)性 。19圖10.11 進(jìn)攻禁區(qū)內(nèi)的位置劃分072022/9/2970強(qiáng)化學(xué)習(xí) 史忠植仿真機(jī)器人足球可選動(dòng)作集確定為的策略通過(guò)基于概率的射門訓(xùn)練的學(xué)習(xí)來(lái)得到。的策略是,始終向受到威脅小,并且射門成功率高的區(qū)域帶球。為了實(shí)現(xiàn)這一策略目標(biāo),可劃分進(jìn)攻區(qū)域?yàn)槎鄠€(gè)戰(zhàn)略區(qū),在每個(gè)戰(zhàn)略區(qū)進(jìn)行射門評(píng)價(jià),記錄每個(gè)區(qū)域的射門成功率。Pass策略很簡(jiǎn)單,只需在兩個(gè)Agent間進(jìn)行傳球,即

溫馨提示

  • 1. 本站所有資源如無(wú)特殊說(shuō)明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請(qǐng)下載最新的WinRAR軟件解壓。
  • 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請(qǐng)聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
  • 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁(yè)內(nèi)容里面會(huì)有圖紙預(yù)覽,若沒(méi)有圖紙預(yù)覽就沒(méi)有圖紙。
  • 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
  • 5. 人人文庫(kù)網(wǎng)僅提供信息存儲(chǔ)空間,僅對(duì)用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對(duì)用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對(duì)任何下載內(nèi)容負(fù)責(zé)。
  • 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請(qǐng)與我們聯(lián)系,我們立即糾正。
  • 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶因使用這些下載資源對(duì)自己和他人造成任何形式的傷害或損失。

評(píng)論

0/150

提交評(píng)論