高級人工智能10強化學(xué)習(xí)_第1頁
高級人工智能10強化學(xué)習(xí)_第2頁
高級人工智能10強化學(xué)習(xí)_第3頁
高級人工智能10強化學(xué)習(xí)_第4頁
高級人工智能10強化學(xué)習(xí)_第5頁
已閱讀5頁,還剩74頁未讀, 繼續(xù)免費閱讀

下載本文檔

版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請進行舉報或認領(lǐng)

文檔簡介

1、2022/7/81高級人工智能 第十章強化學(xué)習(xí)2022/7/82內(nèi)容提要引言強化學(xué)習(xí)模型動態(tài)規(guī)劃蒙特卡羅方法時序差分學(xué)習(xí)Q學(xué)習(xí)強化學(xué)習(xí)中的函數(shù)估計應(yīng)用2022/7/83引言 人類通常從與外界環(huán)境的交互中學(xué)習(xí)。所謂強化(reinforcement)學(xué)習(xí)是指從環(huán)境狀態(tài)到行為映射的學(xué)習(xí),以使系統(tǒng)行為從環(huán)境中獲得的累積獎勵值最大。在強化學(xué)習(xí)中,我們設(shè)計算法來把外界環(huán)境轉(zhuǎn)化為最大化獎勵量的方式的動作。我們并沒有直接告訴主體要做什么或者要采取哪個動作,而是主體通過看哪個動作得到了最多的獎勵來自己發(fā)現(xiàn)。主體的動作的影響不只是立即得到的獎勵,而且還影響接下來的動作和最終的獎勵。試錯搜索(trial-and-e

2、rror search)和延期強化(delayed reinforcement)這兩個特性是強化學(xué)習(xí)中兩個最重要的特性。 2022/7/84引言 強化學(xué)習(xí)技術(shù)是從控制理論、統(tǒng)計學(xué)、心理學(xué)等相關(guān)學(xué)科發(fā)展而來,最早可以追溯到巴甫洛夫的條件反射實驗。但直到上世紀八十年代末、九十年代初強化學(xué)習(xí)技術(shù)才在人工智能、機器學(xué)習(xí)和自動控制等領(lǐng)域中得到廣泛研究和應(yīng)用,并被認為是設(shè)計智能系統(tǒng)的核心技術(shù)之一。特別是隨著強化學(xué)習(xí)的數(shù)學(xué)基礎(chǔ)研究取得突破性進展后,對強化學(xué)習(xí)的研究和應(yīng)用日益開展起來,成為目前機器學(xué)習(xí)領(lǐng)域的研究熱點之一。2022/7/85引言強化思想最先來源于心理學(xué)的研究。1911年Thorndike提出了效

3、果律(Law of Effect):一定情景下讓動物感到舒服的行為,就會與此情景增強聯(lián)系(強化),當此情景再現(xiàn)時,動物的這種行為也更易再現(xiàn);相反,讓動物感覺不舒服的行為,會減弱與情景的聯(lián)系,此情景再現(xiàn)時,此行為將很難再現(xiàn)。換個說法,哪種行為會“記住”,會與刺激建立聯(lián)系,取決于行為產(chǎn)生的效果。動物的試錯學(xué)習(xí),包含兩個含義:選擇(selectional)和聯(lián)系(associative),對應(yīng)計算上的搜索和記憶。所以,1954年,Minsky在他的博士論文中實現(xiàn)了計算上的試錯學(xué)習(xí)。同年,F(xiàn)arley和Clark也在計算上對它進行了研究。強化學(xué)習(xí)一詞最早出現(xiàn)于科技文獻是1961年Minsky 的論文“

4、Steps Toward Artificial Intelligence”,此后開始廣泛使用。1969年, Minsky因在人工智能方面的貢獻而獲得計算機圖靈獎。2022/7/86引言1953到1957年,Bellman提出了求解最優(yōu)控制問題的一個有效方法:動態(tài)規(guī)劃(dynamic programming) Bellman于 1957年還提出了最優(yōu)控制問題的隨機離散版本,就是著名的馬爾可夫決策過程(MDP, Markov decision processe),1960年Howard提出馬爾可夫決策過程的策略迭代方法,這些都成為現(xiàn)代強化學(xué)習(xí)的理論基礎(chǔ)。1972年,Klopf把試錯學(xué)習(xí)和時序差分結(jié)

5、合在一起。1978年開始,Sutton、Barto、 Moore,包括Klopf等對這兩者結(jié)合開始進行深入研究。 1989年Watkins提出了Q-學(xué)習(xí)Watkins 1989,也把強化學(xué)習(xí)的三條主線扭在了一起。1992年,Tesauro用強化學(xué)習(xí)成功了應(yīng)用到西洋雙陸棋(backgammon)中,稱為TD-Gammon 。2022/7/87內(nèi)容提要引言強化學(xué)習(xí)模型動態(tài)規(guī)劃蒙特卡羅方法時序差分學(xué)習(xí)Q學(xué)習(xí)強化學(xué)習(xí)中的函數(shù)估計應(yīng)用2022/7/88主體強化學(xué)習(xí)模型i: inputr: reward s: statea: action狀態(tài) sisi+1ri+1獎勵 ri環(huán)境動作 aia0a1a2s0s

6、1s2s32022/7/89描述一個環(huán)境(問題)Accessible vs. inaccessibleDeterministic vs. non-deterministicEpisodic vs. non-episodicStatic vs. dynamicDiscrete vs. continuousThe most complex general class of environments are inaccessible, non-deterministic, non-episodic, dynamic, and continuous.2022/7/810強化學(xué)習(xí)問題Agent-envi

7、ronment interactionStates, Actions, RewardsTo define a finite MDPstate and action sets : S and Aone-step “dynamics” defined by transition probabilities (Markov Property):reward probabilities:EnvironmentactionstaterewardRLAgent2022/7/811與監(jiān)督學(xué)習(xí)對比Reinforcement Learning Learn from interactionlearn from i

8、ts own experience, and the objective is to get as much reward as possible. The learner is not told which actions to take, but instead must discover which actions yield the most reward by trying them.RLSystemInputsOutputs (“actions”)Training Info = evaluations (“rewards” / “penalties”)Supervised Lear

9、ning Learn from examples provided by a knowledgable external supervisor.2022/7/812強化學(xué)習(xí)要素Policy: stochastic rule for selecting actionsReturn/Reward: the function of future rewards agent tries to maximizeValue: what is good because it predicts rewardModel: what follows whatPolicyRewardValueModel ofenv

10、ironmentIs unknownIs my goalIs I can getIs my method2022/7/813在策略下的Bellman公式The basic idea: So: Or, without the expectation operator: is the discount rate2022/7/814Bellman最優(yōu)策略公式2022/7/815MARKOV DECISION PROCESS k-armed bandit gives immediate reward DELAYED REWARD?Characteristics of MDP:a set of stat

11、es : Sa set of actions : Aa reward function :R : S x A RA state transition function:T: S x A ( S) T (s,a,s): probability of transition from s to s using action a2022/7/816MDP EXAMPLE:TransitionfunctionStates and rewardsBellman Equation:(Greedy policy selection)2022/7/817MDP Graphical Representation,

12、 : T (s, action, s )Similarity to Hidden Markov Models (HMMs)2022/7/818動態(tài)規(guī)劃Dynamic Programming - ProblemA discrete-time dynamic systemStates 1, , n + termination state 0Control U(i)Transition Probability pij(u)Accumulative cost structurePolicies2022/7/819Finite Horizon ProblemInfinite Horizon Proble

13、mValue Iteration動態(tài)規(guī)劃Dynamic Programming Iterative Solution 2022/7/820動態(tài)規(guī)劃中的策略迭代/值迭代 policy evaluationpolicy improvement“greedification”Policy IterationValue Iteration2022/7/821動態(tài)規(guī)劃方法TTTTTTTTTTTTT2022/7/822自適應(yīng)動態(tài)規(guī)劃(ADP)Idea: use the constraints (state transition probabilities) between states to speed

14、learning.Solve = value determination.No maximization over actions because agent is passive unlike in value iteration.using DPLarge state spacee.g. Backgammon: 1050 equations in 1050 variables2022/7/823Value Iteration AlgorithmAN ALTERNATIVE ITERATION: (Singh,1993)(Important for model free learning)S

15、top Iteration when V(s) differs less than .Policy difference ratio = 2 / (1- ) ( Williams & Baird 1993b)2022/7/824Policy Iteration Algorithm Policies converge faster than values.Why faster convergence? 2022/7/825Reinforcement Learning Deterministic transitionsStochastic transitionsis the probability

16、 to reaching state j when taking action a in state istart3211234+1-1A simple environment that presents the agent with a sequential decision problem:Move cost = 0.04(Temporal) credit assignment problem sparse reinforcement problemOffline alg: action sequences determined ex anteOnline alg: action sequ

17、ences is conditional on observations along the way; Important in stochastic environment (e.g. jet flying)2022/7/826Reinforcement Learning M = 0.8 in direction you want to go 0.2 in perpendicular 0.1 left0.1 rightPolicy: mapping from states to actions3211234+1-10.7053211234+1-1 0.8120.762 0.868 0.912

18、 0.660 0.655 0.611 0.388An optimal policy for the stochastic environment:utilities of states:EnvironmentObservable (accessible): percept identifies the statePartially observableMarkov property: Transition probabilities depend on state only, not on the path to the state.Markov decision problem (MDP).

19、Partially observable MDP (POMDP): percepts does not have enough info to identify transition probabilities.2022/7/827Model Free MethodsModels of the environment:T: S x A ( S) and R : S x A RDo we know them? Do we have to know them?Monte Carlo MethodsAdaptive Heuristic CriticQ Learning2022/7/828Monte

20、Carlo策略評價Goal: learn Vp(s) under P and R are unknown in advanceGiven: some number of episodes under p which contain sIdea: Average returns observed after visits to sEvery-Visit MC: average returns for every time s is visited in an episodeFirst-visit MC: average returns only for first time s is visit

21、ed in an episodeBoth converge asymptotically123452022/7/829蒙特卡羅方法 Monte Carlo Methods Idea: Hold statistics about rewards for each state Take the average This is the V(s)Based only on experience Assumes episodic tasks (Experience is divided into episodes and all episodes will terminate regardless of

22、 the actions selected.) Incremental in episode-by-episode sense not step-by-step sense. 2022/7/830Problem: Unvisited pairs(problem of maintaining exploration)For every make sure that:P( selected as a start state and action) 0 (Assumption of exploring starts ) 蒙特卡羅方法 2022/7/831Monte Carlo方法TTTTTTTTTT

23、TTTTTTTTTT2022/7/832蒙特卡羅控制How to select Policies:(Similar to policy evaluation) MC policy iteration: Policy evaluation using MC methods followed by policy improvement Policy improvement step: greedify with respect to value (or action-value) function2022/7/833時序差分學(xué)習(xí) Temporal-Differencetarget: the act

24、ual return after time ttarget: an estimate of the return2022/7/834時序差分學(xué)習(xí) (TD)Idea: Do ADP backups on a per move basis, not for the whole state space.Theorem: Average value of U(i) converges to the correct value.Theorem: If is appropriately decreased as a function of times a state is visited (=Ni), t

25、hen U(i) itself converges to the correct value2022/7/835時序差分學(xué)習(xí) TDTTTTTTTTTTTTTTTTTTTT2022/7/836TD(l) A Forward ViewTD(l) is a method for averaging all n-step backups weight by ln-1 (time since visitation)l-return: Backup using l-return:2022/7/837時序差分學(xué)習(xí)算法 TD()Idea: update from the whole epoch, not ju

26、st on state transition.Special cases:=1: Least-mean-square (LMS), Mont Carlo=0: TDIntermediate choice of (between 0 and 1) is best. Interplay with 2022/7/838時序差分學(xué)習(xí)算法2022/7/839時序差分學(xué)習(xí)算法收斂性TD()Theorem: Converges w.p. 1 under certain boundaries conditions.Decrease i(t) s.t. In practice, often a fixed is

27、 used for all i and t.2022/7/840時序差分學(xué)習(xí) TD2022/7/841Q-LearningWatkins,1989Estimate the Q-function using some approximator (for example, linear regression or neural networks or decision trees etc.).Derive the estimated policy as an argument of the maximum of the estimated Q-function.Allow different pa

28、rameter vectors at different time points.Let us illustrate the algorithm with linear regression as the approximator, and of course, squared error as the appropriate loss function.2022/7/842Q-learningQ (a,i)Direct approach (ADP) would require learning a model .Q-learning does not:Do this update after

29、 each state transition:2022/7/843ExplorationTradeoff between exploitation (control) and exploration (identification) Extremes: greedy vs. random acting(n-armed bandit models)Q-learning converges to optimal Q-values if* Every state is visited infinitely often (due to exploration),* The action selecti

30、on es greedy as time approaches infinity, and* The learning rate a is decreased fast enough but not too fast (as we discussed in TD learning)2022/7/844Common exploration methodsIn value iteration in an ADP agent: Optimistic estimate of utility U+(i)-greedy methodNongreedy actions Greedy actionBoltzm

31、ann explorationExploration funcR+ if nNu o.w.2022/7/845Q-Learning AlgorithmSetForThe estimated policy satisfies2022/7/846What is the intuition?Bellman equation gives If and the training set were infinite, then Q-learning minimizes which is equivalent to minimizing2022/7/847A-Learning Murphy, 2003 an

32、d Robins, 2004Estimate the A-function (advantages) using some approximator, as in Q-learning.Derive the estimated policy as an argument of the maximum of the estimated A-function.Allow different parameter vectors at different time points.Let us illustrate the algorithm with linear regression as the

33、approximator, and of course, squared error as the appropriate loss function.2022/7/848A-Learning Algorithm (Inefficient Version)ForThe estimated policy satisfies2022/7/849Differences between Q and A-learningQ-learningAt time t we model the main effects of the history, (St,At-1) and the action At and

34、 their interactionOur Yt-1 is affected by how we modeled the main effect of the history in time t, (St,At-1) A-learningAt time t we only model the effects of At and its interaction with (St,At-1) Our Yt-1 does not depend on a model of the main effect of the history in time t, (St,At-1) 2022/7/850Q-L

35、earning Vs. A-LearningRelative merits and demerits are not completely known till now.Q-learning has low variance but high bias.A-learning has high variance but low bias. Comparison of Q-learning with A-learning involves a bias-variance trade-off.2022/7/851POMDP部分感知馬氏決策過程 Rather than observing the st

36、ate we observe some function of the state.Ob Observable functiona random variable for each states.Problem : different states may look similarThe optimal strategy might need to consider the history.2022/7/852Framework of POMDPPOMDP由六元組定義。其中定義了環(huán)境潛在的馬爾可夫決策模型上,是觀察的集合,即系統(tǒng)可以感知的世界狀態(tài)集合,觀察函數(shù):SAPD()。系統(tǒng)在采取動作a轉(zhuǎn)

37、移到狀態(tài)s時,觀察函數(shù)確定其在可能觀察上的概率分布。記為(s, a, o)。1 可以是S的子集,也可以與S無關(guān)2022/7/853POMDPsWhat if state information (from sensors) is noisy?Mostly the case!MDP techniques are suboptimal!Two halls are not the same.2022/7/854POMDPs A Solution StrategySE: Belief State Estimator (Can be based on HMM): MDP Techniques2022/7

38、/855POMDP_信度狀態(tài)方法Idea: Given a history of actions and observable value, we compute a posterior distribution for the state we are in (belief state)The belief-state MDPStates: distribution over S (states of the POMDP)Actions: as in POMDPTransition: the posterior distribution (given the observation)Open

39、 Problem : How to deal with the continuous distribution? 2022/7/856The Learning Process of Belief MDP2022/7/857Major Methods to Solve POMDP 算法名稱基本思想學(xué)習(xí)值函數(shù)Memoryless policies直接采用標準的強化學(xué)習(xí)算法Simple memory based approaches使用k個歷史觀察表示當前狀態(tài)UDM(Utile Distinction Memory)分解狀態(tài),構(gòu)建有限狀態(tài)機模型NSM(Nearest Sequence Memory)

40、存儲狀態(tài)歷史,進行距離度量USM(Utile Suffix Memory)綜合UDM和NSM兩種方法Recurrent-Q使用循環(huán)神經(jīng)網(wǎng)絡(luò)進行狀態(tài)預(yù)測策略搜索Evolutionary algorithms使用遺傳算法直接進行策略搜索Gradient ascent method使用梯度下降(上升)法搜索2022/7/858強化學(xué)習(xí)中的函數(shù)估計RLFASubset of statesValue estimate as targetsV (s)Generalization of the value function to the entire state spaceis the TD operato

41、r.is the function approximation operator.2022/7/859并行兩個迭代過程值函數(shù)迭代過程值函數(shù)逼近過程How to construct the M function? Using state cluster, interpolation, decision tree or neural network?2022/7/860Function Approximator: V( s) = f( s, w)Update: Gradient-descent Sarsa: w w + a rt+1 + g Q(st+1,at+1)- Q(st,at) w f(s

42、t,at,w)weight vectorStandard gradienttarget valueestimated valueOpen Problem : How to design the non-liner FA system which can converge with the incremental instances? 2022/7/861Semi-MDPDiscrete timeHomogeneous discountContinuous timeDiscrete eventsInterval-dependent discountDiscrete timeDiscrete ev

43、entsInterval-dependent discountA discrete-time SMDP overlaid on an MDPCan be analyzed at either level. One approach to Temporal Hierarchical RL2022/7/862The equations2022/7/863Multi-agent MDPDistributed RLMarkov GameBest ResponseEnvironmentactionstaterewardRLAgentRLAgent2022/7/864三種觀點問題空間主要方法算法準則合作多

44、agent強化學(xué)習(xí)分布、同構(gòu)、合作環(huán)境交換狀態(tài)提高學(xué)習(xí)收斂速度交換經(jīng)驗交換策略交換建議基于平衡解多agent強化學(xué)習(xí)同構(gòu)或異構(gòu)、合作或競爭環(huán)境極小極大-Q理性和收斂性NASH-QCE-QWoLF最佳響應(yīng)多agent強化學(xué)習(xí)異構(gòu)、競爭環(huán)境PHC收斂性和不遺憾性IGAGIGAGIGA-WoLF2022/7/865馬爾可夫?qū)Σ咴趎個agent的系統(tǒng)中,定義離散的狀態(tài)集S(即對策集合G),agent動作集Ai的集合A, 聯(lián)合獎賞函數(shù)Ri:SA1An和狀態(tài)轉(zhuǎn)移函數(shù)P:SA1AnPD(S)。 2022/7/866基于平衡解方法的強化學(xué)習(xí)Open Problem : Nash equilibrium or

45、other equilibrium is enough? The optimal policy in single game is Nash equilibrium.2022/7/867Applications of RLCheckers Samuel 59TD-Gammon Tesauro 92Worlds best downpeak elevator dispatcher Crites at al 95Inventory management Bertsekas et al 9510-15% better than industry standardDynamic channel assi

46、gnment Singh & Bertsekas, Nie&Haykin 95Outperforms best heuristics in the literatureCart-pole Michie&Chambers 68- with bang-bang controlRobotic manipulation Grupen et al. 93-Path planningRobot docking Lin 93ParkingFootball Stone98TetrisMultiagent RL Tan 93, Sandholm&Crites 95, Sen 94-, Carmel&Markov

47、itch 95-, lots of work sinceCombinatorial optimization: maintenance & repairControl of reasoning Zhang & Dietterich IJCAI-952022/7/868仿真機器人足球 應(yīng)用Q學(xué)習(xí)算法進行仿真機器人足球2 對1 訓(xùn)練,訓(xùn)練的目的是試圖使主體學(xué)習(xí)獲得到一種戰(zhàn)略上的意識,能夠在進攻中進行配合宋志偉, 2003 2022/7/869仿真機器人足球 前鋒A控球,并且在可射門的區(qū)域內(nèi),但是A已經(jīng)沒有射門角度了;隊友B也處于射門區(qū)域,并且B具有良好的射門角度。A傳球給B,射門由B來完成,那么這

48、次進攻配合就會很成功。通過Q學(xué)習(xí)的方法來進行2 對1的射門訓(xùn)練,讓A掌握在這種狀態(tài)情況下傳球給B的動作是最優(yōu)的策略;主體通過大量的學(xué)習(xí)訓(xùn)練(大數(shù)量級的狀態(tài)量和重復(fù)相同狀態(tài))來獲得策略,因此更具有適應(yīng)性 。2022/7/870仿真機器人足球 狀態(tài)描述,將進攻禁區(qū)劃分為個小區(qū)域,每個小區(qū)域是邊長為2m的正方形,一個二維數(shù)組()便可描述這個區(qū)域。使用三個Agent的位置來描述2 對 1進攻時的環(huán)境狀態(tài),利用圖10.11所示的劃分來泛化狀態(tài)??烧J為主體位于同一戰(zhàn)略區(qū)域為相似狀態(tài),這樣對狀態(tài)的描述雖然不精確,但設(shè)計所需的是一種戰(zhàn)略層次的描述,可認為Agent在戰(zhàn)略區(qū)域內(nèi)是積極跑動的,這種方法滿足了需求。

49、如此,便描述了一個特定的狀態(tài);其中,是進攻隊員A的區(qū)域編號,是進攻隊員B的區(qū)域編號,是守門員的區(qū)域編號。區(qū)域編號計算公式為:。相應(yīng)的,所保存的狀態(tài)值為三個區(qū)域編號組成的對。前鋒A控球,并且在可射門的區(qū)域內(nèi),但是A已經(jīng)沒有射門角度了;隊友B也處于射門區(qū)域,并且B具有良好的射門角度。A傳球給B,射門由B來完成,那么這次進攻配合就會很成功。通過Q學(xué)習(xí)的方法來進行2 對1的射門訓(xùn)練,讓A掌握在這種狀態(tài)情況下傳球給B的動作是最優(yōu)的策略;主體通過大量的學(xué)習(xí)訓(xùn)練(大數(shù)量級的狀態(tài)量和重復(fù)相同狀態(tài))來獲得策略,因此更具有適應(yīng)性 。19圖10.11 進攻禁區(qū)內(nèi)的位置劃分072022/7/871仿真機器人足球可選動作集確定為的策略通過基于概率的射門訓(xùn)練的學(xué)習(xí)來得到。的策略是,始終向受到威脅小,并且射門成功率高的區(qū)域帶球。為了實現(xiàn)這一策略目標,可劃分進攻區(qū)域為多個戰(zhàn)略區(qū),在每個戰(zhàn)略區(qū)進行射門評價,記錄每個區(qū)域的射門成功率。Pass策略很簡單

溫馨提示

  • 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
  • 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
  • 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
  • 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
  • 5. 人人文庫網(wǎng)僅提供信息存儲空間,僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護處理,對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對任何下載內(nèi)容負責(zé)。
  • 6. 下載文件中如有侵權(quán)或不適當內(nèi)容,請與我們聯(lián)系,我們立即糾正。
  • 7. 本站不保證下載資源的準確性、安全性和完整性, 同時也不承擔(dān)用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。

最新文檔

評論

0/150

提交評論