版權(quán)說(shuō)明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請(qǐng)進(jìn)行舉報(bào)或認(rèn)領(lǐng)
文檔簡(jiǎn)介
1、高級(jí)人工智能第十章 史忠植 中國(guó)科學(xué)院計(jì)算技術(shù)研究所強(qiáng)化學(xué)習(xí)1向陽(yáng)書(shū)屋p內(nèi)容提要引言強(qiáng)化學(xué)習(xí)模型動(dòng)態(tài)規(guī)劃蒙特卡羅方法時(shí)序差分學(xué)習(xí)Q學(xué)習(xí)強(qiáng)化學(xué)習(xí)中的函數(shù)估計(jì)應(yīng)用2向陽(yáng)書(shū)屋p引言 人類通常從與外界環(huán)境的交互中學(xué)習(xí)。所謂強(qiáng)化(reinforcement)學(xué)習(xí)是指從環(huán)境狀態(tài)到行為映射的學(xué)習(xí),以使系統(tǒng)行為從環(huán)境中獲得的累積獎(jiǎng)勵(lì)值最大。 在強(qiáng)化學(xué)習(xí)中,我們?cè)O(shè)計(jì)算法來(lái)把外界環(huán)境轉(zhuǎn)化為最大化獎(jiǎng)勵(lì)量的方式的動(dòng)作。我們并沒(méi)有直接告訴主體要做什么或者要采取哪個(gè)動(dòng)作,而是主體通過(guò)看哪個(gè)動(dòng)作得到了最多的獎(jiǎng)勵(lì)來(lái)自己發(fā)現(xiàn)。主體的動(dòng)作的影響不只是立即得到的獎(jiǎng)勵(lì),而且還影響接下來(lái)的動(dòng)作和最終的獎(jiǎng)勵(lì)。試錯(cuò)搜索(trial-and
2、-error search)和延期強(qiáng)化(delayed reinforcement)這兩個(gè)特性是強(qiáng)化學(xué)習(xí)中兩個(gè)最重要的特性。 3向陽(yáng)書(shū)屋p引言 強(qiáng)化學(xué)習(xí)技術(shù)是從控制理論、統(tǒng)計(jì)學(xué)、心理學(xué)等相關(guān)學(xué)科發(fā)展而來(lái),最早可以追溯到巴甫洛夫的條件反射實(shí)驗(yàn)。 但直到上世紀(jì)八十年代末、九十年代初強(qiáng)化學(xué)習(xí)技術(shù)才在人工智能、機(jī)器學(xué)習(xí)和自動(dòng)控制等領(lǐng)域中得到廣泛研究和應(yīng)用,并被認(rèn)為是設(shè)計(jì)智能系統(tǒng)的核心技術(shù)之一。特別是隨著強(qiáng)化學(xué)習(xí)的數(shù)學(xué)基礎(chǔ)研究取得突破性進(jìn)展后,對(duì)強(qiáng)化學(xué)習(xí)的研究和應(yīng)用日益開(kāi)展起來(lái),成為目前機(jī)器學(xué)習(xí)領(lǐng)域的研究熱點(diǎn)之一。4向陽(yáng)書(shū)屋p引言強(qiáng)化思想最先來(lái)源于心理學(xué)的研究。1911年Thorndike提出了效果律(
3、Law of Effect):一定情景下讓動(dòng)物感到舒服的行為,就會(huì)與此情景增強(qiáng)聯(lián)系(強(qiáng)化),當(dāng)此情景再現(xiàn)時(shí),動(dòng)物的這種行為也更易再現(xiàn);相反,讓動(dòng)物感覺(jué)不舒服的行為,會(huì)減弱與情景的聯(lián)系,此情景再現(xiàn)時(shí),此行為將很難再現(xiàn)。換個(gè)說(shuō)法,哪種行為會(huì)“記住”,會(huì)與刺激建立聯(lián)系,取決于行為產(chǎn)生的效果。動(dòng)物的試錯(cuò)學(xué)習(xí),包含兩個(gè)含義:選擇(selectional)和聯(lián)系(associative),對(duì)應(yīng)計(jì)算上的搜索和記憶。所以,1954年,Minsky在他的博士論文中實(shí)現(xiàn)了計(jì)算上的試錯(cuò)學(xué)習(xí)。同年,F(xiàn)arley和Clark也在計(jì)算上對(duì)它進(jìn)行了研究。強(qiáng)化學(xué)習(xí)一詞最早出現(xiàn)于科技文獻(xiàn)是1961年Minsky 的論文“Ste
4、ps Toward Artificial Intelligence”,此后開(kāi)始廣泛使用。1969年, Minsky因在人工智能方面的貢獻(xiàn)而獲得計(jì)算機(jī)圖靈獎(jiǎng)。5向陽(yáng)書(shū)屋p引言1953到1957年,Bellman提出了求解最優(yōu)控制問(wèn)題的一個(gè)有效方法:動(dòng)態(tài)規(guī)劃(dynamic programming) Bellman于 1957年還提出了最優(yōu)控制問(wèn)題的隨機(jī)離散版本,就是著名的馬爾可夫決策過(guò)程(MDP, Markov decision processe),1960年Howard提出馬爾可夫決策過(guò)程的策略迭代方法,這些都成為現(xiàn)代強(qiáng)化學(xué)習(xí)的理論基礎(chǔ)。1972年,Klopf把試錯(cuò)學(xué)習(xí)和時(shí)序差分結(jié)合在一起。1
5、978年開(kāi)始,Sutton、Barto、 Moore,包括Klopf等對(duì)這兩者結(jié)合開(kāi)始進(jìn)行深入研究。 1989年Watkins提出了Q-學(xué)習(xí)Watkins 1989,也把強(qiáng)化學(xué)習(xí)的三條主線扭在了一起。1992年,Tesauro用強(qiáng)化學(xué)習(xí)成功了應(yīng)用到西洋雙陸棋(backgammon)中,稱為TD-Gammon 。6向陽(yáng)書(shū)屋p內(nèi)容提要引言強(qiáng)化學(xué)習(xí)模型動(dòng)態(tài)規(guī)劃蒙特卡羅方法時(shí)序差分學(xué)習(xí)Q學(xué)習(xí)強(qiáng)化學(xué)習(xí)中的函數(shù)估計(jì)應(yīng)用7向陽(yáng)書(shū)屋p主體強(qiáng)化學(xué)習(xí)模型i: inputr: reward s: statea: action狀態(tài) sisi+1ri+1獎(jiǎng)勵(lì) ri環(huán)境動(dòng)作 aia0a1a2s0s1s2s38向陽(yáng)書(shū)屋p描
6、述一個(gè)環(huán)境(問(wèn)題)Accessible vs. inaccessibleDeterministic vs. non-deterministicEpisodic vs. non-episodicStatic vs. dynamicDiscrete vs. continuousThe most complex general class of environments are inaccessible, non-deterministic, non-episodic, dynamic, and continuous.9向陽(yáng)書(shū)屋p強(qiáng)化學(xué)習(xí)問(wèn)題Agent-environment interaction
7、States, Actions, RewardsTo define a finite MDPstate and action sets : S and Aone-step “dynamics” defined by transition probabilities (Markov Property):reward probabilities:EnvironmentactionstaterewardRLAgent10向陽(yáng)書(shū)屋p與監(jiān)督學(xué)習(xí)對(duì)比Reinforcement Learning Learn from interactionlearn from its own experience, and
8、 the objective is to get as much reward as possible. The learner is not told which actions to take, but instead must discover which actions yield the most reward by trying them.RLSystemInputsOutputs (“actions”)Training Info = evaluations (“rewards” / “penalties”)Supervised Learning Learn from exampl
9、es provided by a knowledgable external supervisor.11向陽(yáng)書(shū)屋p強(qiáng)化學(xué)習(xí)要素Policy: stochastic rule for selecting actionsReturn/Reward: the function of future rewards agent tries to maximizeValue: what is good because it predicts rewardModel: what follows whatPolicyRewardValueModel ofenvironmentIs unknownIs my g
10、oalIs I can getIs my method12向陽(yáng)書(shū)屋p在策略下的Bellman公式The basic idea: So: Or, without the expectation operator: is the discount rate13向陽(yáng)書(shū)屋pBellman最優(yōu)策略公式其中:V*:狀態(tài)值映射S: 環(huán)境狀態(tài)R: 獎(jiǎng)勵(lì)函數(shù)P: 狀態(tài)轉(zhuǎn)移概率函數(shù): 折扣因子14向陽(yáng)書(shū)屋p馬爾可夫決策過(guò)程 MARKOV DECISION PROCESS 由四元組定義。 環(huán)境狀態(tài)集S 系統(tǒng)行為集合A 獎(jiǎng)勵(lì)函數(shù)R:SA 狀態(tài)轉(zhuǎn)移函數(shù)P:SAPD(S) 記R(s,a,s)為系統(tǒng)在狀態(tài)s采用a動(dòng)作使環(huán)境
11、狀態(tài)轉(zhuǎn)移到s獲得的瞬時(shí)獎(jiǎng)勵(lì)值;記P(s,a,s)為系統(tǒng)在狀態(tài)s采用a動(dòng)作使環(huán)境狀態(tài)轉(zhuǎn)移到s的概率。15向陽(yáng)書(shū)屋p馬爾可夫決策過(guò)程 MARKOV DECISION PROCESS馬爾可夫決策過(guò)程的本質(zhì)是:當(dāng)前狀態(tài)向下一狀態(tài)轉(zhuǎn)移的概率和獎(jiǎng)勵(lì)值只取決于當(dāng)前狀態(tài)和選擇的動(dòng)作,而與歷史狀態(tài)和歷史動(dòng)作無(wú)關(guān)。因此在已知狀態(tài)轉(zhuǎn)移概率函數(shù)P和獎(jiǎng)勵(lì)函數(shù)R的環(huán)境模型知識(shí)下,可以采用動(dòng)態(tài)規(guī)劃技術(shù)求解最優(yōu)策略。而強(qiáng)化學(xué)習(xí)著重研究在P函數(shù)和R函數(shù)未知的情況下,系統(tǒng)如何學(xué)習(xí)最優(yōu)行為策略。16向陽(yáng)書(shū)屋pMARKOV DECISION PROCESSCharacteristics of MDP:a set of states
12、: Sa set of actions : Aa reward function :R : S x A RA state transition function:T: S x A ( S) T (s,a,s): probability of transition from s to s using action a17向陽(yáng)書(shū)屋p馬爾可夫決策過(guò)程 MARKOV DECISION PROCESS18向陽(yáng)書(shū)屋pMDP EXAMPLE:TransitionfunctionStates and rewardsBellman Equation:(Greedy policy selection)19向陽(yáng)書(shū)屋
13、pMDP Graphical Representation, : T (s, action, s )Similarity to Hidden Markov Models (HMMs)20向陽(yáng)書(shū)屋pReinforcement Learning Deterministic transitionsStochastic transitionsis the probability to reaching state j when taking action a in state istart3211234+1-1A simple environment that presents the agent w
14、ith a sequential decision problem:Move cost = 0.04(Temporal) credit assignment problem sparse reinforcement problemOffline alg: action sequences determined ex anteOnline alg: action sequences is conditional on observations along the way; Important in stochastic environment (e.g. jet flying)21向陽(yáng)書(shū)屋pRe
15、inforcement Learning M = 0.8 in direction you want to go 0.2 in perpendicular 0.1 left0.1 rightPolicy: mapping from states to actions3211234+1-10.7053211234+1-1 0.8120.762 0.868 0.912 0.660 0.655 0.611 0.388An optimal policy for the stochastic environment:utilities of states:EnvironmentObservable (a
16、ccessible): percept identifies the statePartially observableMarkov property: Transition probabilities depend on state only, not on the path to the state.Markov decision problem (MDP).Partially observable MDP (POMDP): percepts does not have enough info to identify transition probabilities.22向陽(yáng)書(shū)屋p動(dòng)態(tài)規(guī)劃
17、Dynamic Programming動(dòng)態(tài)規(guī)劃(dynamic programming)的方法通過(guò)從后繼狀態(tài)回溯到前驅(qū)狀態(tài)來(lái)計(jì)算賦值函數(shù)。動(dòng)態(tài)規(guī)劃的方法基于下一個(gè)狀態(tài)分布的模型來(lái)接連的更新?tīng)顟B(tài)。強(qiáng)化學(xué)習(xí)的動(dòng)態(tài)規(guī)劃的方法是基于這樣一個(gè)事實(shí):對(duì)任何策略和任何狀態(tài)s,有(10.9)式迭代的一致的等式成立(as)是給定在隨機(jī)策略下?tīng)顟B(tài)s時(shí)動(dòng)作a的概率。(ssa)是在動(dòng)作a下?tīng)顟B(tài)s轉(zhuǎn)到狀態(tài)s的概率。這就是對(duì)V的Bellman(1957)等式。23向陽(yáng)書(shū)屋p動(dòng)態(tài)規(guī)劃Dynamic Programming - ProblemA discrete-time dynamic systemStates 1, ,
18、n + termination state 0Control U(i)Transition Probability pij(u)Accumulative cost structurePolicies24向陽(yáng)書(shū)屋pFinite Horizon ProblemInfinite Horizon ProblemValue Iteration動(dòng)態(tài)規(guī)劃Dynamic Programming Iterative Solution 25向陽(yáng)書(shū)屋p動(dòng)態(tài)規(guī)劃中的策略迭代/值迭代 policy evaluationpolicy improvement“greedification”Policy IterationV
19、alue Iteration26向陽(yáng)書(shū)屋p動(dòng)態(tài)規(guī)劃方法TTTTTTTTTTTTT27向陽(yáng)書(shū)屋p自適應(yīng)動(dòng)態(tài)規(guī)劃(ADP)Idea: use the constraints (state transition probabilities) between states to speed learning.Solve = value determination.No maximization over actions because agent is passive unlike in value iteration.using DPLarge state spacee.g. Backgammon:
20、 1050 equations in 1050 variables28向陽(yáng)書(shū)屋pValue Iteration AlgorithmAN ALTERNATIVE ITERATION: (Singh,1993)(Important for model free learning)Stop Iteration when V(s) differs less than .Policy difference ratio = 2 / (1- ) ( Williams & Baird 1993b)29向陽(yáng)書(shū)屋pPolicy Iteration Algorithm Policies converge faste
21、r than values.Why faster convergence? 30向陽(yáng)書(shū)屋p動(dòng)態(tài)規(guī)劃Dynamic Programming典型的動(dòng)態(tài)規(guī)劃模型作用有限,很多問(wèn)題很難給出環(huán)境的完整模型。仿真機(jī)器人足球就是這樣的問(wèn)題,可以采用實(shí)時(shí)動(dòng)態(tài)規(guī)劃方法解決這個(gè)問(wèn)題。在實(shí)時(shí)動(dòng)態(tài)規(guī)劃中不需要事先給出環(huán)境模型,而是在真實(shí)的環(huán)境中不斷測(cè)試,得到環(huán)境模型。可以采用反傳神經(jīng)網(wǎng)絡(luò)實(shí)現(xiàn)對(duì)狀態(tài)泛化,網(wǎng)絡(luò)的輸入單元是環(huán)境的狀態(tài)s, 網(wǎng)絡(luò)的輸出是對(duì)該狀態(tài)的評(píng)價(jià)V(s)。31向陽(yáng)書(shū)屋p沒(méi)有模型的方法Model Free MethodsModels of the environment:T: S x A ( S) and
22、 R : S x A RDo we know them? Do we have to know them?Monte Carlo MethodsAdaptive Heuristic CriticQ Learning32向陽(yáng)書(shū)屋p蒙特卡羅方法 Monte Carlo Methods 蒙特卡羅方法不需要一個(gè)完整的模型。而是它們對(duì)狀態(tài)的整個(gè)軌道進(jìn)行抽樣,基于抽樣點(diǎn)的最終結(jié)果來(lái)更新賦值函數(shù)。蒙特卡羅方法不需要經(jīng)驗(yàn),即從與環(huán)境聯(lián)機(jī)的或者模擬的交互中抽樣狀態(tài)、動(dòng)作和獎(jiǎng)勵(lì)的序列。聯(lián)機(jī)的經(jīng)驗(yàn)是令人感興趣的,因?yàn)樗恍枰h(huán)境的先驗(yàn)知識(shí),卻仍然可以是最優(yōu)的。從模擬的經(jīng)驗(yàn)中學(xué)習(xí)功能也很強(qiáng)大。它需要一個(gè)模型,但它可以
23、是生成的而不是分析的,即一個(gè)模型可以生成軌道卻不能計(jì)算明確的概率。于是,它不需要產(chǎn)生在動(dòng)態(tài)規(guī)劃中要求的所有可能轉(zhuǎn)變的完整的概率分布。33向陽(yáng)書(shū)屋pMonte Carlo方法TTTTTTTTTTTTTTTTTTTT34向陽(yáng)書(shū)屋p蒙特卡羅方法 Monte Carlo Methods Idea: Hold statistics about rewards for each state Take the average This is the V(s)Based only on experience Assumes episodic tasks (Experience is divided into
24、episodes and all episodes will terminate regardless of the actions selected.) Incremental in episode-by-episode sense not step-by-step sense. 35向陽(yáng)書(shū)屋pMonte Carlo策略評(píng)價(jià)Goal: learn Vp(s) under P and R are unknown in advanceGiven: some number of episodes under p which contain sIdea: Average returns observ
25、ed after visits to sEvery-Visit MC: average returns for every time s is visited in an episodeFirst-visit MC: average returns only for first time s is visited in an episodeBoth converge asymptotically1234536向陽(yáng)書(shū)屋pProblem: Unvisited pairs(problem of maintaining exploration)For every make sure that:P( s
26、elected as a start state and action) 0 (Assumption of exploring starts ) 蒙特卡羅方法 37向陽(yáng)書(shū)屋p蒙特卡羅控制How to select Policies:(Similar to policy evaluation) MC policy iteration: Policy evaluation using MC methods followed by policy improvement Policy improvement step: greedify with respect to value (or action
27、-value) function38向陽(yáng)書(shū)屋p時(shí)序差分學(xué)習(xí) Temporal-Difference時(shí)序差分學(xué)習(xí)中沒(méi)有環(huán)境模型,根據(jù)經(jīng)驗(yàn)學(xué)習(xí)。每步進(jìn)行迭代,不需要等任務(wù)完成。預(yù)測(cè)模型的控制算法,根據(jù)歷史信息判斷將來(lái)的輸入和輸出,強(qiáng)調(diào)模型的函數(shù)而非模型的結(jié)構(gòu)。時(shí)序差分方法和蒙特卡羅方法類似,仍然采樣一次學(xué)習(xí)循環(huán)中獲得的瞬時(shí)獎(jiǎng)懲反饋,但同時(shí)類似與動(dòng)態(tài)規(guī)劃方法采用自舉方法估計(jì)狀態(tài)的值函數(shù)。然后通過(guò)多次迭代學(xué)習(xí),去逼近真實(shí)的狀態(tài)值函數(shù)。39向陽(yáng)書(shū)屋p時(shí)序差分學(xué)習(xí) TDTTTTTTTTTTTTTTTTTTTT40向陽(yáng)書(shū)屋p時(shí)序差分學(xué)習(xí) Temporal-Differencetarget: the actu
28、al return after time ttarget: an estimate of the return41向陽(yáng)書(shū)屋p時(shí)序差分學(xué)習(xí) (TD)Idea: Do ADP backups on a per move basis, not for the whole state space.Theorem: Average value of U(i) converges to the correct value.Theorem: If is appropriately decreased as a function of times a state is visited (=Ni), then
29、U(i) itself converges to the correct value42向陽(yáng)書(shū)屋pTD(l) A Forward ViewTD(l) is a method for averaging all n-step backups weight by ln-1 (time since visitation)l-return: Backup using l-return:43向陽(yáng)書(shū)屋p時(shí)序差分學(xué)習(xí)算法 TD()Idea: update from the whole epoch, not just on state transition.Special cases:=1: Least-me
30、an-square (LMS), Mont Carlo=0: TDIntermediate choice of (between 0 and 1) is best. Interplay with 44向陽(yáng)書(shū)屋p時(shí)序差分學(xué)習(xí)算法 TD()算法 10.1 TD(0)學(xué)習(xí)算法Initialize V(s) arbitrarily, to the policy to be evaluatedRepeat (for each episode) Initialize s Repeat (for each step of episode) Choose a from s using policy deriv
31、ed from V(e.g., -greedy) Take action a, observer r, s Until s is terminal 45向陽(yáng)書(shū)屋p時(shí)序差分學(xué)習(xí)算法46向陽(yáng)書(shū)屋p時(shí)序差分學(xué)習(xí)算法收斂性TD()Theorem: Converges w.p. 1 under certain boundaries conditions.Decrease i(t) s.t. In practice, often a fixed is used for all i and t.47向陽(yáng)書(shū)屋p時(shí)序差分學(xué)習(xí) TD48向陽(yáng)書(shū)屋pQ-learningWatkins, 1989在Q學(xué)習(xí)中,回溯從動(dòng)作
32、結(jié)點(diǎn)開(kāi)始,最大化下一個(gè)狀態(tài)的所有可能動(dòng)作和它們的獎(jiǎng)勵(lì)。在完全遞歸定義的Q學(xué)習(xí)中,回溯樹(shù)的底部結(jié)點(diǎn)一個(gè)從根結(jié)點(diǎn)開(kāi)始的動(dòng)作和它們的后繼動(dòng)作的獎(jiǎng)勵(lì)的序列可以到達(dá)的所有終端結(jié)點(diǎn)。聯(lián)機(jī)的Q學(xué)習(xí),從可能的動(dòng)作向前擴(kuò)展,不需要建立一個(gè)完全的世界模型。Q學(xué)習(xí)還可以脫機(jī)執(zhí)行。我們可以看到,Q學(xué)習(xí)是一種時(shí)序差分的方法。49向陽(yáng)書(shū)屋pQ-learning在Q學(xué)習(xí)中,Q是狀態(tài)-動(dòng)作對(duì)到學(xué)習(xí)到的值的一個(gè)函數(shù)。對(duì)所有的狀態(tài)和動(dòng)作: Q: (state x action) value 對(duì)Q學(xué)習(xí)中的一步: (10.15)其中c和都1,rt+1是狀態(tài)st+1的獎(jiǎng)勵(lì)。 50向陽(yáng)書(shū)屋pQ-LearningEstimate the
33、Q-function using some approximator (for example, linear regression or neural networks or decision trees etc.).Derive the estimated policy as an argument of the maximum of the estimated Q-function.Allow different parameter vectors at different time points.Let us illustrate the algorithm with linear r
34、egression as the approximator, and of course, squared error as the appropriate loss function.51向陽(yáng)書(shū)屋pQ-learningQ (a,i)Direct approach (ADP) would require learning a model .Q-learning does not:Do this update after each state transition:52向陽(yáng)書(shū)屋pExplorationTradeoff between exploitation (control) and expl
35、oration (identification) Extremes: greedy vs. random acting(n-armed bandit models)Q-learning converges to optimal Q-values if* Every state is visited infinitely often (due to exploration),* The action selection becomes greedy as time approaches infinity, and* The learning rate a is decreased fast en
36、ough but not too fast (as we discussed in TD learning)53向陽(yáng)書(shū)屋pCommon exploration methodsIn value iteration in an ADP agent: Optimistic estimate of utility U+(i)-greedy methodNongreedy actions Greedy actionBoltzmann explorationExploration funcR+ if nNu o.w.54向陽(yáng)書(shū)屋pQ-Learning AlgorithmQ學(xué)習(xí)算法Initialize Q(
37、s,a) arbitrarilyRepeat (for each episode) Initialize s Repeat (for each step of episode) Choose a from s using policy derived from Q (e.g., -greedy)Take action a, observer r, s Until s is terminal55向陽(yáng)書(shū)屋pQ-Learning AlgorithmSetForThe estimated policy satisfies56向陽(yáng)書(shū)屋pWhat is the intuition?Bellman equa
38、tion gives If and the training set were infinite, then Q-learning minimizes which is equivalent to minimizing57向陽(yáng)書(shū)屋pA-Learning Murphy, 2003 and Robins, 2004Estimate the A-function (advantages) using some approximator, as in Q-learning.Derive the estimated policy as an argument of the maximum of the
39、estimated A-function.Allow different parameter vectors at different time points.Let us illustrate the algorithm with linear regression as the approximator, and of course, squared error as the appropriate loss function.58向陽(yáng)書(shū)屋pA-Learning Algorithm (Inefficient Version)ForThe estimated policy satisfies
40、59向陽(yáng)書(shū)屋pDifferences between Q and A-learningQ-learningAt time t we model the main effects of the history, (St,At-1) and the action At and their interactionOur Yt-1 is affected by how we modeled the main effect of the history in time t, (St,At-1) A-learningAt time t we only model the effects of At and
41、 its interaction with (St,At-1) Our Yt-1 does not depend on a model of the main effect of the history in time t, (St,At-1) 60向陽(yáng)書(shū)屋pQ-Learning Vs. A-LearningRelative merits and demerits are not completely known till now.Q-learning has low variance but high bias.A-learning has high variance but low bia
42、s. Comparison of Q-learning with A-learning involves a bias-variance trade-off.61向陽(yáng)書(shū)屋pPOMDP部分感知馬氏決策過(guò)程 Rather than observing the state we observe some function of the state.Ob Observable functiona random variable for each states.Problem : different states may look similarThe optimal strategy might ne
43、ed to consider the history.62向陽(yáng)書(shū)屋pFramework of POMDPPOMDP由六元組定義。其中定義了環(huán)境潛在的馬爾可夫決策模型上,是觀察的集合,即系統(tǒng)可以感知的世界狀態(tài)集合,觀察函數(shù):SAPD()。系統(tǒng)在采取動(dòng)作a轉(zhuǎn)移到狀態(tài)s時(shí),觀察函數(shù)確定其在可能觀察上的概率分布。記為(s, a, o)。1 可以是S的子集,也可以與S無(wú)關(guān)63向陽(yáng)書(shū)屋pPOMDPsWhat if state information (from sensors) is noisy?Mostly the case!MDP techniques are suboptimal!Two halls
44、 are not the same.64向陽(yáng)書(shū)屋pPOMDPs A Solution StrategySE: Belief State Estimator (Can be based on HMM): MDP Techniques65向陽(yáng)書(shū)屋pPOMDP_信度狀態(tài)方法Idea: Given a history of actions and observable value, we compute a posterior distribution for the state we are in (belief state)The belief-state MDPStates: distribut
45、ion over S (states of the POMDP)Actions: as in POMDPTransition: the posterior distribution (given the observation)Open Problem : How to deal with the continuous distribution? 66向陽(yáng)書(shū)屋pThe Learning Process of Belief MDP67向陽(yáng)書(shū)屋pMajor Methods to Solve POMDP 算法名稱基本思想學(xué)習(xí)值函數(shù)Memoryless policies直接采用標(biāo)準(zhǔn)的強(qiáng)化學(xué)習(xí)算法Sim
46、ple memory based approaches使用k個(gè)歷史觀察表示當(dāng)前狀態(tài)UDM(Utile Distinction Memory)分解狀態(tài),構(gòu)建有限狀態(tài)機(jī)模型NSM(Nearest Sequence Memory)存儲(chǔ)狀態(tài)歷史,進(jìn)行距離度量USM(Utile Suffix Memory)綜合UDM和NSM兩種方法Recurrent-Q使用循環(huán)神經(jīng)網(wǎng)絡(luò)進(jìn)行狀態(tài)預(yù)測(cè)策略搜索Evolutionary algorithms使用遺傳算法直接進(jìn)行策略搜索Gradient ascent method使用梯度下降(上升)法搜索68向陽(yáng)書(shū)屋p強(qiáng)化學(xué)習(xí)中的函數(shù)估計(jì)RLFASubset of states
47、Value estimate as targetsV (s)Generalization of the value function to the entire state spaceis the TD operator.is the function approximation operator.69向陽(yáng)書(shū)屋p并行兩個(gè)迭代過(guò)程值函數(shù)迭代過(guò)程值函數(shù)逼近過(guò)程How to construct the M function? Using state cluster, interpolation, decision tree or neural network?70向陽(yáng)書(shū)屋pFunction Appr
48、oximator: V( s) = f( s, w)Update: Gradient-descent Sarsa: w w + a rt+1 + g Q(st+1,at+1)- Q(st,at) w f(st,at,w)weight vectorStandard gradienttarget valueestimated valueOpen Problem : How to design the non-liner FA system which can converge with the incremental instances? 并行兩個(gè)迭代過(guò)程71向陽(yáng)書(shū)屋pSemi-MDPDiscre
49、te timeHomogeneous discountContinuous timeDiscrete eventsInterval-dependent discountDiscrete timeDiscrete eventsInterval-dependent discountA discrete-time SMDP overlaid on an MDPCan be analyzed at either level. One approach to Temporal Hierarchical RL72向陽(yáng)書(shū)屋pThe equations73向陽(yáng)書(shū)屋pMulti-agent MDPDistrib
50、uted RLMarkov GameBest ResponseEnvironmentactionstaterewardRLAgentRLAgent74向陽(yáng)書(shū)屋p三種觀點(diǎn)問(wèn)題空間主要方法算法準(zhǔn)則合作多agent強(qiáng)化學(xué)習(xí)分布、同構(gòu)、合作環(huán)境交換狀態(tài)提高學(xué)習(xí)收斂速度交換經(jīng)驗(yàn)交換策略交換建議基于平衡解多agent強(qiáng)化學(xué)習(xí)同構(gòu)或異構(gòu)、合作或競(jìng)爭(zhēng)環(huán)境極小極大-Q理性和收斂性NASH-QCE-QWoLF最佳響應(yīng)多agent強(qiáng)化學(xué)習(xí)異構(gòu)、競(jìng)爭(zhēng)環(huán)境PHC收斂性和不遺憾性IGAGIGAGIGA-WoLF75向陽(yáng)書(shū)屋p馬爾可夫?qū)Σ咴趎個(gè)agent的系統(tǒng)中,定義離散的狀態(tài)集S(即對(duì)策集合G),agent動(dòng)作集Ai的集
51、合A, 聯(lián)合獎(jiǎng)賞函數(shù)Ri:SA1An和狀態(tài)轉(zhuǎn)移函數(shù)P:SA1AnPD(S)。 76向陽(yáng)書(shū)屋p基于平衡解方法的強(qiáng)化學(xué)習(xí)Open Problem : Nash equilibrium or other equilibrium is enough? The optimal policy in single game is Nash equilibrium.77向陽(yáng)書(shū)屋pApplications of RLCheckers Samuel 59TD-Gammon Tesauro 92Worlds best downpeak elevator dispatcher Crites at al 95Inven
52、tory management Bertsekas et al 9510-15% better than industry standardDynamic channel assignment Singh & Bertsekas, Nie&Haykin 95Outperforms best heuristics in the literatureCart-pole Michie&Chambers 68- with bang-bang controlRobotic manipulation Grupen et al. 93-Path planningRobot docking Lin 93Par
53、kingFootball Stone98TetrisMultiagent RL Tan 93, Sandholm&Crites 95, Sen 94-, Carmel&Markovitch 95-, lots of work sinceCombinatorial optimization: maintenance & repairControl of reasoning Zhang & Dietterich IJCAI-9578向陽(yáng)書(shū)屋p仿真機(jī)器人足球 應(yīng)用Q學(xué)習(xí)算法進(jìn)行仿真機(jī)器人足球2 對(duì)1 訓(xùn)練,訓(xùn)練的目的是試圖使主體學(xué)習(xí)獲得到一種戰(zhàn)略上的意識(shí),能夠在進(jìn)攻中進(jìn)行配合 79向陽(yáng)書(shū)屋p仿真機(jī)器
54、人足球 前鋒A控球,并且在可射門的區(qū)域內(nèi),但是A已經(jīng)沒(méi)有射門角度了;隊(duì)友B也處于射門區(qū)域,并且B具有良好的射門角度。A傳球給B,射門由B來(lái)完成,那么這次進(jìn)攻配合就會(huì)很成功。通過(guò)Q學(xué)習(xí)的方法來(lái)進(jìn)行2 對(duì)1的射門訓(xùn)練,讓A掌握在這種狀態(tài)情況下傳球給B的動(dòng)作是最優(yōu)的策略;主體通過(guò)大量的學(xué)習(xí)訓(xùn)練(大數(shù)量級(jí)的狀態(tài)量和重復(fù)相同狀態(tài))來(lái)獲得策略,因此更具有適應(yīng)性 。80向陽(yáng)書(shū)屋p仿真機(jī)器人足球 狀態(tài)描述,將進(jìn)攻禁區(qū)劃分為個(gè)小區(qū)域,每個(gè)小區(qū)域是邊長(zhǎng)為2m的正方形,一個(gè)二維數(shù)組()便可描述這個(gè)區(qū)域。使用三個(gè)Agent的位置來(lái)描述2 對(duì) 1進(jìn)攻時(shí)的環(huán)境狀態(tài),利用圖10.11所示的劃分來(lái)泛化狀態(tài)??烧J(rèn)為主體位于同一戰(zhàn)略區(qū)域?yàn)橄嗨茽顟B(tài),這樣對(duì)狀態(tài)的描述雖然不精確,但設(shè)計(jì)所需的是一種戰(zhàn)略層次的描述,可認(rèn)為Agent在戰(zhàn)略區(qū)域內(nèi)是積極跑動(dòng)的,這種方法滿足了需求。如此,便描述了一個(gè)特定的狀態(tài);其中,是進(jìn)攻隊(duì)員A的區(qū)域編號(hào),是進(jìn)攻隊(duì)員B的區(qū)域編號(hào),是守門員的區(qū)域編號(hào)。區(qū)域編號(hào)計(jì)算公式為:。相應(yīng)的,所保存的狀態(tài)值為三個(gè)區(qū)域編號(hào)組成的對(duì)。前鋒A控球,并且在可射門的區(qū)域內(nèi),但是A已經(jīng)沒(méi)有射門角度了;隊(duì)友B也處于射門區(qū)域,并且B具有良好的射門角度。A傳球給B,射門由B來(lái)完成,那么這次進(jìn)攻配合就會(huì)很成功。通過(guò)Q學(xué)習(xí)的方法來(lái)進(jìn)行2 對(duì)1的射門訓(xùn)練,讓A掌握在這種狀態(tài)情況下傳球給B的動(dòng)作是最優(yōu)的策略;主體通過(guò)大量的
溫馨提示
- 1. 本站所有資源如無(wú)特殊說(shuō)明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請(qǐng)下載最新的WinRAR軟件解壓。
- 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請(qǐng)聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
- 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁(yè)內(nèi)容里面會(huì)有圖紙預(yù)覽,若沒(méi)有圖紙預(yù)覽就沒(méi)有圖紙。
- 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
- 5. 人人文庫(kù)網(wǎng)僅提供信息存儲(chǔ)空間,僅對(duì)用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對(duì)用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對(duì)任何下載內(nèi)容負(fù)責(zé)。
- 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請(qǐng)與我們聯(lián)系,我們立即糾正。
- 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶因使用這些下載資源對(duì)自己和他人造成任何形式的傷害或損失。
最新文檔
- 蘇教版三下教育課件
- 醫(yī)院消毒供應(yīng)室試題
- 蘇少版小學(xué)美術(shù)一年級(jí)下冊(cè)全冊(cè)教案
- DB11-T 2017-2022 射頻電磁輻射車載巡測(cè)技術(shù)規(guī)范
- DB11-T 2059-2022 生態(tài)產(chǎn)品總值核算技術(shù)規(guī)范
- 倉(cāng)儲(chǔ)物流中心改造合同模板
- 關(guān)于合同法中代位權(quán)制度的理解與適用
- 國(guó)際展覽中心土方開(kāi)挖合同
- 衛(wèi)生間翻新包工裝修合同
- 動(dòng)物園裝修材料供應(yīng)協(xié)議
- 2024年公路養(yǎng)護(hù)工技師考試試題及答案
- 2023-2024學(xué)年北京市朝陽(yáng)區(qū)陳經(jīng)綸中學(xué)分校八年級(jí)(上)期中數(shù)學(xué)試卷【含解析】
- 語(yǔ)文-安徽省1號(hào)卷A10聯(lián)盟2025屆高三上學(xué)期8月開(kāi)學(xué)摸底考試試題和答案
- 人教PEP五年級(jí)上冊(cè)英語(yǔ)說(shuō)課稿 Unit 1 What's he like 第二課時(shí)
- 2024年專屬石斛采購(gòu)與銷售合作協(xié)議
- 2024-2030年中醫(yī)理療行業(yè)市場(chǎng)發(fā)展分析及前景趨勢(shì)與投資研究報(bào)告
- 人教版九年級(jí)物理上冊(cè)期中考試卷及答案【新版】
- 人美版(2024)美術(shù)五年級(jí)上冊(cè)3.認(rèn)識(shí)抽象畫(huà) 教學(xué)設(shè)計(jì)
- 醫(yī)療垃圾分類處置含內(nèi)容
- 服裝采購(gòu)合同電子版
- 《火針療法》課件
評(píng)論
0/150
提交評(píng)論