離散事件動(dòng)態(tài)系統(tǒng)--馬爾科夫鏈_第1頁(yè)
離散事件動(dòng)態(tài)系統(tǒng)--馬爾科夫鏈_第2頁(yè)
離散事件動(dòng)態(tài)系統(tǒng)--馬爾科夫鏈_第3頁(yè)
離散事件動(dòng)態(tài)系統(tǒng)--馬爾科夫鏈_第4頁(yè)
離散事件動(dòng)態(tài)系統(tǒng)--馬爾科夫鏈_第5頁(yè)
已閱讀5頁(yè),還剩56頁(yè)未讀, 繼續(xù)免費(fèi)閱讀

下載本文檔

版權(quán)說(shuō)明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請(qǐng)進(jìn)行舉報(bào)或認(rèn)領(lǐng)

文檔簡(jiǎn)介

1、2022-1-61/ 51課程基本情況n課程性質(zhì):非學(xué)位課n學(xué)時(shí)數(shù)/學(xué)分:32/2 n周學(xué)時(shí):4 (后面有調(diào)整)n授課形式:(a) 主講面授; (c) 文獻(xiàn)報(bào)告和自由討論n應(yīng)用領(lǐng)域:網(wǎng)絡(luò)系統(tǒng)分析、移動(dòng)機(jī)器人、智能交通、生產(chǎn)自動(dòng)化和供應(yīng)鏈管理、Agent系統(tǒng)、網(wǎng)絡(luò)控制優(yōu)化、機(jī)器學(xué)習(xí)、排隊(duì)網(wǎng)絡(luò)、系統(tǒng)可靠性分析,以及其它有關(guān)決策優(yōu)化、控制和智能學(xué)習(xí)等。n前期課程內(nèi)容:高等數(shù)學(xué)、概率論、線性代數(shù)n考核方式:考查(含課程總結(jié)、文獻(xiàn)匯報(bào))2022-1-62/ 51課程內(nèi)容 1.離散事件動(dòng)態(tài)系統(tǒng)基本概念、分類、研究方法2.隨機(jī)離散事件動(dòng)態(tài)系統(tǒng)的基本仿真技術(shù)3.Markov決策過(guò)程(含Markov鏈,半Mar

2、kov決策過(guò)程)基本知識(shí)4.動(dòng)態(tài)規(guī)劃(dynamic programming)和仿真優(yōu)化:主要介紹Bellman最優(yōu)方程,策略迭代和數(shù)值迭代。5.強(qiáng)化學(xué)習(xí)(reinforcement learning)技術(shù):主要介紹Monte-Carlo方法、TD學(xué)習(xí)、Q學(xué)習(xí)和SARSA學(xué)習(xí)等。6.神經(jīng)元/逼近動(dòng)態(tài)規(guī)劃(neuro-dynamic programming) 7.多Agent學(xué)習(xí)探討8.實(shí)例分析 2022-1-63/ 51第一章離散事件動(dòng)態(tài)系統(tǒng)基本概念、分類和研究方法2022-1-64/ 51基本概念n隨著高新技術(shù)的迅猛發(fā)展,現(xiàn)實(shí)世界中涌現(xiàn)了大量的復(fù)雜人造系統(tǒng)(如計(jì)算機(jī)網(wǎng)絡(luò)、通信網(wǎng)絡(luò)、柔性制造系

3、統(tǒng)、CIMS、交通管理系統(tǒng)、軍事指揮系統(tǒng)等)。這些系統(tǒng)的共同特征是:系統(tǒng)的演化過(guò)程不能由通常的物理定律來(lái)描述,而是服從一些由人為規(guī)定的復(fù)雜規(guī)則,并由一系列相互作用的離散事件所決定。n 這樣的一類人造系統(tǒng)常被描述為離散事件動(dòng)態(tài)系統(tǒng)(Discrete event dynamic system,DEDS)。n事件:使DEDS狀態(tài)發(fā)生變動(dòng)的一個(gè)行動(dòng)或事情。2022-1-65/ 51nDEDSDEDS與一般動(dòng)態(tài)系統(tǒng)的差別:與一般動(dòng)態(tài)系統(tǒng)的差別:通常的連續(xù)變量動(dòng)態(tài)系統(tǒng)(CVDS),其動(dòng)態(tài)特性滿足一定的物理定律,可用微分方程或差分方程來(lái)描述。 例如經(jīng)典力學(xué)下的質(zhì)點(diǎn)運(yùn)動(dòng)方程等可以描述為nDEDSDEDS基本概

4、念基本概念: : 由一些相互作用的離散事件構(gòu)成,并且由它們觸發(fā)而引起狀態(tài)轉(zhuǎn)移(演化)的一類動(dòng)態(tài)系統(tǒng),它所含的事件的發(fā)生在時(shí)間和空間上都是離散的。( )( ( ), ( ), ) ( )( )(1)( ( ), ( ) ( )( )x tf x tu ttAx tBu tx kf x ku kAx kBu k&微分方程差分方程:線性系統(tǒng)2022-1-66/ 51例1 柔性制造系統(tǒng) 待加工工件緩沖器工作臺(tái)1已加工工件緩沖器待加工工件緩沖器工作臺(tái)M已加工工件緩沖器Sn2Sn1智能倉(cāng)庫(kù)自行小車2022-1-67/ 51例2 機(jī)器人自動(dòng)裝配線(ROBOTIC ASSEMBLY LINE)2022-1-6

5、8/ 51例3 開(kāi)排隊(duì)網(wǎng)絡(luò)01服務(wù)站1緩沖器服務(wù)站2緩沖器服務(wù)站3緩沖器02010310q11q12q13q20q21q22q23q30q31q32q33q2022-1-69/ 51通信系統(tǒng)中的接入控制2022-1-610/ 51基本分類和研究方法DEDS的三個(gè)層次模型:n 邏輯層次模型(確定性)主要有形式語(yǔ)言,有限自動(dòng)機(jī),Markov鏈,Petri網(wǎng)等(不可時(shí)序化):模型不可賦時(shí),只考慮表征系統(tǒng)行為的符號(hào)的順序關(guān)系n 代數(shù)層次模型(確定性)主要有極大極小代數(shù),有限遞歸過(guò)程等(可時(shí)序化可時(shí)序化)n 統(tǒng)計(jì)層次模型(隨機(jī)性)主要有Markov過(guò)程,半Markov過(guò)程或廣義半Markov過(guò)程,各種類

6、型的排隊(duì)網(wǎng)絡(luò)等(可時(shí)序化可時(shí)序化、采用仿真方法)2022-1-611/ 51DEDS統(tǒng)計(jì)性能層次的研究情況從九十年代開(kāi)始,統(tǒng)計(jì)性能層次的研究已成為DEDS研究領(lǐng)域的一個(gè)重要方面,主要包括以下兩個(gè)研究方向:u系統(tǒng)的性能分析:主要是靈敏度分析u優(yōu)化理論和應(yīng)用研究:Markov控制(決策)過(guò)程方法及優(yōu)化問(wèn)題已成為當(dāng)前DEDS領(lǐng)域的一個(gè)令人注目的熱點(diǎn),也是本課程的主要介紹對(duì)象。拓展:SMDP、POMDP、HMM、HDS建模2022-1-612/ 51第二章隨機(jī)離散事件動(dòng)態(tài)系統(tǒng)的基本仿真技術(shù)2022-1-613/ 51隨機(jī)變量n隨機(jī)變量:粗略的說(shuō)就是能取不同數(shù)值的量n非隨機(jī)的(確定性的數(shù)值,永不改變)

7、:太陽(yáng)系中的太陽(yáng)個(gè)數(shù)n隨機(jī)的:一個(gè)人一天接到的電話個(gè)數(shù),每天都不一樣2022-1-614/ 51概率n實(shí)驗(yàn)(experiment):考試,擲骰子,打球比賽,扔硬幣n一次實(shí)驗(yàn)對(duì)應(yīng)一個(gè)輸出X,考慮實(shí)驗(yàn)的輸出是隨機(jī)變量,可取多個(gè)值。n(pass,fail),(1,2,3,4,5,6),(win,lose),(heads,tails)n事件:擲骰子,點(diǎn)數(shù)為2,或者為偶數(shù)n事件的概率:事件發(fā)生的機(jī)會(huì)(chance)或可能性(likelihood),m次實(shí)驗(yàn)中,事件A發(fā)生n次,則概率為 P(A)=lim m(n/m) 0,12022-1-615/ 51加數(shù)法則(ADDITION LAW)n互斥事件(mut

8、ually exclusive)n復(fù)合事件(compound):由互斥事件構(gòu)成,如多次擲骰子中,出現(xiàn)偶數(shù)的事件由分別出現(xiàn)2,4或6的互斥事件構(gòu)成。若復(fù)合事件E由A1,,An構(gòu)成,則P(E)=P( A1)+ P(An)n復(fù)雜事件(complex):未必由互斥事件構(gòu)成,如擲骰子,出現(xiàn)質(zhì)數(shù)(2,3,5)或偶數(shù)(2,4,6)的事件P(AB)=P( A)+ P(B)-P(AB)AB2022-1-616/ 51乘積法則(MULTIPLICATION LAW)n獨(dú)立事件(independent):兩個(gè)事件中,一個(gè)事件的出現(xiàn)不依賴于另外一個(gè)。反之為相關(guān)事件(dependent)。扔硬幣,第一次為heads的事

9、件A與第二次為tails的事件B相互獨(dú)立。定義事件E表示第一次為heads且第二次為tails的事件,則P(E)P(A B)=P( A) .P(B)n互斥的就無(wú)所謂相關(guān)不相關(guān);非互斥的,則有可能獨(dú)立,則P(A B)=P( A) .P(B)。n既不互斥又不獨(dú)立,則P(A B)=P( A) .P(B|A)= P( B) .P(A|B), 其中,P(B|A)和P(A|B)為條件概率。(若A、B獨(dú)立,則?)2022-1-617/ 51概率分布離散變量隨機(jī)變量取值可能是離散的,如1,4.5,18,1969,也可能是連續(xù)的,如區(qū)間0 10。先考慮離散變量n離散隨機(jī)變量:擲骰子游戲中,輸出X 1,2,3,4

10、,5,6,其中X為1的概率記P(X=1)=1/6,一般地, P(X=k)=l,對(duì)應(yīng)一個(gè)概率質(zhì)量函數(shù)(Prob. Mass function, PMF),即f(x),表示概率P(X=x)。nP(Xk)=l表示隨機(jī)變量X不超過(guò)k的概率為l,該函數(shù)表示累積分布函數(shù) (Cumulative distribution function, CDF,有時(shí)簡(jiǎn)稱分布函數(shù)),記為F(X=k)或F(x),滿足nF(X=k)kx=af(x)(從X的最低可能值a到k的所有pmf值的和)nPMF CDF2022-1-618/ 51概率分布連續(xù)變量連續(xù)隨機(jī)變量:例如連續(xù)兩次所接電話之間的時(shí)間差n概率密度函數(shù)(Prob. d

11、ensity function, PDF對(duì)應(yīng)離散情況的PMF),仍記為f(x). 分布函數(shù)滿足()( )kaF Xkdxf x()1, ( )0 if or ( )baF Xbdxf xxaxbf x( )( )dF xf xdx2022-1-619/ 51隨機(jī)變量的期望值和標(biāo)準(zhǔn)偏差n離散隨機(jī)變量的期望值(expected/mean/average value)( )( )xE Xxf x( )( )xE Xdxxf xn連續(xù)隨機(jī)變量的期望值n均值不能說(shuō)明一個(gè)隨機(jī)變量任何特性,只有同標(biāo)準(zhǔn)偏差一起才能說(shuō)明。隨機(jī)性完全體現(xiàn)在PDF、PMF或CDF。n標(biāo)準(zhǔn)偏差:隨機(jī)變量對(duì)其均值的平均偏離的估計(jì),定義

12、21*21()()()(1),(),niikiiXXxanannxmkXmk若 是 次實(shí)驗(yàn)的平均值(則 個(gè)偏差不是獨(dú)立的)若隨機(jī)變量 的均值 已知(則 個(gè)偏差是獨(dú)立的)n標(biāo)準(zhǔn)偏差的平方稱為方差2()X2022-1-620/ 51極限定理(Limit Theorems)n中心極限定理:1212,()lim ()nnnXXXXE XXXnLLL令為獨(dú)立隨機(jī)變量序列,具有均值的共同分布,則以概率1樣本均值收斂于期望12212122,nnnnnXXXXXXXYnnXXY LLLLL令為同分布的獨(dú)立隨機(jī)變量序列,具有均值 和方差,定義則時(shí),則不管原來(lái)的分布為什么, 的分布逐漸變?yōu)榫禐?和方差為的正態(tài)分

13、布。n強(qiáng)大數(shù)定律:2022-1-621/ 51仿真中的基本概念n離散事件仿真仿真主要涉及隨機(jī)數(shù)產(chǎn)生和隨機(jī)系統(tǒng)仿真模型n動(dòng)態(tài)系統(tǒng)動(dòng)態(tài)系統(tǒng):系統(tǒng)(行為)隨時(shí)間變化n狀態(tài)狀態(tài):描述系統(tǒng)(行為)隨時(shí)間變化的物理量。如排隊(duì)系統(tǒng)的隊(duì)長(zhǎng),庫(kù)存量,帶寬占用率等。n支配(控制)變量支配(控制)變量(governing variable):動(dòng)態(tài)系統(tǒng)的行為受這些變量支配、控制(操縱),如排隊(duì)系統(tǒng)中的服務(wù)時(shí)間和相鄰顧客到達(dá)時(shí)間間隔。n隨機(jī)系統(tǒng)隨機(jī)系統(tǒng):控制變量是隨機(jī)變量的系統(tǒng),其行為受隨機(jī)變量支配。2022-1-622/ 51模型n實(shí)體(模型)實(shí)體(模型):小型飛機(jī)模型,模擬仿真系統(tǒng)n抽象(數(shù)學(xué))模型抽象(數(shù)學(xué))模型

14、:方程,函數(shù),不等式,計(jì)算機(jī)程序等。幫助理解,分析,預(yù)測(cè)系統(tǒng)行為.n建模建模一般基于一些假設(shè),如系統(tǒng)結(jié)構(gòu),支配變量的分布。排隊(duì)系統(tǒng)中的指數(shù)服務(wù)和到達(dá)間隔。n為研究大規(guī)模復(fù)雜隨機(jī)系統(tǒng),可用計(jì)算機(jī)程序模擬系統(tǒng)行為(為支配隨機(jī)變量產(chǎn)生隨機(jī)數(shù)),這樣的程序可稱為仿真模型。n仿真模型亦可用于優(yōu)化,特別是無(wú)法或難以建立數(shù)學(xué)模型時(shí)。產(chǎn)生仿真優(yōu)化。2022-1-623/ 51為什么研究隨機(jī)系統(tǒng)n很多實(shí)際系統(tǒng)是隨機(jī)系統(tǒng),如通訊網(wǎng)絡(luò)n通過(guò)研究,可以改變這些系統(tǒng),使其更有效運(yùn)行(或降低其運(yùn)行代價(jià))2022-1-624/ 51隨機(jī)系統(tǒng)的仿真模型n隨機(jī)系統(tǒng)的建模第一步是要尋找支配隨機(jī)變量的分布。n分布的作用:數(shù)學(xué)模型中

15、用于建立表達(dá)式;仿真模型中用于產(chǎn)生隨機(jī)數(shù),以便計(jì)算機(jī)來(lái)模擬系統(tǒng)的行為,即重構(gòu)實(shí)際系統(tǒng)發(fā)生的事件(產(chǎn)生支配隨機(jī)變量的值)。n隨機(jī)變量分布的獲取:從實(shí)際系統(tǒng)收集數(shù)據(jù),然后進(jìn)行分布函數(shù)擬合2022-1-625/ 51隨機(jī)數(shù)產(chǎn)生-均勻分布隨機(jī)數(shù)的產(chǎn)生(人工產(chǎn)生!)線性同余隨機(jī)數(shù)產(chǎn)生線性同余隨機(jī)數(shù)產(chǎn)生(linear congruential generator)nIj+1(aIj mod m): aIj 被m除的余數(shù), a和m為正整數(shù),I0小于等于m,Ij0,m是隨機(jī)序列。如a=2, m=20, I0 =12,則有12,4,8,16,12,4,8,16,12,.n適當(dāng)選擇a和m,則得到0和m之間的所有整

16、數(shù)序列(m-1個(gè)),第i個(gè)整數(shù)xi代表(決定了)第i個(gè)隨機(jī)數(shù)yi=xi/m,每個(gè)yi的可能性相同( xi 在原來(lái)的序列集里出現(xiàn)一次)。m越大,yi越接近于服從0,1之間均勻分布的自然隨機(jī)數(shù)。nI0是種子,能產(chǎn)生的最大隨機(jī)數(shù)個(gè)數(shù)是m-1。若m2321,對(duì)應(yīng)個(gè)數(shù)21474836462022-1-626/ 51隨機(jī)數(shù)產(chǎn)生n實(shí)際中,若周期短(m?。?,則隨機(jī)數(shù)會(huì)重復(fù),導(dǎo)致結(jié)果變壞(隨機(jī)數(shù)不獨(dú)立,不再服從均勻分布)。nIj+1(aIj mod m)中的aIj可能會(huì)超出計(jì)算機(jī)表達(dá)能力。nSchrage逼近因數(shù)分解:Q= a(Ij mod q)-rIj /q,q和r是正整數(shù)n隨機(jī)數(shù)產(chǎn)生機(jī)制無(wú)需計(jì)算aIj n對(duì)

17、(a, b)間的任何均勻分布,其隨機(jī)數(shù)x都可由(0, 1)之間的隨機(jī)數(shù)y產(chǎn)生: x=a+(b-a)y. (映射!)1 0otherwisejIQif QQm2022-1-627/ 51隨機(jī)數(shù)產(chǎn)生-其它分布逆函數(shù)方法n指數(shù)分布的累積分布函數(shù)為( ) 1,0 xF xe 1.產(chǎn)生一個(gè)隨機(jī)數(shù)y,服從(0,1)之間的均勻分布,令其為指數(shù)分布的CDF的值,即F(x)=y2.反解x,即ln(1)1lnor xyyexyx 2022-1-628/ 51使用隨機(jī)數(shù)的事件重構(gòu)n以單個(gè)服務(wù)臺(tái)排隊(duì)為例,兩個(gè)支配變量:n相繼到達(dá)時(shí)間間隔ta。n服務(wù)者為一個(gè)顧客的服務(wù)時(shí)間ts。n根據(jù)各自分布產(chǎn)生兩個(gè)隨機(jī)序列ta,ts,

18、例如ta=10.1, 2.3, 1, 0.9, 3.5; ts=0.1, 3.2, 1.19, 4.9, 1.1.n可構(gòu)造兩種事件發(fā)生n到達(dá) tan離開(kāi)n空閑:10.1+2.2 tsn使用率(utilization):1-12.3/22.79n長(zhǎng)時(shí)段仿真(long run)10.12.3-0.12022-1-629/ 5110 lim,( ) lim, ( ) t niinTTiwWnQ t dtQQ tT隊(duì)列中顧客的平均等待時(shí)間( 是顧客號(hào))隊(duì)列中的平均顧客數(shù)為 時(shí)刻的隊(duì)長(zhǎng)足夠大的仿真:指定的精度牽水平內(nèi),再涉到隨機(jī)數(shù)增加樣本,待估計(jì)的值 不再改變。獨(dú)立樣本()可終止的系統(tǒng)和非終產(chǎn)生!止的系

19、統(tǒng)2022-1-630/ 51第三章Markov決策過(guò)程基本知識(shí)2022-1-631/ 51EXAMPLES THE DETERMINISTIC SHORTEST PATH PROBLEM nTransition from one city to the next one is deterministic:Each control (or action) at a given city leads to a unique and certain successor citynThe objective is to find a path among all possible paths, wh

20、ich has the minimum costnThis problem can be solved effectively by dynamic programmingTermination cityInitial cityFig.1Path programming for a traveling sales man 2022-1-632/ 51Fig.2 : The shortest path problem *min,: The optimal cost from city to the termination,: The cost for the transition from to

21、 jig i jjiig i jij 327941381314Bellman equation(反向遞推,從K節(jié)點(diǎn)出發(fā)):2022-1-633/ 51EXAMPLES STOCHASTIC SHORTEST PATH (SSP) PROBLEM nTransition from one state to the next one is stochastic, that isEach action at a given state may lead to several possible successor states, and each transition, e.g. from state

22、 C to state F, will generate a cost, which may be dependent on the actionTermination city(Termination state)Initial city (initial state)Fig.3 Path programming for a signal in communication systems 2022-1-634/ 51Examples Transition for signals in the SSP problemsTransition probability: P(E| C, d); Ge

23、nerated cost: f (C, d, E)P(E| C, d)+P(F| C, d)+P(G| C, d )=1;P(G| C, d); f (C, d, G)ddBCDEFGP(F| C, d); f (C, d, F)P(G| C, d ); f (C, d, G)nThe essence of the problem is how to reach the termination state with minimum expected costP(E| C, d); f (C, d, E)2022-1-635/ 51Mathematic models for Markov cha

24、ins System EvolutionDecision epoch: t Decision epoch: t +1 Action: dtdt+1Cost: ft(Xt,dt)ft+1(Xt+1,dt+1)XtXt+1P(Xt+1| Xt, dt)nMarkov property: state transition is independent of the history, i.e., transition from Xt to Xt+1 is only determined by current state Xt and selected action dt狀態(tài)序列行動(dòng)序列代價(jià)序列2022

25、-1-636/ 51Mathematic models for Markov chainsBasic model parametersControlled Markov chain: : 0, 1, 2, , or 0,1,2, : (generic state ); : (generiDecisio n epochsSc action ) : ( , ) or ( , , ) , tate spaceAction setRe r wa ds iDdDf i df i d jiNst, : ( | , ) or ( )A model is called if rewards aTransiti

26、on proband transition prationobabibilitieslities are independent of timy reaijdDp j i dp d2022-1-637/ 51MATHEMATIC MODELS FOR MARKOV CHAINS CLASSIFICATION OF POLICIES 000111,101 of a controlled Markov chain when evolving from state to : , is a sequence as , where each is a distribution over Stochast

27、ic policacHistor y y nnnnnnnXXXdX dXdXHV LLFtion set if history is given: ( |)0 1, ( |)=1 ( is selected deterministic policywith prob. ( |)A is a mapping : (Given a history, a special action is nnnnnnnd DnnddddHDFFFF01 selected w.p.1, ( |)1 for a fixed action d): ( |)( |) (Deterministic) Markov poli

28、c(Dey: : : , : terministic) poliStochastic Markov policyStatci na yyrDonnnnnnnddd XDD LFFenote a stationary policy as , the set as , and ( (1), (), ( )for a finite state apace. state-action map (look-up table)svvvv Mv iL2022-1-638/ 51MATHEMATIC MODELS FOR MARKOV CHAINS PERFORMANCE CRITERIA 100100 Fi

29、nite horizon problem (cost accumulates over a finite number of stages)Discounted( )()1Average( )()criteriacriteri()aNvNnnnnNnNnnnNviEfXv XXifXiEfXv XXiNv X : :00, if Infinite horizon problem (cost accumulates infinitely)Discounted criteria()0total discounInfinite horizon expected cost: , 01( )() 01

30、Fted vNnnnNnfiEfXv XXXiXv 100Infiniteor stochast horizon exic shotest path problepected (every stage) aExpected totalveram, 1 is possible: Average criterige cost costua: 1( )lim(), nichor faNvnnNniEfXv XXiN ( )i,nvvii011?(), 2022-1-639/ 51MATHEMATIC MODELS FOR MARKOV CHAINSOPTIMIZATION PROBLEM ,(,)

31、A controlled Markov chain can be denoted bytransition matrix: ( | , ( ) performance function: ( (1, (1) , (2, (2), (, ()Optimization objectiv vvnvi jvXXD PfPfp j i v ifvfvf M v M*argminargmin or e is to select a policy minimizing the chosen performance criteria, that isNote that the received cost is

32、 by now assumed when an action is t immeaken diat evvvvssvv What happens if the cost is accumulated with time before the process jumps to the next st!ate? 2022-1-640/ 511If an action is taken at state , the generated cost is accrued with time at period , , then we have to consider the sojourn time .

33、 So (, () represents the received cost v ennnnnnXT TTf X v X011 Under , if the underlying chain , is , and distribution of only relies on , and (), leads to a seery unit timemi-MDP (SMDP)Especially if the diMarkovsanti rnnnnnvX XXTXXv X ibution is , leads to continuouexponentias-timle MDPSEMI-MARKOV

34、 DECISION PROCESSES (SMDPS) FROM MARKOV CHAIN TO SMDP 0000 Decisi () (, () on epoch: Action:Cost (rate : )Tv Xf Xv X()(, ()nnnnTv Xf Xv X0121 nnXXXXX1111()(, ()nnnnTv Xf Xv X1nnnTTT一次仿真:2022-1-641/ 51BASIC CONCEPTS FOR MDP01, embedded Markov chaiDeterministic stationar y policy :, write ( (1), ()Tra

35、nsition matrix of the , nInfinitesimal ge( , ( ), nerator (nvi jvijvDvvv MXXXPp i v ijAa v ,00( )(, ()|,( ) satisfies ( (1), (2), ()Average-cost performance crite ria (infinite horizon) under polic y 1limNi jvvsTvttNNviEf Xv Xdt Xi iiAdiagMPIT00,If underlying Markov chain for policy is unichain, the

36、n ( ) ( )(, ()|,Discounted-cost performance criterialimNsvvTvttsNtvviiEf Xv Xdt Xi ivei 保守矩陣與策略v有關(guān)2022-1-642/ 51PROBLEM FORMULATION (3) OPTIMIZATION OBJECTIVE*argminargminIf consider continuous-time MDP or SMDP, the objective is to find Stationary distributio or of states ( (1),(n2),()vvvvvvvvssvvM

37、Balance equations: Markov Chain: ,1 Continuous MDP: 0,0,1 1,1,1vvvvvvvvvPP eeeAA eee2022-1-643/ 51Potential-based optimization via numerical computation (1) performance potential00Definition of performance potential via Poisson equation (Cao 1997) MDP: () MC: () ( ) ()Performance vvvvvvvvvnvnnnIAefI

38、PefgiEf Xv XXgig *potential-based Bellman optimality equation MDP: 0min MC: min()0 or min0Optimization sssvvvvvvvvvvvvvvvvfA gfPI gfP gge based on potentials have two advantages: Optimization of , and can be unified through potentials; Optimization algorithmSMDP MDPMCaverage- and discounted criteria

39、s for both can be unified whether they are realized by numerical computation or simulation2022-1-644/ 51Reinforcement learning of potentials00A continuous-time MDP (or SMDP) can be treated as a uniformized MC (UMC)Definition of the performance potential for UMC via sample path ( ) () TDvnvnnngiEf X

40、v XXi111 ( ) learning of the performance potential (, ()()() : (1)()( )if ( ) ( ) 1 otherwis( ):( )(r( )e)o vvvnnnnnvvnvnnnnnnnvnnndfX v XgXgXf X vgigiXziXiz izzzdiii 1( )if ( )1otherwisennnziXii(unified temporary difference)(eligibility trace)2022-1-645/ 51SEMI-MARKOV DECISION PROCESSES (SMDPS) REL

41、ATIONS OF DIFFERENT MODELS MDPContinuous-time MDPDiscrete-time MDP(Markov chain)SMDPnIn many cases, the study of a SMDP is realized by transforming to a controlled Markov chain, if the model is knownE.g., such as a preventive problem provided in the book Simulation-based optimizationFig. 5 Relations

42、 of different models 2022-1-646/ 51OPTIMIZATION METHODS & DIFFICULT PROBLEMS OVERVIEW OF DIFFERENT OPTIMIZATION METHODSnNumerical computation Value iteration Policy iteration (Non) Linear programmingnSimulation methods Monte-Carlo methods Reinforcement learning Neuro-dynamic programmingIs model know

43、n?Yes: TD learning (model-based)NO: Q-learning (model-free)Disadvantages: Model need to be known Computation of matrix inverse is difficult for large scale problems! For finite horizon models backward induction (dynamic programming)2022-1-647/ 51OPTIMIZATION METHODS & DIFFICULT PROBLEMS SOME VARIANT

44、S ON THE BASIC MODELnBasic and simplest models: Markov chains State space and action set are both finite Stochastic process is ergodicnThere are many problems appearing now! Decisions may be made in continuous time SMDP There may be a continuum of states or actions e.g. compact Model parameters may

45、not be known or uncertain Robust decision/simulation methods System state may be not observable POMDP or HMM2022-1-648/ 51第三章動(dòng)態(tài)規(guī)劃(dynamic programming)2022-1-649/ 51 動(dòng)態(tài)規(guī)劃是運(yùn)籌學(xué)的一個(gè)分支,是求解多階段決策過(guò)程的最優(yōu)化數(shù)學(xué)方法。20世紀(jì)50年代初美國(guó)數(shù)學(xué)家 R.E.Bellman 等人在研究多階段決策過(guò)程的優(yōu)化問(wèn)題時(shí),提出了著名的最優(yōu)化原理,把多階段過(guò)程轉(zhuǎn)化為一系列單階段問(wèn)題,逐個(gè)求解,創(chuàng)立了解決這類問(wèn)題的新方法動(dòng)態(tài)規(guī)劃。 20

46、22-1-650/ 51n多階段決策過(guò)程多階段決策過(guò)程( multi-step decision process ) 指這樣一類特殊的活動(dòng)過(guò)程,過(guò)程可以按時(shí)間順序分解成若干個(gè)相互聯(lián)系的階段,在每一個(gè)階段都需要做出決策,全部過(guò)程的決策是一個(gè)決策序列。n最優(yōu)化原理最優(yōu)化原理 作為整個(gè)過(guò)程的最優(yōu)策略具有這樣的性質(zhì):無(wú)論過(guò)去的狀態(tài)和決策如何,相對(duì)于前面的決策所形成的狀態(tài)而言,余下的決策序列必然構(gòu)成最優(yōu)子策略。也就是說(shuō),一個(gè)最優(yōu)策略的子策略也是最優(yōu)的。2022-1-651/ 51模型分類以以“時(shí)間時(shí)間”角度可分成角度可分成: 離散型和連續(xù)型。從信息確定與否可分成從信息確定與否可分成: 確定型和隨機(jī)型。從

47、目標(biāo)函數(shù)的個(gè)數(shù)可分成從目標(biāo)函數(shù)的個(gè)數(shù)可分成: 單目標(biāo)型和多目標(biāo)型。2022-1-652/ 51確定性問(wèn)題Fig. : The shortest path problem *min,jig i jj327941381314Bellman equation(反向遞推):2022-1-653/ 51隨機(jī)問(wèn)題Bellman equation(反向遞推): *min, ,minuvvviE f i u jj i ufP2022-1-654/ 51My previous work Potential-based policy iteration11By solving an easy subproblem argminor argminksksvvvkvvvvkvvfA gvfP gBy solving Poisson equationor by estimate/learning via simulationPolicy evaluationkvgkvgPolicy improvement1kvFig. 6 Illustration

溫馨提示

  • 1. 本站所有資源如無(wú)特殊說(shuō)明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請(qǐng)下載最新的WinRAR軟件解壓。
  • 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請(qǐng)聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
  • 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁(yè)內(nèi)容里面會(huì)有圖紙預(yù)覽,若沒(méi)有圖紙預(yù)覽就沒(méi)有圖紙。
  • 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
  • 5. 人人文庫(kù)網(wǎng)僅提供信息存儲(chǔ)空間,僅對(duì)用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對(duì)用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對(duì)任何下載內(nèi)容負(fù)責(zé)。
  • 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請(qǐng)與我們聯(lián)系,我們立即糾正。
  • 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶因使用這些下載資源對(duì)自己和他人造成任何形式的傷害或損失。

最新文檔

評(píng)論

0/150

提交評(píng)論