內(nèi)容講義課件教程2019集9676multiple futures prediction_第1頁(yè)
內(nèi)容講義課件教程2019集9676multiple futures prediction_第2頁(yè)
內(nèi)容講義課件教程2019集9676multiple futures prediction_第3頁(yè)
內(nèi)容講義課件教程2019集9676multiple futures prediction_第4頁(yè)
內(nèi)容講義課件教程2019集9676multiple futures prediction_第5頁(yè)
已閱讀5頁(yè),還剩6頁(yè)未讀, 繼續(xù)免費(fèi)閱讀

下載本文檔

版權(quán)說(shuō)明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請(qǐng)進(jìn)行舉報(bào)或認(rèn)領(lǐng)

文檔簡(jiǎn)介

1、Multiple Futures PredictionYichuanTangRuslan Salakhutdinovrsalakhutdinovyichuan_tangAbstractTemporal prediction is critical for making intelligent and robust decisions in com- plex dynamic environments. Motion prediction needs to m the inherently uncertain future which often contains multiple potent

2、ial outcomes, due to multi- agent interactions and the latent goals of others. Towards these goals, we introduce a probabilistic framework that efficiently learns latent variables to jointly mthe multi-step future motions of agents in a scene. Our framework is data-driven and learns sem ally meaning

3、ful latent variables to represent the multimodal future, without requiring explicit labels. Using a dynamic attention-based state encoder, we learn to encode the past as well as the future interactions among agents, efficiently scaling to any number of agents. Finally, our m can be used for planning

4、 via computing a conditional probability density over the trajectories of other agents given a hypothetical rollout of the self agent. We demonstrate our algorithms by predicting vehicle trajectories of both simulated and real data, demonstrating the state-of-the-art results on several vehicle traje

5、ctory datasets.1IntroductionThe ability to make good predictions lies at the heart of robust and safe decision making.It isespecially critical to be able to predict the future motions of all relevant agents in complex anddynamic environments. For example, in the autonomous driving domain, motion pre

6、diction is central both to the ability to make high level decisions, such as when to perform maneuvers, as well as to low level path planning optimizations 34, 28.Motion prediction is a challenging problem due to the various needs of a good predictive m.The varying objectives, goals, and behavioral

7、characteristics of different agents can lead to multiplepossible futures or modes. Agents states do not evolve independently from one another, but rather they interact with each other. As an illustration, we provide some examples in Fig. 1. In Fig. 1(a), there are a few different possible futures fo

8、r the blue vehicle approaching an intersection. It can either turn left, go straight, or turn right, fordifferent modes in trajectory space. In Fig. 1(b), interactions between the two vehicles during a merge scenario show that their trajectories influence each other, depending on who yields to whom.

9、 Besides multimodal interactions, prediction needsto scale efficiently with an arbitrary number of agents in a scene and take intoauxiliaryand contextual information, such as map and road information. Additionally, the ability to measureuncertainty by computing probability over likely future traject

10、ories of all agents ind-form (as opposed to Monte Carlo sampling) is of practical importance.Despite a large body of work in temporal motion predictions 24, 7, 13, 26, 16, 2, 30, 8, 39, existing state-of-the-art methods often only capture a subset of the aforementioned features. For example, algorit

11、hms are either deterministic, not multimodal, or do not fully capture both past and future interactions. Multimodal techniques often require the explicit labeling of modes prior to training. M s which perform joint prediction often assume the number of agents present to be fixed 36, 31.We tackle the

12、se challenges by proposing a unifying framework that captures all of the desirable features mentioned earlier. Our framework, which we call Multiple Futures Predictor (MFP), is33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.(b) Scenario A: green yields to b

13、lue.(c) Scenario B: blue yields to green.(a) Multiple possible future trajectories.Figure 1: Examples illustrating the need for mutimodal interactive predictions. (a): There are a few possible modes for the blue vehicle. (b and c): Time-lapsed visualization of how interactions between agents influen

14、ces each others trajectories.a sequential probabilistic latent variable generative mthat learns directly from multi-agenttrajectory data. Trainingizes a variational lower bound on the log-likelihood of the data. MFPlearns to mmultimodal interactive futures jointly for all agents, while using a novel

15、 factorizationtechnique to remain scalable to arbitrary number of agents. After training, MFP can compute both (un)conditional trajectory probabilities in d form, not requiring any Monte Carlo sampling.MFP builds on the Seq2seq 32, encoder-decoder framework by introducing latent variables and using

16、a set of parallel RNNs (with shared weights) to represent the set of agents in a scene. Each RNN takes on the point-of-view of its agent and aggregates historical information for sequential temporal prediction for that agent. Discrete latent variables, one per RNN, automatically learn sem ally meani

17、ngful modes to capture multimodality without explicit labeling. MFP can be further efficiently and jointly trained end-to-end for all agents in the scene. To summarize, we make the following contributions with the proposed MFP: First, sem ally meaningful latent variables are automatically learned fr

18、om trajectory data without labels. This addresses the multimodality problem. Second, interactive and parallel step-wise rollouts are preformed for all agents in the scene. This addresses the m ing of interactions between actors during future prediction, see Sec. 3.1. We further propose a dynamic att

19、entional encoding which captures both the relationships between agents and the scene context, see Sec. 3.1. Finally, MFP is capable of perforhypothetical inference: evaluating the conditional probability of agents trajectories conditioning on fixing one or more agents trajectory, see Sec. 3.2.2Relat

20、ed WorkThe problem of predicting future motion for dynamic agents has been well studied in the literature. The bulk of classical methods focus on using physics based dynamic or kinematic m s 38, 21, 25. These approaches include Kalman filters and maneuver based methods, which compute the future moti

21、on of agents by propagating their current state forward in time. While these methods perform well for short time horizons, longer horizons suffer due to the lack of interaction and context m ing.The success of machine learning and deep learning ushered in a variety of data-driven recurrent neural ne

22、twork (RNN) based methods 24, 7, 13, 26, 16, 2. These m s often combine RNN variants, such as LSTMs or GRUs, with encoder-decoder architectures such as conditional variational autoencoders (CVAEs). These methods eschew physic based dynamic m s in favor of learning generic sequential predictors (e.g.

23、 RNNs) directly from data. Converting raw input data to input features can also be learned, often by encoding rasterized inputs usings 7, 13.Methods that can learn multiple future modes have been proposed in 16, 24, 13. However, 16 explicitly labels six maneuvers/modes and learn to separately classi

24、fy these modes. 24, 13 do notrequire mabeling but they also do not train in an end-to-end fashion byizing the datalog-likelihood of the m. Most of the methods in literature encode the past interactions of agentsin a scene, however prediction is often an independent rollout of a decoder RNN, independ

25、ent of other future predicted trajectories 16, 29. Encoding of spatial relationships is often done by placing other agents in a fixed and spatially discretized grid 16, 24.2(a) Graphical mof the MFP. Solidnodes denote observed. Cross agentinteraction edges are shaded for clarity.(b) Architecture of

26、the proposed MFP. Circular world contains thext denotes both the state and contextual information from timesteps 1 to t.world state and positions of all agents. Diamond nodes are determin- istic while the circular zn are discrete latent random variables.Figure 2: Graphical mand computation graph of

27、the MFP. See text fors. Best viewed in color.In contrast, MFP proposes a unifying framework which exhibits the aforementioned features. To summarize, we present a feature comparison of MFP with some of the recent methods in the supplementary materials.3Multiple Futures PredictionWe tackle motion pre

28、diction by formulating a probabilistic framework of continuous space but discretetime system with a finite (but variable) numb.er of N interacting agents. We represent the joint stateof all N agents at time t as Xt RN d12N, where d is the dimensionality of= x , x , . . . , x ttteach state1, and xn .

29、Rd is the state n-th agent at time t. With a slight abuse of notation, wetuse superscripted Xn = x, xn, . . . , x to denote the past states of the n-th agent andnntt +1t.X = Xto denote the joint agent states from time t to t, where is the past history steps.1:Nt :t.The future state at time of all ag

30、ents is denoted by Y = y , y , . . . , y and the future trajectory12N.of agent n, from time t to time T , is denoted by Y = .denotesnnnn1:Ny , y, . . . , y TY = Yt:t+Ttt+1the joint state of all agents for the future timesteps. Contextual scene information, e.g. a rasterizedimage Rhw3 of the map, cou

31、ld be useful by providing important cues. We use It to represent anycontextual information at time t.The goal of motion prediction is then to accurately mp(Y|X, It). As in most sequentialmling tasks, it is both inefficient and intractable to mp(Y|X, It) jointly. RNNs are typicallyemployed to sequent

32、ially mthe distribution in a cascade form. However, there are two majorchallenges specific to our multi-agent prediction framework: (1) Multimodality: optimizing vanilla RNNs via backpropagation through time will lead to mode-averaging since the map from X to Y is not a function, but rather a one-to

33、-many map . In other words, multimodality means that for a given X, there could be multiple distinctive modes that results in significant probabilitydistribution over different sequences of Y. (2) Variable-Agents: the number of agents N is variable and unknown, and therefore we can not simply vector

34、ize Xt as the input to a standard RNN at time t.For multimodality, we introduce a set of stochastic latent variables zn Multinoulli(K), oneper agent, where zn can take on K discrete values. The intuition here is that zn would learn to represent intentions (left/right/straight) and/or behavior modes

35、(aggressive/conservative). Learning izes the marginalized distribution, where z isto learn any latent behavior so long as ithelps to improve the data log-likelihood. Each z is conditioned on X at the current time (beforefuture prediction) and will influence the distribution over future states Y. A k

36、ey feature of the MFP is that zn is only sampled once at time t, and must be consistent for the next T time steps. Compared to sampling zn at every timestep, this leads to a tractability and more realistic intention/goal m ing,1We assume states are fully observable and are agents (x, y) coordinates

37、on the ground plane (d=2).3as we will discuss in morelater. We now arrive at the following distribution:XX(1)log p(Y|X, I) = log(p(Y, Z|X, I) = log(p(Y|Z, X, I)p(Z|X, I),ZZwhere Z denotes the joint latent variables of all agents. Navely optimizing for Eq. 1 is prohibitively expensive and not scalabl

38、e as the number of agents and timesteps may become large. In addition,the max number of possible modes is exponential: O(KN ). We first make the mmore tractableby factorizing across time, followed by factorization across agents. The joint future distribution Y assumes the form of product of conditio

39、nal distributions:YT(2)p(Y|Z, X, I) =p(Y|Yt:1, Z, X, I),=t+1YNnn(3)p(Y |Y, Z, X, I) =p(y |Y, z , X, I).t:1t:1n=1The second factorization is sensible as the factorial component is conditioning on the joint states of all agents in the immediate previous timestep, where the typical temporal delta is ve

40、ry short (e.g. 100ms). Also note that the future distribution of the n-th agent is explicitly dependent on its ownmode zn but implicitly dependent on the latent modes of other agents by re-encoding the other agentspredicted states ym (please see discussion later and also Sec. 3.1). Explicitly condit

41、ioning an agentsown latent modes is both more scalable computationally as well as more realistic: agents in the real-world can only infer other agents latent goals/intentions via observing their states. Finally our overall objective from Eq. 1 can be written as: XX Y YTN n, z , X, I)p(z |X, I)(4)nnl

42、ogp(Y|Z, X, I)p(Z|X, I) = logp(y |Yt:1Z =t+1 n=1Z NTX YYp(zn|X, I)p(y |Ynn, z , X, I)(5)= logt:1Z n=1=t+1The graphical mof the MFP is illustrated in Fig. 2a. While we show only three agents forsimplicity, MFP can easily scale to any number of agents. Nonlinear interactions among agents makesn, X, I)

43、 complicated to m. The class of recurrent neural networks are powerful andp(y |Yt:1flexible ms that can efficiently capture and represent long-term dependences in sequential data.At a high level, RNNs introduce deterministic hidden units ht at every timestep t, which a features or embeddings that su

44、mmarize all of the observations up until time t. At time step t, a RNNtakes as its input the observation, xt, and the previous hidden representation, ht1, and computes theupdate: ht = frnn(xt, ht1). The prediction yt is computed from the decoding layer of the RNNyt = fdec(ht). frnn and fdec are recu

45、rsively applied at every timestep of the sequence.Fig. 2b shows the computation graph of the MFP. A point-of-view (PoV) transformation n(Xt) is first used to transform the past states to each agents own reference frame by translation and rotation such that +x-axis aligns with agents heading. We then

46、 inst ate an encoding and a decoding RNN2per agent. Each encoding RNN is responsible for encoding the past observations xt:t into a featurevector. Scene context is transformed via a convolutional neural network into its own feature. The features are combined via a dynamic attention encoder,ed in Sec

47、. 3.1, to provide inputs both to the latent variables as well as to the ensuing decoding RNNs. During predictive rollouts, the decoding RNN will predict its own agents state at every timestep. The predictions will be aggregatedand subsequently transformed via n(), providing inputs to every agent/RNN

48、 for the next timestep.Latent variables Z provide extra inputs to the decoding RNNs to enable multimodality. Finally, theoutput yn consists of a 5 dim vector governing a Bivariate Normal distribution: x, y, x, y, andtcorrelation coefficient .While we instate two RNNs per agent, these RNNs share the

49、same parameters across agents, whichmeans we can efficiently perform joint predictions by combining inputs in a minibatch, allowing us to scale to arbitrary number of agents. Making Z discrete and having only ot of latent variablesinfluencing subsequent predictions is also a deliberate choice. We wo

50、uld like Z to mmodes generated due to high level intentions such as left/right lane changes or conservative/aggressive modesof agent behavior. These latent behavior modes also tend to stay consistent over the time horizon which is typical of motion prediction (e.g. 5 seconds).2We use GRUs 10. LSTMs

51、and GRUs perform similarly, but GRUs were slightly faster computationally.4LearningGiven a set of training trajectory data D = (X(i), Y(i), ) . . . i=1,2,.,|D|, we optimize using theum likelihood estimation (MLE) to estimate the parameters = argmax L(, D) thatum marginal data log-likelihood:3achieve

52、s theXX p(Y, Z|X; )(6)L(, D) = log p(Y|X; ) = logp(Y, Z|X; ) =p(Z|Y, X; ) logp(Z|Y, X; )ZZOptimizing for Eq. 6 directly is non-trivial as the posterior distribution is not only hard to compute, but also varies with . We can however decompose the log-likelihood into the sum of the evidence lower boun

53、d (ELBO) and the KL-divergence between the true posterior and an approximating posterior q(Z) 27:Xp(Y, Z|X; )q(Z|Y, X)log p(Y|X; ) =q(Z|Y, X) log+ D(q|p)KLZX(7)q(Z|Y, X) log p(Y, Z|X; ) + H(q),Zwhere Jensens inequality is used to arrive at the lower bound, H is the entropy function andDKL(q|p) is th

54、e KL-divergence between the true and approximating posterior. We learn by max-imizing the variational lower bound on the data log-likeliho.od by first using the true posterior4 atthe current 0 as the approximating posterior: q(Z|Y, X) = p(Z|Y, X; 0). We can then fix theapproximate posterior and opti

55、mize the mparameters for the following function:X00Q(, ) =p(Z|Y, X; ) log p(Y, Z|X; )ZX 0(8)=p(Z|Y, X; ) log p(Y|Z, X; ) + log p(Z|X; ) .rnnZZwhere = rnn, Z denote the parameters of the RNNs and the parameters of the network layersfor predicting Z. As our latent variables Z are discrete and have sma

56、ll cardinality (e.g. 10), we can compute the posterior exactly for a given 0. The RNN parameter gradients are computed fromQ(, 0)/rnn and the gradient for Z is KL(p(Z|Y, X; 0)|p(Z|X; Z)/Z.Our learning algorithm is a form of the EM algorithm 14, where for the M-step we optimize RNN parameters using s

57、tochastic gradient descent. By integrating out the latent variable Z, MFP learns directly from trajectory data, without requiring any annotations or weak supervision for latent modes.We provide aed training algorithm pseudocode in the supplementary materials.Classmates-forcingTeacher forcing is a standard technique (albeit biased) to accelerate RNN and sequence-to-sequence training by using

溫馨提示

  • 1. 本站所有資源如無(wú)特殊說(shuō)明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請(qǐng)下載最新的WinRAR軟件解壓。
  • 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請(qǐng)聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
  • 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁(yè)內(nèi)容里面會(huì)有圖紙預(yù)覽,若沒(méi)有圖紙預(yù)覽就沒(méi)有圖紙。
  • 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
  • 5. 人人文庫(kù)網(wǎng)僅提供信息存儲(chǔ)空間,僅對(duì)用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對(duì)用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對(duì)任何下載內(nèi)容負(fù)責(zé)。
  • 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請(qǐng)與我們聯(lián)系,我們立即糾正。
  • 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶因使用這些下載資源對(duì)自己和他人造成任何形式的傷害或損失。

最新文檔

評(píng)論

0/150

提交評(píng)論