Buckets:
Title: A Game-Theoretic Framework for Joint Forecasting and Planning
URL Source: https://arxiv.org/html/2308.06137
Markdown Content: Kushal Kedia 1 1{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT, Prithwish Dan 1 1{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT, Sanjiban Choudhury 1 1{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT 1 1{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT The authors are affiliated with Cornell University, Ithaca, New York, USA {kk837, pd337, sc2582}@cornell.edu
Abstract
Planning safe robot motions in the presence of humans requires reliable forecasts of future human motion. However, simply predicting the most likely motion from prior interactions does not guarantee safety. Such forecasts fail to model the long tail of possible events, which are rarely observed in limited datasets. On the other hand, planning for worst-case motions leads to overtly conservative behavior and a “frozen robot”. Instead, we aim to learn forecasts that predict counterfactuals that humans guard against. We propose a novel game-theoretic framework for joint planning and forecasting with the payoff being the performance of the planner against the demonstrator, and present practical algorithms to train models in an end-to-end fashion. We demonstrate that our proposed algorithm results in safer plans in a crowd navigation simulator and real-world datasets of pedestrian motion. We release our code at https://github.com/portal-cornell/Game-Theoretic-Forecasting-Planning.
I Introduction
One of the greatest challenges in robotics and AI is reasoning about interaction with other agents in the world. The ability to forecast how other agents behave in response to a robot’s decisions is key to enabling safe, interpretable, and responsive behavior. Consider a self-driving car negotiating a left turn at a busy intersection, or a personal robot collaboratively cooking with a human in a kitchen. The robot has to both yield to the human at times, and show intent to go ahead at other times. To do so, it must rely on forecasts that are conservative enough to predict rare but risky events, but not so conservative that the robot stays frozen in place.
Current forecasting approaches are primarily based on Maximum Likelihood Estimate (MLE). For instance, in self-driving, state-of-the-art forecasters[1, 2] are typically trained on the L2 loss between the observed future motions of actors and the predicted motion on data collected off-policy. A planner then uses the forecast to compute a safe, collision-free path. However, a forecaster trained purely on observed data fails to predict rare but risky events. The distribution of motions exhibits a long tail, necessitating the modeling of this tail with exorbitant amounts of data to accurately represent the diverse rare events.
Consider the example in Fig.1 of a self-driving car driving alongside a cyclist. We observe humans leave their lane to give space to a cyclist while actively occupying the lane of an oncoming car. However, an MLE forecaster will likely predict the cyclist staying in their lane. This results in plans that fail to guard against possible rare events such as the cyclist accidentally coming into the lane. An alternate approach is to reason about the worst-case outcome given the reachability of the cyclist[3, 4]. But this can lead to overtly conservative behavior where the robot stays perpetually behind the cyclist.
We propose a new objective for training forecasters. We argue that MLE loss on observed motions from a finite dataset is fundamentally insufficient due to lack of coverage in the dataset. Instead, we view the problem through the lens of imitation learning. Our key insight is that humans don’t just plan for things that are likely to happen, but plan contingencies for counterfactuals that could possibly happen. We aim to learn forecasts that enable a planner to guard as well as the demonstrator against any possible rare event. We propose a game-theoretic framework for jointly training planners and forecasts, and use no-regret learning to solve for the approximate equilibrium of the game. Our key contributions are summarized as follows:
- A novel, game-theoretic framework for joint forecasting and planning that guarantees performance with respect to demonstrations.
- Practical algorithms and architectures for joint forecasting and planning for multi-agent navigation.
- Empirical evaluation on a crowd navigation simulator and real-world pedestrian datasets.
Figure 1: MLE forecasts fail to predict rare, hazardous events, like a bicyclist suddenly veering into a car’s lane. We propose to learn adversarial forecasts that enable a planner to guard against such events.
Figure 2: Overview of our game-theoretic framework for joint forecasting and planning. The forecaster maximizes the performance difference between the generated plans and the observed plans. This results in counterfactual forecasts for the cyclist veering into the vehicle’s lane that encourages the planner to guard by nudging away from the cyclist.
II Related Work
We focus on the problem of planning for robot decisions in the presence of humans in the workspace. The intended motions of the humans are unknown to the robot as it plans a sequence of actions that maximize the reward function for a given task. We look at various clusters of related work.
Multi-agent games Multi-agent games are formulated as both Stackelberg or Nash Equilibrium finding problems. [5] models human action as a function of the robot’s action and the system’s current state. [6] extends this model to distinguish between different types of human actors (attentive and distracted). [7] solves for the Global Nash Equilibrium by using the Hamilton-Jacobi-Bellman equation of optimal control. Trajectory optimization utilized by [8] includes a sensitivity term that allows the vehicle to reason how much the other vehicles will yield to avoid collisions. To capture the problem’s constraints effectively, [9] enforces collision-avoidance through augmented Lagrangian constraints instead of imposing a large penalty on collision while constructing the objective functions for the problems. However, unlike our work, in order to solve these games, full information about the cost for each agent is required.
Autonomous Driving In the autonomous driving domain, this problem is broken down into game-theoretic interactions to represent lane-merging, intersection crossing, pedestrian management, etc. Works in this domain have modeled the game by a Stackelberg equilibrium [10, 6, 5] where the behavior of a leader is fixed and best-response strategy is learned for the follower, or by a Nash equilibrium [9, 7, 11, 8], where agents follow strategies such that their objectives cannot be improved upon unilateral deviation. In this paper, we focus on the unstructured domain of pedestrian navigation, where there is large variability in human behavior and the space of possible motions is larger.
Forecasting Human Motion Real-world human movements constitute a broad multimodal distribution containing inherent uncertainty and noise. Prior works have focused on effective model architectures and appropriate representations of interactions between agents. Human-robot and human-human interactions can be effectively modeled with a self-attention mechanism [12] for robot planning. Multimodal deep generative models [13] for trajectory forecasting can effectively represent diverse human behavior. Trajectron++ [14] produces dynamically-feasible predictions by incorporating dynamics constraints into learned multi-agent trajectory forecasting. Representation learning [15] of safer motion representations can be facilitated by contrastive estimation from simulated negative behavior. The problem of transfer-learning forecasting models from different datasets has been tackled by explicitly modeling styles and noise confounders [16]. However, the focus of forecasting literature has been largely restricted to task-agnostic accuracy-based metrics such as average displacement error (ADE) and final displacement error (FDE). For instance, in self-driving, it is important to use task-specific metrics prioritizing the safety of a planner that consumes the forecasts [17]. This has motivated the community to rethink appropriate metrics for planner-centric evaluation of forecasting [18]. In this work, we are ultimately interested in generating forecasts that optimize the final performance of our downstream planner.
III Problem Formulation
We model the forecasting problem as a multi-agent Contextual Markov Decision Process (CMDP), where one of the agents is the robot, and the rest are humans. Let ϕ italic-ϕ\phi italic_ϕ denote the context – a history of past states-actions for all the agents and the current scene. We assume contexts are drawn from a distribution P(ϕ)𝑃 italic-ϕ P(\phi)italic_P ( italic_ϕ ). Let ξ R=(s 1 R,a 1 R,s 2 R,a 2 R,…,s T R,a T R)subscript 𝜉 𝑅 subscript superscript 𝑠 𝑅 1 subscript superscript 𝑎 𝑅 1 subscript superscript 𝑠 𝑅 2 subscript superscript 𝑎 𝑅 2…subscript superscript 𝑠 𝑅 𝑇 subscript superscript 𝑎 𝑅 𝑇\xi_{R}=(s^{R}{1},a^{R}{1},s^{R}{2},a^{R}{2},\dots,s^{R}{T},a^{R}{T})italic_ξ start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT = ( italic_s start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_s start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_s start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) be the T 𝑇 T italic_T-horizon trajectory of the robot. Let ξ H=(s 1 H,a 1 H,s 2 H,a 2 H,…,s T H,a T H)superscript 𝜉 𝐻 subscript superscript 𝑠 𝐻 1 subscript superscript 𝑎 𝐻 1 subscript superscript 𝑠 𝐻 2 subscript superscript 𝑎 𝐻 2…subscript superscript 𝑠 𝐻 𝑇 subscript superscript 𝑎 𝐻 𝑇\xi^{H}=(s^{H}{1},a^{H}{1},s^{H}{2},a^{H}{2},\dots,s^{H}{T},a^{H}{T})italic_ξ start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT = ( italic_s start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_s start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_s start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) be the trajectory of all the other human agents. Let c(ξ R,ξ H)𝑐 superscript 𝜉 𝑅 superscript 𝜉 𝐻 c(\xi^{R},\xi^{H})italic_c ( italic_ξ start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT , italic_ξ start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT ) denote the cost of a robot-human trajectory pair. This captures terms like safety, burden and deviation from the nominal path.
We assume access to demonstrations of human and robot trajectories drawn from a joint probability P(ξ R,ξ H|ϕ)𝑃 subscript 𝜉 𝑅 conditional subscript 𝜉 𝐻 italic-ϕ P(\xi_{R},\xi_{H}|\phi)italic_P ( italic_ξ start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT , italic_ξ start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT | italic_ϕ ). We aim to learn a conditional distribution over robot trajectories P ψ(ξ R|ξ H,ϕ)subscript 𝑃 𝜓 conditional subscript 𝜉 𝑅 subscript 𝜉 𝐻 italic-ϕ P_{\psi}(\xi_{R}|\xi_{H},\phi)italic_P start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( italic_ξ start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT | italic_ξ start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT , italic_ϕ ), where ψ 𝜓\psi italic_ψ are the learnt parameters, that minimize the average performance difference with respect to the demonstrator.
𝔼 ϕ[𝔼 ξ H∼P(.|ϕ)ξ^R∼P ψ(.|ξ H,ϕ)c(ξ^R,ξ H)−𝔼 ξ H,ξ R∼P(.|ϕ)c(ξ R,ξ H)]\displaystyle\mathbb{E}{\phi}\left[\underset{\begin{subarray}{c}\xi{H}\sim P% (.|\phi)\ \hat{\xi}{R}\sim P{\psi}(.|\xi_{H},\phi)\end{subarray}}{\mathbb{E}}c(\hat{% \xi}{R},\xi{H})-\underset{\xi_{H},\xi_{R}\sim P(.|\phi)}{\mathbb{E}}c(\xi_{R% },\xi_{H})\right]blackboard_E start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT start_UNDERACCENT start_ARG start_ROW start_CELL italic_ξ start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT ∼ italic_P ( . | italic_ϕ ) end_CELL end_ROW start_ROW start_CELL over^ start_ARG italic_ξ end_ARG start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT ∼ italic_P start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( . | italic_ξ start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT , italic_ϕ ) end_CELL end_ROW end_ARG end_UNDERACCENT start_ARG blackboard_E end_ARG italic_c ( over^ start_ARG italic_ξ end_ARG start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT , italic_ξ start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT ) - start_UNDERACCENT italic_ξ start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT , italic_ξ start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT ∼ italic_P ( . | italic_ϕ ) end_UNDERACCENT start_ARG blackboard_E end_ARG italic_c ( italic_ξ start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT , italic_ξ start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT )
III-A Pitfalls of Planning with MLE forecast
A template for solving Eq.1 is to first train a forecaster to approximate the human trajectory distribution P θ(ξ H|ϕ)≈P(ξ H|ϕ)subscript 𝑃 𝜃 conditional subscript 𝜉 𝐻 italic-ϕ 𝑃 conditional subscript 𝜉 𝐻 italic-ϕ P_{\theta}(\xi_{H}|\phi)\approx P(\xi_{H}|\phi)italic_P start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_ξ start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT | italic_ϕ ) ≈ italic_P ( italic_ξ start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT | italic_ϕ ), where θ 𝜃\theta italic_θ are learnt parameters. A common way is to train a Maximum Likelihood Estimator (MLE).
Definition 1 (MLE-Forecaster)
Given a dataset 𝒟={(ϕ,ξ H)}𝒟 italic-ϕ subscript 𝜉 𝐻\mathcal{D}={(\phi,\xi_{H})}caligraphic_D = { ( italic_ϕ , italic_ξ start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT ) } of context and human trajectories, the goal of the MLE forecaster is to maximize the likelihood of observed trajectories:
max θ𝔼 ϕ,ξ HlogP θ(ξ H|ϕ)subscript 𝜃 subscript 𝔼 italic-ϕ subscript 𝜉 𝐻 subscript 𝑃 𝜃 conditional subscript 𝜉 𝐻 italic-ϕ\max_{\theta};\mathbb{E}{\phi,\xi{H}}\log P_{\theta}(\xi_{H}|\phi)roman_max start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_ϕ , italic_ξ start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_log italic_P start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_ξ start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT | italic_ϕ )(2)
A nominal approach to planning is to minimize cost with respect to this learnt forecast.
Definition 2 (Nom-Planner)
Given access to a MLE-Forecaster P θ(ξ H|ϕ)subscript 𝑃 𝜃 conditional subscript 𝜉 𝐻 italic-ϕ P_{\theta}(\xi_{H}|\phi)italic_P start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_ξ start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT | italic_ϕ ), a nominal planner minimizes expected costs w.r.t. the forecasts
min ψ𝔼 ϕ𝔼 ξ^H∼P θ(.|ϕ)ξ^R∼P ψ(.|ξ^H,ϕ)c(ξ^R,ξ^H)\min_{\psi};\mathbb{E}{\phi}\underset{\begin{subarray}{c}\hat{\xi}{H}\sim P% {\theta}(.|\phi)\ \hat{\xi}{R}\sim P_{\psi}(.|\hat{\xi}{H},\phi)\end{subarray}}{\mathbb{E}}c(% \hat{\xi}{R},\hat{\xi}_{H})roman_min start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT start_UNDERACCENT start_ARG start_ROW start_CELL over^ start_ARG italic_ξ end_ARG start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT ∼ italic_P start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( . | italic_ϕ ) end_CELL end_ROW start_ROW start_CELL over^ start_ARG italic_ξ end_ARG start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT ∼ italic_P start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( . | over^ start_ARG italic_ξ end_ARG start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT , italic_ϕ ) end_CELL end_ROW end_ARG end_UNDERACCENT start_ARG blackboard_E end_ARG italic_c ( over^ start_ARG italic_ξ end_ARG start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT , over^ start_ARG italic_ξ end_ARG start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT )(3)
While MLE-Forecaster+Nom-Planner is industry standard, the framework has two fundamental problems:
[wide, labelwidth=!, labelindent=0pt]
- Failure to predict rare but risky events: The MLE loss is dominated by events that occur frequently in the data. It fails to predict events ξ H subscript 𝜉 𝐻\xi_{H}italic_ξ start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT that occur rarely in the data. However, these events can be quite risky, i.e., even if P(ξ H|ϕ)𝑃 conditional subscript 𝜉 𝐻 italic-ϕ P(\xi_{H}|\phi)italic_P ( italic_ξ start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT | italic_ϕ ) is small, the cost c(ξ R,ξ H)𝑐 subscript 𝜉 𝑅 subscript 𝜉 𝐻 c(\xi_{R},\xi_{H})italic_c ( italic_ξ start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT , italic_ξ start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT ) of the planner can be very large.
- Small forecasting errors lead to large planning errors: The MLE loss optimizes the KL-Divergence K L(P(ξ H|ϕ)||P θ(ξ H|ϕ))KL(P(\xi_{H}|\phi)||P_{\theta}(\xi_{H}|\phi))italic_K italic_L ( italic_P ( italic_ξ start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT | italic_ϕ ) | | italic_P start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_ξ start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT | italic_ϕ ) ), which is mismatched from the performance difference in (1). Formally, a bounded KL divergence, implies a Total Variation (TV) distance bound of ϵ italic-ϵ\epsilon italic_ϵ (Psinker’s inequality). However, a small error in forecasting could result in an approximation error of C maxϵ subscript 𝐶 𝑚 𝑎 𝑥 italic-ϵ C_{max}\epsilon italic_C start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT italic_ϵ in the downstream planner’s performance, where C max subscript 𝐶 𝑚 𝑎 𝑥 C_{max}italic_C start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT is the maximum cost of a trajectory.
IV Approach
We present a novel game-theoretic framework for joint forecasting and planning. We also present a concrete application of the framework in a multi-agent navigation setting.
Figure 3: Model architecture for joint forecasting and planning. Agents in the scene are represented by nodes in the graph. Each agent has two outputs: a forecast and a plan. We apply first self-attention over the encoded contexts for each node which are decoded into adversarial forecasts, which are then used by the planner to generate safe plans for each agent.
IV-A Game-Theoretic Framework for Forecasting / Planning
In SectionIII-A, we discussed two fundamental problems with the MLE approach: (1) failure to predict rare-events (2) loss mismatched with performance difference (Eq.1). We now propose an approach that addresses both problems. Our key insight is that humans don’t just plan for things that are likely to happen, but plan contingencies for counterfactuals that could possibly happen. For example, in Fig.1, the human plans a path ξ R subscript 𝜉 𝑅\xi_{R}italic_ξ start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT that guards for the counterfactual that the bicyclist may accidentally veer onto their lane. We aim to learn forecasts that don’t just predict likely motions, but predict counterfactuals that humans guard against.
We view the problem from the lens of inverse optimal control (IOC)[19]. IOC aims to learn a cost function that explains demonstrated behavior. Here, we aim to learn a forecast (that in turn defines the cost function) that explains demonstrated behavior. IOC in this setting can be best understood as a two-player zero-sum game[20] between a forecaster and a planner. The forecaster generates forecasts ξ^H superscript^𝜉 𝐻\hat{\xi}^{H}over^ start_ARG italic_ξ end_ARG start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT that increase the performance difference between the planner and the demonstrator. The planner generates plans ξ^R subscript^𝜉 𝑅\hat{\xi}_{R}over^ start_ARG italic_ξ end_ARG start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT that decrease the performance difference. Finally, to ensure that the forecasts are not completely unrealistic, we constrain them to be within an δ 𝛿\delta italic_δ-ball of the observed distribution.
max θmin ψ𝔼 ϕ[𝔼 ξ^H∼P θ(.|ϕ)ξ^R∼P ψ(.|ξ^H,ϕ)c(ξ^R,ξ^H)−𝔼 ξ^H∼P θ(.|ϕ)ξ R∼P(.|ϕ)c(ξ R,ξ^H)]\displaystyle\max_{\theta}\min_{\psi};\mathbb{E}{\phi};\left[\underset{% \begin{subarray}{c}\hat{\xi}{H}\sim P_{\theta}(.|\phi)\ \hat{\xi}{R}\sim P{\psi}(.|\hat{\xi}{H},\phi)\end{subarray}}{\mathbb{E}}c(% \hat{\xi}{R},\hat{\xi}{H})-\underset{\begin{subarray}{c}\hat{\xi}{H}\sim P_% {\theta}(.|\phi)\ \xi_{R}\sim P(.|\phi)\end{subarray}}{\mathbb{E}}c(\xi_{R},\hat{\xi}{H})\right]roman_max start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT roman_min start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT start_UNDERACCENT start_ARG start_ROW start_CELL over^ start_ARG italic_ξ end_ARG start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT ∼ italic_P start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( . | italic_ϕ ) end_CELL end_ROW start_ROW start_CELL over^ start_ARG italic_ξ end_ARG start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT ∼ italic_P start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( . | over^ start_ARG italic_ξ end_ARG start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT , italic_ϕ ) end_CELL end_ROW end_ARG end_UNDERACCENT start_ARG blackboard_E end_ARG italic_c ( over^ start_ARG italic_ξ end_ARG start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT , over^ start_ARG italic_ξ end_ARG start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT ) - start_UNDERACCENT start_ARG start_ROW start_CELL over^ start_ARG italic_ξ end_ARG start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT ∼ italic_P start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( . | italic_ϕ ) end_CELL end_ROW start_ROW start_CELL italic_ξ start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT ∼ italic_P ( . | italic_ϕ ) end_CELL end_ROW end_ARG end_UNDERACCENT start_ARG blackboard_E end_ARG italic_c ( italic_ξ start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT , over^ start_ARG italic_ξ end_ARG start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT ) s.t.𝔼 ϕ[K L(P θ(ξ^H|ϕ)||P(ξ H|ϕ))]≤δ\displaystyle\text{s.t.};\mathbb{E}{\phi}[KL(P_{\theta}(\hat{\xi}{H}|\phi)% ;||;P(\xi{H}|\phi))]\leq\delta s.t. blackboard_E start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT [ italic_K italic_L ( italic_P start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( over^ start_ARG italic_ξ end_ARG start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT | italic_ϕ ) | | italic_P ( italic_ξ start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT | italic_ϕ ) ) ] ≤ italic_δ
We aim to compute a (near-optimal) ϵ−limit-from italic-ϵ\epsilon-italic_ϵ - equilibria of the game above, which would result in a planner P ψ subscript 𝑃 𝜓 P_{\psi}italic_P start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT that bounds the original objective (1) by ϵ italic-ϵ\epsilon italic_ϵ as well. Following the arguments in[20], since the game is bilinear in both P θ subscript 𝑃 𝜃 P_{\theta}italic_P start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT and P ψ subscript 𝑃 𝜓 P_{\psi}italic_P start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT, playing a no-regret strategy for both forecaster and the planner guarantees finding the ϵ−limit-from italic-ϵ\epsilon-italic_ϵ - equilibria.
Input : Dataset
𝒟={(ϕ,ξ R,ξ H)}𝒟 italic-ϕ subscript 𝜉 𝑅 subscript 𝜉 𝐻\mathcal{D}={(\phi,\xi_{R},\xi_{H})}caligraphic_D = { ( italic_ϕ , italic_ξ start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT , italic_ξ start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT ) }
Output :Adv-Forecaster
P θ(.|ϕ)P_{\theta}(.|\phi)italic_P start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( . | italic_ϕ ) , Safe-Planner
P ψ(.|ϕ)P_{\psi}(.|\phi)italic_P start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( . | italic_ϕ )
1 Initialize
θ 1 subscript 𝜃 1\theta_{1}italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT with MLE-Forecaster
2 Initialize
ψ 1 subscript 𝜓 1\psi_{1}italic_ψ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT with Nom-Planner
3 for i=1…N 𝑖 1 normal-…𝑁 i=1\dots N italic_i = 1 … italic_N do
4 Invoke current forecaster
P θ i subscript 𝑃 subscript 𝜃 𝑖 P_{\theta_{i}}italic_P start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT and planner
P ψ i subscript 𝑃 subscript 𝜓 𝑖 P_{\psi_{i}}italic_P start_POSTSUBSCRIPT italic_ψ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT on
𝒟 𝒟\mathcal{D}caligraphic_D to create a dataset
{(ϕ,ξ H,ξ R,ξ^H i,ξ^R i)}italic-ϕ subscript 𝜉 𝐻 subscript 𝜉 𝑅 subscript superscript^𝜉 𝑖 𝐻 subscript superscript^𝜉 𝑖 𝑅{(\phi,\xi_{H},\xi_{R},\hat{\xi}^{i}{H},\hat{\xi}^{i}{R})}{ ( italic_ϕ , italic_ξ start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT , italic_ξ start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT , over^ start_ARG italic_ξ end_ARG start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT , over^ start_ARG italic_ξ end_ARG start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT ) }
5 Update planner
ψ i+1←ψ i−∇ψ ℓ i(P ψ)←subscript 𝜓 𝑖 1 subscript 𝜓 𝑖 subscript∇𝜓 superscript ℓ 𝑖 subscript 𝑃 𝜓\psi_{i+1}\leftarrow\psi_{i}-\nabla_{\psi}\ell^{i}(P_{\psi})italic_ψ start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT ← italic_ψ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - ∇ start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT roman_ℓ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ( italic_P start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT )
6 Update forecaster
θ i+1←θ i−∇θ ℓ i(P θ)←subscript 𝜃 𝑖 1 subscript 𝜃 𝑖 subscript∇𝜃 superscript ℓ 𝑖 subscript 𝑃 𝜃\theta_{i+1}\leftarrow\theta_{i}-\nabla_{\theta}\ell^{i}(P_{\theta})italic_θ start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT ← italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - ∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT roman_ℓ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ( italic_P start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT )
7
return P θ N(.|ϕ)P_{\theta_{N}}(.|\phi)italic_P start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( . | italic_ϕ ), P ψ N(.|ϕ)P_{\psi_{N}}(.|\phi)italic_P start_POSTSUBSCRIPT italic_ψ start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( . | italic_ϕ )
Algorithm 1 Adv-Forecaster / Safe-Planner
We define the forecaster trained in this adversarial fashion as Adv-Forecaster, and the planner trained to be robust against such an adversarial forecaster as Safe-Planner. We describe the overall approach in Algorithm1. We setup an online learning game that lasts N 𝑁 N italic_N rounds. In every round, both the forecaster and the planner receive a loss function and play a no-regret update (we use online gradient descent). We define the loss functions for both players below:
Definition 3 (Adv-Forecaster)
At round i 𝑖 i italic_i, the forecaster receives a dataset {(ϕ,ξ H,ξ R,ξ^R i)}italic-ϕ subscript 𝜉 𝐻 subscript 𝜉 𝑅 subscript superscript normal-^𝜉 𝑖 𝑅{(\phi,\xi_{H},\xi_{R},\hat{\xi}^{i}{R})}{ ( italic_ϕ , italic_ξ start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT , italic_ξ start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT , over^ start_ARG italic_ξ end_ARG start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT ) } of context, human demonstration, robot demonstration, and the planned trajectory, respectively. We define the loss for this round ℓ i(P θ)superscript normal-ℓ 𝑖 subscript 𝑃 𝜃\ell^{i}(P{\theta})roman_ℓ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ( italic_P start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ) as :
ℓ i(P θ)=𝔼 ϕ,ξ H,ξ R,ξ^R i[𝔼 ξ^H∼P θ(.|ϕ)[c(ξ R,ξ^H)−c(ξ^R i,ξ^H)]\displaystyle\ell^{i}(P_{\theta})=\underset{\phi,\xi_{H},\xi_{R},\hat{\xi}^{i}% {R}}{\mathbb{E}}\left[\underset{\hat{\xi}{H}\sim P_{\theta}(.|\phi)}{\mathbb% {E}}\left[c(\xi_{R},\hat{\xi}{H})-c(\hat{\xi}^{i}{R},\hat{\xi}{H})\right]\right.roman_ℓ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ( italic_P start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ) = start_UNDERACCENT italic_ϕ , italic_ξ start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT , italic_ξ start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT , over^ start_ARG italic_ξ end_ARG start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT end_UNDERACCENT start_ARG blackboard_E end_ARG [ start_UNDERACCENT over^ start_ARG italic_ξ end_ARG start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT ∼ italic_P start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( . | italic_ϕ ) end_UNDERACCENT start_ARG blackboard_E end_ARG italic_c ( italic_ξ start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT , over^ start_ARG italic_ξ end_ARG start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT ) - italic_c ( over^ start_ARG italic_ξ end_ARG start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT , over^ start_ARG italic_ξ end_ARG start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT ) −λ log P θ(ξ H|ϕ)]\displaystyle\left.\vphantom{\underset{\hat{\xi}{H}\sim P_{\theta}(.|\phi)}{% \mathbb{E}}}-\lambda\log P_{\theta}(\xi_{H}|\phi)\right]- italic_λ roman_log italic_P start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_ξ start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT | italic_ϕ ) ]
where the first term is the difference between the costs of the demonstrated and the planned robot trajectory, and the second term is the log-likelihood of observed human trajectories multiplied by a Lagrange multiplier.
Definition 4 (Safe-Planner)
At round i 𝑖 i italic_i, the planner receives a dataset {(ϕ,ξ R,ξ^H i)}italic-ϕ subscript 𝜉 𝑅 subscript superscript normal-^𝜉 𝑖 𝐻{(\phi,\xi_{R},\hat{\xi}^{i}{H})}{ ( italic_ϕ , italic_ξ start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT , over^ start_ARG italic_ξ end_ARG start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT ) } of context, robot demonstration and the adversarial forecast respectively. We define the loss for this round ℓ i(P ψ)superscript normal-ℓ 𝑖 subscript 𝑃 𝜓\ell^{i}(P{\psi})roman_ℓ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ( italic_P start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ) as :
ℓ i(P ψ)=𝔼 ϕ,ξ R,ξ^H i[𝔼 ξ^R∼P ψ(.|ξ^H i,ϕ)[c(ξ^R,ξ^H i)−c(ξ R,ξ^H i)]]\displaystyle\ell^{i}(P_{\psi})=\underset{\phi,\xi_{R},\hat{\xi}^{i}{H}}{% \mathbb{E}}\left[\underset{\hat{\xi}{R}\sim P_{\psi}(.|\hat{\xi}^{i}{H},\phi% )}{\mathbb{E}}\left[c(\hat{\xi}{R},\hat{\xi}^{i}{H})-c(\xi{R},\hat{\xi}^{i}% _{H})\right]\right]roman_ℓ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ( italic_P start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ) = start_UNDERACCENT italic_ϕ , italic_ξ start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT , over^ start_ARG italic_ξ end_ARG start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT end_UNDERACCENT start_ARG blackboard_E end_ARG start_UNDERACCENT over^ start_ARG italic_ξ end_ARG start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT ∼ italic_P start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( . | over^ start_ARG italic_ξ end_ARG start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT , italic_ϕ ) end_UNDERACCENT start_ARG blackboard_E end_ARG [ italic_c ( over^ start_ARG italic_ξ end_ARG start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT , over^ start_ARG italic_ξ end_ARG start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT ) - italic_c ( italic_ξ start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT , over^ start_ARG italic_ξ end_ARG start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT ) ]
where the inner most term is the difference between costs of the planned and the demonstrated robot trajectory against the current forecast.
IV-B Application for Multi-Agent Navigation
We define a multi-agent navigation scene to contain N 𝑁 N italic_N agents interacting with each other. We can consider each agent, in turn, to act as the “robot” interacting with the other “humans” in the scene. For every agent, we have to plan a T 𝑇 T italic_T-horizon trajectory, considering the future motions of the other agents in the scene. We need to encode goal-reaching and collision avoidance to define a cost function for the navigation problem. However, while forecasting motions for agents, we do not always have access to their intended goal locations. From a dataset of prior interactions between agents, given the context of the scene, we can infer the future motions using maximum-likelihood estimation. Additionally, we enforce collision avoidance using the following obstacle cost function [21] between plans and forecasts:
c(ξ R,ξ H)=∑i T COL(s i R,s i H)𝑐 subscript 𝜉 𝑅 subscript 𝜉 𝐻 superscript subscript 𝑖 𝑇 𝐶 𝑂 𝐿 superscript subscript 𝑠 𝑖 𝑅 superscript subscript 𝑠 𝑖 𝐻\displaystyle c({\xi}{R},{\xi}{H})=\sum_{i}^{T}COL(s_{i}^{R},s_{i}^{H})italic_c ( italic_ξ start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT , italic_ξ start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT ) = ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_C italic_O italic_L ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT , italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT )(7) COL(s i R,s i H)={−dist(s i R,s i H)+1 2ϵ,dist(s i R,s i H)<0 1 2ϵ(dist(s i R,s i H)−ϵ)2 0<dist(s i R,s i H)<ϵ 0,otherwise 𝐶 𝑂 𝐿 superscript subscript 𝑠 𝑖 𝑅 superscript subscript 𝑠 𝑖 𝐻 cases 𝑑 𝑖 𝑠 𝑡 superscript subscript 𝑠 𝑖 𝑅 superscript subscript 𝑠 𝑖 𝐻 1 2 italic-ϵ 𝑑 𝑖 𝑠 𝑡 superscript subscript 𝑠 𝑖 𝑅 superscript subscript 𝑠 𝑖 𝐻 0 1 2 italic-ϵ superscript 𝑑 𝑖 𝑠 𝑡 superscript subscript 𝑠 𝑖 𝑅 superscript subscript 𝑠 𝑖 𝐻 italic-ϵ 2 0 𝑑 𝑖 𝑠 𝑡 superscript subscript 𝑠 𝑖 𝑅 superscript subscript 𝑠 𝑖 𝐻 italic-ϵ 0 otherwise\displaystyle COL(s_{i}^{R},s_{i}^{H})=\begin{cases}-dist(s_{i}^{R},s_{i}^{H})% +\frac{1}{2}\epsilon,&dist(s_{i}^{R},s_{i}^{H})<0\ \frac{1}{2\epsilon}(dist(s_{i}^{R},s_{i}^{H})-\epsilon)^{2}&0<dist(s_{i}^{R},s% _{i}^{H})<\epsilon\ 0,&\text{otherwise}\end{cases}italic_C italic_O italic_L ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT , italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT ) = { start_ROW start_CELL - italic_d italic_i italic_s italic_t ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT , italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT ) + divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_ϵ , end_CELL start_CELL italic_d italic_i italic_s italic_t ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT , italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT ) < 0 end_CELL end_ROW start_ROW start_CELL divide start_ARG 1 end_ARG start_ARG 2 italic_ϵ end_ARG ( italic_d italic_i italic_s italic_t ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT , italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT ) - italic_ϵ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_CELL start_CELL 0 < italic_d italic_i italic_s italic_t ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT , italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT ) < italic_ϵ end_CELL end_ROW start_ROW start_CELL 0 , end_CELL start_CELL otherwise end_CELL end_ROW
We sum the cost over all robot and human states in the predicted T 𝑇 T italic_T-horizon planner and forecaster trajectories. When a robot interacts with multiple humans, the human trajectory with the largest cost is considered.
In our model architecture (Fig. 3), each agent in the scene is represented by a node in a graph. Neighboring agents are connected by edges. Each node in the graph takes in its individual context, including state history and other relevant information, such as a local map representation of the scene. To encode interactions with other agents, self-attention is applied across neighboring nodes. Each node has two output heads: a forecaster and a planner. The predicted forecasts can be fed as input to the planning module. While the inputs to the forecaster are restricted to the shared context of the scene, the planner can additionally take in information private to an agent, such as its goal location.
We first train the forecasting and planning model to only maximize the likelihood of the future motion for the agents in the scene (Eq. 2 and Eq. 3). This is equivalent to simply maximizing the likelihood of the future motions in the dataset, and we call it a MLE-Forecaster and Nom-Planner. Apart from matching the ground truth, we wish to encode collision avoidance for measuring our plan’s performance with respect to the forecasts. To incorporate this cost function, we solve the min-max game defined by Eq. 4. For every minibatch of data, we update the forecasting model (Adv-Forecaster) to adversarially maximize the difference of the cost functions between the planner and the ground truth. In response, the planner (Safe-Planner) is updated to minimize the costs with the Adv-Forecaster. We ensure that the predictions do not deviate too much from the ground truth data by continuing to optimize the likelihood of the observed future motion.
V Evaluation
(a)CrowdNav - CHOMP Cost (left) and Collision Rates (right)
(b)ETH-UCY - CHOMP Cost (left) and Collision Rates (right)
Figure 4: Evaluation of CHOMP costs (Eq 7) and collision rates for different planner and forecaster combinations.
V-A Setup
We evaluate our algorithm on a crowd navigation simulator and real-world pedestrian datasets.
V-A 1 CrowdNav[22]
This is an open-source simulator where a robot has to move forward in the presence of other humans. The humans in the scene move toward their respective goal locations and are simulated using the ORCA [23] algorithm. We are interested in the non-compliant setting of the simulator, in which the five humans in the scene are unresponsive, i.e., they ignore robot motion. To navigate the scene, the robot should be able to plan around the future movements of the humans. To generate expert trajectories, we utilize the reinforcement learning (RL) agent provided by [22], which uses a self-attention (SA) module to encode human-human and human-robot interactions. The SARL agent is trained using a reward function that manually encodes collision avoidance and goal-reaching behavior.
We collect a dataset of 5000 episodes of human-robot navigation using this SARL policy. Our models are trained on 50% of the dataset and evaluated on the rest. While the SARL model architecture considers just the current state of the robot and humans to predict the robot’s immediate action, we also use an LSTM-module to encode a history of 8 timesteps (2 seconds) for each agent. The predictions produced by the forecaster are given as input to the planning module along with the context and goal location of the robot. We output actions for each agent over a horizon of 8 timesteps.
V-A 2 The ETH-UCY Benchmark
There are five different datasets of real-world pedestrian movements in the ETH[24] and UCY[25] benchmark. The scenarios in the dataset showcase a wide range of human-human interactions and are a standard benchmark in the field. The data is captured at a 2.5Hz frequency (0.4s timestep). For the forecasting task, 8 timesteps of the history (3.2s) are considered, and 12 timesteps of the future are to be predicted for each agent. For evaluation, our model is trained on 4 out of the 5 datasets and evaluated on the held-out dataset.
We implement the Trajectron++ [14] model for the base configuration of our planner and forecaster. It is a state-of-the-art multimodal conditional variational autoencoder (CVAE) generative model that can produce dynamically feasible trajectories. While the original model produces a distribution of trajectories, we use deterministic forecasts for each agent for simplicity. To do this, we restrict the outputs of the model to a unimodal distribution for each agent’s future motion. To calculate the collision avoidance cost function, we use the mean of this distribution.
V-B Results and Analysis
O1.Adv-Forecaster predicts more severe hazards than MLE-Forecaster. In Fig. 5, we show examples of forecasts produced by the Adv-Forecaster that leads to collisions with the plans generated by the Nom-Planner. This is expected as the Adv-Forecaster is trained to increase the cost difference between generated plans and the observed trajectories in the dataset. On the other hand, MLE-Forecaster maximizes likelihood of observed motions and is unable to generate potential hazards that render the generated plans unsafe. Fig 4 shows that the cost (Eq 7) of plans is significantly higher when evaluated with the adversarial forecasts compared with the MLE-Forecasts or the observed futures in the CrowdNav environment and the ETH-UCY benchmark.
Figure 5: The MLE-Forecaster predicts the most likely futures for each human. Nom-Planner avoids collisions with the MLE-Forecaster but not with the Adv-Forecaster. Collisions are marked in red.
Figure 6: The Safe-Planner plans around the Adv-Forecasts leading to safe motions, whereas the Nom-Planner collides with the adversarial forecasts. Collisions are marked in red.
O2.Safe-Planner guards against rare events better than Nom-Planner. Fig 6 shows scenarios where the Nom-Planner is in collision with the forecasts produced by the Adv-Forecaster as it does not consider the possibility of adverse events. Since the Safe-Planner is trained to minimize the cost difference with the adversarial forecasts, its plans are safe with respect to the Adv-Forecaster. It naturally encodes behavior that guards against rare events in the dataset. Fig 4 shows that the Safe-Planner has lower costs than the Nom-Planner when evaluated against the observed futures of humans. We also observe lower costs and collision rates for the Safe-Planner compared to the Nom-Planner when tested against MLE-Forecasts and the Adv-Forecaster on the CrowdNav simulator.
O3.Safe-Planner and Adv-Forecaster produce plausible trajectories even with higher tracking errors. Both the Safe-Planner and Adv-Forecaster are trained with the primary objective of optimizing cost difference. But to do so, they have to deviate from ground-truth observations. Table I shows that their average displacement error (ADE) and final displacement error (FDE) is slightly higher. However, we observe that the trajectories generated by the models are generally plausible. Fig. 7 shows scenarios in the CrowdNav simulator where they both deviate from ground truth trajectories but are still quite plausible counterfactuals that the robot should guard against.
TABLE I: We compare the Average Displacement Error (ADE) and Final Displacement Error (FDE) of the predictions motions by our different planners/forecasters on the testing splits of the ETH-UCY benchmark and the CrowdNav simulator.
MLE-Forecaster Adv-Forecaster Nom-Planner Safe-Planner ADE FDE ADE FDE ADE FDE ADE FDE ETH-UCY 0.387 0.947 0.405 0.950 0.387 0.947 0.391 0.956 CrowdNav 0.268 0.371 0.274 0.383 0.184 0.268 0.193 0.283
Figure 7: (CrowdNav) When Adv-Forecaster and Safe-Planner deviate from the ground truth predictions, they produce alternate plausible trajectories. The forecasts produced by the Adv-Forecast represent risky futures. The Safe-Planner conservatively guards against possible rare events.
VI Discussion and Limitations
This paper introduces a novel game-theoretic framework that addresses joint forecasting and planning. We discuss the pitfalls of MLE forecasting that only focus on maximizing the likelihood of observed human motion. Instead, we produce adversarial counterfactuals by optimizing the performance difference between generated plans and observed demonstrations, considering the predictions made by our learned forecaster. In response, our framework can guard against rare but risky events by generating plans that are safe with respect to the adversary.
There are some limitations to our approach. We observed larger error ranges on the ETH-UCY dataset. This is likely because real-world pedestrian datasets contain significant noise in estimation and a wide variety of behaviors, making it difficult to model human behavior accurately. In future work, we will extend our framework to consider multi-modal distributions of plans and forecasts. On-policy evaluation of our framework in scenarios where humans suddenly change their goals is another promising direction.
VII Acknowledgements
This work was partially funded by NSF RI (#2312956).
References
- [1] J.Ngiam, B.Caine, V.Vasudevan, Z.Zhang, H.-T.L. Chiang, J.Ling, R.Roelofs, A.Bewley, C.Liu, A.Venugopal et al., “Scene transformer: A unified multi-task model for behavior prediction and planning,” arXiv e-prints, pp. arXiv–2106, 2021.
- [2] L.L. Li, B.Yang, M.Liang, W.Zeng, M.Ren, S.Segal, and R.Urtasun, “End-to-end contextual perception and prediction with interaction transformer,” in 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).IEEE, 2020.
- [3] A.Bajcsy, S.Bansal, E.Ratner, C.J. Tomlin, and A.D. Dragan, “A robust control framework for human motion prediction,” IEEE Robotics and Automation Letters, vol.6, no.1, pp. 24–31, 2020.
- [4] K.Leung, A.Bajcsy, E.Schmerling, and M.Pavone, “Towards data-driven synthesis of autonomous vehicle safety concepts,” arXiv preprint arXiv:2107.14412, 2021.
- [5] D.Sadigh, S.S. Sastry, S.A. Seshia, and A.D. Dragan, “Planning for autonomous cars that leverage effects on human actions,” in Robotics: Science and Systems, 2016.
- [6] D.Sadigh, S.S. Sastry, A.Seshia, and A.D. Dragan, “Information gathering actions over human internal state,” 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 66–73, 2016.
- [7] D.Fridovich-Keil, E.Ratner, L.Peters, A.D. Dragan, and C.J. Tomlin, “Efficient iterative linear-quadratic approximations for nonlinear multi-player general-sum differential games,” 2020 IEEE International Conference on Robotics and Automation (ICRA), pp. 1475–1481, 2020.
- [8] M.Wang, Z.Wang, J.Talbot, J.C. Gerdes, and M.Schwager, “Game-theoretic planning for self-driving cars in multivehicle competitive scenarios,” IEEE Transactions on Robotics, vol.37, pp. 1313–1325, 2021.
- [9] S.L. Cleac’h, M.Schwager, and Z.Manchester, “Algames: a fast augmented lagrangian solver for constrained dynamic games,” Autonomous Robots, vol.46, pp. 201–215, 2022.
- [10] A.Liniger and J.Lygeros, “A noncooperative game approach to autonomous racing,” IEEE Transactions on Control Systems Technology, vol.28, pp. 884–897, 2020.
- [11] L.Peters, D.Fridovich-Keil, V.Rubies-Royo, C.J. Tomlin, and C.Stachniss, “Inferring objectives in continuous dynamic games from noise-corrupted partial state observations,” ArXiv, vol. abs/2106.03611, 2021.
- [12] C.Chen, Y.Liu, S.Kreiss, and A.Alahi, “Crowd-robot interaction: Crowd-aware robot navigation with attention-based deep reinforcement learning,” 2019 International Conference on Robotics and Automation (ICRA), pp. 6015–6022, 2018.
- [13] B.Ivanovic, K.Leung, E.Schmerling, and M.Pavone, “Multimodal deep generative models for trajectory prediction: A conditional variational autoencoder approach,” IEEE Robotics and Automation Letters, vol.6, pp. 295–302, 2020.
- [14] T.Salzmann, B.Ivanovic, P.Chakravarty, and M.Pavone, “Trajectron++: Dynamically-feasible trajectory forecasting with heterogeneous data,” in European Conference on Computer Vision, 2020.
- [15] Y.Liu, Q.Yan, and A.Alahi, “Social nce: Contrastive learning of socially-aware motion representations,” 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 15 098–15 109, 2020.
- [16] Y.Liu, R.Cadei, J.Schweizer, S.Bahmani, and A.Alahi, “Towards robust and adaptive motion forecasting: A causal representation perspective,” 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 17 060–17 071, 2021.
- [17] J.Philion, A.Kar, and S.Fidler, “Learning to evaluate perception models using planner-centric metrics,” 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
- [18] B.Ivanovic and M.Pavone, “Rethinking trajectory forecasting evaluation,” ArXiv, vol. abs/2107.10297, 2021.
- [19] B.D. Ziebart, A.L. Maas, J.A. Bagnell, and A.K. Dey, “Maximum entropy inverse reinforcement learning,” in AAAI, 2008.
- [20] G.Swamy, S.Choudhury, J.A. Bagnell, and S.Wu, “Of moments and matching: A game-theoretic framework for closing the imitation gap,” in International Conference on Machine Learning.PMLR, 2021, pp. 10 022–10 032.
- [21] M.Zucker, N.D. Ratliff, A.D. Dragan, M.Pivtoraiko, M.Klingensmith, C.M. Dellin, J.A. Bagnell, and S.S. Srinivasa, “Chomp: Covariant hamiltonian optimization for motion planning,” The International Journal of Robotics Research, vol.32, pp. 1164 – 1193, 2013.
- [22] C.Chen, Y.Liu, S.Kreiss, and A.Alahi, “Crowd-robot interaction: Crowd-aware robot navigation with attention-based deep reinforcement learning,” in 2019 International Conference on Robotics and Automation (ICRA).IEEE, 2019, pp. 6015–6022.
- [23] J.P. van den Berg, S.J. Guy, M.C. Lin, and D.Manocha, “Reciprocal n-body collision avoidance,” in International Symposium of Robotics Research, 2011.
- [24] S.Pellegrini, A.Ess, K.Schindler, and L.V. Gool, “You’ll never walk alone: Modeling social behavior for multi-target tracking,” 2009 IEEE 12th International Conference on Computer Vision, 2009.
- [25] A.Lerner, Y.Chrysanthou, and D.Lischinski, “Crowds by example,” Computer Graphics Forum, vol.26, 2007.
Xet Storage Details
- Size:
- 58.4 kB
- Xet hash:
- 3b88d3081cf5b3ce241d5025cd34a3074e002dfa0c7f34f6e9da76be6ff2b7bc
Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.









