77.6 kB

Title: Physics-based Motion Retargeting from Sparse Inputs

URL Source: https://arxiv.org/html/2307.01938

Markdown Content: ,Jungdam Won Seoul National University South Korea,Yuting Ye Reality Labs Research, Meta United States of America,Michiel van de Panne University of British Columbia Canada and Alexander Winkler winklera@meta.comReality Labs Research, Meta United States of America

Abstract.

Avatars are important to create interactive and immersive experiences in virtual worlds. One challenge in animating these characters to mimic a user’s motion is that commercial AR/VR products consist only of a headset and controllers, providing very limited sensor data of the user’s pose. Another challenge is that an avatar might have a different skeleton structure than a human and the mapping between them is unclear. In this work we address both of these challenges. We introduce a method to retarget motions in real-time from sparse human sensor data to characters of various morphologies. Our method uses reinforcement learning to train a policy to control characters in a physics simulator. We only require human motion capture data for training, without relying on artist-generated animations for each avatar. This allows us to use large motion capture datasets to train general policies that can track unseen users from real and sparse data in real-time. We demonstrate the feasibility of our approach on three characters with different skeleton structure: a dinosaur, a mouse-like creature and a human. We show that the avatar poses often match the user surprisingly well, despite having no sensor information of the lower body available. We discuss and ablate the important components in our framework, specifically the kinematic retargeting step, the imitation, contact and action reward as well as our asymmetric actor-critic observations. We further explore the robustness of our method in a variety of settings including unbalancing, dancing and sports motions.

retargeting, reinforcement learning, physics-based simulation, computer animation

††submissionid: 3607††copyright: acmlicensed††journal: PACMCGIT††journalyear: 2023††journalvolume: 6††journalnumber: 2††publicationmonth: 8††price: 15.00††doi: 10.1145/3606928

Figure 1. Our method uses only a headset and controller pose as input to generate a physically-valid pose for a variety of characters in real-time.

Introduction

Augmented and Virtual Reality (AR/VR) has the potential to provide rich forms of self-expression. By using human characters it is easier to accurately reflect the motions of a user. However, many users might want to portray themselves via non-human characters. Games with non-human player characters already demonstrate the great appeal of this type of embodiment, albeit one that works within the limited immersion afforded by current gaming input devices and displays. How can we best allow users to embody themselves in non-human characters using current AR/VR systems? Our work seeks to make progress on this question. This entails multiple challenges, in particular: (a) AR/VR systems provide only sparse information regarding the pose of the user, obtained from a head-mounted device (HMD) and two controllers. (b) The target character may have significantly different dimensions and body types, as shown in Figure 1; and (c) Kinematic animation, including that resulting from kinematic retargeting, often lacks physical plausibility, producing movements that lack a feeling of weight.

We propose a method to address these challenges. In particular, we develop an imitation-based reinforcement learning (RL) method that uses the sparse sensor input of a user to drive a physics-based simulation of the target character. This directly takes into account the physical properties of the given character, such as the heavy tail of a dinosaur or the short-legs of a mouse character, as shown in Figure 1. We only require human motion capture data for training, without relying on artist-generated animations for each avatar. This allows us to use large motion capture datasets to train general policies that can track unseen users from real and sparse data in real-time. We identify ingredients as being important to successful retargeting in this setting, including foot contact rewards, sparse mapping of key features for retargeting, and suitable reward terms that offer further style control. Many of the pieces that we rely on exist elsewhere in the literature. Our primary contribution lies with bringing them together in a way that enables a new retargeting capability well-suited to current AR/VR systems. We are the first to show a framework that works with real data from sparse sensors in real time while producing high-quality motions for non-human characters. We validate our design choices through a variety of ablations.

Related Work

Figure 2. Overview of our system. The policy π 𝜋\pi italic_π receives the Quest sensor input o t,u⁢s⁢e⁢r subscript 𝑜 𝑡 𝑢 𝑠 𝑒 𝑟 o_{t,user}italic_o start_POSTSUBSCRIPT italic_t , italic_u italic_s italic_e italic_r end_POSTSUBSCRIPT and the current state of the simulated character o t,s⁢i⁢m subscript 𝑜 𝑡 𝑠 𝑖 𝑚 o_{t,sim}italic_o start_POSTSUBSCRIPT italic_t , italic_s italic_i italic_m end_POSTSUBSCRIPT as observation and computes torques a t subscript 𝑎 𝑡 a_{t}italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT to apply to a physics simulator. During training, we use human motion capture data s t,g⁢t subscript 𝑠 𝑡 𝑔 𝑡 s_{t,gt}italic_s start_POSTSUBSCRIPT italic_t , italic_g italic_t end_POSTSUBSCRIPT to estimate a rough pose s t,k⁢i⁢n subscript 𝑠 𝑡 𝑘 𝑖 𝑛 s_{t,kin}italic_s start_POSTSUBSCRIPT italic_t , italic_k italic_i italic_n end_POSTSUBSCRIPT of the simulated character (”kinematic retargeting”). The reward encourages the simulated character s t,s⁢i⁢m subscript 𝑠 𝑡 𝑠 𝑖 𝑚 s_{t,sim}italic_s start_POSTSUBSCRIPT italic_t , italic_s italic_i italic_m end_POSTSUBSCRIPT to imitate this rough kinematic pose s t,k⁢i⁢n subscript 𝑠 𝑡 𝑘 𝑖 𝑛 s_{t,kin}italic_s start_POSTSUBSCRIPT italic_t , italic_k italic_i italic_n end_POSTSUBSCRIPT as best as possible, while respecting all the physical constraints imposed by the simulator. After the policy is trained, full-body data or kinematic retargeting is not required anymore, and the simulated character can be driven purely by the HMD and controller sparse sensor.

In this literature review we focus on the most relevant works in motion tracking, retargeting, and physics-based control.

2.1. Human Motion Tracking

Many solutions exist for full-body tracking of human motion, varying in their choice of sensors, the number of sensors, and their placement. Optical marker-based systems with external cameras remain the most common choice for applications requiring high accuracy, e.g.,(Vicon, 2022). Markerless and vision-based approaches rely on cameras alone to generate full body poses. Common approaches leverage human body models such as SMPL as a pose prior(Loper et al., 2015; Kanazawa et al., 2019; Xu et al., 2019; Rong et al., 2021), using extracted keypoints or correspondences from the images(Güler et al., 2018; Cao et al., 2019), or use physics-based priors, e.g.,(Rempe et al., 2021)Wearable sensors are another common choice, relying on sensors attached on the user’s body, such as Inertial Measurement Unit (IMU) devices, e.g.,(von Marcard et al., 2017; Huang et al., 2018; Jiang et al., 2022).

When using AR/VR devices, systems are further limited by the sparse sensors available. Most commonly available units are comprised of 3 tracker devices: a head-mounted device (HMD) and two controllers, one for each hand. As a human motion tracking device, these are handicapped by the lack of sensory information regarding the lower body and legs, which are essential to synthesizing believable full-body motion. Multiple methods have been proposed to address this, using transformers(Jiang et al., 2022; Vaswani et al., 2017), VAEs(Dittadi et al., 2021) and normalizing flows generative models(Aliakbarian et al., 2022). Being kinematic-based approaches, however, these methods do not enforce physical properties and thus suffer from motion artifacts such as foot-skating and jitter. Physics-based approaches have also recently been proposed(Winkler et al., 2022; Ye et al., 2022). These both make use of reinforcement learning and physics to learn general and robust policies that drive full-body avatars, conditioned on input from a VR device. These are closest to the work we present in this paper, and have great promise, although come with their own limitations. The Neural3Points method(Ye et al., 2022) is specific to a single user and uses auxiliary losses and an intermediate full-body pose predictor. Relatedly, Winkler et al. (2022) proposes a more direct approach that is able to control a simulated human avatar and generalizes to users of different heights and multiple type of motions. Our work generalizes the method of Winkler et al. (2022) in two important ways: (1) we learn physics-based retargeting to characters having different morphologies, and (2) we enable real-time retargeting.

2.2. Retargeting Motions

The motion retargeting problem is that of remapping motion from a source character or skeleton, often driven by motion capture data, to another character of possibly different dimensions. This is a long-standing problem for which many solutions have been proposed. Arguably the most challenging version of this problem arises when the source and target characters may differ significantly in terms of their morphology and skeleton, as is also the case for our work.

Kinematic retargeting methods often approach the problem by allowing the user to specify directly, or alternatively to learn via examples, a model for source-to-target pose correpondences, e.g., (Monzani et al., 2000; Yamane et al., 2010; Seol et al., 2013). This creates a puppetry system, where target motions can be further cleaned to respect contacts with the help of inverse kinematics. Kinematic motion deformation approaches can be used to adapt multiple characters trajectories for motions involving coordination such as moving boxes(Kim et al., 2021). Recent work proposes a kinematic method to learn how to retarget without requiring any explicit pairing between motions(Aberman et al., 2020), and this is also demonstrated to work on skeletons with very different proportions. Other recent work examines how to learn efficient kinematic motion retargeting for human-like skeletons while preserving contact constraints, such as when hands and arms have self-contact with the body(Villegas et al., 2021).

Physics-based retargeting methods aim to produce a physics-based simulation of the output motion, which results in crisp contacts and physically-plausible motion of the target character. An offline approach to motion retargeting using spacetime trajectory optimization is presented in Al Borno et al. (2018). The final output uses LQR trees, and thus the given motions can cope with some perturbations. A method is recently proposed for using interactive human motion to drive the motion of a quadruped robot(Kim et al., 2022). A curated dataset of matching pairs of human-and-robot motions is used to develop relevant kinematic mappings for particular motions or tasks. A deep-RL policy is then learned that can track the target kinematic motions in real time, enabling a form of real-time human-to-real-robot puppetry. In our setting, we assume significantly sparser user input and motion specifications.

2.3. Physics-based Character Simulation

Controllers for physics-based characters have been extensively explored. The ability to imitate reference motions was first demonstrated to varying extents in a number of papers over the past 15 years, e.g., (Yin et al., 2007; Lee et al., 2010; Ye and Liu, 2010; Coros et al., 2010; Liu et al., 2010; Geijtenbeek et al., 2012). These methods often incorporated some iterative optimization to adapt to a specific motion and used a simple control law to provide robust balance feedback. Some of these methods were also adapted to produce motions for non-human characters, e.g.,(Geijtenbeek et al., 2013; Wampler et al., 2014).

Neural network policies, trained via deep reinforcement learning (RL), provide new capabilities to learn new skills from scratch, or to imitate artist-provided motions or motion capture clips, e.g., (Peng et al., 2017; Won et al., 2017; Peng et al., 2018a), including demonstrations for non-human characters. More recent methods provide more flexibility in sequencing motions for basketball(Park et al., 2019) or, more generally, to track online streams of motion capture data(Chentanez et al., 2018; Bergamin et al., 2019; Won et al., 2020; Fussell et al., 2021). Control policies have also been learned which are conditioned on not only the desired motion, but also the specific morphology of a simulated character, which can then even be changed at run time(Won and Lee, 2019). We further refer the reader to a recent survey of RL-related animation methods(Kwiatkowski et al., 2022). We build on the foundations provided above for our specific problem, namely how to retarget from sparse (and therefore potentially highly ambiguous) input data to a non-human physics-based character with very different dimensions and proportions.

Method

An overview of our system is shown in Figure 2. We use reinforcement learning to learn a policy that generates torques for a physics simulator. During training, we use human motion capture data to both synthesize HMD and controllers data for the policy, and to build a reward training signal. In the following we give an overview of reinforcement learning and then describe each component in detail.

3.1. Reinforcement Learning

We use deep reinforcement learning (RL) to learn a retargeting policy for each character. In RL, at each time step t 𝑡 t italic_t, the control policy reacts to an environment state s t subscript 𝑠 𝑡 s_{t}italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT by performing an action a t subscript 𝑎 𝑡 a_{t}italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. Based on the action performed, the policy receives a reward signal r t=r⁢(s t,a t)subscript 𝑟 𝑡 𝑟 subscript 𝑠 𝑡 subscript 𝑎 𝑡 r_{t}=r(s_{t},a_{t})italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_r ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ). In deep RL, the control policy π θ⁢(a|s)subscript 𝜋 𝜃 conditional 𝑎 𝑠\pi_{\theta}(a|s)italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_a | italic_s ) is a neural network. The goal of deep RL is to find the network parameters θ 𝜃\theta italic_θ which maximize the expected return defined as follows:

(1)J R⁢L⁢(θ)=𝔼[∑t=0∞γ t⁢r⁢(s t,a t)],subscript 𝐽 𝑅 𝐿 𝜃 𝔼 delimited-[]superscript subscript 𝑡 0 superscript 𝛾 𝑡 𝑟 subscript 𝑠 𝑡 subscript 𝑎 𝑡\displaystyle J_{RL}(\theta)=\mathop{\mathbb{E}}\left[\sum_{t=0}^{\infty}% \gamma^{t}{r({s}{t},{a}{t})}\right],italic_J start_POSTSUBSCRIPT italic_R italic_L end_POSTSUBSCRIPT ( italic_θ ) = blackboard_E [ ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_r ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ] ,

where γ∈[0,1)𝛾 0 1\gamma\in[0,1)italic_γ ∈ [ 0 , 1 ) is the discount factor. Tuning γ 𝛾\gamma italic_γ affects the importance we give to future states. We solve this optimization problem using the proximal policy optimization (PPO) algorithm(Schulman et al., 2017), a policy gradient actor-critic algorithm. A review of PPO algorithm is provided in Appendix B.

3.2. Characters

Figure 3. We demonstrate our retargeting solution on three different characters (from left to right): a mouse-like creature named Oppy, a human named Jesse, and a dinosaur we call Dino.

We demonstrate our retargeting solution on three characters with unique features: Oppy(Meta, 2023) is a mouse with a short lower body, a big head, big ears and a tail; Dino is a tall dinosaur, with a long and heavy tail and head, and short arms; Jesse is a human-like cartoon character with a skeleton structure similar to the mocap data. Figure 3 shows a visual representation of the characters and Table 1 details the structure of their skeletons.

Table 1. Character details.

3.3. Observations

The observation contains two parts: simulated character data o t,s⁢i⁢m subscript 𝑜 𝑡 𝑠 𝑖 𝑚 o_{t,sim}italic_o start_POSTSUBSCRIPT italic_t , italic_s italic_i italic_m end_POSTSUBSCRIPT and user’s sparse sensor data o t,u⁢s⁢e⁢r subscript 𝑜 𝑡 𝑢 𝑠 𝑒 𝑟 o_{t,user}italic_o start_POSTSUBSCRIPT italic_t , italic_u italic_s italic_e italic_r end_POSTSUBSCRIPT.

(2)o t subscript 𝑜 𝑡\displaystyle o_{t}italic_o start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT=[o t,s⁢i⁢m,o t−1,u⁢s⁢e⁢r,o t,u⁢s⁢e⁢r]absent subscript 𝑜 𝑡 𝑠 𝑖 𝑚 subscript 𝑜 𝑡 1 𝑢 𝑠 𝑒 𝑟 subscript 𝑜 𝑡 𝑢 𝑠 𝑒 𝑟\displaystyle=[o_{t,sim},o_{t-1,user},o_{t,user}]= [ italic_o start_POSTSUBSCRIPT italic_t , italic_s italic_i italic_m end_POSTSUBSCRIPT , italic_o start_POSTSUBSCRIPT italic_t - 1 , italic_u italic_s italic_e italic_r end_POSTSUBSCRIPT , italic_o start_POSTSUBSCRIPT italic_t , italic_u italic_s italic_e italic_r end_POSTSUBSCRIPT ] (3)o t,s⁢i⁢m subscript 𝑜 𝑡 𝑠 𝑖 𝑚\displaystyle o_{t,sim}italic_o start_POSTSUBSCRIPT italic_t , italic_s italic_i italic_m end_POSTSUBSCRIPT=[o s⁢i⁢m,q,o s⁢i⁢m,q˙,o s⁢i⁢m,x,o s⁢i⁢m,R]absent subscript 𝑜 𝑠 𝑖 𝑚 𝑞 subscript 𝑜 𝑠 𝑖 𝑚˙𝑞 subscript 𝑜 𝑠 𝑖 𝑚 𝑥 subscript 𝑜 𝑠 𝑖 𝑚 𝑅\displaystyle=[o_{sim,q},o_{sim,\dot{q}},o_{sim,x},o_{sim,R}]= [ italic_o start_POSTSUBSCRIPT italic_s italic_i italic_m , italic_q end_POSTSUBSCRIPT , italic_o start_POSTSUBSCRIPT italic_s italic_i italic_m , over˙ start_ARG italic_q end_ARG end_POSTSUBSCRIPT , italic_o start_POSTSUBSCRIPT italic_s italic_i italic_m , italic_x end_POSTSUBSCRIPT , italic_o start_POSTSUBSCRIPT italic_s italic_i italic_m , italic_R end_POSTSUBSCRIPT ] (4)o t,u⁢s⁢e⁢r subscript 𝑜 𝑡 𝑢 𝑠 𝑒 𝑟\displaystyle o_{t,user}italic_o start_POSTSUBSCRIPT italic_t , italic_u italic_s italic_e italic_r end_POSTSUBSCRIPT=[h t,l t,r t,R h,t,R l,t,R r,t]absent subscript ℎ 𝑡 subscript 𝑙 𝑡 subscript 𝑟 𝑡 subscript 𝑅 ℎ 𝑡 subscript 𝑅 𝑙 𝑡 subscript 𝑅 𝑟 𝑡\displaystyle=[h_{t},l_{t},r_{t},R_{h,t},R_{l,t},R_{r,t}]= [ italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_l start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_R start_POSTSUBSCRIPT italic_h , italic_t end_POSTSUBSCRIPT , italic_R start_POSTSUBSCRIPT italic_l , italic_t end_POSTSUBSCRIPT , italic_R start_POSTSUBSCRIPT italic_r , italic_t end_POSTSUBSCRIPT ]

The simulated character’s state is fully observable in the simulation. Therefore, even though the sensor signals is sparse, the policy can still rely on the full state of the simulated character. This observation consists of joint angles o s⁢i⁢m,q∈ℝ j subscript 𝑜 𝑠 𝑖 𝑚 𝑞 superscript ℝ 𝑗 o_{sim,q}\in\mathbb{R}^{j}italic_o start_POSTSUBSCRIPT italic_s italic_i italic_m , italic_q end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT and joint angle velocities o s⁢i⁢m,q˙∈ℝ j subscript 𝑜 𝑠 𝑖 𝑚˙𝑞 superscript ℝ 𝑗 o_{sim,\dot{q}}\in\mathbb{R}^{j}italic_o start_POSTSUBSCRIPT italic_s italic_i italic_m , over˙ start_ARG italic_q end_ARG end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT of all degrees of freedom j 𝑗 j italic_j of the character. We also provide Cartesian positions o s⁢i⁢m,x∈ℝ l×3 subscript 𝑜 𝑠 𝑖 𝑚 𝑥 superscript ℝ 𝑙 3 o_{sim,x}\in\mathbb{R}^{l\times 3}italic_o start_POSTSUBSCRIPT italic_s italic_i italic_m , italic_x end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_l × 3 end_POSTSUPERSCRIPT and orientations o s⁢i⁢m,R∈ℝ l×6 subscript 𝑜 𝑠 𝑖 𝑚 𝑅 superscript ℝ 𝑙 6 o_{sim,R}\in\mathbb{R}^{l\times 6}italic_o start_POSTSUBSCRIPT italic_s italic_i italic_m , italic_R end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_l × 6 end_POSTSUPERSCRIPT of a subset l 𝑙 l italic_l of links of the character. The orientations consist of the first two columns of their rotation matrices. All positions and orientations are expressed with respect to a coordinate frame located on the floor below the character which rotates according to the character heading direction. This is useful to make the controller agnostic to the heading direction.

The sensor data, either coming from the real device or synthetically generated from the training data (described in subsection 3.4), consists of the position and orientation of the HMD h ℎ h italic_h, the left controller l 𝑙 l italic_l and the right controller r 𝑟 r italic_r. Positions and orientations are expressed in the same coordinate system as the simulated character observations. To allow the policy to infer velocities, we provide it two consecutive sensor observations [o t−1,u⁢s⁢e⁢r,o t,u⁢s⁢e⁢r]subscript 𝑜 𝑡 1 𝑢 𝑠 𝑒 𝑟 subscript 𝑜 𝑡 𝑢 𝑠 𝑒 𝑟[o_{t-1,user},o_{t,user}][ italic_o start_POSTSUBSCRIPT italic_t - 1 , italic_u italic_s italic_e italic_r end_POSTSUBSCRIPT , italic_o start_POSTSUBSCRIPT italic_t , italic_u italic_s italic_e italic_r end_POSTSUBSCRIPT ].

Inspired by Pinto et al. (2018), we use asymmetric observations. At training time we augment the value function observation by providing the full human mocap pose and future human mocap state information. This complete view of the state allows the value function to better estimate the returns. The better the return estimate, the easier it is for the policy to learn. We are allowed to provide this mocap state information, because the value function is required only for training. Real-time inference still only relies on the policy, which uses the sparse sensor input. We ablate this in subsection 5.3 and find that is essential for sparse real time retargeting.

3.4. Synthetic Training Data

During training, we require HMD and controller data for the observation paired with kinematic poses for each character s t,k⁢i⁢n subscript 𝑠 𝑡 𝑘 𝑖 𝑛 s_{t,kin}italic_s start_POSTSUBSCRIPT italic_t , italic_k italic_i italic_n end_POSTSUBSCRIPT from which the reward r t subscript 𝑟 𝑡 r_{t}italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is computed. To synthetically generate the HMD and controller data we offset the mocap head and wrist joints to emulate the position and orientation of HMD, left and right controllers as if the subjects were equipped with an AR/VR device.

Importantly, our system does not require artist-generated animations for each specific character as training data, which would be infeasible to create with the diversity and quantity we require. Instead we reuse existing human motion capture data s g⁢t subscript 𝑠 𝑔 𝑡 s_{gt}italic_s start_POSTSUBSCRIPT italic_g italic_t end_POSTSUBSCRIPT and perform a rough kinematic retargeting s k⁢i⁢n subscript 𝑠 𝑘 𝑖 𝑛 s_{kin}italic_s start_POSTSUBSCRIPT italic_k italic_i italic_n end_POSTSUBSCRIPT to the morphology of the simulated character (Figure 4). In this step, we manually match selected joint angles of the human to conceptually similar joints of the creature. For joints where no correspondence can be found, we just set them to their default pose (e.g. ears and tails). This provides a rough estimate of the creature’s motion. However, this motion has many artifacts, such a feet sliding due to different leg lengths, self-collisions, floor collisions, and no motion of the tail and ears. Nonetheless, we can still use it as a reward signal to train our simulated character. The physical constraints imposed by the simulation then remove remaining artifacts. Importantly, after the simulated character is trained, it is driven only by a headset and controllers, without requiring any full-body information of the user or any kinematic retargeting.

Figure 4. Training data is generated through kinematic retargeting. The left character is the human mocap data. The middle character shows a rough kinematic retargeting by matching selected joint angles. This pose has many artifacts, such as feet sliding due to different leg lengths, self-collisions, floor collisions, and no motion of the tail and ears. The right character is the closest simulated pose that also respects all physical constraints. Notice how the head does not perfectly follow the human, as it’s heavier and takes more time to react, no having access to future information but only past and present.

3.5. Reward

The goal for the simulated character is to imitate the human motion as closely as possible, while respecting all the constraints imposed by physics. Our reward function includes a component for imitation, contact, and action regularization:

(5)r t subscript 𝑟 𝑡\displaystyle r_{t}italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT=r t⁢(imitation)+r t⁢(contact)+r t⁢(action)absent subscript 𝑟 𝑡 imitation subscript 𝑟 𝑡 contact subscript 𝑟 𝑡 action\displaystyle=r_{t}(\text{imitation})+r_{t}(\text{contact})+r_{t}(\text{action})= italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( imitation ) + italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( contact ) + italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( action ) (6)r t⁢(imitation)subscript 𝑟 𝑡 imitation\displaystyle r_{t}(\text{imitation})italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( imitation )=r t⁢(q)+r t⁢(q˙)+r t⁢(x)+r t⁢(x˙)+r t⁢(orientation)absent subscript 𝑟 𝑡 𝑞 subscript 𝑟 𝑡˙𝑞 subscript 𝑟 𝑡 𝑥 subscript 𝑟 𝑡˙𝑥 subscript 𝑟 𝑡 orientation\displaystyle=r_{t}(q)+r_{t}(\dot{q})+r_{t}(x)+r_{t}(\dot{x})+r_{t}(\text{% orientation})= italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_q ) + italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( over˙ start_ARG italic_q end_ARG ) + italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x ) + italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( over˙ start_ARG italic_x end_ARG ) + italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( orientation ) (7)r t⁢(action)subscript 𝑟 𝑡 action\displaystyle r_{t}(\text{action})italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( action )=r t⁢(action diff)+r t⁢(action min).absent subscript 𝑟 𝑡 action diff subscript 𝑟 𝑡 action min\displaystyle=r_{t}(\text{action diff})+r_{t}(\text{action min}).= italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( action diff ) + italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( action min ) .

Each of the reward terms is expressed using a weighted Gaussian kernel:

(8)r t⁢(s)=w s⁢e−k s⁢d⁢(s t,s⁢i⁢m,s t,k⁢i⁢n)subscript 𝑟 𝑡 𝑠 subscript 𝑤 𝑠 superscript e subscript 𝑘 𝑠 𝑑 subscript 𝑠 𝑡 𝑠 𝑖 𝑚 subscript 𝑠 𝑡 𝑘 𝑖 𝑛 r_{t}(s)=w_{s}\mathrm{e}^{-k_{s}d(s_{t,sim},s_{t,kin})}italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_s ) = italic_w start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT roman_e start_POSTSUPERSCRIPT - italic_k start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT italic_d ( italic_s start_POSTSUBSCRIPT italic_t , italic_s italic_i italic_m end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_t , italic_k italic_i italic_n end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT

where for each term only the specific component of the state s 𝑠 s italic_s is considered and d⁢(s s⁢i⁢m,s k⁢i⁢n)𝑑 subscript 𝑠 𝑠 𝑖 𝑚 subscript 𝑠 𝑘 𝑖 𝑛 d(s_{sim},s_{kin})italic_d ( italic_s start_POSTSUBSCRIPT italic_s italic_i italic_m end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_k italic_i italic_n end_POSTSUBSCRIPT ) represent the distance metric between the simulated and kinematic components of the state, k 𝑘 k italic_k is the sensitivity of the Gaussian kernel, and w 𝑤 w italic_w is the weight of the reward component. Parameter values and details of the distance metrics for each term are provided in Appendix A.

3.5.1. Imitation Reward

This reward matches the available information between the simulated character s s⁢i⁢m subscript 𝑠 𝑠 𝑖 𝑚 s_{sim}italic_s start_POSTSUBSCRIPT italic_s italic_i italic_m end_POSTSUBSCRIPT and the kinematically retargeted ground truth pose s k⁢i⁢n subscript 𝑠 𝑘 𝑖 𝑛 s_{kin}italic_s start_POSTSUBSCRIPT italic_k italic_i italic_n end_POSTSUBSCRIPT. The five terms represent a weighted sum of the difference between the matching joint angles (q 𝑞 q italic_q), joint angle velocities (q˙˙𝑞\dot{q}over˙ start_ARG italic_q end_ARG), Cartesian coordinate positions (x 𝑥 x italic_x) and velocities (x˙˙𝑥\dot{x}over˙ start_ARG italic_x end_ARG), and orientation. The imitation reward term captures the degrees of supervision we want to transfer between human motion data and the simulated character. For clarity, Equation 6 is the general form which includes all possible terms, but the way they are used differs according to each character. The less supervision the imitation term provides, the more we rely on physics and the other components to generate a sensible motion.

Depending on the quality of our kinematically retargeted pose, we can choose which of the aspects of the pose we want the simulated character to imitate more closely. The least amount of supervision consists in only tracking their root position, which according to our experiments does not produce high-quality motions. On the other extreme, we also do not want to track every aspect of the kinematically retargeted pose. For example there is no tail motion in the human mocap data, so the kinematically retargeted pose has all tails set to a stiff default pose. However, a simulated character might want to move the tail to achieve balance and smoother motion. So we do not require these parts of the skeleton to imitate the kinematic pose.

Orientations are skeleton independent, so we rely on the actual human mocap data, not the kinematically retargeted pose to formulate the orientation rewards. We always formulate a reward that matches the characters root with the human mocap root, as well as the characters head orientation with the human head orientation. Ablations without these terms are provided in subsection 4.3.

3.5.2. Contact Reward

The contact reward is a boolean value that checks whether the simulated character’s foot contact and the human’s foot contact coincide. We estimate contact of the mocap data based on a velocity and height threshold. For the simulated character, we can directly access contact forces from the simulator and threshold those. In most cases the kinematically retargeted leg motion has a variety of artifacts, such as feet sliding or penetrating the ground. Imitating this pose is not physically-valid. Since this reward doesn’t depend on the skeleton structure, it can be used for all bipedal characters equally and directly computed from human mocap. The contact reward is important to give further training supervision and generate the high-quality motions shown. Ablations are provided.

3.5.3. Action Reward

The action reward is a regularization term to minimize total amount of energy consumed by the character. It consists of two terms that minimize the difference in torque between two subsequent actions and minimize the absolute action value and is defined as:

(9)r t⁢(action diff)=1 N⁢∑i N(a t−1,i−a t,i)2 subscript 𝑟 𝑡 action diff 1 𝑁 superscript subscript 𝑖 𝑁 superscript subscript 𝑎 𝑡 1 𝑖 subscript 𝑎 𝑡 𝑖 2\displaystyle r_{t}(\text{action diff})=\frac{1}{N}\sum_{i}^{N}(a_{t-1,i}-a_{t% ,i})^{2}italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( action diff ) = divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t - 1 , italic_i end_POSTSUBSCRIPT - italic_a start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (10)r t⁢(action min)=1 N⁢∑i N a t,i 2 subscript 𝑟 𝑡 action min 1 𝑁 superscript subscript 𝑖 𝑁 superscript subscript 𝑎 𝑡 𝑖 2\displaystyle r_{t}(\text{action min})=\frac{1}{N}\sum_{i}^{N}a_{t,i}^{2}italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( action min ) = divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT

where N 𝑁 N italic_N is the total number of action values which the policy outputs. The purpose of these components is to incentivize overall lower energy movements and to minimize twitching with a smoother movement between poses.

3.6. Termination

As noted in multiple previous works(Peng et al., 2018b; Reda et al., 2020), early termination techniques are important for learning complex motions through reinforcement learning. We reset the environment when one of the following two termination conditions is satisfied: the character enters an unrecoverable state, which we define as falling and touching the ground with the upper body, or when the character root position is more than 30cm apart from the scaled root of the motion capture data. Furthermore, to mitigate the imbalance of visiting and learning to retarget only the early parts of the motion trajectories, we reset the character every 500 steps. We randomly sample a pose from the human data and set the character using the kinematically retargeted pose.

3.7. Learning Control Policies

The policy for each simulated character outputs torque values in the range [−1,1]1 1[-1,1][ - 1 , 1 ] which are then rescaled according to minimum and maximum torque values for each joint (provided in Appendix D). We find this to perform better and be more clear with respect to outputting PD target angles, as shown by previous works(Reda et al., 2020). We train the policy with PPO and PyTorch auto differentiation software(Schulman et al., 2017; Paszke et al., 2019) and simulate physics with NVIDIA PhysX Isaac Gym physics simulator(Makoviychuk et al., 2021). A complete set of hyperparameter details for reproducibility are summarized inAppendix C.

Results

Figure 5. If the character size matches the user, joint angles and foot contacts between the character and the user are more similar (left). If the simulated character has very different morphology (e.g. here much smaller), the kinematic-retarged pose is less accurate and is mostly ignored by the simulated character in order to generate a physically valid motion. Here the character has to take many steps for a single human step to match the root translation.

Figure 6. Right Dino has the orientation reward, while Left Dino does not. As the user turns its head, Right Dino follows more closely.

Figure 7. Sequence of frames showing all three characters being controlled in real time with sparse sensory input. Lower-body motion perfectly matches the one of the user and feet contacts are correctly estimated. Watch the accompanying video for more results.

All experiments are performed on a single 12-core machine with one NVIDIA RTX 2070 GPU. All models are trained for 24 hours which translates to approximately 6 billion environment steps.

We demonstrate comparable results with two different motion capture datasets. Our in-house mocap data consists of 4 hours of motion clips of 120 subjects. Specifically, the dataset contains 130 minutes of walking and 110 minutes of jogging. We also demonstrate robust and general results with the Ubisoft La Forge Animation (LaFAN1) dataset(Harvey et al., 2020), an open source motion capture dataset containing 5 subjects and 77 sequences. For the purposes of this work, we only considered actions themed Walk and Run, which consist of a total of 15 sequences and 74 minutes of data. We note that these motions are very different from the ones in our in-house dataset, containing diverse and hard behaviors and gaiting styles. At inference, we provide input to the policy with a Meta Quest headset and controllers device.

4.1. Real-time Retargeting with Headset and Controller

We thank the QuestSim(Winkler et al., 2022) authors for providing us with testing data and video references. With our method, we are able to control different characters in real time with only headset and controller information. Importantly, we are able to estimate the lower-body pose of the user from only three points in the upper body and correctly match the user action while transferring it to a character with a different morphology. Our virtual characters respect physical behaviors and do not suffer from jittering, foot sliding or penetration. Moreover, we are able to generalize to users not present in the training set and users of different heights. In Figure 7 we show a sequence in which all three characters are controlled by an unseen user.

4.2. Retargeting using only Headset

Some VR systems provide only a head-mounted device (HMD), without the two controllers. This provides an even more challenging domain, requiring the policy to predict a full-body pose and control a virtual character from a one-point input. Nonetheless, our trained models are robust to the lack of controller signal and are able to retarget real time user data from unseen users even from this extremely sparse input, albeit with a lower quality compared to before. We invite the reader to watch the video available in the supplementary material.

4.3. Reward component ablations

Some reward components are essential to get good motions. Here we go through a few interesting examples.

4.3.1. Contact Reward

The contact reward shapes the gait style of the character. Both Oppy and Dino display different locomotion behaviors when using this reward component. Furthermore, as the character size changes, more signal can be transferred to the simulated character. In Figure 5 we show Oppy in two different sizes. When Oppy’s size matches the user, it performs the same gait style and distance motions; when it is smaller, by matching the correct gait style it will travel less distance, while it can perform a faster gait to keep up with the user, depending on the weighting of the reward components. Similarly, in Figure 7, the different frames show the matching gaits between the three characters and the user.

4.3.2. Orientation Reward

Providing signal for mimicking head and root orientation is an essential component to support more fidelity in tracking user’s head and overall movements. We show in Figure 6 how Dino without the head orientation component is unable to correctly move its head in the same way as the user. As shown in the supplementary video, both Oppy and Dino without head orientation reward component show the head wobbling left and right while walking. These characters have heavy heads needing learned control.

Discussion

We discuss different capabilities and components of our system.

5.1. Physics-based control

Figure 8. Left Dino’s tail has 2 active joints and the remaining 6 passive; Center Dino’s tail has 2 active joints and the remaining 6 fixed; Right Dino’s tail is completely passive.

Physics acts as a powerful helper in driving the motion of components with missing pose information, with the skeleton description as underlying prior. For the tail of Dino, the simulator affords several stylization options, i.e., whether we allow more joint mobility and passively actuate it through a PD controller with fixed-set input as secondary motion or we let the policy make active control decisions. In Figure 8 we show three examples, in which Dino’s tail is fixed, passively actuated, or controlled by the learned policy. Tail and ears of Oppy are all treated as secondary dynamics. This stylization would not be possible in a kinematic retargeting setting.

5.2. Controlling the style

Our method is robust to different set of parameters. Once changed, most parameters still output a reasonable motion controller with different styles. As described in subsection 4.3, the contact reward shapes the gait style of the character, and modified together with the size of the character would produce different gait styles.

The kinematic retargeting described in subsection 3.4 only requires a rough retargeting to produce sensible motions, as the physics dynamics correct the artifacts. Moreover, tuning the key joints for the kinematically retargeted motion produces an overall modification of the style. For example, it is possible to give Dino a more horizontal feeling, with the tail straight behind the back and not touching the ground, by tuning the spine parameter to be more bent over. An illustration of this tail is provided in Figure 4 and in the supplementary video, and noticeable difference can be observed compared to Figure 8.

5.3. Importance of asymmetric observations

During training we provide a richer observation to the value function compared to the one we provide to the policy. Specifically, while at inference the controller receives only real time sparse information (i.e. no future and no full-body pose), there is no need to constrain the value function since at training time this signal is available. In our experiments, we notice that the outcome of training a policy with a value function that receives no future and no full body pose, is an overall less robust policy. It is able to retarget easy walking examples coming from the training data, but it fails at harder motions like running and is incapable of generalizing to real data coming from an unseen user.

5.4. Quality of open-source datasets

We test our method with two different datasets, a 4 hour in-house dataset and a 74 minutes open-source dataset. While we notice that a larger and more diverse dataset improves the quality of the final motions, models trained with either of these datasets are robust and capable. Both are able to generalize to unseen users, and perform in real-time, even with headset-only sensory input.

Conclusions

We have presented a method to retarget a user’s motion to simulated characters, in a challenging setting: the target characters can differ significantly in size and body morphology; we require a real-time remapping; and the mapping needs to be driven by the sparse motion data coming from an AR/VR device. We show that physics-based simulations, driven by asymmetric actor-critic RL policies, allow for effective retargeting in this difficult setting. The motions generated by the policies track those of the user while also being appropriate to the physics of the target character. We introduce a general reward description which allows for tuning of the degree of supervision and adapts to a range of character morphologies. Numerous ablations allow us to understand the impact of various parameters and design choices, including varying degrees of available tracking information, the impact of contact rewards, choices related to the secondary motion of tails and ears, and more.

Our work still has a number of limitations. Our controller fails to track challenging motion sequences, where the user performs fast and dynamic movements or uncorrelated upper/lower body motions. In these scenarios, a kinematic-based controller acting directly in the pose space will still be able to produce a motion, albeit not of high quality, and it will be able to catch up as parts of the motion become easier by ”teleport” between poses without correlation. Instead, our controller has to produce a correct sequence of joint torques to control the character and may suffer from compounding tracking errors until it fails. An approach that divides the pipeline in two stages, similar to Ye et al. (2022) where first a network predicts the full-body pose and then a high-frequency controller outputs torques, could allow to regain the advantages of kinematic-based systems when needed. While our framework allows richer forms of self expressions for users, empowering them to control different kind of characters, we are only scratching the surface of the complexities that arise due to different target skeletons. Our characters are still bipeds. Increased character complexity might be achieved by supplying skeleton information to the policy(Won and Lee, 2019), using graph neural networks to learn a flexible policy similarly to Wang et al. (2018), or training an auxiliary network to find a mapping between source and target skeletons.

References

(1)
Aberman et al. (2020) Kfir Aberman, Peizhuo Li, Dani Lischinski, Olga Sorkine-Hornung, Daniel Cohen-Or, and Baoquan Chen. 2020. Skeleton-aware networks for deep motion retargeting. ACM Transactions on Graphics (TOG) 39, 4 (2020), 62–1.
Al Borno et al. (2018) Mazen Al Borno, Ludovic Righetti, Michael J Black, Scott L Delp, Eugene Fiume, and Javier Romero. 2018. Robust Physics-based Motion Retargeting with Realistic Body Shapes. In Computer Graphics Forum, Vol.37. Wiley Online Library, 81–92.
Aliakbarian et al. (2022) Sadegh Aliakbarian, Pashmina Cameron, Federica Bogo, Andrew Fitzgibbon, and Tom Cashman. 2022. FLAG: Flow-based 3D Avatar Generation from Sparse Observations. In 2022 Computer Vision and Pattern Recognition. https://www.microsoft.com/en-us/research/publication/flag-flow-based-3d-avatar-generation-from-sparse-observations/
Bergamin et al. (2019) Kevin Bergamin, Simon Clavet, Daniel Holden, and James Richard Forbes. 2019. DReCon: data-driven responsive control of physics-based characters. ACM Transactions On Graphics (TOG) 38, 6 (2019), 1–11.
Cao et al. (2019) Z. Cao, G. Hidalgo Martinez, T. Simon, S. Wei, and Y.A. Sheikh. 2019. OpenPose: Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields. IEEE Transactions on Pattern Analysis and Machine Intelligence (2019).
Chentanez et al. (2018) Nuttapong Chentanez, Matthias Müller, Miles Macklin, Viktor Makoviychuk, and Stefan Jeschke. 2018. Physics-based motion capture imitation with deep reinforcement learning. In Proceedings of the 11th annual international conference on motion, interaction, and games. 1–10.
Coros et al. (2010) Stelian Coros, Philippe Beaudoin, and Michiel Van de Panne. 2010. Generalized biped walking control. ACM Transactions On Graphics (TOG) 29, 4 (2010), 1–9.
Dittadi et al. (2021) Andrea Dittadi, Sebastian Dziadzio, Darren Cosker, Ben Lundell, Tom Cashman, and Jamie Shotton. 2021. Full-Body Motion From a Single Head-Mounted Device: Generating SMPL Poses From Partial Observations. In International Conference on Computer Vision 2021.
Fussell et al. (2021) Levi Fussell, Kevin Bergamin, and Daniel Holden. 2021. SuperTrack: motion tracking for physically simulated characters using supervised learning. ACM Transactions on Graphics (TOG) 40, 6 (2021), 1–13.
Geijtenbeek et al. (2012) Thomas Geijtenbeek, Nicolas Pronost, and Frank van der Stappen. 2012. Simple data-driven control for simulated bipeds. In Eurographics/ACM SIGGRAPH Symposium on Computer Animation (SCA).
Geijtenbeek et al. (2013) Thomas Geijtenbeek, Michiel van de Panne, and A Frank Van Der Stappen. 2013. Flexible muscle-based locomotion for bipedal creatures. ACM Transactions on Graphics (TOG) 32, 6 (2013), 1–11.
Güler et al. (2018) Rıza Alp Güler, Natalia Neverova, and Iasonas Kokkinos. 2018. Densepose: Dense human pose estimation in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7297–7306.
Harvey et al. (2020) Félix G. Harvey, Mike Yurick, Derek Nowrouzezahrai, and Christopher Pal. 2020. Robust Motion In-Betweening. 39, 4 (2020).
Huang et al. (2018) Yinghao Huang, Manuel Kaufmann, Emre Aksan, Michael J. Black, Otmar Hilliges, and Gerard Pons-Moll. 2018. Deep Inertial Poser: Learning to Reconstruct Human Pose from Sparse Inertial Measurements in Real Time. ACM TOG 37, 6 (12 2018).
Jiang et al. (2022) Yifeng Jiang, Yuting Ye, Deepak Gopinath, Jungdam Won, Alexander W Winkler, and C Karen Liu. 2022. Transformer Inertial Poser: Real-time Human Motion Reconstruction from Sparse IMUs with Simultaneous Terrain Generation. journal = ACM Trans. Graph. (2022).
Kanazawa et al. (2019) Angjoo Kanazawa, Jason Y. Zhang, Panna Felsen, and Jitendra Malik. 2019. Learning 3D Human Dynamics from Video. In Computer Vision and Pattern Recognition (CVPR).
Kim et al. (2021) Jongmin Kim, Yeongho Seol, and Taesoo Kwon. 2021. Interactive multi-character motion retargeting. Computer Animation and Virtual Worlds 32, 3-4 (2021), e2015.
Kim et al. (2022) Sunwoo Kim, Maks Sorokin, Jehee Lee, and Sehoon Ha. 2022. Human Motion Control of Quadrupedal Robots using Deep Reinforcement Learning. In Proceedings of Robotics: Science and Systems. New York, USA.
Kwiatkowski et al. (2022) Ariel Kwiatkowski, Eduardo Alvarado, Vicky Kalogeiton, C Karen Liu, Julien Pettré, Michiel van de Panne, and Marie-Paule Cani. 2022. A survey on reinforcement learning methods in character animation. In Computer Graphics Forum, Vol.41. Wiley Online Library, 613–639.
Lee et al. (2010) Yoonsang Lee, Sungeun Kim, and Jehee Lee. 2010. Data-driven biped control. In ACM SIGGRAPH 2010 papers. 1–8.
Liu et al. (2010) Libin Liu, KangKang Yin, Michiel van de Panne, Tianjia Shao, and Weiwei Xu. 2010. Sampling-based contact-rich motion control. In ACM SIGGRAPH 2010 papers. 1–10.
Loper et al. (2015) Matthew Loper, Naureen Mahmood, Javier Romero, Gerard Pons-Moll, and Michael J. Black. 2015. SMPL: A Skinned Multi-Person Linear Model. ACM TOG 34, 6 (Oct. 2015), 248:1–248:16.
Makoviychuk et al. (2021) Viktor Makoviychuk, Lukasz Wawrzyniak, Yunrong Guo, Michelle Lu, Kier Storey, Miles Macklin, David Hoeller, Nikita Rudin, Arthur Allshire, Ankur Handa, and Gavriel State. 2021. Isaac Gym: High Performance GPU-Based Physics Simulation For Robot Learning. https://doi.org/10.48550/ARXIV.2108.10470
Meta (2023) Meta. 2023. The World Beyond. https://github.com/oculus-samples/Unity-TheWorldBeyond.
Monzani et al. (2000) Jean-Sébastien Monzani, Paolo Baerlocher, Ronan Boulic, and Daniel Thalmann. 2000. Using an intermediate skeleton and inverse kinematics for motion retargeting. In Computer Graphics Forum, Vol.19. Wiley Online Library, 11–19.
Park et al. (2019) Soohwan Park, Hoseok Ryu, Seyoung Lee, Sunmin Lee, and Jehee Lee. 2019. Learning predict-and-simulate policies from unorganized human motion data. ACM Transactions on Graphics (TOG) 38, 6 (2019), 1–11.
Paszke et al. (2019) Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. 2019. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 (2019), 8026–8037.
Peng et al. (2018a) Xue Bin Peng, Pieter Abbeel, Sergey Levine, and Michiel van de Panne. 2018a. Deepmimic: Example-guided deep reinforcement learning of physics-based character skills. ACM Transactions on Graphics (TOG) 37, 4 (2018), 1–14.
Peng et al. (2018b) Xue Bin Peng, Pieter Abbeel, Sergey Levine, and Michiel van de Panne. 2018b. DeepMimic: Example-guided Deep Reinforcement Learning of Physics-based Character Skills. ACM Trans. Graph. 37, 4, Article 143 (July 2018), 143:1–143:14 pages.
Peng et al. (2017) Xue Bin Peng, Glen Berseth, KangKang Yin, and Michiel van de Panne. 2017. Deeploco: Dynamic locomotion skills using hierarchical deep reinforcement learning. ACM Transactions on Graphics (TOG) 36, 4 (2017), 1–13.
Pinto et al. (2018) Lerrel Pinto, Marcin Andrychowicz, Peter Welinder, Wojciech Zaremba, and Pieter Abbeel. 2018. Asymmetric Actor Critic for Image-Based Robot Learning. In Robotics (Robotics: Science and Systems), Hadas Kress-Gazit, Siddhartha S. Srinivasa, Tom Howard, and Nikolay Atanasov (Eds.). MIT Press Journals. https://doi.org/10.15607/RSS.2018.XIV.008Publisher Copyright: © 2018, MIT Press Journals. All rights reserved.; 14th Robotics: Science and Systems, RSS 2018 ; Conference date: 26-06-2018 Through 30-06-2018.
Reda et al. (2020) Daniele Reda, Tianxin Tao, and Michiel van de Panne. 2020. Learning to Locomote: Understanding How Environment Design Matters for Deep Reinforcement Learning. In Proc. ACM SIGGRAPH Conference on Motion, Interaction and Games.
Rempe et al. (2021) Davis Rempe, Tolga Birdal, Aaron Hertzmann, Jimei Yang, Srinath Sridhar, and Leonidas J Guibas. 2021. Humor: 3d human motion model for robust pose estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 11488–11499.
Rong et al. (2021) Yu Rong, Takaaki Shiratori, and Hanbyul Joo. 2021. FrankMocap: A Monocular 3D Whole-Body Pose Estimation System via Regression and Integration. In IEEE International Conference on Computer Vision Workshops.
Schulman et al. (2017) John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. 2017. Proximal Policy Optimization Algorithms. https://doi.org/10.48550/ARXIV.1707.06347
Seol et al. (2013) Yeongho Seol, Carol O’Sullivan, and Jehee Lee. 2013. Creature features: online motion puppetry for non-human characters. In Proceedings of the 12th ACM SIGGRAPH/Eurographics Symposium on Computer Animation. 213–221.
Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In Advances in Neural Information Processing Systems, Vol.30.
Vicon (2022) Systems Vicon. 2022. Vicon Motion Systems https://www.vicon.com/.
Villegas et al. (2021) Ruben Villegas, Duygu Ceylan, Aaron Hertzmann, Jimei Yang, and Jun Saito. 2021. Contact-Aware Retargeting of Skinned Motion. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 9720–9729.
von Marcard et al. (2017) Timo von Marcard, Bodo Rosenhahn, Michael Black, and Gerard Pons-Moll. 2017. Sparse Inertial Poser: Automatic 3D Human Pose Estimation from Sparse IMUs. Computer Graphics Forum 36(2), Proceedings of the 38th Annual Conference of the European Association for Computer Graphics (Eurographics) (2017), 349–360.
Wampler et al. (2014) Kevin Wampler, Zoran Popović, and Jovan Popović. 2014. Generalizing locomotion style to new animals with inverse optimal regression. ACM Transactions on Graphics (TOG) 33, 4 (2014), 1–11.
Wang et al. (2018) Tingwu Wang, Renjie Liao, Jimmy Ba, and Sanja Fidler. 2018. Nervenet: Learning structured policy with graph neural networks. In International conference on learning representations.
Winkler et al. (2022) Alexander Winkler, Jungdam Won, and Yuting Ye. 2022. QuestSim: Human Motion Tracking from Sparse Sensors with Simulated Avatars. In SIGGRAPH Asia 2022 Conference Papers. 1–8.
Won et al. (2020) Jungdam Won, Deepak Gopinath, and Jessica Hodgins. 2020. A scalable approach to control diverse behaviors for physically simulated characters. ACM Transactions on Graphics (TOG) 39, 4 (2020), 33–1.
Won and Lee (2019) Jungdam Won and Jehee Lee. 2019. Learning body shape variation in physics-based characters. ACM Transactions on Graphics (TOG) 38, 6 (2019), 1–12.
Won et al. (2017) Jungdam Won, Jongho Park, Kwanyu Kim, and Jehee Lee. 2017. How to train your dragon: example-guided control of flapping flight. ACM Transactions on Graphics (TOG) 36, 6 (2017), 1–13.
Xu et al. (2019) Yuanlu Xu, Song-Chun Zhu, and Tony Tung. 2019. Denserac: Joint 3d pose and shape estimation by dense render-and-compare. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 7760–7770.
Yamane et al. (2010) Katsu Yamane, Yuka Ariki, and Jessica Hodgins. 2010. Animating Non-Humanoid Characters with Human Motion Data. In Eurographics/ ACM SIGGRAPH Symposium on Computer Animation, MZoran Popovic and Miguel Otaduy (Eds.). The Eurographics Association. https://doi.org/10.2312/SCA/SCA10/169-178
Ye and Liu (2010) Yuting Ye and C Karen Liu. 2010. Optimal feedback control for character animation using an abstract model. In ACM SIGGRAPH 2010 papers. 1–9.
Ye et al. (2022) Yongjing Ye, Libin Liu, Lei Hu, and Shihong Xia. 2022. Neural3Points: Learning to Generate Physically Realistic Full-body Motion for Virtual Reality Users. Computer Graphics Forum (2022). https://doi.org/10.1111/cgf.14634
Yin et al. (2007) KangKang Yin, Kevin Loken, and Michiel Van de Panne. 2007. Simbicon: Simple biped locomotion control. ACM Transactions on Graphics (TOG) 26, 3 (2007), 105–es.

Appendix A Reward Details

Parameter values for each term of Equation 5 and Equation 8 are provided in Appendix A.

Table 2. Reward parameters for each character.

Parameter Oppy Dino Jesse w q subscript 𝑤 𝑞 w_{q}italic_w start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT 1 1 4 k q subscript 𝑘 𝑞 k_{q}italic_k start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT 20 20 25 w q˙subscript 𝑤˙𝑞 w_{\dot{q}}italic_w start_POSTSUBSCRIPT over˙ start_ARG italic_q end_ARG end_POSTSUBSCRIPT 0 0 0.5 k q˙subscript 𝑘˙𝑞 k_{\dot{q}}italic_k start_POSTSUBSCRIPT over˙ start_ARG italic_q end_ARG end_POSTSUBSCRIPT--1 w x subscript 𝑤 𝑥 w_{x}italic_w start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT 1 1 2.5 k x subscript 𝑘 𝑥 k_{x}italic_k start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT 6 6 6 w x˙subscript 𝑤˙𝑥 w_{\dot{x}}italic_w start_POSTSUBSCRIPT over˙ start_ARG italic_x end_ARG end_POSTSUBSCRIPT 0 0 0.7 k x˙subscript 𝑘˙𝑥 k_{\dot{x}}italic_k start_POSTSUBSCRIPT over˙ start_ARG italic_x end_ARG end_POSTSUBSCRIPT--2 w contact subscript 𝑤 contact w_{\text{contact}}italic_w start_POSTSUBSCRIPT contact end_POSTSUBSCRIPT 1.5 1.5 0 k contact subscript 𝑘 contact k_{\text{contact}}italic_k start_POSTSUBSCRIPT contact end_POSTSUBSCRIPT 1 1- w orientation subscript 𝑤 orientation w_{\text{orientation}}italic_w start_POSTSUBSCRIPT orientation end_POSTSUBSCRIPT 1 1 2 k orientation subscript 𝑘 orientation k_{\text{orientation}}italic_k start_POSTSUBSCRIPT orientation end_POSTSUBSCRIPT 3 3 3 w action diff subscript 𝑤 action diff w_{\text{action diff}}italic_w start_POSTSUBSCRIPT action diff end_POSTSUBSCRIPT 2 2 1.5 k action diff subscript 𝑘 action diff k_{\text{action diff}}italic_k start_POSTSUBSCRIPT action diff end_POSTSUBSCRIPT 150 150 10 w action min subscript 𝑤 action min w_{\text{action min}}italic_w start_POSTSUBSCRIPT action min end_POSTSUBSCRIPT 0.2 0.2 0.5 k action min subscript 𝑘 action min k_{\text{action min}}italic_k start_POSTSUBSCRIPT action min end_POSTSUBSCRIPT 25 25 25 fail reward-5-5-5

Given the state of the simulated character and the ground truth pose coming from the motion capture dataset, the distance metric for the different imitation reward components is formulated as a weighted sum of the Euclidean distance between the two values:

(11)d⁢(x s⁢i⁢m,x g⁢t)=∑i w i⁢∥q x,s⁢i⁢m−q x,g⁢t∥2 2 𝑑 subscript 𝑥 𝑠 𝑖 𝑚 subscript 𝑥 𝑔 𝑡 subscript 𝑖 subscript 𝑤 𝑖 superscript subscript delimited-∥∥subscript 𝑞 𝑥 𝑠 𝑖 𝑚 subscript 𝑞 𝑥 𝑔 𝑡 2 2 d(x_{sim},x_{gt})=\sum_{i}w_{i}\lVert q_{x,sim}-q_{x,gt}\rVert_{2}^{2}italic_d ( italic_x start_POSTSUBSCRIPT italic_s italic_i italic_m end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_g italic_t end_POSTSUBSCRIPT ) = ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ italic_q start_POSTSUBSCRIPT italic_x , italic_s italic_i italic_m end_POSTSUBSCRIPT - italic_q start_POSTSUBSCRIPT italic_x , italic_g italic_t end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT

where i 𝑖 i italic_i represent the joint angles or the link positions and weights vary according to the character. As described in subsection 3.5, the imitation reward defines the degree of supervision. As the two characters are closer alike, we can rely more on this reward. For Jesse, in fact, all joint weights are equal to 1. For Oppy and Dino, which have a different lower body size compared to a human, we rely more on the style reward for a good motion and decrease the weight of all lower body joints to 0.3. For link weights, for Oppy and Dino we set all weights to zero other than for the root, which is set to 1, for Jesse we track also end effectors.

Contact distance metric is also computed through the Euclidean distance between ground truth human motion data and simulated character data. We define that a human foot is in contact if its height is less than 20cm above the ground and the norm of its velocity is less than 0.4 m/s. For the simulated character, a force threshold of 1 N is set on the feet link.

The orientation distance metric, given the two orientations in quaternions, first computes the composition of the ground truth quaternion with the inverse of the simulated quaternion. Then, takes the distance norm of its axis angle representation.

Appendix B Proximal Policy Optimization

Let an experience tuple be e t=(o t,a t,o t+1,r t)subscript 𝑒 𝑡 subscript 𝑜 𝑡 subscript 𝑎 𝑡 subscript 𝑜 𝑡 1 subscript 𝑟 𝑡 e_{t}=(o_{t},a_{t},o_{t+1},r_{t})italic_e start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ( italic_o start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_o start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) and a trajectory be τ={e 0,e 1,…,e T}𝜏 subscript 𝑒 0 subscript 𝑒 1…subscript 𝑒 𝑇\tau={e_{0},e_{1},\dots,e_{T}}italic_τ = { italic_e start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_e start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_e start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT }. We episodically collect trajectories for a fixed number of environment transitions and we use this data to train the controller and the value function networks. The value function network approximates the expected future returns of each state, and is defined for a policy π 𝜋\pi italic_π as

V π⁢(o)=E o 0=o,a t∼π(⋅|o t)⁢[∑t=0∞γ t⁢r⁢(o t,a t)].\displaystyle V^{\pi}(o)=E_{o_{0}=o,a_{t}\sim\pi(\cdot|o_{t})}\left[\sum_{t=0}% ^{\infty}\gamma^{;t}r(o_{t},a_{t})\right].italic_V start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_o ) = italic_E start_POSTSUBSCRIPT italic_o start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_o , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ italic_π ( ⋅ | italic_o start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT [ ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_r ( italic_o start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ] .

This function can be optimized using supervised learning due to its recursive nature:

V π θ⁢(o t)=γ⁢V π θ⁢(o t+1)+r t,superscript 𝑉 subscript 𝜋 𝜃 subscript 𝑜 𝑡 𝛾 superscript 𝑉 subscript 𝜋 𝜃 subscript 𝑜 𝑡 1 subscript 𝑟 𝑡\displaystyle V^{\pi_{\theta}}(o_{t})=\gamma;V^{\pi_{\theta}}(o_{t+1})+r_{t},italic_V start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_o start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = italic_γ italic_V start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_o start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) + italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ,

where

V π θ⁢(o T)=r T+γ⁢V π θ o⁢l⁢d⁢(o T+1).superscript 𝑉 subscript 𝜋 𝜃 subscript 𝑜 𝑇 subscript 𝑟 𝑇 𝛾 superscript 𝑉 subscript 𝜋 subscript 𝜃 𝑜 𝑙 𝑑 subscript 𝑜 𝑇 1\displaystyle V^{\pi_{\theta}}(o_{T})=r_{T}+\gamma V^{\pi_{\theta_{old}}}(o_{T% +1}).italic_V start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_o start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) = italic_r start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT + italic_γ italic_V start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_o italic_l italic_d end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_o start_POSTSUBSCRIPT italic_T + 1 end_POSTSUBSCRIPT ) .

In PPO, the value function is used for computing the advantage

A t=V π θ−V π θ o⁢l⁢d subscript 𝐴 𝑡 superscript 𝑉 subscript 𝜋 𝜃 superscript 𝑉 subscript 𝜋 subscript 𝜃 𝑜 𝑙 𝑑\displaystyle A_{t}=V^{\pi_{\theta}}-V^{\pi_{\theta_{old}}}italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_V start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT end_POSTSUPERSCRIPT - italic_V start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_o italic_l italic_d end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUPERSCRIPT

which is then used for training the policy by maximizing:

L π⁢(θ)=1 T⁢∑t=1 T min⁡(ρ t⁢A^t,clip⁢(ρ t,1−ϵ,1+ϵ)⁢A^t),subscript 𝐿 𝜋 𝜃 1 𝑇 superscript subscript 𝑡 1 𝑇 subscript 𝜌 𝑡 subscript^𝐴 𝑡 clip subscript 𝜌 𝑡 1 italic-ϵ 1 italic-ϵ subscript^𝐴 𝑡\displaystyle L_{\pi}(\theta)=\frac{1}{T}\sum_{t=1}^{T}\min(\rho_{t}\hat{A}{t% },;\text{clip}(\rho{t},1-\epsilon,1+\epsilon)\hat{A}_{t}),italic_L start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ( italic_θ ) = divide start_ARG 1 end_ARG start_ARG italic_T end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT roman_min ( italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT over^ start_ARG italic_A end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , clip ( italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , 1 - italic_ϵ , 1 + italic_ϵ ) over^ start_ARG italic_A end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ,

where ρ t=π θ⁢(a t|o t)/π θ o⁢l⁢d⁢(a t|o t)subscript 𝜌 𝑡 subscript 𝜋 𝜃 conditional subscript 𝑎 𝑡 subscript 𝑜 𝑡 subscript 𝜋 subscript 𝜃 𝑜 𝑙 𝑑 conditional subscript 𝑎 𝑡 subscript 𝑜 𝑡\rho_{t}=\pi_{\theta}(a_{t}|o_{t})\mathbin{/}\pi_{\theta_{old}}(a_{t}|o_{t})italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_o start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) / italic_π start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_o italic_l italic_d end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_o start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) is an importance sampling term used for calculating the expectation under the old policy π θ o⁢l⁢d subscript 𝜋 subscript 𝜃 𝑜 𝑙 𝑑\pi_{\theta_{old}}italic_π start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_o italic_l italic_d end_POSTSUBSCRIPT end_POSTSUBSCRIPT.

Appendix C Training parameters

Table 3. Training details. We use the same parameters for every character.

Appendix D Torque Limits

Table 4. Torque limit scale value for each character’s joint. If not written, then value is not scaled. Minimum value is negative of maximum value. Moreover, Oppy’s tail and ears do not have torque values as they are passively actuated, similarly to final six joints of Dino’s tail.

Parameter Oppy Dino Jesse Shoulder 0.2 0.2 0.2 Elbow 0.1 0.1 0.1 Head 0.1 0.1 0.1 Neck 0.1 0.1 0.1 Spine0 1 1 0.25 Spine1 1 1 0.25 Spine2 1 1 0.25 Spine3 1 1 0.25 Tail0-0.5- Tail1-0.4-

Xet Storage Details

Size:: 77.6 kB
Xet hash:: f7d9b88ea5fe062089b2e7f92ef5b782e601b8569d8099d77244036ab871c8e3

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.