diff --git "a/app/src/content/article.mdx" "b/app/src/content/article.mdx" --- "a/app/src/content/article.mdx" +++ "b/app/src/content/article.mdx" @@ -55,11 +55,11 @@ Nonetheless, we also hold that the wealth of research from both academia and ind This tutorial... -- Does *not* aim to be a comprehensive guide to general field of robotics, manipulation or underactuated systems: @sicilianoSpringerHandbookRobotics2016 and @tedrakeRoboticManipulationPerception, @tedrakeUnderactuatedRoboticsAlgorithms do this better than we ever could. +- Does *not* aim to be a comprehensive guide to general field of robotics, manipulation or underactuated systems:@sicilianoSpringerHandbookRobotics2016 and@tedrakeRoboticManipulationPerception, @tedrakeUnderactuatedRoboticsAlgorithms do this better than we ever could. -- Does *not* aim to be an introduction to statistical or deep learning: @shalev-shwartzUnderstandingMachineLearning2014 and @prince2023understanding cover these subjects better than we ever could. +- Does *not* aim to be an introduction to statistical or deep learning:@shalev-shwartzUnderstandingMachineLearning2014 and@prince2023understanding cover these subjects better than we ever could. -- Does *not* aim to be a deep dive into Reinforcement Learning, Diffusion Models, or Flow Matching: invaluable works such as @suttonReinforcementLearningIntroduction2018, @nakkiranStepbyStepDiffusionElementary2024, and @lipmanFlowMatchingGuide2024 do this better than we ever could. +- Does *not* aim to be a deep dive into Reinforcement Learning, Diffusion Models, or Flow Matching: invaluable works such as@suttonReinforcementLearningIntroduction2018,@nakkiranStepbyStepDiffusionElementary2024, and@lipmanFlowMatchingGuide2024 do this better than we ever could. Instead, our goal here is to provide an intuitive explanation as per why these disparate ideas have converged to form the exciting field of modern robot learning, driving the unprecedented progress we see today. In this spirit, we follow the adage: "a jack of all trades is a master of none, *but oftentimes better than a master of one*." @@ -83,29 +83,29 @@ The frontier of robotics research is indeed increasingly moving away from classi Moreover, since end-to-end learning on ever-growing collections of text and image data has historically been at the core of the development of *foundation models* capable of semantic reasoning across multiple modalities (images, text, audio, etc.), deriving robotics methods grounded in learning appears particularly consequential, especially as the number of openly available datasets continues to grow. -Robotics is, at its core, an inherently multidisciplinary field, requiring a wide range of expertise in both *software* and *hardware*. The integration of learning-based techniques further broadens this spectrum of skills, raising the bar for both research and practical applications. `lerobot` is an open-source library designed to integrate end-to-end with the entire robotics stack. With a strong focus on accessible, real-world robots (1) `lerobot` supports many, openly available, robotic platforms for manipulation, locomotion and even whole-body control. `lerobot`also implements a (2) unified, low-level approach to reading/writing robot configurations to extend support for other robot platforms with relatively low effort. The library introduces `LeRobotDataset`, (3) a native robotics dataset’s format currently being used by the community to efficiently record and share datasets. `lerobot` also supports many state-of-the-art (SOTA) algorithms in robot learning--mainly based on Reinforcement Learning (RL) and Behavioral Cloning (BC) techniques--with efficient implementations in Pytorch, and extended support to experimentation and experiments tracking. Lastly, `lerobot` defines a custom, optimized inference stack for robotic policies decoupling action planning from action execution, proving effective in guaranteeing more adaptability at runtime. +Robotics is, at its core, an inherently multidisciplinary field, requiring a wide range of expertise in both *software* and *hardware*. The integration of learning-based techniques further broadens this spectrum of skills, raising the bar for both research and practical applications. `lerobot`is an open-source library designed to integrate end-to-end with the entire robotics stack. With a strong focus on accessible, real-world robots (1) `lerobot`supports many, openly available, robotic platforms for manipulation, locomotion and even whole-body control. `lerobot`also implements a (2) unified, low-level approach to reading/writing robot configurations to extend support for other robot platforms with relatively low effort. The library introduces `LeRobotDataset`, (3) a native robotics dataset’s format currently being used by the community to efficiently record and share datasets. `lerobot`also supports many state-of-the-art (SOTA) algorithms in robot learning--mainly based on Reinforcement Learning (RL) and Behavioral Cloning (BC) techniques--with efficient implementations in Pytorch, and extended support to experimentation and experiments tracking. Lastly, `lerobot`defines a custom, optimized inference stack for robotic policies decoupling action planning from action execution, proving effective in guaranteeing more adaptability at runtime. This tutorial serves the double purpose of providing useful references for the Science behind--and practical use of--common robot learning techniques. To this aim, we strike to provide a rigorous yet concise overview of the core concepts behind the techniques presented, paired with practical examples of how to use such techniques concretely, with code examples in `lerobot`, for researchers and practitioners interested in the field of robot learning. This tutorial is structured as follows: -- Section 2 reviews classical robotics foundations, introducing the limitations of dynamics-based approaches to robotics. +- Section2 reviews classical robotics foundations, introducing the limitations of dynamics-based approaches to robotics. -- Section 3 elaborates on the limitations of dynamics-based methods, and introduce RL as a practical approach to solve robotics problems, considering its upsides and potential limitations. +- Section3 elaborates on the limitations of dynamics-based methods, and introduce RL as a practical approach to solve robotics problems, considering its upsides and potential limitations. -- Section 4 further describes robot learning techniques that aim at solving single-tasks learning, leveraging BC techniques to autonomously reproduce specific expert demonstrations. +- Section4 further describes robot learning techniques that aim at solving single-tasks learning, leveraging BC techniques to autonomously reproduce specific expert demonstrations. -- Section 5 presents recent contributions on developing generalist models for robotics applications, by learning from large corpora of multi-task  multi-robot data (*robotics foundation models*). +- Section5 presents recent contributions on developing generalist models for robotics applications, by learning from large corpora of multi-task multi-robot data (*robotics foundation models*). Our goal with this tutorial is to provide an intuitive explanation of the reasons various disparate ideas from Machine Learning (ML) have converged and are powering the current evolution of Robotics, driving the unprecedented progress we see today. We complement our presentation of the most common and recent approaches in robot learning with practical code implementations using `lerobot`, and start here by presenting the dataset format introduced with `lerobot`. ## `LeRobotDataset` -`LeRobotDataset` is a standardized dataset format designed to address the specific needs of robot learning research, and it provides a unified and convenient access to robotics data across modalities, including sensorimotor readings, multiple camera feeds and teleoperation status. `LeRobotDataset` also accommodates for storing general information regarding the data being collected, including textual descriptions of the task being performed by the teleoperator, the kind of robot used, and relevant measurement specifics like the frames per second at which the recording of both image and robot state’s streams are proceeding. +`LeRobotDataset`is a standardized dataset format designed to address the specific needs of robot learning research, and it provides a unified and convenient access to robotics data across modalities, including sensorimotor readings, multiple camera feeds and teleoperation status. `LeRobotDataset`also accommodates for storing general information regarding the data being collected, including textual descriptions of the task being performed by the teleoperator, the kind of robot used, and relevant measurement specifics like the frames per second at which the recording of both image and robot state’s streams are proceeding. -In this, `LeRobotDataset` provides a unified interface for handling multi-modal, time-series data, and it is designed to seamlessly integrate with the PyTorch and Hugging Face ecosystems. `LeRobotDataset` can be easily extended by users and it is highly customizable by users, and it already supports openly available data coming from a variety of embodiments supported in `lerobot`, ranging from manipulator platforms like the SO-100 arm and ALOHA-2 setup, to real-world humanoid arm and hands, as well as entirely simulation-based datasets, and self-driving cars. This dataset format is built to be both efficient for training and flexible enough to accommodate the diverse data types encountered in robotics, while promoting reproducibility and ease of use for users. +In this, `LeRobotDataset`provides a unified interface for handling multi-modal, time-series data, and it is designed to seamlessly integrate with the PyTorch and Hugging Face ecosystems. `LeRobotDataset`can be easily extended by users and it is highly customizable by users, and it already supports openly available data coming from a variety of embodiments supported in `lerobot`, ranging from manipulator platforms like the SO-100 arm and ALOHA-2 setup, to real-world humanoid arm and hands, as well as entirely simulation-based datasets, and self-driving cars. This dataset format is built to be both efficient for training and flexible enough to accommodate the diverse data types encountered in robotics, while promoting reproducibility and ease of use for users. ### The dataset class design -A core design choice behind `LeRobotDataset` is separating the underlying data storage from the user-facing API. This allows for efficient storage while presenting the data in an intuitive, ready-to-use format. +A core design choice behind `LeRobotDataset`is separating the underlying data storage from the user-facing API. This allows for efficient storage while presenting the data in an intuitive, ready-to-use format. Datasets are always organized into three main components: @@ -115,7 +115,7 @@ Datasets are always organized into three main components: - **Metadata** A collection of JSON files which describes the dataset’s structure in terms of its metadata, serving as the relational counterpart to both the tabular and visual dimensions of data. Metadata include the different feature schema, frame rates, normalization statistics, and episode boundaries. -For scalability, and to support datasets with potentially millions of trajectories (resulting in hundreds of millions or billions of individual camera frames), we merge data from different episodes into the same high-level structure. Concretely, this means that any given tabular collection and video will not typically contain information about one episode only, but rather a concatenation of the information available in multiple episodes. This keeps the pressure on the file system limited, both locally and on remote storage providers like Hugging Face, though at the expense of leveraging more heavily relational-like, metadata parts of the dataset, which are used to reconstruct information such as at which position, in a given file, an episode starts or ends. An example struture for a given `LeRobotDataset` would appear as follows: +For scalability, and to support datasets with potentially millions of trajectories (resulting in hundreds of millions or billions of individual camera frames), we merge data from different episodes into the same high-level structure. Concretely, this means that any given tabular collection and video will not typically contain information about one episode only, but rather a concatenation of the information available in multiple episodes. This keeps the pressure on the file system limited, both locally and on remote storage providers like Hugging Face, though at the expense of leveraging more heavily relational-like, metadata parts of the dataset, which are used to reconstruct information such as at which position, in a given file, an episode starts or ends. An example struture for a given `LeRobotDataset`would appear as follows: - `meta/info.json`: This metadata is a central metadata file. It contains the complete dataset schema, defining all features (e.g., `observation.state`, `action`), their shapes, and data types. It also stores crucial information like the dataset’s frames-per-second (`fps`), `lerobot`’s version at the time of capture, and the path templates used to locate data and video files. @@ -131,11 +131,11 @@ For scalability, and to support datasets with potentially millions of trajectori ## Code Example: Batching a (Streaming) Dataset -This section provides an overview of how to access datasets hosted on Hugging Face using the `LeRobotDataset` class. Every dataset on the Hugging Face Hub containing the three main pillars presented above (Tabular, Visual and relational Metadata), and can be assessed with a single instruction. +This section provides an overview of how to access datasets hosted on Hugging Face using the `LeRobotDataset`class. Every dataset on the Hugging Face Hub containing the three main pillars presented above (Tabular, Visual and relational Metadata), and can be assessed with a single instruction. -In practice, most reinforcement learning (RL) and behavioral cloning (BC) algorithms tend to operate on stack of observation and actions. For the sake of brevity, we will refer to joint spaces, and camera frames with the single term of *frame*. For instance, RL algorithms may use a history of previous frames $o_{t-H_o:t}$ to mitigate partial observability, and BC algorithms are in practice trained to regress chunks of multiple actions ($a_{t+t+H_a}$) rather than single controls. To accommodate for these specifics of robot learning training, `LeRobotDataset` provides a native windowing operation, whereby users can define the *seconds* of a given window (before and after) around any given frame, by using the `delta_timestemps` functionality. Unavailable frames are opportunely padded, and a padding mask is also returned to filter out the padded frames. Notably, this all happens within the `LeRobotDataset`, and is entirely transparent to higher level wrappers commonly used in training ML models such as `torch.utils.data.DataLoader`. +In practice, most reinforcement learning (RL) and behavioral cloning (BC) algorithms tend to operate on stack of observation and actions. For the sake of brevity, we will refer to joint spaces, and camera frames with the single term of *frame*. For instance, RL algorithms may use a history of previous frames $o_{t-H_o:t}$ to mitigate partial observability, and BC algorithms are in practice trained to regress chunks of multiple actions ($a_{t+t+H_a}$) rather than single controls. To accommodate for these specifics of robot learning training, `LeRobotDataset`provides a native windowing operation, whereby users can define the *seconds* of a given window (before and after) around any given frame, by using the `delta_timestemps` functionality. Unavailable frames are opportunely padded, and a padding mask is also returned to filter out the padded frames. Notably, this all happens within the `LeRobotDataset`, and is entirely transparent to higher level wrappers commonly used in training ML models such as `torch.utils.data.DataLoader`. -Conveniently, by using `LeRobotDataset` with a Pytorch `DataLoader` one can automatically collate the individual sample dictionaries from the dataset into a single dictionary of batched tensors for downstream training or inference. `LeRobotDataset` also natively supports streaming mode for datasets. Users can stream data of a large dataset hosted on the Hugging Face Hub, with a one-line change in their implementation. Streaming datasets supports high-performance batch processing (ca. 80-100 it/s, varying on connectivity) and high levels of frames randomization, key features for practical BC algorithms which otherwise may be slow or operating on highly non-i.i.d. data. This feature is designed to improve on accessibility so that large datasets can be processed by users without requiring large amounts of memory and storage. +Conveniently, by using `LeRobotDataset`with a Pytorch `DataLoader` one can automatically collate the individual sample dictionaries from the dataset into a single dictionary of batched tensors for downstream training or inference. `LeRobotDataset`also natively supports streaming mode for datasets. Users can stream data of a large dataset hosted on the Hugging Face Hub, with a one-line change in their implementation. Streaming datasets supports high-performance batch processing (ca. 80-100 it/s, varying on connectivity) and high levels of frames randomization, key features for practical BC algorithms which otherwise may be slow or operating on highly non-i.i.d. data. This feature is designed to improve on accessibility so that large datasets can be processed by users without requiring large amounts of memory and storage.
@@ -218,9 +218,9 @@ TL;DR Learning-based approaches to robotics are motivated by the need to (1) gen caption={'Overview of methods to generate motion (clearly non-exhausitve, see @bekrisStateRobotMotion2024). The different methods can be grouped based on whether they explicitly (dynamics-based) or implicitly (learning-based) model robot-environment interactions.'} /> -Robotics is concerned with producing artificial motion in the physical world in useful, reliable and safe fashion. Thus, robotics is an inherently multi-disciplinar domain: producing autonomous motion in the physical world requires, to the very least, interfacing different software (motion planners) and hardware (motion executioners) components. Further, knowledge of mechanical, electrical, and software engineering, as well as rigid-body mechanics and control theory have therefore proven quintessential in robotics since the field first developed in the 1950s. More recently, Machine Learning (ML) has also proved effective in robotics, complementing these more traditional disciplines @connellRobotLearning1993. As a direct consequence of its multi-disciplinar nature, robotics has developed as a rather wide array of methods, all concerned with the main purpose of producing artificial motion in the physical world. +Robotics is concerned with producing artificial motion in the physical world in useful, reliable and safe fashion. Thus, robotics is an inherently multi-disciplinar domain: producing autonomous motion in the physical world requires, to the very least, interfacing different software (motion planners) and hardware (motion executioners) components. Further, knowledge of mechanical, electrical, and software engineering, as well as rigid-body mechanics and control theory have therefore proven quintessential in robotics since the field first developed in the 1950s. More recently, Machine Learning (ML) has also proved effective in robotics, complementing these more traditional disciplines@connellRobotLearning1993. As a direct consequence of its multi-disciplinar nature, robotics has developed as a rather wide array of methods, all concerned with the main purpose of producing artificial motion in the physical world. -Methods to produce robotics motion range from traditional *explicit* models--dynamics-based[^1] methods, leveraging precise descriptions of the mechanics of robots’ rigid bodies and their interactions with eventual obstacles in the environment--to *implicit* models--learning-based methods, treating artificial motion as a statistical pattern to learn given multiple sensorimotor readings @agrawalComputationalSensorimotorLearning, @bekrisStateRobotMotion2024. A variety of methods have been developed between these two extrema. For instance,  @hansenTemporalDifferenceLearning2022 show how learning-based systems can benefit from information on the physics of problems, complementing a traditional learning method such as Temporal Difference (TD)-learning @suttonReinforcementLearningIntroduction2018 with Model-Predictive Control (MPC). Conversely, as explicit models may be relying on assumptions proving overly simplistic--or even unrealistic--in practice, learning can prove effective to improve modeling of complex phenomena or complement perception @mccormacSemanticFusionDense3D2016. Such examples aim at demonstrating the richness of approaches to robotics, and Figure 2 graphically illustrates some of the most relevant techniques. Such a list is clearly far from being exhaustive, and we refer to @bekrisStateRobotMotion2024 for a more comprehensive overview of both general and application-specific methods for motion generation. In this section, we wish to introduce the inherent benefits of learning-based approaches to robotics--the core focus on this tutorial. +Methods to produce robotics motion range from traditional *explicit* models--dynamics-based[^1] methods, leveraging precise descriptions of the mechanics of robots’ rigid bodies and their interactions with eventual obstacles in the environment--to *implicit* models--learning-based methods, treating artificial motion as a statistical pattern to learn given multiple sensorimotor readings@agrawalComputationalSensorimotorLearning, @bekrisStateRobotMotion2024. A variety of methods have been developed between these two extrema. For instance, @hansenTemporalDifferenceLearning2022 show how learning-based systems can benefit from information on the physics of problems, complementing a traditional learning method such as Temporal Difference (TD)-learning@suttonReinforcementLearningIntroduction2018 with Model-Predictive Control (MPC). Conversely, as explicit models may be relying on assumptions proving overly simplistic--or even unrealistic--in practice, learning can prove effective to improve modeling of complex phenomena or complement perception@mccormacSemanticFusionDense3D2016. Such examples aim at demonstrating the richness of approaches to robotics, and Figure2 graphically illustrates some of the most relevant techniques. Such a list is clearly far from being exhaustive, and we refer to@bekrisStateRobotMotion2024 for a more comprehensive overview of both general and application-specific methods for motion generation. In this section, we wish to introduce the inherent benefits of learning-based approaches to robotics--the core focus on this tutorial. ## Different Types of Motion @@ -234,17 +234,17 @@ Methods to produce robotics motion range from traditional *explicit* models-- -In the vast majority of instances, robotics deals with producing motion via actuating joints connecting nearly entirely-rigid links. A key distinction between focus areas in robotics is based on whether the generated motion modifies (1) the absolute state of the environment (via dexterity), (2) the relative state of the robot with respect to its environment (exercising mobility skills), or (3) a combination of the two (Figure 3). +In the vast majority of instances, robotics deals with producing motion via actuating joints connecting nearly entirely-rigid links. A key distinction between focus areas in robotics is based on whether the generated motion modifies (1) the absolute state of the environment (via dexterity), (2) the relative state of the robot with respect to its environment (exercising mobility skills), or (3) a combination of the two (Figure3). Effects such as (1) are typically achieved *through* the robot, i.e. generating motion to perform an action inducing a desirable modification, effectively *manipulating* the environment (manipulation). Motions like (2) may result in changes in the robot’s physical location within its environment. Generally, modifications to a robot’s location within its environment may be considered instances of the general *locomotion* problem, further specified as *wheeled* or *legged* locomotion based on whenever a robot makes use of wheels or leg(s) to move in the environment. Lastly, an increased level of dynamism in the robot-environment interactions can be obtained combining (1) and (2), thus designing systems capable to interact with *and* move within their environment. This category is problems is typically termed *mobile manipulation*, and is characterized by a typically much larger set of control variables compared to either locomotion or manipulation alone. -The traditional body of work developed since the very inception of robotics is increasingly complemented by learning-based approaches. ML has indeed proven particularly transformative across the entire robotics stack, first empowering planning-based techniques with improved state estimation used for traditional planning @tangPerceptionNavigationAutonomous2023 and then end-to-end replacing controllers, effectively yielding perception-to-action methods @koberReinforcementLearningRobotics. Work in producing robots capable of navigating a diverse set of terrains demonstrated the premise of both dynamics and learning-based approaches for locomotion @griffinWalkingStabilizationUsing2017, @jiDribbleBotDynamicLegged2023, @leeLearningQuadrupedalLocomotion2020, @margolisRapidLocomotionReinforcement2022, and recent works on whole-body control indicated the premise of learning-based approaches to generate rich motion on complex robots, including humanoids @zhangWoCoCoLearningWholeBody2024, @bjorckGR00TN1Open2025. Manipulation has also been widely studied, particularly considering its relevance for many impactful use-cases ranging from high-risk applications for humans @fujitaDevelopmentRobotsNuclear2020, @alizadehComprehensiveSurveySpace2024 to manufacturing @sannemanStateIndustrialRobotics2020. While explicit models have proven fundamental in achieving important milestones towards the development of modern robotics, recent works leveraging implicit models proved particularly promising in surpassing scalability and applicability challenges via learning @koberReinforcementLearningRobotics. +The traditional body of work developed since the very inception of robotics is increasingly complemented by learning-based approaches. ML has indeed proven particularly transformative across the entire robotics stack, first empowering planning-based techniques with improved state estimation used for traditional planning@tangPerceptionNavigationAutonomous2023 and then end-to-end replacing controllers, effectively yielding perception-to-action methods@koberReinforcementLearningRobotics. Work in producing robots capable of navigating a diverse set of terrains demonstrated the premise of both dynamics and learning-based approaches for locomotion@griffinWalkingStabilizationUsing2017, @jiDribbleBotDynamicLegged2023, @leeLearningQuadrupedalLocomotion2020, @margolisRapidLocomotionReinforcement2022, and recent works on whole-body control indicated the premise of learning-based approaches to generate rich motion on complex robots, including humanoids@zhangWoCoCoLearningWholeBody2024, @bjorckGR00TN1Open2025. Manipulation has also been widely studied, particularly considering its relevance for many impactful use-cases ranging from high-risk applications for humans@fujitaDevelopmentRobotsNuclear2020, @alizadehComprehensiveSurveySpace2024 to manufacturing@sannemanStateIndustrialRobotics2020. While explicit models have proven fundamental in achieving important milestones towards the development of modern robotics, recent works leveraging implicit models proved particularly promising in surpassing scalability and applicability challenges via learning@koberReinforcementLearningRobotics. ## Example: Planar Manipulation Robot manipulators typically consist of a series of links and joints, articulated in a chain finally connected to an *end-effector*. Actuated joints are considered responsible for generating motion of the links, while the end effector is instead used to perform specific actions at the target location (e.g., grasping/releasing objects via closing/opening a gripper end-effector, using a specialized tool like a screwdriver, etc.). -Recently, the development of low-cost manipulators like the ALOHA @zhaoLearningFineGrainedBimanual2023 ALOHA-2 @aldacoALOHA2Enhanced and SO-100/SO-101 @knightStandardOpenSO100 platforms significantly lowered the barrier to entry to robotics, considering the increased accessibility of these robots compared to more traditional platforms like the Franka Emika Panda arm (Figure 4). +Recently, the development of low-cost manipulators like the ALOHA@zhaoLearningFineGrainedBimanual2023 ALOHA-2@aldacoALOHA2Enhanced and SO-100/SO-101@knightStandardOpenSO100 platforms significantly lowered the barrier to entry to robotics, considering the increased accessibility of these robots compared to more traditional platforms like the Franka Emika Panda arm (Figure4). -Deriving an intuition as per why learning-based approaches are gaining popularity in the robotics community requires briefly analyzing traditional approaches for manipulation, leveraging tools like forward and inverse kinematics (FK, IK) and control theory. Providing a detailed overview of these methods falls (well) out of the scope of this tutorial, and we refer the reader to works including @sicilianoSpringerHandbookRobotics2016, @lynchModernRoboticsMechanics2017, @tedrakeRoboticManipulationPerception, @tedrakeUnderactuatedRoboticsAlgorithms for a much more comprehensive description of these techniques. Here, we mostly wish to highlight the benefits of ML over these traditional techniques +Deriving an intuition as per why learning-based approaches are gaining popularity in the robotics community requires briefly analyzing traditional approaches for manipulation, leveraging tools like forward and inverse kinematics (FK, IK) and control theory. Providing a detailed overview of these methods falls (well) out of the scope of this tutorial, and we refer the reader to works including@sicilianoSpringerHandbookRobotics2016, @lynchModernRoboticsMechanics2017, @tedrakeRoboticManipulationPerception, @tedrakeUnderactuatedRoboticsAlgorithms for a much more comprehensive description of these techniques. Here, we mostly wish to highlight the benefits of ML over these traditional techniques -Consider the (simple) case where a SO-100 is restrained from actuating (1) the shoulder pane and (2) the wrist flex and roll motors. This effectively reduces the degrees of freedom of the SO-100 from the original 5+1 (5 joints + 1 gripper) to 2+1 (shoulder lift, elbow flex + gripper). As the end-effector does not impact motion in this model, the SO-100 is effectively reduced to the planar manipulator robot presented in Figure 5, where spheres represent actuators, and solid lines indicate length-$l$ links from the base of the SO-100 to the end-effector (*ee*). +Consider the (simple) case where a SO-100 is restrained from actuating (1) the shoulder pane and (2) the wrist flex and roll motors. This effectively reduces the degrees of freedom of the SO-100 from the original 5+1 (5 joints + 1 gripper) to 2+1 (shoulder lift, elbow flex + gripper). As the end-effector does not impact motion in this model, the SO-100 is effectively reduced to the planar manipulator robot presented in Figure5, where spheres represent actuators, and solid lines indicate length-$l$ links from the base of the SO-100 to the end-effector (*ee*). Further, let us make the simplifying assumption that actuators can produce rotations up to $2 \pi$ radians. In practice, this is seldom the case due to movement obstructions caused by the robot body itself (for instance, the shoulder lift cannot produce counter-clockwise movement due to the presence of the robot’s base used to secure the SO-100 to its support and host the robot bus), but we will introduce movement obstruction at a later stage. -All these simplifying assumptions leave us with the planar manipulator of Figure 6, free of moving its end-effector by controlling the angles $\theta_1$ and $\theta_2$, jointly referred to as the robot’s *configuration*, and indicated with $q = [\theta_1, \theta_2 ] \in [-\pi, +\pi]^2$. The axis attached to the joints indicate the associated reference frame, whereas circular arrows indicate the maximal feasible rotation allowed at each joint. In this tutorial, we do not cover topics related to spatial algebra, and we instead refer the reader to and for excellent explanations of the mechanics and theoretical foundations of producing motion on rigid bodies. +All these simplifying assumptions leave us with the planar manipulator of Figure6, free of moving its end-effector by controlling the angles $\theta_1$ and $\theta_2$, jointly referred to as the robot’s *configuration*, and indicated with $q = [\theta_1, \theta_2 ] \in [-\pi, +\pi]^2$. The axis attached to the joints indicate the associated reference frame, whereas circular arrows indicate the maximal feasible rotation allowed at each joint. In this tutorial, we do not cover topics related to spatial algebra, and we instead refer the reader to and for excellent explanations of the mechanics and theoretical foundations of producing motion on rigid bodies.
Planar, 2-dof schematic representation of the SO-100 manipulator under diverse deployment settings. From left to right: completely free of moving; constrained by the presence of the surface; constrained by the surface and presence of obstacles. Circular arrows around each joint indicate the maximal rotation feasible at that joint.
-Considering the (toy) example presented in Figure 6, then we can analytically write the end-effector’s position $p \in \mathbb R^2$ as a function of the robot’s configuration, $p = p(q), p: \mathcal Q \mapsto \mathbb R^2$. In particular, we have: +Considering the (toy) example presented in Figure6, then we can analytically write the end-effector’s position $p \in \mathbb R^2$ as a function of the robot’s configuration, $p = p(q), p: \mathcal Q \mapsto \mathbb R^2$. In particular, we have: $$ `p(q) = \begin @@ -325,19 +325,17 @@ Deriving the end-effector’s *pose*--position *and* orientation--in some $m$-di In the simplified case here considered (for which $\boldsymbol{p} \equiv p$, as the orientation of the end-effector is disregarded for simplicity), one can solve the problem of controlling the end-effector’s location to reach a goal position $p^*$ by solving analytically for $q: p(q) = f_{\text{FK}}(q) = p^*$. However, in the general case, one might not be able to solve this problem analytically, and can typically resort to iterative optimization methods comparing candidate solutions using a loss function (in the simplest case, $\Vert p(q) - p^* \Vert_2^2$ is a natural candidate), yielding: -$\min_{q \in \mathcal Q} \Vert p(q) - p^* \Vert_2^2 \, . -$ +$\min_{q \in \mathcal Q} \Vert p(q) - p^* \Vert_2^2 \, . $ Exact analytical solutions to IK are even less appealing when one considers the presence of obstacles in the robot’s workspace, resulting in constraints on the possible values of $q \in \mathcal Q \subseteq [-\pi, +\pi]^n \subset \mathbb R^n$ in the general case of $n$-links robots. -For instance, the robot in Figure 7 is (very naturally) obstacled by the presence of the surface upon which it rests: $\theta_1$ can now exclusively vary within $[0, \pi]$, while possible variations in $\theta_2$ depend on $\theta_1$ (when $\theta_1 \to 0$ or $\theta_1 \to \pi$, further downwards movements are restricted). Even for a simplified kinematic model, developing techniques to solve eq. [eq:ik_problem] is in general non-trivial in the presence of constraints, particularly considering that the feasible set of solutions $\mathcal Q$ may change across problems. Figure 9 provides an example of how the environment influences the feasible set considered, with a new set of constraints deriving from the position of a new obstacle. +For instance, the robot in Figure7 is (very naturally) obstacled by the presence of the surface upon which it rests: $\theta_1$ can now exclusively vary within $[0, \pi]$, while possible variations in $\theta_2$ depend on $\theta_1$ (when $\theta_1 \to 0$ or $\theta_1 \to \pi$, further downwards movements are restricted). Even for a simplified kinematic model, developing techniques to solveeq.[eq:ik_problem] is in general non-trivial in the presence of constraints, particularly considering that the feasible set of solutions $\mathcal Q$ may change across problems. Figure9 provides an example of how the environment influences the feasible set considered, with a new set of constraints deriving from the position of a new obstacle. -However, IK--solving eq. [eq:ik_problem] for a feasible $q$--only proves useful in determining information regarding the robot’s configuration in the goal pose, and crucially does not provide information on the *trajectory* to follow over time to reach a target pose. Expert-defined trajectories obviate to this problem providing a length-$K$ succession of goal poses $\tau_K = [p^*_0, p^*_1, \dots p^*_K]$ for tracking. In practice, trajectories can also be obtained automatically through *motion planning* algorithms, thus avoiding expensive trajectory definition from human experts. However, tracking $\tau_K$ via IK can prove prohibitively expensive, as tracking would require $K$ resolutions of eq. [eq:ik_problem] (one for each target pose). *Differential* inverse kinematics (diff-IK) complements IK via closed-form solution of a variant of eq. [eq:ik_problem]. Let $J(q)$ denote the Jacobian matrix of (partial) derivatives of the FK-function $f_\text{FK}: \mathcal Q \mapsto \mathcal P$, such that $J(q) = \frac{\partial f_{FK}(q)}{\partial q }$. Then, one can apply the chain rule to any $p(q) = f_{\text{FK}}(q)$, deriving $\dot p = J(q) \dot q$, and thus finally relating variations in the robot configurations to variations in pose, thereby providing a platform for control. +However, IK--solving eq.[eq:ik_problem] for a feasible $q$--only proves useful in determining information regarding the robot’s configuration in the goal pose, and crucially does not provide information on the *trajectory* to follow over time to reach a target pose. Expert-defined trajectories obviate to this problem providing a length-$K$ succession of goal poses $\tau_K = [p^*_0, p^*_1, \dots p^*_K]$ for tracking. In practice, trajectories can also be obtained automatically through *motion planning* algorithms, thus avoiding expensive trajectory definition from human experts. However, tracking $\tau_K$ via IK can prove prohibitively expensive, as tracking would require $K$ resolutions of eq.[eq:ik_problem] (one for each target pose). *Differential* inverse kinematics (diff-IK) complements IK via closed-form solution of a variant of eq.[eq:ik_problem]. Let $J(q)$ denote the Jacobian matrix of (partial) derivatives of the FK-function $f_\text{FK}: \mathcal Q \mapsto \mathcal P$, such that $J(q) = \frac{\partial f_{FK}(q)}{\partial q }$. Then, one can apply the chain rule to any $p(q) = f_{\text{FK}}(q)$, deriving $\dot p = J(q) \dot q$, and thus finally relating variations in the robot configurations to variations in pose, thereby providing a platform for control. -Given a desired end-effector trajectory $\dot {p}^*(t)$ (1) indicating anchor regions in space and (2) how much time to spend in each region, diff-IK finds $\dot q(t)$ solving for joints’ *velocities* instead of *configurations*, $\dot q(t) = \arg\min_\nu \; \lVert J(q(t)) \nu - \dot {p}^*(t) \rVert_2^2 -$ +Given a desired end-effector trajectory $\dot {p}^*(t)$ (1) indicating anchor regions in space and (2) how much time to spend in each region, diff-IK finds $\dot q(t)$ solving for joints’ *velocities* instead of *configurations*, $\dot q(t) = \arg\min_\nu \; \lVert J(q(t)) \nu - \dot {p}^*(t) \rVert_2^2 $ -Unlike eq. [eq:ik_problem], solving for $\dot q$ is much less dependent on the environment (typically, variations in velocity are constrained by physical limits on the actuators). Conveniently, eq. [eq:reg_ik_velocity] also often admits the closed-form solution $\dot q = J(q)^+ \dot {p}^*$, where $J^+(q)$ denotes the Moore-Penrose pseudo-inverse of $J(q)$. Finally, discrete-time joint configurations $q$ can be reconstructed from joint velocities $\dot q$ using forward-integration on the continuous-time joint velocity , $q_{t+1} = q_t + \Delta t\,\dot q_t$ for a given $\Delta t$, resulting in tracking via diff-IK. +Unlikeeq.[eq:ik_problem], solving for $\dot q$ is much less dependent on the environment (typically, variations in velocity are constrained by physical limits on the actuators). Conveniently, eq.[eq:reg_ik_velocity] also often admits the closed-form solution $\dot q = J(q)^+ \dot {p}^*$, where $J^+(q)$ denotes the Moore-Penrose pseudo-inverse of $J(q)$. Finally, discrete-time joint configurations $q$ can be reconstructed from joint velocities $\dot q$ using forward-integration on the continuous-time joint velocity , $q_{t+1} = q_t + \Delta t\,\dot q_t$ for a given $\Delta t$, resulting in tracking via diff-IK. Following trajectories with diff-IK is a valid option in well-controlled and static environments (e.g., industrial manipulators in controlled manufacturing settings), and relies on the ability to define a set of target velocities to track $[\dot {p}^*_0, \dot {p}^*_1, \dots, \dot {p}^*_k ]$--an error-prone task largely requiring human expertise. Furthermore, diff-IK relies on the ability to (1) access $J(q) \, \forall q \in \mathcal Q$ and (2) compute its pseudo-inverse at every iteration of a given control cycle--a challenging assumption in highly dynamical settings, or for complex kinematic chains. @@ -356,7 +354,7 @@ r0.3
-One such case is presented in Figure [fig:planar-manipulator-box-velocity], where another rigid body other than the manipulator is moving in the environment along the horizontal axis, with velocity $\dot x_B$. Accounting analytically for the presence of this disturbance--for instance, to prevent the midpoint of the link from ever colliding with the object--requires access to $\dot x_B$ at least, to derive the equation characterizing the motion of the environment. +One such case is presented in Figure[fig:planar-manipulator-box-velocity], where another rigid body other than the manipulator is moving in the environment along the horizontal axis, with velocity $\dot x_B$. Accounting analytically for the presence of this disturbance--for instance, to prevent the midpoint of the link from ever colliding with the object--requires access to $\dot x_B$ at least, to derive the equation characterizing the motion of the environment. Less predictable disturbances however (e.g., $\dot x_B \leftarrow \dot x_B + {\varepsilon}, {\varepsilon}\sim N(0,1)$) may prove challenging to model analytically, and one could attain the same result of preventing link-object collision by adding a condition on the distance between the midpoint of $l$ and $x_B$, enforced through a feedback loop on the position of the robot and object at each control cycle. @@ -364,7 +362,7 @@ To mitigate the effect of modeling errors, sensing noise and other disturbances, More advanced techniques for control consisting in feedback linearization, PID control, Linear Quatratic Regulator (LQR) or Model-Predictive Control (MPC) can be employed to stabilize tracking and reject moderate perturbations, and we refer to for in-detail explanation of these concepts, or for a simple, intuitive example in the case of a point-mass system. Nonetheless, feedback control presents its challenges as well: tuning gains remains laborious and system-specific. Further, manipulation tasks present intermittent contacts inducing hybrid dynamics (mode switches) and discontinuities in the Jacobian, challenging the stability guarantees of the controller and thus often necessitating rather conservative gains and substantial hand-tuning. -We point the interested reader to , , and  for extended coverage of FK, IK, diff-IK and control for (diff-)IK. +We point the interested reader to, , and for extended coverage of FK, IK, diff-IK and control for (diff-)IK. ## Limitations of Dynamics-based Robotics @@ -386,9 +384,9 @@ Moreover, classical planners operate on compact, assumed-sufficient state repres Setting aside integration and scalability challenges: developing accurate modeling of contact, friction, and compliance for complicated systems remains difficult. Rigid-body approximations are often insufficient in the presence of deformable objects, and relying on approximated models hinders real-world applicability of the methods developed. In the case of complex, time-dependent and/or non-linear dynamics, even moderate mismatches in parameters, unmodeled evolutions, or grasp-induced couplings can qualitatively affect the observed dynamics. -Lastly, dynamics-based methods (naturally) overlook the rather recent increase in availability of openly-available robotics datasets. The curation of academic datasets by large centralized groups of human experts in robotics @collaborationOpenXEmbodimentRobotic2025, @khazatskyDROIDLargeScaleInTheWild2025 is now increasingly complemented by a growing number of robotics datasets contributed in a decentralized fashion by individuals with varied expertise. If not tangentially, dynamics-based approaches are not posed to maximally benefit from this trend, which holds the premise of allowing generalization in the space of tasks and embodiments, like data was the cornerstone for advancements in vision @alayracFlamingoVisualLanguage2022 and natural-language understanding @brownLanguageModelsAre2020. +Lastly, dynamics-based methods (naturally) overlook the rather recent increase in availability of openly-available robotics datasets. The curation of academic datasets by large centralized groups of human experts in robotics@collaborationOpenXEmbodimentRobotic2025, @khazatskyDROIDLargeScaleInTheWild2025 is now increasingly complemented by a growing number of robotics datasets contributed in a decentralized fashion by individuals with varied expertise. If not tangentially, dynamics-based approaches are not posed to maximally benefit from this trend, which holds the premise of allowing generalization in the space of tasks and embodiments, like data was the cornerstone for advancements in vision@alayracFlamingoVisualLanguage2022 and natural-language understanding@brownLanguageModelsAre2020. -Taken together, these limitations (Figure 10) motivate the exploration of learning-based approaches that can (1) integrate perception and control more tightly, (2) adapt across tasks and embodiments with reduced expert modeling interventions and (3) scale gracefully in performance as more robotics data becomes available. +Taken together, these limitations (Figure10) motivate the exploration of learning-based approaches that can (1) integrate perception and control more tightly, (2) adapt across tasks and embodiments with reduced expert modeling interventions and (3) scale gracefully in performance as more robotics data becomes available. # Robot (Reinforcement) Learning @@ -414,9 +412,9 @@ TL;DR The need for expensive high-fidelity simulators can be obviated by learnin caption={'Learning-based robotics streamlines perception-to-action by learning a (1) unified high-level controller capable to take (2) high-dimensional, unstructured sensorimotor information. Learning (3) does not require a dynamics model and instead focuses on interaction data, and (4) empirically correlates with the scale of the data used.'} /> -Learning-based techniques for robotics naturally address the limitations presented in 2 (Figure 11). Learning-based techniques typically rely on prediction-to-action (*visuomotor policies*), thereby directly mapping sensorimotor inputs to predicted actions, streamlining control policies by removing the need to interface multiple components. Mapping sensorimotor inputs to actions directly also allows to add diverse input modalities, leveraging the automatic feature extraction characteristic of most modern learning systems. Further, learning-based approaches can in principle entirely bypass modeling efforts and instead rely exclusively on interactions data, proving transformative when dynamics are challenging to model or even entirely unknown. Lastly, learning for robotics (*robot learning*) is naturally well posed to leverage the growing amount of robotics data openly available, just as computer vision first and natural language processing later did historically benefit from large scale corpora of (possibly non curated) data, in great part overlooked by dynamics-based approaches. +Learning-based techniques for robotics naturally address the limitations presented in2 (Figure11). Learning-based techniques typically rely on prediction-to-action (*visuomotor policies*), thereby directly mapping sensorimotor inputs to predicted actions, streamlining control policies by removing the need to interface multiple components. Mapping sensorimotor inputs to actions directly also allows to add diverse input modalities, leveraging the automatic feature extraction characteristic of most modern learning systems. Further, learning-based approaches can in principle entirely bypass modeling efforts and instead rely exclusively on interactions data, proving transformative when dynamics are challenging to model or even entirely unknown. Lastly, learning for robotics (*robot learning*) is naturally well posed to leverage the growing amount of robotics data openly available, just as computer vision first and natural language processing later did historically benefit from large scale corpora of (possibly non curated) data, in great part overlooked by dynamics-based approaches. -Being a field at its relative nascent stages, no prevalent technique(s) proved distinctly better better in robot learning. Still, two major classes of methods gained prominence: reinforcement learning (RL) and Behavioral Cloning (BC) (Figure 12). In this section, we provide a conceptual overview of applications of the former to robotics, as well as introduce practical examples of how to use RL within `lerobot`. We then introduce the major limitations RL suffers from, to introduce BC techniques in the next sections ([sec:learning-bc-single, sec:learning-bc-generalist]). +Being a field at its relative nascent stages, no prevalent technique(s) proved distinctly better better in robot learning. Still, two major classes of methods gained prominence: reinforcement learning (RL) and Behavioral Cloning (BC) (Figure12). In this section, we provide a conceptual overview of applications of the former to robotics, as well as introduce practical examples of how to use RL within `lerobot`. We then introduce the major limitations RL suffers from, to introduce BC techniques in the next sections ([sec:learning-bc-single, sec:learning-bc-generalist]). -In Figure 12 we decided to include generalist robot models @blackp0VisionLanguageActionFlow2024, @shukorSmolVLAVisionLanguageActionModel2025 alongside task-specific BC methods. While significant different in spirit--*generalist* models are language-conditioned and use instructions to generate motion valid across many tasks, while *task-specific* models are typically not language-conditioned and used to perform a single task--foundation models are largely trained to reproduce trajectories contained in a large training set of input demonstrations. Thus, we argue generalist policies can indeed be grouped alongside other task-specific BC methods, as they both leverage similar training data and schemas. +In Figure12 we decided to include generalist robot models@blackp0VisionLanguageActionFlow2024, @shukorSmolVLAVisionLanguageActionModel2025 alongside task-specific BC methods. While significant different in spirit--*generalist* models are language-conditioned and use instructions to generate motion valid across many tasks, while *task-specific* models are typically not language-conditioned and used to perform a single task--foundation models are largely trained to reproduce trajectories contained in a large training set of input demonstrations. Thus, we argue generalist policies can indeed be grouped alongside other task-specific BC methods, as they both leverage similar training data and schemas. -Figure 12 illustrates this categorization graphically, explicitly listing all the robot learning policies currently available in `lerobot`: Action Chunking with Transformers (ACT) @zhaoLearningFineGrainedBimanual2023, Diffusion Policy @chiDiffusionPolicyVisuomotor2024, Vector-Quantized Behavior Transformer (VQ-BeT) @leeBehaviorGenerationLatent2024, $\pi_0$ @blackp0VisionLanguageActionFlow2024, SmolVLA @shukorSmolVLAVisionLanguageActionModel2025, Human-in-the-loop Sample-efficient RL (HIL-SERL) @luoPreciseDexterousRobotic2024 and TD-MPC @hansenTemporalDifferenceLearning2022. +Figure12 illustrates this categorization graphically, explicitly listing all the robot learning policies currently available in `lerobot`: Action Chunking with Transformers (ACT)@zhaoLearningFineGrainedBimanual2023, Diffusion Policy@chiDiffusionPolicyVisuomotor2024, Vector-Quantized Behavior Transformer (VQ-BeT)@leeBehaviorGenerationLatent2024, $\pi_0$@blackp0VisionLanguageActionFlow2024, SmolVLA@shukorSmolVLAVisionLanguageActionModel2025, Human-in-the-loop Sample-efficient RL (HIL-SERL)@luoPreciseDexterousRobotic2024 and TD-MPC@hansenTemporalDifferenceLearning2022. -Applications of RL to robotics have been long studied, to the point the relationship between these two disciplines has been compared to that between physics and matematics @koberReinforcementLearningRobotics. Indeed, due to their interactive and sequential nature, many robotics problems can be directly mapped to RL problems. Figure 13 depicts two of such cases. Reaching for an object to move somewhere else in the scene is an indeed sequential problem where at each cycle the controller needs to adjust the position of the robotic arm based on their current configuration and the (possibly varying) position of the object. Figure 13 also shows an example of a locomotion problem, where sequentiality is inherent in the problem formulation. While sliding to the side, the controller has to constantly keep adjusting to the robot’s propioperception to avoid failure (falling). +Applications of RL to robotics have been long studied, to the point the relationship between these two disciplines has been compared to that between physics and matematics@koberReinforcementLearningRobotics. Indeed, due to their interactive and sequential nature, many robotics problems can be directly mapped to RL problems. Figure13 depicts two of such cases. Reaching for an object to move somewhere else in the scene is an indeed sequential problem where at each cycle the controller needs to adjust the position of the robotic arm based on their current configuration and the (possibly varying) position of the object. Figure13 also shows an example of a locomotion problem, where sequentiality is inherent in the problem formulation. While sliding to the side, the controller has to constantly keep adjusting to the robot’s propioperception to avoid failure (falling). ## A (Concise) Introduction to RL -The RL framework @suttonReinforcementLearningIntroduction2018, which we briefly introduce here, has often been used to model robotics problems @koberReinforcementLearningRobotics. RL is a subfield within ML fundamentally concerned with the development of autonomous systems (*agents*) learning how to *continuously behave* in an evolving environment, developing (ideally, well-performing) control strategies (*policies*). Crucially for robotics, RL agents can improve via trial-and-error only, thus entirely bypassing the need to develop explicit models of the problem dynamics, and rather exploiting interaction data only. In RL, this feedback loop (Figure 14) between actions and outcomes is established through the agent sensing a scalar quantity (*reward*). +The RL framework@suttonReinforcementLearningIntroduction2018, which we briefly introduce here, has often been used to model robotics problems@koberReinforcementLearningRobotics. RL is a subfield within ML fundamentally concerned with the development of autonomous systems (*agents*) learning how to *continuously behave* in an evolving environment, developing (ideally, well-performing) control strategies (*policies*). Crucially for robotics, RL agents can improve via trial-and-error only, thus entirely bypassing the need to develop explicit models of the problem dynamics, and rather exploiting interaction data only. In RL, this feedback loop (Figure14) between actions and outcomes is established through the agent sensing a scalar quantity (*reward*). -Formally, interactions between an agent and its environment are typically modeled via a Markov Decision Process (MDP) @bellmanMarkovianDecisionProcess1957. Representing robotics problems via MDPs offers several advantages, including (1) incorporating uncertainty through MDP’s inherently stochastic formulation and (2) providing a theoretically sound framework for learning *without* an explicit dynamic model. While accommodating also a continuous time formulation, MDPs are typically considered in discrete time in RL, thus assuming interactions to atomically take place over the course of discrete *timestep* $t=0,1,2,3, \dots, T$. MDPs allowing for an unbounded number of interactions ( $T \to + \infty$ ) are typically termed *infinite-horizon*, and opposed to *finite-horizon* MDPs in which $T$ cannot grow unbounded. Unless diversely specified, we will only be referring to discrete-time finite-horizon (*episodic*) MDPs here. +Formally, interactions between an agent and its environment are typically modeled via a Markov Decision Process (MDP)@bellmanMarkovianDecisionProcess1957. Representing robotics problems via MDPs offers several advantages, including (1) incorporating uncertainty through MDP’s inherently stochastic formulation and (2) providing a theoretically sound framework for learning *without* an explicit dynamic model. While accommodating also a continuous time formulation, MDPs are typically considered in discrete time in RL, thus assuming interactions to atomically take place over the course of discrete *timestep* $t=0,1,2,3, \dots, T$. MDPs allowing for an unbounded number of interactions ( $T \to + \infty$ ) are typically termed *infinite-horizon*, and opposed to *finite-horizon* MDPs in which $T$ cannot grow unbounded. Unless diversely specified, we will only be referring to discrete-time finite-horizon (*episodic*) MDPs here. Formally, a lenght-$T$ Markov Decision Process (MDP) is a tuple $\mathcal M = \langle \mathcal S, \mathcal A, \mathcal D, r, \gamma, \rho, T \rangle$, where: @@ -466,9 +464,9 @@ Formally, a lenght-$T$ Markov Decision Process (MDP) is a tuple $\mathcal M = \l - $\mathcal A$ is the *action space*; $a_t\in \mathcal A$ may represent joint torques, joint velocities, or even end-effector commands. In general, actions correspond to commands intervenings on the configuration of the robot. -- $\mathcal D$ represents the (possibly non-deterministic) environment dynamics, with $\mathcal D: \mathcal S\times \mathcal A\times \mathcal S\mapsto [0, 1]$ corresponding to $\mathcal D\, (s_t, a_t, s_{t+1})= \mathbb P (s_{t+1}\vert s_t, a_t)$. For instance, for a planar manipulator dynamics could be considered deterministic when the environment is fully described (Figure 6), and stochastic when unmodeled disturbances depending on non-observable parameters intervene (Figure [fig:planar-manipulator-box-velocity]). +- $\mathcal D$ represents the (possibly non-deterministic) environment dynamics, with $\mathcal D: \mathcal S\times \mathcal A\times \mathcal S\mapsto [0, 1]$ corresponding to $\mathcal D\, (s_t, a_t, s_{t+1})= \mathbb P (s_{t+1}\vert s_t, a_t)$. For instance, for a planar manipulator dynamics could be considered deterministic when the environment is fully described (Figure6), and stochastic when unmodeled disturbances depending on non-observable parameters intervene (Figure[fig:planar-manipulator-box-velocity]). -- $r: \mathcal S\times \mathcal A\times \mathcal S\to \mathbb R$ is the *reward function*, weighing the transition $(s_t, a_t, s_{t+1})$ in the context of the achievement of an arbitrary goal. For instance, a simple reward function for quickly moving the along the $x$ axis in 3D-space (Figure 13) could be based on the absolute position of the robot along the $x$ axis ($p_x$), present negative penalties for falling over (measured from $p_z$) and a introduce bonuses $\dot p_x$ for speed, $r (s_t, a_t, s_{t+1})\equiv r(s_t) = p_{x_t} \cdot \dot p_{x_t} - \tfrac{1}{p_{z_t}}$. +- $r: \mathcal S\times \mathcal A\times \mathcal S\to \mathbb R$ is the *reward function*, weighing the transition $(s_t, a_t, s_{t+1})$ in the context of the achievement of an arbitrary goal. For instance, a simple reward function for quickly moving the along the $x$ axis in 3D-space (Figure13) could be based on the absolute position of the robot along the $x$ axis($p_x$), present negative penalties for falling over (measured from $p_z$) and a introduce bonuses $\dot p_x$ for speed, $r (s_t, a_t, s_{t+1})\equiv r(s_t) = p_{x_t} \cdot \dot p_{x_t} - \tfrac{1}{p_{z_t}}$. Lastly, $\gamma \in [0,1]$ represent the discount factor regulating preference for immediate versus long-term reward (with an effective horizon equal to $\tfrac{1}{1-\gamma}$), and $\rho$ is the distribution, defined over $\mathcal S$, the MDP’s *initial* state is sampled from, $s_0 \sim \rho$. @@ -480,11 +478,7 @@ A length-$T$ *trajectory* is the (random) sequence \end{equation} ``` with per-step rewards defined as $r_t = r (s_t, a_t, s_{t+1})$ for ease of notation.Interestingly, assuming both the environment dynamics and conditional distribution over actions given states--the *policy*--to be *Markovian*: -$$ -`\mathbb P(s_{t+1}\vert s_t, a_t, s_{t-1}, a_{t-1}, \dots s_0, a_0 ) = \mathbb P (s_{t+1}\vert s_t, a_t)\\ - \mathbb P(a_t\vert s_t, a_{t-1}, s_{t-1}, s_0, a_0) = \mathbb P(a_t\vert s_t) ` -$$ - The probability of observing a given trajectory $\tau$ factorizes into +$$\mathbb P(s_{t+1}\vert s_t, a_t, s_{t-1}, a_{t-1}, \dots s_0, a_0 ) = \mathbb P (s_{t+1}\vert s_t, a_t)\\ \mathbb P(a_t\vert s_t, a_{t-1}, s_{t-1}, s_0, a_0) = \mathbb P(a_t\vert s_t) $$The probability of observing a given trajectory$\tau$ factorizes into ``` math \begin{equation} @@ -492,18 +486,14 @@ $$ \end{equation} ``` -Policies $\mathbb P(a_t\vert s_t)$ are typically indicated as $\pi(a_t\vert s_t)$, and often parametrized via $\theta$, yielding $\pi_\theta (a_t\vert s_t)$. Policies are trained optimizing the (discounted) *return* associated to a given $\tau$, i.e. the (random) sum of measured rewards over trajectory: -``` math -G(\tau) = \sum_{t=0}^{T-1} \gamma^{t} r_t. -``` -In that, agents seek to learn control strategies (*policies*, $\pi_\theta$) maximizing the expected return $\mathbb E_{\tau \sim \pi_\theta} G(\tau)$. For a given dynamics $\mathcal D$--i.e., for a given problem--taking the expectation over the (possibly random) trajectories resulting from acting according to a certain policy provides a direct, goal-conditioned ordering in the space of all the possible policies $\Pi$, yielding the (maximization) target $J : \Pi \mapsto \mathbb R$ +Policies $\mathbb P(a_t\vert s_t)$ are typically indicated as $\pi(a_t\vert s_t)$, and often parametrized via $\theta$, yielding $\pi_\theta (a_t\vert s_t)$. Policies are trained optimizing the (discounted) *return* associated to a given $\tau$, i.e. the (random) sum of measured rewards over trajectory: ``` math G(\tau) = \sum_{t=0}^{T-1} \gamma^{t} r_t. ``` In that, agents seek to learn control strategies (*policies*,$\pi_\theta$) maximizing the expected return $\mathbb E_{\tau \sim \pi_\theta} G(\tau)$. For a given dynamics $\mathcal D$--i.e., for a given problem--taking the expectation over the (possibly random) trajectories resulting from acting according to a certain policy provides a direct, goal-conditioned ordering in the space of all the possible policies $\Pi$, yielding the (maximization) target $J : \Pi \mapsto \mathbb R$ $$ `J(\pi_\theta) = \mathbb E_{\tau \sim \mathbb P_{\theta; \mathcal D}} [G(\tau)],\\ \mathbb P_{\theta; \mathcal D} (\tau) = \rho \prod_{t=0}^{T-1} \mathcal D (s_t, a_t, s_{t+1})\ \pi_\theta (a_t\vert s_t).` $$ -Because in the RL framework the agent is assumed to only be able to observe the environment dynamics and not to intervene on them, [eq:RL-j-function] varies exclusively with the policy followed. In turn, MDPs naturally provide a framework to optimize over the space of the possible behaviors an agent might enact ($\pi \in \Pi$), searching for the *optimal policy* $\pi^* = \arg \max_{\theta} J(\pi_\theta)$, where $\theta$ is the parametrization adopted by the policy set $\Pi: \pi_\theta \in \Pi, \ \forall \theta$. Other than providing a target for policy search, $G(\tau)$ can also be used as a target to discriminate between states and state-action pairs. Given any state $s \in \mathcal S$--e.g., a given configuration of the robot--the *state-value* function +Because in the RL framework the agent is assumed to only be able to observe the environment dynamics and not to intervene on them,[eq:RL-j-function] varies exclusively with the policy followed. In turn, MDPs naturally provide a framework to optimize over the space of the possible behaviors an agent might enact ($\pi \in \Pi$), searching for the *optimal policy* $\pi^* = \arg \max_{\theta} J(\pi_\theta)$, where $\theta$ is the parametrization adopted by the policy set $\Pi: \pi_\theta \in \Pi, \ \forall \theta$. Other than providing a target for policy search, $G(\tau)$ can also be used as a target to discriminate between states and state-action pairs. Given any state $s \in \mathcal S$--e.g., a given configuration of the robot--the *state-value* function ``` math V_\pi(s) = \mathbb E_{\tau \sim \pi} [G(\tau) \big \vert s_0 = s] ``` @@ -512,12 +502,7 @@ can be used to discriminate between desirable and undesirable state in terms of Q_\pi(s,a) = \mathbb E_{\tau \sim \pi} [G (\tau) \big \vert s_0 = s, a_0=a] ``` Crucially, value functions are interrelated: -$$ -`Q_\pi(s_t, a_t) = \mathbb{E}_{s_{t+1}\sim \mathbb P(\bullet \vert s_t, a_t)} [r_t + \gamma V_\pi(s_{t+1})]\\ - V_\pi(s_t) = \mathbb E_{a_t\sim \pi(\bullet \vert s_t)} [Q_\pi (s_t, a_t)] -` -$$ - Inducing an ordering over states and state-action pairs under $\pi$, value functions are central to most RL algorithms. A variety of methods have been developed in RL as standalone attemps to find (approximate) solutions to the problem of maximizing cumulative reward (Figure 15). +$$Q_\pi(s_t, a_t) = \mathbb{E}_{s_{t+1}\sim \mathbb P(\bullet \vert s_t, a_t)} [r_t + \gamma V_\pi(s_{t+1})]\\ V_\pi(s_t) = \mathbb E_{a_t\sim \pi(\bullet \vert s_t)} [Q_\pi (s_t, a_t)] $$Inducing an ordering over states and state-action pairs under$\pi$, value functions are central to most RL algorithms. A variety of methods have been developed in RL as standalone attemps to find (approximate) solutions to the problem of maximizing cumulative reward (Figure15). -Popular approaches to continuous state and action space--such as those studied within robotics--include @schulmanTrustRegionPolicy2017, @schulmanProximalPolicyOptimization2017, @haarnojaSoftActorCriticOffPolicy2018. Across manipulation @akkayaSolvingRubiksCube2019 and locomotion @leeLearningQuadrupedalLocomotion2020 problems, RL proved extremely effective in providing a platform to (1) adopt a unified, streamlined perception-to-action pipeline, (2) natively integrate propioperception with multi-modal high-dimensional sensor streams (3) disregard a description of the environment dynamics, by focusing on observed interaction data rather than modeling, and (4) anchor policies in the experience collected and stored in datasets. For a more complete survey of applications of RL to robotics, we refer the reader to @koberReinforcementLearningRobotics, @tangDeepReinforcementLearning2024. +Popular approaches to continuous state and action space--such as those studied within robotics--include@schulmanTrustRegionPolicy2017, @schulmanProximalPolicyOptimization2017, @haarnojaSoftActorCriticOffPolicy2018. Across manipulation@akkayaSolvingRubiksCube2019 and locomotion@leeLearningQuadrupedalLocomotion2020 problems, RL proved extremely effective in providing a platform to (1) adopt a unified, streamlined perception-to-action pipeline, (2) natively integrate propioperception with multi-modal high-dimensional sensor streams (3) disregard a description of the environment dynamics, by focusing on observed interaction data rather than modeling, and (4) anchor policies in the experience collected and stored in datasets. For a more complete survey of applications of RL to robotics, we refer the reader to@koberReinforcementLearningRobotics, @tangDeepReinforcementLearning2024. ## Real-world RL for Robotics @@ -537,7 +522,7 @@ Streamlined end-to-end control pipelines, data-driven feature extraction and a d First, especially early in training, actions are typically explorative, and thus erractic. On physical systems, untrained policies may command high velocities, self-collisiding configurations, or torques exceeding joint limits, leading to wear and potential hardware damage. Mitigating these risks requires external safeguards (e.g., watchdogs, safety monitors, emergency stops), often incuring in a high degree of human supervision. Further, in the typical episodic setting considered in most robotics problems, experimentation is substantially slowed down by the need to manually reset the environment over the course of training, a time-consuming and brittle process. -Second, learning with a limited number of samples remains problematic in RL, limiting the applicability of RL in real-world robotics due to consequently prohibitive timescales of training. Even strong algorithms such as SAC @haarnojaSoftActorCriticOffPolicy2018 typically require a large numbers of transitions $\{ (s_t, a_t, r_t, s_{t+1})\}_{t=1}^N$. On hardware, generating these data is time-consuming and can even be prohibitive. +Second, learning with a limited number of samples remains problematic in RL, limiting the applicability of RL in real-world robotics due to consequently prohibitive timescales of training. Even strong algorithms such as SAC@haarnojaSoftActorCriticOffPolicy2018 typically require a large numbers of transitions $\{ (s_t, a_t, r_t, s_{t+1})\}_{t=1}^N$. On hardware, generating these data is time-consuming and can even be prohibitive. -Training RL policies in simulation @tobinDomainRandomizationTransferring2017 addresses both issues: it eliminates physical risk and dramatically increases throughput. Yet, simulators require significant modeling effort, and rely on assumptions (simplified physical modeling, instantaneous actuation, static environmental conditions, etc.) limiting transferring policies learned in simulation due the discrepancy between real and simulated environments (*reality gap*, Figure 16). *Domain randomization* (DR) is a popular technique to overcome the reality gap, consisting in randomizing parameters of the simulated environment during training, to induce robustness to specific disturbances. In turn, DR is employed to increase the diversity of scenarios over the course of training, improving on the chances sim-to-real transfer @akkayaSolvingRubiksCube2019, @antonovaReinforcementLearningPivoting2017, @jiDribbleBotDynamicLegged2023. In practice, DR is performed further parametrizing the *simulator*’s dynamics $\mathcal D \equiv \mathcal D_\xi$ with a *dynamics* (random) vector $\xi$ drawn an arbitrary distribution, $\xi \sim \Xi$. Over the course of training--typically at each episode’s reset--a new $\xi$ is drawn, and used to specify the environment’s dynamics for that episode. For instance, one could decide to randomize the friction coefficient of the surface in a locomotion task (Figure 17), or the center of mass of an object for a manipulation task. +Training RL policies in simulation@tobinDomainRandomizationTransferring2017 addresses both issues: it eliminates physical risk and dramatically increases throughput. Yet, simulators require significant modeling effort, and rely on assumptions (simplified physical modeling, instantaneous actuation, static environmental conditions, etc.) limiting transferring policies learned in simulation due the discrepancy between real and simulated environments (*reality gap*, Figure16). *Domain randomization* (DR) is a popular technique to overcome the reality gap, consisting in randomizing parameters of the simulated environment during training, to induce robustness to specific disturbances. In turn, DR is employed to increase the diversity of scenarios over the course of training, improving on the chances sim-to-real transfer@akkayaSolvingRubiksCube2019, @antonovaReinforcementLearningPivoting2017, @jiDribbleBotDynamicLegged2023. In practice, DR is performed further parametrizing the *simulator*’s dynamics $\mathcal D \equiv \mathcal D_\xi$ with a *dynamics* (random) vector $\xi$ drawn an arbitrary distribution, $\xi \sim \Xi$. Over the course of training--typically at each episode’s reset--a new $\xi$ is drawn, and used to specify the environment’s dynamics for that episode. For instance, one could decide to randomize the friction coefficient of the surface in a locomotion task (Figure17), or the center of mass of an object for a manipulation task. -While effective in transfering policies across the reality gap in real-world robotics @tobinDomainRandomizationTransferring2017, @akkayaSolvingRubiksCube2019, @jiDribbleBotDynamicLegged2023, @tiboniDomainRandomizationEntropy2024, DR often requires extensive manual engineering. First, identifying which parameters to randomize--i.e., the *support* $\text{supp} (\Xi)$ of $\Xi$--is an inherently task specific process. When locomoting over different terrains, choosing to randomize the friction coefficient is a reasonable choice, yet not completely resolutive as other factors (lightning conditions, external temperature, joints’ fatigue, etc.) may prove just as important, making selecting these parameters yet another source of brittlness. +While effective in transfering policies across the reality gap in real-world robotics@tobinDomainRandomizationTransferring2017, @akkayaSolvingRubiksCube2019, @jiDribbleBotDynamicLegged2023, @tiboniDomainRandomizationEntropy2024, DR often requires extensive manual engineering. First, identifying which parameters to randomize--i.e., the *support* $\text{supp} (\Xi)$ of $\Xi$--is an inherently task specific process. When locomoting over different terrains, choosing to randomize the friction coefficient is a reasonable choice, yet not completely resolutive as other factors (lightning conditions, external temperature, joints’ fatigue, etc.) may prove just as important, making selecting these parameters yet another source of brittlness. -Selecting the dynamics distribution $\Xi$ is also non-trivial. On the one hand, distributions with low entropy might risk to cause failure at transfer time, due to the limited robustness induced over the course of training. On the other hand, excessive randomization may cause over-regularization and hinder performance. Consequently, the research community investigated approaches to automatically select the randomization distribution $\Xi$, using signals from the training process or tuning it to reproduce observed real-world trajectories.  @akkayaSolvingRubiksCube2019 use a parametric uniform distribution $\mathcal U(a, b)$ as $\Xi$, widening the bounds as training progresses and the agent’s performance improves (AutoDR). While effective, AutoDR requires significant tuning--the bounds are widened by a fixed, pre-specified amount $\Delta$--and may disregard data when performance *does not* improve after a distribution update @tiboniDomainRandomizationEntropy2024.  @tiboniDomainRandomizationEntropy2024 propose a similar method to AutoDR (DORAEMON) to evolve $\Xi$ based on training signal, but with the key difference of explicitly maximizing the entropy of parametric Beta distributions, inherently more flexible than uniform distributions. DORAEMON proves particularly effective at dynamically increasing the entropy levels of the training distribution by employing a max-entropy objective, under performance constraints formulation. Other approaches to automatic DR consist in specifically tuning $\Xi$ to align as much as possible the simulation and real-world domains. For instance,  @chebotar2019closing interleave in-simulation policy training with repeated real-world policy rollouts used to adjust $\Xi$ based on real-world data, while  @tiboniDROPOSimtoRealTransfer2023 leverage a single, pre-collected set of real-world trajectories and tune $\Xi$ under a simple likelihood objective. +Selecting the dynamics distribution $\Xi$ is also non-trivial. On the one hand, distributions with low entropy might risk to cause failure at transfer time, due to the limited robustness induced over the course of training. On the other hand, excessive randomization may cause over-regularization and hinder performance. Consequently, the research community investigated approaches to automatically select the randomization distribution $\Xi$, using signals from the training process or tuning it to reproduce observed real-world trajectories. @akkayaSolvingRubiksCube2019 use a parametric uniform distribution $\mathcal U(a, b)$ as $\Xi$, widening the bounds as training progresses and the agent’s performance improves (AutoDR). While effective, AutoDR requires significant tuning--the bounds are widened by a fixed, pre-specified amount $\Delta$--and may disregard data when performance *does not* improve after a distribution update@tiboniDomainRandomizationEntropy2024. @tiboniDomainRandomizationEntropy2024 propose a similar method to AutoDR (DORAEMON) to evolve $\Xi$ based on training signal, but with the key difference of explicitly maximizing the entropy of parametric Beta distributions, inherently more flexible than uniform distributions. DORAEMON proves particularly effective at dynamically increasing the entropy levels of the training distribution by employing a max-entropy objective, under performance constraints formulation. Other approaches to automatic DR consist in specifically tuning $\Xi$ to align as much as possible the simulation and real-world domains. For instance, @chebotar2019closing interleave in-simulation policy training with repeated real-world policy rollouts used to adjust $\Xi$ based on real-world data, while @tiboniDROPOSimtoRealTransfer2023 leverage a single, pre-collected set of real-world trajectories and tune $\Xi$ under a simple likelihood objective. While DR has shown promise, it does not address the main limitation that, even under the assumption that an ideal distribution $\Xi$ to sample from was indeed available, many robotics problems cannot be simulated with high-enough fidelity under practical computational constraints in the first place. Simulating contact-rich manipulation of possibly deformable or soft materials--i.e., *folding a piece of clothing*--can be costly and even time-intensive, limiting the benefits of in-simulation training. @@ -571,11 +556,11 @@ A perhaps more foundamental limitation of RL for robotics is the general unavail To make the most of (1) the growing number of openly available datasets and (2) relatively inexpensive robots like the SO-100, RL could (1) be anchored in already-collected trajectories--limiting erratic and dangerous exploration--and (2) train in the real-world directly--bypassing the aforementioned issues with low-fidelity simulations. In such a context, sample-efficient learning is also paramount, as training on the real-world is inherently time-bottlenecked. -Off-policy algorithms like Soft Actor-Critic (SAC) @haarnojaSoftActorCriticOffPolicy2018 tend to be more sample efficient then their on-policy counterpart @schulmanProximalPolicyOptimization2017, due to the presence a *replay buffer* used over the course of the training. Other than allowing to re-use transitions $(s_t, a_t, r_t, s_{t+1})$ over the course of training, the replay buffer can also accomodate for the injection of previously-collected data in the training process @ballEfficientOnlineReinforcement2023. Using expert demonstrations to guide learning together with learned rewards, RL training can effectively be carried out in the real-world @luoSERLSoftwareSuite2025. Interestingly, when completed with in-training human interventions, real-world RL agents have been shown to learn policies with near-perfect success rates on challenging manipulation tasks in 1-2 hours @luoPreciseDexterousRobotic2024. +Off-policy algorithms like Soft Actor-Critic (SAC)@haarnojaSoftActorCriticOffPolicy2018 tend to be more sample efficient then their on-policy counterpart@schulmanProximalPolicyOptimization2017, due to the presence a *replay buffer* used over the course of the training. Other than allowing to re-use transitions $(s_t, a_t, r_t, s_{t+1})$ over the course of training, the replay buffer can also accomodate for the injection of previously-collected data in the training process@ballEfficientOnlineReinforcement2023. Using expert demonstrations to guide learning together with learned rewards, RL training can effectively be carried out in the real-world@luoSERLSoftwareSuite2025. Interestingly, when completed with in-training human interventions, real-world RL agents have been shown to learn policies with near-perfect success rates on challenging manipulation tasks in 1-2 hours@luoPreciseDexterousRobotic2024. #### Sample-efficient RL -In an MDP, the optimal policy $\pi^*$ can be derived from its associated $Q$-function, $Q_{\pi^*}$, and in particular the optimal action(s) $\mu(s_t)$ can be selected maximizing the optimal $Q$-function over the action space, +In an MDP, the optimal policy $\pi^*$ can be derived from its associated $Q$-function, $Q_{\pi^*}$, and in particular the optimal action(s) $\mu(s_t)$ can be selected maximizing the optimal $Q$-functionover the action space, ``` math \mu(s_t) = \max_{a_t\in \mathcal A} Q_{\pi^*}(s_t, a_t). ``` @@ -586,30 +571,29 @@ Interestingly, the $Q^*$-function satisfies a recursive relationship (*Bellman e > Q^*(s_t, a_t) = \mathbb E_{s_{t+1} \sim \mathbb P(\bullet \vert s_t, a_t)} [r_t + \gamma \max_{a_{t+1} \in \mathcal A} Q^*(s_{t+1}, a_{t+1}) \big\vert s_t, a_t] > ``` -In turn, the optimal $Q$-function  is guaranteed to be self-consistent by definition. *Value-iteration* methods exploit this relationship (and/or its state-value counterpart, $V^*(s_t)$ ) by iteratively updating an initial estimate of $Q^*$, $Q_k$ using the Bellman equation as update rule (*Q-learning*): +In turn, the optimal $Q$-function is guaranteed to be self-consistent by definition. *Value-iteration* methods exploit this relationship (and/or its state-value counterpart, $V^*(s_t)$ ) by iteratively updating an initial estimate of $Q^*$, $Q_k$ using the Bellman equation as update rule (*Q-learning*): ``` math Q_{i+1}(s_t, a_t) \leftarrow \mathbb E_{s_{t+1} \sim \mathbb P(\bullet \vert s_t, a_t)} [r_t + \gamma \max_{a_{t+1} \in \mathcal A} Q_i (s_{t+1}, a_{t+1}) \big\vert s_t, a_t], \quad i=0,1,2,\dots,K ``` Then, one can derive the (ideally, near-optimal) policy by explicitly maximizing over the action space the final (ideally, near-optimal) estimate $Q_K \approx Q^*$ at each timestep. In fact, under certain assumptions on the MDP considered, $Q_K \to Q^* \, \text{as } K \to \infty$. -Effective in its early applications to small-scale discrete problems and theoretically sound, vanilla Q-learning was found complicated to scale to large $\mathcal S\times \mathcal A$ problems, in which the storing of $Q : \mathcal S\times \mathcal A\mapsto \mathbb R$ alone might result prohibitive. Also, vanilla Q-learning is not directly usable for *continuous*, unstructured state-action space MPDs, such as those considered in robotics. In their seminal work on *Deep Q-Learning* (DQN), @mnihPlayingAtariDeep2013 propose learning Q-values using deep convolutional neural networks, thereby accomodating for large and even unstructured *state* spaces. DQN parametrizes the Q-function using a neural network with parameters $\theta$, updating the parameters by sequentially minimizing the expected squared temporal-difference error (TD-error, $\delta_i$): +Effective in its early applications to small-scale discrete problems and theoretically sound, vanilla Q-learning was found complicated to scale to large $\mathcal S\times \mathcal A$ problems, in which the storing of $Q : \mathcal S\times \mathcal A\mapsto \mathbb R$ alone might result prohibitive. Also, vanilla Q-learning is not directly usable for *continuous*, unstructured state-action space MPDs, such as those considered in robotics. In their seminal work on *Deep Q-Learning* (DQN),@mnihPlayingAtariDeep2013 propose learning Q-values using deep convolutional neural networks, thereby accomodating for large and even unstructured *state* spaces. DQN parametrizes the Q-function using a neural network with parameters $\theta$, updating the parameters by sequentially minimizing the expected squared temporal-difference error (TD-error, $\delta_i$): $$ `\mathcal L(\theta_i) = \mathbb E_{(s_t, a_t) \sim \chi(\bullet)} \big[ (\underbrace{y_i - Q_{\theta_i}(s_t, a_t)}_{\delta_i})^2 \big],\\ y_i = \mathbb E_{s_{t+1} \sim \mathbb P(\bullet \vert s_t, a_t)} \big[ r_t + \gamma \max_{a_t\in \mathcal A} Q_{\theta_{i-1}} (s_{t+1}, a_{t+1}) \big], ` -$$ - Where $\chi$ represents a behavior distribution over state-action pairs. Crucially, $\chi$ can in principle be different from the policy being followed, effectively allowing to reuse prior data stored in a *replay buffer* in the form of $(s_t, a_t, r_t, s_{t+1})$ transitions, used to form the TD-target $y_i$, TD-error $\delta_i$ and loss function [eq:dqn-loss] via Monte-Carlo (MC) estimates. +$$Where$\chi$ represents a behavior distribution over state-action pairs. Crucially, $\chi$ can in principle be different from the policy being followed, effectively allowing to reuse prior data stored in a *replay buffer* in the form of $(s_t, a_t, r_t, s_{t+1})$ transitions, used to form the TD-target $y_i$, TD-error $\delta_i$ and loss function[eq:dqn-loss] via Monte-Carlo (MC) estimates. -While effective in handling large, unstructured state spaces for discrete action-space problems, DQN application’s to continous control problems proved challenging. Indeed, in the case of high-capacity function approximators such as neural networks, solving $\max_{a_t \in \mathcal A} Q_\theta(s_t, a_t)$ at each timestep is simply unfeasible due to the (1) continous nature of the action space ($\mathcal A\subset \mathbb R^n$ for some $n$) and (2) impossibility to express the find a cheap (ideally, closed-form) solution to $Q_\theta$.  @silverDeterministicPolicyGradient2014 tackle this fundamental challenge by using a *deterministic* function of the state $s_t$ as policy, $\mu_\phi(s_t) = a_t$, parametrized by $\phi$. Thus, policies can be iteratively refined updating $\phi$ along the direction: +While effective in handling large, unstructured state spaces for discrete action-space problems, DQN application’s to continous control problems proved challenging. Indeed, in the case of high-capacity function approximators such as neural networks, solving $\max_{a_t \in \mathcal A} Q_\theta(s_t, a_t)$ at each timestep is simply unfeasible due to the (1) continous nature of the action space ($\mathcal A\subset \mathbb R^n$ for some $n$) and (2) impossibility to express the find a cheap (ideally, closed-form) solution to $Q_\theta$. @silverDeterministicPolicyGradient2014 tackle this fundamental challenge by using a *deterministic* function of the state $s_t$ as policy, $\mu_\phi(s_t) = a_t$, parametrized by $\phi$. Thus, policies can be iteratively refined updating $\phi$ along the direction: ``` math \begin{equation} d_\phi = \mathbb E_{s_t \sim \mathbb P (\bullet)} [\nabla_\phi Q(s_t, a_t)\vert_{a_t = \mu_\phi(s_t)}] = \mathbb E_{s_t \sim \mathbb P(\bullet)} [\nabla_{a_t} Q(s_t, a_t) \vert_{a_t = \mu_\phi(s_t)} \cdot \nabla_\phi \mu(s_t)] \end{equation} ``` -Provably, [eq:deterministic-pg] is the *deterministic policy gradient* (DPG) of the policy $\mu_\phi$ @silverDeterministicPolicyGradient2014, so that updates $\phi_{k+1}\leftarrow \phi_k + \alpha d_\phi$ are guaranteed to increase the (deterministic) cumulative discounted reward, $J(\mu_\phi)$.  @lillicrapContinuousControlDeep2019 extended DPG to the case of (1) high-dimensional unstructured observations and (2) continuous action spaces, introducing Deep Deterministic Policy Gradient (DDPG), an important algorithm RL and its applications to robotics. DDPG adopts a modified TD-target compared to the one defined in [eq:TD-target], by maintaining a policy network used to select actions, yielding +Provably, [eq:deterministic-pg] is the *deterministic policy gradient* (DPG) of the policy $\mu_\phi$@silverDeterministicPolicyGradient2014, so that updates $\phi_{k+1}\leftarrow \phi_k + \alpha d_\phi$ are guaranteed to increase the (deterministic) cumulative discounted reward, $J(\mu_\phi)$. @lillicrapContinuousControlDeep2019 extended DPG to the case of (1) high-dimensional unstructured observations and (2) continuous action spaces, introducing Deep Deterministic Policy Gradient (DDPG), an important algorithm RL and its applications to robotics. DDPG adopts a modified TD-target compared to the one defined in[eq:TD-target], by maintaining a policy network used to select actions, yielding ``` math \begin{equation} @@ -618,7 +602,7 @@ y_i = \mathbb E_{s_{t+1} \sim \mathbb P(\bullet \vert s_t, a_t)} \big[ r_t + \ga ``` Similarily to DQN, DDPG also employs the same replay buffer mechanism, to reuse past transitions over training for increased sample efficiency and estimate the loss function via MC-estimates. -Soft Actor-Critic (SAC) @haarnojaSoftActorCriticOffPolicy2018 is a derivation of DDPG in the max-entropy (MaxEnt) RL framework, in which RL agents are tasked with maximizing the discounted cumulative reward, while acting as randomly as possible. MaxEnt RL @haarnojaReinforcementLearningDeep2017 has proven particularly robust thanks to the development of diverse behaviors, incentivized by its entropy-regularization formulation. In that, MaxEnt revisits the RL objective $J (\pi)$ to specifically account for the policy entropy, $J(\pi) = \sum_{t=0}^T \mathbb{E}_{(s_t, a_t) \sim \chi} [r_t + \alpha \mathcal H(\pi (\bullet \vert s_t))] $ This modified objective results in the *soft* TD-target: +Soft Actor-Critic (SAC)@haarnojaSoftActorCriticOffPolicy2018 is a derivation of DDPG in the max-entropy (MaxEnt) RL framework, in which RL agents are tasked with maximizing the discounted cumulative reward, while acting as randomly as possible. MaxEnt RL@haarnojaReinforcementLearningDeep2017 has proven particularly robust thanks to the development of diverse behaviors, incentivized by its entropy-regularization formulation. In that, MaxEnt revisits the RL objective $J (\pi)$ to specifically account for the policy entropy, $J(\pi) = \sum_{t=0}^T \mathbb{E}_{(s_t, a_t) \sim \chi} [r_t + \alpha \mathcal H(\pi (\bullet \vert s_t))] $ This modified objective results in the *soft* TD-target: ``` math \begin{equation} @@ -632,19 +616,19 @@ Similarily to DDPG, SAC also maintains an explicit policy, trained under the sam \pi_{k+1} \leftarrow \arg\min_{\pi^\prime \in \Pi} \text{D}_{\text{KL}}\left(\pi^\prime (\bullet \vert s_t) \bigg\Vert \frac{\exp(Q_{\pi_k}(s_t, \bullet))}{Z_{\pi_k}(s_t)} \right) \end{equation} ``` -The update rule provided in [eq:sac-policy-update] optimizes the policy while projecting it on a set $\Pi$ of tractable distributions (e.g., Gaussians, @haarnojaReinforcementLearningDeep2017). +The update rule provided in [eq:sac-policy-update] optimizes the policy while projecting it on a set $\Pi$ of tractable distributions (e.g., Gaussians,@haarnojaReinforcementLearningDeep2017). #### Sample-efficient, data-driven RL Importantly, sampling $(s_t, a_t, r_t, s_{t+1})$ from the replay buffer $D$ conveniently allows to approximate the previously introduced expectations for TD-target and TD-error through Monte-Carlo (MC) estimates. The replay buffer $D$ also proves extremely useful in maintaining a history of previous transitions and using it for training, improving on sample efficiency. Furthermore, it also naturally provides an entry point to inject offline trajectories recorded, for instance, by a human demonstrator, into the training process. -Reinforcement Learning with Prior Data (RLPD) @ballEfficientOnlineReinforcement2023 is an Offline-to-Online RL algorithm leveraging prior data to effectively accelerate the training of a SAC agent. Unlike previous works on Offline-to-Online RL, RLPD avoids any pre-training and instead uses the available offline data $D_\text{offline}$ to improve online-learning from scratch. During each training step, transitions from both the offline and online replay buffers are sampled in equal proportion, and used in the underlying SAC routine. +Reinforcement Learning with Prior Data (RLPD)@ballEfficientOnlineReinforcement2023 is an Offline-to-Online RL algorithm leveraging prior data to effectively accelerate the training of a SAC agent. Unlike previous works on Offline-to-Online RL, RLPD avoids any pre-training and instead uses the available offline data $D_\text{offline}$ to improve online-learning from scratch. During each training step, transitions from both the offline and online replay buffers are sampled in equal proportion, and used in the underlying SAC routine. #### Sample-efficient, data-driven, real-world RL -Despite the possibility to leverage offline data for learning, the effectiveness of real-world RL training is still limited by the need to define a task-specific, hard-to-define reward function. Further, even assuming to have access to a well-defined reward function, typical robotics pipelines rely mostly on propioperceptive inputs augmented by camera streams of the environment. As such, even well-defined rewards would need to be derived from processed representations of unstructured observations, introducing brittleness. In their technical report, @luoSERLSoftwareSuite2025 empirically address the needs (1) to define a reward function and (2) to use it on image observations, by introducing a series of tools to allow for streamlined training of *reward classifiers* $c$, as well as jointly learn forward-backward controllers to speed up real-world RL. Reward classifiers are particularly useful in treating complex tasks--e.g., folding a t-shirt--for which a precise reward formulation is arbitrarily complex to obtain, or that do require significant shaping and are more easily learned directly from demonstrations of success ($e^+$) or failure ($e^-$) states, $s \in \mathcal S$, with a natural choice for the state-conditioned reward function being $r \mathcal S \mapsto \mathbb R$ being $r(s) = \log c(e^+ \ vert s )$. Further, @luoSERLSoftwareSuite2025 demonstrate the benefits of learning *forward* (executing the task from initial state to completion) and *backward* (resetting the environment to the initial state from completion) controllers, parametrized by separate policies. +Despite the possibility to leverage offline data for learning, the effectiveness of real-world RL training is still limited by the need to define a task-specific, hard-to-define reward function. Further, even assuming to have access to a well-defined reward function, typical robotics pipelines rely mostly on propioperceptive inputs augmented by camera streams of the environment. As such, even well-defined rewards would need to be derived from processed representations of unstructured observations, introducing brittleness. In their technical report,@luoSERLSoftwareSuite2025 empirically address the needs (1) to define a reward function and (2) to use it on image observations, by introducing a series of tools to allow for streamlined training of *reward classifiers* $c$, as well as jointly learn forward-backward controllers to speed up real-world RL. Reward classifiers are particularly useful in treating complex tasks--e.g., folding a t-shirt--for which a precise reward formulation is arbitrarily complex to obtain, or that do require significant shaping and are more easily learned directly from demonstrations of success ($e^+$) or failure ($e^-$) states, $s \in \mathcal S$, with a natural choice for the state-conditioned reward function being $r \mathcal S \mapsto \mathbb R$ being $r(s) = \log c(e^+ \ vert s )$. Further,@luoSERLSoftwareSuite2025 demonstrate the benefits of learning *forward* (executing the task from initial state to completion) and *backward* (resetting the environment to the initial state from completion) controllers, parametrized by separate policies. -Lastly, in order to improve on the robustness of their approach to different goals while maintaing practical scalability, @luoSERLSoftwareSuite2025 introduced a modified state and action space, expressing proprioperceptive configurations $q$ and actions $\dot q$ in the frame of end-effector pose at $t=0$. Randomizing the initial pose of the end-effector ($s_0$),@luoSERLSoftwareSuite2025 achieved a similar result to that of having to manually randomize the environment at every timestep, but with the benefit of maintaining the environment in the same condition across multiple training episodes, achieving higher scalability of their method thanks to the increased practicality of their approach. +Lastly, in order to improve on the robustness of their approach to different goals while maintaing practical scalability,@luoSERLSoftwareSuite2025 introduced a modified state and action space, expressing proprioperceptive configurations $q$ and actions $\dot q$ in the frame of end-effector pose at $t=0$. Randomizing the initial pose of the end-effector ($s_0$),@luoSERLSoftwareSuite2025 achieved a similar result to that of having to manually randomize the environment at every timestep, but with the benefit of maintaining the environment in the same condition across multiple training episodes, achieving higher scalability of their method thanks to the increased practicality of their approach. -Building on off-policy deep Q-learning with replay buffers, entropy regularization for better exploration and performance, expert demonstrations to guide learning, and a series of tools and recommendations for real-world training using reward classifiers (Figure 18), @luoPreciseDexterousRobotic2024 introduce human interactions during training, learning near-optimal policies in challenging real-world manipulation tasks in 1-2 hours. +Building on off-policy deep Q-learning with replay buffers, entropy regularization for better exploration and performance, expert demonstrations to guide learning, and a series of tools and recommendations for real-world training using reward classifiers (Figure18),@luoPreciseDexterousRobotic2024 introduce human interactions during training, learning near-optimal policies in challenging real-world manipulation tasks in 1-2 hours. -Human in the Loop Sample Efficient Robot reinforcement Learning (HIL-SERL) @luoPreciseDexterousRobotic2024 augments offline-to-online RL with targeted human corrections during training, and employs prior data to (1) train a reward classifier and (2) bootstrap RL training on expert trajectories. While demonstrations provide the initial dataset seeding learning and constraining early exploration, interactive corrections allow a human supervisor to intervene on failure modes and supply targeted interventions to aid the learning process. Crucially, human interventions are stored in both the offline and online replay buffers, differently from the autonomous transitions generated at training time and stored in the online buffer only. Consequently, given an intervention timestep $k \in (0, T)$, length-$K$ human intervention data $\{ s^{\text{human}}_k, a^{\text{human}}_k, r^{\text{human}}_k, s^{\text{human}}_{k+1},\}_{k=1}^K$ is more likely to be sampled for off-policy learning than the data generated online during training, providing stronger supervision to the agent while still allowing for autonomous learning. Empirically, HIL-SERL attains near-perfect success rates on diverse manipulation tasks within 1-2 hours of training @luoPreciseDexterousRobotic2024, underscoring how offline datasets with online RL can markedly improve stability and data efficiency, and ultimately even allow real-world RL-training. +Human in the Loop Sample Efficient Robot reinforcement Learning (HIL-SERL)@luoPreciseDexterousRobotic2024 augments offline-to-online RL with targeted human corrections during training, and employs prior data to (1) train a reward classifier and (2) bootstrap RL training on expert trajectories. While demonstrations provide the initial dataset seeding learning and constraining early exploration, interactive corrections allow a human supervisor to intervene on failure modes and supply targeted interventions to aid the learning process. Crucially, human interventions are stored in both the offline and online replay buffers, differently from the autonomous transitions generated at training time and stored in the online buffer only. Consequently, given an intervention timestep $k \in (0, T)$, length-$K$ human intervention data $\{ s^{\text{human}}_k, a^{\text{human}}_k, r^{\text{human}}_k, s^{\text{human}}_{k+1},\}_{k=1}^K$ is more likely to be sampled for off-policy learning than the data generated online during training, providing stronger supervision to the agent while still allowing for autonomous learning. Empirically, HIL-SERL attains near-perfect success rates on diverse manipulation tasks within 1-2 hours of training@luoPreciseDexterousRobotic2024, underscoring how offline datasets with online RL can markedly improve stability and data efficiency, and ultimately even allow real-world RL-training. ### Code Example: Real-world RL @@ -668,10 +652,119 @@ Human in the Loop Sample Efficient Robot reinforcement Learning (HIL-SERL) @luo Despite the advancements in real-world RL training, solving robotics training RL agents in the real world still suffers from the following limitations: -- In those instances where real-world training experience is prohibitively expensive to gather @degraveMagneticControlTokamak2022, @bellemareAutonomousNavigationStratospheric2020, in-simulation training is often the only option. However, high-fidelity simulators for real-world problems can be difficult to build and maintain, especially for contact-rich manipulation and tasks involving deformable or soft materials. +- In those instances where real-world training experience is prohibitively expensive to gather@degraveMagneticControlTokamak2022, @bellemareAutonomousNavigationStratospheric2020, in-simulation training is often the only option. However, high-fidelity simulators for real-world problems can be difficult to build and maintain, especially for contact-rich manipulation and tasks involving deformable or soft materials. - Reward design poses an additional source of brittleness. Dense shaping terms are often required to guide exploration in long-horizon problems, but poorly tuned terms can lead to specification gaming or local optima. Sparse rewards avoid shaping but exacerbate credit assignment and slow down learning. In practice, complex behaviors require efforts shaping rewards: a britlle and error prone process. Advances in Behavioral Cloning (BC) from corpora of human demonstrations address both of these concerns. By learning in a supervised fashion to reproduce expert demonstrations, BC methods prove competitive while bypassing the need for simulated environments and hard-to-define reward functions. # Robot (Imitation) Learning + +
+ +*The best material model for a cat is another, or preferably the same cat* + +Norbert Wiener + +
+
+ +TL;DR Behavioral Cloning provides a natural platform to learn from real-world interactions without the need to design any reward function, and generative models prove more effective than point-wise policies at dealing with multimodal demonstration datasets. + +
+ + +Learning from human demonstrations provides a pragmatic alternative to the reinforcement-learning pipeline discussed in Section3. Indeed, in real-world robotics online exploration is typically costly and potentially unsafe, and designing (dense) reward signals is a brittle and task-specific process. In general, success detection itself may often require bespoke instrumentation, while episodic training demands reliable resets--all factors complicating training RL algorithms on hardware at scale. Behavioral Cloning (BC) sidesteps these constraints by casting control an imitation learning problem, leveraging previously collected expert demonstrations. Most notably, by learning to imitate autonomous systems naturally adhere to the objectives, preferences, and success criteria implicitly encoded in the data, which obviates reduces early-stage exploratory failures and obviates hand-crafted reward shaping altogether. + +Formally, let $\mathcal D = \{ \tau^{(i)} \}_{i=1}^N$ be a set of expert trajectories, with $\tau^{(i)} = \{(o_t^{(i)}, a_t^{(i)})\}_{t=0}^{T_i}$ representing the $i$-th trajectory in $\mathcal D$, $o_t \in \mathcal O$ denoting observations (e.g., images and proprioception altogether), and $a_t \in \mathcal A$ the expert actions. Typically, observations $o \in \mathcal O$ consist of both image and proprioperceptive information, while actions $a \in \mathcal A$ represent control specifications for the robot to execute, e.g. a joint configuration. Note that differently from Section3, in the imitation learning context $\mathcal D$ denotes an offline dataset collecting $N$ length-$T_i$ reward-free (expert) human trajectories $\tau^{(i)}$, and *not* the environment dynamics. Similarily, in this section $\tau^{(i)}$ represent a length-$T_i$ trajectory of observation-action pairs, which crucially *omits entirely any reward* information. Figure19 graphically shows trajectories in terms of the average evolution of the actuation on the 6 joints over a group of teleoperated episodes for the SO-100 manipulator. Notice how proprioperceptive states are captured jointly with camera frames over the course of the recorded episodes, providing a unified high-frame rate collection of teleoperation data. Figure20 shows $(o_t, a_t)$-pairs for the same dataset, with the actions performed by the human expert illustrated just alongside the corresponding observation. In principle, (expert) trajectories $\tau^{(i)}$ can have different lengths since demonstrations might exhibit multi-modal strategies to attain the same goal, resulting in possibly multiple, different behaviors. + + + +Behavioral Cloning (BC)@pomerleauALVINNAutonomousLand1988a aims at synthetizing synthetic behaviors by learning the mapping from observations to actions, and in its most natural formulation can be effectively tackled as a *supevised* learning problem, consisting of learning the (deterministic) mapping $f: \mathcal O\mapsto \mathcal A, \ a_t = f(o_t)$ by solving +``` math +\begin{equation} + + \min_{f} \mathbb{E}_{(o_t, a_t) \sim p(\bullet)} \mathcal L(a_t, f(o_t)), +\end{equation} +``` +for a given risk function $\mathcal L: \mathcal A \times \mathcal A \mapsto \mathbb{R}, \ \mathcal L (a, a^\prime)$. + +Typically, the expert’s joint observation-action distribution $p: \mathcal O\times \mathcal A\mapsto [0,1]$ such that $(o,a) \sim p(\bullet)$ is assumed to be unknown, in keeping with a classic Supervised Learning (SL) framework[^3]. However, differently from standard SL’s assumptions, the samples collected in $\mathcal D$, correspoding to observations of the underlying $p$ are *not* i.i.d., as expert demonstrations are collected *sequentially* in trajectories. In practice, this aspect can be partially mitigated by considering pairs in a non-sequential order--*shuffling* the samples in $\mathcal D$--so that the expected risk under $p$ can be approximated using MC estimates, although estimates may in general be less accurate. Another strategy to mitigate the impact of regressing over non-i.i.d. samples relies on the possibility of interleaving BC and data collection@rossReductionImitationLearning2011, aggregating multiple datasets iteratively. However, because we only consider the case where a single offline dataset $\mathcal D$ of (expert) trajectories is already available, dataset aggregation falls out of scope. + +Despite the inherent challenges of learning on non-i.i.d. data, the BC formulation affords several operational advantages in robotics. First, training happens offline and typically uses expert human demonstration data, hereby severily limiting exploration risks by preventing the robot from performing dangerous actions altogether. Second, reward design is entirely unnecessary in BC, as demonstrations already reflect human intent and task completion. This also mitigates the risk of misalignment and specification gaming (*reward hacking*), otherwise inherent in purely reward-based RL@heessEmergenceLocomotionBehaviours2017. Third, because expert trajectories encode terminal conditions, success detection and resets are implicit in the dataset. Finally, BC scales naturally with growing corpora of demonstrations collected across tasks, embodiments, and environments. However, BC can in principle only learn behaviors that are, at most, as good as the one exhibited by the demonstrator, and thus critically provides no mitigation for the suboptimal decision making that might be enaced by humans. Still, while problematic in sequential-decision making problems for which expert demonstrations are not generally available--data migth be expensive to collect, or human performance may be inherently suboptimal--many robotics applications benefit from relative cheap pipelines to acquire high-quality trajectories generated by humans, thus justifying BC approaches. + + + +While conceptually elegant, point-estimate policies $f : \mathcal O\mapsto \mathcal A$ learned by solving [eq:loss-minimization-SL] have been observed to suffer from (1) compounding errors@rossReductionImitationLearning2011 and (2) poor fit to multimodal distributions@florenceImplicitBehavioralCloning2022, @keGraspingChopsticksCombating2020. Figure21 illustrates these two key issues related to learning *explicit policies*@florenceImplicitBehavioralCloning2022. Besides sequentiality in $\mathcal D$, compounding errors due to *covariate shift* may also prove catastrophic, as even small $\epsilon$-prediction errors $0 < \Vert \mu(o_t) - a_t \Vert \leq \epsilon$ can quickly drive the policy into out-of-distribution states, incuring in less confident generations and thus errors compounding (Figure21, left).Moreover, point-estimate policies typically fail to learn *multimodal* targets, which are very common in human demonstrations solving robotics problems, since multiple trajectories can be equally as good towards the accomplishment of a goal (e.g., symmetric grasps, Figure21, right). In particular, unimodal regressors tend to average across modes, yielding indecisive or even unsafe commands@florenceImplicitBehavioralCloning2022. To address poor multimodal fitting,@florenceImplicitBehavioralCloning2022 propose learning the generative model $p(o, a)$ underlying the samples in $\mathcal D$, rather than an explicitly learning a prediction function $f(o) = a$. + +## A (Concise) Introduction to Generative Models + +Generative Models (GMs) aim to learn the stochastic process underlying the very generation of the data collected, and typically do so by fitting a probability distribution that approximates the unknown *data distribution*, $p$. In the case of BC, this unknown data distribution $p$ represents the expert’s joint distribution over $(o, a)$-pairs. Thus, given a finite set of $N$ pairs $\mathcal D = \{ (o,a)_i \}_{i=0}^N$ used as an imitation learning target (and thus assumed to be i.i.d.), GM seeks to learn a *parametric* distribution $p_\theta(o,a)$ such that (1) new samples $(o,a) \sim p_\theta(\bullet)$ resemble those stored in $\mathcal D$, and (2) high likelihood is assigned to the observed regions of the unobservable $p$. Likelihood-based learning provides a principled training objective to achieve both objectives, and it is thus extensively used in GM@prince2023understanding. + +### Variational Auto-Encoders + + + +A common inductive bias used in GM posits samples $(o,a)$ are influenced from an unobservable latent variable $z \in Z$, resulting in +``` math +\begin + + p (o,a) = \int_{\text{supp}({Z})} p(o,a \vert z) p(z) +\end{equation} +``` +Intuitively, in the case of observation-action pairs $(o, a)$ for a robotics application, $z$ could be some high level representation of the underlying task being performed by the human demonstrator. In such case, treating $p(o,a)$ as a marginalization over $\text{supp}({Z})$ of the complete joint distribution $p(o,a,z)$ natively captures the effect different tasks have on the likelihood of observation-action pairs. Figure22 graphically illustrates this concept in the case of a (A) picking and (B) pushing task, for which, nearing the target object, the likelihood of actions resulting in opening the gripper--the higher $q_6$, the wider the gripper’s opening--should intuitively be (A) high or (B) low, depending on the task performed. While the latent space $Z$ typically has a much richer structure than the set of all actual tasks performed,[eq:BC-latent-variable] still provides a solid framework to learn joint distribution conditioned on unobservable yet relevant factors. Figure23 represents this framework of latent-variable for a robotics application: the true, $z$-conditioned generative process on assigns *likelihood* $p((o,a) \vert z)$ to the single $(o,a)$-pair. Using Bayes’ theorem, one can reconstruct the *posterior* distribution on $\text{supp}({Z})$, $q_\theta(z \vert o,a)$ from the likelihood $p_\theta(o,a \vert z)$, *prior* $p_\theta(z)$ and *evidence* $p_\theta(o,a)$. VAEs approximate the latent variable model presented in[eq:BC-latent-variable]) using an *approximate posterior* $q_\phi(z \vert o,a)$ while regressing parameters for a parametric likelihood, $p_\theta(o,a \vert z)$ (Figure23). + + + +Given a dataset $\mathcal D$ consisting of $N$ i.i.d. observation-action pairs, the log-likelihood of all datapoints under $\theta$ (in Bayesian terms, the *evidence* $p_\theta(\mathcal D)$) can thus be written as: +$$ +`\log p_\theta(\mathcal D) = \log \sum_{i=0}^N p_\theta ((o,a)_i)\\ + = \log \sum_{i=0}^N \int_{\text{supp}({Z})} p_\theta((o,a)_i \vert z) p(z)\\ + = \log \sum_{i=0}^N \int_{\text{supp}({Z})} \frac{q_\theta(z \vert (o,a)_i)}{q_\theta(z \vert (o,a)_i)} \cdot p_\theta((o,a)_i \vert z) p(z)\\ + = \log \sum_{i=0}^N \mathbb E_{z \sim p_\theta(\bullet \vert (o,a)_i)} [\frac{p(z)}{q_\theta(z \vert (o,a)_i)} \cdot p_\theta((o,a)_i \vert z)], ` +$$ + where we used[eq:BC-latent-variable] in[eq:evidence-definition-1], multiplied by $1 = \frac{q_\theta(z \vert (o,a)_i)}{q_\theta(z \vert (o,a)_i)}$ in[eq:evidence-definition-2], and used the definition of expected value in[eq:evidence-definition]. + +In the special case where one assumes distributions to be tractable, $p_\theta (\mathcal D)$ is typically tractable too, and $\max_\theta \log p_\theta(\mathcal D)$ provides a natural target for (point-wise) infering the unknown parameters $\theta$ of the generative model. Unfortunately,[eq:evidence-definition] is rarely tractable when the distribution $p$ is modeled with approximators such as neural networks, especially for high-dimensional, unstructured data. + +In their seminal work on Variational Auto-Encoders (VAEs),@kingmaAutoEncodingVariationalBayes2022 present two major contributions to learn complex latent-variable GMs on unstructured data, proposing (1) a tractable, variational lower-bound to [eq:evidence-definition] as an optimization target to jointly learn likelihood and posterior and (2) high-capacity function approximators to model the likelihood $p_\theta(o,a\vert z)$ and (approximate) posterior distribution $q_\phi(z \vert o,a) \approx q_\theta(z \vert o,a)$.