Spaces:
Running
Running
few additions & improvements
#16
by imstevenpmwork HF Staff - opened
- app/src/content/chapters/folding/01-hero.mdx +5 -2
- app/src/content/chapters/folding/03-hardware.mdx +5 -4
- app/src/content/chapters/folding/04-data-collection.mdx +9 -6
- app/src/content/chapters/folding/06-training.mdx +2 -2
- app/src/content/chapters/folding/08-ablations.mdx +4 -4
- app/src/content/chapters/folding/09-learnings.mdx +21 -7
- app/src/content/chapters/folding/12-references.mdx +4 -4
app/src/content/chapters/folding/01-hero.mdx
CHANGED
|
@@ -24,10 +24,13 @@ To get there we used **8 bimanual robot setups**, spent **~131 hours** collectin
|
|
| 24 |
- **Evaluation** — what metrics give good signal and are reliable enough
|
| 25 |
- **Takeaways** — what we learned and what we'd do differently next time
|
| 26 |
|
| 27 |
-
This post aims to serve as a **blueprint for
|
| 28 |
|
| 29 |
-
|
|
|
|
|
|
|
| 30 |
|
|
|
|
| 31 |
| Resource | Link |
|
| 32 |
|:---|:---|
|
| 33 |
| **Model** | [HF Hub](https://huggingface.co/lerobot-data-collection/folding_final) |
|
|
|
|
| 24 |
- **Evaluation** — what metrics give good signal and are reliable enough
|
| 25 |
- **Takeaways** — what we learned and what we'd do differently next time
|
| 26 |
|
| 27 |
+
This post aims to serve as a **blueprint for building real-world robotic learning systems** with current open-source tools. Whether you are in industry, research, or your garage, this guide will help you move beyond simple toy projects.
|
| 28 |
|
| 29 |
+
<Note variant="success">
|
| 30 |
+
This project was made possible entirely by [LeRobot](https://github.com/huggingface/lerobot). Everything discussed in this blog is available in [LeRobot v0.5.1](https://github.com/huggingface/lerobot/releases/tag/v0.5.1) and ready for the community to use.
|
| 31 |
+
</Note>
|
| 32 |
|
| 33 |
+
All resources from this project are open-source:
|
| 34 |
| Resource | Link |
|
| 35 |
|:---|:---|
|
| 36 |
| **Model** | [HF Hub](https://huggingface.co/lerobot-data-collection/folding_final) |
|
app/src/content/chapters/folding/03-hardware.mdx
CHANGED
|
@@ -14,7 +14,7 @@ LeRobot handles the software stack, but you still need the physical hardware. Be
|
|
| 14 |
|
| 15 |
### The Robot: Bimanual OpenArm
|
| 16 |
|
| 17 |
-
For starters, the robot. We used the **bimanual [OpenArm](https://huggingface.co/docs/lerobot/openarm)**. These are open-source, human-like robot arms developed by [Enactic](https://openarm.dev) and sold by multiple vendors like [WowRobo](https://shop.wowrobo.com). Two reasons drove this choice:
|
| 18 |
|
| 19 |
1. **Smaller teleop gap.** When the robot's kinematics match a human arm, the teleoperator's motions transfer more naturally, meaning less mental remapping and faster learning. The humanoid form factor also aligns with where the ecosystem is heading: more human-form robots means more transferable data.
|
| 20 |
2. **Open source, good specs.** Solid payload, good reach, and fully open hardware. We extended the upper arm (the bicep segment) by **+5 cm** to increase reach since our setup doesn't have a hip or torso to provide additional workspace.
|
|
@@ -27,7 +27,7 @@ Everything is mounted on **aluminum extrusion profiles**, which let us quickly i
|
|
| 27 |
|
| 28 |
Next, we need a way to control the robot. We started with full-size OpenArm as leader arms for teleoperation. They seemed like the natural choice: same kinematics as the follower arms, one-to-one mapping.
|
| 29 |
|
| 30 |
-
However, we quickly realized we needed a teleoperator arm with less inertia, that allows for fast and precise manipulation, and more adaptability to different human morphologies. This led us to develop the **OpenArm Mini**: small, Feetech-based, 3D-printed leader arms based on the [SO-101](https://github.com/TheRobotStudio/SO-ARM100) design. These gave us:
|
| 31 |
1. **Less inertia** for quicker and more deliberate motions that cloth folding demands
|
| 32 |
2. **Arm-length agnostic** and adaptable to any human operator size
|
| 33 |
3. **Incredibly cheap** (~120 EUR per arm) making it easy to scale to multiple stations
|
|
@@ -51,9 +51,9 @@ Another feature that made a surprisingly big difference: when both your hands ar
|
|
| 51 |
|
| 52 |
<br/>
|
| 53 |
|
| 54 |
-
###
|
| 55 |
|
| 56 |
-
We use **three cameras**, each serving a purpose. The **base camera** is mounted between the arms and provides a wide-angle overview of the full workspace, the model's primary understanding of the task state. The two **wrist cameras** are mounted directly on the end-effectors. Because they move with the grippers, they provide a natural depth signal and give a close-up view for precise manipulation. They also act as a proxy for touch, being so close to the grippers, they capture contact details: grip quality, slip, that humans normally sense through their fingers. More cameras could help,
|
| 57 |
|
| 58 |
<Note variant="info" emoji="🔗">
|
| 59 |
Camera links: <a href="https://www.amazon.fr/-/en/Fafeicy-Camera-Module-Million-Conferencing/dp/B08GLSPTXY" target="_blank">Base camera (Fafeicy OV2710)</a> / <a href="https://www.arducam.com/12mp-imx708-usb-uvc-102-wide-angle-fixed-focus-camera-module-3.html" target="_blank">Wrist cameras (Arducam IMX708)</a>
|
|
@@ -63,6 +63,7 @@ We use **three cameras**, each serving a purpose. The **base camera** is mounted
|
|
| 63 |
The base camera has a slight fisheye effect which is totally fine, as the model learns to handle it.
|
| 64 |
</Sidenote>
|
| 65 |
|
|
|
|
| 66 |
|
| 67 |
### LeRobot Integration
|
| 68 |
|
|
|
|
| 14 |
|
| 15 |
### The Robot: Bimanual OpenArm
|
| 16 |
|
| 17 |
+
For starters, the robot. We used the **bimanual [OpenArm](https://huggingface.co/docs/lerobot/v0.5.1/openarm)**. These are open-source, human-like robot arms developed by [Enactic](https://openarm.dev) and sold by multiple vendors like [WowRobo](https://shop.wowrobo.com). Two reasons drove this choice:
|
| 18 |
|
| 19 |
1. **Smaller teleop gap.** When the robot's kinematics match a human arm, the teleoperator's motions transfer more naturally, meaning less mental remapping and faster learning. The humanoid form factor also aligns with where the ecosystem is heading: more human-form robots means more transferable data.
|
| 20 |
2. **Open source, good specs.** Solid payload, good reach, and fully open hardware. We extended the upper arm (the bicep segment) by **+5 cm** to increase reach since our setup doesn't have a hip or torso to provide additional workspace.
|
|
|
|
| 27 |
|
| 28 |
Next, we need a way to control the robot. We started with full-size OpenArm as leader arms for teleoperation. They seemed like the natural choice: same kinematics as the follower arms, one-to-one mapping.
|
| 29 |
|
| 30 |
+
However, we quickly realized we needed a teleoperator arm with less inertia, that allows for fast and precise manipulation, and more adaptability to different human morphologies. This led us to experiment with friction and gravity compensation - which improved the operator's experience, but ultimately we decided develop the **OpenArm Mini**: small, Feetech-based, 3D-printed leader arms based on the [SO-101](https://github.com/TheRobotStudio/SO-ARM100) design. These gave us:
|
| 31 |
1. **Less inertia** for quicker and more deliberate motions that cloth folding demands
|
| 32 |
2. **Arm-length agnostic** and adaptable to any human operator size
|
| 33 |
3. **Incredibly cheap** (~120 EUR per arm) making it easy to scale to multiple stations
|
|
|
|
| 51 |
|
| 52 |
<br/>
|
| 53 |
|
| 54 |
+
### Sensors
|
| 55 |
|
| 56 |
+
We use **three cameras**, each serving a purpose. The **base camera** is mounted between the arms and provides a wide-angle overview of the full workspace, the model's primary understanding of the task state. The two **wrist cameras** are mounted directly on the end-effectors. Because they move with the grippers, they provide a natural depth signal and give a close-up view for precise manipulation. They also act as a proxy for touch, being so close to the grippers, they capture contact details: grip quality, slip, that humans normally sense through their fingers. More details about the resolution and FPS of each later on. While cameras could help, every additional image stream requires more compute and time to process; three was the tradeoff we settled on.
|
| 57 |
|
| 58 |
<Note variant="info" emoji="🔗">
|
| 59 |
Camera links: <a href="https://www.amazon.fr/-/en/Fafeicy-Camera-Module-Million-Conferencing/dp/B08GLSPTXY" target="_blank">Base camera (Fafeicy OV2710)</a> / <a href="https://www.arducam.com/12mp-imx708-usb-uvc-102-wide-angle-fixed-focus-camera-module-3.html" target="_blank">Wrist cameras (Arducam IMX708)</a>
|
|
|
|
| 63 |
The base camera has a slight fisheye effect which is totally fine, as the model learns to handle it.
|
| 64 |
</Sidenote>
|
| 65 |
|
| 66 |
+
Beyond RGB cameras, the only other sensor data we used were the **joint encoders** from the arm's motors. No torque sensors, no force/touch sensing, no audio, no IMUs, no depth cameras. All the results in this blog were achieved with just camera images and joint positions. While additional sensing modalities could potentially improve performance, the overhead of integrating, calibrating, and maintaining extra sensors adds real engineering complexity. Given how well this minimal setup performed, it's hard to argue the tradeoff is worth it for this application.
|
| 67 |
|
| 68 |
### LeRobot Integration
|
| 69 |
|
app/src/content/chapters/folding/04-data-collection.mdx
CHANGED
|
@@ -8,10 +8,12 @@ import diversityGridImg from "../../assets/image/lerobot-data-collection_level12
|
|
| 8 |
|
| 9 |
Data collection was the longest phase of this project, and arguably the most important. No amount of compute can compensate for bad demonstrations.
|
| 10 |
|
| 11 |
-
T-shirt folding involves deformable objects with complex contact dynamics: simulation can't faithfully reproduce how fabric crumples and slides, world models can't yet predict cloth deformation reliably, and RL from scratch on real hardware is too sample-inefficient for a task this long-horizon (though RL *on top of* a pretrained VLA is a promising future direction). That leaves **real-world teleoperation** as the practical path. We chose **leader-follower arms** because they match the robot's kinematics exactly (what the operator does is what the robot does), giving low latency, high precision, and support for [DAgger](https://github.com/huggingface/lerobot/
|
| 12 |
|
| 13 |
We ran **8 setups** in parallel, optimizing for **maximum diversity**: 25+ different t-shirts, 8 different backgrounds, and varying camera and robot heights between sessions. We structured collection into two task levels: **Level 1** (fold a laid-out shirt) and **Level 2** (spread a messy shirt, fold it, place it aside).
|
| 14 |
|
|
|
|
|
|
|
| 15 |
### Learning to Teleoperate
|
| 16 |
|
| 17 |
Here's an honest truth: **early data collection is worse than the final data**. Teleoperating a bimanual robot is a genuine skill, and it takes practice. The first episodes are slow, not deliberate, and full of failed attempts. Over hours of practice, operators get dramatically better, with smoother motions, faster execution, and more consistent grasps.
|
|
@@ -24,11 +26,12 @@ Aligning on a common strategy across operators was equally important. Folding is
|
|
| 24 |
|
| 25 |
After weeks of collecting data across operators, these are the guidelines we found most useful:
|
| 26 |
|
| 27 |
-
1. **
|
| 28 |
-
2. **
|
| 29 |
-
3. **
|
| 30 |
-
4. **
|
| 31 |
-
5. **
|
|
|
|
| 32 |
|
| 33 |
### What we ended up with
|
| 34 |
|
|
|
|
| 8 |
|
| 9 |
Data collection was the longest phase of this project, and arguably the most important. No amount of compute can compensate for bad demonstrations.
|
| 10 |
|
| 11 |
+
T-shirt folding involves deformable objects with complex contact dynamics: simulation can't faithfully reproduce how fabric crumples and slides, world models can't yet predict cloth deformation reliably, and RL from scratch on real hardware is too sample-inefficient for a task this long-horizon (though RL *on top of* a pretrained VLA is a promising future direction). That leaves **real-world teleoperation** as the practical path. We chose **leader-follower arms** because they match the robot's kinematics exactly (what the operator does is what the robot does), giving low latency, high precision, and support for [DAgger](https://github.com/huggingface/lerobot/pull/2833), where the teleop setup takes over from a running policy to collect corrections.
|
| 12 |
|
| 13 |
We ran **8 setups** in parallel, optimizing for **maximum diversity**: 25+ different t-shirts, 8 different backgrounds, and varying camera and robot heights between sessions. We structured collection into two task levels: **Level 1** (fold a laid-out shirt) and **Level 2** (spread a messy shirt, fold it, place it aside).
|
| 14 |
|
| 15 |
+
We never strongly controlled lighting conditions during data collection or evaluation. The only requirement was having enough light for the cameras. All work happened in an open coworking space throughout different times of day — sometimes daylight, sometimes artificial light at night, sometimes a mix. No special lighting rigs, no effort to homogenize conditions across episodes or rollouts.
|
| 16 |
+
|
| 17 |
### Learning to Teleoperate
|
| 18 |
|
| 19 |
Here's an honest truth: **early data collection is worse than the final data**. Teleoperating a bimanual robot is a genuine skill, and it takes practice. The first episodes are slow, not deliberate, and full of failed attempts. Over hours of practice, operators get dramatically better, with smoother motions, faster execution, and more consistent grasps.
|
|
|
|
| 26 |
|
| 27 |
After weeks of collecting data across operators, these are the guidelines we found most useful:
|
| 28 |
|
| 29 |
+
1. **Watch your setup, not just your data.** The physical rig should feel solid and stable. If it vibrates, wobbles, or frustrates operators, fix that first. Pay attention to what causes idle time and operator fatigue.
|
| 30 |
+
2. **Practice before you record.** Consistent, deliberate demonstrations are more valuable than hesitant or inconsistent ones.
|
| 31 |
+
3. **Quality over speed.** High quality task execution is more valuable than fast, sloppy ones.
|
| 32 |
+
4. **Be consistent within episodes.** The model learns a coherent strategy more easily than movements that vary wildly each time.
|
| 33 |
+
5. **Start small, then extend.** Train a quick model, see what fails, then add diversity. Don't try to collect the perfect dataset on day one.
|
| 34 |
+
6. **Speed after quality.** Once you've dialed in quality and a consistent strategy, optimize for speed. But never sacrifice quality for it.
|
| 35 |
|
| 36 |
### What we ended up with
|
| 37 |
|
app/src/content/chapters/folding/06-training.mdx
CHANGED
|
@@ -10,7 +10,7 @@ Before diving into experiments, let's establish the model we're training and how
|
|
| 10 |
|
| 11 |
### Model Architecture
|
| 12 |
|
| 13 |
-
The dominant approach in robot learning today is to train a **Vision-Language-Action (VLA)** model: a single neural network that takes in camera images and a task description and outputs motor commands. The recipe most labs follow is to **pretrain** a VLA on large, multi-robot, multi-task datasets to learn general visual and manipulation priors, then **fine-tune** it on data from your specific robot and task. Several strong pretrained VLAs exist: [π0](https://huggingface.co/docs/lerobot/pi0), [π0.5](https://huggingface.co/docs/lerobot/pi05), [GR00T](https://developer.nvidia.com/isaac/gr00t), [SmolVLA](https://huggingface.co/docs/lerobot/smolvla), among others. We chose to use **π0 and π0.5** because they showed the strongest performance in our early experiments, likely due to the scale of their pretraining data. We start from the **pretrained checkpoints** and fine-tune on our folding data.
|
| 14 |
|
| 15 |
<Wide>
|
| 16 |
<HtmlEmbed
|
|
@@ -29,7 +29,7 @@ The model takes in three things: **camera images** (3 views: base + 2 wrist), **
|
|
| 29 |
Flow matching is closely related to diffusion models but uses a simpler, more direct interpolation path between noise and data.
|
| 30 |
</Sidenote>
|
| 31 |
|
| 32 |
-
#### [Real-Time Chunking (RTC)](https://huggingface.co/docs/lerobot/rtc)
|
| 33 |
|
| 34 |
The model outputs 30 actions at once (a "chunk"), but generating that chunk takes ~100..200ms, during which the robot is still moving. Without RTC, the robot would **pause** between chunks while waiting for the next prediction, producing jerky stop-and-go motion. RTC solves this by generating the next chunk **while the current one is still executing**. The key idea: by the time the new chunk is ready, several actions from the old chunk have already been executed and can't be changed, so RTC **freezes** those committed actions and **inpaints** the rest of the new chunk to be consistent with them. This produces smooth, continuous motion with no pauses.
|
| 35 |
|
|
|
|
| 10 |
|
| 11 |
### Model Architecture
|
| 12 |
|
| 13 |
+
The dominant approach in robot learning today is to train a **Vision-Language-Action (VLA)** model: a single neural network that takes in camera images and a task description and outputs motor commands. The recipe most labs follow is to **pretrain** a VLA on large, multi-robot, multi-task datasets to learn general visual and manipulation priors, then **fine-tune** it on data from your specific robot and task. Several strong pretrained VLAs exist: [π0](https://huggingface.co/docs/lerobot/v0.5.1/pi0), [π0.5](https://huggingface.co/docs/lerobot/v0.5.1/pi05), [GR00T](https://developer.nvidia.com/isaac/gr00t), [SmolVLA](https://huggingface.co/docs/lerobot/v0.5.1/smolvla), among others. We chose to use **π0 and π0.5** because they showed the strongest performance in our early experiments, likely due to the scale of their pretraining data. We start from the **pretrained checkpoints** and fine-tune on our folding data.
|
| 14 |
|
| 15 |
<Wide>
|
| 16 |
<HtmlEmbed
|
|
|
|
| 29 |
Flow matching is closely related to diffusion models but uses a simpler, more direct interpolation path between noise and data.
|
| 30 |
</Sidenote>
|
| 31 |
|
| 32 |
+
#### [Real-Time Chunking (RTC)](https://huggingface.co/docs/lerobot/v0.5.1/rtc)
|
| 33 |
|
| 34 |
The model outputs 30 actions at once (a "chunk"), but generating that chunk takes ~100..200ms, during which the robot is still moving. Without RTC, the robot would **pause** between chunks while waiting for the next prediction, producing jerky stop-and-go motion. RTC solves this by generating the next chunk **while the current one is still executing**. The key idea: by the time the new chunk is ready, several actions from the old chunk have already been executed and can't be changed, so RTC **freezes** those committed actions and **inpaints** the rest of the new chunk to be consistent with them. This produces smooth, continuous motion with no pauses.
|
| 35 |
|
app/src/content/chapters/folding/08-ablations.mdx
CHANGED
|
@@ -44,7 +44,7 @@ We filtered episodes in two ways:
|
|
| 44 |
|
| 45 |
#### Stage-Aware Reward Modeling (SARM)
|
| 46 |
|
| 47 |
-
To go beyond binary keep/discard filtering, we trained a reward model: **[SARM](https://huggingface.co/docs/lerobot/sarm)** (Stage-Aware Reward Modeling). The core problem SARM solves: how do you measure "progress" in a long, multi-stage task like t-shirt folding, where demonstrations vary wildly in length and strategy? You can't just use elapsed time, a shirt that's fully flattened might happen at frame 200 in one demo and frame 800 in another. SARM instead learns a semantic notion of progress that generalizes across demonstrations.
|
| 48 |
|
| 49 |
**How it works.** SARM is a **vision-language** reward model built on a frozen [CLIP](https://openai.com/index/clip/) backbone. It takes **8 RGB frames** (sampled 1 second apart from the base camera), a **task description** ("fold the t-shirt"), and the robot's **joint state**. The language conditioning means the same model could in principle score different tasks; it's not hard-coded to folding.
|
| 50 |
|
|
@@ -92,7 +92,7 @@ Following the [UMI](https://arxiv.org/abs/2402.10329) approach, we switched from
|
|
| 92 |
desc="Relative trajectory (blue) references all actions to the current state. Delta (yellow) chains each action to the previous one, accumulating error. Absolute (red) requires a global coordinate frame. Diagram adapted from UMI (Chi et al., 2024)."
|
| 93 |
/>
|
| 94 |
|
| 95 |
-
See the [LeRobot action representations docs](https://huggingface.co/docs/lerobot/
|
| 96 |
|
| 97 |
The result: experiment 1.3 (π0.5 + relative actions) jumped to **35% total success rate and 70% Level 1**, up from 1.2's 20% total and 40% Level 1.
|
| 98 |
|
|
@@ -102,7 +102,7 @@ The result: experiment 1.3 (π0.5 + relative actions) jumped to **35% total succ
|
|
| 102 |
|
| 103 |
#### RABC (Reward-Advantage-Based Conditioning)
|
| 104 |
|
| 105 |
-
Next, we put SARM's per-timestep scores to work during training with **[RABC](https://huggingface.co/docs/lerobot/sarm)** (Reward-Aligned Behavior Cloning). Standard behavior cloning treats every action in the dataset equally. RABC instead computes a per-timestep "progress delta" from SARM (the change in predicted task progress over one action chunk) and uses it to weight the training loss. A threshold κ controls how selective the weighting is: actions with progress above κ get full weight, those below are softly down-weighted, and negative-progress actions are clipped to zero weight entirely. The result is that the policy focuses on the best moments in each demonstration without discarding entire episodes.
|
| 106 |
|
| 107 |
<HtmlEmbed
|
| 108 |
id="rabc-explainer"
|
|
@@ -156,7 +156,7 @@ We then **fine-tuned** the best initial-training checkpoints on this curated dat
|
|
| 156 |
| 2.2 | 1.3 | HQ + RABC + Relative | **75%** (+40) | **100%** (+30) | **50%** (+50) |
|
| 157 |
| 2.5 | 1.7 | HQ + RABC + Relative | **90%** (+50) | **100%** (+20) | **80%** (+80) |
|
| 158 |
|
| 159 |
-
The jump was dramatic. Experiment 2.5 reached **90% total success rate**: 100% Level 1, 80% Level 2, up from 40% with initial training. The same architecture, the same training recipe, just better data.
|
| 160 |
|
| 161 |
Both 2.2 and 2.5 used the same recipe (HQ + RABC + Relative Actions), but 2.5 fine-tuned from 1.7 (the stronger base with relative actions + RABC already baked in) while 2.2 fine-tuned from 1.3. The difference (75% → 90%) likely reflects this stronger starting point. Data quality was the single biggest lever, and RABC's effect was strongest on **Level 2**, the longer, harder task where emphasizing the best demonstrations mattered most.
|
| 162 |
|
|
|
|
| 44 |
|
| 45 |
#### Stage-Aware Reward Modeling (SARM)
|
| 46 |
|
| 47 |
+
To go beyond binary keep/discard filtering, we trained a reward model: **[SARM](https://huggingface.co/docs/lerobot/v0.5.1/sarm)** (Stage-Aware Reward Modeling). The core problem SARM solves: how do you measure "progress" in a long, multi-stage task like t-shirt folding, where demonstrations vary wildly in length and strategy? You can't just use elapsed time, a shirt that's fully flattened might happen at frame 200 in one demo and frame 800 in another. SARM instead learns a semantic notion of progress that generalizes across demonstrations.
|
| 48 |
|
| 49 |
**How it works.** SARM is a **vision-language** reward model built on a frozen [CLIP](https://openai.com/index/clip/) backbone. It takes **8 RGB frames** (sampled 1 second apart from the base camera), a **task description** ("fold the t-shirt"), and the robot's **joint state**. The language conditioning means the same model could in principle score different tasks; it's not hard-coded to folding.
|
| 50 |
|
|
|
|
| 92 |
desc="Relative trajectory (blue) references all actions to the current state. Delta (yellow) chains each action to the previous one, accumulating error. Absolute (red) requires a global coordinate frame. Diagram adapted from UMI (Chi et al., 2024)."
|
| 93 |
/>
|
| 94 |
|
| 95 |
+
See the [LeRobot action representations docs](https://huggingface.co/docs/lerobot/v0.5.1/action_representations) for a full guide.
|
| 96 |
|
| 97 |
The result: experiment 1.3 (π0.5 + relative actions) jumped to **35% total success rate and 70% Level 1**, up from 1.2's 20% total and 40% Level 1.
|
| 98 |
|
|
|
|
| 102 |
|
| 103 |
#### RABC (Reward-Advantage-Based Conditioning)
|
| 104 |
|
| 105 |
+
Next, we put SARM's per-timestep scores to work during training with **[RABC](https://huggingface.co/docs/lerobot/v0.5.1/sarm)** (Reward-Aligned Behavior Cloning). Standard behavior cloning treats every action in the dataset equally. RABC instead computes a per-timestep "progress delta" from SARM (the change in predicted task progress over one action chunk) and uses it to weight the training loss. A threshold κ controls how selective the weighting is: actions with progress above κ get full weight, those below are softly down-weighted, and negative-progress actions are clipped to zero weight entirely. The result is that the policy focuses on the best moments in each demonstration without discarding entire episodes.
|
| 106 |
|
| 107 |
<HtmlEmbed
|
| 108 |
id="rabc-explainer"
|
|
|
|
| 156 |
| 2.2 | 1.3 | HQ + RABC + Relative | **75%** (+40) | **100%** (+30) | **50%** (+50) |
|
| 157 |
| 2.5 | 1.7 | HQ + RABC + Relative | **90%** (+50) | **100%** (+20) | **80%** (+80) |
|
| 158 |
|
| 159 |
+
The jump was dramatic. Experiment 2.5 reached **90% total success rate**: 100% Level 1, 80% Level 2, up from 40% with initial training. The same architecture, the same training recipe, just better data and more fine-tuning.
|
| 160 |
|
| 161 |
Both 2.2 and 2.5 used the same recipe (HQ + RABC + Relative Actions), but 2.5 fine-tuned from 1.7 (the stronger base with relative actions + RABC already baked in) while 2.2 fine-tuned from 1.3. The difference (75% → 90%) likely reflects this stronger starting point. Data quality was the single biggest lever, and RABC's effect was strongest on **Level 2**, the longer, harder task where emphasizing the best demonstrations mattered most.
|
| 162 |
|
app/src/content/chapters/folding/09-learnings.mdx
CHANGED
|
@@ -8,19 +8,33 @@ Running all these experiments taught us a lot. Some expected, some not. Here's w
|
|
| 8 |
|
| 9 |
Beyond the experiment findings above, several practical insights stood out:
|
| 10 |
|
| 11 |
-
- **Train a reward model.** [SARM](https://huggingface.co/docs/lerobot/sarm) gave us data scoring, advantage conditioning, and curation in one package.
|
| 12 |
- **Invest in recording quality early.** More time upfront on clean and consistent recordings pays off more than extra volume.
|
| 13 |
- **Record/Teleoperate at higher frequency.** We'd record at 50 fps instead of 30 fps if we had to do it again. Folding is dynamic and higher record rates capture transitions better.
|
| 14 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 15 |
### For the community: the order of operations
|
| 16 |
|
| 17 |
If you're training a policy for a new manipulation task with LeRobot, **here's the sequence we'd recommend**:
|
| 18 |
|
| 19 |
1. **Define your task protocol first.** Before collecting a single episode, define exactly how the task should be performed.
|
| 20 |
2. **Collect 30-50 clean demonstrations per item/background** Quality over volume. Consistent technique, deliberate motions. This is your foundation, everything else builds on it.
|
| 21 |
-
3. **Train a reward model.** Use [SARM](https://huggingface.co/docs/lerobot/sarm) to score your episodes and enable RABC during training. This allows the policy to focus on the best demonstrations, which is crucial for longer tasks.
|
| 22 |
4. **Train a baseline and watch it fail.** Film the rollouts. Understanding *how* and *where* it breaks tells you exactly what kind of data to collect next.
|
| 23 |
-
5. **Enable action interpolation and [RTC](https://huggingface.co/docs/lerobot/rtc).** This smooths transitions and speeds up execution. Action interpolation upsamples the policy's 30 Hz output to your robot's control frequency (e.g. 90 Hz), and RTC overlaps inference with execution. Both features are available at inference time with the corresponding flags:
|
| 24 |
|
| 25 |
```bash
|
| 26 |
python examples/rtc/eval_with_real_robot.py \
|
|
@@ -31,7 +45,7 @@ python examples/rtc/eval_with_real_robot.py \
|
|
| 31 |
--interpolation_multiplier=3
|
| 32 |
```
|
| 33 |
|
| 34 |
-
6. **Find the right [action representation](https://huggingface.co/docs/lerobot/
|
| 35 |
|
| 36 |
```bash
|
| 37 |
# Precompute relative action stats
|
|
@@ -50,10 +64,10 @@ lerobot-train \
|
|
| 50 |
|
| 51 |
7. **Use [DAgger](https://github.com/huggingface/lerobot/tree/main/examples/hil) for targeted improvement.** Once you have a model that mostly works, collect correction data for its specific failure modes using LeRobot's Human-in-the-Loop scripts.
|
| 52 |
|
| 53 |
-
8. **
|
| 54 |
|
| 55 |
-
<Note variant="
|
| 56 |
-
All the innovations from this project [SARM](https://huggingface.co/docs/lerobot/sarm), [RTC](https://huggingface.co/docs/lerobot/rtc), DAgger, [OpenArm](https://huggingface.co/docs/lerobot/openarm),
|
| 57 |
</Note>
|
| 58 |
|
| 59 |
### The bigger picture
|
|
|
|
| 8 |
|
| 9 |
Beyond the experiment findings above, several practical insights stood out:
|
| 10 |
|
| 11 |
+
- **Train a reward model.** [SARM](https://huggingface.co/docs/lerobot/v0.5.1/sarm) gave us data scoring, advantage conditioning, and curation in one package.
|
| 12 |
- **Invest in recording quality early.** More time upfront on clean and consistent recordings pays off more than extra volume.
|
| 13 |
- **Record/Teleoperate at higher frequency.** We'd record at 50 fps instead of 30 fps if we had to do it again. Folding is dynamic and higher record rates capture transitions better.
|
| 14 |
|
| 15 |
+
### The real cost of an experiment like this
|
| 16 |
+
|
| 17 |
+
Flashy results can obscure how much effort goes into producing them. Here's a rough breakdown of what this project actually cost:
|
| 18 |
+
|
| 19 |
+
| Category | Estimate |
|
| 20 |
+
|:---|:---|
|
| 21 |
+
| **Hardware (per setup)** | ~TODO EUR (arms, grippers, cameras, extrusion, cabling) |
|
| 22 |
+
| **Number of setups** | 8 |
|
| 23 |
+
| **Operator hours** | ~TODO h across all operators (~150 hours worth of data) |
|
| 24 |
+
| **Engineering hours** | ~1,920 hours (1 FTE Robotics AI Engineer) |
|
| 25 |
+
| **GPU training hours** | ~TODO h of H100 (across all experimental runs) |
|
| 26 |
+
|
| 27 |
+
TODO: commentary on what this means / what surprised us / what the takeaway is for others attempting similar projects. Some ideas: hidden cost of over-commiting to HW setups early on. What percentage of the operator's data was actually used. It’s often a balance between over-head investment vs speed of execution. Simulation was not explored but in other types of tasks could help birnging these numbers down.
|
| 28 |
+
|
| 29 |
### For the community: the order of operations
|
| 30 |
|
| 31 |
If you're training a policy for a new manipulation task with LeRobot, **here's the sequence we'd recommend**:
|
| 32 |
|
| 33 |
1. **Define your task protocol first.** Before collecting a single episode, define exactly how the task should be performed.
|
| 34 |
2. **Collect 30-50 clean demonstrations per item/background** Quality over volume. Consistent technique, deliberate motions. This is your foundation, everything else builds on it.
|
| 35 |
+
3. **Train a reward model.** Use [SARM](https://huggingface.co/docs/lerobot/v0.5.1/sarm) to score your episodes and enable RABC during training. This allows the policy to focus on the best demonstrations, which is crucial for longer tasks.
|
| 36 |
4. **Train a baseline and watch it fail.** Film the rollouts. Understanding *how* and *where* it breaks tells you exactly what kind of data to collect next.
|
| 37 |
+
5. **Enable action interpolation and [RTC](https://huggingface.co/docs/lerobot/v0.5.1/rtc).** This smooths transitions and speeds up execution. Action interpolation upsamples the policy's 30 Hz output to your robot's control frequency (e.g. 90 Hz), and RTC overlaps inference with execution. Both features are available at inference time with the corresponding flags:
|
| 38 |
|
| 39 |
```bash
|
| 40 |
python examples/rtc/eval_with_real_robot.py \
|
|
|
|
| 45 |
--interpolation_multiplier=3
|
| 46 |
```
|
| 47 |
|
| 48 |
+
6. **Find the right [action representation](https://huggingface.co/docs/lerobot/v0.5.1/action_representations).** LeRobot uses absolute actions by default. Switching to relative trajectory was one of our key improvements, and unlocked consistency with π0.5 pretraining. To enable relative actions for π0/π0.5 using LeRobot, first precompute the relative action statistics for your dataset, then train with the flag enabled:
|
| 49 |
|
| 50 |
```bash
|
| 51 |
# Precompute relative action stats
|
|
|
|
| 64 |
|
| 65 |
7. **Use [DAgger](https://github.com/huggingface/lerobot/tree/main/examples/hil) for targeted improvement.** Once you have a model that mostly works, collect correction data for its specific failure modes using LeRobot's Human-in-the-Loop scripts.
|
| 66 |
|
| 67 |
+
8. **Record every evaluation.** Metrics alone won't tell the full story. Video reveals subtle failure modes that success rate misses, and lets you score quality.
|
| 68 |
|
| 69 |
+
<Note variant="success">
|
| 70 |
+
All the innovations from this project [SARM](https://huggingface.co/docs/lerobot/v0.5.1/sarm), [RTC](https://huggingface.co/docs/lerobot/v0.5.1/rtc), [DAgger](https://github.com/huggingface/lerobot/pull/2833), [OpenArm](https://huggingface.co/docs/lerobot/v0.5.1/openarm), [OpenArm Mini](https://github.com/pkooij/open-arms-mini) and [Dataset Tooling](https://huggingface.co/docs/lerobot/v0.5.1/using_dataset_tools) are available in [LeRobot v0.5.1](https://github.com/huggingface/lerobot/releases/tag/v0.5.1). You can use our full pipeline as a starting point and swap in your own task.
|
| 71 |
</Note>
|
| 72 |
|
| 73 |
### The bigger picture
|
app/src/content/chapters/folding/12-references.mdx
CHANGED
|
@@ -29,13 +29,13 @@ We also want to thank the [Enactic](https://github.com/enactic/openarm) team and
|
|
| 29 |
|
| 30 |
#### Models we trained
|
| 31 |
|
| 32 |
-
- **π0** Black et al. (2024). *π0: A Vision-Language-Action Flow Model for General Robot Control.* [arxiv.org/abs/2410.24164](https://arxiv.org/abs/2410.24164) · [LeRobot docs](https://huggingface.co/docs/lerobot/pi0)
|
| 33 |
-
- **π0.5** Black et al. (2025). *π0.5: A Vision-Language-Action Model with Open-World Generalization.* [pi.website/blog/pi05](https://www.pi.website/blog/pi05) · [LeRobot docs](https://huggingface.co/docs/lerobot/pi05)
|
| 34 |
|
| 35 |
#### Techniques we used
|
| 36 |
|
| 37 |
-
- **RTC** Black, Galliker & Levine (2025). *Real-Time Execution of Action Chunking Flow Policies.* [pi.website/research/real_time_chunking](https://www.pi.website/research/real_time_chunking) · [LeRobot docs](https://huggingface.co/docs/lerobot/rtc)
|
| 38 |
-
- **SARM** Chen et al. (2025). *Stage-Aware Reward Modeling for Long Horizon Robot Manipulation.* [arxiv.org/abs/2509.25358](https://arxiv.org/abs/2509.25358) · [LeRobot docs](https://huggingface.co/docs/lerobot/sarm)
|
| 39 |
- **UMI** Chi et al. (2024). *Universal Manipulation Interface: In-The-Wild Robot Teaching Without In-The-Wild Robots.* [arxiv.org/abs/2402.10329](https://arxiv.org/abs/2402.10329)
|
| 40 |
- **DAgger** Ross, Gordon & Bagnell (2011). *A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning.* [arxiv.org/abs/1011.0686](https://arxiv.org/abs/1011.0686)
|
| 41 |
- **HG-DAgger** Kelly et al. (2019). *HG-DAgger: Interactive Imitation Learning with Human Experts.* [arxiv.org/abs/1810.02890](https://arxiv.org/abs/1810.02890)
|
|
|
|
| 29 |
|
| 30 |
#### Models we trained
|
| 31 |
|
| 32 |
+
- **π0** Black et al. (2024). *π0: A Vision-Language-Action Flow Model for General Robot Control.* [arxiv.org/abs/2410.24164](https://arxiv.org/abs/2410.24164) · [LeRobot docs](https://huggingface.co/docs/lerobot/v0.5.1/pi0)
|
| 33 |
+
- **π0.5** Black et al. (2025). *π0.5: A Vision-Language-Action Model with Open-World Generalization.* [pi.website/blog/pi05](https://www.pi.website/blog/pi05) · [LeRobot docs](https://huggingface.co/docs/lerobot/v0.5.1/pi05)
|
| 34 |
|
| 35 |
#### Techniques we used
|
| 36 |
|
| 37 |
+
- **RTC** Black, Galliker & Levine (2025). *Real-Time Execution of Action Chunking Flow Policies.* [pi.website/research/real_time_chunking](https://www.pi.website/research/real_time_chunking) · [LeRobot docs](https://huggingface.co/docs/lerobot/v0.5.1/rtc)
|
| 38 |
+
- **SARM** Chen et al. (2025). *Stage-Aware Reward Modeling for Long Horizon Robot Manipulation.* [arxiv.org/abs/2509.25358](https://arxiv.org/abs/2509.25358) · [LeRobot docs](https://huggingface.co/docs/lerobot/v0.5.1/sarm)
|
| 39 |
- **UMI** Chi et al. (2024). *Universal Manipulation Interface: In-The-Wild Robot Teaching Without In-The-Wild Robots.* [arxiv.org/abs/2402.10329](https://arxiv.org/abs/2402.10329)
|
| 40 |
- **DAgger** Ross, Gordon & Bagnell (2011). *A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning.* [arxiv.org/abs/1011.0686](https://arxiv.org/abs/1011.0686)
|
| 41 |
- **HG-DAgger** Kelly et al. (2019). *HG-DAgger: Interactive Imitation Learning with Human Experts.* [arxiv.org/abs/1810.02890](https://arxiv.org/abs/1810.02890)
|