Spaces:
Running
Running
chore(typos): fixing typos, syntax and grammar issues
#10
by CarolinePascal HF Staff - opened
- app/src/content/chapters/folding/01-hero.mdx +12 -11
- app/src/content/chapters/folding/02-results.mdx +6 -8
- app/src/content/chapters/folding/03-hardware.mdx +17 -18
- app/src/content/chapters/folding/04-data-collection.mdx +3 -3
- app/src/content/chapters/folding/05-data-diversity.mdx +5 -5
- app/src/content/chapters/folding/06-training.mdx +9 -9
- app/src/content/chapters/folding/07-evaluation.mdx +9 -10
- app/src/content/chapters/folding/08-ablations.mdx +20 -17
- app/src/content/chapters/folding/09-learnings.mdx +13 -13
- app/src/styles/_base.css +27 -2
app/src/content/chapters/folding/01-hero.mdx
CHANGED
|
@@ -3,7 +3,7 @@ import Note from "../../../components/Note.astro";
|
|
| 3 |
import Wide from "../../../components/Wide.astro";
|
| 4 |
import Stack from "../../../components/Stack.astro";
|
| 5 |
|
| 6 |
-
We trained an open-source bimanual robot to fold t-shirts autonomously, reaching 90% success rate. The biggest lever was data quality, not the model, not the architecture.
|
| 7 |
|
| 8 |
<Sidenote>
|
| 9 |
Read time: ~30 minutes. Each section stands on its own — feel free to skip to what interests you most.
|
|
@@ -11,23 +11,24 @@ We trained an open-source bimanual robot to fold t-shirts autonomously, reaching
|
|
| 11 |
|
| 12 |
This post walks through the complete journey: hardware choices, data collection, training recipes, and different experiments that show what actually matters. We cover the mistakes and dead ends alongside the things that worked, because the messy middle is where most of the learning happens.
|
| 13 |
|
| 14 |
-
Some of what we found: cheap 3D-printed leader arms outperformed the expensive ones for teleoperation. Early data collection was more wasteful than expected. A trained reward model turned out to be essential for separating useful demonstrations from harmful ones. And curating a small, high-quality dataset did more than algorithmic improvement on the full dataset.
|
| 15 |
|
| 16 |
By sharing this we hope to contribute to our bigger vision: **democratize robotics and robot learning**. By open-sourcing every piece tools, data, models, and knowledge we want to enable a community that pushes this technology further. We've tried to avoid just listing what we did in favor of telling the story of how we got here. We hope being this open will help close the gap between closed-lab demos and what the open-source community can achieve.
|
| 17 |
|
| 18 |
-
Everything we built for this project [SARM](https://huggingface.co/docs/lerobot/sarm), [RTC](https://huggingface.co/docs/lerobot/rtc), DAgger, [OpenArm](https://huggingface.co/docs/lerobot/openarm)
|
| 19 |
|
| 20 |
-
|
| 21 |
|
| 22 |
#### Links
|
| 23 |
|
| 24 |
-
<Stack layout="4-column" gap="small">
|
| 25 |
-
<a href="https://huggingface.co/lerobot-data-collection/folding_final" className="card" style="padding: 12px 16px; text-align: center; text-decoration: none;">
|
| 26 |
-
<a href="https://huggingface.co/lerobot-data-collection/folding_sarm_reward" className="card" style="padding: 12px 16px; text-align: center; text-decoration: none;">
|
| 27 |
-
<a href="https://huggingface.co/datasets/lerobot/high_quality_folding" className="card" style="padding: 12px 16px; text-align: center; text-decoration: none;">
|
| 28 |
-
<a href="https://huggingface.co/datasets/lerobot/full_folding" className="card" style="padding: 12px 16px; text-align: center; text-decoration: none;">
|
| 29 |
-
<a href="
|
| 30 |
-
<a href="https://
|
|
|
|
| 31 |
</Stack>
|
| 32 |
|
| 33 |
<Sidenote>
|
|
|
|
| 3 |
import Wide from "../../../components/Wide.astro";
|
| 4 |
import Stack from "../../../components/Stack.astro";
|
| 5 |
|
| 6 |
+
> We trained an open-source bimanual robot to fold t-shirts autonomously, reaching 90% success rate. The biggest lever was data quality, not the model, not the architecture.
|
| 7 |
|
| 8 |
<Sidenote>
|
| 9 |
Read time: ~30 minutes. Each section stands on its own — feel free to skip to what interests you most.
|
|
|
|
| 11 |
|
| 12 |
This post walks through the complete journey: hardware choices, data collection, training recipes, and different experiments that show what actually matters. We cover the mistakes and dead ends alongside the things that worked, because the messy middle is where most of the learning happens.
|
| 13 |
|
| 14 |
+
Some of what we found: cheap 3D-printed leader arms outperformed the expensive ones for teleoperation. Early data collection was more wasteful than expected. A trained reward model turned out to be essential for separating useful demonstrations from harmful ones. And curating a small, high-quality dataset did more than any algorithmic improvement on the full dataset.
|
| 15 |
|
| 16 |
By sharing this we hope to contribute to our bigger vision: **democratize robotics and robot learning**. By open-sourcing every piece tools, data, models, and knowledge we want to enable a community that pushes this technology further. We've tried to avoid just listing what we did in favor of telling the story of how we got here. We hope being this open will help close the gap between closed-lab demos and what the open-source community can achieve.
|
| 17 |
|
| 18 |
+
Everything we built for this project [SARM](https://huggingface.co/docs/lerobot/sarm), [RTC](https://huggingface.co/docs/lerobot/rtc), DAgger, [OpenArm](https://huggingface.co/docs/lerobot/openarm) and [OpenArm Mini](http://github.com/pkooij/open-arms-mini) is now merged into [LeRobot](https://github.com/huggingface/lerobot) and ready for the community to use.
|
| 19 |
|
| 20 |
+
_Let's start with the results, does it actually work?_
|
| 21 |
|
| 22 |
#### Links
|
| 23 |
|
| 24 |
+
<Stack layout="4-column" gap="small" class="links-centered">
|
| 25 |
+
<a href="https://huggingface.co/lerobot-data-collection/folding_final" className="card" style="padding: 12px 16px; text-align: center; text-decoration: none;"><strong>Model</strong><br/>HF Hub</a>
|
| 26 |
+
<a href="https://huggingface.co/lerobot-data-collection/folding_sarm_reward" className="card" style="padding: 12px 16px; text-align: center; text-decoration: none;"><strong>SARM Reward</strong><br/>HF Hub</a>
|
| 27 |
+
<a href="https://huggingface.co/datasets/lerobot/high_quality_folding" className="card" style="padding: 12px 16px; text-align: center; text-decoration: none;"><strong>HQ Dataset</strong><br/>HF Hub</a>
|
| 28 |
+
<a href="https://huggingface.co/datasets/lerobot/full_folding" className="card" style="padding: 12px 16px; text-align: center; text-decoration: none;"><strong>Full Dataset</strong><br/>HF Hub</a>
|
| 29 |
+
<a href="http://github.com/pkooij/open-arms-mini" className="card" style="padding: 12px 16px; text-align: center; text-decoration: none;"><strong>OpenArm Mini</strong><br/>Repo</a>
|
| 30 |
+
<a href="https://github.com/huggingface/lerobot" className="card" style="padding: 12px 16px; text-align: center; text-decoration: none;"><strong>LeRobot</strong><br/>Code</a>
|
| 31 |
+
<a href="https://huggingface.co/docs/lerobot/index" className="card" style="padding: 12px 16px; text-align: center; text-decoration: none;"><strong>LeRobot</strong><br/>Documentation</a>
|
| 32 |
</Stack>
|
| 33 |
|
| 34 |
<Sidenote>
|
app/src/content/chapters/folding/02-results.mdx
CHANGED
|
@@ -7,16 +7,14 @@ import Video from "../../../components/Video.astro";
|
|
| 7 |
|
| 8 |
## Results
|
| 9 |
|
| 10 |
-
|
| 11 |
-
|
| 12 |
-
**Level 1: Fold a laid-out t-shirt** (15 min continuous folding)
|
| 13 |
|
|
|
|
| 14 |
<Wide>
|
| 15 |
<Video src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/lerobot/level1.mp4" />
|
| 16 |
</Wide>
|
| 17 |
|
| 18 |
-
|
| 19 |
-
|
| 20 |
<Wide>
|
| 21 |
<Video src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/lerobot/level2.mp4" />
|
| 22 |
</Wide>
|
|
@@ -25,14 +23,14 @@ Below are two **uncut, full-length** runs from our best model. No human interven
|
|
| 25 |
|
| 26 |
How well does it actually work? We evaluated our best model (Experiment 2.5) across 20 rollouts.
|
| 27 |
|
| 28 |
-
| Task | Success Rate | Avg. Completion Time |
|
| 29 |
|:---|:---:|:---:|
|
| 30 |
| **Level 1** Laid-out to Fold | **100%** | **40.8 s** |
|
| 31 |
| **Level 2** Messy to Spread to Fold to Place aside | **80%** | **95.9 s** |
|
| 32 |
| **Combined** (Total SR) | **90%** | |
|
| 33 |
|
| 34 |
<Sidenote>
|
| 35 |
-
All evaluations filmed and scored from video. 20 rollouts per experiment (10 per level). Full methodology in the Evaluation section.
|
| 36 |
</Sidenote>
|
| 37 |
|
| 38 |
-
These numbers are the result of 11 experiments, each testing a different combination of model, data, and training strategies. The full breakdown is in the [Experiments](#experiments) section.
|
|
|
|
| 7 |
|
| 8 |
## Results
|
| 9 |
|
| 10 |
+
No cherry-picked clips. Here are two **uncut, full-length** runs from our best model, no human intervention.
|
|
|
|
|
|
|
| 11 |
|
| 12 |
+
### Level 1: Fold a laid-out t-shirt <span style="font-weight: 400; font-size: 0.8em; opacity: 0.6;">(15 min continuous folding)</span>
|
| 13 |
<Wide>
|
| 14 |
<Video src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/lerobot/level1.mp4" />
|
| 15 |
</Wide>
|
| 16 |
|
| 17 |
+
### Level 2: Untangle, spread, fold, and place aside <span style="font-weight: 400; font-size: 0.8em; opacity: 0.6;">(5 shirts back-to-back)</span>
|
|
|
|
| 18 |
<Wide>
|
| 19 |
<Video src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/lerobot/level2.mp4" />
|
| 20 |
</Wide>
|
|
|
|
| 23 |
|
| 24 |
How well does it actually work? We evaluated our best model (Experiment 2.5) across 20 rollouts.
|
| 25 |
|
| 26 |
+
| Task | Success Rate (SR) | Avg. Completion Time |
|
| 27 |
|:---|:---:|:---:|
|
| 28 |
| **Level 1** Laid-out to Fold | **100%** | **40.8 s** |
|
| 29 |
| **Level 2** Messy to Spread to Fold to Place aside | **80%** | **95.9 s** |
|
| 30 |
| **Combined** (Total SR) | **90%** | |
|
| 31 |
|
| 32 |
<Sidenote>
|
| 33 |
+
All evaluations filmed and scored from video. 20 rollouts per experiment (10 per level). Full methodology in the [Evaluation](#evaluation) section.
|
| 34 |
</Sidenote>
|
| 35 |
|
| 36 |
+
These numbers are the result of 11 experiments, each testing a different combination of model, data, and training strategies. The full breakdown is in the [Experiments](#experiments) section. But let's start from the beginning: the hardware.
|
app/src/content/chapters/folding/03-hardware.mdx
CHANGED
|
@@ -10,14 +10,14 @@ import openArmMini2 from "../../assets/image/openarm-mini2.jpg";
|
|
| 10 |
|
| 11 |
## Hardware
|
| 12 |
|
| 13 |
-
LeRobot takes care of the entire robot learning stack
|
| 14 |
|
| 15 |
### The Robot: Bimanual OpenArm
|
| 16 |
|
| 17 |
-
For starters,
|
| 18 |
|
| 19 |
1. **The humanoid trend.** We're seeing a wave of human-like robots. More human-form robots means more human-form data in the ecosystem. Building on this form factor positions our work for a future where human-like manipulation data is transferable.
|
| 20 |
-
2. **Smaller teleop gap.** When the robot's kinematics match a human arm, the teleoperator's motions transfer more naturally less mental remapping
|
| 21 |
3. **Open source, good specs.** Solid payload, good reach, and fully open hardware. We extended the upper arm by **+5 cm** to increase reach since our setup doesn't have a hip or torso to provide additional workspace.
|
| 22 |
|
| 23 |
Everything is mounted on **aluminum extrusion profiles**, which let us quickly iterate on the physical arrangement and adjust both teleop and robot height between sessions to increase data diversity.
|
|
@@ -26,24 +26,22 @@ Everything is mounted on **aluminum extrusion profiles**, which let us quickly i
|
|
| 26 |
|
| 27 |
### Custom Grippers
|
| 28 |
|
| 29 |
-
We designed **custom grippers with a larger surface area**, giving the robot a broader contact patch to grip, pinch, and slide fabric reliably.
|
| 30 |
|
| 31 |
### Teleop Arms: OpenArm Mini
|
| 32 |
|
| 33 |
-
Next, we need a way to
|
| 34 |
|
| 35 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 36 |
|
| 37 |
-
|
| 38 |
-
- **Less inertia** operators could make quicker and more deliberate motions that cloth folding demands
|
| 39 |
-
- **Arm-length agnostic** works for teleoperators of any size
|
| 40 |
-
- **Incredibly cheap** ~120 EUR per arm, making it very cheap to set up multiple stations
|
| 41 |
-
- **Still support DAgger** lightweight, but strong enough to move during human-in-the-loop correction data collection
|
| 42 |
-
|
| 43 |
-
One detail turned out to be critical: the **wrist strap**. Without it, wrist rotations were imprecise. With the strap, operators get locked-in wrist control, which is essential for cloth manipulation.
|
| 44 |
|
| 45 |
<Note variant="info" emoji="🔗">
|
| 46 |
-
OpenArm Mini repo (3D print files, BOM, LeRobot integration)
|
| 47 |
</Note>
|
| 48 |
|
| 49 |
<div style="display: flex; gap: 8px; max-width: 70%; margin: 0 auto;">
|
|
@@ -54,11 +52,13 @@ One detail turned out to be critical: the **wrist strap**. Without it, wrist rot
|
|
| 54 |
</div>
|
| 55 |
</div>
|
| 56 |
|
| 57 |
-
|
|
|
|
|
|
|
| 58 |
|
| 59 |
### Cameras
|
| 60 |
|
| 61 |
-
|
| 62 |
|
| 63 |
| Camera | Position | Notes |
|
| 64 |
|:---|:---|:---|
|
|
@@ -77,7 +77,6 @@ The robot needs to see what it's doing: for this purpose we use **three cameras*
|
|
| 77 |
|
| 78 |
### LeRobot Integration
|
| 79 |
|
| 80 |
-
Integrating OpenArm into LeRobot required adding **CAN-bus protocol** support for the arm's motors
|
| 81 |
-
|
| 82 |
|
| 83 |
With the hardware in place, the next step was the hardest and most time-consuming part of the entire project: collecting good data. And "good" is much harder to define than it sounds.
|
|
|
|
| 10 |
|
| 11 |
## Hardware
|
| 12 |
|
| 13 |
+
LeRobot takes care of the entire robot learning stack — but you still need the physical hardware. Here's an averview of every piece we used.
|
| 14 |
|
| 15 |
### The Robot: Bimanual OpenArm
|
| 16 |
|
| 17 |
+
For starters, the robot. We used the **bimanual [OpenArm](https://huggingface.co/docs/lerobot/openarm)**, open-source, human-like robot arms developed by [Enactic](https://openarm.dev) and built by [WowRobo](https://shop.wowrobo.com). Three reasons drove this choice:
|
| 18 |
|
| 19 |
1. **The humanoid trend.** We're seeing a wave of human-like robots. More human-form robots means more human-form data in the ecosystem. Building on this form factor positions our work for a future where human-like manipulation data is transferable.
|
| 20 |
+
2. **Smaller teleop gap.** When the robot's kinematics match a human arm, the teleoperator's motions transfer more naturally, meaning less mental remapping and faster learning.
|
| 21 |
3. **Open source, good specs.** Solid payload, good reach, and fully open hardware. We extended the upper arm by **+5 cm** to increase reach since our setup doesn't have a hip or torso to provide additional workspace.
|
| 22 |
|
| 23 |
Everything is mounted on **aluminum extrusion profiles**, which let us quickly iterate on the physical arrangement and adjust both teleop and robot height between sessions to increase data diversity.
|
|
|
|
| 26 |
|
| 27 |
### Custom Grippers
|
| 28 |
|
| 29 |
+
We designed **custom grippers with a larger surface area**, giving the robot a broader contact patch to grip, pinch, and slide fabric reliably. We also added a small polymer patch on one side of the gripper to reduce slippage and make the grasping of fabric easier.
|
| 30 |
|
| 31 |
### Teleop Arms: OpenArm Mini
|
| 32 |
|
| 33 |
+
Next, we need a way to control the robot. We started with full-size OpenArm as leader arms for teleoperation. They seemed like the natural choice: same kinematics as the follower arms, one-to-one mapping.
|
| 34 |
|
| 35 |
+
However, we quickly realized we needed a teleoperator with less inertia, to allow for fast and precise manipulation, and more adaptability to different human morphologies. This led us to develop the **OpenArm Mini**: small, Feetech-based, 3D-printed leader arms based on the [SO-101](https://github.com/TheRobotStudio/SO-ARM100) design. These gave us:
|
| 36 |
+
1. **Less inertia** for quicker and more deliberate motions that cloth folding demands
|
| 37 |
+
2. **Arm-length agnostic** and adaptable to any human operator size
|
| 38 |
+
3. **Incredibly cheap** (~120 EUR per arm) making it easy to scale to multiple stations
|
| 39 |
+
4. **Still support DAgger**: lightweight, but strong enough to move during human-in-the-loop corrective data collection
|
| 40 |
|
| 41 |
+
One small detail mattered more than expected: the **wrist strap**. It locks the wrist to the leader arm, providing the precise rotational control essential for cloth manipulation.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 42 |
|
| 43 |
<Note variant="info" emoji="🔗">
|
| 44 |
+
[OpenArm Mini repo (3D print files, BOM, LeRobot integration)](https://github.com/pkooij/open-arms-mini)
|
| 45 |
</Note>
|
| 46 |
|
| 47 |
<div style="display: flex; gap: 8px; max-width: 70%; margin: 0 auto;">
|
|
|
|
| 52 |
</div>
|
| 53 |
</div>
|
| 54 |
|
| 55 |
+
<br/>
|
| 56 |
+
|
| 57 |
+
Another feature that made a surprisingly big difference: when both your hands are on the leader arms, you need a hands-free way to **start and stop episodes recording**. USB foot pedals solved this elegantly.
|
| 58 |
|
| 59 |
### Cameras
|
| 60 |
|
| 61 |
+
Eventually, the robot needs to see what it is doing: for this purpose we used **three cameras** each serving a distinct purpose:
|
| 62 |
|
| 63 |
| Camera | Position | Notes |
|
| 64 |
|:---|:---|:---|
|
|
|
|
| 77 |
|
| 78 |
### LeRobot Integration
|
| 79 |
|
| 80 |
+
Integrating OpenArm into LeRobot required adding **CAN-bus protocol** support for the arm's motors. It can now be found in the [LeRobot repository](https://github.com/huggingface/lerobot). We also created a UI for non-technical robot operators, so the CLI doesn't need to be used to start and stop episodes.
|
|
|
|
| 81 |
|
| 82 |
With the hardware in place, the next step was the hardest and most time-consuming part of the entire project: collecting good data. And "good" is much harder to define than it sounds.
|
app/src/content/chapters/folding/04-data-collection.mdx
CHANGED
|
@@ -10,11 +10,11 @@ We ran **8 setups** in parallel, optimizing for **maximum diversity**: 25+ diffe
|
|
| 10 |
|
| 11 |
### Learning to Teleoperate
|
| 12 |
|
| 13 |
-
Here's an honest truth: **early data is worse than the final data**. Teleoperating a bimanual robot is a genuine skill, and it takes practice. The first episodes are slow, not deliberate, and full of failed attempts. Over hours of practice, operators get dramatically better smoother motions, faster execution, and more consistent grasps.
|
| 14 |
|
| 15 |
-
This creates one of the most important practical decisions of the project: **when do you start recording data for the final model?** Too early and you pollute the dataset with low-quality demonstrations that the model will faithfully reproduce, hesitations
|
| 16 |
|
| 17 |
-
|
| 18 |
|
| 19 |
### Tips for Good Data Collection
|
| 20 |
|
|
|
|
| 10 |
|
| 11 |
### Learning to Teleoperate
|
| 12 |
|
| 13 |
+
Here's an honest truth: **early data is worse than the final data**. Teleoperating a bimanual robot is a genuine skill, and it takes practice. The first episodes are slow, not deliberate, and full of failed attempts. Over hours of practice, operators get dramatically better and smoother motions, faster execution, and more consistent grasps.
|
| 14 |
|
| 15 |
+
This creates one of the most important practical decisions of the project: **when do you start recording data for the final model?** Too early and you pollute the dataset with low-quality demonstrations that the model will faithfully reproduce, including hesitations and fumbles. Too late and you've wasted precious time.
|
| 16 |
|
| 17 |
+
Aligning on a common strategy across operators was equally important. Folding is a very multi-modal task (there are many valid ways to fold a t-shirt) and the model learns better from a consistent strategy. Before each recording sprint, we held brief alignment sessions: experimenting with different techniques, sharing our learnings and then converging on the most efficient fold sequence.
|
| 18 |
|
| 19 |
### Tips for Good Data Collection
|
| 20 |
|
app/src/content/chapters/folding/05-data-diversity.mdx
CHANGED
|
@@ -8,7 +8,7 @@ import diversityGridImg from "../../assets/image/lerobot-data-collection_level12
|
|
| 8 |
|
| 9 |
Raw episodes are only the beginning. What you do with them before training determines whether your model learns to fold or learns to fumble.
|
| 10 |
|
| 11 |
-
We collected two datasets: a
|
| 12 |
|
| 13 |
### Dataset Statistics
|
| 14 |
|
|
@@ -35,10 +35,10 @@ The grid below shows one frame from each of 100 different episodes. Notice the v
|
|
| 35 |
|
| 36 |
We filtered episodes in two ways:
|
| 37 |
|
| 38 |
-
1. **End-state image filtering** discard episodes where the final frame doesn't show a properly folded shirt. If the end result isn't good, the demonstration isn't useful.
|
| 39 |
2. **Length-based filtering** using the LeRobot data visualizer to remove outliers. Episodes that are suspiciously short tend to be low quality.
|
| 40 |
|
| 41 |
-
The [LeRobot Data Visualizer](https://huggingface.co/spaces/lerobot/visualize_dataset) was invaluable for inspecting the dataset, spotting outliers, and understanding distributions.
|
| 42 |
|
| 43 |
<Wide>
|
| 44 |
<div className="card" style="overflow: hidden; border-radius: 10px;">
|
|
@@ -46,6 +46,6 @@ The [LeRobot Data Visualizer](https://huggingface.co/spaces/lerobot/visualize_da
|
|
| 46 |
</div>
|
| 47 |
</Wide>
|
| 48 |
|
| 49 |
-
#### SARM Annotation
|
| 50 |
|
| 51 |
-
We also annotated every episode using our trained **[SARM](https://huggingface.co/docs/lerobot/sarm)** reward model. This gave us continuous scores we could
|
|
|
|
| 8 |
|
| 9 |
Raw episodes are only the beginning. What you do with them before training determines whether your model learns to fold or learns to fumble.
|
| 10 |
|
| 11 |
+
We collected two datasets: a **full dataset** containing every episode, and a **curated dataset** built by selecting the best episodes from the full set and supplementing them with additional high-quality recordings.
|
| 12 |
|
| 13 |
### Dataset Statistics
|
| 14 |
|
|
|
|
| 35 |
|
| 36 |
We filtered episodes in two ways:
|
| 37 |
|
| 38 |
+
1. **End-state image filtering** to discard episodes where the final frame doesn't show a properly folded shirt. If the end result isn't good, the demonstration isn't useful.
|
| 39 |
2. **Length-based filtering** using the LeRobot data visualizer to remove outliers. Episodes that are suspiciously short tend to be low quality.
|
| 40 |
|
| 41 |
+
The [LeRobot Data Visualizer](https://huggingface.co/spaces/lerobot/visualize_dataset) was invaluable for inspecting the dataset, spotting outliers, and understanding distributions. Try it right here with our dataset:
|
| 42 |
|
| 43 |
<Wide>
|
| 44 |
<div className="card" style="overflow: hidden; border-radius: 10px;">
|
|
|
|
| 46 |
</div>
|
| 47 |
</Wide>
|
| 48 |
|
| 49 |
+
#### SARM Annotation
|
| 50 |
|
| 51 |
+
We also annotated every episode using our trained **[Stage-Aware Reward Modeling (SARM)](https://huggingface.co/docs/lerobot/sarm)** reward model. This gave us continuous scores we could use as weights at training time. More details in [SARM: Our Reward Model](#sarm-our-reward-model).
|
app/src/content/chapters/folding/06-training.mdx
CHANGED
|
@@ -6,11 +6,11 @@ import HtmlEmbed from "../../../components/HtmlEmbed.astro";
|
|
| 6 |
|
| 7 |
## Training
|
| 8 |
|
| 9 |
-
Before
|
| 10 |
|
| 11 |
### Model Architecture
|
| 12 |
|
| 13 |
-
At its core, the model is a **Vision-Language-Action (VLA)** model. It
|
| 14 |
|
| 15 |
<Wide>
|
| 16 |
<HtmlEmbed
|
|
@@ -22,7 +22,7 @@ At its core, the model is a **Vision-Language-Action (VLA)** model. It sees the
|
|
| 22 |
/>
|
| 23 |
</Wide>
|
| 24 |
|
| 25 |
-
The model generates actions through **flow matching** a generative approach that transforms random noise into coherent action sequences, conditioned on what the cameras see and what the
|
| 26 |
|
| 27 |
<Sidenote>
|
| 28 |
Flow matching is closely related to diffusion models but uses a simpler, more direct interpolation path between noise and data.
|
|
@@ -30,7 +30,7 @@ The model generates actions through **flow matching** a generative approach that
|
|
| 30 |
|
| 31 |
#### [Real-Time Chunking (RTC)](https://huggingface.co/docs/lerobot/rtc)
|
| 32 |
|
| 33 |
-
|
| 34 |
|
| 35 |
```mermaid
|
| 36 |
sequenceDiagram
|
|
@@ -45,12 +45,12 @@ sequenceDiagram
|
|
| 45 |
|
| 46 |
### Models
|
| 47 |
|
| 48 |
-
We initially trained multiple architectures supported in LeRobot, but we ended up
|
| 49 |
|
| 50 |
- **π0** the base flow-matching VLA, trained with standard imitation learning
|
| 51 |
-
- **[π0.5](https://huggingface.co/docs/lerobot/pi05)** an improved variant with more pretraining and
|
| 52 |
|
| 53 |
-
Both are finetuned from pretrained checkpoints. Starting from this pretrained foundation
|
| 54 |
|
| 55 |
### Training Setup
|
| 56 |
|
|
@@ -64,7 +64,7 @@ Both are finetuned from pretrained checkpoints. Starting from this pretrained fo
|
|
| 64 |
| Training steps | **200k** (Series 1) / **100k** (Series 2 fine-tune) |
|
| 65 |
|
| 66 |
<Sidenote>
|
| 67 |
-
Multi-GPU training with
|
| 68 |
</Sidenote>
|
| 69 |
|
| 70 |
### Loss Curves
|
|
@@ -79,4 +79,4 @@ Both are finetuned from pretrained checkpoints. Starting from this pretrained fo
|
|
| 79 |
/>
|
| 80 |
</Wide>
|
| 81 |
|
| 82 |
-
Our training followed two phases: **Series 1** trained from pretrained base checkpoints on the full dataset for 200k steps
|
|
|
|
| 6 |
|
| 7 |
## Training
|
| 8 |
|
| 9 |
+
Before talking about hyperparameters, one needs to understand what the trained model actually *is*: what it takes in, what it produces, and why those choices matter for cloth folding.
|
| 10 |
|
| 11 |
### Model Architecture
|
| 12 |
|
| 13 |
+
At its core, the model is a **Vision-Language-Action (VLA)** model. It takes in camera images and a task description, and outputs actions — joint angle targets and gripper commands — for the next second, at a frequency of 30Hz.
|
| 14 |
|
| 15 |
<Wide>
|
| 16 |
<HtmlEmbed
|
|
|
|
| 22 |
/>
|
| 23 |
</Wide>
|
| 24 |
|
| 25 |
+
The model generates actions through **flow matching**, a generative approach that transforms random noise into coherent action sequences, conditioned on what the cameras see and what the motors are doing. This allows the model to represent **multi-modal action distributions**: when there are multiple valid ways to grasp a sleeve or start a fold, the model can capture that ambiguity rather than averaging to a meaningless middle ground.
|
| 26 |
|
| 27 |
<Sidenote>
|
| 28 |
Flow matching is closely related to diffusion models but uses a simpler, more direct interpolation path between noise and data.
|
|
|
|
| 30 |
|
| 31 |
#### [Real-Time Chunking (RTC)](https://huggingface.co/docs/lerobot/rtc)
|
| 32 |
|
| 33 |
+
RTC was crucial for real-world deployment. Instead of waiting for the predicted action chunk to finish before generating the next, RTC generates the next chunk while executing the current one. It "freezes" actions that are already committed and "inpaints" the remaining ones, producing smooth asynchronous motion. In practice, this sped up our rollouts by at least a factor of 2.
|
| 34 |
|
| 35 |
```mermaid
|
| 36 |
sequenceDiagram
|
|
|
|
| 45 |
|
| 46 |
### Models
|
| 47 |
|
| 48 |
+
We initially trained multiple architectures supported in LeRobot, but we ended up focusing on two VLA architectures for our cloth folding data:
|
| 49 |
|
| 50 |
- **π0** the base flow-matching VLA, trained with standard imitation learning
|
| 51 |
+
- **[π0.5](https://huggingface.co/docs/lerobot/pi05)** an improved variant with more pretraining and several improvements to the flow matching denoising process
|
| 52 |
|
| 53 |
+
Both are finetuned from pretrained checkpoints. Starting from this pretrained foundation rather than training from scratch gives the model a head start on visual understanding and basic manipulation concepts.
|
| 54 |
|
| 55 |
### Training Setup
|
| 56 |
|
|
|
|
| 64 |
| Training steps | **200k** (Series 1) / **100k** (Series 2 fine-tune) |
|
| 65 |
|
| 66 |
<Sidenote>
|
| 67 |
+
Multi-GPU training with 8xH100 and gradient accumulation was necessary to fit the large batch sizes needed for stable VLA training.
|
| 68 |
</Sidenote>
|
| 69 |
|
| 70 |
### Loss Curves
|
|
|
|
| 79 |
/>
|
| 80 |
</Wide>
|
| 81 |
|
| 82 |
+
Our training followed two phases: **Series 1** trained from pretrained base checkpoints on the full dataset for 200k steps. **Series 2** fine-tuned the best Series 1 checkpoint on curated high-quality data for 100k steps.
|
app/src/content/chapters/folding/07-evaluation.mdx
CHANGED
|
@@ -5,27 +5,27 @@ import HtmlEmbed from "../../../components/HtmlEmbed.astro";
|
|
| 5 |
|
| 6 |
## Evaluation
|
| 7 |
|
| 8 |
-
**Evaluation is as hard as training.** In robotics
|
| 9 |
|
| 10 |
### Protocol
|
| 11 |
|
| 12 |
-
For every experiment we evaluate on:
|
| 13 |
|
| 14 |
- **5 different t-shirts for Level 1** (laid-out to fold)
|
| 15 |
- **5 different t-shirts for Level 2** (messy to spread to fold, then place aside)
|
| 16 |
|
| 17 |
-
Each t-shirt is attempted **twice consecutively**, giving **10 rollouts per level** and **20 rollouts total per experiment**. Every evaluation is filmed and scored from video afterward, so judgment is decoupled from execution.
|
| 18 |
|
| 19 |
<Note>
|
| 20 |
-
The
|
| 21 |
</Note>
|
| 22 |
|
| 23 |
### Metrics
|
| 24 |
|
| 25 |
We report four complementary metrics:
|
| 26 |
|
| 27 |
-
|
| 28 |
-
|
| 29 |
|
| 30 |
<Accordion title="Scoring rubric Level 1 and Level 2">
|
| 31 |
|
|
@@ -52,13 +52,12 @@ We report four complementary metrics:
|
|
| 52 |
| Rotation + Place aside | +10 |
|
| 53 |
| **Maximum per rollout** | **100** |
|
| 54 |
|
| 55 |
-
Scores are summed across all rollouts in an experiment. With 10
|
| 56 |
|
| 57 |
</Accordion>
|
| 58 |
|
| 59 |
-
|
| 60 |
-
|
| 61 |
-
**4. Completion time** Seconds to complete Level 1/Level 2, averaged across successful rollouts.
|
| 62 |
|
| 63 |
### Statistical uncertainty
|
| 64 |
|
|
|
|
| 5 |
|
| 6 |
## Evaluation
|
| 7 |
|
| 8 |
+
**Evaluation is as hard as training.** In robotics and with real hardware, no standardized benchmarks exist. If your evaluation protocol is inconsistent, every downstream decision will be wrong.
|
| 9 |
|
| 10 |
### Protocol
|
| 11 |
|
| 12 |
+
For every experiment, we evaluate the model on:
|
| 13 |
|
| 14 |
- **5 different t-shirts for Level 1** (laid-out to fold)
|
| 15 |
- **5 different t-shirts for Level 2** (messy to spread to fold, then place aside)
|
| 16 |
|
| 17 |
+
Each t-shirt fold is attempted **twice consecutively**, giving **10 rollouts per level** and **20 rollouts total per experiment**. Every evaluation is filmed and scored from video afterward, so judgment is decoupled from execution.
|
| 18 |
|
| 19 |
<Note>
|
| 20 |
+
The evaluation protocol t-shirts, attempts count, scoring rubric, and filming setup is identical across every experiment.
|
| 21 |
</Note>
|
| 22 |
|
| 23 |
### Metrics
|
| 24 |
|
| 25 |
We report four complementary metrics:
|
| 26 |
|
| 27 |
+
1. **Success Rate** Binary pass/fail per rollout.
|
| 28 |
+
2. **Score** Partial credit based on subtasks completed. This distinguishes a model that consistently reaches Fold 3 from one that fails at Unfold, even if neither achieves full success.
|
| 29 |
|
| 30 |
<Accordion title="Scoring rubric Level 1 and Level 2">
|
| 31 |
|
|
|
|
| 52 |
| Rotation + Place aside | +10 |
|
| 53 |
| **Maximum per rollout** | **100** |
|
| 54 |
|
| 55 |
+
Scores are summed across all the rollouts in an experiment. With 10 Level 1 rollouts (max 50 points each) and 10 Level 2 rollouts (max 100 points each), the **maximum total score per experiment is 1,500 points**.
|
| 56 |
|
| 57 |
</Accordion>
|
| 58 |
|
| 59 |
+
3. **Fold quality** A 1–5 rating of the final fold appearance, averaged across successful rollouts.
|
| 60 |
+
4. **Completion time** Seconds to complete Level 1/Level 2, averaged across successful rollouts.
|
|
|
|
| 61 |
|
| 62 |
### Statistical uncertainty
|
| 63 |
|
app/src/content/chapters/folding/08-ablations.mdx
CHANGED
|
@@ -10,7 +10,7 @@ import Stack from "../../../components/Stack.astro";
|
|
| 10 |
|
| 11 |
## Experiments
|
| 12 |
|
| 13 |
-
We ran 11 experiments to understand what *actually* matters. **Series 1** trains from pretrained base checkpoints on the full dataset. **Series 2** finetunes Series 1 checkpoints on curated high-quality data (2.1–2.4 from 1.3, 2.5 from 1.7). One early lesson: **undertraining makes the policy shaky** make sure your model has converged before drawing conclusions.
|
| 14 |
|
| 15 |
<Wide>
|
| 16 |
|
|
@@ -41,13 +41,15 @@ policy_cfg.rtc_config = RTCConfig(
|
|
| 41 |
)
|
| 42 |
```
|
| 43 |
|
| 44 |
-
|
|
|
|
|
|
|
| 45 |
|
| 46 |
### SARM: Our Reward Model
|
| 47 |
|
| 48 |
Before diving into the experiments further, let's introduce a key ingredient: **[SARM](https://huggingface.co/docs/lerobot/sarm)** (Stage-Aware Reward Modeling). SARM is a trained reward model that scores trajectories based on how well the robot is progressing toward task completion, it acts as a learned "critic" that predicts whether things are going well or badly.
|
| 49 |
|
| 50 |
-
SARM is trained on our demonstration data to predict 0-1 task progression. The
|
| 51 |
|
| 52 |
<Wide>
|
| 53 |
<Stack layout="3-column" gap="small">
|
|
@@ -57,13 +59,13 @@ SARM is trained on our demonstration data to predict 0-1 task progression. The k
|
|
| 57 |
</Stack>
|
| 58 |
</Wide>
|
| 59 |
|
| 60 |
-
We use SARM exclusively for **RABC** (Reward-Advantage-Based Conditioning):
|
| 61 |
|
| 62 |
---
|
| 63 |
|
| 64 |
### Results Overview
|
| 65 |
|
| 66 |
-
Now let's look at how each experiment actually performed. The charts below show success rates, scores, completion times, and failure modes across all 11 experiments. The pattern is consistent: **Series 2 dominates Series 1**, and within each series, RABC combined with relative actions produces the best results.
|
| 67 |
|
| 68 |
<HtmlEmbed
|
| 69 |
id="success-rates"
|
|
@@ -72,7 +74,7 @@ Now let's look at how each experiment actually performed. The charts below show
|
|
| 72 |
desc="Success rates (Total, Level 1, Level 2) across all experiments. Series 1 trains from scratch on full data; Series 2 finetunes the best Series 1 checkpoint on curated high-quality data."
|
| 73 |
/>
|
| 74 |
|
| 75 |
-
The gap between Series 1 and Series 2 is immediately visible. Experiment 2.5 reaches 90% total success rate (100%
|
| 76 |
|
| 77 |
<HtmlEmbed
|
| 78 |
id="total-score"
|
|
@@ -103,7 +105,7 @@ The heatmap shows where time is spent. Series 1 experiments are slow across the
|
|
| 103 |
|
| 104 |
### Where the policies fail
|
| 105 |
|
| 106 |
-
Before interpreting success rates,
|
| 107 |
|
| 108 |
<HtmlEmbed
|
| 109 |
id="failure-analysis"
|
|
@@ -129,7 +131,7 @@ With 20 rollouts per experiment, not every visible gap is real. We run **Barnard
|
|
| 129 |
|
| 130 |
#### 1. Data quality matters most
|
| 131 |
|
| 132 |
-
This is the finding we're most confident in it held regardless of which confidence level or correction method we used. The best Series 1 result (1.7) achieves 40% total
|
| 133 |
|
| 134 |
We hypothesise that the root cause is the difference in **multi-modality** between the high-quality and full dataset. The full dataset contains demonstrations with some inconsistent strategies: different grips, unfolding sequences, and timing, while the high-quality dataset enforces a more unified, consistent protocol.
|
| 135 |
|
|
@@ -165,36 +167,37 @@ lerobot-train \
|
|
| 165 |
--policy.use_relative_actions=true
|
| 166 |
```
|
| 167 |
|
| 168 |
-
Comparing π0.5 without relative actions (1.2: 20% total SR, 40%
|
| 169 |
|
| 170 |
-
|
| 171 |
|
| 172 |
#### 3. RABC helps especially on long tasks like level 2
|
| 173 |
|
| 174 |
-
RABC on high-quality data
|
|
|
|
| 175 |
#### 4. Fine-tuning from a strong checkpoint is the winning recipe
|
| 176 |
|
| 177 |
The best results share the same recipe: fine-tune a Series 1 checkpoint on curated high-quality data with RABC and relative actions.
|
| 178 |
|
| 179 |
-
| Experiment | Total SR |
|
| 180 |
|:---:|:---:|:---:|:---:|:---|
|
| 181 |
| 2.5 | **90%** | **100%** | **80%** | 1.7 → HQ + RABC, 100k steps |
|
| 182 |
| 2.2 | 75% | 100% | 50% | 1.3 → HQ + RABC, 100k steps |
|
| 183 |
| 1.7 | 40% | 80% | 0% | All data, Relative Actions + RABC + QUANTILES |
|
| 184 |
|
| 185 |
-
The jump from Series 1 to Series 2 is unambiguous in the statistical analysis — 2.5 and 2.2 clearly
|
| 186 |
|
| 187 |
Both 2.2 and 2.5 were trained for 100k steps. 2.2 fine-tunes from 1.3 while 2.5 fine-tunes from 1.7 (the stronger base). The difference (75% → 90%) likely reflects this stronger starting point. They don't separate from each other in the pairwise tests, suggesting the recipe itself (HQ + RABC + Relative Actions) is the key ingredient, with the base checkpoint providing an additional boost.
|
| 188 |
|
| 189 |
#### 5. Level 2 requires everything to be right simultaneously
|
| 190 |
|
| 191 |
-
Every Series 1 experiment achieves exactly **0% Level 2 success**. Level 2 only becomes tractable in Series 2, and only with RABC on high-quality data (2.2: 50% L2, 2.5: 80% L2). The 0% → 50–80% jump is as clean a signal as you'll find in a 20-rollout experiment. Level 2 is genuinely harder
|
| 192 |
|
| 193 |
#### 6. Speed and fold quality both track data quality
|
| 194 |
|
| 195 |
Series 1 completes Level 1 in **78–122s**; Series 2 does it in **41–73s**. Fold quality (1–5 scale) hits a ceiling around 2.8 in Series 1, breaking past 3.0 only with high-quality data.
|
| 196 |
|
| 197 |
-
| Experiment |
|
| 198 |
|:---:|:---:|:---:|:---:|
|
| 199 |
| 1.1 (π0, all data) | 121.5s | 80% | 2.70 |
|
| 200 |
| 1.7 (best S1) | 99.5s | 80% | 2.30 |
|
|
@@ -202,8 +205,8 @@ Series 1 completes Level 1 in **78–122s**; Series 2 does it in **41–73s**. F
|
|
| 202 |
| 2.2 (HQ + RABC) | 43.2s | 100% | 3.30 |
|
| 203 |
| 2.5 (best overall) | **40.8s** | **100%** | **4.10** |
|
| 204 |
|
| 205 |
-
Policies trained on the full dataset learned hesitant motions; the high-quality dataset enforces deliberate
|
| 206 |
|
| 207 |
#### 8. What did not work
|
| 208 |
|
| 209 |
-
|
|
|
|
| 10 |
|
| 11 |
## Experiments
|
| 12 |
|
| 13 |
+
We ran 11 experiments to understand what *actually* matters. **Series 1** trains from pretrained base checkpoints on the full dataset. **Series 2** finetunes Series 1 checkpoints on curated high-quality data (2.1–2.4 from 1.3, 2.5 from 1.7). One early lesson: **undertraining makes the policy shaky** - make sure your model has converged before drawing conclusions.
|
| 14 |
|
| 15 |
<Wide>
|
| 16 |
|
|
|
|
| 41 |
)
|
| 42 |
```
|
| 43 |
|
| 44 |
+
with an action queue size of 30 and a maximum action horizon of 20.
|
| 45 |
+
|
| 46 |
+
RTC gave us a ~2x speedup (sometimes even 2.5x) and action interpolation made the robot much quieter and smoother. Both are now available on [LeRobot main](https://github.com/huggingface/lerobot).
|
| 47 |
|
| 48 |
### SARM: Our Reward Model
|
| 49 |
|
| 50 |
Before diving into the experiments further, let's introduce a key ingredient: **[SARM](https://huggingface.co/docs/lerobot/sarm)** (Stage-Aware Reward Modeling). SARM is a trained reward model that scores trajectories based on how well the robot is progressing toward task completion, it acts as a learned "critic" that predicts whether things are going well or badly.
|
| 51 |
|
| 52 |
+
SARM is trained on our demonstration data to predict 0-1 task progression. The takeaway: it correctly identifies **mistakes** (drops in value) and **progress** (increases) in real time.
|
| 53 |
|
| 54 |
<Wide>
|
| 55 |
<Stack layout="3-column" gap="small">
|
|
|
|
| 59 |
</Stack>
|
| 60 |
</Wide>
|
| 61 |
|
| 62 |
+
We use SARM exclusively for **RABC** (Reward-Advantage-Based Conditioning): every episode is scored with a per-timestep quality signal, and during training, actions are weighted by their contribution to progress. High-reward actions contribute more to the loss, low-reward ones contribute less. Negative progress is clipped to 0. Unlike binary success/fail labels, SARM provides a continuous signal at every timestep.
|
| 63 |
|
| 64 |
---
|
| 65 |
|
| 66 |
### Results Overview
|
| 67 |
|
| 68 |
+
Now let's look at how each experiment actually performed. The charts below show success rates, scores, completion times, and failure modes across all 11 experiments. The pattern is consistent: **Series 2 dominates Series 1**, and within each series, RABC combined with relative actions produces the best results. We break down the key findings below.
|
| 69 |
|
| 70 |
<HtmlEmbed
|
| 71 |
id="success-rates"
|
|
|
|
| 74 |
desc="Success rates (Total, Level 1, Level 2) across all experiments. Series 1 trains from scratch on full data; Series 2 finetunes the best Series 1 checkpoint on curated high-quality data."
|
| 75 |
/>
|
| 76 |
|
| 77 |
+
The gap between Series 1 and Series 2 is immediately visible. Experiment 2.5 reaches 90% total success rate (100% Level 1, 80% Level 2), while the best Series 1 result tops out at 40%. No Series 1 experiment achieves a single Level 2 success.
|
| 78 |
|
| 79 |
<HtmlEmbed
|
| 80 |
id="total-score"
|
|
|
|
| 105 |
|
| 106 |
### Where the policies fail
|
| 107 |
|
| 108 |
+
Before interpreting success rates, understanding *how* each experiment fails — not just whether it fails — is essential.
|
| 109 |
|
| 110 |
<HtmlEmbed
|
| 111 |
id="failure-analysis"
|
|
|
|
| 131 |
|
| 132 |
#### 1. Data quality matters most
|
| 133 |
|
| 134 |
+
This is the finding we're most confident in it held regardless of which confidence level or correction method we used. The best Series 1 result (1.7) achieves 40% total success rate. The best Series 2 result (2.5) achieves 90% using the *same architecture*. The pairwise tests cleanly separate these two groups, and no amount of algorithmic tuning within Series 1 came close to closing the gap.
|
| 135 |
|
| 136 |
We hypothesise that the root cause is the difference in **multi-modality** between the high-quality and full dataset. The full dataset contains demonstrations with some inconsistent strategies: different grips, unfolding sequences, and timing, while the high-quality dataset enforces a more unified, consistent protocol.
|
| 137 |
|
|
|
|
| 167 |
--policy.use_relative_actions=true
|
| 168 |
```
|
| 169 |
|
| 170 |
+
Comparing π0.5 without relative actions (1.2: 20% total SR, 40% Level 1) to π0.5 with relative actions and quantile normalization (1.3: 35% total SR, 70% Level 1), and then to the full combination in 1.7 (40% total SR, 80% Level 1), shows that training with relative actions consistently improves performance. The trend is clear and shows up in every comparison we made.
|
| 171 |
|
| 172 |
+
With only 20 rollouts, the exact gap between experiments is hard to pin down — but the improvement is consistent across every comparison. **Caveat:** π0.5 is likely pretrained with relative actions, so 1.3 and 1.7 fine-tune in a regime consistent with pretraining, while 1.2 fine-tunes against it.
|
| 173 |
|
| 174 |
#### 3. RABC helps especially on long tasks like level 2
|
| 175 |
|
| 176 |
+
RABC on high-quality data yields the two best results overall: experiments 2.2 and 2.5 clearly outperform every experiment that lacks it. The effect is strongest on **Level 2**, the longer and harder task — 2.2 reaches 50% Level 2 SR and 2.5 reaches 80%, while every experiment without RABC on clean data stays at 0%.
|
| 177 |
+
|
| 178 |
#### 4. Fine-tuning from a strong checkpoint is the winning recipe
|
| 179 |
|
| 180 |
The best results share the same recipe: fine-tune a Series 1 checkpoint on curated high-quality data with RABC and relative actions.
|
| 181 |
|
| 182 |
+
| Experiment | Total SR | Level 1 SR | Level 2 SR | Recipe |
|
| 183 |
|:---:|:---:|:---:|:---:|:---|
|
| 184 |
| 2.5 | **90%** | **100%** | **80%** | 1.7 → HQ + RABC, 100k steps |
|
| 185 |
| 2.2 | 75% | 100% | 50% | 1.3 → HQ + RABC, 100k steps |
|
| 186 |
| 1.7 | 40% | 80% | 0% | All data, Relative Actions + RABC + QUANTILES |
|
| 187 |
|
| 188 |
+
The jump from Series 1 to Series 2 is unambiguous in the statistical analysis — 2.5 and 2.2 clearly stand out from the Series 1 group. The Series 1 checkpoint already knows how to fold shirts in general, the high-quality data teaches the correct protocol, and RABC emphasizes the best demonstrations within an already clean dataset.
|
| 189 |
|
| 190 |
Both 2.2 and 2.5 were trained for 100k steps. 2.2 fine-tunes from 1.3 while 2.5 fine-tunes from 1.7 (the stronger base). The difference (75% → 90%) likely reflects this stronger starting point. They don't separate from each other in the pairwise tests, suggesting the recipe itself (HQ + RABC + Relative Actions) is the key ingredient, with the base checkpoint providing an additional boost.
|
| 191 |
|
| 192 |
#### 5. Level 2 requires everything to be right simultaneously
|
| 193 |
|
| 194 |
+
Every Series 1 experiment achieves exactly **0% Level 2 success**. Level 2 only becomes tractable in Series 2, and only with RABC on high-quality data (2.2: 50% L2, 2.5: 80% L2). The 0% → 50–80% jump is as clean a signal as you'll find in a 20-rollout experiment. Level 2 is genuinely harder. It requires the policy to have seen consistent, high-quality demonstrations of the full task, because without a reliable starting state after unfolding, the subsequent folds can't succeed.
|
| 195 |
|
| 196 |
#### 6. Speed and fold quality both track data quality
|
| 197 |
|
| 198 |
Series 1 completes Level 1 in **78–122s**; Series 2 does it in **41–73s**. Fold quality (1–5 scale) hits a ceiling around 2.8 in Series 1, breaking past 3.0 only with high-quality data.
|
| 199 |
|
| 200 |
+
| Experiment | Level 1 Time | Level 1 SR | Quality |
|
| 201 |
|:---:|:---:|:---:|:---:|
|
| 202 |
| 1.1 (π0, all data) | 121.5s | 80% | 2.70 |
|
| 203 |
| 1.7 (best S1) | 99.5s | 80% | 2.30 |
|
|
|
|
| 205 |
| 2.2 (HQ + RABC) | 43.2s | 100% | 3.30 |
|
| 206 |
| 2.5 (best overall) | **40.8s** | **100%** | **4.10** |
|
| 207 |
|
| 208 |
+
Policies trained on the full dataset learned hesitant motions; the high-quality dataset enforces deliberate and progress oriented actions. Faster completion isn't a separate goal from quality it's a consequence of a clear, unambiguous strategy.
|
| 209 |
|
| 210 |
#### 8. What did not work
|
| 211 |
|
| 212 |
+
**Mirroring augmentation** (2.3): only 5% total SR. Mirroring doubles multimodality, making convergence much harder even at 100k steps.
|
app/src/content/chapters/folding/09-learnings.mdx
CHANGED
|
@@ -3,25 +3,25 @@ import Sidenote from "../../../components/Sidenote.astro";
|
|
| 3 |
|
| 4 |
## Learnings
|
| 5 |
|
| 6 |
-
Running all these experiments taught us a lot
|
| 7 |
|
| 8 |
### What mattered most
|
| 9 |
|
| 10 |
Beyond the experiment findings above, several practical insights stood out:
|
| 11 |
|
| 12 |
- **Train a reward model.** [SARM](https://huggingface.co/docs/lerobot/sarm) gave us data scoring, advantage conditioning, and curation in one package. We recommend it even for tasks where you think manual filtering would suffice.
|
| 13 |
-
- **Invest in recording quality early.** More time upfront on clean
|
| 14 |
-
- **Record at higher frequency.** We'd record at 50 fps if we
|
| 15 |
-
- **DAgger is promising.**
|
| 16 |
|
| 17 |
### For the community: the order of operations
|
| 18 |
|
| 19 |
-
If you're training a policy for a new manipulation task with LeRobot, here's the sequence we'd recommend
|
| 20 |
|
| 21 |
-
1. **Define your task protocol first.** Before collecting a single episode,
|
| 22 |
2. **Collect 50–100 clean demonstrations.** Quality over volume. Consistent technique, good camera angles, deliberate motions. This is your foundation, everything else builds on it.
|
| 23 |
-
3. **Train a reward model.** Use [SARM](https://huggingface.co/docs/lerobot/sarm) to score your episodes and enable RABC during training. This
|
| 24 |
-
4. **Train a baseline and watch it fail.** Film the rollouts. Understanding *how* and *where* it breaks tells you exactly what data to collect next.
|
| 25 |
5. **Use DAgger for targeted improvement.** Once you have a model that mostly works, collect correction data for its specific failure modes. LeRobot's [HIL scripts](https://github.com/huggingface/lerobot/tree/main/examples/hil) handle the full loop, the operator watches the policy run, pauses on failure, teleoperates a recovery, and hands control back:
|
| 26 |
|
| 27 |
```bash
|
|
@@ -34,7 +34,7 @@ python examples/hil/hil_data_collection.py \
|
|
| 34 |
--rtc.execution_horizon=20
|
| 35 |
```
|
| 36 |
|
| 37 |
-
6. **Enable action interpolation and RTC.** This smooths transitions and speeds up execution. Action interpolation upsamples the policy's 30 Hz output to your robot's control frequency (e.g. 90 Hz), and RTC overlaps inference with execution. Both are flags on `lerobot-eval`:
|
| 38 |
|
| 39 |
```bash
|
| 40 |
lerobot-eval \
|
|
@@ -49,7 +49,7 @@ lerobot-eval \
|
|
| 49 |
7. **Film every evaluation.** Metrics alone won't tell the full story. Video reveals subtle failure modes that success rate misses, and lets you score quality.
|
| 50 |
|
| 51 |
<Note variant="info">
|
| 52 |
-
All the innovations from this project [SARM](https://huggingface.co/docs/lerobot/sarm), [RTC](https://huggingface.co/docs/lerobot/rtc), DAgger, [OpenArm](https://huggingface.co/docs/lerobot/openarm), and OpenArm Mini are merged into [LeRobot
|
| 53 |
</Note>
|
| 54 |
|
| 55 |
### What's next
|
|
@@ -58,8 +58,8 @@ This project is far from done. We're releasing the final model, full dataset, an
|
|
| 58 |
|
| 59 |
- **Massive-scale training.** We want LeRobot and LeRobotDataset to support 10–100x the data we used here, with billions of frames, powered by the new [HF Buckets](https://huggingface.co/docs/hub/en/storage-buckets) for storage and streaming at scale.
|
| 60 |
- **More robots, teleoperators, VLAs, and reward models.** We're continuing to expand the ecosystem of supported hardware, teleoperation setups, and model architectures in LeRobot.
|
| 61 |
-
- **
|
| 62 |
-
- **Better data interpretability tools.** Our biggest lever was data quality, but finding *which* demonstrations help and which hurt is still a hard problem. The field needs tools that provide understanding of what makes a trajectory useful before you train on it. If you're
|
| 63 |
- **Democratize robot learning.** Continue to lower the barrier to entry and share every insight, tool, and method with the community.
|
| 64 |
|
| 65 |
-
We also encourage you to use our dataset directly. Train your own policies, try new architectures, experiment with different training recipes. If you find something promising,
|
|
|
|
| 3 |
|
| 4 |
## Learnings
|
| 5 |
|
| 6 |
+
Running all these experiments taught us a lot. Some expected, some not. Here's what stuck.
|
| 7 |
|
| 8 |
### What mattered most
|
| 9 |
|
| 10 |
Beyond the experiment findings above, several practical insights stood out:
|
| 11 |
|
| 12 |
- **Train a reward model.** [SARM](https://huggingface.co/docs/lerobot/sarm) gave us data scoring, advantage conditioning, and curation in one package. We recommend it even for tasks where you think manual filtering would suffice.
|
| 13 |
+
- **Invest in recording quality early.** More time upfront on clean and consistent recordings pays off more than extra volume.
|
| 14 |
+
- **Record at higher frequency.** We'd record at 50 fps if we had to do it again. Folding is dynamic and higher record rates capture transitions better.
|
| 15 |
+
- **DAgger is promising.** Corrections targeting the model's actual failure modes should be highly effective at pushing success rates higher. This infrastructure is ready and now also merged into LeRobot.
|
| 16 |
|
| 17 |
### For the community: the order of operations
|
| 18 |
|
| 19 |
+
If you're training a policy for a new manipulation task with LeRobot, **here's the sequence we'd recommend**:
|
| 20 |
|
| 21 |
+
1. **Define your task protocol first.** Before collecting a single episode, define exactly how the task should be performed.
|
| 22 |
2. **Collect 50–100 clean demonstrations.** Quality over volume. Consistent technique, good camera angles, deliberate motions. This is your foundation, everything else builds on it.
|
| 23 |
+
3. **Train a reward model.** Use [SARM](https://huggingface.co/docs/lerobot/sarm) to score your episodes and enable RABC during training. This allows the policy to focus on the best demonstrations, which is crucial for longer tasks.
|
| 24 |
+
4. **Train a baseline and watch it fail.** Film the rollouts. Understanding *how* and *where* it breaks tells you exactly what kind of data to collect next.
|
| 25 |
5. **Use DAgger for targeted improvement.** Once you have a model that mostly works, collect correction data for its specific failure modes. LeRobot's [HIL scripts](https://github.com/huggingface/lerobot/tree/main/examples/hil) handle the full loop, the operator watches the policy run, pauses on failure, teleoperates a recovery, and hands control back:
|
| 26 |
|
| 27 |
```bash
|
|
|
|
| 34 |
--rtc.execution_horizon=20
|
| 35 |
```
|
| 36 |
|
| 37 |
+
6. **Enable action interpolation and RTC.** This smooths transitions and speeds up execution. Action interpolation upsamples the policy's 30 Hz output to your robot's control frequency (e.g. 90 Hz), and RTC overlaps inference with execution. Both features are flags on `lerobot-eval`:
|
| 38 |
|
| 39 |
```bash
|
| 40 |
lerobot-eval \
|
|
|
|
| 49 |
7. **Film every evaluation.** Metrics alone won't tell the full story. Video reveals subtle failure modes that success rate misses, and lets you score quality.
|
| 50 |
|
| 51 |
<Note variant="info">
|
| 52 |
+
All the innovations from this project [SARM](https://huggingface.co/docs/lerobot/sarm), [RTC](https://huggingface.co/docs/lerobot/rtc), DAgger, [OpenArm](https://huggingface.co/docs/lerobot/openarm), and OpenArm Mini are merged into [LeRobot repository](https://github.com/huggingface/lerobot). You can use our full pipeline as a starting point and swap in your own task.
|
| 53 |
</Note>
|
| 54 |
|
| 55 |
### What's next
|
|
|
|
| 58 |
|
| 59 |
- **Massive-scale training.** We want LeRobot and LeRobotDataset to support 10–100x the data we used here, with billions of frames, powered by the new [HF Buckets](https://huggingface.co/docs/hub/en/storage-buckets) for storage and streaming at scale.
|
| 60 |
- **More robots, teleoperators, VLAs, and reward models.** We're continuing to expand the ecosystem of supported hardware, teleoperation setups, and model architectures in LeRobot.
|
| 61 |
+
- **Reinforcement Learning support.** Extending LeRobot with new reinforcement learning methods and all the infrastructure needed to train policies online, not just from offline demonstrations.
|
| 62 |
+
- **Better data interpretability tools.** Our biggest lever was data quality, but finding *which* demonstrations help and which hurt is still a hard problem. The field needs tools that provide understanding of what makes a trajectory useful before you train on it. If you're exploring data curation or interpretability for robotics, we'd love to hear from you — [let's talk](https://huggingface.co/lerobot).
|
| 63 |
- **Democratize robot learning.** Continue to lower the barrier to entry and share every insight, tool, and method with the community.
|
| 64 |
|
| 65 |
+
We also encourage you to use our dataset directly. Train your own policies, try new architectures, experiment with different training recipes. If you find something promising, we'd be happy to run your models on our physical setups and share the results back !
|
app/src/styles/_base.css
CHANGED
|
@@ -104,9 +104,27 @@ html {
|
|
| 104 |
margin-bottom: 0;
|
| 105 |
}
|
| 106 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 107 |
.content-grid main blockquote {
|
| 108 |
-
border-left:
|
| 109 |
-
padding-left:
|
|
|
|
|
|
|
| 110 |
font-style: italic;
|
| 111 |
color: var(--muted-color);
|
| 112 |
margin: var(--spacing-4) 0;
|
|
@@ -182,4 +200,11 @@ html {
|
|
| 182 |
background: none;
|
| 183 |
border: none;
|
| 184 |
opacity: 0.4;
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 185 |
}
|
|
|
|
| 104 |
margin-bottom: 0;
|
| 105 |
}
|
| 106 |
|
| 107 |
+
.links-centered {
|
| 108 |
+
display: flex !important;
|
| 109 |
+
flex-wrap: wrap;
|
| 110 |
+
justify-content: center;
|
| 111 |
+
}
|
| 112 |
+
|
| 113 |
+
.links-centered > * {
|
| 114 |
+
flex: 0 0 calc(25% - 0.375rem);
|
| 115 |
+
}
|
| 116 |
+
|
| 117 |
+
@media (max-width: 768px) {
|
| 118 |
+
.links-centered > * {
|
| 119 |
+
flex: 0 0 calc(50% - 0.25rem);
|
| 120 |
+
}
|
| 121 |
+
}
|
| 122 |
+
|
| 123 |
.content-grid main blockquote {
|
| 124 |
+
border-left: none;
|
| 125 |
+
padding-left: 0;
|
| 126 |
+
font-size: clamp(16px, 1.8vw, 19px);
|
| 127 |
+
line-height: 1.6;
|
| 128 |
font-style: italic;
|
| 129 |
color: var(--muted-color);
|
| 130 |
margin: var(--spacing-4) 0;
|
|
|
|
| 200 |
background: none;
|
| 201 |
border: none;
|
| 202 |
opacity: 0.4;
|
| 203 |
+
}
|
| 204 |
+
|
| 205 |
+
.lead-paragraph {
|
| 206 |
+
font-size: clamp(18px, 2.2vw, 22px);
|
| 207 |
+
line-height: 1.55;
|
| 208 |
+
letter-spacing: -0.01em;
|
| 209 |
+
margin-bottom: var(--spacing-5);
|
| 210 |
}
|