chore(typos): fixing typos, syntax and grammar issues

#10
by CarolinePascal HF Staff - opened
app/src/content/chapters/folding/01-hero.mdx CHANGED
@@ -3,7 +3,7 @@ import Note from "../../../components/Note.astro";
3
  import Wide from "../../../components/Wide.astro";
4
  import Stack from "../../../components/Stack.astro";
5
 
6
- We trained an open-source bimanual robot to fold t-shirts autonomously, reaching 90% success rate. The biggest lever was data quality, not the model, not the architecture.
7
 
8
  <Sidenote>
9
  Read time: ~30 minutes. Each section stands on its own — feel free to skip to what interests you most.
@@ -11,23 +11,24 @@ We trained an open-source bimanual robot to fold t-shirts autonomously, reaching
11
 
12
  This post walks through the complete journey: hardware choices, data collection, training recipes, and different experiments that show what actually matters. We cover the mistakes and dead ends alongside the things that worked, because the messy middle is where most of the learning happens.
13
 
14
- Some of what we found: cheap 3D-printed leader arms outperformed the expensive ones for teleoperation. Early data collection was more wasteful than expected. A trained reward model turned out to be essential for separating useful demonstrations from harmful ones. And curating a small, high-quality dataset did more than algorithmic improvement on the full dataset.
15
 
16
  By sharing this we hope to contribute to our bigger vision: **democratize robotics and robot learning**. By open-sourcing every piece tools, data, models, and knowledge we want to enable a community that pushes this technology further. We've tried to avoid just listing what we did in favor of telling the story of how we got here. We hope being this open will help close the gap between closed-lab demos and what the open-source community can achieve.
17
 
18
- Everything we built for this project [SARM](https://huggingface.co/docs/lerobot/sarm), [RTC](https://huggingface.co/docs/lerobot/rtc), DAgger, [OpenArm](https://huggingface.co/docs/lerobot/openarm), and OpenArm Mini is now merged into [LeRobot](https://github.com/huggingface/lerobot) and ready for the community to use.
19
 
20
- Let's start with the results, does it actually work?
21
 
22
  #### Links
23
 
24
- <Stack layout="4-column" gap="small">
25
- <a href="https://huggingface.co/lerobot-data-collection/folding_final" className="card" style="padding: 12px 16px; text-align: center; text-decoration: none;">**Model** HF Hub</a>
26
- <a href="https://huggingface.co/lerobot-data-collection/folding_sarm_reward" className="card" style="padding: 12px 16px; text-align: center; text-decoration: none;">**SARM Reward** HF Hub</a>
27
- <a href="https://huggingface.co/datasets/lerobot/high_quality_folding" className="card" style="padding: 12px 16px; text-align: center; text-decoration: none;">**HQ Dataset** HF Hub</a>
28
- <a href="https://huggingface.co/datasets/lerobot/full_folding" className="card" style="padding: 12px 16px; text-align: center; text-decoration: none;">**Full Dataset** HF Hub</a>
29
- <a href="https://github.com/huggingface/lerobot" className="card" style="padding: 12px 16px; text-align: center; text-decoration: none;">**Code** LeRobot</a>
30
- <a href="https://huggingface.co/docs/lerobot/openarm" className="card" style="padding: 12px 16px; text-align: center; text-decoration: none;">**OpenArm Mini** Repo</a>
 
31
  </Stack>
32
 
33
  <Sidenote>
 
3
  import Wide from "../../../components/Wide.astro";
4
  import Stack from "../../../components/Stack.astro";
5
 
6
+ > We trained an open-source bimanual robot to fold t-shirts autonomously, reaching 90% success rate. The biggest lever was data quality, not the model, not the architecture.
7
 
8
  <Sidenote>
9
  Read time: ~30 minutes. Each section stands on its own — feel free to skip to what interests you most.
 
11
 
12
  This post walks through the complete journey: hardware choices, data collection, training recipes, and different experiments that show what actually matters. We cover the mistakes and dead ends alongside the things that worked, because the messy middle is where most of the learning happens.
13
 
14
+ Some of what we found: cheap 3D-printed leader arms outperformed the expensive ones for teleoperation. Early data collection was more wasteful than expected. A trained reward model turned out to be essential for separating useful demonstrations from harmful ones. And curating a small, high-quality dataset did more than any algorithmic improvement on the full dataset.
15
 
16
  By sharing this we hope to contribute to our bigger vision: **democratize robotics and robot learning**. By open-sourcing every piece tools, data, models, and knowledge we want to enable a community that pushes this technology further. We've tried to avoid just listing what we did in favor of telling the story of how we got here. We hope being this open will help close the gap between closed-lab demos and what the open-source community can achieve.
17
 
18
+ Everything we built for this project [SARM](https://huggingface.co/docs/lerobot/sarm), [RTC](https://huggingface.co/docs/lerobot/rtc), DAgger, [OpenArm](https://huggingface.co/docs/lerobot/openarm) and [OpenArm Mini](http://github.com/pkooij/open-arms-mini) is now merged into [LeRobot](https://github.com/huggingface/lerobot) and ready for the community to use.
19
 
20
+ _Let's start with the results, does it actually work?_
21
 
22
  #### Links
23
 
24
+ <Stack layout="4-column" gap="small" class="links-centered">
25
+ <a href="https://huggingface.co/lerobot-data-collection/folding_final" className="card" style="padding: 12px 16px; text-align: center; text-decoration: none;"><strong>Model</strong><br/>HF Hub</a>
26
+ <a href="https://huggingface.co/lerobot-data-collection/folding_sarm_reward" className="card" style="padding: 12px 16px; text-align: center; text-decoration: none;"><strong>SARM Reward</strong><br/>HF Hub</a>
27
+ <a href="https://huggingface.co/datasets/lerobot/high_quality_folding" className="card" style="padding: 12px 16px; text-align: center; text-decoration: none;"><strong>HQ Dataset</strong><br/>HF Hub</a>
28
+ <a href="https://huggingface.co/datasets/lerobot/full_folding" className="card" style="padding: 12px 16px; text-align: center; text-decoration: none;"><strong>Full Dataset</strong><br/>HF Hub</a>
29
+ <a href="http://github.com/pkooij/open-arms-mini" className="card" style="padding: 12px 16px; text-align: center; text-decoration: none;"><strong>OpenArm Mini</strong><br/>Repo</a>
30
+ <a href="https://github.com/huggingface/lerobot" className="card" style="padding: 12px 16px; text-align: center; text-decoration: none;"><strong>LeRobot</strong><br/>Code</a>
31
+ <a href="https://huggingface.co/docs/lerobot/index" className="card" style="padding: 12px 16px; text-align: center; text-decoration: none;"><strong>LeRobot</strong><br/>Documentation</a>
32
  </Stack>
33
 
34
  <Sidenote>
app/src/content/chapters/folding/02-results.mdx CHANGED
@@ -7,16 +7,14 @@ import Video from "../../../components/Video.astro";
7
 
8
  ## Results
9
 
10
- Below are two **uncut, full-length** runs from our best model. No human intervention.
11
-
12
- **Level 1: Fold a laid-out t-shirt** (15 min continuous folding)
13
 
 
14
  <Wide>
15
  <Video src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/lerobot/level1.mp4" />
16
  </Wide>
17
 
18
- **Level 2: Untangle, spread, fold, and place aside** (5 shirts back-to-back)
19
-
20
  <Wide>
21
  <Video src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/lerobot/level2.mp4" />
22
  </Wide>
@@ -25,14 +23,14 @@ Below are two **uncut, full-length** runs from our best model. No human interven
25
 
26
  How well does it actually work? We evaluated our best model (Experiment 2.5) across 20 rollouts.
27
 
28
- | Task | Success Rate | Avg. Completion Time |
29
  |:---|:---:|:---:|
30
  | **Level 1** Laid-out to Fold | **100%** | **40.8 s** |
31
  | **Level 2** Messy to Spread to Fold to Place aside | **80%** | **95.9 s** |
32
  | **Combined** (Total SR) | **90%** | |
33
 
34
  <Sidenote>
35
- All evaluations filmed and scored from video. 20 rollouts per experiment (10 per level). Full methodology in the Evaluation section.
36
  </Sidenote>
37
 
38
- These numbers are the result of 11 experiments, each testing a different combination of model, data, and training strategies. The full breakdown is in the [Experiments](#experiments) section. Let's start from the beginning: the hardware.
 
7
 
8
  ## Results
9
 
10
+ No cherry-picked clips. Here are two **uncut, full-length** runs from our best model, no human intervention.
 
 
11
 
12
+ ### Level 1: Fold a laid-out t-shirt <span style="font-weight: 400; font-size: 0.8em; opacity: 0.6;">(15 min continuous folding)</span>
13
  <Wide>
14
  <Video src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/lerobot/level1.mp4" />
15
  </Wide>
16
 
17
+ ### Level 2: Untangle, spread, fold, and place aside <span style="font-weight: 400; font-size: 0.8em; opacity: 0.6;">(5 shirts back-to-back)</span>
 
18
  <Wide>
19
  <Video src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/lerobot/level2.mp4" />
20
  </Wide>
 
23
 
24
  How well does it actually work? We evaluated our best model (Experiment 2.5) across 20 rollouts.
25
 
26
+ | Task | Success Rate (SR) | Avg. Completion Time |
27
  |:---|:---:|:---:|
28
  | **Level 1** Laid-out to Fold | **100%** | **40.8 s** |
29
  | **Level 2** Messy to Spread to Fold to Place aside | **80%** | **95.9 s** |
30
  | **Combined** (Total SR) | **90%** | |
31
 
32
  <Sidenote>
33
+ All evaluations filmed and scored from video. 20 rollouts per experiment (10 per level). Full methodology in the [Evaluation](#evaluation) section.
34
  </Sidenote>
35
 
36
+ These numbers are the result of 11 experiments, each testing a different combination of model, data, and training strategies. The full breakdown is in the [Experiments](#experiments) section. But let's start from the beginning: the hardware.
app/src/content/chapters/folding/03-hardware.mdx CHANGED
@@ -10,14 +10,14 @@ import openArmMini2 from "../../assets/image/openarm-mini2.jpg";
10
 
11
  ## Hardware
12
 
13
- LeRobot takes care of the entire robot learning stack: in this section we're gonna walk you through every piece of hardware we used.
14
 
15
  ### The Robot: Bimanual OpenArm
16
 
17
- For starters, we need a robot (duh). We use the **bimanual [OpenArm](https://huggingface.co/docs/lerobot/openarm)**, open-source, human-like robot arms designed by [Enactic](https://openarm.dev) and manufactured by vendors such as [WowRobo](https://shop.wowrobo.com). Three reasons drove this choice:
18
 
19
  1. **The humanoid trend.** We're seeing a wave of human-like robots. More human-form robots means more human-form data in the ecosystem. Building on this form factor positions our work for a future where human-like manipulation data is transferable.
20
- 2. **Smaller teleop gap.** When the robot's kinematics match a human arm, the teleoperator's motions transfer more naturally less mental remapping, faster learning.
21
  3. **Open source, good specs.** Solid payload, good reach, and fully open hardware. We extended the upper arm by **+5 cm** to increase reach since our setup doesn't have a hip or torso to provide additional workspace.
22
 
23
  Everything is mounted on **aluminum extrusion profiles**, which let us quickly iterate on the physical arrangement and adjust both teleop and robot height between sessions to increase data diversity.
@@ -26,24 +26,22 @@ Everything is mounted on **aluminum extrusion profiles**, which let us quickly i
26
 
27
  ### Custom Grippers
28
 
29
- We designed **custom grippers with a larger surface area**, giving the robot a broader contact patch to grip, pinch, and slide fabric reliably.
30
 
31
  ### Teleop Arms: OpenArm Mini
32
 
33
- Next, we need a way to actually control the robot.
34
 
35
- We started with full-size OpenArm as leader arms for teleoperation. They seemed like the natural choice: same kinematics as the follower arms, one-to-one mapping.
 
 
 
 
36
 
37
- However, we quickly realized we needed something with less inertia so operators could move faster and with more precision and something that works regardless of arm length, since our operators varied significantly in height. This led us to develop the **OpenArm Mini**: small, Feetech-based, 3D-printed leader arms based on the [SO-101](https://github.com/TheRobotStudio/SO-ARM100) design. These gave us:
38
- - **Less inertia** operators could make quicker and more deliberate motions that cloth folding demands
39
- - **Arm-length agnostic** works for teleoperators of any size
40
- - **Incredibly cheap** ~120 EUR per arm, making it very cheap to set up multiple stations
41
- - **Still support DAgger** lightweight, but strong enough to move during human-in-the-loop correction data collection
42
-
43
- One detail turned out to be critical: the **wrist strap**. Without it, wrist rotations were imprecise. With the strap, operators get locked-in wrist control, which is essential for cloth manipulation.
44
 
45
  <Note variant="info" emoji="🔗">
46
- OpenArm Mini repo (3D print files, BOM, LeRobot integration): <a href="https://github.com/pkooij/open-arms-mini" target="_blank">github.com/pkooij/open-arms-mini</a>
47
  </Note>
48
 
49
  <div style="display: flex; gap: 8px; max-width: 70%; margin: 0 auto;">
@@ -54,11 +52,13 @@ One detail turned out to be critical: the **wrist strap**. Without it, wrist rot
54
  </div>
55
  </div>
56
 
57
- A small thing that makes a surprisingly big difference: when both your hands are on the leader arms, you need a hands-free way to **start and stop episodes**. USB foot pedals solved this elegantly.
 
 
58
 
59
  ### Cameras
60
 
61
- The robot needs to see what it's doing: for this purpose we use **three cameras** each serving a distinct purpose:
62
 
63
  | Camera | Position | Notes |
64
  |:---|:---|:---|
@@ -77,7 +77,6 @@ The robot needs to see what it's doing: for this purpose we use **three cameras*
77
 
78
  ### LeRobot Integration
79
 
80
- Integrating OpenArm into LeRobot required adding **CAN-bus protocol** support for the arm's motors, which can be found in the [LeRobot repository](https://github.com/huggingface/lerobot). We also created a UI for the non-technical robot operators, so they don't have to use the CLI to start and stop episodes.
81
-
82
 
83
  With the hardware in place, the next step was the hardest and most time-consuming part of the entire project: collecting good data. And "good" is much harder to define than it sounds.
 
10
 
11
  ## Hardware
12
 
13
+ LeRobot takes care of the entire robot learning stack but you still need the physical hardware. Here's an averview of every piece we used.
14
 
15
  ### The Robot: Bimanual OpenArm
16
 
17
+ For starters, the robot. We used the **bimanual [OpenArm](https://huggingface.co/docs/lerobot/openarm)**, open-source, human-like robot arms developed by [Enactic](https://openarm.dev) and built by [WowRobo](https://shop.wowrobo.com). Three reasons drove this choice:
18
 
19
  1. **The humanoid trend.** We're seeing a wave of human-like robots. More human-form robots means more human-form data in the ecosystem. Building on this form factor positions our work for a future where human-like manipulation data is transferable.
20
+ 2. **Smaller teleop gap.** When the robot's kinematics match a human arm, the teleoperator's motions transfer more naturally, meaning less mental remapping and faster learning.
21
  3. **Open source, good specs.** Solid payload, good reach, and fully open hardware. We extended the upper arm by **+5 cm** to increase reach since our setup doesn't have a hip or torso to provide additional workspace.
22
 
23
  Everything is mounted on **aluminum extrusion profiles**, which let us quickly iterate on the physical arrangement and adjust both teleop and robot height between sessions to increase data diversity.
 
26
 
27
  ### Custom Grippers
28
 
29
+ We designed **custom grippers with a larger surface area**, giving the robot a broader contact patch to grip, pinch, and slide fabric reliably. We also added a small polymer patch on one side of the gripper to reduce slippage and make the grasping of fabric easier.
30
 
31
  ### Teleop Arms: OpenArm Mini
32
 
33
+ Next, we need a way to control the robot. We started with full-size OpenArm as leader arms for teleoperation. They seemed like the natural choice: same kinematics as the follower arms, one-to-one mapping.
34
 
35
+ However, we quickly realized we needed a teleoperator with less inertia, to allow for fast and precise manipulation, and more adaptability to different human morphologies. This led us to develop the **OpenArm Mini**: small, Feetech-based, 3D-printed leader arms based on the [SO-101](https://github.com/TheRobotStudio/SO-ARM100) design. These gave us:
36
+ 1. **Less inertia** for quicker and more deliberate motions that cloth folding demands
37
+ 2. **Arm-length agnostic** and adaptable to any human operator size
38
+ 3. **Incredibly cheap** (~120 EUR per arm) making it easy to scale to multiple stations
39
+ 4. **Still support DAgger**: lightweight, but strong enough to move during human-in-the-loop corrective data collection
40
 
41
+ One small detail mattered more than expected: the **wrist strap**. It locks the wrist to the leader arm, providing the precise rotational control essential for cloth manipulation.
 
 
 
 
 
 
42
 
43
  <Note variant="info" emoji="🔗">
44
+ [OpenArm Mini repo (3D print files, BOM, LeRobot integration)](https://github.com/pkooij/open-arms-mini)
45
  </Note>
46
 
47
  <div style="display: flex; gap: 8px; max-width: 70%; margin: 0 auto;">
 
52
  </div>
53
  </div>
54
 
55
+ <br/>
56
+
57
+ Another feature that made a surprisingly big difference: when both your hands are on the leader arms, you need a hands-free way to **start and stop episodes recording**. USB foot pedals solved this elegantly.
58
 
59
  ### Cameras
60
 
61
+ Eventually, the robot needs to see what it is doing: for this purpose we used **three cameras** each serving a distinct purpose:
62
 
63
  | Camera | Position | Notes |
64
  |:---|:---|:---|
 
77
 
78
  ### LeRobot Integration
79
 
80
+ Integrating OpenArm into LeRobot required adding **CAN-bus protocol** support for the arm's motors. It can now be found in the [LeRobot repository](https://github.com/huggingface/lerobot). We also created a UI for non-technical robot operators, so the CLI doesn't need to be used to start and stop episodes.
 
81
 
82
  With the hardware in place, the next step was the hardest and most time-consuming part of the entire project: collecting good data. And "good" is much harder to define than it sounds.
app/src/content/chapters/folding/04-data-collection.mdx CHANGED
@@ -10,11 +10,11 @@ We ran **8 setups** in parallel, optimizing for **maximum diversity**: 25+ diffe
10
 
11
  ### Learning to Teleoperate
12
 
13
- Here's an honest truth: **early data is worse than the final data**. Teleoperating a bimanual robot is a genuine skill, and it takes practice. The first episodes are slow, not deliberate, and full of failed attempts. Over hours of practice, operators get dramatically better smoother motions, faster execution, and more consistent grasps.
14
 
15
- This creates one of the most important practical decisions of the project: **when do you start recording data for the final model?** Too early and you pollute the dataset with low-quality demonstrations that the model will faithfully reproduce, hesitations, fumbles, and all. Too late and you've wasted precious time.
16
 
17
- Another important part is aligning the strategy between operators. Since some parts of folding are very multi-modal (you can fold a t-shirt in many different ways), you need to make sure there is a common strategy. We held brief alignment sessions to standardize the fold sequence before each recording sprint, where we first experimented with different approaches, then shared our learnings and discussed to find the best or most efficient way.
18
 
19
  ### Tips for Good Data Collection
20
 
 
10
 
11
  ### Learning to Teleoperate
12
 
13
+ Here's an honest truth: **early data is worse than the final data**. Teleoperating a bimanual robot is a genuine skill, and it takes practice. The first episodes are slow, not deliberate, and full of failed attempts. Over hours of practice, operators get dramatically better and smoother motions, faster execution, and more consistent grasps.
14
 
15
+ This creates one of the most important practical decisions of the project: **when do you start recording data for the final model?** Too early and you pollute the dataset with low-quality demonstrations that the model will faithfully reproduce, including hesitations and fumbles. Too late and you've wasted precious time.
16
 
17
+ Aligning on a common strategy across operators was equally important. Folding is a very multi-modal task (there are many valid ways to fold a t-shirt) and the model learns better from a consistent strategy. Before each recording sprint, we held brief alignment sessions: experimenting with different techniques, sharing our learnings and then converging on the most efficient fold sequence.
18
 
19
  ### Tips for Good Data Collection
20
 
app/src/content/chapters/folding/05-data-diversity.mdx CHANGED
@@ -8,7 +8,7 @@ import diversityGridImg from "../../assets/image/lerobot-data-collection_level12
8
 
9
  Raw episodes are only the beginning. What you do with them before training determines whether your model learns to fold or learns to fumble.
10
 
11
- We collected two datasets: a larger dataset containing all episodes, and a curated high-quality dataset which is partly a subset of the larger one, with additional high-quality episodes.
12
 
13
  ### Dataset Statistics
14
 
@@ -35,10 +35,10 @@ The grid below shows one frame from each of 100 different episodes. Notice the v
35
 
36
  We filtered episodes in two ways:
37
 
38
- 1. **End-state image filtering** discard episodes where the final frame doesn't show a properly folded shirt. If the end result isn't good, the demonstration isn't useful.
39
  2. **Length-based filtering** using the LeRobot data visualizer to remove outliers. Episodes that are suspiciously short tend to be low quality.
40
 
41
- The [LeRobot Data Visualizer](https://huggingface.co/spaces/lerobot/visualize_dataset) was invaluable for inspecting the dataset, spotting outliers, and understanding distributions. If you're collecting robot data, use it you can try it right here with our dataset:
42
 
43
  <Wide>
44
  <div className="card" style="overflow: hidden; border-radius: 10px;">
@@ -46,6 +46,6 @@ The [LeRobot Data Visualizer](https://huggingface.co/spaces/lerobot/visualize_da
46
  </div>
47
  </Wide>
48
 
49
- #### SARM Annotation with RABC
50
 
51
- We also annotated every episode using our trained **[SARM](https://huggingface.co/docs/lerobot/sarm)** reward model. This gave us continuous scores we could weight during training. More details in [SARM: Our Reward Model](#sarm-our-reward-model).
 
8
 
9
  Raw episodes are only the beginning. What you do with them before training determines whether your model learns to fold or learns to fumble.
10
 
11
+ We collected two datasets: a **full dataset** containing every episode, and a **curated dataset** built by selecting the best episodes from the full set and supplementing them with additional high-quality recordings.
12
 
13
  ### Dataset Statistics
14
 
 
35
 
36
  We filtered episodes in two ways:
37
 
38
+ 1. **End-state image filtering** to discard episodes where the final frame doesn't show a properly folded shirt. If the end result isn't good, the demonstration isn't useful.
39
  2. **Length-based filtering** using the LeRobot data visualizer to remove outliers. Episodes that are suspiciously short tend to be low quality.
40
 
41
+ The [LeRobot Data Visualizer](https://huggingface.co/spaces/lerobot/visualize_dataset) was invaluable for inspecting the dataset, spotting outliers, and understanding distributions. Try it right here with our dataset:
42
 
43
  <Wide>
44
  <div className="card" style="overflow: hidden; border-radius: 10px;">
 
46
  </div>
47
  </Wide>
48
 
49
+ #### SARM Annotation
50
 
51
+ We also annotated every episode using our trained **[Stage-Aware Reward Modeling (SARM)](https://huggingface.co/docs/lerobot/sarm)** reward model. This gave us continuous scores we could use as weights at training time. More details in [SARM: Our Reward Model](#sarm-our-reward-model).
app/src/content/chapters/folding/06-training.mdx CHANGED
@@ -6,11 +6,11 @@ import HtmlEmbed from "../../../components/HtmlEmbed.astro";
6
 
7
  ## Training
8
 
9
- Before we can talk about hyperparameters, we need to understand what the model actually *is* what it takes in, what it produces, and why those choices matter for cloth folding.
10
 
11
  ### Model Architecture
12
 
13
- At its core, the model is a **Vision-Language-Action (VLA)** model. It sees the world through cameras, understands a task description, and outputs motor commands 30 timesteps of joint angle targets and gripper commands, generated via flow matching at 30 Hz.
14
 
15
  <Wide>
16
  <HtmlEmbed
@@ -22,7 +22,7 @@ At its core, the model is a **Vision-Language-Action (VLA)** model. It sees the
22
  />
23
  </Wide>
24
 
25
- The model generates actions through **flow matching** a generative approach that transforms random noise into coherent action sequences, conditioned on what the cameras see and what the joints are doing. This allows the model to represent **multi-modal action distributions**: when there are multiple valid ways to grasp a sleeve or start a fold, the model can capture that ambiguity rather than averaging to a meaningless middle ground.
26
 
27
  <Sidenote>
28
  Flow matching is closely related to diffusion models but uses a simpler, more direct interpolation path between noise and data.
@@ -30,7 +30,7 @@ The model generates actions through **flow matching** a generative approach that
30
 
31
  #### [Real-Time Chunking (RTC)](https://huggingface.co/docs/lerobot/rtc)
32
 
33
- A crucial detail for real-world deployment: the model predicts action chunks of 30 steps, but instead of waiting for one chunk to finish before generating the next, RTC generates the next chunk while executing the current one. It "freezes" actions that are guaranteed to execute and "inpaints" the rest, enabling smooth asynchronous execution, speeding up our rollouts by at least a factor of 2.
34
 
35
  ```mermaid
36
  sequenceDiagram
@@ -45,12 +45,12 @@ sequenceDiagram
45
 
46
  ### Models
47
 
48
- We initially trained multiple architectures supported in LeRobot, but we ended up training two VLA architectures on our cloth folding data:
49
 
50
  - **π0** the base flow-matching VLA, trained with standard imitation learning
51
- - **[π0.5](https://huggingface.co/docs/lerobot/pi05)** an improved variant with more pretraining and some additional improvements to the flow matching denoising process
52
 
53
- Both are finetuned from pretrained checkpoints. Starting from this pretrained foundation, rather than training from scratch gives the model a head start on visual understanding and basic manipulation concepts.
54
 
55
  ### Training Setup
56
 
@@ -64,7 +64,7 @@ Both are finetuned from pretrained checkpoints. Starting from this pretrained fo
64
  | Training steps | **200k** (Series 1) / **100k** (Series 2 fine-tune) |
65
 
66
  <Sidenote>
67
- Multi-GPU training with 8x H100 and gradient accumulation was necessary to fit the large batch sizes needed for stable VLA training.
68
  </Sidenote>
69
 
70
  ### Loss Curves
@@ -79,4 +79,4 @@ Both are finetuned from pretrained checkpoints. Starting from this pretrained fo
79
  />
80
  </Wide>
81
 
82
- Our training followed two phases: **Series 1** trained from pretrained base checkpoints on the full dataset for 200k steps, then **Series 2** fine-tuned the best Series 1 checkpoint on curated high-quality data for 100k steps.
 
6
 
7
  ## Training
8
 
9
+ Before talking about hyperparameters, one needs to understand what the trained model actually *is*: what it takes in, what it produces, and why those choices matter for cloth folding.
10
 
11
  ### Model Architecture
12
 
13
+ At its core, the model is a **Vision-Language-Action (VLA)** model. It takes in camera images and a task description, and outputs actions joint angle targets and gripper commands for the next second, at a frequency of 30Hz.
14
 
15
  <Wide>
16
  <HtmlEmbed
 
22
  />
23
  </Wide>
24
 
25
+ The model generates actions through **flow matching**, a generative approach that transforms random noise into coherent action sequences, conditioned on what the cameras see and what the motors are doing. This allows the model to represent **multi-modal action distributions**: when there are multiple valid ways to grasp a sleeve or start a fold, the model can capture that ambiguity rather than averaging to a meaningless middle ground.
26
 
27
  <Sidenote>
28
  Flow matching is closely related to diffusion models but uses a simpler, more direct interpolation path between noise and data.
 
30
 
31
  #### [Real-Time Chunking (RTC)](https://huggingface.co/docs/lerobot/rtc)
32
 
33
+ RTC was crucial for real-world deployment. Instead of waiting for the predicted action chunk to finish before generating the next, RTC generates the next chunk while executing the current one. It "freezes" actions that are already committed and "inpaints" the remaining ones, producing smooth asynchronous motion. In practice, this sped up our rollouts by at least a factor of 2.
34
 
35
  ```mermaid
36
  sequenceDiagram
 
45
 
46
  ### Models
47
 
48
+ We initially trained multiple architectures supported in LeRobot, but we ended up focusing on two VLA architectures for our cloth folding data:
49
 
50
  - **π0** the base flow-matching VLA, trained with standard imitation learning
51
+ - **[π0.5](https://huggingface.co/docs/lerobot/pi05)** an improved variant with more pretraining and several improvements to the flow matching denoising process
52
 
53
+ Both are finetuned from pretrained checkpoints. Starting from this pretrained foundation rather than training from scratch gives the model a head start on visual understanding and basic manipulation concepts.
54
 
55
  ### Training Setup
56
 
 
64
  | Training steps | **200k** (Series 1) / **100k** (Series 2 fine-tune) |
65
 
66
  <Sidenote>
67
+ Multi-GPU training with 8xH100 and gradient accumulation was necessary to fit the large batch sizes needed for stable VLA training.
68
  </Sidenote>
69
 
70
  ### Loss Curves
 
79
  />
80
  </Wide>
81
 
82
+ Our training followed two phases: **Series 1** trained from pretrained base checkpoints on the full dataset for 200k steps. **Series 2** fine-tuned the best Series 1 checkpoint on curated high-quality data for 100k steps.
app/src/content/chapters/folding/07-evaluation.mdx CHANGED
@@ -5,27 +5,27 @@ import HtmlEmbed from "../../../components/HtmlEmbed.astro";
5
 
6
  ## Evaluation
7
 
8
- **Evaluation is as hard as training.** In robotics on real hardware, no standardized benchmarks exist. If your evaluation protocol is inconsistent, every downstream decision will be wrong.
9
 
10
  ### Protocol
11
 
12
- For every experiment we evaluate on:
13
 
14
  - **5 different t-shirts for Level 1** (laid-out to fold)
15
  - **5 different t-shirts for Level 2** (messy to spread to fold, then place aside)
16
 
17
- Each t-shirt is attempted **twice consecutively**, giving **10 rollouts per level** and **20 rollouts total per experiment**. Every evaluation is filmed and scored from video afterward, so judgment is decoupled from execution.
18
 
19
  <Note>
20
- The eval protocol t-shirts, attempt count, scoring rubric, and filming setup is identical across every experiment.
21
  </Note>
22
 
23
  ### Metrics
24
 
25
  We report four complementary metrics:
26
 
27
- **1. Success Rate** Binary pass/fail per rollout.
28
- **2. Score** Partial credit based on subtasks completed. This distinguishes a model that consistently reaches Fold 3 from one that fails at Unfold, even if neither achieves full success.
29
 
30
  <Accordion title="Scoring rubric Level 1 and Level 2">
31
 
@@ -52,13 +52,12 @@ We report four complementary metrics:
52
  | Rotation + Place aside | +10 |
53
  | **Maximum per rollout** | **100** |
54
 
55
- Scores are summed across all rollouts in an experiment. With 10 L1 rollouts (max 50 points each) and 10 L2 rollouts (max 100 points each), the **maximum total score per experiment is 1,500 points**.
56
 
57
  </Accordion>
58
 
59
- **3. Fold quality** A 1–5 rating of the final fold appearance, averaged across successful rollouts.
60
-
61
- **4. Completion time** Seconds to complete Level 1/Level 2, averaged across successful rollouts.
62
 
63
  ### Statistical uncertainty
64
 
 
5
 
6
  ## Evaluation
7
 
8
+ **Evaluation is as hard as training.** In robotics and with real hardware, no standardized benchmarks exist. If your evaluation protocol is inconsistent, every downstream decision will be wrong.
9
 
10
  ### Protocol
11
 
12
+ For every experiment, we evaluate the model on:
13
 
14
  - **5 different t-shirts for Level 1** (laid-out to fold)
15
  - **5 different t-shirts for Level 2** (messy to spread to fold, then place aside)
16
 
17
+ Each t-shirt fold is attempted **twice consecutively**, giving **10 rollouts per level** and **20 rollouts total per experiment**. Every evaluation is filmed and scored from video afterward, so judgment is decoupled from execution.
18
 
19
  <Note>
20
+ The evaluation protocol t-shirts, attempts count, scoring rubric, and filming setup is identical across every experiment.
21
  </Note>
22
 
23
  ### Metrics
24
 
25
  We report four complementary metrics:
26
 
27
+ 1. **Success Rate** Binary pass/fail per rollout.
28
+ 2. **Score** Partial credit based on subtasks completed. This distinguishes a model that consistently reaches Fold 3 from one that fails at Unfold, even if neither achieves full success.
29
 
30
  <Accordion title="Scoring rubric Level 1 and Level 2">
31
 
 
52
  | Rotation + Place aside | +10 |
53
  | **Maximum per rollout** | **100** |
54
 
55
+ Scores are summed across all the rollouts in an experiment. With 10 Level 1 rollouts (max 50 points each) and 10 Level 2 rollouts (max 100 points each), the **maximum total score per experiment is 1,500 points**.
56
 
57
  </Accordion>
58
 
59
+ 3. **Fold quality** A 1–5 rating of the final fold appearance, averaged across successful rollouts.
60
+ 4. **Completion time** Seconds to complete Level 1/Level 2, averaged across successful rollouts.
 
61
 
62
  ### Statistical uncertainty
63
 
app/src/content/chapters/folding/08-ablations.mdx CHANGED
@@ -10,7 +10,7 @@ import Stack from "../../../components/Stack.astro";
10
 
11
  ## Experiments
12
 
13
- We ran 11 experiments to understand what *actually* matters. **Series 1** trains from pretrained base checkpoints on the full dataset. **Series 2** finetunes Series 1 checkpoints on curated high-quality data (2.1–2.4 from 1.3, 2.5 from 1.7). One early lesson: **undertraining makes the policy shaky** make sure your model has converged before drawing conclusions.
14
 
15
  <Wide>
16
 
@@ -41,13 +41,15 @@ policy_cfg.rtc_config = RTCConfig(
41
  )
42
  ```
43
 
44
- With an action queue size of 30 and max action horizon of 20. RTC gave us a ~2x speedup (sometimes even 2.5x), and action interpolation made the robot much quieter and smoother. Both are now available on [LeRobot main](https://github.com/huggingface/lerobot).
 
 
45
 
46
  ### SARM: Our Reward Model
47
 
48
  Before diving into the experiments further, let's introduce a key ingredient: **[SARM](https://huggingface.co/docs/lerobot/sarm)** (Stage-Aware Reward Modeling). SARM is a trained reward model that scores trajectories based on how well the robot is progressing toward task completion, it acts as a learned "critic" that predicts whether things are going well or badly.
49
 
50
- SARM is trained on our demonstration data to predict 0-1 task progression. The key insight: it correctly identifies **mistakes** (drops in value) and **progress** (increases) in real time.
51
 
52
  <Wide>
53
  <Stack layout="3-column" gap="small">
@@ -57,13 +59,13 @@ SARM is trained on our demonstration data to predict 0-1 task progression. The k
57
  </Stack>
58
  </Wide>
59
 
60
- We use SARM exclusively for **RABC** (Reward-Advantage-Based Conditioning): it scores every episode with a per-timestep quality signal, and during training we weight actions by their contribution to progress. High-reward actions contribute more to the loss, low-reward ones contribute less. Negative progress are clipped to 0. Unlike binary success/fail labels, SARM provides continuous signal on every timestep.
61
 
62
  ---
63
 
64
  ### Results Overview
65
 
66
- Now let's look at how each experiment actually performed. The charts below show success rates, scores, completion times, and failure modes across all 11 experiments. The pattern is consistent: **Series 2 dominates Series 1**, and within each series, RABC combined with relative actions produces the best results. Explore the charts, then we break down the key findings below.
67
 
68
  <HtmlEmbed
69
  id="success-rates"
@@ -72,7 +74,7 @@ Now let's look at how each experiment actually performed. The charts below show
72
  desc="Success rates (Total, Level 1, Level 2) across all experiments. Series 1 trains from scratch on full data; Series 2 finetunes the best Series 1 checkpoint on curated high-quality data."
73
  />
74
 
75
- The gap between Series 1 and Series 2 is immediately visible. Experiment 2.5 reaches 90% total success rate (100% L1, 80% L2), while the best Series 1 result tops out at 40%. No Series 1 experiment achieves a single Level 2 success.
76
 
77
  <HtmlEmbed
78
  id="total-score"
@@ -103,7 +105,7 @@ The heatmap shows where time is spent. Series 1 experiments are slow across the
103
 
104
  ### Where the policies fail
105
 
106
- Before interpreting success rates, it helps to understand *how* each experiment fails not just whether it fails.
107
 
108
  <HtmlEmbed
109
  id="failure-analysis"
@@ -129,7 +131,7 @@ With 20 rollouts per experiment, not every visible gap is real. We run **Barnard
129
 
130
  #### 1. Data quality matters most
131
 
132
- This is the finding we're most confident in it held regardless of which confidence level or correction method we used. The best Series 1 result (1.7) achieves 40% total SR. The best Series 2 result (2.5) achieves 90% using the *same architecture*. The pairwise tests cleanly separate these two groups, and no amount of algorithmic tuning within Series 1 came close to closing the gap.
133
 
134
  We hypothesise that the root cause is the difference in **multi-modality** between the high-quality and full dataset. The full dataset contains demonstrations with some inconsistent strategies: different grips, unfolding sequences, and timing, while the high-quality dataset enforces a more unified, consistent protocol.
135
 
@@ -165,36 +167,37 @@ lerobot-train \
165
  --policy.use_relative_actions=true
166
  ```
167
 
168
- Comparing π0.5 without relative actions (1.2: 20% total SR, 40% L1) to π0.5 with relative actions and quantile normalization (1.3: 35% total SR, 70% L1), and then to the full combination in 1.7 (40% total SR, 80% L1), shows that training with relative actions consistently improves performance. The trend is clear and shows up in every comparison we made.
169
 
170
- The effect size doesn't separate cleanly at 20 rollouts, but the direction is consistent. **Caveat:** π0.5 is likely pretrained with relative actions, so 1.3 and 1.7 fine-tune in a regime consistent with pretraining, while 1.2 fine-tunes against it.
171
 
172
  #### 3. RABC helps especially on long tasks like level 2
173
 
174
- RABC on high-quality data produces the two best results overall: 2.2 and 2.5 clearly separate from experiments without it. The effect is strongest on **Level 2**, the longer and harder task — 2.2 reaches 50% L2 SR and 2.5 reaches 80%, while every experiment without RABC on clean data stays at 0%.
 
175
  #### 4. Fine-tuning from a strong checkpoint is the winning recipe
176
 
177
  The best results share the same recipe: fine-tune a Series 1 checkpoint on curated high-quality data with RABC and relative actions.
178
 
179
- | Experiment | Total SR | L1 SR | L2 SR | Recipe |
180
  |:---:|:---:|:---:|:---:|:---|
181
  | 2.5 | **90%** | **100%** | **80%** | 1.7 → HQ + RABC, 100k steps |
182
  | 2.2 | 75% | 100% | 50% | 1.3 → HQ + RABC, 100k steps |
183
  | 1.7 | 40% | 80% | 0% | All data, Relative Actions + RABC + QUANTILES |
184
 
185
- The jump from Series 1 to Series 2 is unambiguous in the statistical analysis — 2.5 and 2.2 clearly separate from the Series 1 group. The Series 1 checkpoint already knows how to fold shirts in general, the high-quality data teaches the correct protocol, and RABC emphasizes the best demonstrations within an already clean dataset.
186
 
187
  Both 2.2 and 2.5 were trained for 100k steps. 2.2 fine-tunes from 1.3 while 2.5 fine-tunes from 1.7 (the stronger base). The difference (75% → 90%) likely reflects this stronger starting point. They don't separate from each other in the pairwise tests, suggesting the recipe itself (HQ + RABC + Relative Actions) is the key ingredient, with the base checkpoint providing an additional boost.
188
 
189
  #### 5. Level 2 requires everything to be right simultaneously
190
 
191
- Every Series 1 experiment achieves exactly **0% Level 2 success**. Level 2 only becomes tractable in Series 2, and only with RABC on high-quality data (2.2: 50% L2, 2.5: 80% L2). The 0% → 50–80% jump is as clean a signal as you'll find in a 20-rollout experiment. Level 2 is genuinely harder it requires the policy to have seen consistent, high-quality demonstrations of the full task, because without a reliable starting state after unfolding, the subsequent folds can't succeed.
192
 
193
  #### 6. Speed and fold quality both track data quality
194
 
195
  Series 1 completes Level 1 in **78–122s**; Series 2 does it in **41–73s**. Fold quality (1–5 scale) hits a ceiling around 2.8 in Series 1, breaking past 3.0 only with high-quality data.
196
 
197
- | Experiment | L1 Time | L1 SR | Quality |
198
  |:---:|:---:|:---:|:---:|
199
  | 1.1 (π0, all data) | 121.5s | 80% | 2.70 |
200
  | 1.7 (best S1) | 99.5s | 80% | 2.30 |
@@ -202,8 +205,8 @@ Series 1 completes Level 1 in **78–122s**; Series 2 does it in **41–73s**. F
202
  | 2.2 (HQ + RABC) | 43.2s | 100% | 3.30 |
203
  | 2.5 (best overall) | **40.8s** | **100%** | **4.10** |
204
 
205
- Policies trained on the full dataset learned hesitant motions; the high-quality dataset enforces deliberate, progress oriented actions. Faster completion isn't a separate goal from quality it's a consequence of a clear, unambiguous strategy.
206
 
207
  #### 8. What did not work
208
 
209
- - **Mirroring augmentation** (2.3): only 5% total SR. It doubles multimodality, making convergence much harder even at 100k steps.
 
10
 
11
  ## Experiments
12
 
13
+ We ran 11 experiments to understand what *actually* matters. **Series 1** trains from pretrained base checkpoints on the full dataset. **Series 2** finetunes Series 1 checkpoints on curated high-quality data (2.1–2.4 from 1.3, 2.5 from 1.7). One early lesson: **undertraining makes the policy shaky** - make sure your model has converged before drawing conclusions.
14
 
15
  <Wide>
16
 
 
41
  )
42
  ```
43
 
44
+ with an action queue size of 30 and a maximum action horizon of 20.
45
+
46
+ RTC gave us a ~2x speedup (sometimes even 2.5x) and action interpolation made the robot much quieter and smoother. Both are now available on [LeRobot main](https://github.com/huggingface/lerobot).
47
 
48
  ### SARM: Our Reward Model
49
 
50
  Before diving into the experiments further, let's introduce a key ingredient: **[SARM](https://huggingface.co/docs/lerobot/sarm)** (Stage-Aware Reward Modeling). SARM is a trained reward model that scores trajectories based on how well the robot is progressing toward task completion, it acts as a learned "critic" that predicts whether things are going well or badly.
51
 
52
+ SARM is trained on our demonstration data to predict 0-1 task progression. The takeaway: it correctly identifies **mistakes** (drops in value) and **progress** (increases) in real time.
53
 
54
  <Wide>
55
  <Stack layout="3-column" gap="small">
 
59
  </Stack>
60
  </Wide>
61
 
62
+ We use SARM exclusively for **RABC** (Reward-Advantage-Based Conditioning): every episode is scored with a per-timestep quality signal, and during training, actions are weighted by their contribution to progress. High-reward actions contribute more to the loss, low-reward ones contribute less. Negative progress is clipped to 0. Unlike binary success/fail labels, SARM provides a continuous signal at every timestep.
63
 
64
  ---
65
 
66
  ### Results Overview
67
 
68
+ Now let's look at how each experiment actually performed. The charts below show success rates, scores, completion times, and failure modes across all 11 experiments. The pattern is consistent: **Series 2 dominates Series 1**, and within each series, RABC combined with relative actions produces the best results. We break down the key findings below.
69
 
70
  <HtmlEmbed
71
  id="success-rates"
 
74
  desc="Success rates (Total, Level 1, Level 2) across all experiments. Series 1 trains from scratch on full data; Series 2 finetunes the best Series 1 checkpoint on curated high-quality data."
75
  />
76
 
77
+ The gap between Series 1 and Series 2 is immediately visible. Experiment 2.5 reaches 90% total success rate (100% Level 1, 80% Level 2), while the best Series 1 result tops out at 40%. No Series 1 experiment achieves a single Level 2 success.
78
 
79
  <HtmlEmbed
80
  id="total-score"
 
105
 
106
  ### Where the policies fail
107
 
108
+ Before interpreting success rates, understanding *how* each experiment fails not just whether it fails — is essential.
109
 
110
  <HtmlEmbed
111
  id="failure-analysis"
 
131
 
132
  #### 1. Data quality matters most
133
 
134
+ This is the finding we're most confident in it held regardless of which confidence level or correction method we used. The best Series 1 result (1.7) achieves 40% total success rate. The best Series 2 result (2.5) achieves 90% using the *same architecture*. The pairwise tests cleanly separate these two groups, and no amount of algorithmic tuning within Series 1 came close to closing the gap.
135
 
136
  We hypothesise that the root cause is the difference in **multi-modality** between the high-quality and full dataset. The full dataset contains demonstrations with some inconsistent strategies: different grips, unfolding sequences, and timing, while the high-quality dataset enforces a more unified, consistent protocol.
137
 
 
167
  --policy.use_relative_actions=true
168
  ```
169
 
170
+ Comparing π0.5 without relative actions (1.2: 20% total SR, 40% Level 1) to π0.5 with relative actions and quantile normalization (1.3: 35% total SR, 70% Level 1), and then to the full combination in 1.7 (40% total SR, 80% Level 1), shows that training with relative actions consistently improves performance. The trend is clear and shows up in every comparison we made.
171
 
172
+ With only 20 rollouts, the exact gap between experiments is hard to pin down — but the improvement is consistent across every comparison. **Caveat:** π0.5 is likely pretrained with relative actions, so 1.3 and 1.7 fine-tune in a regime consistent with pretraining, while 1.2 fine-tunes against it.
173
 
174
  #### 3. RABC helps especially on long tasks like level 2
175
 
176
+ RABC on high-quality data yields the two best results overall: experiments 2.2 and 2.5 clearly outperform every experiment that lacks it. The effect is strongest on **Level 2**, the longer and harder task — 2.2 reaches 50% Level 2 SR and 2.5 reaches 80%, while every experiment without RABC on clean data stays at 0%.
177
+
178
  #### 4. Fine-tuning from a strong checkpoint is the winning recipe
179
 
180
  The best results share the same recipe: fine-tune a Series 1 checkpoint on curated high-quality data with RABC and relative actions.
181
 
182
+ | Experiment | Total SR | Level 1 SR | Level 2 SR | Recipe |
183
  |:---:|:---:|:---:|:---:|:---|
184
  | 2.5 | **90%** | **100%** | **80%** | 1.7 → HQ + RABC, 100k steps |
185
  | 2.2 | 75% | 100% | 50% | 1.3 → HQ + RABC, 100k steps |
186
  | 1.7 | 40% | 80% | 0% | All data, Relative Actions + RABC + QUANTILES |
187
 
188
+ The jump from Series 1 to Series 2 is unambiguous in the statistical analysis — 2.5 and 2.2 clearly stand out from the Series 1 group. The Series 1 checkpoint already knows how to fold shirts in general, the high-quality data teaches the correct protocol, and RABC emphasizes the best demonstrations within an already clean dataset.
189
 
190
  Both 2.2 and 2.5 were trained for 100k steps. 2.2 fine-tunes from 1.3 while 2.5 fine-tunes from 1.7 (the stronger base). The difference (75% → 90%) likely reflects this stronger starting point. They don't separate from each other in the pairwise tests, suggesting the recipe itself (HQ + RABC + Relative Actions) is the key ingredient, with the base checkpoint providing an additional boost.
191
 
192
  #### 5. Level 2 requires everything to be right simultaneously
193
 
194
+ Every Series 1 experiment achieves exactly **0% Level 2 success**. Level 2 only becomes tractable in Series 2, and only with RABC on high-quality data (2.2: 50% L2, 2.5: 80% L2). The 0% → 50–80% jump is as clean a signal as you'll find in a 20-rollout experiment. Level 2 is genuinely harder. It requires the policy to have seen consistent, high-quality demonstrations of the full task, because without a reliable starting state after unfolding, the subsequent folds can't succeed.
195
 
196
  #### 6. Speed and fold quality both track data quality
197
 
198
  Series 1 completes Level 1 in **78–122s**; Series 2 does it in **41–73s**. Fold quality (1–5 scale) hits a ceiling around 2.8 in Series 1, breaking past 3.0 only with high-quality data.
199
 
200
+ | Experiment | Level 1 Time | Level 1 SR | Quality |
201
  |:---:|:---:|:---:|:---:|
202
  | 1.1 (π0, all data) | 121.5s | 80% | 2.70 |
203
  | 1.7 (best S1) | 99.5s | 80% | 2.30 |
 
205
  | 2.2 (HQ + RABC) | 43.2s | 100% | 3.30 |
206
  | 2.5 (best overall) | **40.8s** | **100%** | **4.10** |
207
 
208
+ Policies trained on the full dataset learned hesitant motions; the high-quality dataset enforces deliberate and progress oriented actions. Faster completion isn't a separate goal from quality it's a consequence of a clear, unambiguous strategy.
209
 
210
  #### 8. What did not work
211
 
212
+ **Mirroring augmentation** (2.3): only 5% total SR. Mirroring doubles multimodality, making convergence much harder even at 100k steps.
app/src/content/chapters/folding/09-learnings.mdx CHANGED
@@ -3,25 +3,25 @@ import Sidenote from "../../../components/Sidenote.astro";
3
 
4
  ## Learnings
5
 
6
- Running all these experiments taught us a lot some expected, some not. Here's what stuck.
7
 
8
  ### What mattered most
9
 
10
  Beyond the experiment findings above, several practical insights stood out:
11
 
12
  - **Train a reward model.** [SARM](https://huggingface.co/docs/lerobot/sarm) gave us data scoring, advantage conditioning, and curation in one package. We recommend it even for tasks where you think manual filtering would suffice.
13
- - **Invest in recording quality early.** More time upfront on clean, consistent recordings pays off more than extra volume.
14
- - **Record at higher frequency.** We'd record at 50 fps if we did it again. Folding is dynamic and higher record rates capture transitions better.
15
- - **DAgger is promising.** Targeted corrections for the model's actual failure modes should be very effective pushign the success rate higher. This infrastructure is ready and now also merged into LeRobot.
16
 
17
  ### For the community: the order of operations
18
 
19
- If you're training a policy for a new manipulation task with LeRobot, here's the sequence we'd recommend based on what we learned:
20
 
21
- 1. **Define your task protocol first.** Before collecting a single episode, agree on exactly how the task should be performed.
22
  2. **Collect 50–100 clean demonstrations.** Quality over volume. Consistent technique, good camera angles, deliberate motions. This is your foundation, everything else builds on it.
23
- 3. **Train a reward model.** Use [SARM](https://huggingface.co/docs/lerobot/sarm) to score your episodes and enable RABC during training. This lets the policy focus on the best demonstrations, especially important for longer tasks.
24
- 4. **Train a baseline and watch it fail.** Film the rollouts. Understanding *how* and *where* it breaks tells you exactly what data to collect next.
25
  5. **Use DAgger for targeted improvement.** Once you have a model that mostly works, collect correction data for its specific failure modes. LeRobot's [HIL scripts](https://github.com/huggingface/lerobot/tree/main/examples/hil) handle the full loop, the operator watches the policy run, pauses on failure, teleoperates a recovery, and hands control back:
26
 
27
  ```bash
@@ -34,7 +34,7 @@ python examples/hil/hil_data_collection.py \
34
  --rtc.execution_horizon=20
35
  ```
36
 
37
- 6. **Enable action interpolation and RTC.** This smooths transitions and speeds up execution. Action interpolation upsamples the policy's 30 Hz output to your robot's control frequency (e.g. 90 Hz), and RTC overlaps inference with execution. Both are flags on `lerobot-eval`:
38
 
39
  ```bash
40
  lerobot-eval \
@@ -49,7 +49,7 @@ lerobot-eval \
49
  7. **Film every evaluation.** Metrics alone won't tell the full story. Video reveals subtle failure modes that success rate misses, and lets you score quality.
50
 
51
  <Note variant="info">
52
- All the innovations from this project [SARM](https://huggingface.co/docs/lerobot/sarm), [RTC](https://huggingface.co/docs/lerobot/rtc), DAgger, [OpenArm](https://huggingface.co/docs/lerobot/openarm), and OpenArm Mini are merged into [LeRobot main](https://github.com/huggingface/lerobot). You can use our full pipeline as a starting point and swap in your own task.
53
  </Note>
54
 
55
  ### What's next
@@ -58,8 +58,8 @@ This project is far from done. We're releasing the final model, full dataset, an
58
 
59
  - **Massive-scale training.** We want LeRobot and LeRobotDataset to support 10–100x the data we used here, with billions of frames, powered by the new [HF Buckets](https://huggingface.co/docs/hub/en/storage-buckets) for storage and streaming at scale.
60
  - **More robots, teleoperators, VLAs, and reward models.** We're continuing to expand the ecosystem of supported hardware, teleoperation setups, and model architectures in LeRobot.
61
- - **RL support.** Extending LeRobot with new reinforcement learning methods and all the infrastructure needed to train policies online, not just from offline demonstrations.
62
- - **Better data interpretability tools.** Our biggest lever was data quality, but finding *which* demonstrations help and which hurt is still a hard problem. The field needs tools that provide understanding of what makes a trajectory useful before you train on it. If you're working on data curation or interpretability for robotics, [reach out](https://huggingface.co/lerobot), we'd love to collaborate.
63
  - **Democratize robot learning.** Continue to lower the barrier to entry and share every insight, tool, and method with the community.
64
 
65
- We also encourage you to use our dataset directly. Train your own policies, try new architectures, experiment with different training recipes. If you find something promising, reach out we're happy to run your models on our physical setups and share the results back.
 
3
 
4
  ## Learnings
5
 
6
+ Running all these experiments taught us a lot. Some expected, some not. Here's what stuck.
7
 
8
  ### What mattered most
9
 
10
  Beyond the experiment findings above, several practical insights stood out:
11
 
12
  - **Train a reward model.** [SARM](https://huggingface.co/docs/lerobot/sarm) gave us data scoring, advantage conditioning, and curation in one package. We recommend it even for tasks where you think manual filtering would suffice.
13
+ - **Invest in recording quality early.** More time upfront on clean and consistent recordings pays off more than extra volume.
14
+ - **Record at higher frequency.** We'd record at 50 fps if we had to do it again. Folding is dynamic and higher record rates capture transitions better.
15
+ - **DAgger is promising.** Corrections targeting the model's actual failure modes should be highly effective at pushing success rates higher. This infrastructure is ready and now also merged into LeRobot.
16
 
17
  ### For the community: the order of operations
18
 
19
+ If you're training a policy for a new manipulation task with LeRobot, **here's the sequence we'd recommend**:
20
 
21
+ 1. **Define your task protocol first.** Before collecting a single episode, define exactly how the task should be performed.
22
  2. **Collect 50–100 clean demonstrations.** Quality over volume. Consistent technique, good camera angles, deliberate motions. This is your foundation, everything else builds on it.
23
+ 3. **Train a reward model.** Use [SARM](https://huggingface.co/docs/lerobot/sarm) to score your episodes and enable RABC during training. This allows the policy to focus on the best demonstrations, which is crucial for longer tasks.
24
+ 4. **Train a baseline and watch it fail.** Film the rollouts. Understanding *how* and *where* it breaks tells you exactly what kind of data to collect next.
25
  5. **Use DAgger for targeted improvement.** Once you have a model that mostly works, collect correction data for its specific failure modes. LeRobot's [HIL scripts](https://github.com/huggingface/lerobot/tree/main/examples/hil) handle the full loop, the operator watches the policy run, pauses on failure, teleoperates a recovery, and hands control back:
26
 
27
  ```bash
 
34
  --rtc.execution_horizon=20
35
  ```
36
 
37
+ 6. **Enable action interpolation and RTC.** This smooths transitions and speeds up execution. Action interpolation upsamples the policy's 30 Hz output to your robot's control frequency (e.g. 90 Hz), and RTC overlaps inference with execution. Both features are flags on `lerobot-eval`:
38
 
39
  ```bash
40
  lerobot-eval \
 
49
  7. **Film every evaluation.** Metrics alone won't tell the full story. Video reveals subtle failure modes that success rate misses, and lets you score quality.
50
 
51
  <Note variant="info">
52
+ All the innovations from this project [SARM](https://huggingface.co/docs/lerobot/sarm), [RTC](https://huggingface.co/docs/lerobot/rtc), DAgger, [OpenArm](https://huggingface.co/docs/lerobot/openarm), and OpenArm Mini are merged into [LeRobot repository](https://github.com/huggingface/lerobot). You can use our full pipeline as a starting point and swap in your own task.
53
  </Note>
54
 
55
  ### What's next
 
58
 
59
  - **Massive-scale training.** We want LeRobot and LeRobotDataset to support 10–100x the data we used here, with billions of frames, powered by the new [HF Buckets](https://huggingface.co/docs/hub/en/storage-buckets) for storage and streaming at scale.
60
  - **More robots, teleoperators, VLAs, and reward models.** We're continuing to expand the ecosystem of supported hardware, teleoperation setups, and model architectures in LeRobot.
61
+ - **Reinforcement Learning support.** Extending LeRobot with new reinforcement learning methods and all the infrastructure needed to train policies online, not just from offline demonstrations.
62
+ - **Better data interpretability tools.** Our biggest lever was data quality, but finding *which* demonstrations help and which hurt is still a hard problem. The field needs tools that provide understanding of what makes a trajectory useful before you train on it. If you're exploring data curation or interpretability for robotics, we'd love to hear from you — [let's talk](https://huggingface.co/lerobot).
63
  - **Democratize robot learning.** Continue to lower the barrier to entry and share every insight, tool, and method with the community.
64
 
65
+ We also encourage you to use our dataset directly. Train your own policies, try new architectures, experiment with different training recipes. If you find something promising, we'd be happy to run your models on our physical setups and share the results back !
app/src/styles/_base.css CHANGED
@@ -104,9 +104,27 @@ html {
104
  margin-bottom: 0;
105
  }
106
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
107
  .content-grid main blockquote {
108
- border-left: 2px solid var(--border-color);
109
- padding-left: var(--spacing-4);
 
 
110
  font-style: italic;
111
  color: var(--muted-color);
112
  margin: var(--spacing-4) 0;
@@ -182,4 +200,11 @@ html {
182
  background: none;
183
  border: none;
184
  opacity: 0.4;
 
 
 
 
 
 
 
185
  }
 
104
  margin-bottom: 0;
105
  }
106
 
107
+ .links-centered {
108
+ display: flex !important;
109
+ flex-wrap: wrap;
110
+ justify-content: center;
111
+ }
112
+
113
+ .links-centered > * {
114
+ flex: 0 0 calc(25% - 0.375rem);
115
+ }
116
+
117
+ @media (max-width: 768px) {
118
+ .links-centered > * {
119
+ flex: 0 0 calc(50% - 0.25rem);
120
+ }
121
+ }
122
+
123
  .content-grid main blockquote {
124
+ border-left: none;
125
+ padding-left: 0;
126
+ font-size: clamp(16px, 1.8vw, 19px);
127
+ line-height: 1.6;
128
  font-style: italic;
129
  color: var(--muted-color);
130
  margin: var(--spacing-4) 0;
 
200
  background: none;
201
  border: none;
202
  opacity: 0.4;
203
+ }
204
+
205
+ .lead-paragraph {
206
+ font-size: clamp(18px, 2.2vw, 22px);
207
+ line-height: 1.55;
208
+ letter-spacing: -0.01em;
209
+ margin-bottom: var(--spacing-5);
210
  }