Update app/src/content/chapters/folding/08-ablations.mdx

#9
app/src/content/chapters/folding/08-ablations.mdx CHANGED
@@ -9,8 +9,7 @@ import sarmEp2200 from "../../assets/image/lerobot-data-collection_level12_rac_2
9
  import Stack from "../../../components/Stack.astro";
10
 
11
  ## Experiments
12
-
13
- We ran 11 experiments to understand what *actually* matters. **Series 1** trains from pretrained base checkpoints on the full dataset. **Series 2** finetunes Series 1 checkpoints on curated high-quality data (2.1–2.4 from 1.3, 2.5 from 1.7). One early lesson: **undertraining makes the policy shaky** make sure your model has converged before drawing conclusions.
14
 
15
  <Wide>
16
 
@@ -18,15 +17,15 @@ We ran 11 experiments to understand what *actually* matters. **Series 1** trains
18
  |:---:|:---:|:---|:---:|:---:|:---|
19
  | 1.1 | π0 | All data | 200k | MEAN_STD | Baseline |
20
  | 1.2 | π0.5 | All data | 200k | MEAN_STD | Baseline |
21
- | 1.3 | π0.5 | All data | 200k | QUANTILES | Δ (Delta) Action |
22
  | 1.4 | π0.5 | All data | 200k | MEAN_STD | Reward model (SARM) with RABC κ=0.01 |
23
  | 1.5 | π0.5 | All data | 200k | MEAN_STD | Reward model (SARM) with RABC κ=0.0215 |
24
- | 1.7 | π0.5 | All data | 200k | QUANTILES | Δ (Delta) Action + Reward model (SARM) with RABC κ=0.0215 |
25
  | 2.1 | π0.5 | High-quality only | 100k | QUANTILES | Fine-tune from 1.3 |
26
- | 2.2 | π0.5 | High-quality only | 100k | QUANTILES | Fine-tune from 1.3 + Reward model (SARM) with RABC κ=0.0265 + Δ (Delta) Action |
27
- | 2.3 | π0.5 | High-quality + mirrored | 100k | QUANTILES | Fine-tune from 1.3 + Δ (Delta) Action + image transforms + mirroring setup (data augmentation) |
28
  | 2.4 | π0.5 | High-quality only | 100k | QUANTILES | Fine-tune from 1.3 · chunk=45 |
29
- | 2.5 | π0.5 | High-quality only | 100k | QUANTILES | Fine-tune from 1.7 + Reward model (SARM) with RABC κ=0.0265 + Δ (Delta) Action |
30
 
31
  </Wide>
32
 
@@ -47,7 +46,7 @@ With an action queue size of 30 and max action horizon of 20. RTC gave us a ~2x
47
 
48
  Before diving into the experiments further, let's introduce a key ingredient: **[SARM](https://huggingface.co/docs/lerobot/sarm)** (Stage-Aware Reward Modeling). SARM is a trained reward model that scores trajectories based on how well the robot is progressing toward task completion, it acts as a learned "critic" that predicts whether things are going well or badly.
49
 
50
- SARM is trained on our demonstration data to predict 0-1 task progression. The key insight: it correctly identifies **mistakes** (drops in value) and **progress** (increases) in real time.
51
 
52
  <Wide>
53
  <Stack layout="3-column" gap="small">
@@ -63,7 +62,7 @@ We use SARM exclusively for **RABC** (Reward-Advantage-Based Conditioning): it s
63
 
64
  ### Results Overview
65
 
66
- Now let's look at how each experiment actually performed. The charts below show success rates, scores, completion times, and failure modes across all 11 experiments. The pattern is consistent: **Series 2 dominates Series 1**, and within each series, RABC combined with delta actions produces the best results. Explore the charts, then we break down the key findings below.
67
 
68
  <HtmlEmbed
69
  id="success-rates"
@@ -90,7 +89,7 @@ Total score captures partial progress that binary success rate misses. Even fail
90
  desc="Each bubble is one experiment. X-axis = completion time (faster is left), Y-axis = fold quality (higher is better), bubble size = total success rate. The best experiments cluster in the top-left corner."
91
  />
92
 
93
- Speed and quality correlate strongly with data quality. Series 2 experiments fold 2-3x faster than Series 1 (40s vs 100s+), and fold quality only breaks past 3.0 with high-quality training data. Faster isn't a separate goal from better; it's a consequence of the policy learning a clear, unambiguous strategy.
94
 
95
  <HtmlEmbed
96
  id="subtask-heatmap"
@@ -103,7 +102,7 @@ The heatmap shows where time is spent. Series 1 experiments are slow across the
103
 
104
  ### Where the policies fail
105
 
106
- Before interpreting success rates, it helps to understand *how* each experiment fails not just whether it fails.
107
 
108
  <HtmlEmbed
109
  id="failure-analysis"
@@ -137,28 +136,28 @@ We hypothesise that the root cause is the difference in **multi-modality** betwe
137
  Define the exact task protocol before collecting data. Speed is secondary to consistency and clarity of intent at every step.
138
  </Note>
139
 
140
- #### 2. Delta actions improve performance consistently
141
 
142
- Comparing π0.5 without delta actions (1.2: 20% total SR, 40% L1) to π0.5 with delta actions and quantile normalization (1.3: 35% total SR, 70% L1), and then to the full combination in 1.7 (40% total SR, 80% L1), shows that training with delta actions consistently improves performance. The trend is clear and shows up in every comparison we made.
143
 
144
- The effect size doesn't separate cleanly at 20 rollouts, but the direction is consistent. **Caveat:** π0.5 is likely pretrained with delta actions, so 1.3 and 1.7 fine-tune in a regime consistent with pretraining, while 1.2 fine-tunes against it.
145
 
146
  #### 3. RABC helps especially on long tasks like level 2
147
 
148
  RABC on high-quality data produces the two best results overall: 2.2 and 2.5 clearly separate from experiments without it. The effect is strongest on **Level 2**, the longer and harder task — 2.2 reaches 50% L2 SR and 2.5 reaches 80%, while every experiment without RABC on clean data stays at 0%.
149
  #### 4. Fine-tuning from a strong checkpoint is the winning recipe
150
 
151
- The best results share the same recipe: fine-tune a Series 1 checkpoint on curated high-quality data with RABC and delta actions.
152
 
153
  | Experiment | Total SR | L1 SR | L2 SR | Recipe |
154
  |:---:|:---:|:---:|:---:|:---|
155
  | 2.5 | **90%** | **100%** | **80%** | 1.7 → HQ + RABC, 100k steps |
156
  | 2.2 | 75% | 100% | 50% | 1.3 → HQ + RABC, 100k steps |
157
- | 1.7 | 40% | 80% | 0% | All data, ΔActions + RABC + QUANTILES |
158
 
159
  The jump from Series 1 to Series 2 is unambiguous in the statistical analysis — 2.5 and 2.2 clearly separate from the Series 1 group. The Series 1 checkpoint already knows how to fold shirts in general, the high-quality data teaches the correct protocol, and RABC emphasizes the best demonstrations within an already clean dataset.
160
 
161
- Both 2.2 and 2.5 were trained for 100k steps. 2.2 fine-tunes from 1.3 while 2.5 fine-tunes from 1.7 (the stronger base). The difference (75% → 90%) likely reflects this stronger starting point. They don't separate from each other in the pairwise tests, suggesting the recipe itself (HQ + RABC + ΔActions) is the key ingredient, with the base checkpoint providing an additional boost.
162
 
163
  #### 5. Level 2 requires everything to be right simultaneously
164
 
 
9
  import Stack from "../../../components/Stack.astro";
10
 
11
  ## Experiments
12
+ We ran 11 ablation experiments to isolate which variables actually drove performance. **Series 1** trains from pretrained base checkpoints on the full dataset. **Series 2** finetunes Series 1 checkpoints on curated high-quality data. One early lesson: **undertraining makes the policy shaky** so make sure your model has converged before drawing conclusions.
 
13
 
14
  <Wide>
15
 
 
17
  |:---:|:---:|:---|:---:|:---:|:---|
18
  | 1.1 | π0 | All data | 200k | MEAN_STD | Baseline |
19
  | 1.2 | π0.5 | All data | 200k | MEAN_STD | Baseline |
20
+ | 1.3 | π0.5 | All data | 200k | QUANTILES | Relative Action |
21
  | 1.4 | π0.5 | All data | 200k | MEAN_STD | Reward model (SARM) with RABC κ=0.01 |
22
  | 1.5 | π0.5 | All data | 200k | MEAN_STD | Reward model (SARM) with RABC κ=0.0215 |
23
+ | 1.7 | π0.5 | All data | 200k | QUANTILES | Relative Action + Reward model (SARM) with RABC κ=0.0215 |
24
  | 2.1 | π0.5 | High-quality only | 100k | QUANTILES | Fine-tune from 1.3 |
25
+ | 2.2 | π0.5 | High-quality only | 100k | QUANTILES | Fine-tune from 1.3 + Reward model (SARM) with RABC κ=0.0265 + Relative Action |
26
+ | 2.3 | π0.5 | High-quality + mirrored | 100k | QUANTILES | Fine-tune from 1.3 + Relative Action + image transforms + mirroring setup (data augmentation) |
27
  | 2.4 | π0.5 | High-quality only | 100k | QUANTILES | Fine-tune from 1.3 · chunk=45 |
28
+ | 2.5 | π0.5 | High-quality only | 100k | QUANTILES | Fine-tune from 1.7 + Reward model (SARM) with RABC κ=0.0265 + Relative Action |
29
 
30
  </Wide>
31
 
 
46
 
47
  Before diving into the experiments further, let's introduce a key ingredient: **[SARM](https://huggingface.co/docs/lerobot/sarm)** (Stage-Aware Reward Modeling). SARM is a trained reward model that scores trajectories based on how well the robot is progressing toward task completion, it acts as a learned "critic" that predicts whether things are going well or badly.
48
 
49
+ SARM is trained on our demonstration data to predict 0-1 task progressiont: it correctly identifies **mistakes** (drops in value) and **progress** (increases) in real time.
50
 
51
  <Wide>
52
  <Stack layout="3-column" gap="small">
 
62
 
63
  ### Results Overview
64
 
65
+ Now let's look at how each experiment actually performed. The charts below show success rates, scores, completion times, and failure modes across all 11 experiments. The pattern is consistent: **Series 2 dominates Series 1**, and within each series, RABC combined with relative actions produces the best results. Explore the charts, then we break down the key findings below.
66
 
67
  <HtmlEmbed
68
  id="success-rates"
 
89
  desc="Each bubble is one experiment. X-axis = completion time (faster is left), Y-axis = fold quality (higher is better), bubble size = total success rate. The best experiments cluster in the top-left corner."
90
  />
91
 
92
+ Speed and quality correlate strongly with data quality. Series 2 experiments fold 2-3x faster than Series 1 (40s vs 100s+), and fold quality only breaks past 3.0 with high-quality training data. The increases in speed and quality are both consequences of the policy learning a clear, unambiguous strategy.
93
 
94
  <HtmlEmbed
95
  id="subtask-heatmap"
 
102
 
103
  ### Where the policies fail
104
 
105
+ Before interpreting success rates, it helps to understand *how* each experiment fails, not just whether it fails or not.
106
 
107
  <HtmlEmbed
108
  id="failure-analysis"
 
136
  Define the exact task protocol before collecting data. Speed is secondary to consistency and clarity of intent at every step.
137
  </Note>
138
 
139
+ #### 2. Relative actions improve performance consistently
140
 
141
+ Comparing π0.5 without relative actions (1.2: 20% total SR, 40% L1) to π0.5 with relative actions and quantile normalization (1.3: 35% total SR, 70% L1), and then to the full combination in 1.7 (40% total SR, 80% L1), shows that training with relative actions consistently improves performance. The trend is clear and shows up in every comparison we made.
142
 
143
+ The effect size doesn't separate cleanly at 20 rollouts, but the direction is consistent. **Caveat:** π0.5 is likely pretrained with relative actions, so 1.3 and 1.7 fine-tune in a regime consistent with pretraining, while 1.2 fine-tunes against it.
144
 
145
  #### 3. RABC helps especially on long tasks like level 2
146
 
147
  RABC on high-quality data produces the two best results overall: 2.2 and 2.5 clearly separate from experiments without it. The effect is strongest on **Level 2**, the longer and harder task — 2.2 reaches 50% L2 SR and 2.5 reaches 80%, while every experiment without RABC on clean data stays at 0%.
148
  #### 4. Fine-tuning from a strong checkpoint is the winning recipe
149
 
150
+ The best results share the same recipe: fine-tune a Series 1 checkpoint on curated high-quality data with RABC and relative actions.
151
 
152
  | Experiment | Total SR | L1 SR | L2 SR | Recipe |
153
  |:---:|:---:|:---:|:---:|:---|
154
  | 2.5 | **90%** | **100%** | **80%** | 1.7 → HQ + RABC, 100k steps |
155
  | 2.2 | 75% | 100% | 50% | 1.3 → HQ + RABC, 100k steps |
156
+ | 1.7 | 40% | 80% | 0% | All data, Relative Actions + RABC + QUANTILES |
157
 
158
  The jump from Series 1 to Series 2 is unambiguous in the statistical analysis — 2.5 and 2.2 clearly separate from the Series 1 group. The Series 1 checkpoint already knows how to fold shirts in general, the high-quality data teaches the correct protocol, and RABC emphasizes the best demonstrations within an already clean dataset.
159
 
160
+ Both 2.2 and 2.5 were trained for 100k steps. 2.2 fine-tunes from 1.3 while 2.5 fine-tunes from 1.7 (the stronger base). The difference (75% → 90%) likely reflects this stronger starting point. They don't separate from each other in the pairwise tests, suggesting the recipe itself (HQ + RABC + Relative Actions) is the key ingredient, with the base checkpoint providing an additional boost.
161
 
162
  #### 5. Level 2 requires everything to be right simultaneously
163