robot-folding

Running

App Files Files Community

Update app/src/content/chapters/folding/08-ablations.mdx

by nepyope - opened Mar 26

base: refs/heads/main

←

from: refs/pr/9

Discussion Files changed

+16

-17

Files changed (1) hide show

app/src/content/chapters/folding/08-ablations.mdx +16 -17

app/src/content/chapters/folding/08-ablations.mdx CHANGED Viewed

@@ -9,8 +9,7 @@ import sarmEp2200 from "../../assets/image/lerobot-data-collection_level12_rac_2
 import Stack from "../../../components/Stack.astro";
 ## Experiments
-We ran 11 experiments to understand what *actually* matters. **Series 1** trains from pretrained base checkpoints on the full dataset. **Series 2** finetunes Series 1 checkpoints on curated high-quality data (2.1–2.4 from 1.3, 2.5 from 1.7). One early lesson: **undertraining makes the policy shaky** make sure your model has converged before drawing conclusions.
 <Wide>
@@ -18,15 +17,15 @@ We ran 11 experiments to understand what *actually* matters. **Series 1** trains
 |:---:|:---:|:---|:---:|:---:|:---|
 | 1.1 | π0 | All data | 200k | MEAN_STD | Baseline |
 | 1.2 | π0.5 | All data | 200k | MEAN_STD | Baseline |
-| 1.3 | π0.5 | All data | 200k | QUANTILES | Δ (Delta) Action |
 | 1.4 | π0.5 | All data | 200k | MEAN_STD | Reward model (SARM) with RABC κ=0.01 |
 | 1.5 | π0.5 | All data | 200k | MEAN_STD | Reward model (SARM) with RABC κ=0.0215 |
-| 1.7 | π0.5 | All data | 200k | QUANTILES | Δ (Delta) Action + Reward model (SARM) with RABC κ=0.0215 |
 | 2.1 | π0.5 | High-quality only | 100k | QUANTILES | Fine-tune from 1.3 |
-| 2.2 | π0.5 | High-quality only | 100k | QUANTILES | Fine-tune from 1.3 + Reward model (SARM) with RABC κ=0.0265 + Δ (Delta) Action |
-| 2.3 | π0.5 | High-quality + mirrored | 100k | QUANTILES | Fine-tune from 1.3 + Δ (Delta) Action + image transforms + mirroring setup (data augmentation) |
 | 2.4 | π0.5 | High-quality only | 100k | QUANTILES | Fine-tune from 1.3 · chunk=45 |
-| 2.5 | π0.5 | High-quality only | 100k | QUANTILES | Fine-tune from 1.7 + Reward model (SARM) with RABC κ=0.0265 + Δ (Delta) Action |
 </Wide>
@@ -47,7 +46,7 @@ With an action queue size of 30 and max action horizon of 20. RTC gave us a ~2x
 Before diving into the experiments further, let's introduce a key ingredient: **[SARM](https://huggingface.co/docs/lerobot/sarm)** (Stage-Aware Reward Modeling). SARM is a trained reward model that scores trajectories based on how well the robot is progressing toward task completion, it acts as a learned "critic" that predicts whether things are going well or badly.
-SARM is trained on our demonstration data to predict 0-1 task progression. The key insight: it correctly identifies **mistakes** (drops in value) and **progress** (increases) in real time.
 <Wide>
 <Stack layout="3-column" gap="small">
@@ -63,7 +62,7 @@ We use SARM exclusively for **RABC** (Reward-Advantage-Based Conditioning): it s
 ### Results Overview
-Now let's look at how each experiment actually performed. The charts below show success rates, scores, completion times, and failure modes across all 11 experiments. The pattern is consistent: **Series 2 dominates Series 1**, and within each series, RABC combined with delta actions produces the best results. Explore the charts, then we break down the key findings below.
 <HtmlEmbed
   id="success-rates"
@@ -90,7 +89,7 @@ Total score captures partial progress that binary success rate misses. Even fail
   desc="Each bubble is one experiment. X-axis = completion time (faster is left), Y-axis = fold quality (higher is better), bubble size = total success rate. The best experiments cluster in the top-left corner."
 />
-Speed and quality correlate strongly with data quality. Series 2 experiments fold 2-3x faster than Series 1 (40s vs 100s+), and fold quality only breaks past 3.0 with high-quality training data. Faster isn't a separate goal from better; it's a consequence of the policy learning a clear, unambiguous strategy.
 <HtmlEmbed
   id="subtask-heatmap"
@@ -103,7 +102,7 @@ The heatmap shows where time is spent. Series 1 experiments are slow across the
 ### Where the policies fail
-Before interpreting success rates, it helps to understand *how* each experiment fails not just whether it fails.
 <HtmlEmbed
   id="failure-analysis"
@@ -137,28 +136,28 @@ We hypothesise that the root cause is the difference in **multi-modality** betwe
   Define the exact task protocol before collecting data. Speed is secondary to consistency and clarity of intent at every step.
 </Note>
-#### 2. Delta actions improve performance consistently
-Comparing π0.5 without delta actions (1.2: 20% total SR, 40% L1) to π0.5 with delta actions and quantile normalization (1.3: 35% total SR, 70% L1), and then to the full combination in 1.7 (40% total SR, 80% L1), shows that training with delta actions consistently improves performance. The trend is clear and shows up in every comparison we made.
-The effect size doesn't separate cleanly at 20 rollouts, but the direction is consistent. **Caveat:** π0.5 is likely pretrained with delta actions, so 1.3 and 1.7 fine-tune in a regime consistent with pretraining, while 1.2 fine-tunes against it.
 #### 3. RABC helps especially on long tasks like level 2
 RABC on high-quality data produces the two best results overall: 2.2 and 2.5 clearly separate from experiments without it. The effect is strongest on **Level 2**, the longer and harder task — 2.2 reaches 50% L2 SR and 2.5 reaches 80%, while every experiment without RABC on clean data stays at 0%.
 #### 4. Fine-tuning from a strong checkpoint is the winning recipe
-The best results share the same recipe: fine-tune a Series 1 checkpoint on curated high-quality data with RABC and delta actions.
 | Experiment | Total SR | L1 SR | L2 SR | Recipe |
 |:---:|:---:|:---:|:---:|:---|
 | 2.5 | **90%** | **100%** | **80%** | 1.7 → HQ + RABC, 100k steps |
 | 2.2 | 75% | 100% | 50% | 1.3 → HQ + RABC, 100k steps |
-| 1.7 | 40% | 80% | 0% | All data, ΔActions + RABC + QUANTILES |
 The jump from Series 1 to Series 2 is unambiguous in the statistical analysis — 2.5 and 2.2 clearly separate from the Series 1 group. The Series 1 checkpoint already knows how to fold shirts in general, the high-quality data teaches the correct protocol, and RABC emphasizes the best demonstrations within an already clean dataset.
-Both 2.2 and 2.5 were trained for 100k steps. 2.2 fine-tunes from 1.3 while 2.5 fine-tunes from 1.7 (the stronger base). The difference (75% → 90%) likely reflects this stronger starting point. They don't separate from each other in the pairwise tests, suggesting the recipe itself (HQ + RABC + ΔActions) is the key ingredient, with the base checkpoint providing an additional boost.
 #### 5. Level 2 requires everything to be right simultaneously

 import Stack from "../../../components/Stack.astro";
 ## Experiments
+We ran 11 ablation experiments to isolate which variables actually drove performance. **Series 1** trains from pretrained base checkpoints on the full dataset. **Series 2** finetunes Series 1 checkpoints on curated high-quality data. One early lesson: **undertraining makes the policy shaky** so make sure your model has converged before drawing conclusions.
 <Wide>
 |:---:|:---:|:---|:---:|:---:|:---|
 | 1.1 | π0 | All data | 200k | MEAN_STD | Baseline |
 | 1.2 | π0.5 | All data | 200k | MEAN_STD | Baseline |
+| 1.3 | π0.5 | All data | 200k | QUANTILES | Relative Action |
 | 1.4 | π0.5 | All data | 200k | MEAN_STD | Reward model (SARM) with RABC κ=0.01 |
 | 1.5 | π0.5 | All data | 200k | MEAN_STD | Reward model (SARM) with RABC κ=0.0215 |
+| 1.7 | π0.5 | All data | 200k | QUANTILES | Relative Action + Reward model (SARM) with RABC κ=0.0215 |
 | 2.1 | π0.5 | High-quality only | 100k | QUANTILES | Fine-tune from 1.3 |
+| 2.2 | π0.5 | High-quality only | 100k | QUANTILES | Fine-tune from 1.3 + Reward model (SARM) with RABC κ=0.0265 + Relative Action |
+| 2.3 | π0.5 | High-quality + mirrored | 100k | QUANTILES | Fine-tune from 1.3 + Relative Action + image transforms + mirroring setup (data augmentation) |
 | 2.4 | π0.5 | High-quality only | 100k | QUANTILES | Fine-tune from 1.3 · chunk=45 |
+| 2.5 | π0.5 | High-quality only | 100k | QUANTILES | Fine-tune from 1.7 + Reward model (SARM) with RABC κ=0.0265 + Relative Action |
 </Wide>
 Before diving into the experiments further, let's introduce a key ingredient: **[SARM](https://huggingface.co/docs/lerobot/sarm)** (Stage-Aware Reward Modeling). SARM is a trained reward model that scores trajectories based on how well the robot is progressing toward task completion, it acts as a learned "critic" that predicts whether things are going well or badly.
+SARM is trained on our demonstration data to predict 0-1 task progressiont: it correctly identifies **mistakes** (drops in value) and **progress** (increases) in real time.
 <Wide>
 <Stack layout="3-column" gap="small">
 ### Results Overview
+Now let's look at how each experiment actually performed. The charts below show success rates, scores, completion times, and failure modes across all 11 experiments. The pattern is consistent: **Series 2 dominates Series 1**, and within each series, RABC combined with relative actions produces the best results. Explore the charts, then we break down the key findings below.
 <HtmlEmbed
   id="success-rates"
   desc="Each bubble is one experiment. X-axis = completion time (faster is left), Y-axis = fold quality (higher is better), bubble size = total success rate. The best experiments cluster in the top-left corner."
 />
+Speed and quality correlate strongly with data quality. Series 2 experiments fold 2-3x faster than Series 1 (40s vs 100s+), and fold quality only breaks past 3.0 with high-quality training data. The increases in speed and quality are both consequences of the policy learning a clear, unambiguous strategy.
 <HtmlEmbed
   id="subtask-heatmap"
 ### Where the policies fail
+Before interpreting success rates, it helps to understand *how* each experiment fails, not just whether it fails or not.
 <HtmlEmbed
   id="failure-analysis"
   Define the exact task protocol before collecting data. Speed is secondary to consistency and clarity of intent at every step.
 </Note>
+#### 2. Relative actions improve performance consistently
+Comparing π0.5 without relative actions (1.2: 20% total SR, 40% L1) to π0.5 with relative actions and quantile normalization (1.3: 35% total SR, 70% L1), and then to the full combination in 1.7 (40% total SR, 80% L1), shows that training with relative actions consistently improves performance. The trend is clear and shows up in every comparison we made.
+The effect size doesn't separate cleanly at 20 rollouts, but the direction is consistent. **Caveat:** π0.5 is likely pretrained with relative actions, so 1.3 and 1.7 fine-tune in a regime consistent with pretraining, while 1.2 fine-tunes against it.
 #### 3. RABC helps especially on long tasks like level 2
 RABC on high-quality data produces the two best results overall: 2.2 and 2.5 clearly separate from experiments without it. The effect is strongest on **Level 2**, the longer and harder task — 2.2 reaches 50% L2 SR and 2.5 reaches 80%, while every experiment without RABC on clean data stays at 0%.
 #### 4. Fine-tuning from a strong checkpoint is the winning recipe
+The best results share the same recipe: fine-tune a Series 1 checkpoint on curated high-quality data with RABC and relative actions.
 | Experiment | Total SR | L1 SR | L2 SR | Recipe |
 |:---:|:---:|:---:|:---:|:---|
 | 2.5 | **90%** | **100%** | **80%** | 1.7 → HQ + RABC, 100k steps |
 | 2.2 | 75% | 100% | 50% | 1.3 → HQ + RABC, 100k steps |
+| 1.7 | 40% | 80% | 0% | All data, Relative Actions + RABC + QUANTILES |
 The jump from Series 1 to Series 2 is unambiguous in the statistical analysis — 2.5 and 2.2 clearly separate from the Series 1 group. The Series 1 checkpoint already knows how to fold shirts in general, the high-quality data teaches the correct protocol, and RABC emphasizes the best demonstrations within an already clean dataset.
+Both 2.2 and 2.5 were trained for 100k steps. 2.2 fine-tunes from 1.3 while 2.5 fine-tunes from 1.7 (the stronger base). The difference (75% → 90%) likely reflects this stronger starting point. They don't separate from each other in the pairwise tests, suggesting the recipe itself (HQ + RABC + Relative Actions) is the key ingredient, with the base checkpoint providing an additional boost.
 #### 5. Level 2 requires everything to be right simultaneously