pepijn223 HF Staff commited on
Commit
8bf1f05
·
unverified ·
1 Parent(s): f0f3d44

Add statistical significance caveat to experiment analysis

Browse files

Clarify that visible gaps between bars may reflect rollout stochasticity,
not true performance differences. Add caveat that non-significance does
not imply identical policies — finer metrics can still reveal behavioral
differences.

Made-with: Cursor

app/src/content/chapters/folding/08-ablations.mdx CHANGED
@@ -196,7 +196,9 @@ Now that the full journey is clear, here is the complete picture across all 11 e
196
  desc="Bars show success rate for the selected level; balloons show Bayesian Beta posterior. Letters above each bar are CLD groups from STEP sequential testing (Bonferroni α=0.10/55). Shared letter means not significantly different. Toggle between Total, Level 1, and Level 2 to re-sort and update. Hover for details."
197
  />
198
 
199
- Total score captures partial progress that binary success rate misses: even failed rollouts earn credit for completed subtasks. The CLD letters (from **[STEP](https://tri-ml.github.io/step/)** sequential testing with **Bonferroni correction**, α = 0.10, per-pair α < 0.0018, n<sub>max</sub> = 50, following [TRI's statistical evaluation framework](https://medium.com/toyotaresearch/statistical-thinking-for-robot-policy-evaluation-from-rigorous-a-b-testing-to-effective-0ae886fbd68d)) tell a clear story: **fine-tuning separates from initial training**. Experiments 2.5 and 2.2 form group 'a', statistically distinct from the bulk of initial-training experiments which cluster in groups 'c' and 'd'. Within initial training, no experiment separates from another, meaning none of the algorithmic changes (RABC, relative actions, normalization) produced a statistically detectable improvement on their own when trained on the full dataset. The jump only comes from switching to curated high-quality data in fine-tuning. Toggle to Level 2 and the separation is even starker: every initial-training experiment sits in group 'b' with 0% success, while 2.5 stands alone in group 'a'.
 
 
200
 
201
  Success rate isn't the full picture. A policy that succeeds but takes twice as long or produces sloppy folds isn't really better. The heatmap below breaks down per-subtask timing and fold quality across experiments, revealing which policies are not just more reliable but also faster and more precise:
202
 
 
196
  desc="Bars show success rate for the selected level; balloons show Bayesian Beta posterior. Letters above each bar are CLD groups from STEP sequential testing (Bonferroni α=0.10/55). Shared letter means not significantly different. Toggle between Total, Level 1, and Level 2 to re-sort and update. Hover for details."
197
  />
198
 
199
+ Total score captures partial progress that binary success rate misses: even failed rollouts earn credit for completed subtasks. The CLD letters (from **[STEP](https://tri-ml.github.io/step/)** sequential testing with **Bonferroni correction**, α = 0.10, per-pair α < 0.0018, n<sub>max</sub> = 50, following [TRI's statistical evaluation framework](https://medium.com/toyotaresearch/statistical-thinking-for-robot-policy-evaluation-from-rigorous-a-b-testing-to-effective-0ae886fbd68d)) tell a clear story: **fine-tuning separates from initial training**. Experiments 2.5 and 2.2 form group 'a', statistically distinct from the bulk of initial-training experiments which cluster in groups 'c' and 'd'. Given the stochasticity in rollouts, not every visible gap between bars reflects a true performance difference. Within initial training, no experiment separates from another, meaning the algorithmic changes (RABC, relative actions, normalization) did not produce a statistically detectable improvement on their own when trained on the full dataset. The jump only comes from switching to curated high-quality data in fine-tuning. Toggle to Level 2 and the separation is even starker: every initial-training experiment sits in group 'b' with 0% success, while 2.5 stands alone in group 'a'.
200
+
201
+ An important caveat: a non-significant result does not mean the policies are identical. Collecting more rollouts (STEP supports up to n<sub>max</sub> = 50) might still differentiate them, and more granular metrics like partial task progress or the failure-mode and timing analyses below can reveal meaningful behavioral differences even when success rates overlap.
202
 
203
  Success rate isn't the full picture. A policy that succeeds but takes twice as long or produces sloppy folds isn't really better. The heatmap below breaks down per-subtask timing and fold quality across experiments, revealing which policies are not just more reliable but also faster and more precise:
204