robot-folding

Running

App Files Files Community

pepijn223 HF Staff commited on Mar 31

Commit

32be670

unverified ·

1 Parent(s): 255daaf

Switch to Clopper-Pearson CIs and STEP sequential testing

Browse files

- Replace Wilson CIs with exact Clopper-Pearson intervals in success
rate chart
- Switch from Barnard's exact test to STEP (TRI) for pairwise
statistical comparisons with Bonferroni correction (n_max=50)
- Recompute CLD assignments using STEP
- Remove separate "Sort by" controls; auto-sort by selected metric
- Add Leandro von Werra to authors and acknowledgments
- Link STEP project page and TRI blog throughout

Made-with: Cursor

Files changed (6) hide show

app/src/content/article.mdx +3 -0
app/src/content/chapters/folding/07-evaluation.mdx +1 -1
app/src/content/chapters/folding/08-ablations.mdx +1 -1
app/src/content/chapters/folding/12-references.mdx +1 -0
app/src/content/embeds/folding/statistical-analysis.html +7 -7
app/src/content/embeds/folding/success-rates.html +56 -35

app/src/content/article.mdx CHANGED Viewed

@@ -30,6 +30,9 @@ authors:
   - name: "Virgile Batto"
     url: "https://huggingface.co/VirgileBatto"
     affiliations: [1]
   - name: "Thomas Wolf"
     url: "https://huggingface.co/thomwolf"
     affiliations: [1]

   - name: "Virgile Batto"
     url: "https://huggingface.co/VirgileBatto"
     affiliations: [1]
+  - name: "Leandro von Werra"
+    url: "https://huggingface.co/lvwerra"
+    affiliations: [1]
   - name: "Thomas Wolf"
     url: "https://huggingface.co/thomwolf"
     affiliations: [1]

app/src/content/chapters/folding/07-evaluation.mdx CHANGED Viewed

@@ -62,4 +62,4 @@ Scores are summed across all rollouts in an experiment. With 10 L1 rollouts (max
 ### Statistical uncertainty
-With 20 rollouts per experiment, even large apparent differences can be statistically indistinguishable. We report **Wilson 90% confidence intervals** on all success rates and run formal pairwise significance tests. Running 50-100 rollouts per experiment would give tighter estimates but was not feasible for us across 11 experiments on real hardware.


62
63	### Statistical uncertainty
64
65	+ With 20 rollouts per experiment, even large apparent differences can be statistically indistinguishable. We report Clopper-Pearson 90% confidence intervals on all success rates and run formal pairwise significance tests using [STEP](https://tri-ml.github.io/step/) (Sequential Testing for Efficient Policy comparison). Running 50-100 rollouts per experiment would give tighter estimates but was not feasible for us across 11 experiments on real hardware.

app/src/content/chapters/folding/08-ablations.mdx CHANGED Viewed

@@ -114,7 +114,7 @@ Before interpreting success rates, it helps to understand *how* each experiment
 ### Which differences are real?
-With 20 rollouts per experiment, not every visible gap is real. We run **Barnard's exact test** on all 55 pairs with **Bonferroni correction** (α = 0.10, per-pair p < 0.0018), following [TRI's statistical evaluation framework](https://medium.com/toyotaresearch/statistical-thinking-for-robot-policy-evaluation-from-rigorous-a-b-testing-to-effective-0ae886fbd68d). The chart below shows the full **Bayesian Beta posterior** over each policy's true success rate. **CLD letters** above each violin indicate which experiments are statistically separable, policies sharing a letter are not significantly different.
 <HtmlEmbed
   id="statistical-analysis"

 ### Which differences are real?
+With 20 rollouts per experiment, not every visible gap is real. We run **[STEP](https://tri-ml.github.io/step/)** (Sequential Testing for Efficient Policy comparison) on all 55 pairs with **Bonferroni correction** (α = 0.10, per-pair α < 0.0018, max sample size = 50), following [TRI's statistical evaluation framework](https://medium.com/toyotaresearch/statistical-thinking-for-robot-policy-evaluation-from-rigorous-a-b-testing-to-effective-0ae886fbd68d). Unlike batch tests, STEP allows us to collect additional rollouts later without p-hacking concerns we set max sample size to 50, leaving room for future data. The chart below shows the full **Bayesian Beta posterior** over each policy's true success rate. **CLD letters** above each violin indicate which experiments are statistically separable, policies sharing a letter are not significantly different.
 <HtmlEmbed
   id="statistical-analysis"

app/src/content/chapters/folding/12-references.mdx CHANGED Viewed

@@ -17,6 +17,7 @@ This project would not have been possible without the contributions and support
   <HfUser username="nepyope" name="Martino Russi" />
   <HfUser username="Nico-robot" name="Nicolas Rabault" />
   <HfUser username="VirgileBatto" name="Virgile Batto" />
   <HfUser username="thomwolf" name="Thomas Wolf" />
 </div>

   <HfUser username="nepyope" name="Martino Russi" />
   <HfUser username="Nico-robot" name="Nicolas Rabault" />
   <HfUser username="VirgileBatto" name="Virgile Batto" />
+  <HfUser username="lvwerra" name="Leandro von Werra" />
   <HfUser username="thomwolf" name="Thomas Wolf" />
 </div>

app/src/content/embeds/folding/statistical-analysis.html CHANGED Viewed

@@ -34,7 +34,7 @@ svg text{font-family:system-ui,sans-serif}
 <body>
 <div class="wrap">
 <div class="insight purple">
-  <strong>What this shows:</strong> Each violin is the Bayesian posterior distribution over a policy's true success rate, given the observed rollouts. Wide = high uncertainty; narrow = high certainty. <strong>CLD letters above</strong> summarise which policies are statistically separable (Bonferroni-corrected). Shared letter → not significantly different.
 </div>
 <div class="card">
   <div class="card-head">
@@ -49,10 +49,10 @@ svg text{font-family:system-ui,sans-serif}
   <div class="legend">
     <div class="li"><div class="lsw" style="background:#f7934f"></div>Series 1</div>
     <div class="li"><div class="lsw" style="background:#4dc98a"></div>Series 2</div>
-    <div class="li" style="font-size:10px;color:#8b8fa8">Bold letter = CLD group (Bonferroni α=0.10/55)</div>
   </div>
 </div>
-<p class="note">Violins represent posterior uncertainty, not confidence intervals. Two overlapping violins can still be statistically distinct. Barnard's exact test with Bonferroni correction for 55 pairwise comparisons; CLD groups computed at α=0.10 per-pair threshold (p&lt;0.0018).</p>
 </div>
 <div class="tooltip" id="tooltip"></div>
@@ -75,11 +75,11 @@ const DATA = {
   '2.5 HQ+RABC+Rel★': {total:[18,20], L1:[10,10], L2:[8,10],  series:2},
 };
-// CLD assignments (Barnard's exact test, two-sided, Bonferroni α=0.10/55)
 const CLD = {
-  total: {'2.5 HQ+RABC+Rel★':'a','2.2 HQ+RABC+Rel':'ab','1.1 π0':'bc','1.7 Rel+RABC':'bc','2.1 HQ':'bc','1.3 Relative':'bc','1.2 π0.5':'c','2.4 HQ chunk45':'c','1.4 RABC low':'c','2.3 HQ+mirror':'c','1.5 RABC high':'c'},
-  L1:   {'2.2 HQ+RABC+Rel':'a','2.5 HQ+RABC+Rel★':'a','1.1 π0':'ab','1.7 Rel+RABC':'ab','1.3 Relative':'ab','2.1 HQ':'ab','1.2 π0.5':'abc','2.4 HQ chunk45':'abc','1.4 RABC low':'bc','1.5 RABC high':'c','2.3 HQ+mirror':'c'},
-  L2:   {'2.5 HQ+RABC+Rel★':'a','2.2 HQ+RABC+Rel':'ab','2.1 HQ':'b','2.3 HQ+mirror':'b','1.1 π0':'b','1.2 π0.5':'b','1.3 Relative':'b','1.4 RABC low':'b','1.5 RABC high':'b','1.7 Rel+RABC':'b','2.4 HQ chunk45':'b'},
 };
 // ── BETA DISTRIBUTION PDF ────────────────────────────────────────────────────

 <body>
 <div class="wrap">
 <div class="insight purple">
+  <strong>What this shows:</strong> Each violin is the Bayesian posterior distribution over a policy's true success rate, given the observed rollouts. Wide = high uncertainty; narrow = high certainty. <strong>CLD letters above</strong> summarise which policies are statistically separable (<a href="https://tri-ml.github.io/step/" style="color:#818cf8">STEP</a>, Bonferroni-corrected). Shared letter → not significantly different.
 </div>
 <div class="card">
   <div class="card-head">
   <div class="legend">
     <div class="li"><div class="lsw" style="background:#f7934f"></div>Series 1</div>
     <div class="li"><div class="lsw" style="background:#4dc98a"></div>Series 2</div>
+    <div class="li" style="font-size:10px;color:#8b8fa8">Bold letter = CLD group (STEP, Bonferroni α=0.10/55, n<sub>max</sub>=50)</div>
   </div>
 </div>
+<p class="note">Violins represent posterior uncertainty, not confidence intervals. Two overlapping violins can still be statistically distinct. <a href="https://tri-ml.github.io/step/" style="color:#555e7a">STEP</a> sequential test with Bonferroni correction for 55 pairwise comparisons (α=0.10, per-pair α&lt;0.0018, n<sub>max</sub>=50).</p>
 </div>
 <div class="tooltip" id="tooltip"></div>
   '2.5 HQ+RABC+Rel★': {total:[18,20], L1:[10,10], L2:[8,10],  series:2},
 };
+// CLD assignments (STEP sequential test, two-sided, Bonferroni α=0.10/55, n_max=50)
 const CLD = {
+  total: {'2.5 HQ+RABC+Rel★':'a','2.2 HQ+RABC+Rel':'ab','1.1 π0':'bc','1.7 Rel+RABC':'bc','2.1 HQ':'bc','1.3 Relative':'bc','1.2 π0.5':'cd','2.4 HQ chunk45':'cd','1.4 RABC low':'cd','2.3 HQ+mirror':'cd','1.5 RABC high':'d'},
+  L1:   {'2.2 HQ+RABC+Rel':'a','2.5 HQ+RABC+Rel★':'a','1.1 π0':'a','1.7 Rel+RABC':'a','1.3 Relative':'a','2.1 HQ':'a','1.2 π0.5':'ab','2.4 HQ chunk45':'ab','1.4 RABC low':'ab','1.5 RABC high':'b','2.3 HQ+mirror':'b'},
+  L2:   {'2.5 HQ+RABC+Rel★':'a','2.2 HQ+RABC+Rel':'ab','2.1 HQ':'ab','2.3 HQ+mirror':'ab','1.1 π0':'b','1.2 π0.5':'b','1.3 Relative':'b','1.4 RABC low':'b','1.5 RABC high':'b','1.7 Rel+RABC':'b','2.4 HQ chunk45':'b'},
 };
 // ── BETA DISTRIBUTION PDF ────────────────────────────────────────────────────

app/src/content/embeds/folding/success-rates.html CHANGED Viewed

@@ -23,13 +23,6 @@
   .toggle-btn.active-total { background: rgba(167,139,250,0.18); border-color: #a78bfa; color: #a78bfa; }
   .toggle-btn.active-l1    { background: rgba(77,201,138,0.18);  border-color: #4dc98a; color: #4dc98a; }
   .toggle-btn.active-l2    { background: rgba(248,113,113,0.18); border-color: #f87171; color: #f87171; }
-  .divider { width: 1px; height: 22px; background: #2a2d3a; }
-  .sort-btn {
-    background: none; border: 1px solid #2a2d3a; color: #8b8fa8;
-    font-size: 11px; padding: 4px 10px; border-radius: 6px; cursor: pointer; transition: all .15s;
-  }
-  .sort-btn:hover { color: #e8eaf0; border-color: #555; }
-  .sort-btn.active { color: #e8eaf0; border-color: #555; background: #1a1d27; }
   /* ── Legend ── */
   .legend { display: flex; gap: 16px; justify-content: center; flex-wrap: wrap; margin-bottom: 4px; }
@@ -38,7 +31,7 @@
   .legend-dot { width: 11px; height: 11px; border-radius: 2px; }
   .legend-pip { width: 8px; height: 8px; border-radius: 50%; display: inline-block; }
-  /* ── Wilson explanation box ── */
   .ci-note {
     background: rgba(255,255,255,0.03); border: 1px solid #2a2d3a;
     border-radius: 8px; padding: 9px 14px; margin-bottom: 10px;
@@ -78,11 +71,6 @@
   <button class="toggle-btn active-total" id="btn-total" onclick="toggleMetric('total')">Total SR</button>
   <button class="toggle-btn active-l1"    id="btn-l1"    onclick="toggleMetric('l1')">Level 1</button>
   <button class="toggle-btn active-l2"    id="btn-l2"    onclick="toggleMetric('l2')">Level 2</button>
-  <div class="divider"></div>
-  <span class="ctrl-label">Sort by:</span>
-  <button class="sort-btn active" id="sort-total" onclick="setSort('total')">Total</button>
-  <button class="sort-btn"        id="sort-l1"    onclick="setSort('l1')">L1</button>
-  <button class="sort-btn"        id="sort-l2"    onclick="setSort('l2')">L2</button>
 </div>
 <!-- Legend -->
@@ -96,11 +84,11 @@
   </div>
 </div>
-<!-- Wilson CI explanation -->
 <div class="ci-note">
   <span class="ci-note-icon">ℹ</span>
   <span>
-    <strong>Error bars = Wilson 90% CI</strong> (Total: n=20, L1/L2: n=10). Wide bars reflect small sample sizes.
   </span>
 </div>
@@ -144,23 +132,61 @@ const raw = [
   {label:"2.5 HQ+RABC+Rel★", series:"2", total:90, l1:100, l2:80},
 ];
-// ── Wilson 90% CI ──────────────────────────────────────────────────────────
-function wilson(pct, n, z = 1.645) {
-  const p = pct / 100;
-  const denom  = 1 + z * z / n;
-  const center = (p + z * z / (2 * n)) / denom;
-  const margin = (z / denom) * Math.sqrt(p * (1 - p) / n + z * z / (4 * n * n));
-  return {
-    lo: Math.max(0,   (center - margin) * 100),
-    hi: Math.min(100, (center + margin) * 100),
-  };
 }
 // Pre-compute CIs and attach to data
 const data_with_ci = raw.map(d => {
   const out = {...d};
   ["total","l1","l2"].forEach(k => {
-    out[k + "_ci"] = wilson(d[k], N[k]);
   });
   return out;
 });
@@ -175,17 +201,12 @@ window.toggleMetric = function(key) {
   const numActive = Object.values(active).filter(Boolean).length;
   if (active[key] && numActive === 1) return;
   active[key] = !active[key];
   updateLegend();
   render();
 };
-window.setSort = function(key) {
-  sortKey = key;
-  document.querySelectorAll(".sort-btn").forEach(b => b.classList.remove("active"));
-  document.getElementById("sort-" + key).classList.add("active");
-  render();
-};
 function updateLegend() {
   ["total","l1","l2"].forEach(k => {
     document.getElementById("btn-" + k).className = "toggle-btn" + (active[k] ? " active-"+k : "");
@@ -283,7 +304,7 @@ function render() {
             </span>
           </div>
           <div style="margin-top:5px;font-size:10px;color:#555">
-            90% Wilson CI · this metric: ${k}/${n} successes
           </div>
           ${exp.note ? `<div class="tooltip-note">${exp.note}</div>` : ""}
         `);
@@ -293,7 +314,7 @@ function render() {
       })
       .on("mouseleave", () => tooltip.style("opacity", 0));
-    // Error bar (Wilson CI)
     groups.append("line").attr("class","error-bar")
       .attr("x1", cx).attr("x2", cx)
       .attr("y1", d => y(d[key + "_ci"].hi))

   .toggle-btn.active-total { background: rgba(167,139,250,0.18); border-color: #a78bfa; color: #a78bfa; }
   .toggle-btn.active-l1    { background: rgba(77,201,138,0.18);  border-color: #4dc98a; color: #4dc98a; }
   .toggle-btn.active-l2    { background: rgba(248,113,113,0.18); border-color: #f87171; color: #f87171; }
   /* ── Legend ── */
   .legend { display: flex; gap: 16px; justify-content: center; flex-wrap: wrap; margin-bottom: 4px; }
   .legend-dot { width: 11px; height: 11px; border-radius: 2px; }
   .legend-pip { width: 8px; height: 8px; border-radius: 50%; display: inline-block; }
+  /* ── CI explanation box ── */
   .ci-note {
     background: rgba(255,255,255,0.03); border: 1px solid #2a2d3a;
     border-radius: 8px; padding: 9px 14px; margin-bottom: 10px;
   <button class="toggle-btn active-total" id="btn-total" onclick="toggleMetric('total')">Total SR</button>
   <button class="toggle-btn active-l1"    id="btn-l1"    onclick="toggleMetric('l1')">Level 1</button>
   <button class="toggle-btn active-l2"    id="btn-l2"    onclick="toggleMetric('l2')">Level 2</button>
 </div>
 <!-- Legend -->
   </div>
 </div>
+<!-- Clopper-Pearson CI explanation -->
 <div class="ci-note">
   <span class="ci-note-icon">ℹ</span>
   <span>
+    <strong>Error bars = Clopper-Pearson 90% CI</strong> (Total: n=20, L1/L2: n=10). Wide bars reflect small sample sizes.
   </span>
 </div>
   {label:"2.5 HQ+RABC+Rel★", series:"2", total:90, l1:100, l2:80},
 ];
+// ── Clopper-Pearson exact 90% CI ────────────────────────────────────────────
+function lgamma(x) {
+  if (x < 0.5) return Math.log(Math.PI / Math.sin(Math.PI * x)) - lgamma(1 - x);
+  x--;
+  let a = 0.99999999999980993;
+  const c = [676.5203681218851,-1259.1392167224028,771.32342877765313,-176.61502916214059,
+             12.507343278686905,-0.13857109526572012,9.9843695780195716e-6,1.5056327351493116e-7];
+  for (let i = 0; i < 8; i++) a += c[i] / (x + i + 1);
+  const t = x + 8 - 0.5;
+  return 0.5 * Math.log(2 * Math.PI) + (x + 0.5) * Math.log(t) - t + Math.log(a);
+}
+function betaCf(x, a, b) {
+  const qab = a + b, qap = a + 1, qam = a - 1;
+  let c = 1, d = 1 - qab * x / qap;
+  if (Math.abs(d) < 1e-30) d = 1e-30;
+  d = 1 / d; let h = d;
+  for (let m = 1; m <= 200; m++) {
+    const m2 = 2 * m;
+    let aa = m * (b - m) * x / ((qam + m2) * (a + m2));
+    d = 1 + aa * d; if (Math.abs(d) < 1e-30) d = 1e-30;
+    c = 1 + aa / c; if (Math.abs(c) < 1e-30) c = 1e-30;
+    d = 1 / d; h *= d * c;
+    aa = -(a + m) * (qab + m) * x / ((a + m2) * (qap + m2));
+    d = 1 + aa * d; if (Math.abs(d) < 1e-30) d = 1e-30;
+    c = 1 + aa / c; if (Math.abs(c) < 1e-30) c = 1e-30;
+    d = 1 / d; const del = d * c; h *= del;
+    if (Math.abs(del - 1) < 3e-12) break;
+  }
+  return h;
+}
+function betaI(x, a, b) {
+  if (x <= 0) return 0; if (x >= 1) return 1;
+  const bt = Math.exp(lgamma(a+b) - lgamma(a) - lgamma(b) + a*Math.log(x) + b*Math.log(1-x));
+  return x < (a+1)/(a+b+2) ? bt * betaCf(x,a,b) / a : 1 - bt * betaCf(1-x,b,a) / b;
+}
+function betaInv(p, a, b) {
+  let lo = 0, hi = 1;
+  for (let i = 0; i < 100; i++) {
+    const mid = (lo + hi) / 2;
+    if (betaI(mid, a, b) < p) lo = mid; else hi = mid;
+  }
+  return (lo + hi) / 2;
+}
+function clopperPearson(pct, n, alpha = 0.10) {
+  const k = Math.round(pct / 100 * n);
+  const lo = k === 0 ? 0 : betaInv(alpha / 2, k, n - k + 1);
+  const hi = k === n ? 1 : betaInv(1 - alpha / 2, k + 1, n - k);
+  return { lo: lo * 100, hi: hi * 100 };
 }
 // Pre-compute CIs and attach to data
 const data_with_ci = raw.map(d => {
   const out = {...d};
   ["total","l1","l2"].forEach(k => {
+    out[k + "_ci"] = clopperPearson(d[k], N[k]);
   });
   return out;
 });
   const numActive = Object.values(active).filter(Boolean).length;
   if (active[key] && numActive === 1) return;
   active[key] = !active[key];
+  if (active[key]) sortKey = key;
+  else if (sortKey === key) sortKey = ["total","l1","l2"].find(k => active[k]);
   updateLegend();
   render();
 };
 function updateLegend() {
   ["total","l1","l2"].forEach(k => {
     document.getElementById("btn-" + k).className = "toggle-btn" + (active[k] ? " active-"+k : "");
             </span>
           </div>
           <div style="margin-top:5px;font-size:10px;color:#555">
+            90% Clopper-Pearson CI · this metric: ${k}/${n} successes
           </div>
           ${exp.note ? `<div class="tooltip-note">${exp.note}</div>` : ""}
         `);
       })
       .on("mouseleave", () => tooltip.style("opacity", 0));
+    // Error bar (Clopper-Pearson CI)
     groups.append("line").attr("class","error-bar")
       .attr("x1", cx).attr("x2", cx)
       .attr("y1", d => y(d[key + "_ci"].hi))