Spaces:

LaelaZ
/

embodied-efficiency

Running

App Files Files Community

LaelaZ commited on 2 days ago

Commit

e96076c

verified ·

1 Parent(s): 11e3e70

Surface the measured supervisor result (AUC 0.99 on real DROID actions); sync supervisor.py (adds read-only drift_score)

Browse files

Files changed (2) hide show

app/supervisor.py +21 -1
app/templates/index.html +11 -0

app/supervisor.py CHANGED Viewed

@@ -14,7 +14,12 @@ What it checks, on every action:
   - in-bounds: every dimension stays inside the action limits.
   - drift (OOD): how far the action sits from the calibration set, as a per-dim
     z-score pooled into one distance. This is a deliberately simple v0 (diagonal
-    Gaussian); it catches gross drift, not subtle correlated shifts.
   - jerk: how big the jump is from the last accepted action, against the
     calibration jerk.
@@ -83,6 +88,21 @@ class Supervisor:
     def _pooled_z(self, x, mean, std):
         return float(np.sqrt(np.mean(((x - mean) / std) ** 2)))
     def _safe_out(self):
         if self._last_safe is not None:
             return np.clip(self._last_safe, self.cfg.action_low, self.cfg.action_high)

   - in-bounds: every dimension stays inside the action limits.
   - drift (OOD): how far the action sits from the calibration set, as a per-dim
     z-score pooled into one distance. This is a deliberately simple v0 (diagonal
+    Gaussian); it catches gross drift, not subtle correlated shifts. The
+    drift_thresh isn't a guess: evaluate.py sweeps it against a labelled set
+    (real DROID actions + injected faults) to pick an operating point. On DROID
+    the detector scores AUC 0.99, and a threshold of ~2.2 catches 91% of faults
+    at a 1% false-positive rate; the shipped default (4.0) is conservative on
+    purpose, so tune it to your fleet with evaluate.py.
   - jerk: how big the jump is from the last accepted action, against the
     calibration jerk.
     def _pooled_z(self, x, mean, std):
         return float(np.sqrt(np.mean(((x - mean) / std) ** 2)))
+    def drift_score(self, action):
+        """Pooled z-distance of an action from the calibration set, read-only.
+        Same quantity step() thresholds for drift (computed on the in-bounds
+        action), but with no side effects, so you can sweep a threshold over a
+        labelled set to get an ROC. Non-finite or wrong-shape actions score inf.
+        """
+        if self._mean is None:
+            raise RuntimeError("calibrate() before scoring")
+        a = np.asarray(action, dtype=np.float64).reshape(-1)
+        if a.size != self._mean.size or not np.all(np.isfinite(a)):
+            return float("inf")
+        clipped = np.clip(a, self.cfg.action_low, self.cfg.action_high)
+        return self._pooled_z(clipped, self._mean, self._std)
     def _safe_out(self):
         if self._last_safe is not None:
             return np.clip(self._last_safe, self.cfg.action_low, self.cfg.action_high)

app/templates/index.html CHANGED Viewed

@@ -181,6 +181,17 @@
     </div>
   </div>
   <div class="mt-6 grid gap-5 lg:grid-cols-[0.85fr_1.15fr]">
     <!-- Scenario picker -->
     <div class="ee-card p-6">

     </div>
   </div>
+  <div class="mt-4 flex flex-wrap items-center gap-2 text-xs">
+    <span class="ee-chip bg-emerald-50 text-emerald-700 ring-1 ring-inset ring-emerald-200/70 dark:bg-emerald-500/10 dark:text-emerald-300 dark:ring-emerald-400/20">
+      <span class="ee-light ee-light--go"></span> measured, not asserted
+    </span>
+    <span class="text-slate-500 dark:text-slate-400">
+      On <strong>real DROID robot actions</strong> + labelled faults: the drift detector scores
+      <span class="ee-mono font-semibold text-slate-700 dark:text-slate-200">AUC 0.99</span>, and tuned to a 1% false-alarm budget it catches
+      <span class="ee-mono font-semibold text-slate-700 dark:text-slate-200">91%</span> of faults. Eval in the repo.
+    </span>
+  </div>
   <div class="mt-6 grid gap-5 lg:grid-cols-[0.85fr_1.15fr]">
     <!-- Scenario picker -->
     <div class="ee-card p-6">