AEGIS: A Backup Reflex for Physical AI
Interactive demo · Paper (arXiv:2606.06660) · Rollout logs
AEGIS (Activation-probe Early-warning, Gated Inference Switching) is a runtime escalation layer for robot manipulation policies. A cheap probe reads the deployed policy's frozen internal activations as a per-step early-warning signal; when a calibrated gate fires, control switches mid-trajectory to a stronger separate policy, but only for the steps that need it. Both policies stay frozen — the ten-kilobyte probe head in this repo is the only trained component, and it decides when a 4.14B policy wakes up.
The thesis is one sentence: a robot policy can read its own activations as an early-warning signal and call a stronger policy before failure compounds, recovering twice as many failures as matched-budget escalation.
- Paper: https://arxiv.org/abs/2606.06660
- Interactive demo: https://huggingface.co/spaces/kaikaku/aegis-demo
- Author: Josef Chen, KAIKAKU
What it is
Long-horizon failures are slow spirals: one bad step degrades the state, the next compounds it, and the episode is lost long before it ends. Detect-only methods (SAFE, FIPER, Sentinel) see this coming and raise an alarm but never act; recover-within-policy methods (HELM, Pre-VLA, FailSafe) act, but only by asking the same failing policy to try again. AEGIS does what a human supervisor would do: call in someone stronger, at the moment it matters.
The runtime loop: the weak policy (SmolVLA, 450M) drives by default and a forward hook reads its action-expert layer-15 activations live (720-d, mean-pooled over each 10-step action chunk). The probe head scores each step; a split-conformal threshold (α=0.10), an early-harm guard (no escalation before 0.20 T), and a per-episode budget cap (⌈0.05 T⌉ fires) turn the score into a switching decision. On a fire, control hands to π₀.₅ (4.14B) at the next chunk boundary, holds for ≥3 chunks, and returns on hysteresis. Both policies sit warm in one process (~9.5 GB VRAM for the backup).
Key results (measured, from the paper)
All from the confirmatory factorial: LIBERO-Spatial, full 10×70 task×seed common-random-number grid, ≥700 episodes per arm, 646 in the weak-policy-failing conditional pool.
- Recovered-task rate 10.1% on the episodes the weak policy alone loses — vs 4.6% for budget-matched blind escalation and 5.1% for a random-trigger placebo at the same strong-policy budget and temporal spread (B−C +5.4pp, exact McNemar p=8.5×10⁻⁶; B−D +5.0pp, p=1.0×10⁻⁴; Holm-adjusted, one-sided; all whole-trajectory bootstrap CIs exclude zero).
- Duty cycle 38%: the stronger policy is dormant most of the time; per-episode cost ≈44% of always-strong (parameter-count schematic). The always-strong ceiling recovers 31.9% at ≈4.6× the compute.
- Selectivity cuts both ways: AEGIS recovers 65/646 failures while disrupting only 10/54 of the weak policy's successes (recover:disrupt 6.5, vs 1.8 blind, 3.3 random) — the intervention paradox is the failure mode the gate is built against.
- Early-window AUROC 0.764 (95% cluster-bootstrap CI [0.70, 0.84], n=2,792 episodes) read over the first 30% of steps on the weak-policy path before any handoff — a precondition, not the headline, because accurate prediction does not imply effective prevention.
- Sign-invariant under simulator non-determinism: across 2,000 replicate redraws of the 212 multi-host cells, no primary contrast ever reverses.
- Cross-family generalization: swapping the escalation target to GR00T N1.7 recovers 15.5%, consistent with the effect being a property of escalating to a stronger separate policy, not of one lucky pair.
How to use
The released probe is plain numpy — no framework needed to score risk:
import numpy as np
from huggingface_hub import hf_hub_download
art = np.load(hf_hub_download("kaikaku/aegis", "probe_artifact.npz"))
mu, sd, w, b = art["mu"], art["sd"], art["w"], float(art["b"])
tau = float(art["conformal_threshold"]) # split-conformal, alpha = 0.10
def risk(h): # h: (720,) mean-pooled layer-15 action-expert activations
z = (h - mu) / sd
return 1.0 / (1.0 + np.exp(-(z @ w + b)))
Hook the frozen weak policy (SmolVLA via LeRobot) and run the reflex:
import math
feats = []
layer = policy.model.vlm_with_expert.lm_expert.layers[15].self_attn.o_proj
hook = layer.register_forward_hook(
lambda m, i, o: feats.append(o.detach().float().mean(dim=(0, 1)).cpu().numpy())
)
# sanity check: live activations vary step to step (std > 0.05).
# a frozen cached feature here is exactly the bug that gives AUROC 0.50.
T, H = 520, 10 # horizon, native action-chunk length
t_min = max(int(0.20 * T), 2) # early-harm guard
k_max = math.ceil(0.05 * T) # per-episode budget cap on gate fires
fires, driver = 0, weak_policy
for t in range(T):
a_t = driver.select_action(obs)
s_t = risk(feats[-1]) # the weak forward pass keeps running
if driver is weak_policy and s_t >= tau and t >= t_min and fires < k_max:
fires += 1
driver = strong_policy # switch at the next chunk boundary,
# hold >= 3 chunks, return on hysteresis
obs = env.step(a_t)
The full gate semantics (chunk-boundary handoff, hold, hysteretic de-escalation) are frozen in gate_config.json; the rollout, calibration, and analysis code that reproduces the paper's tables from the logged traces ships with the dataset repo.
Files in this repo
probe_artifact.npz— the frozen probe head: feature standardization (mu,sd), logistic weights (w,b), and the split-conformal trigger threshold calibrated at α=0.10. Reads SmolVLA action-expert layer-15o_proj(720-d, mean-pooled over the 10-token chunk).gate_config.json— the deployed gate configuration (early-harm guard, budget cap, hold, hysteresis) and the measured headline numbers, frozen before the confirmatory run.banner.png,fig_architecture.png,fig_headline_bars.png— card art and paper figures.
The probe is policy-specific: it was trained on SmolVLA's layer-15 action-expert activations on LIBERO and will not transfer to a different backbone without refitting (the paper's OFT-7B supporting study refits a [4096→256→1] head).
Limitations
- The gains over the matched controls are real but modest (+5.4pp / +5.0pp conditional RTR), and the B−C interval touches zero in the HARD difficulty tercile; the within-stratum claim rests on EASY and MEDIUM.
- AEGIS does not out-recover bigger spends: HELM-style rollback (15.5%), GR00T escalation (15.5%) and always-strong (31.9%) all recover more at higher compute. The claim is selectivity at a fixed budget.
- Simulation only (LIBERO-Spatial), one weak/strong headline pair; real-hardware transfer and broader pair coverage are named as the next tests, not assumed.
- Conformal coverage is marginal rather than conditional where difficulty strata were too small to calibrate their own threshold; realized trigger rates are reported empirically.
- Escalation overhead scales with the escalated fraction times the relative cost of the stronger policy; it does not transfer to a different pair without re-profiling.
Relation to AURA
AEGIS is the outward counterpart of AURA, the author's companion memory gate: AURA gates memory writes inward to save bandwidth at fixed success; AEGIS gates compute outward to raise success at fixed memory. The trigger semantics differ (should I write vs. will this trajectory fail and should I escalate), and the failure-trained probe dominates the closest surprise proxy (early AUROC 0.764 vs 0.63).
Citation
@misc{chen2026aegis,
title = {AEGIS: A Backup Reflex for Physical AI: Calling a Stronger
Policy Before Long-Horizon Failures Compound},
author = {Chen, Josef},
year = {2026},
eprint = {2606.06660},
archivePrefix = {arXiv},
primaryClass = {cs.RO},
url = {https://arxiv.org/abs/2606.06660}
}
Not affiliated with Hugging Face, Physical Intelligence, or NVIDIA; model names are trademarks of their respective owners.