AEGIS: A Backup Reflex for Physical AI

Interactive demo · Paper (arXiv:2606.06660) · Rollout logs

AEGIS (Activation-probe Early-warning, Gated Inference Switching) is a runtime escalation layer for robot manipulation policies. A cheap probe reads the deployed policy's frozen internal activations as a per-step early-warning signal; when a calibrated gate fires, control switches mid-trajectory to a stronger separate policy, but only for the steps that need it. Both policies stay frozen — the ten-kilobyte probe head in this repo is the only trained component, and it decides when a 4.14B policy wakes up.

The thesis is one sentence: a robot policy can read its own activations as an early-warning signal and call a stronger policy before failure compounds, recovering twice as many failures as matched-budget escalation.

Paper: https://arxiv.org/abs/2606.06660
Interactive demo: https://huggingface.co/spaces/kaikaku/aegis-demo
Author: Josef Chen, KAIKAKU

What it is

Long-horizon failures are slow spirals: one bad step degrades the state, the next compounds it, and the episode is lost long before it ends. Detect-only methods (SAFE, FIPER, Sentinel) see this coming and raise an alarm but never act; recover-within-policy methods (HELM, Pre-VLA, FailSafe) act, but only by asking the same failing policy to try again. AEGIS does what a human supervisor would do: call in someone stronger, at the moment it matters.

AEGIS architecture: frozen weak policy, probe, gate, handoff to a frozen strong policy

The runtime loop: the weak policy (SmolVLA, 450M) drives by default and a forward hook reads its action-expert layer-15 activations live (720-d, mean-pooled over each 10-step action chunk). The probe head scores each step; a split-conformal threshold (α=0.10), an early-harm guard (no escalation before 0.20 T), and a per-episode budget cap (⌈0.05 T⌉ fires) turn the score into a switching decision. On a fire, control hands to π₀.₅ (4.14B) at the next chunk boundary, holds for ≥3 chunks, and returns on hysteresis. Both policies sit warm in one process (~9.5 GB VRAM for the backup).

Key results (measured, from the paper)

All from the confirmatory factorial: LIBERO-Spatial, full 10×70 task×seed common-random-number grid, ≥700 episodes per arm, 646 in the weak-policy-failing conditional pool.

Recovered-task rate 10.1% on the episodes the weak policy alone loses — vs 4.6% for budget-matched blind escalation and 5.1% for a random-trigger placebo at the same strong-policy budget and temporal spread (B−C +5.4pp, exact McNemar p=8.5×10⁻⁶; B−D +5.0pp, p=1.0×10⁻⁴; Holm-adjusted, one-sided; all whole-trajectory bootstrap CIs exclude zero).
Duty cycle 38%: the stronger policy is dormant most of the time; per-episode cost ≈44% of always-strong (parameter-count schematic). The always-strong ceiling recovers 31.9% at ≈4.6× the compute.
Selectivity cuts both ways: AEGIS recovers 65/646 failures while disrupting only 10/54 of the weak policy's successes (recover:disrupt 6.5, vs 1.8 blind, 3.3 random) — the intervention paradox is the failure mode the gate is built against.
Early-window AUROC 0.764 (95% cluster-bootstrap CI [0.70, 0.84], n=2,792 episodes) read over the first 30% of steps on the weak-policy path before any handoff — a precondition, not the headline, because accurate prediction does not imply effective prevention.
Sign-invariant under simulator non-determinism: across 2,000 replicate redraws of the 212 multi-host cells, no primary contrast ever reverses.
Cross-family generalization: swapping the escalation target to GR00T N1.7 recovers 15.5%, consistent with the effect being a property of escalating to a stronger separate policy, not of one lucky pair.

Headline: timing doubles recovery at matched compute

How to use

The released probe is plain numpy — no framework needed to score risk:

import numpy as np
from huggingface_hub import hf_hub_download

art = np.load(hf_hub_download("kaikaku/aegis", "probe_artifact.npz"))
mu, sd, w, b = art["mu"], art["sd"], art["w"], float(art["b"])
tau = float(art["conformal_threshold"])   # split-conformal, alpha = 0.10

def risk(h):  # h: (720,) mean-pooled layer-15 action-expert activations
    z = (h - mu) / sd
    return 1.0 / (1.0 + np.exp(-(z @ w + b)))

Hook the frozen weak policy (SmolVLA via LeRobot) and run the reflex:

import math

feats = []
layer = policy.model.vlm_with_expert.lm_expert.layers[15].self_attn.o_proj
hook = layer.register_forward_hook(
    lambda m, i, o: feats.append(o.detach().float().mean(dim=(0, 1)).cpu().numpy())
)
# sanity check: live activations vary step to step (std > 0.05).
# a frozen cached feature here is exactly the bug that gives AUROC 0.50.

T, H  = 520, 10                  # horizon, native action-chunk length
t_min = max(int(0.20 * T), 2)    # early-harm guard
k_max = math.ceil(0.05 * T)      # per-episode budget cap on gate fires
fires, driver = 0, weak_policy

for t in range(T):
    a_t = driver.select_action(obs)
    s_t = risk(feats[-1])        # the weak forward pass keeps running
    if driver is weak_policy and s_t >= tau and t >= t_min and fires < k_max:
        fires += 1
        driver = strong_policy   # switch at the next chunk boundary,
                                 # hold >= 3 chunks, return on hysteresis
    obs = env.step(a_t)

The full gate semantics (chunk-boundary handoff, hold, hysteretic de-escalation) are frozen in gate_config.json; the rollout, calibration, and analysis code that reproduces the paper's tables from the logged traces ships with the dataset repo.

Files in this repo

probe_artifact.npz — the frozen probe head: feature standardization (mu, sd), logistic weights (w, b), and the split-conformal trigger threshold calibrated at α=0.10. Reads SmolVLA action-expert layer-15 o_proj (720-d, mean-pooled over the 10-token chunk).
gate_config.json — the deployed gate configuration (early-harm guard, budget cap, hold, hysteresis) and the measured headline numbers, frozen before the confirmatory run.
banner.png, fig_architecture.png, fig_headline_bars.png — card art and paper figures.

The probe is policy-specific: it was trained on SmolVLA's layer-15 action-expert activations on LIBERO and will not transfer to a different backbone without refitting (the paper's OFT-7B supporting study refits a [4096→256→1] head).

Limitations

The gains over the matched controls are real but modest (+5.4pp / +5.0pp conditional RTR), and the B−C interval touches zero in the HARD difficulty tercile; the within-stratum claim rests on EASY and MEDIUM.
AEGIS does not out-recover bigger spends: HELM-style rollback (15.5%), GR00T escalation (15.5%) and always-strong (31.9%) all recover more at higher compute. The claim is selectivity at a fixed budget.
Simulation only (LIBERO-Spatial), one weak/strong headline pair; real-hardware transfer and broader pair coverage are named as the next tests, not assumed.
Conformal coverage is marginal rather than conditional where difficulty strata were too small to calibrate their own threshold; realized trigger rates are reported empirically.
Escalation overhead scales with the escalated fraction times the relative cost of the stronger policy; it does not transfer to a different pair without re-profiling.

Relation to AURA

AEGIS is the outward counterpart of AURA, the author's companion memory gate: AURA gates memory writes inward to save bandwidth at fixed success; AEGIS gates compute outward to raise success at fixed memory. The trigger semantics differ (should I write vs. will this trajectory fail and should I escalate), and the failure-trained probe dominates the closest surprise proxy (early AUROC 0.764 vs 0.63).

Citation

@misc{chen2026aegis,
  title         = {AEGIS: A Backup Reflex for Physical AI: Calling a Stronger
                   Policy Before Long-Horizon Failures Compound},
  author        = {Chen, Josef},
  year          = {2026},
  eprint        = {2606.06660},
  archivePrefix = {arXiv},
  primaryClass  = {cs.RO},
  url           = {https://arxiv.org/abs/2606.06660}
}

Not affiliated with Hugging Face, Physical Intelligence, or NVIDIA; model names are trademarks of their respective owners.

Downloads last month: -; Downloads are not tracked for this model. How to track

Video Preview

Robotics

Paper for Kaikaku/aegis

AEGIS: A Backup Reflex for Physical AI

Paper • 2606.06660 • Published 9 days ago