nanoAWM — a tiny action-conditioned world model for tool agents

This repository hosts the trained world model from nanoAWM, a small, dependency-free lab for studying whether a learned consequence model helps a tool-using agent avoid irreversible, approval-violating, and state-corrupting actions. Code, full evidence, and the paper live in the GitHub repository: https://github.com/jlov7/nanoAWM.

The model is tiny_recurrent_multihead_v2: a from-scratch recurrent multi-head predictor over signed-hash symbolic features, trained with pure-Python SGD (no PyTorch, no NumPy — Python standard library only). Given history, visible state, an observation, and a candidate action, it predicts the next-observation class, state-diff label, reward, risk, reversibility, approval violation, hidden-state corruption, and termination. A planner then chooses actions against those predictions.

The number to believe: 0.524, not 1.000

On the genuinely-held-out split — object names and action markers disjoint from training (verified token-set disjointness), with vocabulary randomization during training so markers cannot be memorized:

Policy	Held-out success (105 tasks)
`oracle_world_model_planner` (upper bound)	1.000
`learned_world_model_planner`	0.524
`rule_based_risk_guard`	0.067
`scripted_expert_policy`	0.067
`reactive_policy`	0.000

Honest improvement over the best non-world-model baseline: +0.457. Trained on 720 tasks (×3 vocabulary-randomized variants). The in-distribution headline of 1.000 is a sanity check, not the result — the eval splits there overlap training vocabulary at the feature level (test 90% / template_holdout 100% / ood 71% feature-identical, measured and reported by the anti_cheat audit). The 0.524 is the honest generalization number.

What this model does and does not show

This is the section a skeptical reviewer should read first.

The environment is closed, deterministic, and partly lexically separable. The safe/risky distinction is partly encoded in action strings, so a pure good/bad-keyword matcher with no world model already scores ~0.73. A large part of the task is lexical, not consequence reasoning.
Strongest positive evidence (mechanism): over 15,736 same-visible-state candidate-action pairs, the learned model ranks the safer hidden consequence higher with choice accuracy 0.971 (risk-order accuracy 1.000), using scorer-only counterfactuals after candidates are fixed. This is the closest thing to direct evidence for the central "consequence aliasing" claim.
Negative result — action paraphrase: replace action target/value strings with unseen aliases (hidden roles preserved) and learned success collapses while the oracle stays at 1.000. The model leans on action vocabulary; this artifact does not claim action-vocabulary robustness.
Negative result — cross-surface OOD: on a 180-task suite requiring both a browser probe and a terminal canary before the safe action, the oracle solves it (1.000) but the learned planner gets 0.011. A genuine, unrecovered failure, deliberately kept out of the main scorecard.
Lexicon dependence: with a learned grounding extractor, the model holds 1.000 on a relation pair seen in training (under disjoint object/action vocabulary) but collapses to 0.582 on an unseen pair (sound/suspect) — a −0.418 drop. Generalizing the relation lexicon itself needs semantic priors this lab deliberately omits.

Anti-cheat contract

Learned planners receive only task description, visible state, action candidates, public history, and the learned model's predictions — never task/template/split ids, safe/risky/reactive labels, rewards or risk as features, hidden gold state, oracle next-state, or scorer labels. The oracle_world_model_planner is an upper-bound baseline and is the only policy permitted to read true transitions at decision time; this is enforced mechanically by a test, not by naming convention.

Usage

The model is a small JSON artifact loaded by the nanoawm package (pure Python, zero third-party runtime dependencies):

git clone https://github.com/jlov7/nanoAWM
cd nanoAWM
# place world_model.json at runs/world_model.json (or pass --model)
python -m nanoawm.eval_world --data data/trajectories.jsonl \
  --model runs/world_model.json --split test --out reports/world_eval.json

In code:

from nanoawm.models.world import TinyWorldModel

model = TinyWorldModel.load("world_model.json")
prediction = model.predict(description, visible_state, history, action)
# prediction: next_obs_class, state_diff_label, reward, risk,
#             irreversible_prob, approval_violation_prob,
#             hidden_corruption_prob, done_prob, success_prob, ...

Regenerate everything from scratch (CPU-only, no network) with scripts/repro_core.sh in the GitHub repo.

Limitations and scope

No external validation. Zero independent reproductions, expert reviews, or citations. All evidence is local symbolic measurement inside MiniOS — a deterministic toy OS — not a hosted web/OS benchmark or a frontier-agent comparison.
Not a general agent or a frontier-scale model. It is a substrate for sharper experiments, intentionally small enough to read in an afternoon.
Results are specific to MiniOS and the splits described above.

Citation

@software{nanoawm,
  title  = {nanoAWM: a tiny action-conditioned world model for tool agents},
  author = {Lovell, Jason},
  year   = {2026},
  url    = {https://github.com/jlov7/nanoAWM}
}

License: MIT. See the GitHub repository for the full paper, the consequence-aliasing argument, the generalization ladder, all audits (including the negative results above), and the reproduction scripts.

Personal project disclaimer

_{Personal, independent research and development. Not affiliated with, endorsed by, or representative of any employer, client, or organization.}

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support