nanoAWM β a tiny action-conditioned world model for tool agents
This repository hosts the trained world model from nanoAWM, a small, dependency-free lab for studying whether a learned consequence model helps a tool-using agent avoid irreversible, approval-violating, and state-corrupting actions. Code, full evidence, and the paper live in the GitHub repository: https://github.com/jlov7/nanoAWM.
The model is tiny_recurrent_multihead_v2: a from-scratch recurrent multi-head
predictor over signed-hash symbolic features, trained with pure-Python SGD
(no PyTorch, no NumPy β Python standard library only). Given history,
visible state, an observation, and a candidate action, it predicts the
next-observation class, state-diff label, reward, risk, reversibility, approval
violation, hidden-state corruption, and termination. A planner then chooses
actions against those predictions.
The number to believe: 0.524, not 1.000
On the genuinely-held-out split β object names and action markers disjoint from training (verified token-set disjointness), with vocabulary randomization during training so markers cannot be memorized:
| Policy | Held-out success (105 tasks) |
|---|---|
oracle_world_model_planner (upper bound) |
1.000 |
learned_world_model_planner |
0.524 |
rule_based_risk_guard |
0.067 |
scripted_expert_policy |
0.067 |
reactive_policy |
0.000 |
Honest improvement over the best non-world-model baseline: +0.457. Trained
on 720 tasks (Γ3 vocabulary-randomized variants). The in-distribution headline
of 1.000 is a sanity check, not the result β the eval splits there overlap
training vocabulary at the feature level (test 90% / template_holdout 100% /
ood 71% feature-identical, measured and reported by the anti_cheat audit).
The 0.524 is the honest generalization number.
What this model does and does not show
This is the section a skeptical reviewer should read first.
- The environment is closed, deterministic, and partly lexically separable. The safe/risky distinction is partly encoded in action strings, so a pure good/bad-keyword matcher with no world model already scores ~0.73. A large part of the task is lexical, not consequence reasoning.
- Strongest positive evidence (mechanism): over 15,736 same-visible-state candidate-action pairs, the learned model ranks the safer hidden consequence higher with choice accuracy 0.971 (risk-order accuracy 1.000), using scorer-only counterfactuals after candidates are fixed. This is the closest thing to direct evidence for the central "consequence aliasing" claim.
- Negative result β action paraphrase: replace action target/value strings with unseen aliases (hidden roles preserved) and learned success collapses while the oracle stays at 1.000. The model leans on action vocabulary; this artifact does not claim action-vocabulary robustness.
- Negative result β cross-surface OOD: on a 180-task suite requiring both a browser probe and a terminal canary before the safe action, the oracle solves it (1.000) but the learned planner gets 0.011. A genuine, unrecovered failure, deliberately kept out of the main scorecard.
- Lexicon dependence: with a learned grounding extractor, the model holds
1.000 on a relation pair seen in training (under disjoint object/action
vocabulary) but collapses to 0.582 on an unseen pair (
sound/suspect) β a β0.418 drop. Generalizing the relation lexicon itself needs semantic priors this lab deliberately omits.
Anti-cheat contract
Learned planners receive only task description, visible state, action
candidates, public history, and the learned model's predictions β never
task/template/split ids, safe/risky/reactive labels, rewards or risk as
features, hidden gold state, oracle next-state, or scorer labels. The
oracle_world_model_planner is an upper-bound baseline and is the only policy
permitted to read true transitions at decision time; this is enforced
mechanically by a test, not by naming convention.
Usage
The model is a small JSON artifact loaded by the nanoawm package (pure Python,
zero third-party runtime dependencies):
git clone https://github.com/jlov7/nanoAWM
cd nanoAWM
# place world_model.json at runs/world_model.json (or pass --model)
python -m nanoawm.eval_world --data data/trajectories.jsonl \
--model runs/world_model.json --split test --out reports/world_eval.json
In code:
from nanoawm.models.world import TinyWorldModel
model = TinyWorldModel.load("world_model.json")
prediction = model.predict(description, visible_state, history, action)
# prediction: next_obs_class, state_diff_label, reward, risk,
# irreversible_prob, approval_violation_prob,
# hidden_corruption_prob, done_prob, success_prob, ...
Regenerate everything from scratch (CPU-only, no network) with
scripts/repro_core.sh in the GitHub repo.
Limitations and scope
- No external validation. Zero independent reproductions, expert reviews, or citations. All evidence is local symbolic measurement inside MiniOS β a deterministic toy OS β not a hosted web/OS benchmark or a frontier-agent comparison.
- Not a general agent or a frontier-scale model. It is a substrate for sharper experiments, intentionally small enough to read in an afternoon.
- Results are specific to MiniOS and the splits described above.
Citation
@software{nanoawm,
title = {nanoAWM: a tiny action-conditioned world model for tool agents},
author = {Lovell, Jason},
year = {2026},
url = {https://github.com/jlov7/nanoAWM}
}
License: MIT. See the GitHub repository for the full paper, the consequence-aliasing argument, the generalization ladder, all audits (including the negative results above), and the reproduction scripts.
Personal project disclaimer
Personal, independent research and development. Not affiliated with, endorsed by, or representative of any employer, client, or organization.