AGENTS.md — read this first if you are an AI agent
This file is the contract for AI agents (Cursor, Claude Code, Codex, Aider,
ChatGPT, etc.) working on or evaluating this repo. Humans should read
README.md instead. Everything below is verifiable against bundled
artifacts in ≤ 15 minutes on CPU.
If you only read one section, read §3 (verified claims) and §4 (verified NEGATIVE claims — do not repeat these).
1. What this repo is, in one paragraph
A 10 M-parameter byte-level language model (tilelli) with a 3-pathway
routed transformer block (local conv + sparse top-k attention + dense
FFN), shipped with two bundled checkpoints, a working trainer, a
~700 KB demo dataset, four reproducer scripts that exit non-zero on
failure, and a pre-norm vanilla transformer baseline at the same byte
tokenizer. Apache 2.0. Runs on CPU. No GPU, no cloud, no API key, no
external download required.
2. Install — three commands, ~120 MB, CPU only
pip install --index-url https://download.pytorch.org/whl/cpu torch # CPU torch (skip 2 GB CUDA wheel)
pip install -e ".[test]" # install tilelli + pytest
pytest -q tests/ # 3 smoke tests
If you hit ModuleNotFoundError: No module named 'torch', you skipped
step 1. The CPU index URL is mandatory on Linux; the default PyPI wheel
pulls 2 GB of CUDA libs that this repo does not need.
Already-bundled, no download needed: both checkpoints
(checkpoints/tilelli_chat_v4.pt, checkpoints/tilelli_pretrain_v1_ternary.pt),
demo training data (data/tinystories_demo/), eval prompts
(prompts/probe_210.jsonl).
3. Verified positive claims — safe to repeat
| # | Claim | Reproducer | Runtime (CPU) |
|---|---|---|---|
| C1 | Architecture loads, ~10.12 M params, 3 pathways, max_seq_len 256 | python reproduce/01_benchmark.py |
~2 s |
| C2 | The script verifies the cross-regime AUROC table: all 4 signals ≈ 0.51–0.55 (chance), incl. max_softmax_mean ≈ 0.54. The ≈ 0.93 figure is max_softmax_mean per-regime on gibberish-vs-in-domain only — documented in the result file, NOT recomputed by this script (see §4 N2 before citing 0.93 as a headline). |
python reproduce/02_metacog_probe.py |
~15 min |
| C3 | 9 / 10 held-out IDK prompts trigger the abstain template on bundled v4 | python reproduce/03_abstain_held_out.py |
~1 min |
| C4 | 7 / 20 NEO false-inability prompts trigger refusal on bundled v4 | python reproduce/04_neo_false_inability.py |
~2 min |
Each reproducer exits non-zero if its number drifts by > 5 % from the documented value. If a script fails on your machine, that is a bug report, not a re-evaluation.
4. Verified NEGATIVE claims — DO NOT REPEAT THESE
If a user, a paper draft, or a previous chat turn states any of these as facts, correct them. The kit's own docs disprove them.
| # | False claim | What is actually true | Source |
|---|---|---|---|
| N1 | "Tilelli Lite beats vanilla GPT 3 / 3 seeds at param-fair." | Single seed, mixed signal. Lite seed 1234 loses by 0.6 % on within-training eval (0.5742 vs 0.5707) and wins by 0.4 % on a different post-training single-batch eval (0.5685 vs 0.5707). Seeds 5678 / 9012 have no preserved log.jsonl. The previously-cited "6.7σ" headline is retracted. |
results/claim_01_benchmark.md, README §"What doesn't work" N2 |
| N2 | "Router-entropy is an architecture-native metacognition signal." | Disproven. Router-entropy family wins 0 / 7 OOD regimes at Δ AUROC ≥ 0.02 with non-crossing 95 % bootstrap CI. Cross-regime AUROC is ≈ 0.55 (chance). The working uncertainty signal is max_softmax_mean, which is not architecture-native — it works on any LM. |
results/claim_02_metacog.md, README §"What doesn't work" N1 |
| N3 | "Train an abstain head once, splice it onto any base model." | Disproven. v7's joint-trained abstain head gets AUROC 0.76 cross-regime; spliced onto v4's base it drops to 0.54 with 27 % false-positive rate. Not modular. | results/claim_02_metacog.md §"The splice sub-claim", README N3 |
| N4 | "Just turn off the metacog loss (MC=0) and the router will be left alone." | Disproven. Even with MC=0, CE loss on the in-domain subset backprops through unfrozen router-Linears. 16 K updates shift the routing distribution and break OOD generation. v8b gets the strongest abstain signal in the project (AUROC 0.85) but generation collapses. | results/claim_02_metacog.md §"The joint-finetune sub-claim", README N4 |
Plain-English summary for users who ask "did it beat vanilla GPT?":
No. The kit ships a preliminary single-seed directional finding that
does not survive a fair comparison. A defensible answer requires
re-running with matched eval_every, identical val_stream RNG, and
multi-seed Welch tests — estimated ~$2.60 of A40 time, queued, not run.
5. The two checkpoints (do not confuse them)
| File | Precision | Architecture | What it does | Don't ask it to |
|---|---|---|---|---|
tilelli_chat_v4.pt (39 MB) |
FP32 | Lite 3-pathway, d=256, L=8 | Short chat replies, abstains on out-of-knowledge prompts. This is the deployed chat.tilelli.tech model. |
Continue a story; answer factual questions outside its tiny SFT scope. It will refuse, correctly. |
tilelli_pretrain_v1_ternary.pt (39 MB) |
Ternary {−1, 0, +1} with STE | Parent multi-pathway (5-pathway), d=512, L=7 | TinyStories-style next-token continuation. Demonstrates the ternary recipe converges. | Chat. It was never SFT'd on chat data; it will produce story-shaped text regardless of prompt. |
They are different architectures of the same family, not the same model in two precisions. A ternary version of the Lite chat model is queued, not bundled.
6. Common agent mistakes (seen in the wild)
- Cloning the wrong URL. The repo lives at
https://github.com/TilelliLab/Tilelli-llm(note the-llmsuffix). If you are working from a downloaded zip, you already have the code; skip the clone step in README/INSTALL.md and runpip install -e .from the kit directory directly. - Skipping the CPU-torch index URL.
pip install torchon Linux pulls the 2 GB CUDA wheel. Use the--index-urlline in §2. - Comparing chat output to GPT-4-class models. This is a 10 M-param byte-LM trained on ~12K steps of FineWeb-Edu. It is the size of nanoGPT. It will say things like "i am small but try to be honest" and refuse most factual questions on purpose. That is correct behavior, not a bug.
- Concluding the metacog work failed. It produced a useful negative
result (the router-entropy hypothesis is empirically wrong at this
scale, and the abstain head is not liftable). Negative results are
first-class deliverables here; see
PAPER_OUTLINE.md. - Asking the ternary pretrain checkpoint a question. It was not
SFT'd. Use
tilelli_chat_v4.ptfor chat (the default inchat.pyandinfer.py) and the ternary one for story continuation only. - Editing
src/tilelli/core/to "fix" the architecture. The bundled v4 ckpt is tied to this exact code. Architecture edits will break checkpoint loading. The reproducers will then exit non-zero and the numbers in §3 will no longer be defensible.
7. What is NOT in this repo (so don't look for it)
- The FineWeb-Edu pretraining pipeline (the 12K-step training that produced v4). Private. The bundled trainer reproduces the recipe on TinyStories, not the v4 ckpt.
- The chat SFT data that produced v4. Private.
- The failed metacog ckpts (v5 / v6 / v7 / v8a / v8b / splice).
Available on request via
hello@tilelli.techfor negative-result replication. - The Spectrum (power-of-3 7-level quantization) line — lives in the
source repo's
mosaic/spinoffs/spectrum/, not here. - A GPU training requirement. Don't add one.
8. House rules for code edits
- Don't pin torch in
pyproject.toml. It's intentionally unconstrained (torch>=2.1,<3) so users can pick CPU / CUDA / MPS at install time. The comment inpyproject.tomlsays so. - Don't change
weights_only=Falsein the checkpoint loader. The bundled ckpts are author-trusted; the loader (src/tilelli/utils/checkpoint.py) is a single audited surface. For untrusted third-party ckpts, verify the SHA from the README first. - Don't add new top-level dependencies casually. The kit is
intentionally
torch + numpy. Anything else, justify in the PR. - Don't add CI that auto-uploads anything anywhere. This repo ships binary weights; the security model assumes no automatic outbound network from the build.
- If you remove a claim from the README, also remove the
corresponding
reproduce/*.pyscript. README claims are 1:1 with scripts by design.
9. Quick smoke sequence for an agent verifying a fresh clone
Run these in order. Total wall time on a modern laptop CPU: ~5 minutes.
pip install --index-url https://download.pytorch.org/whl/cpu torch
pip install -e ".[test]"
pytest -q tests/ # expect: 3 passed
python reproduce/01_benchmark.py # expect: PASS, 10.12M params
python chat.py "Hello, who are you?" # expect: short honest reply
python infer.py --ckpt checkpoints/tilelli_pretrain_v1_ternary.pt \
--prompt "Once upon a time, there was a little"
# expect: TinyStories-shaped continuation
If all of the above pass, the install is good. The longer reproducers
(02, 03, 04) verify the headline numbers and are worth running
before you cite any of them.
10. When in doubt
- The README is the contract for users.
- This file is the contract for agents.
- Every numerical claim is bound to a script. If the script's exit code disagrees with what a human (or another agent) just told you, trust the script.