# AGENTS.md — read this first if you are an AI agent This file is the contract for AI agents (Cursor, Claude Code, Codex, Aider, ChatGPT, etc.) working on or evaluating this repo. Humans should read `README.md` instead. Everything below is verifiable against bundled artifacts in ≤ 15 minutes on CPU. If you only read one section, read **§3 (verified claims)** and **§4 (verified NEGATIVE claims — do not repeat these)**. --- ## 1. What this repo is, in one paragraph A 10 M-parameter byte-level language model (`tilelli`) with a 3-pathway routed transformer block (local conv + sparse top-k attention + dense FFN), shipped with **two bundled checkpoints**, a working trainer, a ~700 KB demo dataset, four reproducer scripts that exit non-zero on failure, and a pre-norm vanilla transformer baseline at the same byte tokenizer. Apache 2.0. Runs on CPU. No GPU, no cloud, no API key, no external download required. ## 2. Install — three commands, ~120 MB, CPU only ```bash pip install --index-url https://download.pytorch.org/whl/cpu torch # CPU torch (skip 2 GB CUDA wheel) pip install -e ".[test]" # install tilelli + pytest pytest -q tests/ # 3 smoke tests ``` If you hit `ModuleNotFoundError: No module named 'torch'`, you skipped step 1. The CPU index URL is mandatory on Linux; the default PyPI wheel pulls 2 GB of CUDA libs that this repo does not need. **Already-bundled, no download needed**: both checkpoints (`checkpoints/tilelli_chat_v4.pt`, `checkpoints/tilelli_pretrain_v1_ternary.pt`), demo training data (`data/tinystories_demo/`), eval prompts (`prompts/probe_210.jsonl`). ## 3. Verified positive claims — safe to repeat | # | Claim | Reproducer | Runtime (CPU) | |---|---|---|---| | C1 | Architecture loads, ~10.12 M params, 3 pathways, max_seq_len 256 | `python reproduce/01_benchmark.py` | ~2 s | | C2 | The script verifies the *cross-regime* AUROC table: all 4 signals ≈ 0.51–0.55 (chance), incl. `max_softmax_mean` ≈ 0.54. The ≈ 0.93 figure is `max_softmax_mean` *per-regime* on gibberish-vs-in-domain only — documented in the result file, NOT recomputed by this script (see §4 N2 before citing 0.93 as a headline). | `python reproduce/02_metacog_probe.py` | ~15 min | | C3 | 9 / 10 held-out IDK prompts trigger the abstain template on bundled v4 | `python reproduce/03_abstain_held_out.py` | ~1 min | | C4 | 7 / 20 NEO false-inability prompts trigger refusal on bundled v4 | `python reproduce/04_neo_false_inability.py` | ~2 min | Each reproducer exits non-zero if its number drifts by > 5 % from the documented value. If a script fails on your machine, that is a bug report, not a re-evaluation. ## 4. Verified NEGATIVE claims — DO NOT REPEAT THESE If a user, a paper draft, or a previous chat turn states any of these as facts, correct them. The kit's own docs disprove them. | # | False claim | What is actually true | Source | |---|---|---|---| | N1 | "Tilelli Lite beats vanilla GPT 3 / 3 seeds at param-fair." | **Single seed, mixed signal.** Lite seed 1234 *loses* by 0.6 % on within-training eval (0.5742 vs 0.5707) and *wins* by 0.4 % on a different post-training single-batch eval (0.5685 vs 0.5707). Seeds 5678 / 9012 have no preserved `log.jsonl`. The previously-cited "6.7σ" headline is retracted. | `results/claim_01_benchmark.md`, README §"What doesn't work" N2 | | N2 | "Router-entropy is an architecture-native metacognition signal." | **Disproven.** Router-entropy family wins 0 / 7 OOD regimes at Δ AUROC ≥ 0.02 with non-crossing 95 % bootstrap CI. Cross-regime AUROC is ≈ 0.55 (chance). The working uncertainty signal is `max_softmax_mean`, which is **not** architecture-native — it works on any LM. | `results/claim_02_metacog.md`, README §"What doesn't work" N1 | | N3 | "Train an abstain head once, splice it onto any base model." | **Disproven.** v7's joint-trained abstain head gets AUROC 0.76 cross-regime; spliced onto v4's base it drops to 0.54 with 27 % false-positive rate. Not modular. | `results/claim_02_metacog.md` §"The splice sub-claim", README N3 | | N4 | "Just turn off the metacog loss (MC=0) and the router will be left alone." | **Disproven.** Even with MC=0, CE loss on the in-domain subset backprops through unfrozen router-Linears. 16 K updates shift the routing distribution and break OOD generation. v8b gets the strongest abstain signal in the project (AUROC 0.85) but generation collapses. | `results/claim_02_metacog.md` §"The joint-finetune sub-claim", README N4 | **Plain-English summary for users who ask "did it beat vanilla GPT?"**: No. The kit ships a preliminary single-seed directional finding that does not survive a fair comparison. A defensible answer requires re-running with matched `eval_every`, identical `val_stream` RNG, and multi-seed Welch tests — estimated ~$2.60 of A40 time, queued, not run. ## 5. The two checkpoints (do not confuse them) | File | Precision | Architecture | What it does | Don't ask it to | |---|---|---|---|---| | `tilelli_chat_v4.pt` (39 MB) | **FP32** | Lite 3-pathway, d=256, L=8 | Short chat replies, abstains on out-of-knowledge prompts. This is the deployed `chat.tilelli.tech` model. | Continue a story; answer factual questions outside its tiny SFT scope. It will refuse, correctly. | | `tilelli_pretrain_v1_ternary.pt` (39 MB) | **Ternary {−1, 0, +1}** with STE | Parent multi-pathway (5-pathway), d=512, L=7 | TinyStories-style next-token continuation. Demonstrates the ternary recipe converges. | Chat. It was never SFT'd on chat data; it will produce story-shaped text regardless of prompt. | They are *different architectures* of the same family, not the same model in two precisions. A ternary version of the Lite chat model is queued, not bundled. ## 6. Common agent mistakes (seen in the wild) 1. **Cloning the wrong URL.** The repo lives at `https://github.com/TilelliLab/Tilelli-llm` (note the `-llm` suffix). If you are working from a downloaded zip, you already have the code; skip the clone step in README/INSTALL.md and run `pip install -e .` from the kit directory directly. 2. **Skipping the CPU-torch index URL.** `pip install torch` on Linux pulls the 2 GB CUDA wheel. Use the `--index-url` line in §2. 3. **Comparing chat output to GPT-4-class models.** This is a 10 M-param byte-LM trained on ~12K steps of FineWeb-Edu. It is the size of nanoGPT. It will say things like "i am small but try to be honest" and refuse most factual questions on purpose. That is correct behavior, not a bug. 4. **Concluding the metacog work failed.** It produced a *useful* negative result (the router-entropy hypothesis is empirically wrong at this scale, and the abstain head is not liftable). Negative results are first-class deliverables here; see `PAPER_OUTLINE.md`. 5. **Asking the ternary pretrain checkpoint a question.** It was not SFT'd. Use `tilelli_chat_v4.pt` for chat (the default in `chat.py` and `infer.py`) and the ternary one for story continuation only. 6. **Editing `src/tilelli/core/` to "fix" the architecture.** The bundled v4 ckpt is tied to this exact code. Architecture edits will break checkpoint loading. The reproducers will then exit non-zero and the numbers in §3 will no longer be defensible. ## 7. What is NOT in this repo (so don't look for it) - The FineWeb-Edu pretraining pipeline (the 12K-step training that produced v4). Private. The bundled trainer reproduces the *recipe* on TinyStories, not the v4 ckpt. - The chat SFT data that produced v4. Private. - The failed metacog ckpts (v5 / v6 / v7 / v8a / v8b / splice). Available on request via `hello@tilelli.tech` for negative-result replication. - The Spectrum (power-of-3 7-level quantization) line — lives in the source repo's `mosaic/spinoffs/spectrum/`, not here. - A GPU training requirement. Don't add one. ## 8. House rules for code edits - **Don't pin torch in `pyproject.toml`.** It's intentionally unconstrained (`torch>=2.1,<3`) so users can pick CPU / CUDA / MPS at install time. The comment in `pyproject.toml` says so. - **Don't change `weights_only=False` in the checkpoint loader.** The bundled ckpts are author-trusted; the loader (`src/tilelli/utils/checkpoint.py`) is a single audited surface. For untrusted third-party ckpts, verify the SHA from the README first. - **Don't add new top-level dependencies casually.** The kit is intentionally `torch + numpy`. Anything else, justify in the PR. - **Don't add CI that auto-uploads anything anywhere.** This repo ships binary weights; the security model assumes no automatic outbound network from the build. - **If you remove a claim from the README, also remove the corresponding `reproduce/*.py` script.** README claims are 1:1 with scripts by design. ## 9. Quick smoke sequence for an agent verifying a fresh clone Run these in order. Total wall time on a modern laptop CPU: ~5 minutes. ```bash pip install --index-url https://download.pytorch.org/whl/cpu torch pip install -e ".[test]" pytest -q tests/ # expect: 3 passed python reproduce/01_benchmark.py # expect: PASS, 10.12M params python chat.py "Hello, who are you?" # expect: short honest reply python infer.py --ckpt checkpoints/tilelli_pretrain_v1_ternary.pt \ --prompt "Once upon a time, there was a little" # expect: TinyStories-shaped continuation ``` If all of the above pass, the install is good. The longer reproducers (`02`, `03`, `04`) verify the headline numbers and are worth running before you cite any of them. ## 10. When in doubt - The README is the contract for users. - This file is the contract for agents. - Every numerical claim is bound to a script. If the script's exit code disagrees with what a human (or another agent) just told you, trust the script.