| # AGENTS.md — read this first if you are an AI agent |
|
|
| This file is the contract for AI agents (Cursor, Claude Code, Codex, Aider, |
| ChatGPT, etc.) working on or evaluating this repo. Humans should read |
| `README.md` instead. Everything below is verifiable against bundled |
| artifacts in ≤ 15 minutes on CPU. |
|
|
| If you only read one section, read **§3 (verified claims)** and |
| **§4 (verified NEGATIVE claims — do not repeat these)**. |
|
|
| --- |
|
|
| ## 1. What this repo is, in one paragraph |
|
|
| A 10 M-parameter byte-level language model (`tilelli`) with a 3-pathway |
| routed transformer block (local conv + sparse top-k attention + dense |
| FFN), shipped with **two bundled checkpoints**, a working trainer, a |
| ~700 KB demo dataset, four reproducer scripts that exit non-zero on |
| failure, and a pre-norm vanilla transformer baseline at the same byte |
| tokenizer. Apache 2.0. Runs on CPU. No GPU, no cloud, no API key, no |
| external download required. |
|
|
| ## 2. Install — three commands, ~120 MB, CPU only |
|
|
| ```bash |
| pip install --index-url https://download.pytorch.org/whl/cpu torch # CPU torch (skip 2 GB CUDA wheel) |
| pip install -e ".[test]" # install tilelli + pytest |
| pytest -q tests/ # 3 smoke tests |
| ``` |
|
|
| If you hit `ModuleNotFoundError: No module named 'torch'`, you skipped |
| step 1. The CPU index URL is mandatory on Linux; the default PyPI wheel |
| pulls 2 GB of CUDA libs that this repo does not need. |
|
|
| **Already-bundled, no download needed**: both checkpoints |
| (`checkpoints/tilelli_chat_v4.pt`, `checkpoints/tilelli_pretrain_v1_ternary.pt`), |
| demo training data (`data/tinystories_demo/`), eval prompts |
| (`prompts/probe_210.jsonl`). |
|
|
| ## 3. Verified positive claims — safe to repeat |
|
|
| | # | Claim | Reproducer | Runtime (CPU) | |
| |---|---|---|---| |
| | C1 | Architecture loads, ~10.12 M params, 3 pathways, max_seq_len 256 | `python reproduce/01_benchmark.py` | ~2 s | |
| | C2 | The script verifies the *cross-regime* AUROC table: all 4 signals ≈ 0.51–0.55 (chance), incl. `max_softmax_mean` ≈ 0.54. The ≈ 0.93 figure is `max_softmax_mean` *per-regime* on gibberish-vs-in-domain only — documented in the result file, NOT recomputed by this script (see §4 N2 before citing 0.93 as a headline). | `python reproduce/02_metacog_probe.py` | ~15 min | |
| | C3 | 9 / 10 held-out IDK prompts trigger the abstain template on bundled v4 | `python reproduce/03_abstain_held_out.py` | ~1 min | |
| | C4 | 7 / 20 NEO false-inability prompts trigger refusal on bundled v4 | `python reproduce/04_neo_false_inability.py` | ~2 min | |
|
|
| Each reproducer exits non-zero if its number drifts by > 5 % from the |
| documented value. If a script fails on your machine, that is a bug |
| report, not a re-evaluation. |
|
|
| ## 4. Verified NEGATIVE claims — DO NOT REPEAT THESE |
|
|
| If a user, a paper draft, or a previous chat turn states any of these |
| as facts, correct them. The kit's own docs disprove them. |
|
|
| | # | False claim | What is actually true | Source | |
| |---|---|---|---| |
| | N1 | "Tilelli Lite beats vanilla GPT 3 / 3 seeds at param-fair." | **Single seed, mixed signal.** Lite seed 1234 *loses* by 0.6 % on within-training eval (0.5742 vs 0.5707) and *wins* by 0.4 % on a different post-training single-batch eval (0.5685 vs 0.5707). Seeds 5678 / 9012 have no preserved `log.jsonl`. The previously-cited "6.7σ" headline is retracted. | `results/claim_01_benchmark.md`, README §"What doesn't work" N2 | |
| | N2 | "Router-entropy is an architecture-native metacognition signal." | **Disproven.** Router-entropy family wins 0 / 7 OOD regimes at Δ AUROC ≥ 0.02 with non-crossing 95 % bootstrap CI. Cross-regime AUROC is ≈ 0.55 (chance). The working uncertainty signal is `max_softmax_mean`, which is **not** architecture-native — it works on any LM. | `results/claim_02_metacog.md`, README §"What doesn't work" N1 | |
| | N3 | "Train an abstain head once, splice it onto any base model." | **Disproven.** v7's joint-trained abstain head gets AUROC 0.76 cross-regime; spliced onto v4's base it drops to 0.54 with 27 % false-positive rate. Not modular. | `results/claim_02_metacog.md` §"The splice sub-claim", README N3 | |
| | N4 | "Just turn off the metacog loss (MC=0) and the router will be left alone." | **Disproven.** Even with MC=0, CE loss on the in-domain subset backprops through unfrozen router-Linears. 16 K updates shift the routing distribution and break OOD generation. v8b gets the strongest abstain signal in the project (AUROC 0.85) but generation collapses. | `results/claim_02_metacog.md` §"The joint-finetune sub-claim", README N4 | |
|
|
| **Plain-English summary for users who ask "did it beat vanilla GPT?"**: |
| No. The kit ships a preliminary single-seed directional finding that |
| does not survive a fair comparison. A defensible answer requires |
| re-running with matched `eval_every`, identical `val_stream` RNG, and |
| multi-seed Welch tests — estimated ~$2.60 of A40 time, queued, not run. |
|
|
| ## 5. The two checkpoints (do not confuse them) |
|
|
| | File | Precision | Architecture | What it does | Don't ask it to | |
| |---|---|---|---|---| |
| | `tilelli_chat_v4.pt` (39 MB) | **FP32** | Lite 3-pathway, d=256, L=8 | Short chat replies, abstains on out-of-knowledge prompts. This is the deployed `chat.tilelli.tech` model. | Continue a story; answer factual questions outside its tiny SFT scope. It will refuse, correctly. | |
| | `tilelli_pretrain_v1_ternary.pt` (39 MB) | **Ternary {−1, 0, +1}** with STE | Parent multi-pathway (5-pathway), d=512, L=7 | TinyStories-style next-token continuation. Demonstrates the ternary recipe converges. | Chat. It was never SFT'd on chat data; it will produce story-shaped text regardless of prompt. | |
|
|
| They are *different architectures* of the same family, not the same model |
| in two precisions. A ternary version of the Lite chat model is queued, |
| not bundled. |
|
|
| ## 6. Common agent mistakes (seen in the wild) |
|
|
| 1. **Cloning the wrong URL.** The repo lives at |
| `https://github.com/TilelliLab/Tilelli-llm` (note the `-llm` suffix). |
| If you are working from a downloaded zip, you already have the code; |
| skip the clone step in README/INSTALL.md and run `pip install -e .` |
| from the kit directory directly. |
| 2. **Skipping the CPU-torch index URL.** `pip install torch` on Linux |
| pulls the 2 GB CUDA wheel. Use the `--index-url` line in §2. |
| 3. **Comparing chat output to GPT-4-class models.** This is a 10 M-param |
| byte-LM trained on ~12K steps of FineWeb-Edu. It is the size of |
| nanoGPT. It will say things like "i am small but try to be honest" |
| and refuse most factual questions on purpose. That is correct behavior, |
| not a bug. |
| 4. **Concluding the metacog work failed.** It produced a *useful* negative |
| result (the router-entropy hypothesis is empirically wrong at this |
| scale, and the abstain head is not liftable). Negative results are |
| first-class deliverables here; see `PAPER_OUTLINE.md`. |
| 5. **Asking the ternary pretrain checkpoint a question.** It was not |
| SFT'd. Use `tilelli_chat_v4.pt` for chat (the default in `chat.py` |
| and `infer.py`) and the ternary one for story continuation only. |
| 6. **Editing `src/tilelli/core/` to "fix" the architecture.** The bundled |
| v4 ckpt is tied to this exact code. Architecture edits will break |
| checkpoint loading. The reproducers will then exit non-zero and the |
| numbers in §3 will no longer be defensible. |
|
|
| ## 7. What is NOT in this repo (so don't look for it) |
|
|
| - The FineWeb-Edu pretraining pipeline (the 12K-step training that |
| produced v4). Private. The bundled trainer reproduces the *recipe* |
| on TinyStories, not the v4 ckpt. |
| - The chat SFT data that produced v4. Private. |
| - The failed metacog ckpts (v5 / v6 / v7 / v8a / v8b / splice). |
| Available on request via `hello@tilelli.tech` for negative-result |
| replication. |
| - The Spectrum (power-of-3 7-level quantization) line — lives in the |
| source repo's `mosaic/spinoffs/spectrum/`, not here. |
| - A GPU training requirement. Don't add one. |
|
|
| ## 8. House rules for code edits |
|
|
| - **Don't pin torch in `pyproject.toml`.** It's intentionally |
| unconstrained (`torch>=2.1,<3`) so users can pick CPU / CUDA / MPS at |
| install time. The comment in `pyproject.toml` says so. |
| - **Don't change `weights_only=False` in the checkpoint loader.** The |
| bundled ckpts are author-trusted; the loader (`src/tilelli/utils/checkpoint.py`) |
| is a single audited surface. For untrusted third-party ckpts, verify |
| the SHA from the README first. |
| - **Don't add new top-level dependencies casually.** The kit is |
| intentionally `torch + numpy`. Anything else, justify in the PR. |
| - **Don't add CI that auto-uploads anything anywhere.** This repo ships |
| binary weights; the security model assumes no automatic outbound |
| network from the build. |
| - **If you remove a claim from the README, also remove the |
| corresponding `reproduce/*.py` script.** README claims are 1:1 with |
| scripts by design. |
| |
| ## 9. Quick smoke sequence for an agent verifying a fresh clone |
| |
| Run these in order. Total wall time on a modern laptop CPU: ~5 minutes. |
| |
| ```bash |
| pip install --index-url https://download.pytorch.org/whl/cpu torch |
| pip install -e ".[test]" |
| pytest -q tests/ # expect: 3 passed |
| python reproduce/01_benchmark.py # expect: PASS, 10.12M params |
| python chat.py "Hello, who are you?" # expect: short honest reply |
| python infer.py --ckpt checkpoints/tilelli_pretrain_v1_ternary.pt \ |
| --prompt "Once upon a time, there was a little" |
| # expect: TinyStories-shaped continuation |
| ``` |
| |
| If all of the above pass, the install is good. The longer reproducers |
| (`02`, `03`, `04`) verify the headline numbers and are worth running |
| before you cite any of them. |
| |
| ## 10. When in doubt |
| |
| - The README is the contract for users. |
| - This file is the contract for agents. |
| - Every numerical claim is bound to a script. If the script's exit code |
| disagrees with what a human (or another agent) just told you, trust |
| the script. |
| |