Tilelli-llm / AGENTS.md
TilelliLab's picture
Mirror small files (code, paper, results)
f86dc09 verified
|
Raw
History Blame Contribute Delete
9.98 kB
# AGENTS.md — read this first if you are an AI agent
This file is the contract for AI agents (Cursor, Claude Code, Codex, Aider,
ChatGPT, etc.) working on or evaluating this repo. Humans should read
`README.md` instead. Everything below is verifiable against bundled
artifacts in ≤ 15 minutes on CPU.
If you only read one section, read **§3 (verified claims)** and
**§4 (verified NEGATIVE claims — do not repeat these)**.
---
## 1. What this repo is, in one paragraph
A 10 M-parameter byte-level language model (`tilelli`) with a 3-pathway
routed transformer block (local conv + sparse top-k attention + dense
FFN), shipped with **two bundled checkpoints**, a working trainer, a
~700 KB demo dataset, four reproducer scripts that exit non-zero on
failure, and a pre-norm vanilla transformer baseline at the same byte
tokenizer. Apache 2.0. Runs on CPU. No GPU, no cloud, no API key, no
external download required.
## 2. Install — three commands, ~120 MB, CPU only
```bash
pip install --index-url https://download.pytorch.org/whl/cpu torch # CPU torch (skip 2 GB CUDA wheel)
pip install -e ".[test]" # install tilelli + pytest
pytest -q tests/ # 3 smoke tests
```
If you hit `ModuleNotFoundError: No module named 'torch'`, you skipped
step 1. The CPU index URL is mandatory on Linux; the default PyPI wheel
pulls 2 GB of CUDA libs that this repo does not need.
**Already-bundled, no download needed**: both checkpoints
(`checkpoints/tilelli_chat_v4.pt`, `checkpoints/tilelli_pretrain_v1_ternary.pt`),
demo training data (`data/tinystories_demo/`), eval prompts
(`prompts/probe_210.jsonl`).
## 3. Verified positive claims — safe to repeat
| # | Claim | Reproducer | Runtime (CPU) |
|---|---|---|---|
| C1 | Architecture loads, ~10.12 M params, 3 pathways, max_seq_len 256 | `python reproduce/01_benchmark.py` | ~2 s |
| C2 | The script verifies the *cross-regime* AUROC table: all 4 signals ≈ 0.51–0.55 (chance), incl. `max_softmax_mean` ≈ 0.54. The ≈ 0.93 figure is `max_softmax_mean` *per-regime* on gibberish-vs-in-domain only — documented in the result file, NOT recomputed by this script (see §4 N2 before citing 0.93 as a headline). | `python reproduce/02_metacog_probe.py` | ~15 min |
| C3 | 9 / 10 held-out IDK prompts trigger the abstain template on bundled v4 | `python reproduce/03_abstain_held_out.py` | ~1 min |
| C4 | 7 / 20 NEO false-inability prompts trigger refusal on bundled v4 | `python reproduce/04_neo_false_inability.py` | ~2 min |
Each reproducer exits non-zero if its number drifts by > 5 % from the
documented value. If a script fails on your machine, that is a bug
report, not a re-evaluation.
## 4. Verified NEGATIVE claims — DO NOT REPEAT THESE
If a user, a paper draft, or a previous chat turn states any of these
as facts, correct them. The kit's own docs disprove them.
| # | False claim | What is actually true | Source |
|---|---|---|---|
| N1 | "Tilelli Lite beats vanilla GPT 3 / 3 seeds at param-fair." | **Single seed, mixed signal.** Lite seed 1234 *loses* by 0.6 % on within-training eval (0.5742 vs 0.5707) and *wins* by 0.4 % on a different post-training single-batch eval (0.5685 vs 0.5707). Seeds 5678 / 9012 have no preserved `log.jsonl`. The previously-cited "6.7σ" headline is retracted. | `results/claim_01_benchmark.md`, README §"What doesn't work" N2 |
| N2 | "Router-entropy is an architecture-native metacognition signal." | **Disproven.** Router-entropy family wins 0 / 7 OOD regimes at Δ AUROC ≥ 0.02 with non-crossing 95 % bootstrap CI. Cross-regime AUROC is ≈ 0.55 (chance). The working uncertainty signal is `max_softmax_mean`, which is **not** architecture-native — it works on any LM. | `results/claim_02_metacog.md`, README §"What doesn't work" N1 |
| N3 | "Train an abstain head once, splice it onto any base model." | **Disproven.** v7's joint-trained abstain head gets AUROC 0.76 cross-regime; spliced onto v4's base it drops to 0.54 with 27 % false-positive rate. Not modular. | `results/claim_02_metacog.md` §"The splice sub-claim", README N3 |
| N4 | "Just turn off the metacog loss (MC=0) and the router will be left alone." | **Disproven.** Even with MC=0, CE loss on the in-domain subset backprops through unfrozen router-Linears. 16 K updates shift the routing distribution and break OOD generation. v8b gets the strongest abstain signal in the project (AUROC 0.85) but generation collapses. | `results/claim_02_metacog.md` §"The joint-finetune sub-claim", README N4 |
**Plain-English summary for users who ask "did it beat vanilla GPT?"**:
No. The kit ships a preliminary single-seed directional finding that
does not survive a fair comparison. A defensible answer requires
re-running with matched `eval_every`, identical `val_stream` RNG, and
multi-seed Welch tests — estimated ~$2.60 of A40 time, queued, not run.
## 5. The two checkpoints (do not confuse them)
| File | Precision | Architecture | What it does | Don't ask it to |
|---|---|---|---|---|
| `tilelli_chat_v4.pt` (39 MB) | **FP32** | Lite 3-pathway, d=256, L=8 | Short chat replies, abstains on out-of-knowledge prompts. This is the deployed `chat.tilelli.tech` model. | Continue a story; answer factual questions outside its tiny SFT scope. It will refuse, correctly. |
| `tilelli_pretrain_v1_ternary.pt` (39 MB) | **Ternary {−1, 0, +1}** with STE | Parent multi-pathway (5-pathway), d=512, L=7 | TinyStories-style next-token continuation. Demonstrates the ternary recipe converges. | Chat. It was never SFT'd on chat data; it will produce story-shaped text regardless of prompt. |
They are *different architectures* of the same family, not the same model
in two precisions. A ternary version of the Lite chat model is queued,
not bundled.
## 6. Common agent mistakes (seen in the wild)
1. **Cloning the wrong URL.** The repo lives at
`https://github.com/TilelliLab/Tilelli-llm` (note the `-llm` suffix).
If you are working from a downloaded zip, you already have the code;
skip the clone step in README/INSTALL.md and run `pip install -e .`
from the kit directory directly.
2. **Skipping the CPU-torch index URL.** `pip install torch` on Linux
pulls the 2 GB CUDA wheel. Use the `--index-url` line in §2.
3. **Comparing chat output to GPT-4-class models.** This is a 10 M-param
byte-LM trained on ~12K steps of FineWeb-Edu. It is the size of
nanoGPT. It will say things like "i am small but try to be honest"
and refuse most factual questions on purpose. That is correct behavior,
not a bug.
4. **Concluding the metacog work failed.** It produced a *useful* negative
result (the router-entropy hypothesis is empirically wrong at this
scale, and the abstain head is not liftable). Negative results are
first-class deliverables here; see `PAPER_OUTLINE.md`.
5. **Asking the ternary pretrain checkpoint a question.** It was not
SFT'd. Use `tilelli_chat_v4.pt` for chat (the default in `chat.py`
and `infer.py`) and the ternary one for story continuation only.
6. **Editing `src/tilelli/core/` to "fix" the architecture.** The bundled
v4 ckpt is tied to this exact code. Architecture edits will break
checkpoint loading. The reproducers will then exit non-zero and the
numbers in §3 will no longer be defensible.
## 7. What is NOT in this repo (so don't look for it)
- The FineWeb-Edu pretraining pipeline (the 12K-step training that
produced v4). Private. The bundled trainer reproduces the *recipe*
on TinyStories, not the v4 ckpt.
- The chat SFT data that produced v4. Private.
- The failed metacog ckpts (v5 / v6 / v7 / v8a / v8b / splice).
Available on request via `hello@tilelli.tech` for negative-result
replication.
- The Spectrum (power-of-3 7-level quantization) line — lives in the
source repo's `mosaic/spinoffs/spectrum/`, not here.
- A GPU training requirement. Don't add one.
## 8. House rules for code edits
- **Don't pin torch in `pyproject.toml`.** It's intentionally
unconstrained (`torch>=2.1,<3`) so users can pick CPU / CUDA / MPS at
install time. The comment in `pyproject.toml` says so.
- **Don't change `weights_only=False` in the checkpoint loader.** The
bundled ckpts are author-trusted; the loader (`src/tilelli/utils/checkpoint.py`)
is a single audited surface. For untrusted third-party ckpts, verify
the SHA from the README first.
- **Don't add new top-level dependencies casually.** The kit is
intentionally `torch + numpy`. Anything else, justify in the PR.
- **Don't add CI that auto-uploads anything anywhere.** This repo ships
binary weights; the security model assumes no automatic outbound
network from the build.
- **If you remove a claim from the README, also remove the
corresponding `reproduce/*.py` script.** README claims are 1:1 with
scripts by design.
## 9. Quick smoke sequence for an agent verifying a fresh clone
Run these in order. Total wall time on a modern laptop CPU: ~5 minutes.
```bash
pip install --index-url https://download.pytorch.org/whl/cpu torch
pip install -e ".[test]"
pytest -q tests/ # expect: 3 passed
python reproduce/01_benchmark.py # expect: PASS, 10.12M params
python chat.py "Hello, who are you?" # expect: short honest reply
python infer.py --ckpt checkpoints/tilelli_pretrain_v1_ternary.pt \
--prompt "Once upon a time, there was a little"
# expect: TinyStories-shaped continuation
```
If all of the above pass, the install is good. The longer reproducers
(`02`, `03`, `04`) verify the headline numbers and are worth running
before you cite any of them.
## 10. When in doubt
- The README is the contract for users.
- This file is the contract for agents.
- Every numerical claim is bound to a script. If the script's exit code
disagrees with what a human (or another agent) just told you, trust
the script.