Mirror small files (code, paper, results)

f86dc09 verified 16 days ago

9.98 kB

	# AGENTS.md — read this first if you are an AI agent

	This file is the contract for AI agents (Cursor, Claude Code, Codex, Aider,
	ChatGPT, etc.) working on or evaluating this repo. Humans should read
	`README.md` instead. Everything below is verifiable against bundled
	artifacts in ≤ 15 minutes on CPU.

	If you only read one section, read §3 (verified claims) and
	§4 (verified NEGATIVE claims — do not repeat these).

	---

	## 1. What this repo is, in one paragraph

	A 10 M-parameter byte-level language model (`tilelli`) with a 3-pathway
	routed transformer block (local conv + sparse top-k attention + dense
	FFN), shipped with two bundled checkpoints, a working trainer, a
	~700 KB demo dataset, four reproducer scripts that exit non-zero on
	failure, and a pre-norm vanilla transformer baseline at the same byte
	tokenizer. Apache 2.0. Runs on CPU. No GPU, no cloud, no API key, no
	external download required.

	## 2. Install — three commands, ~120 MB, CPU only

	```bash
	pip install --index-url https://download.pytorch.org/whl/cpu torch # CPU torch (skip 2 GB CUDA wheel)
	pip install -e ".[test]" # install tilelli + pytest
	pytest -q tests/ # 3 smoke tests
	```

	If you hit `ModuleNotFoundError: No module named 'torch'`, you skipped
	step 1. The CPU index URL is mandatory on Linux; the default PyPI wheel
	pulls 2 GB of CUDA libs that this repo does not need.

	Already-bundled, no download needed: both checkpoints
	(`checkpoints/tilelli_chat_v4.pt`, `checkpoints/tilelli_pretrain_v1_ternary.pt`),
	demo training data (`data/tinystories_demo/`), eval prompts
	(`prompts/probe_210.jsonl`).

	## 3. Verified positive claims — safe to repeat

	\| # \| Claim \| Reproducer \| Runtime (CPU) \|
	\|---\|---\|---\|---\|
	\| C1 \| Architecture loads, ~10.12 M params, 3 pathways, max_seq_len 256 \| `python reproduce/01_benchmark.py` \| ~2 s \|
	\| C2 \| The script verifies the cross-regime AUROC table: all 4 signals ≈ 0.51–0.55 (chance), incl. `max_softmax_mean` ≈ 0.54. The ≈ 0.93 figure is `max_softmax_mean` per-regime on gibberish-vs-in-domain only — documented in the result file, NOT recomputed by this script (see §4 N2 before citing 0.93 as a headline). \| `python reproduce/02_metacog_probe.py` \| ~15 min \|
	\| C3 \| 9 / 10 held-out IDK prompts trigger the abstain template on bundled v4 \| `python reproduce/03_abstain_held_out.py` \| ~1 min \|
	\| C4 \| 7 / 20 NEO false-inability prompts trigger refusal on bundled v4 \| `python reproduce/04_neo_false_inability.py` \| ~2 min \|

	Each reproducer exits non-zero if its number drifts by > 5 % from the
	documented value. If a script fails on your machine, that is a bug
	report, not a re-evaluation.

	## 4. Verified NEGATIVE claims — DO NOT REPEAT THESE

	If a user, a paper draft, or a previous chat turn states any of these
	as facts, correct them. The kit's own docs disprove them.

	\| # \| False claim \| What is actually true \| Source \|
	\|---\|---\|---\|---\|
	\| N1 \| "Tilelli Lite beats vanilla GPT 3 / 3 seeds at param-fair." \| Single seed, mixed signal. Lite seed 1234 loses by 0.6 % on within-training eval (0.5742 vs 0.5707) and wins by 0.4 % on a different post-training single-batch eval (0.5685 vs 0.5707). Seeds 5678 / 9012 have no preserved `log.jsonl`. The previously-cited "6.7σ" headline is retracted. \| `results/claim_01_benchmark.md`, README §"What doesn't work" N2 \|
	\| N2 \| "Router-entropy is an architecture-native metacognition signal." \| Disproven. Router-entropy family wins 0 / 7 OOD regimes at Δ AUROC ≥ 0.02 with non-crossing 95 % bootstrap CI. Cross-regime AUROC is ≈ 0.55 (chance). The working uncertainty signal is `max_softmax_mean`, which is not architecture-native — it works on any LM. \| `results/claim_02_metacog.md`, README §"What doesn't work" N1 \|
	\| N3 \| "Train an abstain head once, splice it onto any base model." \| Disproven. v7's joint-trained abstain head gets AUROC 0.76 cross-regime; spliced onto v4's base it drops to 0.54 with 27 % false-positive rate. Not modular. \| `results/claim_02_metacog.md` §"The splice sub-claim", README N3 \|
	\| N4 \| "Just turn off the metacog loss (MC=0) and the router will be left alone." \| Disproven. Even with MC=0, CE loss on the in-domain subset backprops through unfrozen router-Linears. 16 K updates shift the routing distribution and break OOD generation. v8b gets the strongest abstain signal in the project (AUROC 0.85) but generation collapses. \| `results/claim_02_metacog.md` §"The joint-finetune sub-claim", README N4 \|

	Plain-English summary for users who ask "did it beat vanilla GPT?":
	No. The kit ships a preliminary single-seed directional finding that
	does not survive a fair comparison. A defensible answer requires
	re-running with matched `eval_every`, identical `val_stream` RNG, and
	multi-seed Welch tests — estimated ~$2.60 of A40 time, queued, not run.

	## 5. The two checkpoints (do not confuse them)

	\| File \| Precision \| Architecture \| What it does \| Don't ask it to \|
	\|---\|---\|---\|---\|---\|
	\| `tilelli_chat_v4.pt` (39 MB) \| FP32 \| Lite 3-pathway, d=256, L=8 \| Short chat replies, abstains on out-of-knowledge prompts. This is the deployed `chat.tilelli.tech` model. \| Continue a story; answer factual questions outside its tiny SFT scope. It will refuse, correctly. \|
	\| `tilelli_pretrain_v1_ternary.pt` (39 MB) \| Ternary {−1, 0, +1} with STE \| Parent multi-pathway (5-pathway), d=512, L=7 \| TinyStories-style next-token continuation. Demonstrates the ternary recipe converges. \| Chat. It was never SFT'd on chat data; it will produce story-shaped text regardless of prompt. \|

	They are different architectures of the same family, not the same model
	in two precisions. A ternary version of the Lite chat model is queued,
	not bundled.

	## 6. Common agent mistakes (seen in the wild)

	1. Cloning the wrong URL. The repo lives at
	`https://github.com/TilelliLab/Tilelli-llm` (note the `-llm` suffix).
	If you are working from a downloaded zip, you already have the code;
	skip the clone step in README/INSTALL.md and run `pip install -e .`
	from the kit directory directly.
	2. Skipping the CPU-torch index URL. `pip install torch` on Linux
	pulls the 2 GB CUDA wheel. Use the `--index-url` line in §2.
	3. Comparing chat output to GPT-4-class models. This is a 10 M-param
	byte-LM trained on ~12K steps of FineWeb-Edu. It is the size of
	nanoGPT. It will say things like "i am small but try to be honest"
	and refuse most factual questions on purpose. That is correct behavior,
	not a bug.
	4. Concluding the metacog work failed. It produced a useful negative
	result (the router-entropy hypothesis is empirically wrong at this
	scale, and the abstain head is not liftable). Negative results are
	first-class deliverables here; see `PAPER_OUTLINE.md`.
	5. Asking the ternary pretrain checkpoint a question. It was not
	SFT'd. Use `tilelli_chat_v4.pt` for chat (the default in `chat.py`
	and `infer.py`) and the ternary one for story continuation only.
	6. Editing `src/tilelli/core/` to "fix" the architecture. The bundled
	v4 ckpt is tied to this exact code. Architecture edits will break
	checkpoint loading. The reproducers will then exit non-zero and the
	numbers in §3 will no longer be defensible.

	## 7. What is NOT in this repo (so don't look for it)

	- The FineWeb-Edu pretraining pipeline (the 12K-step training that
	produced v4). Private. The bundled trainer reproduces the recipe
	on TinyStories, not the v4 ckpt.
	- The chat SFT data that produced v4. Private.
	- The failed metacog ckpts (v5 / v6 / v7 / v8a / v8b / splice).
	Available on request via `hello@tilelli.tech` for negative-result
	replication.
	- The Spectrum (power-of-3 7-level quantization) line — lives in the
	source repo's `mosaic/spinoffs/spectrum/`, not here.
	- A GPU training requirement. Don't add one.

	## 8. House rules for code edits

	- Don't pin torch in `pyproject.toml`. It's intentionally
	unconstrained (`torch>=2.1,<3`) so users can pick CPU / CUDA / MPS at
	install time. The comment in `pyproject.toml` says so.
	- Don't change `weights_only=False` in the checkpoint loader. The
	bundled ckpts are author-trusted; the loader (`src/tilelli/utils/checkpoint.py`)
	is a single audited surface. For untrusted third-party ckpts, verify
	the SHA from the README first.
	- Don't add new top-level dependencies casually. The kit is
	intentionally `torch + numpy`. Anything else, justify in the PR.
	- Don't add CI that auto-uploads anything anywhere. This repo ships
	binary weights; the security model assumes no automatic outbound
	network from the build.
	- **If you remove a claim from the README, also remove the
	corresponding `reproduce/.py` script.* README claims are 1:1 with
	scripts by design.

	## 9. Quick smoke sequence for an agent verifying a fresh clone

	Run these in order. Total wall time on a modern laptop CPU: ~5 minutes.

	```bash
	pip install --index-url https://download.pytorch.org/whl/cpu torch
	pip install -e ".[test]"
	pytest -q tests/ # expect: 3 passed
	python reproduce/01_benchmark.py # expect: PASS, 10.12M params
	python chat.py "Hello, who are you?" # expect: short honest reply
	python infer.py --ckpt checkpoints/tilelli_pretrain_v1_ternary.pt \
	--prompt "Once upon a time, there was a little"
	# expect: TinyStories-shaped continuation
	```

	If all of the above pass, the install is good. The longer reproducers
	(`02`, `03`, `04`) verify the headline numbers and are worth running
	before you cite any of them.

	## 10. When in doubt

	- The README is the contract for users.
	- This file is the contract for agents.
	- Every numerical claim is bound to a script. If the script's exit code
	disagrees with what a human (or another agent) just told you, trust
	the script.