ChatTLA-20b (v20)
ChatTLA is a fine-tuned version of openai/gpt-oss-20b specialised in generating TLA+ formal specifications — the language used by AWS, Microsoft, and Intel to mathematically verify distributed systems.
Given a plain-English description of a concurrent or distributed system, ChatTLA outputs a complete, syntactically valid TLA+ module including Init, Next, Spec, TypeOK, and domain invariants, together with a TLC model-checker configuration block.
What's new in v20
v20 is the autonomous self-repair flywheel model — the result of 5 successful repair-GRPO cycles (c1 → c5) building on v15's Repair-GRPO weights. Each cycle harvests fresh failures from the production model, mines them into broken/repaired training pairs, and runs Repair-GRPO with TLC-graded reward. Promotion to production happens only when the 30-spec holdout score improves; cycles c6, c7, c8 trained on top of v20/c5 but failed to beat it (11, 9, 9 vs 12) and were not promoted.
Benchmark Results (v20, 3-shot self-correct)
Evaluated on the same 30-spec held-out suite as v14/v15, spanning communication protocols, concurrency primitives, consensus, data structures, memory/caches, mutual exclusion, classical puzzles, scheduling, transactions, and workflow state machines. Each spec gets up to 3 self-correction attempts using TLC error feedback.
| Tier | Meaning |
|---|---|
| 🥇 Gold | SANY parses and TLC model-checks clean |
| 🥈 Silver | SANY parses, TLC finds violation or timeout |
| Bronze | SANY parse failure |
Diamond tier (mutation-test caught + non-trivial invariant) was not assessed in this round; v20's evaluation reports Gold-rate only.
Per-domain breakdown (30-spec holdout, 3-shot)
| Domain | Gold |
|---|---|
| communication_protocols | 2/3 |
| concurrency_primitives | 2/3 |
| consensus_election | 2/3 |
| data_structures | 0/3 |
| memory_caches | 0/3 |
| mutual_exclusion | 2/3 |
| puzzles_classical | 1/3 |
| scheduling_resources | 1/3 |
| transactions_databases | 1/3 |
| workflows_state_machines | 1/3 |
| Total | 12 / 30 (40 %) |
Domains where v20 reaches 2/3: communication, concurrency, consensus, mutual exclusion. Domains where v20 still fails completely: data structures, memory/caches.
Version history
| Version | Suite | SANY | TLC (Gold) | Notes |
|---|---|---|---|---|
| v6 | 20-problem handcraft | 4/20 (20%) | 1/20 (5%) | — |
| v7 | 20-problem handcraft | 6/20 (30%) | 1/20 (5%) | — |
| v8 | 20-problem handcraft | 8/20 (40%) | 1/20 (5%) | — |
| v9 | 20-problem handcraft | 6/20 (30%) | 3/20 (15%) | — |
| v9 best-of-5 + self-correct | 20-problem handcraft | 16/20 (80%) | 5/20 (25%) | — |
| v10 | 20-problem handcraft | 6/20 (30%) | 2/20 (10%) | — |
| v11 | 20-problem handcraft | 6/20 (30%) | 2/20 (10%) | — |
| v13 (SFT + DPO) | 20-problem handcraft | 9/20 (45%) | 5/20 (25%) | trivial invariants counted as Gold |
| v14 (Diamond SFT) | 30-spec holdout (single-shot) | 16/30 (53%) | 5/30 (17%) | Diamond 4/30 (13%) |
| v15 (Repair GRPO) | 30-spec holdout (3-shot) | 9/30 (30%) | 9/30 (30%) | Diamond 9/30 (30%) |
| v20 (Flywheel c5) | 30-spec holdout (3-shot) | — | 12/30 (40%) | first promoted-by-holdout-gain release |
Compared to v15, v20 adds 3 specs to the Gold pool (+33 % relative): the gains concentrate in communication (+2) and mutual exclusion (+2), with transactions and workflows holding ground. Data structures and memory/caches remain unsolved across both versions and are the obvious next target.
Quick Start
Ollama (recommended)
ollama run hf.co/EricSpencer00/chattla-20b
Or use the bundled Modelfile:
curl -L https://huggingface.co/EricSpencer00/chattla-20b/resolve/main/gguf/Modelfile -o Modelfile
ollama create chattla:20b -f Modelfile
ollama run chattla:20b "Write a TLA+ spec for a token ring with N nodes."
Python (transformers)
from transformers import pipeline
pipe = pipeline(
"text-generation",
model="EricSpencer00/chattla-20b",
device_map="auto",
)
prompt = (
"Write a complete TLA+ specification for a two-phase commit protocol "
"with one coordinator and N participants."
)
result = pipe([{"role": "user", "content": prompt}], max_new_tokens=1024, return_full_text=False)
print(result[0]["generated_text"])
llama.cpp / GGUF
huggingface-cli download EricSpencer00/chattla-20b \
gguf/chattla-20b-v20-Q8_0.gguf \
--local-dir ./chattla
./llama-cli -m chattla/gguf/chattla-20b-v20-Q8_0.gguf \
-n 1024 --temp 0.4 \
-p "Write a TLA+ spec for mutual exclusion with N processes."
Model Details
| Property | Value |
|---|---|
| Base model | openai/gpt-oss-20b |
| Parameters | 20.9 B |
| Architecture | GptOss (sliding + full attention) |
| Fine-tuning method | Diamond SFT (LoRA) → Repair GRPO (LoRA) → Self-Repair Flywheel (5× LoRA) → merged |
| Context length | 2048 (trained) / 131072 (base) |
| GGUF quantisation | Q8_0 (~22 GB) |
| Training date | April – May 2026 |
System prompt
You are ChatTLA, an expert at writing verified TLA+ formal specifications.
When asked to write a TLA+ spec, follow these rules exactly:
1. Start the module with ---- MODULE <ModuleName> ----
2. End with ====
3. Include EXTENDS, VARIABLES, Init, Next, and Spec operators
4. After the TLA+ module, append a TLC configuration block:
SPECIFICATION Spec
INVARIANT TypeOK (if TypeOK is defined)
5. Output only valid TLA+ code. No markdown fences, no explanation outside the spec.
Training
Phase 1 — Diamond SFT (v14)
v14 was produced by the Diamond curation pipeline: candidate TLA+ specs are generated by an earlier checkpoint, then graded by a tlc_validator that checks SANY parsing, TLC state-space exploration, non-trivial invariants, and mutation-test sensitivity. Specs that survive grading are LLM-judged for chain-of-thought quality, leaving a curated training pool (209 raw → 73 curated). The model is fine-tuned with LoRA on this pool and merged.
Phase 2 — Repair GRPO (v15)
v15 applies repair-based GRPO (Group Relative Policy Optimization) on top of v14: instead of training on gold-standard specs alone, the model learns to fix broken specs using TLC error feedback as reward signal.
- Trajectory collection — the v14 model generates specs for 398 problems with up to 6 repair iterations each, producing (broken, repaired) pairs scored by a multi-stage validator (SANY → TLC → Apalache → TLAPS).
- Dataset filtering — pairs filtered to keep the "learnable middle" (
min_before_score=0.10,max_before_score=0.80), yielding ~430 gradable pairs centered on score ≈ 0.45. - GRPO training — 300 steps, 4 generations per prompt, max 384 completion tokens. Reward is the score-improvement delta
after − before, normalized by group. lr=3e-6, KL β=0.02, temp=0.5. - LoRA merge — best checkpoint (around step 140–160) merged back into full weights.
Phase 3 — Self-Repair Flywheel (v20)
v20 wraps Phase 2 in an autonomous outer loop that keeps running on the production GPU pool. Each cycle:
- Failure harvest. Sample 400 random NL prompts, call the current production model, classify the outputs (Gold / Silver / Bronze).
- Pair construction. Bootstrap (broken → repaired) pairs from the bronze and silver outputs; the repaired side comes from the same model under a stricter retry budget.
- Repair-GRPO step. 160 steps on the harvested pairs, LoRA r=8 / α=16, lr=3e-6, KL β=0.02, on the current best merged base.
- Merge → GGUF → Ollama as
chattla:20b-c{N}. - Holdout eval. 30-spec 3-shot benchmark against the same held-out suite.
- Promote-on-improvement. If
score_cN > best_score, update the production tagchattla:20b-repair. Otherwise keep prior; the failed candidate stays aschattla:20b-c{N}for analysis.
v20 is cycle 5 of this flywheel: c1 → 5/30, c2 → 10/30, c3 → 8/30, c4 → 5/30, c5 → 12/30 (promoted). Cycles c6 (11/30), c7 (9/30), c8 (9/30) did not promote.
Training configuration (v20 incremental cycle)
| Setting | Value |
|---|---|
| Method | Repair GRPO with LoRA |
| LoRA rank / α / dropout | 8 / 16 / 0.0 |
| GRPO steps | 160 per cycle |
| GRPO generations / prompt | 4 |
| GRPO max completion length | 384 tokens |
| Learning rate | 3e-6 |
| KL β | 0.02 |
| Temperature | 0.5 |
| Failures harvested / cycle | 400 (filtered to ~150–250 gradable pairs) |
| Hardware | 2× Quadro RTX 8000 (48 GB each) |
DPO/KTO refinement was used in v11–v13 but was deprecated in the Diamond overhaul: 0/484 specs from those preference-trained checkpoints actually passed Diamond, indicating the model had learned TLA+ syntax without learning semantics.
Files
EricSpencer00/chattla-20b
├── README.md
└── gguf/
├── chattla-20b-v20-Q8_0.gguf # Quantised GGUF for Ollama / llama.cpp (~22 GB)
└── Modelfile # Ollama Modelfile
Intended Use
ChatTLA is designed for:
- Rapid prototyping of TLA+ specifications from natural-language system descriptions
- Educational exploration of formal methods
- Assisting engineers who are learning TLA+
Not intended for: safety-critical or production verification without human review. Always validate generated specs with SANY and TLC before relying on them.
Citation
@misc{chattla2026,
title = {ChatTLA: Fine-tuned LLM for TLA+ Formal Specification Generation},
author = {Spencer, Eric},
year = {2026},
url = {https://huggingface.co/EricSpencer00/chattla-20b},
}
- Downloads last month
- 5,671
Model tree for EricSpencer00/chattla-20b
Base model
openai/gpt-oss-20b