metadata
language:
- en
license: apache-2.0
library_name: transformers
tags:
- chess
- causal-lm
- uci
- decoder-only
- llama-style
datasets:
- malcouffe/lichess-standard-rated-2025-07-uci
- malcouffe/lichess-standard-rated-2025-08-uci
- malcouffe/lichess-standard-rated-2025-09-uci
- malcouffe/lichess-standard-rated-2025-10-uci
- malcouffe/lichess-standard-rated-2025-11-uci
- malcouffe/lichess-standard-rated-2025-12-uci
- malcouffe/lichess-standard-rated-2026-01-uci
pipeline_tag: text-generation
model-index:
- name: ChessGPT
results: []
ChessGPT — 432M
A decoder-only transformer trained to predict the next move in chess games using UCI notation. The model learns purely from move sequences (no board state, no evaluation) via next-token prediction on Lichess games.
Model details
| Architecture | LLaMA-style decoder-only transformer |
| Parameters | 432M |
| Context length | 256 tokens |
| Vocab size | 4 211 (UCI moves + 3 special tokens) |
| Training tokens | 7.87B |
| License | Apache 2.0 |
Architecture
- d_model 1 280, n_layers 21, n_heads 20 (head_dim 64), d_ff 3 584
- RMSNorm (pre-norm), Rotary Position Embeddings (RoPE), SwiGLU FFN
- QK-Norm before RoPE (Gemma / DeepSeek-V2 practice)
- No bias in linear layers, weight tying between embedding and output head
- Scaled residual initialization:
std / sqrt(2 * n_layers)
Training
Data
7 monthly snapshots of Lichess standard rated games (July 2025 — January 2026), filtered to both players >= 1 800 ELO. Games are converted to space-separated UCI move strings.
Datasets are streamed and interleaved from HuggingFace Hub. Sequence packing concatenates games into fixed 256-token sequences to eliminate padding.
Hyperparameters
| Optimizer | AdamW (betas 0.9 / 0.95, weight decay 0.1) |
| Learning rate | 3e-4 with cosine decay to 10 % of peak |
| Warmup | 9 300 steps (linear) |
| Batch size | 256 × 256 tokens = 65 536 tokens/step |
| Gradient clipping | 1.0 |
| Precision | BF16 |
| Steps | 120 155 |
Tokenizer
Custom UCI tokenizer that maps every legal UCI move string to a unique integer:
| Range | Description | Count |
|---|---|---|
| 0 | <PAD> |
1 |
| 1 | <BOS> |
1 |
| 2 | <EOS> |
1 |
| 3 — 4 034 | Normal moves (src ≠ dst) | 4 032 |
| 4 035 — 4 210 | Promotion moves (file × direction × piece × color) | 176 |
| Total | 4 211 |
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model = AutoModelForCausalLM.from_pretrained(
"malcouffe/chessgpt", trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained(
"malcouffe/chessgpt", trust_remote_code=True
)
# Encode an opening (Italian Game)
moves = "e2e4 e7e5 g1f3 b8c6 f1c4"
input_ids = tokenizer.encode(moves, return_tensors="pt")
with torch.no_grad():
logits = model(input_ids).logits
# Get top-5 predicted next moves
top5 = logits[0, -1].topk(5)
for score, idx in zip(top5.values, top5.indices):
print(f"{tokenizer.decode([idx.item()]):>8s} {score:.2f}")
Limitations
- It has no access to board state: all chess knowledge is inferred from move sequences.
- No RLHF or self-play refinement — this is a pure next-token prediction model.
- Predictions can include illegal moves; use
python-chessto filter at inference time. (see the chessgpt-inference repo for legal move masking while generating.)
Citation
@misc{chessgpt2026,
author = {Matthieu Alcouffe},
title = {ChessGPT: A 432M Decoder-Only Transformer for UCI Move Prediction},
year = {2026},
url = {https://huggingface.co/malcouffe/chessgpt}
}