| | --- |
| | language: |
| | - en |
| | license: apache-2.0 |
| | library_name: transformers |
| | tags: |
| | - chess |
| | - causal-lm |
| | - uci |
| | - decoder-only |
| | - llama-style |
| | datasets: |
| | - malcouffe/lichess-standard-rated-2025-07-uci |
| | - malcouffe/lichess-standard-rated-2025-08-uci |
| | - malcouffe/lichess-standard-rated-2025-09-uci |
| | - malcouffe/lichess-standard-rated-2025-10-uci |
| | - malcouffe/lichess-standard-rated-2025-11-uci |
| | - malcouffe/lichess-standard-rated-2025-12-uci |
| | - malcouffe/lichess-standard-rated-2026-01-uci |
| | pipeline_tag: text-generation |
| | model-index: |
| | - name: ChessGPT |
| | results: [] |
| | --- |
| | |
| | # ChessGPT — 432M |
| |
|
| | A decoder-only transformer trained to predict the next move in chess games using UCI notation. The model learns purely from move sequences (no board state, no evaluation) via next-token prediction on Lichess games. |
| |
|
| | ## Model details |
| |
|
| | | | | |
| | |---|---| |
| | | **Architecture** | LLaMA-style decoder-only transformer | |
| | | **Parameters** | 432M | |
| | | **Context length** | 256 tokens | |
| | | **Vocab size** | 4 211 (UCI moves + 3 special tokens) | |
| | | **Training tokens** | 7.87B | |
| | | **License** | Apache 2.0 | |
| |
|
| | ### Architecture |
| |
|
| | - **d_model** 1 280, **n_layers** 21, **n_heads** 20 (head_dim 64), **d_ff** 3 584 |
| | - RMSNorm (pre-norm), Rotary Position Embeddings (RoPE), SwiGLU FFN |
| | - QK-Norm before RoPE (Gemma / DeepSeek-V2 practice) |
| | - No bias in linear layers, weight tying between embedding and output head |
| | - Scaled residual initialization: `std / sqrt(2 * n_layers)` |
| |
|
| | ## Training |
| |
|
| | ### Data |
| |
|
| | 7 monthly snapshots of Lichess standard rated games (July 2025 — January 2026), filtered to **both players >= 1 800 ELO**. Games are converted to space-separated UCI move strings. |
| |
|
| | Datasets are streamed and interleaved from HuggingFace Hub. **Sequence packing** concatenates games into fixed 256-token sequences to eliminate padding. |
| |
|
| | ### Hyperparameters |
| |
|
| | | | | |
| | |---|---| |
| | | Optimizer | AdamW (betas 0.9 / 0.95, weight decay 0.1) | |
| | | Learning rate | 3e-4 with cosine decay to 10 % of peak | |
| | | Warmup | 9 300 steps (linear) | |
| | | Batch size | 256 × 256 tokens = 65 536 tokens/step | |
| | | Gradient clipping | 1.0 | |
| | | Precision | BF16 | |
| | | Steps | 120 155 | |
| |
|
| | ## Tokenizer |
| |
|
| | Custom **UCI tokenizer** that maps every legal UCI move string to a unique integer: |
| |
|
| | | Range | Description | Count | |
| | |---|---|---| |
| | | 0 | `<PAD>` | 1 | |
| | | 1 | `<BOS>` | 1 | |
| | | 2 | `<EOS>` | 1 | |
| | | 3 — 4 034 | Normal moves (src ≠dst) | 4 032 | |
| | | 4 035 — 4 210 | Promotion moves (file × direction × piece × color) | 176 | |
| | | **Total** | | **4 211** | |
| |
|
| | ## Usage |
| |
|
| | ```python |
| | from transformers import AutoModelForCausalLM, AutoTokenizer |
| | import torch |
| | |
| | model = AutoModelForCausalLM.from_pretrained( |
| | "malcouffe/chessgpt", trust_remote_code=True |
| | ) |
| | tokenizer = AutoTokenizer.from_pretrained( |
| | "malcouffe/chessgpt", trust_remote_code=True |
| | ) |
| | |
| | # Encode an opening (Italian Game) |
| | moves = "e2e4 e7e5 g1f3 b8c6 f1c4" |
| | input_ids = tokenizer.encode(moves, return_tensors="pt") |
| | |
| | with torch.no_grad(): |
| | logits = model(input_ids).logits |
| | |
| | # Get top-5 predicted next moves |
| | top5 = logits[0, -1].topk(5) |
| | for score, idx in zip(top5.values, top5.indices): |
| | print(f"{tokenizer.decode([idx.item()]):>8s} {score:.2f}") |
| | ``` |
| |
|
| | ## Limitations |
| |
|
| | - It has no access to board state: all chess knowledge is inferred from move sequences. |
| | - No RLHF or self-play refinement — this is a pure next-token prediction model. |
| | - Predictions can include illegal moves; use `python-chess` to filter at inference time. (see the [chessgpt-inference](https://github.com/malcouffe/chessgpt-inference) repo for legal move masking while generating.) |
| |
|
| | ## Citation |
| |
|
| | ```bibtex |
| | @misc{chessgpt2026, |
| | author = {Matthieu Alcouffe}, |
| | title = {ChessGPT: A 432M Decoder-Only Transformer for UCI Move Prediction}, |
| | year = {2026}, |
| | url = {https://huggingface.co/malcouffe/chessgpt} |
| | } |
| | ``` |
| |
|