File size: 3,808 Bytes

---
language:
  - en
license: apache-2.0
library_name: transformers
tags:
  - chess
  - causal-lm
  - uci
  - decoder-only
  - llama-style
datasets:
  - malcouffe/lichess-standard-rated-2025-07-uci
  - malcouffe/lichess-standard-rated-2025-08-uci
  - malcouffe/lichess-standard-rated-2025-09-uci
  - malcouffe/lichess-standard-rated-2025-10-uci
  - malcouffe/lichess-standard-rated-2025-11-uci
  - malcouffe/lichess-standard-rated-2025-12-uci
  - malcouffe/lichess-standard-rated-2026-01-uci
pipeline_tag: text-generation
model-index:
  - name: ChessGPT
    results: []
---

# ChessGPT — 432M

A decoder-only transformer trained to predict the next move in chess games using UCI notation. The model learns purely from move sequences (no board state, no evaluation) via next-token prediction on Lichess games.

## Model details

| | |
|---|---|
| **Architecture** | LLaMA-style decoder-only transformer |
| **Parameters** | 432M |
| **Context length** | 256 tokens |
| **Vocab size** | 4 211 (UCI moves + 3 special tokens) |
| **Training tokens** | 7.87B |
| **License** | Apache 2.0 |

### Architecture

- **d_model** 1 280, **n_layers** 21, **n_heads** 20 (head_dim 64), **d_ff** 3 584
- RMSNorm (pre-norm), Rotary Position Embeddings (RoPE), SwiGLU FFN
- QK-Norm before RoPE (Gemma / DeepSeek-V2 practice)
- No bias in linear layers, weight tying between embedding and output head
- Scaled residual initialization: `std / sqrt(2 * n_layers)`

## Training

### Data

7 monthly snapshots of Lichess standard rated games (July 2025 — January 2026), filtered to **both players >= 1 800 ELO**. Games are converted to space-separated UCI move strings.

Datasets are streamed and interleaved from HuggingFace Hub. **Sequence packing** concatenates games into fixed 256-token sequences to eliminate padding.

### Hyperparameters

| | |
|---|---|
| Optimizer | AdamW (betas 0.9 / 0.95, weight decay 0.1) |
| Learning rate | 3e-4 with cosine decay to 10 % of peak |
| Warmup | 9 300 steps (linear) |
| Batch size | 256 × 256 tokens = 65 536 tokens/step |
| Gradient clipping | 1.0 |
| Precision | BF16 |
| Steps | 120 155 |

## Tokenizer

Custom **UCI tokenizer** that maps every legal UCI move string to a unique integer:

| Range | Description | Count |
|---|---|---|
| 0 | `<PAD>` | 1 |
| 1 | `<BOS>` | 1 |
| 2 | `<EOS>` | 1 |
| 3 — 4 034 | Normal moves (src ≠ dst) | 4 032 |
| 4 035 — 4 210 | Promotion moves (file × direction × piece × color) | 176 |
| **Total** | | **4 211** |

## Usage

```python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained(
    "malcouffe/chessgpt", trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained(
    "malcouffe/chessgpt", trust_remote_code=True
)

# Encode an opening (Italian Game)
moves = "e2e4 e7e5 g1f3 b8c6 f1c4"
input_ids = tokenizer.encode(moves, return_tensors="pt")

with torch.no_grad():
    logits = model(input_ids).logits

# Get top-5 predicted next moves
top5 = logits[0, -1].topk(5)
for score, idx in zip(top5.values, top5.indices):
    print(f"{tokenizer.decode([idx.item()]):>8s}  {score:.2f}")
```

## Limitations

- It has no access to board state: all chess knowledge is inferred from move sequences.
- No RLHF or self-play refinement — this is a pure next-token prediction model.
- Predictions can include illegal moves; use `python-chess` to filter at inference time. (see the [chessgpt-inference](https://github.com/malcouffe/chessgpt-inference) repo for legal move masking while generating.)

## Citation

```bibtex
@misc{chessgpt2026,
  author       = {Matthieu Alcouffe},
  title        = {ChessGPT: A 432M Decoder-Only Transformer for UCI Move Prediction},
  year         = {2026},
  url          = {https://huggingface.co/malcouffe/chessgpt}
}
```