OMLCheT-v1

A fine-tuned distilgpt2 that generates legal chess moves in Standard Algebraic Notation (SAN), trained on ~86k real games from the Open Machine Learning Chess Tournament dataset.

Model Overview

Field	Detail
Base model	`distilbert/distilgpt2`
Architecture	Decoder-only Transformer (GPT-2 family)
Task	Causal language modelling over SAN move sequences
Intended playstyle	Generalist — reproduces human amateur-to-intermediate patterns seen in the training corpus; no explicit tactical or positional bias was enforced
Input/Output	Plain SAN string (e.g. `e4 e5 Nf3`) → continuation (e.g. `Nc6 Bc4 …`)

The model treats a chess game as a text sequence: moves are space-separated tokens and the model is trained to predict the next token at each step. During inference, sampling from the model is equivalent to picking the next move.

Architecture Details

All figures are for the base distilgpt2 skeleton; the fine-tuning adds only one new embedding vector (<|chess|>).

Attribute	Value
Total parameters	~82.7 M
Transformer blocks	6
Embedding dimension	768
Attention heads	12
Feed-forward dimension	3 072
Context window	1 024 tokens
Vocabulary size	50 258 (50 257 GPT-2 BPE + 1 domain token <\|chess\|>)
Positional encoding	Learned absolute
Activation	GELU

Training Data

Field	Detail
Dataset	`OMLCheT/chess-san-base`
Subset used	`clean`
Volume	~86 600 games (train: 81 860 / test: 4 740)
Source	Open Machine Learning Chess Tournament (OMLCheT) — AI vs AI games played under tournament conditions
Format	Raw SAN strings, one game per row, e.g. `e4 e5 Nf3 Nc6 Bc4 …`
Pre-processing	Each game is wrapped as <\|chess\|> {moves} <\|endoftext\|> and short games are packed together into 256-token chunks

Training Porgress

Training Loss	Validation Loss	Entropy	Num Tokens	Mean Token Accuracy
1.1141	1.0671	1.0349	51,434,331	0.6441

What the corpus is and isn't:
The games come from ML-agent matches, not human grandmasters or large Lichess databases. This means the model has learned patterns produced by other (possibly imperfect) chess agents, not a broad human-style distribution. Move quality varies widely across the corpus.

Training Methodology

Supervised next-token prediction (standard causal language modelling). No reinforcement learning or RLHF was used.

Hyperparameters

Hyperparameter	Value
Framework	HuggingFace `transformers` + `trl` (`SFTTrainer`)
Epochs	3
Per-device batch size	16
Gradient accumulation steps	2 (effective batch = 32)
Learning rate	5 × 10⁻⁴
LR schedule	Cosine decay with 5% warmup
Weight decay	0.01
Optimiser	AdamW (default `transformers` implementation)
Max sequence length	256 tokens
Packing	Enabled (`packing=True`) — short games concatenated into full-length chunks
Precision	bf16 on Ampere+ GPUs, fp16 on older CUDA, fp32 on CPU
Seed	42

Training process

Load distilgpt2 weights from HuggingFace Hub.
Add the <|chess|> domain prefix token and resize token embeddings.
Format each game: <|chess|> {san_moves} <EOS>.
Pack multiple short games per 256-token chunk to maximise GPU utilisation.
Train with cross-entropy loss over all tokens (moves and the prefix).
Select the checkpoint with the lowest eval_loss.

Known Limitations / Failure Modes

Failure mode	Severity	Notes
Illegal moves	Medium	The model has no explicit legality checker; it occasionally emits moves that are syntactically valid SAN but illegal given the current board position (e.g. moving a pinned piece)
Endgame blunders	High	The training corpus is dominated by middlegame positions. The model has seen relatively few endgame sequences and tends to play aimlessly once queens are traded
Pawn promotions	Medium–High	Promotion notation (`e8=Q`, `a1=N`, etc.) appears infrequently; underpromotions are rarely generated
Long games	Medium	At 256 tokens the context window truncates games running past ~60–70 full moves; the model loses positional coherence in very long endgames
Repetition	Low–Medium	Without a repetition detector the model can occasionally cycle through the same few moves
Opening diversity	Low	The model shows reasonable opening variety for common openings (Italian, Ruy López, Sicilian), but handles rare lines poorly
Engine-level play	N/A	This is a language model, not a search-based engine; it does not calculate variations or evaluate positions. Expect amateur-to-club strength at best

Tip for downstream users: always wrap inference in a legality filter (e.g. python-chess) and re-sample on illegal output.

Inference Speed

Benchmarked on a single NVIDIA T4 (Colab free tier) with the full fine-tuned checkpoint loaded in fp16:

Metric	Value
Time per move (greedy)	~15–25 ms
Time per move (sampling, top-p=0.9)	~20–35 ms
Moves per second	~30–60
Full 40-move game generation	~0.8–1.5 s
Memory footprint (fp16)	~330 MB VRAM
Memory footprint (fp32 / CPU)	~660 MB RAM

On a Kaggle P100 expect roughly 2× faster; on CPU expect ~200–500 ms per move.

from transformers import pipeline
import torch

pipe = pipeline(
    "text-generation",
    model="OMLCheT/OMLCheT-v1",
    torch_dtype=torch.float16,
    device=0,                          # GPU; use -1 for CPU
)

# Provide moves played so far; model continues from here
prompt = "<|chess|> e4 e5 Nf3 Nc6 Bc4"
result = pipe(
    prompt,
    max_new_tokens=80,
    do_sample=True,
    temperature=0.8,
    top_p=0.9,
    pad_token_id=pipe.tokenizer.eos_token_id,
)
print(result[0]["generated_text"])

License

MIT License

This model weights file is released under the MIT License.

The base model (distilbert/distilgpt2) is also MIT-licensed.
The training dataset (OMLCheT/chess-san-base) is released by us — check the dataset card for its specific terms.
Chess move notation (SAN) is in the public domain.

You are free to use, modify, distribute, and build on top of this model for any purpose, commercial or non-commercial, with attribution.

Downloads last month: -

Safetensors

Model size

81.9M params

Tensor type

F32

Model tree for OMLCheT/OMLCheT-v1

Base model

distilbert/distilgpt2

Finetuned

(1514)

this model

OMLCheT
/

OMLCheT-v1