chessgpt / README.md

Upload README.md with huggingface_hub

f84c4b8 verified 2 days ago

3.81 kB

	---
	language:
	- en
	license: apache-2.0
	library_name: transformers
	tags:
	- chess
	- causal-lm
	- uci
	- decoder-only
	- llama-style
	datasets:
	- malcouffe/lichess-standard-rated-2025-07-uci
	- malcouffe/lichess-standard-rated-2025-08-uci
	- malcouffe/lichess-standard-rated-2025-09-uci
	- malcouffe/lichess-standard-rated-2025-10-uci
	- malcouffe/lichess-standard-rated-2025-11-uci
	- malcouffe/lichess-standard-rated-2025-12-uci
	- malcouffe/lichess-standard-rated-2026-01-uci
	pipeline_tag: text-generation
	model-index:
	- name: ChessGPT
	results: []
	---

	# ChessGPT — 432M

	A decoder-only transformer trained to predict the next move in chess games using UCI notation. The model learns purely from move sequences (no board state, no evaluation) via next-token prediction on Lichess games.

	## Model details

	\| \| \|
	\|---\|---\|
	\| Architecture \| LLaMA-style decoder-only transformer \|
	\| Parameters \| 432M \|
	\| Context length \| 256 tokens \|
	\| Vocab size \| 4 211 (UCI moves + 3 special tokens) \|
	\| Training tokens \| 7.87B \|
	\| License \| Apache 2.0 \|

	### Architecture

	- d_model 1 280, n_layers 21, n_heads 20 (head_dim 64), d_ff 3 584
	- RMSNorm (pre-norm), Rotary Position Embeddings (RoPE), SwiGLU FFN
	- QK-Norm before RoPE (Gemma / DeepSeek-V2 practice)
	- No bias in linear layers, weight tying between embedding and output head
	- Scaled residual initialization: `std / sqrt(2 * n_layers)`

	## Training

	### Data

	7 monthly snapshots of Lichess standard rated games (July 2025 — January 2026), filtered to both players >= 1 800 ELO. Games are converted to space-separated UCI move strings.

	Datasets are streamed and interleaved from HuggingFace Hub. Sequence packing concatenates games into fixed 256-token sequences to eliminate padding.

	### Hyperparameters

	\| \| \|
	\|---\|---\|
	\| Optimizer \| AdamW (betas 0.9 / 0.95, weight decay 0.1) \|
	\| Learning rate \| 3e-4 with cosine decay to 10 % of peak \|
	\| Warmup \| 9 300 steps (linear) \|
	\| Batch size \| 256 × 256 tokens = 65 536 tokens/step \|
	\| Gradient clipping \| 1.0 \|
	\| Precision \| BF16 \|
	\| Steps \| 120 155 \|

	## Tokenizer

	Custom UCI tokenizer that maps every legal UCI move string to a unique integer:

	\| Range \| Description \| Count \|
	\|---\|---\|---\|
	\| 0 \| `<PAD>` \| 1 \|
	\| 1 \| `<BOS>` \| 1 \|
	\| 2 \| `<EOS>` \| 1 \|
	\| 3 — 4 034 \| Normal moves (src ≠ dst) \| 4 032 \|
	\| 4 035 — 4 210 \| Promotion moves (file × direction × piece × color) \| 176 \|
	\| Total \| \| 4 211 \|

	## Usage

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer
	import torch

	model = AutoModelForCausalLM.from_pretrained(
	"malcouffe/chessgpt", trust_remote_code=True
	)
	tokenizer = AutoTokenizer.from_pretrained(
	"malcouffe/chessgpt", trust_remote_code=True
	)

	# Encode an opening (Italian Game)
	moves = "e2e4 e7e5 g1f3 b8c6 f1c4"
	input_ids = tokenizer.encode(moves, return_tensors="pt")

	with torch.no_grad():
	logits = model(input_ids).logits

	# Get top-5 predicted next moves
	top5 = logits[0, -1].topk(5)
	for score, idx in zip(top5.values, top5.indices):
	print(f"{tokenizer.decode([idx.item()]):>8s} {score:.2f}")
	```

	## Limitations

	- It has no access to board state: all chess knowledge is inferred from move sequences.
	- No RLHF or self-play refinement — this is a pure next-token prediction model.
	- Predictions can include illegal moves; use `python-chess` to filter at inference time. (see the [chessgpt-inference](https://github.com/malcouffe/chessgpt-inference) repo for legal move masking while generating.)

	## Citation

	```bibtex
	@misc{chessgpt2026,
	author = {Matthieu Alcouffe},
	title = {ChessGPT: A 432M Decoder-Only Transformer for UCI Move Prediction},
	year = {2026},
	url = {https://huggingface.co/malcouffe/chessgpt}
	}
	```