MicroChess-20M
A 39M parameter encoder-only transformer that predicts the best chess move from a board position. Trained on 20M Stockfish-evaluated positions from Lichess.
Built for the INFOMTALC 2026 Midterm Chess Tournament at Utrecht University.
Why Encoder-Only? The Key Design Decision
The assignment gave us full freedom: encoder, decoder, encoder-decoder β anything goes. Most baseline players in the course use decoder-only models (Mistral-7B, Kimi-K2) that try to generate a move string token-by-token. That's a natural choice if you think of chess as a text problem. But it's the wrong abstraction.
Chess move prediction is not a generation task. It's a classification task.
Think about what happens when you look at a chess board. You don't "write out" a move character by character β you see the full position and pick one move from a fixed set of possibilities. Every legal chess move can be written as a UCI string like e2e4 or g1f3. There are exactly 1,968 possible UCI move strings (all source-destination square pairs reachable by any piece, plus pawn promotions). So predicting the best move is really just: given this board, which of these 1,968 classes is the answer?
This is why an encoder-only model is the right tool:
Bidirectional attention over the full board. An encoder sees all 64 squares at once and can attend to any piece from any position. A decoder processes tokens left-to-right and needs to build up context sequentially β but there's no natural "left-to-right" order on a chess board. The spatial relationships between pieces matter in all directions simultaneously. A knight on f3 needs to attend to pawns on e5, the king on g1, and a bishop on c4 all at once. Bidirectional self-attention does this naturally.
No autoregressive bottleneck. Decoder models generate moves one character at a time: first
e, then2, thene, then4. Each step depends on the previous one, meaning errors compound β if the first character is wrong, the whole move is wrong. They also need multiple forward passes per move. Our encoder does a single forward pass and outputs logits over all 1,968 moves simultaneously. One pass, one decision.Legal move masking is trivial. Since the model outputs a score for every possible move at once, we can simply set illegal moves to negative infinity before picking the best one. This guarantees zero fallbacks β every move we output is always legal. Decoder models can't do this easily because they generate character-by-character and don't know if the full string will be legal until they've finished generating it.
Much smaller model needed. The baseline
LMPlayeruses Mistral-7B (7 billion parameters, quantized to fit in memory) and still produces tons of illegal moves (85+ fallbacks per game in the notebook examples). Our model is 180x smaller at 39M parameters and never produces an illegal move. The encoder doesn't waste capacity on language understanding, grammar, or general knowledge β every parameter is dedicated to chess pattern recognition.Fast inference. Single forward pass on CPU takes ~50ms. No beam search, no sampling loops, no retries. The decoder baselines need multiple retries (up to 5 attempts) just to get a legal move, and each attempt requires autoregressive generation.
In short: decoder models are great when you need to generate variable-length text. But a chess move is always one of 1,968 fixed options. Using a decoder for that is like using GPT to pick from a multiple-choice exam β it works, but you're hauling far more machinery than the task requires.
How the Model Actually Predicts a Move
Here's the full pipeline, step by step:
Step 1: Tokenize the Board (FEN to Tokens)
A chess position is described by a FEN string like:
rnbqkbnr/pppppppp/8/8/4P3/8/PPPP1PPP/RNBQKBNR b KQkq - 0 1
Our custom tokenizer converts this into exactly 80 integer tokens:
- Token 1:
[CLS]β a special classification token (more on this below) - Tokens 2-65: The 64 board squares, read rank by rank from the 8th rank (black's back row) to the 1st rank (white's back row). Each square becomes one token: a piece character (
P,n,Q, etc.) or.for an empty square. The digit3in a FEN (meaning 3 empty squares) is expanded into three.tokens. - Token 66:
[SEP]β separator - Token 67: Side to move (
worb) - Tokens 68-71: Castling rights (one token each for K, Q, k, q β present or
-) - Tokens 72-73: En passant target square (file + rank, or
-+ padding) - Tokens 74-76: Halfmove clock (3 digits, zero-padded)
- Tokens 77-80: Fullmove counter (4 digits, zero-padded)
The vocabulary is just 51 tokens total. No subword tokenization, no BPE β every token has a direct chess meaning. This is intentional: the model shouldn't have to figure out that "rnbqkbnr" is 8 separate pieces, it already gets them as 8 separate tokens.
Step 2: Encode with the Transformer
The 80 tokens are fed into a standard transformer encoder:
Token + positional embeddings: Each token ID is mapped to a 512-dimensional vector, and a learned positional embedding is added. The positional embedding lets the model learn that token 2 is the a8 square, token 3 is b8, etc. β so it develops spatial awareness of the board layout.
12 transformer blocks, each containing:
- Multi-head self-attention (8 heads, 64 dims each): Every token attends to every other token. This is where the magic happens β the model learns relationships like "this knight can attack that bishop" or "this pawn is blocking that rook." Because attention is bidirectional, pieces can attend to other pieces regardless of their position in the sequence.
- Feed-forward network (512 β 2048 β 512 with GELU activation): Processes each position's representation independently, adding non-linear transformations.
- Pre-norm residual connections: LayerNorm is applied before each sublayer (attention and FFN), and the original input is added back after. This "pre-norm" variant is more stable during training than the original post-norm transformer.
CLS token extraction: After all 12 blocks, we take the hidden state of the
[CLS]token (position 0). Through training, this token learns to aggregate information from the entire board into a single 512-dimensional vector β a "summary" of the position.Classification head: A linear layer projects the CLS vector from 512 dimensions to 1,968 logits β one score per possible UCI move.
Step 3: Pick the Best Legal Move
The raw logits tell us how confident the model is about each of the 1,968 moves. But many of those moves are illegal in the current position (you can't move a piece that isn't there, can't castle through check, etc.). So we:
- Get the set of legal moves from the
python-chesslibrary - Set every illegal move's logit to
-inf - Pick the move with the highest remaining score
This legal move masking is what guarantees zero fallbacks. The model might internally "want" to play an illegal move, but we never let it.
Inference Enhancements: Beyond Raw Model Output
The raw model gets the right move about 50% of the time (top-1 accuracy). To make it play stronger chess in the tournament, player.py wraps the model with several layers of chess logic:
1. Forced Mate Detection
Before even calling the model, we check: can we checkmate the opponent right now? If any legal move delivers checkmate, play it immediately. This is a cheap check (just iterate legal moves and see if any end the game) but catches situations the model might miss.
2. Opening Book
For the first few moves, we don't use the model at all. Instead, we have a hardcoded dictionary of strong opening moves: Sicilian Defense, Italian Game, Queen's Gambit lines, etc. This ensures we play established theory in the opening rather than relying on the model, which might suggest slightly unusual moves in well-known positions.
3. Heuristic Score Adjustments
After getting the model's top legal moves, we adjust their scores with simple chess heuristics:
- +0.3 for moves that give check (keeps pressure on the opponent)
- +100.0 for checkmate (always play it if available)
- Small bonus for captures, proportional to the captured piece's value
- Extra bonus for "good trades" (capturing a high-value piece with a low-value one)
- +2.0 for pawn promotions
These adjustments are small relative to model scores, so they mostly act as tiebreakers between similarly-rated moves, nudging toward tactically sound choices.
4. Blunder Detection
Before committing to the top-rated move, we simulate it and check:
- Does it allow the opponent to checkmate us in 1?
- Does it hang a valuable piece (opponent can capture our knight/bishop/rook/queen with a lower-value piece)?
If the top move is a blunder, we fall back to the next best candidate that isn't.
5. 1-Ply Lookahead
For the top 5 candidate moves, we do a simple one-move lookahead: play each candidate, then run the model on the resulting position to see what the opponent's best response would be. We pick the move that minimizes the opponent's best response score. This is a lightweight form of minimax search β just one level deep, but it helps avoid moves that look good on the surface but lead to bad positions.
6. Endgame Tablebases (Syzygy)
When the board has 5 or fewer pieces, the game enters a phase where perfect play is mathematically known. If Syzygy tablebase files are available, we look up the theoretically optimal move instead of relying on the model.
Training Details
Data
- Source: Lichess Stockfish evaluations β 10M positions where Stockfish (a very strong chess engine) has evaluated the position and determined the best move.
- Color-flip augmentation: Every position was also mirrored (swap white/black pieces, flip the board vertically, swap side to move). This doubles the dataset to 20M samples and teaches the model to play equally well as both colors. Without this, it might learn to favor white's perspective since the data isn't perfectly balanced.
- Pre-tokenization: All FEN strings were tokenized offline into integer tensors (
train.pt) so the dataloader just does tensor lookups during training β no string processing in the hot loop.
Why Stockfish Data?
We train on Stockfish's best moves, not on human games. Human games contain blunders, time-pressure mistakes, and suboptimal play. Stockfish evaluations give us clean, high-quality labels β the "right answer" according to a very strong engine. The model essentially learns to imitate Stockfish. It won't be as strong as Stockfish (it only sees the board, not a search tree), but it absorbs the patterns and intuitions that make Stockfish's moves good.
Hyperparameters
| Parameter | Value | Why |
|---|---|---|
| Batch size | 2048 | Large batches give stable gradients for classification tasks |
| Learning rate | 3e-4 | Standard for transformers of this size |
| Warmup | 2000 steps | Linear warmup prevents early instability |
| Schedule | Cosine decay | Gradually reduces LR for fine-grained convergence |
| Optimizer | AdamW | Weight decay (0.01) prevents overfitting |
| Betas | (0.9, 0.98) | Slightly higher beta2 for more stable second moments |
| Gradient clipping | 1.0 | Prevents exploding gradients |
| Precision | bf16 (AMP) | 2x memory savings, ~1.5x speed on RTX 6000 Ada |
| torch.compile | enabled | Kernel fusion for additional ~20% speedup |
| Total steps | 50,000 | ~5.2 epochs over the 20M samples |
Training Progress
| Step | Loss | Val Top-1 | Val Top-3 | Notes |
|---|---|---|---|---|
| 1000 | 3.82 | 17.3% | 33.6% | First eval β already learning |
| 2000 | 2.96 | 25.6% | 46.5% | Warmup complete |
| 5000 | 2.37 | 35.7% | 61.2% | Rapid improvement phase |
| 10000 | 1.94 | 42.5% | 70.1% | Diminishing returns begin |
| 13000 | 1.96 | 44.5% | 72.2% | Last logged eval |
| 50000 | ~1.5 | ~50% | ~78% | Final (estimated) |
The model picks the exact Stockfish best move ~50% of the time, and has the right move in its top 3 predictions ~78% of the time. The remaining ~22% are still legal moves (guaranteed by masking) β they're just not what Stockfish would play.
Hardware
- RunPod RTX 6000 Ada (48GB VRAM)
- Training speed: ~1.5 steps/second
- Total training time: ~9.3 hours
Model Architecture Summary
| Property | Value |
|---|---|
| Architecture | Encoder-only Transformer |
| Parameters | 38.9M |
| Layers | 12 |
| Hidden dim (d_model) | 512 |
| Attention heads | 8 |
| Head dim | 64 |
| FFN dim | 2048 |
| Activation | GELU |
| Norm | Pre-LayerNorm |
| Input | 80 tokens (FEN) |
| Vocabulary | 51 tokens |
| Output | 1968 classes (UCI moves) |
| Checkpoint size | 74 MB |
Usage
from player import TransformerPlayer
player = TransformerPlayer("Chess-Bot-20M")
move = player.get_move("rnbqkbnr/pppppppp/8/8/8/8/PPPPPPPP/RNBQKBNR w KQkq - 0 1")
print(move) # "e2e4"
Results
- vs RandomPlayer: 100% win rate (5W/0D/0L in testing), 0 fallbacks
- Zero fallbacks guaranteed β legal move masking ensures every output is a valid move
- Inference: ~50ms per move on CPU, <10ms on GPU
Repository
- GitHub: kruuusher13/Chess-Bot-20M
Limitations
- Trained only on Stockfish evaluations β learns engine-style play, not human-style play
- The 1-ply lookahead is shallow compared to real search algorithms (Stockfish searches millions of positions per move)
- ~50% top-1 accuracy means the other half of the time it plays reasonable but suboptimal moves
- The model has no concept of long-term planning or strategy β it evaluates each position independently without remembering previous moves in the game