---
license: mit
tags:
  - chess
  - transformer
  - policy-value
datasets:
  - avewright/chess-positions-lichess-sf
---

# ChessTransformer200M

A 204M parameter chess-native transformer trained on Stockfish-labeled positions.

## Architecture
- **Encoder**: FusedBoardEncoder (256d) — learned piece-color + square + context embeddings
- **Backbone**: 16-layer Transformer (1024d, 16 heads, FFN 4096, GELU, norm_first)
- **Policy Head**: SpatialPolicyHead (from×to square features, 512d)
- **Value Head**: WDL (win/draw/loss) classification

## Training
- **Dataset**: avewright/chess-positions-lichess-sf (10.2M positions seen out of 48M available)
- **Steps**: 10,000 optimizer steps (effective batch 1024)
- **Final Policy Loss**: ~2.5 (estimated from loss curve)
- **Top-1 Accuracy**: 18.4% (on 5K eval positions vs Stockfish best moves)
- **GPU**: NVIDIA A40 46GB, FP16 + torch.compile
- **Training time**: ~6 hours to step 10,000

## Usage

```python
import torch
from play import ChessTransformer200M, load_model, encode_board, get_model_move
import chess

model = load_model("best_model.pt", torch.device("cpu"))
board = chess.Board()
move, info = get_model_move(model, board, torch.device("cpu"))
print(f"Best move: {move.uci()}, Top 5: {info['top_moves']}")
```

## Files
- `best_model.pt` — Model weights only (816 MB)
- `training_log.json` — Loss curve data
- `config.json` — Architecture config

## Known Issues
- Training hit FP16 NaN at step ~13,800. Best checkpoint is step 10,000.
- Model is only ~21% through 1 epoch of the 48M subset dataset.
- Opens with 1.d4 as White. Plays reasonable chess but still early in training.