--- license: mit tags: - chess - transformer - policy-value datasets: - avewright/chess-positions-lichess-sf --- # ChessTransformer200M A 204M parameter chess-native transformer trained on Stockfish-labeled positions. ## Architecture - **Encoder**: FusedBoardEncoder (256d) — learned piece-color + square + context embeddings - **Backbone**: 16-layer Transformer (1024d, 16 heads, FFN 4096, GELU, norm_first) - **Policy Head**: SpatialPolicyHead (from×to square features, 512d) - **Value Head**: WDL (win/draw/loss) classification ## Training - **Dataset**: avewright/chess-positions-lichess-sf (10.2M positions seen out of 48M available) - **Steps**: 10,000 optimizer steps (effective batch 1024) - **Final Policy Loss**: ~2.5 (estimated from loss curve) - **Top-1 Accuracy**: 18.4% (on 5K eval positions vs Stockfish best moves) - **GPU**: NVIDIA A40 46GB, FP16 + torch.compile - **Training time**: ~6 hours to step 10,000 ## Usage ```python import torch from play import ChessTransformer200M, load_model, encode_board, get_model_move import chess model = load_model("best_model.pt", torch.device("cpu")) board = chess.Board() move, info = get_model_move(model, board, torch.device("cpu")) print(f"Best move: {move.uci()}, Top 5: {info['top_moves']}") ``` ## Files - `best_model.pt` — Model weights only (816 MB) - `training_log.json` — Loss curve data - `config.json` — Architecture config ## Known Issues - Training hit FP16 NaN at step ~13,800. Best checkpoint is step 10,000. - Model is only ~21% through 1 epoch of the 48M subset dataset. - Opens with 1.d4 as White. Plays reasonable chess but still early in training.