Chess Transformer (Cross-Temporal Attention)
A 39M-parameter dual-head transformer (policy + value) that plays chess, trained from scratch on ~98M Lichess positions. The novel contribution is cross-temporal attention over 8 successive board states (600 input tokens), allowing the model to reason about how a position evolved, not just its current static state.
Files
best.pt— supervised pretraining checkpoint (~45 % top-1 move accuracy on a 1M-position Lichess test set, ~1100 Elo with 200 MCTS simulations per move)rl_latest.pt— same model after 20 iterations of AlphaZero-style PPO self-play (note: underperforms the supervised baseline due to compute scale,see project notebook)
Architecture
- 12-layer pre-norm encoder, d_model=512, 8 heads
- 75-token board encoding (64 squares + side / castling / en passant / material / king-safety / phase)
- 8-board temporal window flattened to 600 tokens + CLS
- Policy head: Linear(512 -> 1968) with legal-move masking
- Value head: Linear(512 -> 256) -> GELU -> Linear(256 -> 1) -> Tanh
Training
- Hardware: single NVIDIA RTX 5070 (12 GB, Blackwell sm_120)
- Supervised pretraining: 3 epochs over
98M positions, mixed-precision fp16, AdamW, cosine schedule — **5 days of continuous training** - PPO self-play (RL): 20 iterations × 50 games × 200 MCTS sims, GAE + clipped surrogate, Stockfish-shaped reward — ~3 days on top of the supervised checkpoint
- Total: ~8 days of GPU time end-to-end on a single consumer card
Usage
See github.com/Angelotxx271/Individual_project_DAI for training/inference code, or try the interactive demo.
Citation
Individual project for Designing AI (2025-26).