--- license: mit tags: - chess - transformer - reinforcement-learning - game-playing - research library_name: pytorch --- # ChessFormer-RL ChessFormer-RL represents an experimental checkpoint in training chess models with reinforcement learning. **Note**: This model is actually the 8th supervised learning checkpoint (49152 steps) that was intended as initialization for RL training, as the full RL training encountered challenges. ## Model Description - **Model type**: Transformer for chess (RL training initialization) - **Language(s)**: Chess (FEN notation) - **License**: MIT - **Parameters**: 100.7M ## Important Notice ⚠️ **This model represents a research checkpoint rather than a completed RL-trained model.** The actual reinforcement learning training encountered: - Gradient norm explosion - Noisy reward signals - Performance degradation from this initialization point This checkpoint is provided for researchers interested in: - RL training initialization strategies - Comparative analysis with the final SL model - Continuing RL experiments with improved methods ## Architecture Identical to ChessFormer-SL: - **Blocks**: 20 transformer layers - **Hidden size**: 640 - **Attention heads**: 8 - **Intermediate size**: 1728 - **Features**: RMSNorm, SwiGLU activation, custom FEN tokenizer ## Training Details ### Phase 1: Supervised Learning (This Checkpoint) - **Dataset**: `kaupane/lichess-2023-01-stockfish-annotated` (depth18 split) - **Training**: 49152 steps of supervised learning on Stockfish evaluations - **Purpose**: Initialization for subsequent RL training ### Phase 2: Reinforcement Learning (Attempted) - **Method**: Self-play with Proximal Policy Optimization (PPO) - **Environment**: Batch chess environment with sparse terminal rewards - **Outcome**: Training instabilities led to performance degradation - **Current Status**: Requires further research and improved RL methodology ### Training Metrics (This Checkpoint) - **Action Loss**: 1.8329 - **Value Loss**: 0.0501 - **Invalid Loss**: 0.0484 ## Performance As an intermediate SL checkpoint, this model exhibits: - Similar capabilities to early ChessFormer-SL training - Less refined than the final SL model - Suitable for RL initialization experiments ### Comparison with ChessFormer-SL | Metric | ChessFormer-RL (8th ckpt) | ChessFormer-SL (20th ckpt) | |--------|---------------------------|----------------------------| | Action Loss | 1.8329 | 1.6985 | | Value Loss | 0.0501 | 0.0407 | | Invalid Loss | 0.0484 | 0.0303 | ## Research Context ### RL Training Challenges Encountered 1. **Gradient Instability**: Explosive gradient norms during PPO updates 2. **Sparse Rewards**: Terminal-only rewards created noisy learning signals 3. **Action Space Complexity**: 1,969 possible moves created exploration challenges 4. **Self-Play Dynamics**: Unstable opponent strength during training ## Usage ### Installation ```bash pip install torch transformers huggingface_hub chess # Download model.py from this repository ``` ### Loading the Model ```python import torch from model import ChessFormerModel # Load model model = ChessFormerModel.from_pretrained("kaupane/ChessFormer-RL") model.eval() # This is an intermediate checkpoint - performance will be lower than ChessFormer-SL ``` ### For RL Research ```python # This checkpoint can serve as initialization for RL experiments from train_rl import RLTrainer # Load checkpoint for RL training continuation trainer = RLTrainer( model=model, # ... other hyperparameters ) trainer.resume("path/to/checkpoint", from_sl_checkpoint=True) ``` ## Limitations ### Technical Limitations - **Incomplete Training**: Represents intermediate rather than final model - **RL Instabilities**: Subsequent RL training was unsuccessful - **Performance**: Lower quality than ChessFormer-SL final checkpoint ### Research Limitations - Demonstrates challenges rather than solutions for chess RL - Requires significant additional work for competitive performance - Not suitable for production use ## Intended Use This model is specifically intended for: - ✅ RL research and experimentation - ✅ Studying initialization strategies for chess RL - ✅ Comparative analysis of SL vs RL training trajectories - ✅ Educational purposes in understanding RL challenges **Not intended for:** - ❌ Practical chess playing applications - ❌ Production chess engines - ❌ Competitive chess analysis ## Additional Information - **Repository**: [GitHub link](https://github.com/Mtrya/chess-transformer) - **Demo**: [HuggingFace Space Demo](https://huggingface.co/spaces/kaupane/Chessformer_Demo) - **Related**: [ChessFormer-SL](https://huggingface.co/kaupane/ChessFormer-SL) (Completed SL Training) *This model represents ongoing research into chess RL training. While the full RL training was unsuccessful, this checkpoint may be an initial starting point for future research directions.*