File size: 4,955 Bytes
cb21292 23edaa1 d50f141 cb21292 d50f141 cd28434 d50f141 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 |
---
license: mit
tags:
- chess
- transformer
- reinforcement-learning
- game-playing
- research
library_name: pytorch
---
# ChessFormer-RL
ChessFormer-RL represents an experimental checkpoint in training chess models with reinforcement learning. **Note**: This model is actually the 8th supervised learning checkpoint (49152 steps) that was intended as initialization for RL training, as the full RL training encountered challenges.
## Model Description
- **Model type**: Transformer for chess (RL training initialization)
- **Language(s)**: Chess (FEN notation)
- **License**: MIT
- **Parameters**: 100.7M
## Important Notice
⚠️ **This model represents a research checkpoint rather than a completed RL-trained model.** The actual reinforcement learning training encountered:
- Gradient norm explosion
- Noisy reward signals
- Performance degradation from this initialization point
This checkpoint is provided for researchers interested in:
- RL training initialization strategies
- Comparative analysis with the final SL model
- Continuing RL experiments with improved methods
## Architecture
Identical to ChessFormer-SL:
- **Blocks**: 20 transformer layers
- **Hidden size**: 640
- **Attention heads**: 8
- **Intermediate size**: 1728
- **Features**: RMSNorm, SwiGLU activation, custom FEN tokenizer
## Training Details
### Phase 1: Supervised Learning (This Checkpoint)
- **Dataset**: `kaupane/lichess-2023-01-stockfish-annotated` (depth18 split)
- **Training**: 49152 steps of supervised learning on Stockfish evaluations
- **Purpose**: Initialization for subsequent RL training
### Phase 2: Reinforcement Learning (Attempted)
- **Method**: Self-play with Proximal Policy Optimization (PPO)
- **Environment**: Batch chess environment with sparse terminal rewards
- **Outcome**: Training instabilities led to performance degradation
- **Current Status**: Requires further research and improved RL methodology
### Training Metrics (This Checkpoint)
- **Action Loss**: 1.8329
- **Value Loss**: 0.0501
- **Invalid Loss**: 0.0484
## Performance
As an intermediate SL checkpoint, this model exhibits:
- Similar capabilities to early ChessFormer-SL training
- Less refined than the final SL model
- Suitable for RL initialization experiments
### Comparison with ChessFormer-SL
| Metric | ChessFormer-RL (8th ckpt) | ChessFormer-SL (20th ckpt) |
|--------|---------------------------|----------------------------|
| Action Loss | 1.8329 | 1.6985 |
| Value Loss | 0.0501 | 0.0407 |
| Invalid Loss | 0.0484 | 0.0303 |
## Research Context
### RL Training Challenges Encountered
1. **Gradient Instability**: Explosive gradient norms during PPO updates
2. **Sparse Rewards**: Terminal-only rewards created noisy learning signals
3. **Action Space Complexity**: 1,969 possible moves created exploration challenges
4. **Self-Play Dynamics**: Unstable opponent strength during training
## Usage
### Installation
```bash
pip install torch transformers huggingface_hub chess
# Download model.py from this repository
```
### Loading the Model
```python
import torch
from model import ChessFormerModel
# Load model
model = ChessFormerModel.from_pretrained("kaupane/ChessFormer-RL")
model.eval()
# This is an intermediate checkpoint - performance will be lower than ChessFormer-SL
```
### For RL Research
```python
# This checkpoint can serve as initialization for RL experiments
from train_rl import RLTrainer
# Load checkpoint for RL training continuation
trainer = RLTrainer(
model=model,
# ... other hyperparameters
)
trainer.resume("path/to/checkpoint", from_sl_checkpoint=True)
```
## Limitations
### Technical Limitations
- **Incomplete Training**: Represents intermediate rather than final model
- **RL Instabilities**: Subsequent RL training was unsuccessful
- **Performance**: Lower quality than ChessFormer-SL final checkpoint
### Research Limitations
- Demonstrates challenges rather than solutions for chess RL
- Requires significant additional work for competitive performance
- Not suitable for production use
## Intended Use
This model is specifically intended for:
- ✅ RL research and experimentation
- ✅ Studying initialization strategies for chess RL
- ✅ Comparative analysis of SL vs RL training trajectories
- ✅ Educational purposes in understanding RL challenges
**Not intended for:**
- ❌ Practical chess playing applications
- ❌ Production chess engines
- ❌ Competitive chess analysis
## Additional Information
- **Repository**: [GitHub link](https://github.com/Mtrya/chess-transformer)
- **Demo**: [HuggingFace Space Demo](https://huggingface.co/spaces/kaupane/Chessformer_Demo)
- **Related**: [ChessFormer-SL](https://huggingface.co/kaupane/ChessFormer-SL) (Completed SL Training)
*This model represents ongoing research into chess RL training. While the full RL training was unsuccessful, this checkpoint may be an initial starting point for future research directions.* |