ChessFormer-RL / README.md
kaupane's picture
Update README.md
cd28434 verified
---
license: mit
tags:
- chess
- transformer
- reinforcement-learning
- game-playing
- research
library_name: pytorch
---
# ChessFormer-RL
ChessFormer-RL represents an experimental checkpoint in training chess models with reinforcement learning. **Note**: This model is actually the 8th supervised learning checkpoint (49152 steps) that was intended as initialization for RL training, as the full RL training encountered challenges.
## Model Description
- **Model type**: Transformer for chess (RL training initialization)
- **Language(s)**: Chess (FEN notation)
- **License**: MIT
- **Parameters**: 100.7M
## Important Notice
⚠️ **This model represents a research checkpoint rather than a completed RL-trained model.** The actual reinforcement learning training encountered:
- Gradient norm explosion
- Noisy reward signals
- Performance degradation from this initialization point
This checkpoint is provided for researchers interested in:
- RL training initialization strategies
- Comparative analysis with the final SL model
- Continuing RL experiments with improved methods
## Architecture
Identical to ChessFormer-SL:
- **Blocks**: 20 transformer layers
- **Hidden size**: 640
- **Attention heads**: 8
- **Intermediate size**: 1728
- **Features**: RMSNorm, SwiGLU activation, custom FEN tokenizer
## Training Details
### Phase 1: Supervised Learning (This Checkpoint)
- **Dataset**: `kaupane/lichess-2023-01-stockfish-annotated` (depth18 split)
- **Training**: 49152 steps of supervised learning on Stockfish evaluations
- **Purpose**: Initialization for subsequent RL training
### Phase 2: Reinforcement Learning (Attempted)
- **Method**: Self-play with Proximal Policy Optimization (PPO)
- **Environment**: Batch chess environment with sparse terminal rewards
- **Outcome**: Training instabilities led to performance degradation
- **Current Status**: Requires further research and improved RL methodology
### Training Metrics (This Checkpoint)
- **Action Loss**: 1.8329
- **Value Loss**: 0.0501
- **Invalid Loss**: 0.0484
## Performance
As an intermediate SL checkpoint, this model exhibits:
- Similar capabilities to early ChessFormer-SL training
- Less refined than the final SL model
- Suitable for RL initialization experiments
### Comparison with ChessFormer-SL
| Metric | ChessFormer-RL (8th ckpt) | ChessFormer-SL (20th ckpt) |
|--------|---------------------------|----------------------------|
| Action Loss | 1.8329 | 1.6985 |
| Value Loss | 0.0501 | 0.0407 |
| Invalid Loss | 0.0484 | 0.0303 |
## Research Context
### RL Training Challenges Encountered
1. **Gradient Instability**: Explosive gradient norms during PPO updates
2. **Sparse Rewards**: Terminal-only rewards created noisy learning signals
3. **Action Space Complexity**: 1,969 possible moves created exploration challenges
4. **Self-Play Dynamics**: Unstable opponent strength during training
## Usage
### Installation
```bash
pip install torch transformers huggingface_hub chess
# Download model.py from this repository
```
### Loading the Model
```python
import torch
from model import ChessFormerModel
# Load model
model = ChessFormerModel.from_pretrained("kaupane/ChessFormer-RL")
model.eval()
# This is an intermediate checkpoint - performance will be lower than ChessFormer-SL
```
### For RL Research
```python
# This checkpoint can serve as initialization for RL experiments
from train_rl import RLTrainer
# Load checkpoint for RL training continuation
trainer = RLTrainer(
model=model,
# ... other hyperparameters
)
trainer.resume("path/to/checkpoint", from_sl_checkpoint=True)
```
## Limitations
### Technical Limitations
- **Incomplete Training**: Represents intermediate rather than final model
- **RL Instabilities**: Subsequent RL training was unsuccessful
- **Performance**: Lower quality than ChessFormer-SL final checkpoint
### Research Limitations
- Demonstrates challenges rather than solutions for chess RL
- Requires significant additional work for competitive performance
- Not suitable for production use
## Intended Use
This model is specifically intended for:
- βœ… RL research and experimentation
- βœ… Studying initialization strategies for chess RL
- βœ… Comparative analysis of SL vs RL training trajectories
- βœ… Educational purposes in understanding RL challenges
**Not intended for:**
- ❌ Practical chess playing applications
- ❌ Production chess engines
- ❌ Competitive chess analysis
## Additional Information
- **Repository**: [GitHub link](https://github.com/Mtrya/chess-transformer)
- **Demo**: [HuggingFace Space Demo](https://huggingface.co/spaces/kaupane/Chessformer_Demo)
- **Related**: [ChessFormer-SL](https://huggingface.co/kaupane/ChessFormer-SL) (Completed SL Training)
*This model represents ongoing research into chess RL training. While the full RL training was unsuccessful, this checkpoint may be an initial starting point for future research directions.*