|
|
--- |
|
|
license: mit |
|
|
tags: |
|
|
- chess |
|
|
- transformer |
|
|
- reinforcement-learning |
|
|
- game-playing |
|
|
- research |
|
|
library_name: pytorch |
|
|
--- |
|
|
|
|
|
# ChessFormer-RL |
|
|
|
|
|
ChessFormer-RL represents an experimental checkpoint in training chess models with reinforcement learning. **Note**: This model is actually the 8th supervised learning checkpoint (49152 steps) that was intended as initialization for RL training, as the full RL training encountered challenges. |
|
|
|
|
|
## Model Description |
|
|
|
|
|
- **Model type**: Transformer for chess (RL training initialization) |
|
|
- **Language(s)**: Chess (FEN notation) |
|
|
- **License**: MIT |
|
|
- **Parameters**: 100.7M |
|
|
|
|
|
## Important Notice |
|
|
|
|
|
β οΈ **This model represents a research checkpoint rather than a completed RL-trained model.** The actual reinforcement learning training encountered: |
|
|
|
|
|
- Gradient norm explosion |
|
|
- Noisy reward signals |
|
|
- Performance degradation from this initialization point |
|
|
|
|
|
This checkpoint is provided for researchers interested in: |
|
|
|
|
|
- RL training initialization strategies |
|
|
- Comparative analysis with the final SL model |
|
|
- Continuing RL experiments with improved methods |
|
|
|
|
|
## Architecture |
|
|
|
|
|
Identical to ChessFormer-SL: |
|
|
|
|
|
- **Blocks**: 20 transformer layers |
|
|
- **Hidden size**: 640 |
|
|
- **Attention heads**: 8 |
|
|
- **Intermediate size**: 1728 |
|
|
- **Features**: RMSNorm, SwiGLU activation, custom FEN tokenizer |
|
|
|
|
|
## Training Details |
|
|
|
|
|
### Phase 1: Supervised Learning (This Checkpoint) |
|
|
|
|
|
- **Dataset**: `kaupane/lichess-2023-01-stockfish-annotated` (depth18 split) |
|
|
- **Training**: 49152 steps of supervised learning on Stockfish evaluations |
|
|
- **Purpose**: Initialization for subsequent RL training |
|
|
|
|
|
### Phase 2: Reinforcement Learning (Attempted) |
|
|
|
|
|
- **Method**: Self-play with Proximal Policy Optimization (PPO) |
|
|
- **Environment**: Batch chess environment with sparse terminal rewards |
|
|
- **Outcome**: Training instabilities led to performance degradation |
|
|
- **Current Status**: Requires further research and improved RL methodology |
|
|
|
|
|
### Training Metrics (This Checkpoint) |
|
|
|
|
|
- **Action Loss**: 1.8329 |
|
|
- **Value Loss**: 0.0501 |
|
|
- **Invalid Loss**: 0.0484 |
|
|
|
|
|
## Performance |
|
|
|
|
|
As an intermediate SL checkpoint, this model exhibits: |
|
|
|
|
|
- Similar capabilities to early ChessFormer-SL training |
|
|
- Less refined than the final SL model |
|
|
- Suitable for RL initialization experiments |
|
|
|
|
|
### Comparison with ChessFormer-SL |
|
|
|
|
|
| Metric | ChessFormer-RL (8th ckpt) | ChessFormer-SL (20th ckpt) | |
|
|
|--------|---------------------------|----------------------------| |
|
|
| Action Loss | 1.8329 | 1.6985 | |
|
|
| Value Loss | 0.0501 | 0.0407 | |
|
|
| Invalid Loss | 0.0484 | 0.0303 | |
|
|
|
|
|
## Research Context |
|
|
|
|
|
### RL Training Challenges Encountered |
|
|
|
|
|
1. **Gradient Instability**: Explosive gradient norms during PPO updates |
|
|
2. **Sparse Rewards**: Terminal-only rewards created noisy learning signals |
|
|
3. **Action Space Complexity**: 1,969 possible moves created exploration challenges |
|
|
4. **Self-Play Dynamics**: Unstable opponent strength during training |
|
|
|
|
|
## Usage |
|
|
|
|
|
### Installation |
|
|
|
|
|
```bash |
|
|
pip install torch transformers huggingface_hub chess |
|
|
# Download model.py from this repository |
|
|
``` |
|
|
|
|
|
### Loading the Model |
|
|
|
|
|
```python |
|
|
import torch |
|
|
from model import ChessFormerModel |
|
|
|
|
|
# Load model |
|
|
model = ChessFormerModel.from_pretrained("kaupane/ChessFormer-RL") |
|
|
model.eval() |
|
|
|
|
|
# This is an intermediate checkpoint - performance will be lower than ChessFormer-SL |
|
|
``` |
|
|
|
|
|
### For RL Research |
|
|
|
|
|
```python |
|
|
# This checkpoint can serve as initialization for RL experiments |
|
|
from train_rl import RLTrainer |
|
|
|
|
|
# Load checkpoint for RL training continuation |
|
|
trainer = RLTrainer( |
|
|
model=model, |
|
|
# ... other hyperparameters |
|
|
) |
|
|
trainer.resume("path/to/checkpoint", from_sl_checkpoint=True) |
|
|
``` |
|
|
|
|
|
## Limitations |
|
|
|
|
|
### Technical Limitations |
|
|
|
|
|
- **Incomplete Training**: Represents intermediate rather than final model |
|
|
- **RL Instabilities**: Subsequent RL training was unsuccessful |
|
|
- **Performance**: Lower quality than ChessFormer-SL final checkpoint |
|
|
|
|
|
### Research Limitations |
|
|
|
|
|
- Demonstrates challenges rather than solutions for chess RL |
|
|
- Requires significant additional work for competitive performance |
|
|
- Not suitable for production use |
|
|
|
|
|
## Intended Use |
|
|
|
|
|
This model is specifically intended for: |
|
|
|
|
|
- β
RL research and experimentation |
|
|
- β
Studying initialization strategies for chess RL |
|
|
- β
Comparative analysis of SL vs RL training trajectories |
|
|
- β
Educational purposes in understanding RL challenges |
|
|
|
|
|
**Not intended for:** |
|
|
|
|
|
- β Practical chess playing applications |
|
|
- β Production chess engines |
|
|
- β Competitive chess analysis |
|
|
|
|
|
## Additional Information |
|
|
|
|
|
- **Repository**: [GitHub link](https://github.com/Mtrya/chess-transformer) |
|
|
- **Demo**: [HuggingFace Space Demo](https://huggingface.co/spaces/kaupane/Chessformer_Demo) |
|
|
- **Related**: [ChessFormer-SL](https://huggingface.co/kaupane/ChessFormer-SL) (Completed SL Training) |
|
|
|
|
|
*This model represents ongoing research into chess RL training. While the full RL training was unsuccessful, this checkpoint may be an initial starting point for future research directions.* |