File size: 4,955 Bytes

---
license: mit
tags:
- chess
- transformer
- reinforcement-learning
- game-playing
- research
library_name: pytorch
---

# ChessFormer-RL

ChessFormer-RL represents an experimental checkpoint in training chess models with reinforcement learning. **Note**: This model is actually the 8th supervised learning checkpoint (49152 steps) that was intended as initialization for RL training, as the full RL training encountered challenges.

## Model Description

- **Model type**: Transformer for chess (RL training initialization)
- **Language(s)**: Chess (FEN notation)
- **License**: MIT
- **Parameters**: 100.7M

## Important Notice

⚠️ **This model represents a research checkpoint rather than a completed RL-trained model.** The actual reinforcement learning training encountered:

- Gradient norm explosion
- Noisy reward signals
- Performance degradation from this initialization point

This checkpoint is provided for researchers interested in:

- RL training initialization strategies
- Comparative analysis with the final SL model
- Continuing RL experiments with improved methods

## Architecture

Identical to ChessFormer-SL:

- **Blocks**: 20 transformer layers
- **Hidden size**: 640
- **Attention heads**: 8
- **Intermediate size**: 1728
- **Features**: RMSNorm, SwiGLU activation, custom FEN tokenizer

## Training Details

### Phase 1: Supervised Learning (This Checkpoint)

- **Dataset**: `kaupane/lichess-2023-01-stockfish-annotated` (depth18 split)
- **Training**: 49152 steps of supervised learning on Stockfish evaluations
- **Purpose**: Initialization for subsequent RL training

### Phase 2: Reinforcement Learning (Attempted)

- **Method**: Self-play with Proximal Policy Optimization (PPO)
- **Environment**: Batch chess environment with sparse terminal rewards
- **Outcome**: Training instabilities led to performance degradation
- **Current Status**: Requires further research and improved RL methodology

### Training Metrics (This Checkpoint)

- **Action Loss**: 1.8329
- **Value Loss**: 0.0501
- **Invalid Loss**: 0.0484

## Performance

As an intermediate SL checkpoint, this model exhibits:

- Similar capabilities to early ChessFormer-SL training
- Less refined than the final SL model
- Suitable for RL initialization experiments

### Comparison with ChessFormer-SL

| Metric | ChessFormer-RL (8th ckpt) | ChessFormer-SL (20th ckpt) |
|--------|---------------------------|----------------------------|
| Action Loss | 1.8329 | 1.6985 |
| Value Loss | 0.0501 | 0.0407 |
| Invalid Loss | 0.0484 | 0.0303 |

## Research Context

### RL Training Challenges Encountered

1. **Gradient Instability**: Explosive gradient norms during PPO updates
2. **Sparse Rewards**: Terminal-only rewards created noisy learning signals
3. **Action Space Complexity**: 1,969 possible moves created exploration challenges
4. **Self-Play Dynamics**: Unstable opponent strength during training

## Usage

### Installation

```bash
pip install torch transformers huggingface_hub chess
# Download model.py from this repository
```

### Loading the Model

```python
import torch
from model import ChessFormerModel

# Load model
model = ChessFormerModel.from_pretrained("kaupane/ChessFormer-RL")
model.eval()

# This is an intermediate checkpoint - performance will be lower than ChessFormer-SL
```

### For RL Research

```python
# This checkpoint can serve as initialization for RL experiments
from train_rl import RLTrainer

# Load checkpoint for RL training continuation
trainer = RLTrainer(
    model=model,
    # ... other hyperparameters
)
trainer.resume("path/to/checkpoint", from_sl_checkpoint=True)
```

## Limitations

### Technical Limitations

- **Incomplete Training**: Represents intermediate rather than final model
- **RL Instabilities**: Subsequent RL training was unsuccessful
- **Performance**: Lower quality than ChessFormer-SL final checkpoint

### Research Limitations

- Demonstrates challenges rather than solutions for chess RL
- Requires significant additional work for competitive performance
- Not suitable for production use

## Intended Use

This model is specifically intended for:

- ✅ RL research and experimentation
- ✅ Studying initialization strategies for chess RL
- ✅ Comparative analysis of SL vs RL training trajectories
- ✅ Educational purposes in understanding RL challenges

**Not intended for:**

- ❌ Practical chess playing applications
- ❌ Production chess engines
- ❌ Competitive chess analysis

## Additional Information

- **Repository**: [GitHub link](https://github.com/Mtrya/chess-transformer)
- **Demo**: [HuggingFace Space Demo](https://huggingface.co/spaces/kaupane/Chessformer_Demo)
- **Related**: [ChessFormer-SL](https://huggingface.co/kaupane/ChessFormer-SL) (Completed SL Training)

*This model represents ongoing research into chess RL training. While the full RL training was unsuccessful, this checkpoint may be an initial starting point for future research directions.*