ChessFormer-RL / README.md

Update README.md

cd28434 verified 8 months ago

4.96 kB

	---
	license: mit
	tags:
	- chess
	- transformer
	- reinforcement-learning
	- game-playing
	- research
	library_name: pytorch
	---

	# ChessFormer-RL

	ChessFormer-RL represents an experimental checkpoint in training chess models with reinforcement learning. Note: This model is actually the 8th supervised learning checkpoint (49152 steps) that was intended as initialization for RL training, as the full RL training encountered challenges.

	## Model Description

	- Model type: Transformer for chess (RL training initialization)
	- Language(s): Chess (FEN notation)
	- License: MIT
	- Parameters: 100.7M

	## Important Notice

	⚠️ This model represents a research checkpoint rather than a completed RL-trained model. The actual reinforcement learning training encountered:

	- Gradient norm explosion
	- Noisy reward signals
	- Performance degradation from this initialization point

	This checkpoint is provided for researchers interested in:

	- RL training initialization strategies
	- Comparative analysis with the final SL model
	- Continuing RL experiments with improved methods

	## Architecture

	Identical to ChessFormer-SL:

	- Blocks: 20 transformer layers
	- Hidden size: 640
	- Attention heads: 8
	- Intermediate size: 1728
	- Features: RMSNorm, SwiGLU activation, custom FEN tokenizer

	## Training Details

	### Phase 1: Supervised Learning (This Checkpoint)

	- Dataset: `kaupane/lichess-2023-01-stockfish-annotated` (depth18 split)
	- Training: 49152 steps of supervised learning on Stockfish evaluations
	- Purpose: Initialization for subsequent RL training

	### Phase 2: Reinforcement Learning (Attempted)

	- Method: Self-play with Proximal Policy Optimization (PPO)
	- Environment: Batch chess environment with sparse terminal rewards
	- Outcome: Training instabilities led to performance degradation
	- Current Status: Requires further research and improved RL methodology

	### Training Metrics (This Checkpoint)

	- Action Loss: 1.8329
	- Value Loss: 0.0501
	- Invalid Loss: 0.0484

	## Performance

	As an intermediate SL checkpoint, this model exhibits:

	- Similar capabilities to early ChessFormer-SL training
	- Less refined than the final SL model
	- Suitable for RL initialization experiments

	### Comparison with ChessFormer-SL

	\| Metric \| ChessFormer-RL (8th ckpt) \| ChessFormer-SL (20th ckpt) \|
	\|--------\|---------------------------\|----------------------------\|
	\| Action Loss \| 1.8329 \| 1.6985 \|
	\| Value Loss \| 0.0501 \| 0.0407 \|
	\| Invalid Loss \| 0.0484 \| 0.0303 \|

	## Research Context

	### RL Training Challenges Encountered

	1. Gradient Instability: Explosive gradient norms during PPO updates
	2. Sparse Rewards: Terminal-only rewards created noisy learning signals
	3. Action Space Complexity: 1,969 possible moves created exploration challenges
	4. Self-Play Dynamics: Unstable opponent strength during training

	## Usage

	### Installation

	```bash
	pip install torch transformers huggingface_hub chess
	# Download model.py from this repository
	```

	### Loading the Model

	```python
	import torch
	from model import ChessFormerModel

	# Load model
	model = ChessFormerModel.from_pretrained("kaupane/ChessFormer-RL")
	model.eval()

	# This is an intermediate checkpoint - performance will be lower than ChessFormer-SL
	```

	### For RL Research

	```python
	# This checkpoint can serve as initialization for RL experiments
	from train_rl import RLTrainer

	# Load checkpoint for RL training continuation
	trainer = RLTrainer(
	model=model,
	# ... other hyperparameters
	)
	trainer.resume("path/to/checkpoint", from_sl_checkpoint=True)
	```

	## Limitations

	### Technical Limitations

	- Incomplete Training: Represents intermediate rather than final model
	- RL Instabilities: Subsequent RL training was unsuccessful
	- Performance: Lower quality than ChessFormer-SL final checkpoint

	### Research Limitations

	- Demonstrates challenges rather than solutions for chess RL
	- Requires significant additional work for competitive performance
	- Not suitable for production use

	## Intended Use

	This model is specifically intended for:

	- ✅ RL research and experimentation
	- ✅ Studying initialization strategies for chess RL
	- ✅ Comparative analysis of SL vs RL training trajectories
	- ✅ Educational purposes in understanding RL challenges

	Not intended for:

	- ❌ Practical chess playing applications
	- ❌ Production chess engines
	- ❌ Competitive chess analysis

	## Additional Information

	- Repository: [GitHub link](https://github.com/Mtrya/chess-transformer)
	- Demo: [HuggingFace Space Demo](https://huggingface.co/spaces/kaupane/Chessformer_Demo)
	- Related: [ChessFormer-SL](https://huggingface.co/kaupane/ChessFormer-SL) (Completed SL Training)

	This model represents ongoing research into chess RL training. While the full RL training was unsuccessful, this checkpoint may be an initial starting point for future research directions.