kaupane
/

ChessFormer-RL

@@ -1,14 +1,164 @@
 ---
-tags:
-- model_hub_mixin
-- pytorch_model_hub_mixin
 license: mit
-datasets:
-- kaupane/lichess-2023-01-stockfish-annotated
-pipeline_tag: reinforcement-learning
 ---
-This model has been pushed to the Hub using the [PytorchModelHubMixin](https://huggingface.co/docs/huggingface_hub/package_reference/mixins#huggingface_hub.PyTorchModelHubMixin) integration:
-- Code: [More Information Needed]
-- Paper: [More Information Needed]
-- Docs: [More Information Needed]

 ---
 license: mit
+tags:
+- chess
+- transformer
+- reinforcement-learning
+- game-playing
+- research
+library_name: pytorch
 ---
+# ChessFormer-RL
+ChessFormer-RL represents an experimental checkpoint in training chess models with reinforcement learning. **Note**: This model is actually the 8th supervised learning checkpoint (49152 steps) that was intended as initialization for RL training, as the full RL training encountered challenges.
+## Model Description
+- **Model type**: Transformer for chess (RL training initialization)
+- **Language(s)**: Chess (FEN notation)
+- **License**: MIT
+- **Parameters**: 100.7M
+## Important Notice
+⚠️ **This model represents a research checkpoint rather than a completed RL-trained model.** The actual reinforcement learning training encountered:
+- Gradient norm explosion
+- Noisy reward signals
+- Performance degradation from this initialization point
+This checkpoint is provided for researchers interested in:
+- RL training initialization strategies
+- Comparative analysis with the final SL model
+- Continuing RL experiments with improved methods
+## Architecture
+Identical to ChessFormer-SL:
+- **Blocks**: 20 transformer layers
+- **Hidden size**: 640
+- **Attention heads**: 8
+- **Intermediate size**: 1728
+- **Features**: RMSNorm, SwiGLU activation, custom FEN tokenizer
+## Training Details
+### Phase 1: Supervised Learning (This Checkpoint)
+- **Dataset**: `kaupane/lichess-2023-01-stockfish-annotated` (depth18 split)
+- **Training**: 49152 steps of supervised learning on Stockfish evaluations
+- **Purpose**: Initialization for subsequent RL training
+### Phase 2: Reinforcement Learning (Attempted)
+- **Method**: Self-play with Proximal Policy Optimization (PPO)
+- **Environment**: Batch chess environment with sparse terminal rewards
+- **Outcome**: Training instabilities led to performance degradation
+- **Current Status**: Requires further research and improved RL methodology
+### Training Metrics (This Checkpoint)
+- **Action Loss**: 1.8329
+- **Value Loss**: 0.0501
+- **Invalid Loss**: 0.0484
+## Performance
+As an intermediate SL checkpoint, this model exhibits:
+- Similar capabilities to early ChessFormer-SL training
+- Less refined than the final SL model
+- Suitable for RL initialization experiments
+### Comparison with ChessFormer-SL
+| Metric | ChessFormer-RL (8th ckpt) | ChessFormer-SL (20th ckpt) |
+|--------|---------------------------|----------------------------|
+| Action Loss | 1.8329 | / |
+| Value Loss | 0.0501 | / |
+| Invalid Loss | 0.0484 | / |
+## Research Context
+### RL Training Challenges Encountered
+1. **Gradient Instability**: Explosive gradient norms during PPO updates
+2. **Sparse Rewards**: Terminal-only rewards created noisy learning signals
+3. **Action Space Complexity**: 1,969 possible moves created exploration challenges
+4. **Self-Play Dynamics**: Unstable opponent strength during training
+## Usage
+### Installation
+```bash
+pip install torch transformers huggingface_hub chess
+# Download model.py from this repository
+```
+### Loading the Model
+```python
+import torch
+from model import ChessFormerModel
+# Load model
+model = ChessFormerModel.from_pretrained("kaupane/ChessFormer-RL")
+model.eval()
+# This is an intermediate checkpoint - performance will be lower than ChessFormer-SL
+```
+### For RL Research
+```python
+# This checkpoint can serve as initialization for RL experiments
+from train_rl import RLTrainer
+# Load checkpoint for RL training continuation
+trainer = RLTrainer(
+    model=model,
+    # ... other hyperparameters
+)
+trainer.resume("path/to/checkpoint", from_sl_checkpoint=True)
+```
+## Limitations
+### Technical Limitations
+- **Incomplete Training**: Represents intermediate rather than final model
+- **RL Instabilities**: Subsequent RL training was unsuccessful
+- **Performance**: Lower quality than ChessFormer-SL final checkpoint
+### Research Limitations
+- Demonstrates challenges rather than solutions for chess RL
+- Requires significant additional work for competitive performance
+- Not suitable for production use
+## Intended Use
+This model is specifically intended for:
+- ✅ RL research and experimentation
+- ✅ Studying initialization strategies for chess RL
+- ✅ Comparative analysis of SL vs RL training trajectories
+- ✅ Educational purposes in understanding RL challenges
+**Not intended for:**
+- ❌ Practical chess playing applications
+- ❌ Production chess engines
+- ❌ Competitive chess analysis
+## Additional Information
+- **Repository**: [GitHub link](https://github.com/Mtrya/chess-transformer)
+- **Demo**: [HuggingFace Space Demo](https://huggingface.co/spaces/kaupane/Chessformer_Demo)
+- **Related**: [ChessFormer-SL](https://huggingface.co/kaupane/ChessFormer-SL) (Completed SL Training)
+*This model represents ongoing research into chess RL training. While the full RL training was unsuccessful, this checkpoint may be an initial starting point for future research directions.*