Update README
Browse files
README.md
CHANGED
|
@@ -29,7 +29,7 @@ Initially I used Optuna for fine tuning the hyperparameters. I run 70 trials wit
|
|
| 29 |
1. CrossEntropyLoss
|
| 30 |
2. CrossEntropyLoss Gap between Training Set and Validation Set. This a way to minimize overfitting.
|
| 31 |
|
| 32 |
-
**NOTE**: due to vram limitation of my GPU
|
| 33 |
|
| 34 |
```yml
|
| 35 |
d_model: 256
|
|
@@ -41,36 +41,56 @@ lr: 0.0003
|
|
| 41 |
batch_size: 64
|
| 42 |
```
|
| 43 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 44 |
## Training & Optimizations
|
| 45 |
Given VRAM issues I tweaked the training and the architecture of the model as follow:
|
| 46 |
- Mixed Precision Training: Training uses torch.amp.autocast with GradScaler to perform forward passes in float16 while keeping optimizer states in float32. This roughly halves VRAM usage and speeds up training. The attention mask was adjusted from -1e9 to -1e4 to prevent float16 overflow during masked attention computation.
|
| 47 |
- Gradient Clipping: torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0) is applied after every backward pass to prevent exploding gradients, which is especially important when combined with Mixed Precision Training.
|
| 48 |
-
- Gradient Accumulation: During fine-tuning, gradients are accumulated over 4 steps before each optimizer update, giving an effective batch size of
|
| 49 |
- Cosine Annealing LR Scheduler: The learning rate decays from the initial value down to eta_min=0.00005 following a cosine curve over the full training run, allowing the model to make large updates early and smaller adjustments later.
|
| 50 |
- Early Stopping: Training monitors validation loss with a patience of 3 epochs and a minimum improvement delta of 0.002. The best checkpoint is saved automatically, ensuring the final model reflects peak generalization rather than the last epoch.
|
| 51 |
-
- Encoder layers 0–2 and decoder layers 0–2 were frozen during fine-tuning to preserve learned general chess representations
|
| 52 |
|
| 53 |

|
| 54 |
|
| 55 |
### Data
|
| 56 |
|
| 57 |
-
- Base training: ~2.
|
| 58 |
-
- Fine-tuning: ~1.2mln positions combining checkmate positions (eval ≥ 2000), high-quality Lichess puzzles filtered by popularity ≥ 90, NbPlays ≥ 3000, and rating 300–2200, plus a ~15% general data buffer to prevent forgetting (all data from bonna46
|
| 59 |
|
| 60 |
### Notes
|
| 61 |
-
The default Temperature after running 100 matches vs every stockfish-model. Codes are in `tester.py`.
|
| 62 |
-

|
|
| 82 |
print(f"God-Transformer predicts: {move}")
|
| 83 |
```
|
| 84 |
|
| 85 |
-
|
| 86 |
**Fallback**
|
| 87 |
As a probabilistic model, the Transformer occasionally predicts illegal moves. particularly in unusual positions that differ from the training distribution. A python-chess validation layer is applied inside get_move() to catch these cases before they reach the game engine. The fallback strategy works in three stages:
|
| 88 |
|
|
@@ -96,4 +115,5 @@ As a probabilistic model, the Transformer occasionally predicts illegal moves. p
|
|
| 96 |
- Mixed Precision: https://apxml.com/courses/foundations-transformers-architecture/chapter-7-implementation-details-optimization/mixed-precision-training
|
| 97 |
- Freezing Layers: https://medium.com/we-talk-data/guide-to-freezing-layers-in-pytorch-best-practices-and-practical-examples-8e644e7a9598
|
| 98 |
- Gradient Clipping: https://www.geeksforgeeks.org/deep-learning/understanding-gradient-clipping/
|
|
|
|
| 99 |
|
|
|
|
| 29 |
1. CrossEntropyLoss
|
| 30 |
2. CrossEntropyLoss Gap between Training Set and Validation Set. This a way to minimize overfitting.
|
| 31 |
|
| 32 |
+
**NOTE**: due to vram limitation of my GPU, I manually set hyperparameters and made ad-hoc changes to the architecture.
|
| 33 |
|
| 34 |
```yml
|
| 35 |
d_model: 256
|
|
|
|
| 41 |
batch_size: 64
|
| 42 |
```
|
| 43 |
|
| 44 |
+
I chose these parameters because they represent the best balance between model capacity and the VRAM constraints of an RTX 4060 laptop (8GB).
|
| 45 |
+
|
| 46 |
+
- d_model: 256 - embedding dimension large enough to capture complex positional relationships in FEN notation without exceeding memory limits.
|
| 47 |
+
- num_heads: 8 - follows the standard ratio of d_model / num_heads = 32 dimensions per head, as recommended in the original "Attention is All You Need" paper. Each head learns to attend to different aspects of the position simultaneously.
|
| 48 |
+
- num_layers: 6 - upgraded from 5 layers in earlier experiments. The additional layer added ~1.84M parameters and improved validation loss by 11% (0.2644 -> 0.2352) at the cost of roughly 15% longer training time per epoch.
|
| 49 |
+
- d_ff: 1024 - the feedforward dimension follows the standard 4*d_model ratio, providing sufficient non-linear capacity between attention layers.
|
| 50 |
+
- dropout: 0.1 - light regularization. The train/validation gap remained consistently below 0.025 throughout training, confirming the model was not overfitting and did not require stronger regularization.
|
| 51 |
+
- lr: 0.0003 - validated through Optuna search in earlier experiments. During finetuning we drop this parameter to 0.00003.
|
| 52 |
+
- batch_size: 64 - chosen for VRAM stability. Although the actual batch size is 256 because of gradient Accumulation.
|
| 53 |
+
|
| 54 |
+
The total parameter count is **11,086,884**.
|
| 55 |
+
|
| 56 |
## Training & Optimizations
|
| 57 |
Given VRAM issues I tweaked the training and the architecture of the model as follow:
|
| 58 |
- Mixed Precision Training: Training uses torch.amp.autocast with GradScaler to perform forward passes in float16 while keeping optimizer states in float32. This roughly halves VRAM usage and speeds up training. The attention mask was adjusted from -1e9 to -1e4 to prevent float16 overflow during masked attention computation.
|
| 59 |
- Gradient Clipping: torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0) is applied after every backward pass to prevent exploding gradients, which is especially important when combined with Mixed Precision Training.
|
| 60 |
+
- Gradient Accumulation: During fine-tuning, gradients are accumulated over 4 steps before each optimizer update, giving an effective batch size of 256 (batch_size=64* acc_steps=4) without requiring additional VRAM.
|
| 61 |
- Cosine Annealing LR Scheduler: The learning rate decays from the initial value down to eta_min=0.00005 following a cosine curve over the full training run, allowing the model to make large updates early and smaller adjustments later.
|
| 62 |
- Early Stopping: Training monitors validation loss with a patience of 3 epochs and a minimum improvement delta of 0.002. The best checkpoint is saved automatically, ensuring the final model reflects peak generalization rather than the last epoch.
|
| 63 |
+
- Encoder layers 0–2 and decoder layers 0–2 were frozen during fine-tuning to preserve learned general chess representations.
|
| 64 |
|
| 65 |

|
| 66 |
|
| 67 |
### Data
|
| 68 |
|
| 69 |
+
- Base training: ~2.3mln positions combining the `ssingh22/chess-evaluations` tactics dataset and `bonna46/Chess-FEN-and-NL-Format-30K-Dataset`
|
| 70 |
+
- Fine-tuning: ~1.2mln positions combining checkmate positions (eval ≥ 2000), high-quality Lichess puzzles filtered by popularity ≥ 90, NbPlays ≥ 3000, and rating 300–2200, plus a ~15% general data buffer to prevent forgetting (all data from bonna46, self generated data, and tactics)
|
| 71 |
|
| 72 |
### Notes
|
| 73 |
+
The default Temperature after running 100 matches vs every stockfish-model (weak,mid,strong,GM). I calculated win-rate as 1pt. for Win, 0.5 draw and -1 for a loss. Codes are in `tester.py`. I decided that 0.65 is the optimal default temperature for winrate consistency across different runs and opponents. Below the last run I tried.
|
| 74 |
+

|
| 75 |
+

|
| 76 |
+
|
| 77 |
|
| 78 |
the model **TransformerGodPlayer.pth** is saved in model folder and uploaded in HuggingFace along with **opt-configs.yml**.
|
| 79 |
|
| 80 |
+
## Requirements
|
| 81 |
Libraries used are described in `requirements.txt`. If you want to install them in bulk you can run the following command once cd into the directory:
|
| 82 |
```bash
|
| 83 |
pip install -r requirements.txt
|
| 84 |
```
|
| 85 |
|
| 86 |
## How to Use
|
| 87 |
+
|
| 88 |
+
This model is ment to be used with the `TransformerPlayer` class in `player.py` after cloning the original repo.
|
| 89 |
+
```bash
|
| 90 |
+
git clone https://github.com/LeonSavi/chess_exam
|
| 91 |
+
```
|
| 92 |
+
|
| 93 |
+
For the purpose of the class tournment, the model automatically imports a ad-hoc `ChessTokenizer` and the `Transformer` class to load the `.pth` weights, first attemps to load those locally (since the GitHub repo is going to be cloned), otherwise import from HuggingFace.
|
| 94 |
|
| 95 |
```python
|
| 96 |
from player import TransformerPlayer
|
|
|
|
| 102 |
print(f"God-Transformer predicts: {move}")
|
| 103 |
```
|
| 104 |
|
|
|
|
| 105 |
**Fallback**
|
| 106 |
As a probabilistic model, the Transformer occasionally predicts illegal moves. particularly in unusual positions that differ from the training distribution. A python-chess validation layer is applied inside get_move() to catch these cases before they reach the game engine. The fallback strategy works in three stages:
|
| 107 |
|
|
|
|
| 115 |
- Mixed Precision: https://apxml.com/courses/foundations-transformers-architecture/chapter-7-implementation-details-optimization/mixed-precision-training
|
| 116 |
- Freezing Layers: https://medium.com/we-talk-data/guide-to-freezing-layers-in-pytorch-best-practices-and-practical-examples-8e644e7a9598
|
| 117 |
- Gradient Clipping: https://www.geeksforgeeks.org/deep-learning/understanding-gradient-clipping/
|
| 118 |
+
- Our lectures
|
| 119 |
|