|
|
--- |
|
|
language: |
|
|
- en |
|
|
license: mit |
|
|
tags: |
|
|
- chess |
|
|
- tokenizer |
|
|
- bpe |
|
|
- game-ai |
|
|
library_name: rustbpe |
|
|
datasets: |
|
|
- angeluriot/chess_games |
|
|
--- |
|
|
|
|
|
# Chess BPE Tokenizer |
|
|
|
|
|
A Byte Pair Encoding (BPE) tokenizer trained on chess moves in custom notation format. |
|
|
|
|
|
## Model Details |
|
|
|
|
|
- **Tokenizer Type**: BPE (Byte Pair Encoding) |
|
|
- **Vocabulary Size**: 256 |
|
|
- **Training Data**: [angeluriot/chess_games](https://huggingface.co/datasets/angeluriot/chess_games) |
|
|
- **Training Split**: train[0:1000] |
|
|
- **Move Format**: Custom notation with Unicode chess pieces (e.g., `w.βg1βf3..`) |
|
|
|
|
|
## Move Format Description |
|
|
|
|
|
The tokenizer is trained on a custom chess move notation: |
|
|
|
|
|
| Component | Description | Example | |
|
|
|-----------|-------------|---------| |
|
|
| Player prefix | `w.` (white) or `b.` (black) | `w.` | |
|
|
| Piece + Source | Unicode piece + square | `βg1` | |
|
|
| Piece + Destination | Unicode piece + square | `βf3` | |
|
|
| Flags | `.x.` (capture), `..+` (check), `..#` (checkmate) | `..` | |
|
|
|
|
|
### Examples |
|
|
|
|
|
| Move | Meaning | |
|
|
|------|---------| |
|
|
| `w.βg1βf3..` | White knight from g1 to f3 | |
|
|
| `b.βc7βc5..` | Black pawn from c7 to c5 | |
|
|
| `b.βc5βd4.x.` | Black pawn captures on d4 | |
|
|
| `w.βe1βg1βh1βf1..` | White kingside castle | |
|
|
| `b.βd7βd5..+` | Black queen to d5 with check | |
|
|
|
|
|
### Chess Piece Symbols |
|
|
|
|
|
| White | Black | Piece | |
|
|
|-------|-------|-------| |
|
|
| β | β | King | |
|
|
| β | β | Queen | |
|
|
| β | β | Rook | |
|
|
| β | β | Bishop | |
|
|
| β | β | Knight | |
|
|
| β | β | Pawn | |
|
|
|
|
|
## Usage |
|
|
|
|
|
### Installation |
|
|
|
|
|
```bash |
|
|
pip install rustbpe huggingface_hub |
|
|
``` |
|
|
|
|
|
### Loading and Using the Tokenizer |
|
|
|
|
|
```python |
|
|
import json |
|
|
from huggingface_hub import hf_hub_download |
|
|
|
|
|
# Download tokenizer files |
|
|
vocab_path = hf_hub_download(repo_id="YOUR_USERNAME/chess-bpe-tokenizer", filename="vocab.json") |
|
|
config_path = hf_hub_download(repo_id="YOUR_USERNAME/chess-bpe-tokenizer", filename="tokenizer_config.json") |
|
|
|
|
|
# Load vocabulary |
|
|
with open(vocab_path, 'r') as f: |
|
|
vocab = json.load(f) |
|
|
|
|
|
with open(config_path, 'r') as f: |
|
|
config = json.load(f) |
|
|
|
|
|
print(f"Vocab size: {len(vocab)}") |
|
|
print(f"Pattern: {config['pattern']}") |
|
|
``` |
|
|
|
|
|
### Using with rustbpe (for encoding) |
|
|
|
|
|
```python |
|
|
import rustbpe |
|
|
|
|
|
# Note: rustbpe tokenizer needs to be retrained or loaded from merges |
|
|
# See the training script for details |
|
|
``` |
|
|
|
|
|
### Training Your Own |
|
|
|
|
|
```python |
|
|
from bpess.main import train_chess_tokenizer, push_to_hub |
|
|
|
|
|
# Train |
|
|
tokenizer = train_chess_tokenizer( |
|
|
vocab_size=4096, |
|
|
dataset_fraction="train", |
|
|
moves_key='moves_custom' |
|
|
) |
|
|
|
|
|
# Push to HuggingFace |
|
|
push_to_hub( |
|
|
tokenizer=tokenizer, |
|
|
repo_id="your-username/chess-bpe-tokenizer", |
|
|
config={ |
|
|
"vocab_size": 4096, |
|
|
"dataset_fraction": "train", |
|
|
"moves_key": "moves_custom" |
|
|
} |
|
|
) |
|
|
``` |
|
|
|
|
|
## Training Details |
|
|
|
|
|
- **Library**: [rustbpe](https://github.com/karpathy/rustbpe) by Andrej Karpathy |
|
|
- **Algorithm**: Byte Pair Encoding with GPT-4 style regex pre-tokenization |
|
|
- **Source Dataset**: ~14M chess games from [angeluriot/chess_games](https://huggingface.co/datasets/angeluriot/chess_games) |
|
|
|
|
|
## Intended Use |
|
|
|
|
|
This tokenizer is designed for: |
|
|
- Training language models on chess games |
|
|
- Chess move prediction tasks |
|
|
- Game analysis and embedding generation |
|
|
|
|
|
## License |
|
|
|
|
|
MIT License |
|
|
|