File size: 3,239 Bytes

9b2a433

---
language:
- en
license: mit
tags:
- chess
- tokenizer
- bpe
- game-ai
library_name: rustbpe
datasets:
- angeluriot/chess_games
---

# Chess BPE Tokenizer

A Byte Pair Encoding (BPE) tokenizer trained on chess moves in custom notation format.

## Model Details

- **Tokenizer Type**: BPE (Byte Pair Encoding)
- **Vocabulary Size**: 256
- **Training Data**: [angeluriot/chess_games](https://huggingface.co/datasets/angeluriot/chess_games)
- **Training Split**: train[0:1000]
- **Move Format**: Custom notation with Unicode chess pieces (e.g., `w.♘g1♘f3..`)

## Move Format Description

The tokenizer is trained on a custom chess move notation:

| Component | Description | Example |
|-----------|-------------|---------|
| Player prefix | `w.` (white) or `b.` (black) | `w.` |
| Piece + Source | Unicode piece + square | `♘g1` |
| Piece + Destination | Unicode piece + square | `♘f3` |
| Flags | `.x.` (capture), `..+` (check), `..#` (checkmate) | `..` |

### Examples

| Move | Meaning |
|------|---------|
| `w.♘g1♘f3..` | White knight from g1 to f3 |
| `b.♟c7♟c5..` | Black pawn from c7 to c5 |
| `b.♟c5♟d4.x.` | Black pawn captures on d4 |
| `w.♔e1♔g1♖h1♖f1..` | White kingside castle |
| `b.♛d7♛d5..+` | Black queen to d5 with check |

### Chess Piece Symbols

| White | Black | Piece |
|-------|-------|-------|
| ♔ | ♚ | King |
| ♕ | ♛ | Queen |
| ♖ | ♜ | Rook |
| ♗ | ♝ | Bishop |
| ♘ | ♞ | Knight |
| ♙ | ♟ | Pawn |

## Usage

### Installation

```bash
pip install rustbpe huggingface_hub
```

### Loading and Using the Tokenizer

```python
import json
from huggingface_hub import hf_hub_download

# Download tokenizer files
vocab_path = hf_hub_download(repo_id="YOUR_USERNAME/chess-bpe-tokenizer", filename="vocab.json")
config_path = hf_hub_download(repo_id="YOUR_USERNAME/chess-bpe-tokenizer", filename="tokenizer_config.json")

# Load vocabulary
with open(vocab_path, 'r') as f:
    vocab = json.load(f)

with open(config_path, 'r') as f:
    config = json.load(f)

print(f"Vocab size: {len(vocab)}")
print(f"Pattern: {config['pattern']}")
```

### Using with rustbpe (for encoding)

```python
import rustbpe

# Note: rustbpe tokenizer needs to be retrained or loaded from merges
# See the training script for details
```

### Training Your Own

```python
from bpess.main import train_chess_tokenizer, push_to_hub

# Train
tokenizer = train_chess_tokenizer(
    vocab_size=4096,
    dataset_fraction="train",
    moves_key='moves_custom'
)

# Push to HuggingFace
push_to_hub(
    tokenizer=tokenizer,
    repo_id="your-username/chess-bpe-tokenizer",
    config={
        "vocab_size": 4096,
        "dataset_fraction": "train",
        "moves_key": "moves_custom"
    }
)
```

## Training Details

- **Library**: [rustbpe](https://github.com/karpathy/rustbpe) by Andrej Karpathy
- **Algorithm**: Byte Pair Encoding with GPT-4 style regex pre-tokenization
- **Source Dataset**: ~14M chess games from [angeluriot/chess_games](https://huggingface.co/datasets/angeluriot/chess_games)

## Intended Use

This tokenizer is designed for:
- Training language models on chess games
- Chess move prediction tasks
- Game analysis and embedding generation

## License

MIT License