File size: 3,239 Bytes
9b2a433 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 |
---
language:
- en
license: mit
tags:
- chess
- tokenizer
- bpe
- game-ai
library_name: rustbpe
datasets:
- angeluriot/chess_games
---
# Chess BPE Tokenizer
A Byte Pair Encoding (BPE) tokenizer trained on chess moves in custom notation format.
## Model Details
- **Tokenizer Type**: BPE (Byte Pair Encoding)
- **Vocabulary Size**: 256
- **Training Data**: [angeluriot/chess_games](https://huggingface.co/datasets/angeluriot/chess_games)
- **Training Split**: train[0:1000]
- **Move Format**: Custom notation with Unicode chess pieces (e.g., `w.βg1βf3..`)
## Move Format Description
The tokenizer is trained on a custom chess move notation:
| Component | Description | Example |
|-----------|-------------|---------|
| Player prefix | `w.` (white) or `b.` (black) | `w.` |
| Piece + Source | Unicode piece + square | `βg1` |
| Piece + Destination | Unicode piece + square | `βf3` |
| Flags | `.x.` (capture), `..+` (check), `..#` (checkmate) | `..` |
### Examples
| Move | Meaning |
|------|---------|
| `w.βg1βf3..` | White knight from g1 to f3 |
| `b.βc7βc5..` | Black pawn from c7 to c5 |
| `b.βc5βd4.x.` | Black pawn captures on d4 |
| `w.βe1βg1βh1βf1..` | White kingside castle |
| `b.βd7βd5..+` | Black queen to d5 with check |
### Chess Piece Symbols
| White | Black | Piece |
|-------|-------|-------|
| β | β | King |
| β | β | Queen |
| β | β | Rook |
| β | β | Bishop |
| β | β | Knight |
| β | β | Pawn |
## Usage
### Installation
```bash
pip install rustbpe huggingface_hub
```
### Loading and Using the Tokenizer
```python
import json
from huggingface_hub import hf_hub_download
# Download tokenizer files
vocab_path = hf_hub_download(repo_id="YOUR_USERNAME/chess-bpe-tokenizer", filename="vocab.json")
config_path = hf_hub_download(repo_id="YOUR_USERNAME/chess-bpe-tokenizer", filename="tokenizer_config.json")
# Load vocabulary
with open(vocab_path, 'r') as f:
vocab = json.load(f)
with open(config_path, 'r') as f:
config = json.load(f)
print(f"Vocab size: {len(vocab)}")
print(f"Pattern: {config['pattern']}")
```
### Using with rustbpe (for encoding)
```python
import rustbpe
# Note: rustbpe tokenizer needs to be retrained or loaded from merges
# See the training script for details
```
### Training Your Own
```python
from bpess.main import train_chess_tokenizer, push_to_hub
# Train
tokenizer = train_chess_tokenizer(
vocab_size=4096,
dataset_fraction="train",
moves_key='moves_custom'
)
# Push to HuggingFace
push_to_hub(
tokenizer=tokenizer,
repo_id="your-username/chess-bpe-tokenizer",
config={
"vocab_size": 4096,
"dataset_fraction": "train",
"moves_key": "moves_custom"
}
)
```
## Training Details
- **Library**: [rustbpe](https://github.com/karpathy/rustbpe) by Andrej Karpathy
- **Algorithm**: Byte Pair Encoding with GPT-4 style regex pre-tokenization
- **Source Dataset**: ~14M chess games from [angeluriot/chess_games](https://huggingface.co/datasets/angeluriot/chess_games)
## Intended Use
This tokenizer is designed for:
- Training language models on chess games
- Chess move prediction tasks
- Game analysis and embedding generation
## License
MIT License
|