File size: 3,239 Bytes
9b2a433
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
---
language:
- en
license: mit
tags:
- chess
- tokenizer
- bpe
- game-ai
library_name: rustbpe
datasets:
- angeluriot/chess_games
---

# Chess BPE Tokenizer

A Byte Pair Encoding (BPE) tokenizer trained on chess moves in custom notation format.

## Model Details

- **Tokenizer Type**: BPE (Byte Pair Encoding)
- **Vocabulary Size**: 256
- **Training Data**: [angeluriot/chess_games](https://huggingface.co/datasets/angeluriot/chess_games)
- **Training Split**: train[0:1000]
- **Move Format**: Custom notation with Unicode chess pieces (e.g., `w.β™˜g1β™˜f3..`)

## Move Format Description

The tokenizer is trained on a custom chess move notation:

| Component | Description | Example |
|-----------|-------------|---------|
| Player prefix | `w.` (white) or `b.` (black) | `w.` |
| Piece + Source | Unicode piece + square | `β™˜g1` |
| Piece + Destination | Unicode piece + square | `β™˜f3` |
| Flags | `.x.` (capture), `..+` (check), `..#` (checkmate) | `..` |

### Examples

| Move | Meaning |
|------|---------|
| `w.β™˜g1β™˜f3..` | White knight from g1 to f3 |
| `b.β™Ÿc7β™Ÿc5..` | Black pawn from c7 to c5 |
| `b.β™Ÿc5β™Ÿd4.x.` | Black pawn captures on d4 |
| `w.β™”e1β™”g1β™–h1β™–f1..` | White kingside castle |
| `b.β™›d7β™›d5..+` | Black queen to d5 with check |

### Chess Piece Symbols

| White | Black | Piece |
|-------|-------|-------|
| β™” | β™š | King |
| β™• | β™› | Queen |
| β™– | β™œ | Rook |
| β™— | ♝ | Bishop |
| β™˜ | β™ž | Knight |
| β™™ | β™Ÿ | Pawn |

## Usage

### Installation

```bash
pip install rustbpe huggingface_hub
```

### Loading and Using the Tokenizer

```python
import json
from huggingface_hub import hf_hub_download

# Download tokenizer files
vocab_path = hf_hub_download(repo_id="YOUR_USERNAME/chess-bpe-tokenizer", filename="vocab.json")
config_path = hf_hub_download(repo_id="YOUR_USERNAME/chess-bpe-tokenizer", filename="tokenizer_config.json")

# Load vocabulary
with open(vocab_path, 'r') as f:
    vocab = json.load(f)

with open(config_path, 'r') as f:
    config = json.load(f)

print(f"Vocab size: {len(vocab)}")
print(f"Pattern: {config['pattern']}")
```

### Using with rustbpe (for encoding)

```python
import rustbpe

# Note: rustbpe tokenizer needs to be retrained or loaded from merges
# See the training script for details
```

### Training Your Own

```python
from bpess.main import train_chess_tokenizer, push_to_hub

# Train
tokenizer = train_chess_tokenizer(
    vocab_size=4096,
    dataset_fraction="train",
    moves_key='moves_custom'
)

# Push to HuggingFace
push_to_hub(
    tokenizer=tokenizer,
    repo_id="your-username/chess-bpe-tokenizer",
    config={
        "vocab_size": 4096,
        "dataset_fraction": "train",
        "moves_key": "moves_custom"
    }
)
```

## Training Details

- **Library**: [rustbpe](https://github.com/karpathy/rustbpe) by Andrej Karpathy
- **Algorithm**: Byte Pair Encoding with GPT-4 style regex pre-tokenization
- **Source Dataset**: ~14M chess games from [angeluriot/chess_games](https://huggingface.co/datasets/angeluriot/chess_games)

## Intended Use

This tokenizer is designed for:
- Training language models on chess games
- Chess move prediction tasks
- Game analysis and embedding generation

## License

MIT License