File size: 2,207 Bytes
8e910b1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
---
license: mit
tags:
- chess,
- tiktoken,
- tokenizer
---
# Chess BPE Tokenizer

A BPE tokenizer trained on chess moves using [rustbpe](https://github.com/karpathy/rustbpe) with tiktoken inference.

## Installation

```bash
pip install rustbpe tiktoken datasets huggingface_hub
```

## Quick Start

### Load from HuggingFace & Inference

```python
from chess_tokenizer import load_tiktoken

enc = load_tiktoken("ItsMaxNorm/chess-bpe-tokenizer")

# Encode chess moves
ids = enc.encode("w.β™˜g1β™˜f3.. b.β™Ÿc7β™Ÿc5.. w.β™™d2β™™d4..")
print(ids)  # [token_ids...]

# Decode back
text = enc.decode(ids)
print(text)  # "w.β™˜g1β™˜f3.. b.β™Ÿc7β™Ÿc5.. w.β™™d2β™™d4.."
```


### Or simply load using tiktoken
```python
config = json.load(open(hf_hub_download("ItsMaxNorm/bpess", "config.json")))
vocab = json.load(open(hf_hub_download("ItsMaxNorm/bpess", "vocab.json")))
return tiktoken.Encoding(
    name="chess", pat_str=config["pattern"],
    mergeable_ranks={k.encode('utf-8', errors='replace'): v for k, v in vocab.items()},
    special_tokens={}
)
```
### Train Your Own

```python
from chess_tokenizer import train, upload

# Train on chess dataset
tok = train(vocab_size=4096, split="train[0:10000]")

# Upload to HuggingFace
upload(tok, "YOUR_USERNAME/chess-bpe-tokenizer")
```

### Full Pipeline

```bash
python chess_tokenizer.py
```

## Move Format

The tokenizer is trained on custom chess notation:

| Move | Meaning |
|------|---------|
| `w.β™˜g1β™˜f3..` | White knight g1 to f3 |
| `b.β™Ÿc7β™Ÿc5..` | Black pawn c7 to c5 |
| `b.β™Ÿc5β™Ÿd4.x.` | Black pawn captures on d4 |
| `w.β™”e1β™”g1β™–h1β™–f1..` | White kingside castle |
| `b.β™›d7β™›d5..+` | Black queen to d5 with check |

### Piece Symbols

| White | Black | Piece |
|-------|-------|-------|
| β™” | β™š | King |
| β™• | β™› | Queen |
| β™– | β™œ | Rook |
| β™— | ♝ | Bishop |
| β™˜ | β™ž | Knight |
| β™™ | β™Ÿ | Pawn |

## API

| Function | Description |
|----------|-------------|
| `train(vocab_size, split)` | Train BPE on angeluriot/chess_games |
| `save(tok, path)` | Save vocab.json + config.json |
| `upload(tok, repo_id)` | Push to HuggingFace Hub |
| `load_tiktoken(repo_id)` | Load as tiktoken Encoding |

## License

MIT