File size: 2,207 Bytes
8e910b1 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 |
---
license: mit
tags:
- chess,
- tiktoken,
- tokenizer
---
# Chess BPE Tokenizer
A BPE tokenizer trained on chess moves using [rustbpe](https://github.com/karpathy/rustbpe) with tiktoken inference.
## Installation
```bash
pip install rustbpe tiktoken datasets huggingface_hub
```
## Quick Start
### Load from HuggingFace & Inference
```python
from chess_tokenizer import load_tiktoken
enc = load_tiktoken("ItsMaxNorm/chess-bpe-tokenizer")
# Encode chess moves
ids = enc.encode("w.βg1βf3.. b.βc7βc5.. w.βd2βd4..")
print(ids) # [token_ids...]
# Decode back
text = enc.decode(ids)
print(text) # "w.βg1βf3.. b.βc7βc5.. w.βd2βd4.."
```
### Or simply load using tiktoken
```python
config = json.load(open(hf_hub_download("ItsMaxNorm/bpess", "config.json")))
vocab = json.load(open(hf_hub_download("ItsMaxNorm/bpess", "vocab.json")))
return tiktoken.Encoding(
name="chess", pat_str=config["pattern"],
mergeable_ranks={k.encode('utf-8', errors='replace'): v for k, v in vocab.items()},
special_tokens={}
)
```
### Train Your Own
```python
from chess_tokenizer import train, upload
# Train on chess dataset
tok = train(vocab_size=4096, split="train[0:10000]")
# Upload to HuggingFace
upload(tok, "YOUR_USERNAME/chess-bpe-tokenizer")
```
### Full Pipeline
```bash
python chess_tokenizer.py
```
## Move Format
The tokenizer is trained on custom chess notation:
| Move | Meaning |
|------|---------|
| `w.βg1βf3..` | White knight g1 to f3 |
| `b.βc7βc5..` | Black pawn c7 to c5 |
| `b.βc5βd4.x.` | Black pawn captures on d4 |
| `w.βe1βg1βh1βf1..` | White kingside castle |
| `b.βd7βd5..+` | Black queen to d5 with check |
### Piece Symbols
| White | Black | Piece |
|-------|-------|-------|
| β | β | King |
| β | β | Queen |
| β | β | Rook |
| β | β | Bishop |
| β | β | Knight |
| β | β | Pawn |
## API
| Function | Description |
|----------|-------------|
| `train(vocab_size, split)` | Train BPE on angeluriot/chess_games |
| `save(tok, path)` | Save vocab.json + config.json |
| `upload(tok, repo_id)` | Push to HuggingFace Hub |
| `load_tiktoken(repo_id)` | Load as tiktoken Encoding |
## License
MIT |