bpess

File size: 2,207 Bytes

8e910b1

---
license: mit
tags:
- chess,
- tiktoken,
- tokenizer
---
# Chess BPE Tokenizer

A BPE tokenizer trained on chess moves using [rustbpe](https://github.com/karpathy/rustbpe) with tiktoken inference.

## Installation

```bash
pip install rustbpe tiktoken datasets huggingface_hub
```

## Quick Start

### Load from HuggingFace & Inference

```python
from chess_tokenizer import load_tiktoken

enc = load_tiktoken("ItsMaxNorm/chess-bpe-tokenizer")

# Encode chess moves
ids = enc.encode("w.♘g1♘f3.. b.♟c7♟c5.. w.♙d2♙d4..")
print(ids)  # [token_ids...]

# Decode back
text = enc.decode(ids)
print(text)  # "w.♘g1♘f3.. b.♟c7♟c5.. w.♙d2♙d4.."
```


### Or simply load using tiktoken
```python
config = json.load(open(hf_hub_download("ItsMaxNorm/bpess", "config.json")))
vocab = json.load(open(hf_hub_download("ItsMaxNorm/bpess", "vocab.json")))
return tiktoken.Encoding(
    name="chess", pat_str=config["pattern"],
    mergeable_ranks={k.encode('utf-8', errors='replace'): v for k, v in vocab.items()},
    special_tokens={}
)
```
### Train Your Own

```python
from chess_tokenizer import train, upload

# Train on chess dataset
tok = train(vocab_size=4096, split="train[0:10000]")

# Upload to HuggingFace
upload(tok, "YOUR_USERNAME/chess-bpe-tokenizer")
```

### Full Pipeline

```bash
python chess_tokenizer.py
```

## Move Format

The tokenizer is trained on custom chess notation:

| Move | Meaning |
|------|---------|
| `w.♘g1♘f3..` | White knight g1 to f3 |
| `b.♟c7♟c5..` | Black pawn c7 to c5 |
| `b.♟c5♟d4.x.` | Black pawn captures on d4 |
| `w.♔e1♔g1♖h1♖f1..` | White kingside castle |
| `b.♛d7♛d5..+` | Black queen to d5 with check |

### Piece Symbols

| White | Black | Piece |
|-------|-------|-------|
| ♔ | ♚ | King |
| ♕ | ♛ | Queen |
| ♖ | ♜ | Rook |
| ♗ | ♝ | Bishop |
| ♘ | ♞ | Knight |
| ♙ | ♟ | Pawn |

## API

| Function | Description |
|----------|-------------|
| `train(vocab_size, split)` | Train BPE on angeluriot/chess_games |
| `save(tok, path)` | Save vocab.json + config.json |
| `upload(tok, repo_id)` | Push to HuggingFace Hub |
| `load_tiktoken(repo_id)` | Load as tiktoken Encoding |

## License

MIT