bpess / README.md
ItsMaxNorm's picture
Create README.md (#1)
8e910b1 verified
---
license: mit
tags:
- chess,
- tiktoken,
- tokenizer
---
# Chess BPE Tokenizer
A BPE tokenizer trained on chess moves using [rustbpe](https://github.com/karpathy/rustbpe) with tiktoken inference.
## Installation
```bash
pip install rustbpe tiktoken datasets huggingface_hub
```
## Quick Start
### Load from HuggingFace & Inference
```python
from chess_tokenizer import load_tiktoken
enc = load_tiktoken("ItsMaxNorm/chess-bpe-tokenizer")
# Encode chess moves
ids = enc.encode("w.β™˜g1β™˜f3.. b.β™Ÿc7β™Ÿc5.. w.β™™d2β™™d4..")
print(ids) # [token_ids...]
# Decode back
text = enc.decode(ids)
print(text) # "w.β™˜g1β™˜f3.. b.β™Ÿc7β™Ÿc5.. w.β™™d2β™™d4.."
```
### Or simply load using tiktoken
```python
config = json.load(open(hf_hub_download("ItsMaxNorm/bpess", "config.json")))
vocab = json.load(open(hf_hub_download("ItsMaxNorm/bpess", "vocab.json")))
return tiktoken.Encoding(
name="chess", pat_str=config["pattern"],
mergeable_ranks={k.encode('utf-8', errors='replace'): v for k, v in vocab.items()},
special_tokens={}
)
```
### Train Your Own
```python
from chess_tokenizer import train, upload
# Train on chess dataset
tok = train(vocab_size=4096, split="train[0:10000]")
# Upload to HuggingFace
upload(tok, "YOUR_USERNAME/chess-bpe-tokenizer")
```
### Full Pipeline
```bash
python chess_tokenizer.py
```
## Move Format
The tokenizer is trained on custom chess notation:
| Move | Meaning |
|------|---------|
| `w.β™˜g1β™˜f3..` | White knight g1 to f3 |
| `b.β™Ÿc7β™Ÿc5..` | Black pawn c7 to c5 |
| `b.β™Ÿc5β™Ÿd4.x.` | Black pawn captures on d4 |
| `w.β™”e1β™”g1β™–h1β™–f1..` | White kingside castle |
| `b.β™›d7β™›d5..+` | Black queen to d5 with check |
### Piece Symbols
| White | Black | Piece |
|-------|-------|-------|
| β™” | β™š | King |
| β™• | β™› | Queen |
| β™– | β™œ | Rook |
| β™— | ♝ | Bishop |
| β™˜ | β™ž | Knight |
| β™™ | β™Ÿ | Pawn |
## API
| Function | Description |
|----------|-------------|
| `train(vocab_size, split)` | Train BPE on angeluriot/chess_games |
| `save(tok, path)` | Save vocab.json + config.json |
| `upload(tok, repo_id)` | Push to HuggingFace Hub |
| `load_tiktoken(repo_id)` | Load as tiktoken Encoding |
## License
MIT