ItsMaxNorm
/

bpess

+---
+license: mit
+tags:
+- chess,
+- tiktoken,
+- tokenizer
+---
+# Chess BPE Tokenizer
+A BPE tokenizer trained on chess moves using [rustbpe](https://github.com/karpathy/rustbpe) with tiktoken inference.
+## Installation
+```bash
+pip install rustbpe tiktoken datasets huggingface_hub
+```
+## Quick Start
+### Load from HuggingFace & Inference
+```python
+from chess_tokenizer import load_tiktoken
+enc = load_tiktoken("ItsMaxNorm/chess-bpe-tokenizer")
+# Encode chess moves
+ids = enc.encode("w.♘g1♘f3.. b.♟c7♟c5.. w.♙d2♙d4..")
+print(ids)  # [token_ids...]
+# Decode back
+text = enc.decode(ids)
+print(text)  # "w.♘g1♘f3.. b.♟c7♟c5.. w.♙d2♙d4.."
+```
+### Or simply load using tiktoken
+```python
+config = json.load(open(hf_hub_download("ItsMaxNorm/bpess", "config.json")))
+vocab = json.load(open(hf_hub_download("ItsMaxNorm/bpess", "vocab.json")))
+return tiktoken.Encoding(
+    name="chess", pat_str=config["pattern"],
+    mergeable_ranks={k.encode('utf-8', errors='replace'): v for k, v in vocab.items()},
+    special_tokens={}
+)
+```
+### Train Your Own
+```python
+from chess_tokenizer import train, upload
+# Train on chess dataset
+tok = train(vocab_size=4096, split="train[0:10000]")
+# Upload to HuggingFace
+upload(tok, "YOUR_USERNAME/chess-bpe-tokenizer")
+```
+### Full Pipeline
+```bash
+python chess_tokenizer.py
+```
+## Move Format
+The tokenizer is trained on custom chess notation:
+| Move | Meaning |
+|------|---------|
+| `w.♘g1♘f3..` | White knight g1 to f3 |
+| `b.♟c7♟c5..` | Black pawn c7 to c5 |
+| `b.♟c5♟d4.x.` | Black pawn captures on d4 |
+| `w.♔e1♔g1♖h1♖f1..` | White kingside castle |
+| `b.♛d7♛d5..+` | Black queen to d5 with check |
+### Piece Symbols
+| White | Black | Piece |
+|-------|-------|-------|
+| ♔ | ♚ | King |
+| ♕ | ♛ | Queen |
+| ♖ | ♜ | Rook |
+| ♗ | ♝ | Bishop |
+| ♘ | ♞ | Knight |
+| ♙ | ♟ | Pawn |
+## API
+| Function | Description |
+|----------|-------------|
+| `train(vocab_size, split)` | Train BPE on angeluriot/chess_games |
+| `save(tok, path)` | Save vocab.json + config.json |
+| `upload(tok, repo_id)` | Push to HuggingFace Hub |
+| `load_tiktoken(repo_id)` | Load as tiktoken Encoding |
+## License
+MIT