|
|
--- |
|
|
license: mit |
|
|
tags: |
|
|
- chess, |
|
|
- tiktoken, |
|
|
- tokenizer |
|
|
--- |
|
|
# Chess BPE Tokenizer |
|
|
|
|
|
A BPE tokenizer trained on chess moves using [rustbpe](https://github.com/karpathy/rustbpe) with tiktoken inference. |
|
|
|
|
|
## Installation |
|
|
|
|
|
```bash |
|
|
pip install rustbpe tiktoken datasets huggingface_hub |
|
|
``` |
|
|
|
|
|
## Quick Start |
|
|
|
|
|
### Load from HuggingFace & Inference |
|
|
|
|
|
```python |
|
|
from chess_tokenizer import load_tiktoken |
|
|
|
|
|
enc = load_tiktoken("ItsMaxNorm/chess-bpe-tokenizer") |
|
|
|
|
|
# Encode chess moves |
|
|
ids = enc.encode("w.βg1βf3.. b.βc7βc5.. w.βd2βd4..") |
|
|
print(ids) # [token_ids...] |
|
|
|
|
|
# Decode back |
|
|
text = enc.decode(ids) |
|
|
print(text) # "w.βg1βf3.. b.βc7βc5.. w.βd2βd4.." |
|
|
``` |
|
|
|
|
|
|
|
|
### Or simply load using tiktoken |
|
|
```python |
|
|
config = json.load(open(hf_hub_download("ItsMaxNorm/bpess", "config.json"))) |
|
|
vocab = json.load(open(hf_hub_download("ItsMaxNorm/bpess", "vocab.json"))) |
|
|
return tiktoken.Encoding( |
|
|
name="chess", pat_str=config["pattern"], |
|
|
mergeable_ranks={k.encode('utf-8', errors='replace'): v for k, v in vocab.items()}, |
|
|
special_tokens={} |
|
|
) |
|
|
``` |
|
|
### Train Your Own |
|
|
|
|
|
```python |
|
|
from chess_tokenizer import train, upload |
|
|
|
|
|
# Train on chess dataset |
|
|
tok = train(vocab_size=4096, split="train[0:10000]") |
|
|
|
|
|
# Upload to HuggingFace |
|
|
upload(tok, "YOUR_USERNAME/chess-bpe-tokenizer") |
|
|
``` |
|
|
|
|
|
### Full Pipeline |
|
|
|
|
|
```bash |
|
|
python chess_tokenizer.py |
|
|
``` |
|
|
|
|
|
## Move Format |
|
|
|
|
|
The tokenizer is trained on custom chess notation: |
|
|
|
|
|
| Move | Meaning | |
|
|
|------|---------| |
|
|
| `w.βg1βf3..` | White knight g1 to f3 | |
|
|
| `b.βc7βc5..` | Black pawn c7 to c5 | |
|
|
| `b.βc5βd4.x.` | Black pawn captures on d4 | |
|
|
| `w.βe1βg1βh1βf1..` | White kingside castle | |
|
|
| `b.βd7βd5..+` | Black queen to d5 with check | |
|
|
|
|
|
### Piece Symbols |
|
|
|
|
|
| White | Black | Piece | |
|
|
|-------|-------|-------| |
|
|
| β | β | King | |
|
|
| β | β | Queen | |
|
|
| β | β | Rook | |
|
|
| β | β | Bishop | |
|
|
| β | β | Knight | |
|
|
| β | β | Pawn | |
|
|
|
|
|
## API |
|
|
|
|
|
| Function | Description | |
|
|
|----------|-------------| |
|
|
| `train(vocab_size, split)` | Train BPE on angeluriot/chess_games | |
|
|
| `save(tok, path)` | Save vocab.json + config.json | |
|
|
| `upload(tok, repo_id)` | Push to HuggingFace Hub | |
|
|
| `load_tiktoken(repo_id)` | Load as tiktoken Encoding | |
|
|
|
|
|
## License |
|
|
|
|
|
MIT |