--- license: mit tags: - chess, - tiktoken, - tokenizer --- # Chess BPE Tokenizer A BPE tokenizer trained on chess moves using [rustbpe](https://github.com/karpathy/rustbpe) with tiktoken inference. ## Installation ```bash pip install rustbpe tiktoken datasets huggingface_hub ``` ## Quick Start ### Load from HuggingFace & Inference ```python from chess_tokenizer import load_tiktoken enc = load_tiktoken("ItsMaxNorm/chess-bpe-tokenizer") # Encode chess moves ids = enc.encode("w.♘g1♘f3.. b.♟c7♟c5.. w.♙d2♙d4..") print(ids) # [token_ids...] # Decode back text = enc.decode(ids) print(text) # "w.♘g1♘f3.. b.♟c7♟c5.. w.♙d2♙d4.." ``` ### Or simply load using tiktoken ```python config = json.load(open(hf_hub_download("ItsMaxNorm/bpess", "config.json"))) vocab = json.load(open(hf_hub_download("ItsMaxNorm/bpess", "vocab.json"))) return tiktoken.Encoding( name="chess", pat_str=config["pattern"], mergeable_ranks={k.encode('utf-8', errors='replace'): v for k, v in vocab.items()}, special_tokens={} ) ``` ### Train Your Own ```python from chess_tokenizer import train, upload # Train on chess dataset tok = train(vocab_size=4096, split="train[0:10000]") # Upload to HuggingFace upload(tok, "YOUR_USERNAME/chess-bpe-tokenizer") ``` ### Full Pipeline ```bash python chess_tokenizer.py ``` ## Move Format The tokenizer is trained on custom chess notation: | Move | Meaning | |------|---------| | `w.♘g1♘f3..` | White knight g1 to f3 | | `b.♟c7♟c5..` | Black pawn c7 to c5 | | `b.♟c5♟d4.x.` | Black pawn captures on d4 | | `w.♔e1♔g1♖h1♖f1..` | White kingside castle | | `b.♛d7♛d5..+` | Black queen to d5 with check | ### Piece Symbols | White | Black | Piece | |-------|-------|-------| | ♔ | ♚ | King | | ♕ | ♛ | Queen | | ♖ | ♜ | Rook | | ♗ | ♝ | Bishop | | ♘ | ♞ | Knight | | ♙ | ♟ | Pawn | ## API | Function | Description | |----------|-------------| | `train(vocab_size, split)` | Train BPE on angeluriot/chess_games | | `save(tok, path)` | Save vocab.json + config.json | | `upload(tok, repo_id)` | Push to HuggingFace Hub | | `load_tiktoken(repo_id)` | Load as tiktoken Encoding | ## License MIT