--- language: - en license: mit tags: - chess - tokenizer - bpe - game-ai library_name: rustbpe datasets: - angeluriot/chess_games --- # Chess BPE Tokenizer A Byte Pair Encoding (BPE) tokenizer trained on chess moves in custom notation format. ## Model Details - **Tokenizer Type**: BPE (Byte Pair Encoding) - **Vocabulary Size**: 256 - **Training Data**: [angeluriot/chess_games](https://huggingface.co/datasets/angeluriot/chess_games) - **Training Split**: train[0:1000] - **Move Format**: Custom notation with Unicode chess pieces (e.g., `w.♘g1♘f3..`) ## Move Format Description The tokenizer is trained on a custom chess move notation: | Component | Description | Example | |-----------|-------------|---------| | Player prefix | `w.` (white) or `b.` (black) | `w.` | | Piece + Source | Unicode piece + square | `♘g1` | | Piece + Destination | Unicode piece + square | `♘f3` | | Flags | `.x.` (capture), `..+` (check), `..#` (checkmate) | `..` | ### Examples | Move | Meaning | |------|---------| | `w.♘g1♘f3..` | White knight from g1 to f3 | | `b.♟c7♟c5..` | Black pawn from c7 to c5 | | `b.♟c5♟d4.x.` | Black pawn captures on d4 | | `w.♔e1♔g1♖h1♖f1..` | White kingside castle | | `b.♛d7♛d5..+` | Black queen to d5 with check | ### Chess Piece Symbols | White | Black | Piece | |-------|-------|-------| | ♔ | ♚ | King | | ♕ | ♛ | Queen | | ♖ | ♜ | Rook | | ♗ | ♝ | Bishop | | ♘ | ♞ | Knight | | ♙ | ♟ | Pawn | ## Usage ### Installation ```bash pip install rustbpe huggingface_hub ``` ### Loading and Using the Tokenizer ```python import json from huggingface_hub import hf_hub_download # Download tokenizer files vocab_path = hf_hub_download(repo_id="YOUR_USERNAME/chess-bpe-tokenizer", filename="vocab.json") config_path = hf_hub_download(repo_id="YOUR_USERNAME/chess-bpe-tokenizer", filename="tokenizer_config.json") # Load vocabulary with open(vocab_path, 'r') as f: vocab = json.load(f) with open(config_path, 'r') as f: config = json.load(f) print(f"Vocab size: {len(vocab)}") print(f"Pattern: {config['pattern']}") ``` ### Using with rustbpe (for encoding) ```python import rustbpe # Note: rustbpe tokenizer needs to be retrained or loaded from merges # See the training script for details ``` ### Training Your Own ```python from bpess.main import train_chess_tokenizer, push_to_hub # Train tokenizer = train_chess_tokenizer( vocab_size=4096, dataset_fraction="train", moves_key='moves_custom' ) # Push to HuggingFace push_to_hub( tokenizer=tokenizer, repo_id="your-username/chess-bpe-tokenizer", config={ "vocab_size": 4096, "dataset_fraction": "train", "moves_key": "moves_custom" } ) ``` ## Training Details - **Library**: [rustbpe](https://github.com/karpathy/rustbpe) by Andrej Karpathy - **Algorithm**: Byte Pair Encoding with GPT-4 style regex pre-tokenization - **Source Dataset**: ~14M chess games from [angeluriot/chess_games](https://huggingface.co/datasets/angeluriot/chess_games) ## Intended Use This tokenizer is designed for: - Training language models on chess games - Chess move prediction tasks - Game analysis and embedding generation ## License MIT License