tokenizerchess / README.md
ItsMaxNorm's picture
Upload folder using huggingface_hub
9b2a433 verified
---
language:
- en
license: mit
tags:
- chess
- tokenizer
- bpe
- game-ai
library_name: rustbpe
datasets:
- angeluriot/chess_games
---
# Chess BPE Tokenizer
A Byte Pair Encoding (BPE) tokenizer trained on chess moves in custom notation format.
## Model Details
- **Tokenizer Type**: BPE (Byte Pair Encoding)
- **Vocabulary Size**: 256
- **Training Data**: [angeluriot/chess_games](https://huggingface.co/datasets/angeluriot/chess_games)
- **Training Split**: train[0:1000]
- **Move Format**: Custom notation with Unicode chess pieces (e.g., `w.β™˜g1β™˜f3..`)
## Move Format Description
The tokenizer is trained on a custom chess move notation:
| Component | Description | Example |
|-----------|-------------|---------|
| Player prefix | `w.` (white) or `b.` (black) | `w.` |
| Piece + Source | Unicode piece + square | `β™˜g1` |
| Piece + Destination | Unicode piece + square | `β™˜f3` |
| Flags | `.x.` (capture), `..+` (check), `..#` (checkmate) | `..` |
### Examples
| Move | Meaning |
|------|---------|
| `w.β™˜g1β™˜f3..` | White knight from g1 to f3 |
| `b.β™Ÿc7β™Ÿc5..` | Black pawn from c7 to c5 |
| `b.β™Ÿc5β™Ÿd4.x.` | Black pawn captures on d4 |
| `w.β™”e1β™”g1β™–h1β™–f1..` | White kingside castle |
| `b.β™›d7β™›d5..+` | Black queen to d5 with check |
### Chess Piece Symbols
| White | Black | Piece |
|-------|-------|-------|
| β™” | β™š | King |
| β™• | β™› | Queen |
| β™– | β™œ | Rook |
| β™— | ♝ | Bishop |
| β™˜ | β™ž | Knight |
| β™™ | β™Ÿ | Pawn |
## Usage
### Installation
```bash
pip install rustbpe huggingface_hub
```
### Loading and Using the Tokenizer
```python
import json
from huggingface_hub import hf_hub_download
# Download tokenizer files
vocab_path = hf_hub_download(repo_id="YOUR_USERNAME/chess-bpe-tokenizer", filename="vocab.json")
config_path = hf_hub_download(repo_id="YOUR_USERNAME/chess-bpe-tokenizer", filename="tokenizer_config.json")
# Load vocabulary
with open(vocab_path, 'r') as f:
vocab = json.load(f)
with open(config_path, 'r') as f:
config = json.load(f)
print(f"Vocab size: {len(vocab)}")
print(f"Pattern: {config['pattern']}")
```
### Using with rustbpe (for encoding)
```python
import rustbpe
# Note: rustbpe tokenizer needs to be retrained or loaded from merges
# See the training script for details
```
### Training Your Own
```python
from bpess.main import train_chess_tokenizer, push_to_hub
# Train
tokenizer = train_chess_tokenizer(
vocab_size=4096,
dataset_fraction="train",
moves_key='moves_custom'
)
# Push to HuggingFace
push_to_hub(
tokenizer=tokenizer,
repo_id="your-username/chess-bpe-tokenizer",
config={
"vocab_size": 4096,
"dataset_fraction": "train",
"moves_key": "moves_custom"
}
)
```
## Training Details
- **Library**: [rustbpe](https://github.com/karpathy/rustbpe) by Andrej Karpathy
- **Algorithm**: Byte Pair Encoding with GPT-4 style regex pre-tokenization
- **Source Dataset**: ~14M chess games from [angeluriot/chess_games](https://huggingface.co/datasets/angeluriot/chess_games)
## Intended Use
This tokenizer is designed for:
- Training language models on chess games
- Chess move prediction tasks
- Game analysis and embedding generation
## License
MIT License