chess-Tok: Efficient Chess Move Tokenizer
This chess tokenizer uses a large vocabulary (~844 tokens) with semantically meaningful units like 'w.', 'b.', piece+square combinations ('โe4', 'โf6'), and complete suffixes ('..', '.x.', '.+').
This design reduces sequence length by ~60% compared to character-level tokenization, enabling faster training and better gradient flow in recurrent models.
Usage
from transformers import AutoTokenizer
# Load tokenizer directly from HuggingFace
tokenizer = AutoTokenizer.from_pretrained("ankanmbz/chess-tok", trust_remote_code=True)
# Tokenize chess moves
text = "w.โe2โe4.."
encoded = tokenizer(text, return_tensors="pt")
print(encoded)
# Decode
decoded = tokenizer.decode(encoded['input_ids'][0])
print(decoded)
# Batch processing
moves = ["w.โe2โe4..", "b.โc7โc5..", "w.โg1โf3.."]
batch = tokenizer(moves, padding=True, return_tensors="pt")
print(batch)
Features
- โ Semantic units (not character-level)
- โ 60% shorter sequences
- โ
Special tokens:
<pad>,<sos>,<eos>,<unk> - โ Compatible with HuggingFace transformers
- โ Batch processing with padding
- โ Optimized for tiny recurrent models
Vocabulary Size
844 tokens
Token Categories
- Move prefixes:
w.,b. - Piece+Square:
โe2,โf6, etc. - Squares:
a1,e4,h8, etc. - Suffixes:
..,.x.,.+,.+#, etc.
Example Tokenization
Input: "w.โe2โe4.."
Tokens: ['w.', 'โe2', 'โe4', '..']
Token count: 4 (vs ~10 with character-level)
Training Details
- Built from 1M chess games
- Frequency-based vocabulary ordering
- Covers all common chess move patterns
License
Apache 2.0
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
๐
Ask for provider support