chess-Tok: Efficient Chess Move Tokenizer

This chess tokenizer uses a large vocabulary (~844 tokens) with semantically meaningful units like 'w.', 'b.', piece+square combinations ('♙e4', '♞f6'), and complete suffixes ('..', '.x.', '.+').

This design reduces sequence length by ~60% compared to character-level tokenization, enabling faster training and better gradient flow in recurrent models.

Usage

from transformers import AutoTokenizer

# Load tokenizer directly from HuggingFace
tokenizer = AutoTokenizer.from_pretrained("ankanmbz/chess-tok", trust_remote_code=True)

# Tokenize chess moves
text = "w.♙e2♙e4.."
encoded = tokenizer(text, return_tensors="pt")
print(encoded)

# Decode
decoded = tokenizer.decode(encoded['input_ids'][0])
print(decoded)

# Batch processing
moves = ["w.♙e2♙e4..", "b.♟c7♟c5..", "w.♘g1♘f3.."]
batch = tokenizer(moves, padding=True, return_tensors="pt")
print(batch)

Features

✅ Semantic units (not character-level)
✅ 60% shorter sequences
✅ Special tokens: <pad>, <sos>, <eos>, <unk>
✅ Compatible with HuggingFace transformers
✅ Batch processing with padding
✅ Optimized for tiny recurrent models

Vocabulary Size

844 tokens

Token Categories

Move prefixes: w., b.
Piece+Square: ♙e2, ♞f6, etc.
Squares: a1, e4, h8, etc.
Suffixes: .., .x., .+, .+#, etc.

Example Tokenization

Input: "w.♙e2♙e4.." Tokens: ['w.', '♙e2', '♙e4', '..'] Token count: 4 (vs ~10 with character-level)

Training Details

Built from 1M chess games
Frequency-based vocabulary ordering
Covers all common chess move patterns

License

Apache 2.0

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support