chess-Tok: Efficient Chess Move Tokenizer

This chess tokenizer uses a large vocabulary (~844 tokens) with semantically meaningful units like 'w.', 'b.', piece+square combinations ('โ™™e4', 'โ™žf6'), and complete suffixes ('..', '.x.', '.+').

This design reduces sequence length by ~60% compared to character-level tokenization, enabling faster training and better gradient flow in recurrent models.

Usage

from transformers import AutoTokenizer

# Load tokenizer directly from HuggingFace
tokenizer = AutoTokenizer.from_pretrained("ankanmbz/chess-tok", trust_remote_code=True)

# Tokenize chess moves
text = "w.โ™™e2โ™™e4.."
encoded = tokenizer(text, return_tensors="pt")
print(encoded)

# Decode
decoded = tokenizer.decode(encoded['input_ids'][0])
print(decoded)

# Batch processing
moves = ["w.โ™™e2โ™™e4..", "b.โ™Ÿc7โ™Ÿc5..", "w.โ™˜g1โ™˜f3.."]
batch = tokenizer(moves, padding=True, return_tensors="pt")
print(batch)

Features

  • โœ… Semantic units (not character-level)
  • โœ… 60% shorter sequences
  • โœ… Special tokens: <pad>, <sos>, <eos>, <unk>
  • โœ… Compatible with HuggingFace transformers
  • โœ… Batch processing with padding
  • โœ… Optimized for tiny recurrent models

Vocabulary Size

844 tokens

Token Categories

  • Move prefixes: w., b.
  • Piece+Square: โ™™e2, โ™žf6, etc.
  • Squares: a1, e4, h8, etc.
  • Suffixes: .., .x., .+, .+#, etc.

Example Tokenization

Input: "w.โ™™e2โ™™e4.." Tokens: ['w.', 'โ™™e2', 'โ™™e4', '..'] Token count: 4 (vs ~10 with character-level)

Training Details

  • Built from 1M chess games
  • Frequency-based vocabulary ordering
  • Covers all common chess move patterns

License

Apache 2.0

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support