ItsMaxNorm
/

tokenizerchess

Model card Files Files and versions

tokenizerchess / README.md

ItsMaxNorm's picture

Upload folder using huggingface_hub

9b2a433 verified 11 days ago

|

history blame contribute delete

3.24 kB

	---
	language:
	- en
	license: mit
	tags:
	- chess
	- tokenizer
	- bpe
	- game-ai
	library_name: rustbpe
	datasets:
	- angeluriot/chess_games
	---

	# Chess BPE Tokenizer

	A Byte Pair Encoding (BPE) tokenizer trained on chess moves in custom notation format.

	## Model Details

	- Tokenizer Type: BPE (Byte Pair Encoding)
	- Vocabulary Size: 256
	- Training Data: [angeluriot/chess_games](https://huggingface.co/datasets/angeluriot/chess_games)
	- Training Split: train[0:1000]
	- Move Format: Custom notation with Unicode chess pieces (e.g., `w.♘g1♘f3..`)

	## Move Format Description

	The tokenizer is trained on a custom chess move notation:

	\| Component \| Description \| Example \|
	\|-----------\|-------------\|---------\|
	\| Player prefix \| `w.` (white) or `b.` (black) \| `w.` \|
	\| Piece + Source \| Unicode piece + square \| `♘g1` \|
	\| Piece + Destination \| Unicode piece + square \| `♘f3` \|
	\| Flags \| `.x.` (capture), `..+` (check), `..#` (checkmate) \| `..` \|

	### Examples

	\| Move \| Meaning \|
	\|------\|---------\|
	\| `w.♘g1♘f3..` \| White knight from g1 to f3 \|
	\| `b.♟c7♟c5..` \| Black pawn from c7 to c5 \|
	\| `b.♟c5♟d4.x.` \| Black pawn captures on d4 \|
	\| `w.♔e1♔g1♖h1♖f1..` \| White kingside castle \|
	\| `b.♛d7♛d5..+` \| Black queen to d5 with check \|

	### Chess Piece Symbols

	\| White \| Black \| Piece \|
	\|-------\|-------\|-------\|
	\| ♔ \| ♚ \| King \|
	\| ♕ \| ♛ \| Queen \|
	\| ♖ \| ♜ \| Rook \|
	\| ♗ \| ♝ \| Bishop \|
	\| ♘ \| ♞ \| Knight \|
	\| ♙ \| ♟ \| Pawn \|

	## Usage

	### Installation

	```bash
	pip install rustbpe huggingface_hub
	```

	### Loading and Using the Tokenizer

	```python
	import json
	from huggingface_hub import hf_hub_download

	# Download tokenizer files
	vocab_path = hf_hub_download(repo_id="YOUR_USERNAME/chess-bpe-tokenizer", filename="vocab.json")
	config_path = hf_hub_download(repo_id="YOUR_USERNAME/chess-bpe-tokenizer", filename="tokenizer_config.json")

	# Load vocabulary
	with open(vocab_path, 'r') as f:
	vocab = json.load(f)

	with open(config_path, 'r') as f:
	config = json.load(f)

	print(f"Vocab size: {len(vocab)}")
	print(f"Pattern: {config['pattern']}")
	```

	### Using with rustbpe (for encoding)

	```python
	import rustbpe

	# Note: rustbpe tokenizer needs to be retrained or loaded from merges
	# See the training script for details
	```

	### Training Your Own

	```python
	from bpess.main import train_chess_tokenizer, push_to_hub

	# Train
	tokenizer = train_chess_tokenizer(
	vocab_size=4096,
	dataset_fraction="train",
	moves_key='moves_custom'
	)

	# Push to HuggingFace
	push_to_hub(
	tokenizer=tokenizer,
	repo_id="your-username/chess-bpe-tokenizer",
	config={
	"vocab_size": 4096,
	"dataset_fraction": "train",
	"moves_key": "moves_custom"
	}
	)
	```

	## Training Details

	- Library: [rustbpe](https://github.com/karpathy/rustbpe) by Andrej Karpathy
	- Algorithm: Byte Pair Encoding with GPT-4 style regex pre-tokenization
	- Source Dataset: ~14M chess games from [angeluriot/chess_games](https://huggingface.co/datasets/angeluriot/chess_games)

	## Intended Use

	This tokenizer is designed for:
	- Training language models on chess games
	- Chess move prediction tasks
	- Game analysis and embedding generation

	## License

	MIT License