ItsMaxNorm
/

bpess

Model card Files Files and versions

bpess / README.md

ItsMaxNorm's picture

Create README.md (#1)

8e910b1 verified 13 days ago

|

history blame contribute delete

2.21 kB

	---
	license: mit
	tags:
	- chess,
	- tiktoken,
	- tokenizer
	---
	# Chess BPE Tokenizer

	A BPE tokenizer trained on chess moves using [rustbpe](https://github.com/karpathy/rustbpe) with tiktoken inference.

	## Installation

	```bash
	pip install rustbpe tiktoken datasets huggingface_hub
	```

	## Quick Start

	### Load from HuggingFace & Inference

	```python
	from chess_tokenizer import load_tiktoken

	enc = load_tiktoken("ItsMaxNorm/chess-bpe-tokenizer")

	# Encode chess moves
	ids = enc.encode("w.♘g1♘f3.. b.♟c7♟c5.. w.♙d2♙d4..")
	print(ids) # [token_ids...]

	# Decode back
	text = enc.decode(ids)
	print(text) # "w.♘g1♘f3.. b.♟c7♟c5.. w.♙d2♙d4.."
	```


	### Or simply load using tiktoken
	```python
	config = json.load(open(hf_hub_download("ItsMaxNorm/bpess", "config.json")))
	vocab = json.load(open(hf_hub_download("ItsMaxNorm/bpess", "vocab.json")))
	return tiktoken.Encoding(
	name="chess", pat_str=config["pattern"],
	mergeable_ranks={k.encode('utf-8', errors='replace'): v for k, v in vocab.items()},
	special_tokens={}
	)
	```
	### Train Your Own

	```python
	from chess_tokenizer import train, upload

	# Train on chess dataset
	tok = train(vocab_size=4096, split="train[0:10000]")

	# Upload to HuggingFace
	upload(tok, "YOUR_USERNAME/chess-bpe-tokenizer")
	```

	### Full Pipeline

	```bash
	python chess_tokenizer.py
	```

	## Move Format

	The tokenizer is trained on custom chess notation:

	\| Move \| Meaning \|
	\|------\|---------\|
	\| `w.♘g1♘f3..` \| White knight g1 to f3 \|
	\| `b.♟c7♟c5..` \| Black pawn c7 to c5 \|
	\| `b.♟c5♟d4.x.` \| Black pawn captures on d4 \|
	\| `w.♔e1♔g1♖h1♖f1..` \| White kingside castle \|
	\| `b.♛d7♛d5..+` \| Black queen to d5 with check \|

	### Piece Symbols

	\| White \| Black \| Piece \|
	\|-------\|-------\|-------\|
	\| ♔ \| ♚ \| King \|
	\| ♕ \| ♛ \| Queen \|
	\| ♖ \| ♜ \| Rook \|
	\| ♗ \| ♝ \| Bishop \|
	\| ♘ \| ♞ \| Knight \|
	\| ♙ \| ♟ \| Pawn \|

	## API

	\| Function \| Description \|
	\|----------\|-------------\|
	\| `train(vocab_size, split)` \| Train BPE on angeluriot/chess_games \|
	\| `save(tok, path)` \| Save vocab.json + config.json \|
	\| `upload(tok, repo_id)` \| Push to HuggingFace Hub \|
	\| `load_tiktoken(repo_id)` \| Load as tiktoken Encoding \|

	## License

	MIT