Files changed (1) hide show
  1. README.md +99 -0
README.md ADDED
@@ -0,0 +1,99 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ tags:
4
+ - chess,
5
+ - tiktoken,
6
+ - tokenizer
7
+ ---
8
+ # Chess BPE Tokenizer
9
+
10
+ A BPE tokenizer trained on chess moves using [rustbpe](https://github.com/karpathy/rustbpe) with tiktoken inference.
11
+
12
+ ## Installation
13
+
14
+ ```bash
15
+ pip install rustbpe tiktoken datasets huggingface_hub
16
+ ```
17
+
18
+ ## Quick Start
19
+
20
+ ### Load from HuggingFace & Inference
21
+
22
+ ```python
23
+ from chess_tokenizer import load_tiktoken
24
+
25
+ enc = load_tiktoken("ItsMaxNorm/chess-bpe-tokenizer")
26
+
27
+ # Encode chess moves
28
+ ids = enc.encode("w.β™˜g1β™˜f3.. b.β™Ÿc7β™Ÿc5.. w.β™™d2β™™d4..")
29
+ print(ids) # [token_ids...]
30
+
31
+ # Decode back
32
+ text = enc.decode(ids)
33
+ print(text) # "w.β™˜g1β™˜f3.. b.β™Ÿc7β™Ÿc5.. w.β™™d2β™™d4.."
34
+ ```
35
+
36
+
37
+ ### Or simply load using tiktoken
38
+ ```python
39
+ config = json.load(open(hf_hub_download("ItsMaxNorm/bpess", "config.json")))
40
+ vocab = json.load(open(hf_hub_download("ItsMaxNorm/bpess", "vocab.json")))
41
+ return tiktoken.Encoding(
42
+ name="chess", pat_str=config["pattern"],
43
+ mergeable_ranks={k.encode('utf-8', errors='replace'): v for k, v in vocab.items()},
44
+ special_tokens={}
45
+ )
46
+ ```
47
+ ### Train Your Own
48
+
49
+ ```python
50
+ from chess_tokenizer import train, upload
51
+
52
+ # Train on chess dataset
53
+ tok = train(vocab_size=4096, split="train[0:10000]")
54
+
55
+ # Upload to HuggingFace
56
+ upload(tok, "YOUR_USERNAME/chess-bpe-tokenizer")
57
+ ```
58
+
59
+ ### Full Pipeline
60
+
61
+ ```bash
62
+ python chess_tokenizer.py
63
+ ```
64
+
65
+ ## Move Format
66
+
67
+ The tokenizer is trained on custom chess notation:
68
+
69
+ | Move | Meaning |
70
+ |------|---------|
71
+ | `w.β™˜g1β™˜f3..` | White knight g1 to f3 |
72
+ | `b.β™Ÿc7β™Ÿc5..` | Black pawn c7 to c5 |
73
+ | `b.β™Ÿc5β™Ÿd4.x.` | Black pawn captures on d4 |
74
+ | `w.β™”e1β™”g1β™–h1β™–f1..` | White kingside castle |
75
+ | `b.β™›d7β™›d5..+` | Black queen to d5 with check |
76
+
77
+ ### Piece Symbols
78
+
79
+ | White | Black | Piece |
80
+ |-------|-------|-------|
81
+ | β™” | β™š | King |
82
+ | β™• | β™› | Queen |
83
+ | β™– | β™œ | Rook |
84
+ | β™— | ♝ | Bishop |
85
+ | β™˜ | β™ž | Knight |
86
+ | β™™ | β™Ÿ | Pawn |
87
+
88
+ ## API
89
+
90
+ | Function | Description |
91
+ |----------|-------------|
92
+ | `train(vocab_size, split)` | Train BPE on angeluriot/chess_games |
93
+ | `save(tok, path)` | Save vocab.json + config.json |
94
+ | `upload(tok, repo_id)` | Push to HuggingFace Hub |
95
+ | `load_tiktoken(repo_id)` | Load as tiktoken Encoding |
96
+
97
+ ## License
98
+
99
+ MIT