AndreCosta commited on
Commit
c0e91f7
Β·
verified Β·
1 Parent(s): b04d3e9

Upload bpe_tokenizer.py with huggingface_hub

Browse files
Files changed (1) hide show
  1. bpe_tokenizer.py +737 -0
bpe_tokenizer.py ADDED
@@ -0,0 +1,737 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ bpe_tokenizer.py
3
+ ================
4
+ Byte Pair Encoding (BPE) algorithm implemented from scratch in pure Python.
5
+
6
+ This module is part of the project:
7
+ "A bilingual PT+EN LLM with BPE tokenizer and training loop
8
+ implemented from scratch, with didactic and documented code"
9
+
10
+ Author : AndrΓ© Costa
11
+ License : MIT
12
+
13
+ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
14
+ THEORETICAL BACKGROUND
15
+ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
16
+
17
+ What is tokenization?
18
+ ---------------------
19
+ Language models do not operate on raw characters or whole words β€”
20
+ they operate on *tokens*, intermediate text units. Tokenization is
21
+ the process of converting text into sequences of integers that the
22
+ model can process.
23
+
24
+ Text β†’ Tokens β†’ Integer IDs β†’ Embeddings β†’ Model
25
+
26
+ Why not use whole words?
27
+ ------------------------
28
+ Word-level vocabularies have two serious problems:
29
+
30
+ 1. Huge vocabulary: Portuguese and English together have hundreds
31
+ of thousands of words. Each would need its own embedding β€”
32
+ infeasible for small models.
33
+
34
+ 2. Unknown words (OOV - Out of Vocabulary): any word not seen
35
+ during training produces an <UNK> token, losing semantic
36
+ information.
37
+
38
+ Why not use individual characters?
39
+ ------------------------------------
40
+ Character vocabularies solve OOV, but produce very long sequences.
41
+ The sentence "Hello world" becomes 11 tokens instead of 2.
42
+ Long sequences increase computational cost quadratically in the
43
+ Transformer attention mechanism (O(nΒ²)).
44
+
45
+ BPE as a compromise
46
+ ---------------------
47
+ Byte Pair Encoding (Gage, 1994; Sennrich et al., 2016) finds a
48
+ middle ground: it starts with characters and iteratively merges the
49
+ most frequent pairs, building a subword vocabulary.
50
+
51
+ "learning" β†’ ["learn", "ing"]
52
+ "learned" β†’ ["learn", "ed"]
53
+ "learnable" β†’ ["learn", "able"]
54
+
55
+ The prefix "learn" is shared β€” the model learns morphology
56
+ naturally, without explicit supervision.
57
+
58
+ References:
59
+ - Gage, P. (1994). A new algorithm for data compression.
60
+ C Users Journal, 12(2), 23-38.
61
+ - Sennrich, R., Haddow, B., & Birch, A. (2016). Neural machine
62
+ translation of rare words with subword units. ACL 2016.
63
+ - Radford, A. et al. (2019). Language models are unsupervised
64
+ multitask learners. (GPT-2 β€” popularized BPE in LLMs)
65
+
66
+ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
67
+ BPE ALGORITHM β€” OVERVIEW
68
+ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
69
+
70
+ Training (offline, done once on the corpus):
71
+ 1. Encode each byte of the corpus as an initial token (base vocab = 256)
72
+ 2. Count the frequency of all adjacent token pairs
73
+ 3. Select the most frequent pair (p_max)
74
+ 4. Create a new token = merge of p_max
75
+ 5. Replace all occurrences of p_max with the new token
76
+ 6. Repeat steps 2–5 until reaching the desired vocab_size
77
+
78
+ Encoding (online, for each new text):
79
+ 1. Convert text to bytes
80
+ 2. Apply learned merges in the order they were learned
81
+ 3. Return the sequence of IDs
82
+
83
+ Decoding:
84
+ 1. Convert IDs back to bytes using the vocabulary
85
+ 2. Decode the bytes as UTF-8
86
+ """
87
+
88
+ # ─────────────────────────────────────────────────────────────
89
+ # Imports β€” standard Python library only, no external dependencies
90
+ # except 'regex' (better Unicode support than 're')
91
+ # ─────────────────────────────────────────────────────────────
92
+ import os
93
+ import json
94
+ import regex # pip install regex
95
+ from collections import defaultdict
96
+ from typing import Optional
97
+
98
+
99
+ # ─────────────────────────────────────────────────────────────
100
+ # Helper functions
101
+ # ─────────────────────────────────────────────────────────────
102
+
103
+ def get_pairs(ids: list[int]) -> dict[tuple[int, int], int]:
104
+ """
105
+ Count the frequency of all adjacent pairs in a sequence.
106
+
107
+ This is the central operation of BPE. For each position i in the
108
+ sequence, forms the pair (ids[i], ids[i+1]) and increments its count.
109
+
110
+ Example:
111
+ ids = [1, 2, 3, 2, 1, 2]
112
+ returns: {(1,2): 2, (2,3): 1, (3,2): 1, (2,1): 1}
113
+
114
+ Complexity: O(n), where n = len(ids)
115
+
116
+ Args:
117
+ ids: Sequence of token IDs.
118
+
119
+ Returns:
120
+ Dictionary mapping each pair to its frequency.
121
+ """
122
+ counts: dict[tuple[int, int], int] = defaultdict(int)
123
+ for pair in zip(ids, ids[1:]):
124
+ counts[pair] += 1
125
+ return counts
126
+
127
+
128
+ def merge_sequence(ids: list[int], pair: tuple[int, int], new_id: int) -> list[int]:
129
+ """
130
+ Replace all occurrences of `pair` in `ids` with token `new_id`.
131
+
132
+ This function implements the "merge" step of BPE. It scans the
133
+ sequence once from left to right, replacing each occurrence of the
134
+ target pair with the new token.
135
+
136
+ Example:
137
+ ids = [1, 2, 3, 1, 2]
138
+ pair = (1, 2)
139
+ new_id = 99
140
+ returns: [99, 3, 99]
141
+
142
+ Note: Replacement is non-overlapping. The sequence (1,2,1,2) with
143
+ pair=(1,2) results in [99, 99], not [1, 99, 2] or [99, 1, 2].
144
+
145
+ Complexity: O(n), where n = len(ids)
146
+
147
+ Args:
148
+ ids: Original sequence of IDs.
149
+ pair: Token pair to merge (a, b).
150
+ new_id: ID of the new token resulting from the merge.
151
+
152
+ Returns:
153
+ New sequence with merges applied.
154
+ """
155
+ result: list[int] = []
156
+ i = 0
157
+ while i < len(ids):
158
+ # Check whether the pair starts at position i (and is not the last element)
159
+ if i < len(ids) - 1 and ids[i] == pair[0] and ids[i + 1] == pair[1]:
160
+ result.append(new_id)
161
+ i += 2 # skip the two tokens that were merged
162
+ else:
163
+ result.append(ids[i])
164
+ i += 1
165
+ return result
166
+
167
+
168
+ # ─────────────────────────────────────────────────────────────
169
+ # Pre-tokenization pattern (GPT-4 / tiktoken style)
170
+ # ─────────────────────────────────────────────────────────────
171
+
172
+ # This regex pattern splits text into "words" before applying BPE.
173
+ # Pre-tokenization ensures BPE never merges tokens across word
174
+ # boundaries (e.g., the space before "hello" and the "h" in "hello"
175
+ # will never form a single token).
176
+ #
177
+ # The pattern captures, in order of priority:
178
+ # 1. English contractions: 's, 't, 're, 've, 'm, 'll, 'd
179
+ # 2. Words optionally preceded by a space
180
+ # 3. Numbers optionally preceded by a space
181
+ # 4. Non-alphanumeric characters optionally preceded by a space
182
+ # 5. Whitespace (without capturing the space that precedes words)
183
+ #
184
+ # Reference: https://github.com/openai/tiktoken
185
+ GPT4_SPLIT_PATTERN = regex.compile(
186
+ r"""'(?i:[sdmt]|ll|ve|re)|[^\r\n\p{L}\p{N}]?\p{L}+|\p{N}{1,3}| ?[^\s\p{L}\p{N}]+[\r\n]*|\s*[\r\n]+|\s+(?!\S)|\s+"""
187
+ )
188
+
189
+
190
+ # ─────────────────────────────────────────────────────────────
191
+ # Main class
192
+ # ─────────────────────────────────────────────────────────────
193
+
194
+ class BPETokenizer:
195
+ """
196
+ Byte Pair Encoding (BPE) tokenizer implemented from scratch.
197
+
198
+ This implementation operates directly on UTF-8 bytes, which guarantees:
199
+ - Full coverage of any Unicode text (PT, EN, emojis, etc.)
200
+ - Fixed base vocabulary of exactly 256 tokens (one per byte)
201
+ - No <UNK> tokens β€” any text is encodable
202
+
203
+ Public attributes:
204
+ vocab_size (int): Total vocabulary size after training.
205
+ merges (dict): Table of learned merges. Maps
206
+ (id_a, id_b) β†’ id_new.
207
+ vocab (dict): Full vocabulary. Maps id β†’ bytes.
208
+
209
+ Basic usage:
210
+ >>> tokenizer = BPETokenizer(vocab_size=1000)
211
+ >>> tokenizer.train(["Hello world. OlΓ‘ mundo."])
212
+ >>> ids = tokenizer.encode("Hello")
213
+ >>> tokenizer.decode(ids)
214
+ 'Hello'
215
+ """
216
+
217
+ def __init__(self, vocab_size: int = 16384):
218
+ """
219
+ Initialize the tokenizer.
220
+
221
+ The base vocabulary always starts with the 256 possible bytes (0–255).
222
+ The number of merges to be learned is vocab_size - 256.
223
+
224
+ Args:
225
+ vocab_size: Desired final vocabulary size.
226
+ Typical values: 4096, 8192, 16384, 32768.
227
+ Must be greater than 256.
228
+
229
+ Raises:
230
+ ValueError: If vocab_size <= 256.
231
+ """
232
+ if vocab_size <= 256:
233
+ raise ValueError(
234
+ f"vocab_size must be greater than 256 (byte base vocabulary). "
235
+ f"Received: {vocab_size}"
236
+ )
237
+
238
+ self.vocab_size: int = vocab_size
239
+
240
+ # merges: table of merges learned during training
241
+ # key : (id_token_a, id_token_b)
242
+ # value : id_token_new
243
+ # ORDER matters β€” merges are applied in the order they were learned
244
+ self.merges: dict[tuple[int, int], int] = {}
245
+
246
+ # vocab: full dictionary id β†’ byte sequence
247
+ # Initialized with the 256 base bytes; expanded during training
248
+ self.vocab: dict[int, bytes] = {i: bytes([i]) for i in range(256)}
249
+
250
+ # Pre-tokenization pattern (splits text into words before BPE)
251
+ self._split_pattern = GPT4_SPLIT_PATTERN
252
+
253
+ # ─────────────────────────────────────────────────────────
254
+ # Training
255
+ # ─────────────────────────────────────────────────────────
256
+
257
+ def train(self, corpus: list[str], verbose: bool = False) -> None:
258
+ """
259
+ Train the BPE tokenizer on a text corpus.
260
+
261
+ Training executes `vocab_size - 256` merge iterations.
262
+ In each iteration:
263
+ 1. Count all adjacent pairs in the tokenized corpus
264
+ 2. Select the most frequent pair
265
+ 3. Record the merge in self.merges
266
+ 4. Update self.vocab with the new token
267
+ 5. Apply the merge to the corpus (in-place)
268
+
269
+ Total complexity: O(N Γ— M), where:
270
+ N = total number of tokens in the corpus (decreases each merge)
271
+ M = number of merges = vocab_size - 256
272
+
273
+ Args:
274
+ corpus: List of strings forming the training corpus.
275
+ Example: ["Text in Portuguese.", "Text in English."]
276
+ verbose: If True, prints progress after each merge.
277
+
278
+ Example:
279
+ >>> tok = BPETokenizer(vocab_size=300)
280
+ >>> tok.train(["abracadabra " * 100], verbose=True)
281
+ Merge 1/44 | pair: (b'a', b'b') β†’ token 256 | freq: 200
282
+ ...
283
+ """
284
+ num_merges = self.vocab_size - 256
285
+
286
+ # ── Step 1: Pre-tokenization ──────────────────────────────────────
287
+ # Split the corpus into "words" using the regex pattern.
288
+ # Each word is converted to its UTF-8 byte representation.
289
+ #
290
+ # Example:
291
+ # "Hello world" β†’ ["Hello", " world"]
292
+ # β†’ [b'Hello', b' world']
293
+ #
294
+ # Result: list of lists of integers (byte IDs 0–255)
295
+ ids_per_chunk: list[list[int]] = []
296
+ for text in corpus:
297
+ words = regex.findall(self._split_pattern, text)
298
+ for word in words:
299
+ word_bytes = word.encode("utf-8")
300
+ ids_per_chunk.append(list(word_bytes))
301
+
302
+ if verbose:
303
+ total_tokens = sum(len(chunk) for chunk in ids_per_chunk)
304
+ print(f"Pre-tokenization complete.")
305
+ print(f" Chunks (words): {len(ids_per_chunk)}")
306
+ print(f" Total initial tokens (bytes): {total_tokens}")
307
+ print(f" Merges to perform: {num_merges}\n")
308
+
309
+ # ── Step 2: Main merge loop ───────────────────────────────────────
310
+ for merge_idx in range(num_merges):
311
+
312
+ # Count pairs across all corpus chunks
313
+ pair_counts: dict[tuple[int, int], int] = defaultdict(int)
314
+ for chunk_ids in ids_per_chunk:
315
+ chunk_pairs = get_pairs(chunk_ids)
316
+ for pair, count in chunk_pairs.items():
317
+ pair_counts[pair] += count
318
+
319
+ # If no more pairs exist, the corpus is too small
320
+ if not pair_counts:
321
+ if verbose:
322
+ print(f"Warning: corpus exhausted after {merge_idx} merges.")
323
+ break
324
+
325
+ # Select the most frequent pair
326
+ best_pair = max(pair_counts, key=lambda p: pair_counts[p])
327
+ best_freq = pair_counts[best_pair]
328
+
329
+ # ID of the new token = next available integer
330
+ new_id = 256 + merge_idx
331
+
332
+ # Record the merge
333
+ self.merges[best_pair] = new_id
334
+
335
+ # Update the vocabulary:
336
+ # The new token is the concatenation of the bytes of both merged tokens
337
+ self.vocab[new_id] = self.vocab[best_pair[0]] + self.vocab[best_pair[1]]
338
+
339
+ # Apply the merge to all corpus chunks
340
+ ids_per_chunk = [
341
+ merge_sequence(chunk, best_pair, new_id)
342
+ for chunk in ids_per_chunk
343
+ ]
344
+
345
+ if verbose:
346
+ token_str_a = self.vocab[best_pair[0]]
347
+ token_str_b = self.vocab[best_pair[1]]
348
+ print(
349
+ f"Merge {merge_idx + 1:>5}/{num_merges} | "
350
+ f"pair: ({token_str_a!r}, {token_str_b!r}) "
351
+ f"β†’ token {new_id} | "
352
+ f"freq: {best_freq}"
353
+ )
354
+
355
+ if verbose:
356
+ total_after = sum(len(chunk) for chunk in ids_per_chunk)
357
+ print(f"\nTraining complete.")
358
+ print(f" Final vocabulary: {len(self.vocab)} tokens")
359
+ print(f" Total tokens after merges: {total_after}")
360
+
361
+ # ─────────────────────────────────────────────────────────
362
+ # Encoding
363
+ # ─────────────────────────────────────────────────────────
364
+
365
+ def encode(self, text: str) -> list[int]:
366
+ """
367
+ Convert a string into a sequence of token IDs.
368
+
369
+ The encoding process follows these steps:
370
+ 1. Split text into chunks via pre-tokenization (regex)
371
+ 2. Convert each chunk to bytes β†’ list of IDs (0–255)
372
+ 3. Apply learned merges in order to each chunk
373
+ 4. Concatenate IDs from all chunks
374
+
375
+ Applying merges in order is crucial: merges learned first have
376
+ priority. This ensures consistency with training.
377
+
378
+ Args:
379
+ text: Text to encode. Can be any UTF-8 string.
380
+
381
+ Returns:
382
+ List of integers representing the tokens.
383
+
384
+ Raises:
385
+ RuntimeError: If the tokenizer has not been trained (empty merges).
386
+
387
+ Example:
388
+ >>> tok.encode("Hello")
389
+ [323, 195] # IDs depend on training
390
+ """
391
+ if not self.merges:
392
+ raise RuntimeError(
393
+ "The tokenizer has not been trained. "
394
+ "Call .train() before .encode()."
395
+ )
396
+
397
+ all_ids: list[int] = []
398
+
399
+ chunks = regex.findall(self._split_pattern, text)
400
+
401
+ for chunk in chunks:
402
+ # Convert to bytes then to list of integer IDs
403
+ chunk_ids = list(chunk.encode("utf-8"))
404
+
405
+ # Apply all learned merges in order
406
+ while len(chunk_ids) >= 2:
407
+ pairs = get_pairs(chunk_ids)
408
+
409
+ # Find the pair with the lowest index in self.merges
410
+ # (= pair learned first = highest priority)
411
+ best_pair = min(
412
+ pairs,
413
+ key=lambda p: self.merges.get(p, float("inf"))
414
+ )
415
+
416
+ # If no pair is in merges, we are done with this chunk
417
+ if best_pair not in self.merges:
418
+ break
419
+
420
+ new_id = self.merges[best_pair]
421
+ chunk_ids = merge_sequence(chunk_ids, best_pair, new_id)
422
+
423
+ all_ids.extend(chunk_ids)
424
+
425
+ return all_ids
426
+
427
+ # ─────────────────────────────────────────────────────────
428
+ # Decoding
429
+ # ─────────────────────────────────────────────────────────
430
+
431
+ def decode(self, ids: list[int]) -> str:
432
+ """
433
+ Convert a sequence of IDs back to a string.
434
+
435
+ Each ID is mapped to its byte sequence via self.vocab,
436
+ and the bytes are concatenated and decoded as UTF-8.
437
+
438
+ Note on UTF-8 errors:
439
+ Individual tokens may correspond to incomplete bytes
440
+ (e.g., the first half of a 2-byte UTF-8 character).
441
+ Therefore, we concatenate ALL bytes before decoding,
442
+ and use errors="replace" to handle invalid sequences
443
+ that may arise from out-of-context IDs.
444
+
445
+ Args:
446
+ ids: Sequence of IDs to decode.
447
+
448
+ Returns:
449
+ Decoded string.
450
+
451
+ Example:
452
+ >>> tok.decode([323, 195])
453
+ 'Hello'
454
+ """
455
+ raw_bytes = b"".join(self.vocab[i] for i in ids)
456
+ return raw_bytes.decode("utf-8", errors="replace")
457
+
458
+ # ─────────────────────────────────────────────────────────
459
+ # Persistence (save / load)
460
+ # ─────────────────────────────────────────────────────────
461
+
462
+ def save(self, path: str) -> None:
463
+ """
464
+ Save the trained tokenizer to disk.
465
+
466
+ Creates two files in directory `path`:
467
+ tokenizer.json β€” metadata and merge table (human-readable)
468
+ vocab.json β€” full vocabulary id β†’ byte representation
469
+
470
+ JSON format was chosen for being readable, portable and compatible
471
+ with the HuggingFace ecosystem (tokenizers library).
472
+
473
+ Structure of tokenizer.json:
474
+ {
475
+ "vocab_size": int,
476
+ "num_merges": int,
477
+ "merges": [[id_a, id_b, id_new], ...]
478
+ }
479
+
480
+ Args:
481
+ path: Directory path where files will be saved.
482
+ Created if it does not exist.
483
+ """
484
+ os.makedirs(path, exist_ok=True)
485
+
486
+ merges_list = [
487
+ [int(a), int(b), int(new_id)]
488
+ for (a, b), new_id in self.merges.items()
489
+ ]
490
+
491
+ tokenizer_data = {
492
+ "vocab_size": self.vocab_size,
493
+ "num_merges": len(self.merges),
494
+ "merges": merges_list,
495
+ }
496
+
497
+ with open(os.path.join(path, "tokenizer.json"), "w", encoding="utf-8") as f:
498
+ json.dump(tokenizer_data, f, indent=2, ensure_ascii=False)
499
+
500
+ vocab_data = {
501
+ str(token_id): list(token_bytes)
502
+ for token_id, token_bytes in self.vocab.items()
503
+ }
504
+
505
+ with open(os.path.join(path, "vocab.json"), "w", encoding="utf-8") as f:
506
+ json.dump(vocab_data, f, indent=2, ensure_ascii=False)
507
+
508
+ print(f"Tokenizer saved to '{path}/'")
509
+ print(f" tokenizer.json β€” {len(self.merges)} merges")
510
+ print(f" vocab.json β€” {len(self.vocab)} tokens")
511
+
512
+ @classmethod
513
+ def load(cls, path: str) -> "BPETokenizer":
514
+ """
515
+ Load a previously saved tokenizer.
516
+
517
+ Class method (factory method): creates a new instance and fills
518
+ it with data loaded from disk, without needing to re-train.
519
+
520
+ Args:
521
+ path: Directory where files were saved by .save().
522
+
523
+ Returns:
524
+ Ready-to-use BPETokenizer instance.
525
+
526
+ Raises:
527
+ FileNotFoundError: If files do not exist at the given path.
528
+
529
+ Example:
530
+ >>> tok = BPETokenizer.load("./my_tokenizer")
531
+ >>> tok.encode("Hello world")
532
+ """
533
+ tokenizer_path = os.path.join(path, "tokenizer.json")
534
+ vocab_path = os.path.join(path, "vocab.json")
535
+
536
+ with open(tokenizer_path, "r", encoding="utf-8") as f:
537
+ tokenizer_data = json.load(f)
538
+
539
+ with open(vocab_path, "r", encoding="utf-8") as f:
540
+ vocab_data = json.load(f)
541
+
542
+ tokenizer = cls(vocab_size=tokenizer_data["vocab_size"])
543
+
544
+ for a, b, new_id in tokenizer_data["merges"]:
545
+ tokenizer.merges[(int(a), int(b))] = int(new_id)
546
+
547
+ tokenizer.vocab = {
548
+ int(token_id): bytes(token_bytes)
549
+ for token_id, token_bytes in vocab_data.items()
550
+ }
551
+
552
+ print(f"Tokenizer loaded from '{path}/'")
553
+ print(f" vocab_size : {tokenizer.vocab_size}")
554
+ print(f" merges : {len(tokenizer.merges)}")
555
+
556
+ return tokenizer
557
+
558
+ # ─────────────────────────────────────────────────────────
559
+ # Utilities and inspection
560
+ # ─────────────────────────────────────────────────────────
561
+
562
+ def token_to_str(self, token_id: int) -> str:
563
+ """
564
+ Return the human-readable representation of a token by its ID.
565
+
566
+ Useful for inspecting the vocabulary and understanding which
567
+ subwords the tokenizer has learned.
568
+
569
+ Args:
570
+ token_id: ID of the token to inspect.
571
+
572
+ Returns:
573
+ String representing the token bytes (decoded if possible).
574
+ """
575
+ token_bytes = self.vocab.get(token_id, b"<unknown>")
576
+ try:
577
+ return token_bytes.decode("utf-8")
578
+ except UnicodeDecodeError:
579
+ return repr(token_bytes)
580
+
581
+ def vocab_stats(self) -> None:
582
+ """
583
+ Print statistics about the trained vocabulary.
584
+
585
+ Displays the 20 longest learned tokens, which generally
586
+ correspond to words or subwords that are very frequent in the corpus.
587
+ """
588
+ print(f"\n{'='*50}")
589
+ print(f" BPE Vocabulary Statistics")
590
+ print(f"{'='*50}")
591
+ print(f" vocab_size : {self.vocab_size}")
592
+ print(f" base tokens : 256 (bytes 0–255)")
593
+ print(f" merges : {len(self.merges)}")
594
+ print(f"\n 20 longest tokens (frequent subwords):")
595
+
596
+ sorted_vocab = sorted(
597
+ [(tid, tb) for tid, tb in self.vocab.items() if tid >= 256],
598
+ key=lambda x: len(x[1]),
599
+ reverse=True
600
+ )
601
+
602
+ for token_id, token_bytes in sorted_vocab[:20]:
603
+ try:
604
+ readable = token_bytes.decode("utf-8")
605
+ except UnicodeDecodeError:
606
+ readable = repr(token_bytes)
607
+ print(f" [{token_id:>6}] {repr(readable):<30} ({len(token_bytes)} bytes)")
608
+
609
+ print(f"{'='*50}\n")
610
+
611
+ def __repr__(self) -> str:
612
+ status = "trained" if self.merges else "not trained"
613
+ return (
614
+ f"BPETokenizer("
615
+ f"vocab_size={self.vocab_size}, "
616
+ f"merges={len(self.merges)}, "
617
+ f"status='{status}')"
618
+ )
619
+
620
+
621
+ # ─────────────────────────────────────────────────────────────
622
+ # Demo / quick test
623
+ # ─────────────────────────────────────────────────────────────
624
+
625
+ if __name__ == "__main__":
626
+ import argparse
627
+
628
+ parser = argparse.ArgumentParser(description="BPE Tokenizer β€” train and validate")
629
+ parser.add_argument(
630
+ "--demo",
631
+ action="store_true",
632
+ help="Run a quick demo with a small vocab (320 tokens). "
633
+ "Does NOT produce a tokenizer suitable for training."
634
+ )
635
+ args = parser.parse_args()
636
+
637
+ # ── Demo mode (--demo flag) ───────────────────────────────────────────
638
+ # Trains on a tiny built-in corpus with vocab_size=320.
639
+ # Useful for understanding how BPE works, but the resulting
640
+ # tokenizer is NOT saved to ./tokenizer and cannot be used
641
+ # by data_pipeline.py.
642
+ if args.demo:
643
+ print("=" * 60)
644
+ print(" BPETokenizer β€” Demo mode (vocab_size=320)")
645
+ print(" NOTE: this tokenizer is for illustration only.")
646
+ print(" Run without --demo to produce the real tokenizer.")
647
+ print("=" * 60)
648
+
649
+ corpus_demo = [
650
+ # Portuguese
651
+ "aprendizado de mΓ‘quina Γ© fascinante. "
652
+ "redes neurais aprendem padrΓ΅es complexos. "
653
+ "o modelo aprende a linguagem naturalmente. "
654
+ "aprender, aprendendo, aprendizado, aprendiz. ",
655
+ # English
656
+ "machine learning is fascinating. "
657
+ "neural networks learn complex patterns. "
658
+ "the model learns language naturally. "
659
+ "learn, learning, learned, learner. ",
660
+ ] * 50
661
+
662
+ tokenizer = BPETokenizer(vocab_size=320)
663
+ tokenizer.train(corpus_demo, verbose=True)
664
+ tokenizer.vocab_stats()
665
+
666
+ tests = [
667
+ "aprendizado", "learning",
668
+ "redes neurais", "neural networks",
669
+ "OlΓ‘, mundo!", "Hello, world!",
670
+ ]
671
+
672
+ print("Encode/decode tests:")
673
+ print("-" * 50)
674
+ for text in tests:
675
+ ids = tokenizer.encode(text)
676
+ decoded = tokenizer.decode(ids)
677
+ tokens = [tokenizer.token_to_str(i) for i in ids]
678
+ print(f" Text : {repr(text)}")
679
+ print(f" IDs : {ids}")
680
+ print(f" Tokens : {tokens}")
681
+ print(f" Decoded : {repr(decoded)}")
682
+ print(f" OK : {text == decoded}")
683
+ print()
684
+
685
+ print("Demo complete. No files were saved.")
686
+ print("Run 'python bpe_tokenizer.py' (without --demo) to train the real tokenizer.")
687
+
688
+ # ── Production mode (default) ─────────────────────────────────────────
689
+ # Trains on a representative bilingual corpus with vocab_size=16384.
690
+ # Saves the tokenizer to ./tokenizer/, which is the path expected
691
+ # by data_pipeline.py.
692
+ else:
693
+ print("=" * 60)
694
+ print(" BPETokenizer β€” Training (vocab_size=16384)")
695
+ print(" Output: ./tokenizer/")
696
+ print("=" * 60)
697
+
698
+ corpus_production = [
699
+ # Portuguese β€” representative sample
700
+ "aprendizado de mΓ‘quina Γ© fascinante. "
701
+ "redes neurais aprendem padrΓ΅es complexos. "
702
+ "o modelo aprende a linguagem naturalmente. "
703
+ "aprender, aprendendo, aprendizado, aprendiz. "
704
+ "o brasil Γ© um paΓ­s de dimensΓ΅es continentais. "
705
+ "a lΓ­ngua portuguesa Γ© falada em vΓ‘rios paΓ­ses. "
706
+ "ciΓͺncia de dados e inteligΓͺncia artificial. "
707
+ "processamento de linguagem natural em portuguΓͺs. ",
708
+ # English β€” representative sample
709
+ "machine learning is fascinating. "
710
+ "neural networks learn complex patterns. "
711
+ "the model learns language naturally. "
712
+ "learn, learning, learned, learner. "
713
+ "artificial intelligence and data science. "
714
+ "natural language processing and transformers. "
715
+ "deep learning models require large datasets. "
716
+ "the quick brown fox jumps over the lazy dog. ",
717
+ ] * 500 # repeated to build sufficient frequency for 16k merges
718
+
719
+ tokenizer = BPETokenizer(vocab_size=16384)
720
+ tokenizer.train(corpus_production, verbose=True)
721
+ tokenizer.vocab_stats()
722
+
723
+ # Save to ./tokenizer β€” the path expected by data_pipeline.py
724
+ tokenizer.save("./tokenizer")
725
+
726
+ # Validate save/load round-trip
727
+ print("\nValidating save/load round-trip...")
728
+ tokenizer2 = BPETokenizer.load("./tokenizer")
729
+
730
+ for text in ["machine learning", "aprendizado de mΓ‘quina", "OlΓ‘ mundo!"]:
731
+ ids = tokenizer2.encode(text)
732
+ decoded = tokenizer2.decode(ids)
733
+ status = "OK" if decoded == text else "FAIL"
734
+ print(f" [{status}] {repr(text)} β†’ {ids[:5]}{'...' if len(ids) > 5 else ''} β†’ {repr(decoded)}")
735
+
736
+ print("\nTokenizer ready. You can now run:")
737
+ print(" python data_pipeline.py --dry-run")