File size: 6,920 Bytes
7f974df | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 | # Walkthrough: SLLM Custom BPE Tokenizer
This document explains the architecture, execution pipeline, and design choices of the custom **Byte-Pair Encoding (BPE)** tokenizer implemented in the `tokenizer/` directory of the `sllm` project.
---
## 🏗️ Overall Architecture & Pipeline
The SLLM tokenizer is a custom-built BPE tokenizer tailored for pre-training small language models on the educational subset of HuggingFace's **FineWeb-Edu** dataset. It integrates custom text normalization, a regex-based pre-tokenization strategy, standard BPE training with byte-level fallback, and packaging utility scripts for high-performance training.
```mermaid
graph TD
A[Raw Text Stream] --> B[normalizer.py: Normalization]
B --> C[pretokenizer.py: Custom Regex Split]
C --> D[bpe.py: Byte-Level Encoding]
D --> E[traintokenizer.py: BPE Trainer]
E --> F[post_processor.py: Template Post-Processing]
F --> G[wrap_tokenizer.py: PreTrainedTokenizerFast Wrapper]
G --> H[tokenize_dataset.py: Packed binary .bin Shards]
```
---
## 📁 Component-by-Component Breakdown
### 1. `normalizer.py` (Text Normalization)
Before any splitting occurs, the raw input text is standardized and cleaned to eliminate noise while preserving syntax and code structure:
* **HTML Stripping & Decoding**: Removes HTML tags using regex and decodes HTML entities (e.g., `&` $\rightarrow$ `&`).
* **Unicode Normalization**: Performs **NFC** normalization to ensure characters like accented letters are represented consistently.
* **Noise Removal**: Eliminates raw control characters, zero-width characters (e.g., zero-width spaces/joins), and the Unicode replacement character (`\ufffd`).
* **Whitespace Control**:
* Collapses multiple consecutive spaces into a single space (preserving leading tabs for code indentation).
* Cleans trailing whitespaces at the end of lines.
* Collapses 3 or more consecutive newlines into exactly two newlines (`\n\n`) to preserve paragraph structure.
---
### 2. `pretokenizer.py` (Custom Regex Segmentation)
Instead of relying on standard GPT-2/Llama pre-tokenization, this model implements a custom, ordered, priority-based regex pre-tokenizer:
1. **Contractions**: `'s`, `'t`, `'re`, `'ve`, `'ll`, `'m`, `'d`.
2. **Abbreviations**: Acronyms and shorthand (e.g., `U.S.A`, `e.g.`, `Ph.D`).
3. **Scientific Notation**: E.g., `1.5e-3`, `3e10`, `2.0E+4` (evaluated *before* decimals to avoid splitting).
4. **Decimal Numbers**: E.g., `3.14` (evaluated *before* integers).
5. **Integers**: E.g., `42`, `1984`.
6. **Multi-character Operators**: Common coding operators like `==`, `!=`, `->`, `<=`, `>=`, `**`, `//`, `+=`, `-=`, `*=`, `/=`.
7. **Snake Case Identifiers**: E.g., `snake_case`, `_private` (evaluated *before* plain words for clean code representation).
8. **Regular Unicode Words**: Alphanumeric words covering non-English languages.
9. **Whitespace**: Preserves sequences of spaces/tabs separately from newlines to keep structural formatting.
10. **Punctuation Catch-all**: Individual punctuation characters.
> [!NOTE]
> The pre-tokenizer uses HuggingFace's `Split` pre-tokenizer with `behavior="isolated"` and `invert=True`, meaning matched strings are isolated and kept as distinct, individual tokens instead of being discarded as delimiters.
---
### 3. `bpe.py` (BPE Model Configuration)
Defines the base tokenizer pipeline:
* **Byte Fallback**: Configures the BPE model with `unk_token=None` and `byte_fallback=True`. This guarantees that *every* character maps to at least one byte-level token, resulting in **zero out-of-vocabulary (OOV)** issues.
* **Pre-Tokenizer Chain**: Sequentially runs the custom Regex pre-tokenizer followed by `ByteLevel(add_prefix_space=False)` to translate character segments to their corresponding byte values.
* **Decoder**: Instantiates the standard `ByteLevelDecoder` to reverse byte conversions, allowing human-readable decoded strings.
* **Trainer Config**: Builds a `BpeTrainer` specifying a vocabulary of `32,000` tokens, minimum merge frequency of `3`, and initial alphabet containing all `256` bytes to enforce the fallback capability.
---
### 4. `post_processor.py` (Sequence Endings)
Once BPE rules have been learned and vocabulary IDs are assigned:
* Attaches `TemplateProcessing` to automatically append `<|endoftext|>` to every sequence.
* For single documents, it maps to `[tokens...] <|endoftext|>`.
* For sequence pairs (useful in downstream tasks like Question-Answering), it automatically maps to `[tokens_A...] <|endoftext|> [tokens_B...] <|endoftext|>`.
---
### 5. `traintokenizer.py` (BPE Training Loop)
* Streams the educational subset of `HuggingFaceFW/fineweb-edu` (`CC-MAIN-2014-49` split).
* Filters out low-quality documents (requires educational score `int_score >= 3`) and documents shorter than 100 characters.
* Feeds documents iteratively into BPE training via `train_from_iterator()`.
* Adds the post-processor and runs comprehensive verification checks against edge cases (equations, scientific numbers, code snippets, byte fallbacks, and contractions).
---
### 6. `wrap_tokenizer.py` (HuggingFace Integration)
Wraps the trained HuggingFace BPE model into `PreTrainedTokenizerFast` from `transformers`:
* Associates `<|endoftext|>` as the `bos_token`, `eos_token`, and `pad_token`.
* Enables compatibility with the `datasets.map()` bulk utility, the HuggingFace Trainer, and PyTorch dataloaders.
* Standardizes right-padding, right-truncation, and context length configurations (`model_max_length=1024`).
---
### 7. `tokenize_dataset.py` (Dataset Packing)
A highly optimized bulk-tokenization utility:
* Tokenizes the streamed FineWeb-Edu dataset up to a target cap (e.g., `3.2` Billion tokens).
* Performs a 99% train and 1% validation split (every 100th document is routed to the validation buffer).
* Concatenates/packs documents sequentially (using `<|endoftext|>` as the document boundary) and writes them to disk as high-performance flat binary shards (`.bin` files of `np.uint16` type).
* Standard shard size is `100,000,000` tokens.
* Provides a memory-mapped helper `load_shard(split, shard_idx)` using `np.memmap` so that models can stream training batches without loading multi-gigabyte files into RAM.
---
## 💡 Key Design Highlights
> [!TIP]
> **Why Byte Fallback is Critical**: By initializing the alphabet with 256 unique byte values and enabling fallback, characters like math symbols ($\nabla$) or emojis don't fail or return an `<unk>` token; instead, they represent themselves as their raw UTF-8 bytes (e.g., $\nabla$ is parsed perfectly as `<0xE2><0x88><0x87>`).
> [!TIP]
> **Code-Aware Features**: The combination of preserving leading tabs in `normalizer.py`, isolating multi-character operators (`==`, `!=`, etc.), and protecting `snake_case` variables guarantees high-fidelity, compact token representation when the language model is trained on code.
|