| # Walkthrough: SLLM Custom BPE Tokenizer |
|
|
| This document explains the architecture, execution pipeline, and design choices of the custom **Byte-Pair Encoding (BPE)** tokenizer implemented in the `tokenizer/` directory of the `sllm` project. |
|
|
| --- |
|
|
| ## 🏗️ Overall Architecture & Pipeline |
|
|
| The SLLM tokenizer is a custom-built BPE tokenizer tailored for pre-training small language models on the educational subset of HuggingFace's **FineWeb-Edu** dataset. It integrates custom text normalization, a regex-based pre-tokenization strategy, standard BPE training with byte-level fallback, and packaging utility scripts for high-performance training. |
|
|
| ```mermaid |
| graph TD |
| A[Raw Text Stream] --> B[normalizer.py: Normalization] |
| B --> C[pretokenizer.py: Custom Regex Split] |
| C --> D[bpe.py: Byte-Level Encoding] |
| D --> E[traintokenizer.py: BPE Trainer] |
| E --> F[post_processor.py: Template Post-Processing] |
| F --> G[wrap_tokenizer.py: PreTrainedTokenizerFast Wrapper] |
| G --> H[tokenize_dataset.py: Packed binary .bin Shards] |
| ``` |
|
|
| --- |
|
|
| ## 📁 Component-by-Component Breakdown |
|
|
| ### 1. `normalizer.py` (Text Normalization) |
| Before any splitting occurs, the raw input text is standardized and cleaned to eliminate noise while preserving syntax and code structure: |
| * **HTML Stripping & Decoding**: Removes HTML tags using regex and decodes HTML entities (e.g., `&` $\rightarrow$ `&`). |
| * **Unicode Normalization**: Performs **NFC** normalization to ensure characters like accented letters are represented consistently. |
| * **Noise Removal**: Eliminates raw control characters, zero-width characters (e.g., zero-width spaces/joins), and the Unicode replacement character (`\ufffd`). |
| * **Whitespace Control**: |
| * Collapses multiple consecutive spaces into a single space (preserving leading tabs for code indentation). |
| * Cleans trailing whitespaces at the end of lines. |
| * Collapses 3 or more consecutive newlines into exactly two newlines (`\n\n`) to preserve paragraph structure. |
|
|
| --- |
|
|
| ### 2. `pretokenizer.py` (Custom Regex Segmentation) |
| Instead of relying on standard GPT-2/Llama pre-tokenization, this model implements a custom, ordered, priority-based regex pre-tokenizer: |
| 1. **Contractions**: `'s`, `'t`, `'re`, `'ve`, `'ll`, `'m`, `'d`. |
| 2. **Abbreviations**: Acronyms and shorthand (e.g., `U.S.A`, `e.g.`, `Ph.D`). |
| 3. **Scientific Notation**: E.g., `1.5e-3`, `3e10`, `2.0E+4` (evaluated *before* decimals to avoid splitting). |
| 4. **Decimal Numbers**: E.g., `3.14` (evaluated *before* integers). |
| 5. **Integers**: E.g., `42`, `1984`. |
| 6. **Multi-character Operators**: Common coding operators like `==`, `!=`, `->`, `<=`, `>=`, `**`, `//`, `+=`, `-=`, `*=`, `/=`. |
| 7. **Snake Case Identifiers**: E.g., `snake_case`, `_private` (evaluated *before* plain words for clean code representation). |
| 8. **Regular Unicode Words**: Alphanumeric words covering non-English languages. |
| 9. **Whitespace**: Preserves sequences of spaces/tabs separately from newlines to keep structural formatting. |
| 10. **Punctuation Catch-all**: Individual punctuation characters. |
|
|
| > [!NOTE] |
| > The pre-tokenizer uses HuggingFace's `Split` pre-tokenizer with `behavior="isolated"` and `invert=True`, meaning matched strings are isolated and kept as distinct, individual tokens instead of being discarded as delimiters. |
|
|
| --- |
|
|
| ### 3. `bpe.py` (BPE Model Configuration) |
| Defines the base tokenizer pipeline: |
| * **Byte Fallback**: Configures the BPE model with `unk_token=None` and `byte_fallback=True`. This guarantees that *every* character maps to at least one byte-level token, resulting in **zero out-of-vocabulary (OOV)** issues. |
| * **Pre-Tokenizer Chain**: Sequentially runs the custom Regex pre-tokenizer followed by `ByteLevel(add_prefix_space=False)` to translate character segments to their corresponding byte values. |
| * **Decoder**: Instantiates the standard `ByteLevelDecoder` to reverse byte conversions, allowing human-readable decoded strings. |
| * **Trainer Config**: Builds a `BpeTrainer` specifying a vocabulary of `32,000` tokens, minimum merge frequency of `3`, and initial alphabet containing all `256` bytes to enforce the fallback capability. |
|
|
| --- |
|
|
| ### 4. `post_processor.py` (Sequence Endings) |
| Once BPE rules have been learned and vocabulary IDs are assigned: |
| * Attaches `TemplateProcessing` to automatically append `<|endoftext|>` to every sequence. |
| * For single documents, it maps to `[tokens...] <|endoftext|>`. |
| * For sequence pairs (useful in downstream tasks like Question-Answering), it automatically maps to `[tokens_A...] <|endoftext|> [tokens_B...] <|endoftext|>`. |
| |
| --- |
| |
| ### 5. `traintokenizer.py` (BPE Training Loop) |
| * Streams the educational subset of `HuggingFaceFW/fineweb-edu` (`CC-MAIN-2014-49` split). |
| * Filters out low-quality documents (requires educational score `int_score >= 3`) and documents shorter than 100 characters. |
| * Feeds documents iteratively into BPE training via `train_from_iterator()`. |
| * Adds the post-processor and runs comprehensive verification checks against edge cases (equations, scientific numbers, code snippets, byte fallbacks, and contractions). |
|
|
| --- |
|
|
| ### 6. `wrap_tokenizer.py` (HuggingFace Integration) |
| Wraps the trained HuggingFace BPE model into `PreTrainedTokenizerFast` from `transformers`: |
| * Associates `<|endoftext|>` as the `bos_token`, `eos_token`, and `pad_token`. |
| * Enables compatibility with the `datasets.map()` bulk utility, the HuggingFace Trainer, and PyTorch dataloaders. |
| * Standardizes right-padding, right-truncation, and context length configurations (`model_max_length=1024`). |
|
|
| --- |
|
|
| ### 7. `tokenize_dataset.py` (Dataset Packing) |
| A highly optimized bulk-tokenization utility: |
| * Tokenizes the streamed FineWeb-Edu dataset up to a target cap (e.g., `3.2` Billion tokens). |
| * Performs a 99% train and 1% validation split (every 100th document is routed to the validation buffer). |
| * Concatenates/packs documents sequentially (using `<|endoftext|>` as the document boundary) and writes them to disk as high-performance flat binary shards (`.bin` files of `np.uint16` type). |
| * Standard shard size is `100,000,000` tokens. |
| * Provides a memory-mapped helper `load_shard(split, shard_idx)` using `np.memmap` so that models can stream training batches without loading multi-gigabyte files into RAM. |
| |
| --- |
| |
| ## 💡 Key Design Highlights |
| |
| > [!TIP] |
| > **Why Byte Fallback is Critical**: By initializing the alphabet with 256 unique byte values and enabling fallback, characters like math symbols ($\nabla$) or emojis don't fail or return an `<unk>` token; instead, they represent themselves as their raw UTF-8 bytes (e.g., $\nabla$ is parsed perfectly as `<0xE2><0x88><0x87>`). |
| |
| > [!TIP] |
| > **Code-Aware Features**: The combination of preserving leading tabs in `normalizer.py`, isolating multi-character operators (`==`, `!=`, etc.), and protecting `snake_case` variables guarantees high-fidelity, compact token representation when the language model is trained on code. |
|
|