# Walkthrough: SLLM Custom BPE Tokenizer This document explains the architecture, execution pipeline, and design choices of the custom **Byte-Pair Encoding (BPE)** tokenizer implemented in the `tokenizer/` directory of the `sllm` project. --- ## 🏗️ Overall Architecture & Pipeline The SLLM tokenizer is a custom-built BPE tokenizer tailored for pre-training small language models on the educational subset of HuggingFace's **FineWeb-Edu** dataset. It integrates custom text normalization, a regex-based pre-tokenization strategy, standard BPE training with byte-level fallback, and packaging utility scripts for high-performance training. ```mermaid graph TD A[Raw Text Stream] --> B[normalizer.py: Normalization] B --> C[pretokenizer.py: Custom Regex Split] C --> D[bpe.py: Byte-Level Encoding] D --> E[traintokenizer.py: BPE Trainer] E --> F[post_processor.py: Template Post-Processing] F --> G[wrap_tokenizer.py: PreTrainedTokenizerFast Wrapper] G --> H[tokenize_dataset.py: Packed binary .bin Shards] ``` --- ## 📁 Component-by-Component Breakdown ### 1. `normalizer.py` (Text Normalization) Before any splitting occurs, the raw input text is standardized and cleaned to eliminate noise while preserving syntax and code structure: * **HTML Stripping & Decoding**: Removes HTML tags using regex and decodes HTML entities (e.g., `&` $\rightarrow$ `&`). * **Unicode Normalization**: Performs **NFC** normalization to ensure characters like accented letters are represented consistently. * **Noise Removal**: Eliminates raw control characters, zero-width characters (e.g., zero-width spaces/joins), and the Unicode replacement character (`\ufffd`). * **Whitespace Control**: * Collapses multiple consecutive spaces into a single space (preserving leading tabs for code indentation). * Cleans trailing whitespaces at the end of lines. * Collapses 3 or more consecutive newlines into exactly two newlines (`\n\n`) to preserve paragraph structure. --- ### 2. `pretokenizer.py` (Custom Regex Segmentation) Instead of relying on standard GPT-2/Llama pre-tokenization, this model implements a custom, ordered, priority-based regex pre-tokenizer: 1. **Contractions**: `'s`, `'t`, `'re`, `'ve`, `'ll`, `'m`, `'d`. 2. **Abbreviations**: Acronyms and shorthand (e.g., `U.S.A`, `e.g.`, `Ph.D`). 3. **Scientific Notation**: E.g., `1.5e-3`, `3e10`, `2.0E+4` (evaluated *before* decimals to avoid splitting). 4. **Decimal Numbers**: E.g., `3.14` (evaluated *before* integers). 5. **Integers**: E.g., `42`, `1984`. 6. **Multi-character Operators**: Common coding operators like `==`, `!=`, `->`, `<=`, `>=`, `**`, `//`, `+=`, `-=`, `*=`, `/=`. 7. **Snake Case Identifiers**: E.g., `snake_case`, `_private` (evaluated *before* plain words for clean code representation). 8. **Regular Unicode Words**: Alphanumeric words covering non-English languages. 9. **Whitespace**: Preserves sequences of spaces/tabs separately from newlines to keep structural formatting. 10. **Punctuation Catch-all**: Individual punctuation characters. > [!NOTE] > The pre-tokenizer uses HuggingFace's `Split` pre-tokenizer with `behavior="isolated"` and `invert=True`, meaning matched strings are isolated and kept as distinct, individual tokens instead of being discarded as delimiters. --- ### 3. `bpe.py` (BPE Model Configuration) Defines the base tokenizer pipeline: * **Byte Fallback**: Configures the BPE model with `unk_token=None` and `byte_fallback=True`. This guarantees that *every* character maps to at least one byte-level token, resulting in **zero out-of-vocabulary (OOV)** issues. * **Pre-Tokenizer Chain**: Sequentially runs the custom Regex pre-tokenizer followed by `ByteLevel(add_prefix_space=False)` to translate character segments to their corresponding byte values. * **Decoder**: Instantiates the standard `ByteLevelDecoder` to reverse byte conversions, allowing human-readable decoded strings. * **Trainer Config**: Builds a `BpeTrainer` specifying a vocabulary of `32,000` tokens, minimum merge frequency of `3`, and initial alphabet containing all `256` bytes to enforce the fallback capability. --- ### 4. `post_processor.py` (Sequence Endings) Once BPE rules have been learned and vocabulary IDs are assigned: * Attaches `TemplateProcessing` to automatically append `<|endoftext|>` to every sequence. * For single documents, it maps to `[tokens...] <|endoftext|>`. * For sequence pairs (useful in downstream tasks like Question-Answering), it automatically maps to `[tokens_A...] <|endoftext|> [tokens_B...] <|endoftext|>`. --- ### 5. `traintokenizer.py` (BPE Training Loop) * Streams the educational subset of `HuggingFaceFW/fineweb-edu` (`CC-MAIN-2014-49` split). * Filters out low-quality documents (requires educational score `int_score >= 3`) and documents shorter than 100 characters. * Feeds documents iteratively into BPE training via `train_from_iterator()`. * Adds the post-processor and runs comprehensive verification checks against edge cases (equations, scientific numbers, code snippets, byte fallbacks, and contractions). --- ### 6. `wrap_tokenizer.py` (HuggingFace Integration) Wraps the trained HuggingFace BPE model into `PreTrainedTokenizerFast` from `transformers`: * Associates `<|endoftext|>` as the `bos_token`, `eos_token`, and `pad_token`. * Enables compatibility with the `datasets.map()` bulk utility, the HuggingFace Trainer, and PyTorch dataloaders. * Standardizes right-padding, right-truncation, and context length configurations (`model_max_length=1024`). --- ### 7. `tokenize_dataset.py` (Dataset Packing) A highly optimized bulk-tokenization utility: * Tokenizes the streamed FineWeb-Edu dataset up to a target cap (e.g., `3.2` Billion tokens). * Performs a 99% train and 1% validation split (every 100th document is routed to the validation buffer). * Concatenates/packs documents sequentially (using `<|endoftext|>` as the document boundary) and writes them to disk as high-performance flat binary shards (`.bin` files of `np.uint16` type). * Standard shard size is `100,000,000` tokens. * Provides a memory-mapped helper `load_shard(split, shard_idx)` using `np.memmap` so that models can stream training batches without loading multi-gigabyte files into RAM. --- ## 💡 Key Design Highlights > [!TIP] > **Why Byte Fallback is Critical**: By initializing the alphabet with 256 unique byte values and enabling fallback, characters like math symbols ($\nabla$) or emojis don't fail or return an `` token; instead, they represent themselves as their raw UTF-8 bytes (e.g., $\nabla$ is parsed perfectly as `<0xE2><0x88><0x87>`). > [!TIP] > **Code-Aware Features**: The combination of preserving leading tabs in `normalizer.py`, isolating multi-character operators (`==`, `!=`, etc.), and protecting `snake_case` variables guarantees high-fidelity, compact token representation when the language model is trained on code.