Walkthrough: SLLM Custom BPE Tokenizer

This document explains the architecture, execution pipeline, and design choices of the custom Byte-Pair Encoding (BPE) tokenizer implemented in the tokenizer/ directory of the sllm project.

🏗️ Overall Architecture & Pipeline

The SLLM tokenizer is a custom-built BPE tokenizer tailored for pre-training small language models on the educational subset of HuggingFace's FineWeb-Edu dataset. It integrates custom text normalization, a regex-based pre-tokenization strategy, standard BPE training with byte-level fallback, and packaging utility scripts for high-performance training.

graph TD
    A[Raw Text Stream] --> B[normalizer.py: Normalization]
    B --> C[pretokenizer.py: Custom Regex Split]
    C --> D[bpe.py: Byte-Level Encoding]
    D --> E[traintokenizer.py: BPE Trainer]
    E --> F[post_processor.py: Template Post-Processing]
    F --> G[wrap_tokenizer.py: PreTrainedTokenizerFast Wrapper]
    G --> H[tokenize_dataset.py: Packed binary .bin Shards]

📁 Component-by-Component Breakdown

1. `normalizer.py` (Text Normalization)

Before any splitting occurs, the raw input text is standardized and cleaned to eliminate noise while preserving syntax and code structure:

HTML Stripping & Decoding: Removes HTML tags using regex and decodes HTML entities (e.g., & $\rightarrow$ &).
Unicode Normalization: Performs NFC normalization to ensure characters like accented letters are represented consistently.
Noise Removal: Eliminates raw control characters, zero-width characters (e.g., zero-width spaces/joins), and the Unicode replacement character (\ufffd).
Whitespace Control:
- Collapses multiple consecutive spaces into a single space (preserving leading tabs for code indentation).
- Cleans trailing whitespaces at the end of lines.
- Collapses 3 or more consecutive newlines into exactly two newlines (\n\n) to preserve paragraph structure.

2. `pretokenizer.py` (Custom Regex Segmentation)

Instead of relying on standard GPT-2/Llama pre-tokenization, this model implements a custom, ordered, priority-based regex pre-tokenizer:

Contractions: 's, 't, 're, 've, 'll, 'm, 'd.
Abbreviations: Acronyms and shorthand (e.g., U.S.A, e.g., Ph.D).
Scientific Notation: E.g., 1.5e-3, 3e10, 2.0E+4 (evaluated before decimals to avoid splitting).
Decimal Numbers: E.g., 3.14 (evaluated before integers).
Integers: E.g., 42, 1984.
Multi-character Operators: Common coding operators like ==, !=, ->, <=, >=, **, //, +=, -=, *=, /=.
Snake Case Identifiers: E.g., snake_case, _private (evaluated before plain words for clean code representation).
Regular Unicode Words: Alphanumeric words covering non-English languages.
Whitespace: Preserves sequences of spaces/tabs separately from newlines to keep structural formatting.
Punctuation Catch-all: Individual punctuation characters.

The pre-tokenizer uses HuggingFace's Split pre-tokenizer with behavior="isolated" and invert=True, meaning matched strings are isolated and kept as distinct, individual tokens instead of being discarded as delimiters.

3. `bpe.py` (BPE Model Configuration)

Defines the base tokenizer pipeline:

Byte Fallback: Configures the BPE model with unk_token=None and byte_fallback=True. This guarantees that every character maps to at least one byte-level token, resulting in zero out-of-vocabulary (OOV) issues.
Pre-Tokenizer Chain: Sequentially runs the custom Regex pre-tokenizer followed by ByteLevel(add_prefix_space=False) to translate character segments to their corresponding byte values.
Decoder: Instantiates the standard ByteLevelDecoder to reverse byte conversions, allowing human-readable decoded strings.
Trainer Config: Builds a BpeTrainer specifying a vocabulary of 32,000 tokens, minimum merge frequency of 3, and initial alphabet containing all 256 bytes to enforce the fallback capability.

4. `post_processor.py` (Sequence Endings)

Once BPE rules have been learned and vocabulary IDs are assigned:

Attaches TemplateProcessing to automatically append <|endoftext|> to every sequence.
For single documents, it maps to [tokens...] <|endoftext|>.
For sequence pairs (useful in downstream tasks like Question-Answering), it automatically maps to [tokens_A...] <|endoftext|> [tokens_B...] <|endoftext|>.

5. `traintokenizer.py` (BPE Training Loop)

Streams the educational subset of HuggingFaceFW/fineweb-edu (CC-MAIN-2014-49 split).
Filters out low-quality documents (requires educational score int_score >= 3) and documents shorter than 100 characters.
Feeds documents iteratively into BPE training via train_from_iterator().
Adds the post-processor and runs comprehensive verification checks against edge cases (equations, scientific numbers, code snippets, byte fallbacks, and contractions).

6. `wrap_tokenizer.py` (HuggingFace Integration)

Wraps the trained HuggingFace BPE model into PreTrainedTokenizerFast from transformers:

Associates <|endoftext|> as the bos_token, eos_token, and pad_token.
Enables compatibility with the datasets.map() bulk utility, the HuggingFace Trainer, and PyTorch dataloaders.
Standardizes right-padding, right-truncation, and context length configurations (model_max_length=1024).

7. `tokenize_dataset.py` (Dataset Packing)

A highly optimized bulk-tokenization utility:

Tokenizes the streamed FineWeb-Edu dataset up to a target cap (e.g., 3.2 Billion tokens).
Performs a 99% train and 1% validation split (every 100th document is routed to the validation buffer).
Concatenates/packs documents sequentially (using <|endoftext|> as the document boundary) and writes them to disk as high-performance flat binary shards (.bin files of np.uint16 type).
Standard shard size is 100,000,000 tokens.
Provides a memory-mapped helper load_shard(split, shard_idx) using np.memmap so that models can stream training batches without loading multi-gigabyte files into RAM.

💡 Key Design Highlights

Why Byte Fallback is Critical: By initializing the alphabet with 256 unique byte values and enabling fallback, characters like math symbols ($\nabla$) or emojis don't fail or return an <unk> token; instead, they represent themselves as their raw UTF-8 bytes (e.g., $\nabla$ is parsed perfectly as <0xE2><0x88><0x87>).

Code-Aware Features: The combination of preserving leading tabs in normalizer.py, isolating multi-character operators (==, !=, etc.), and protecting snake_case variables guarantees high-fidelity, compact token representation when the language model is trained on code.

Walkthrough: SLLM Custom BPE Tokenizer

🏗️ Overall Architecture & Pipeline

📁 Component-by-Component Breakdown

1. normalizer.py (Text Normalization)

2. pretokenizer.py (Custom Regex Segmentation)

3. bpe.py (BPE Model Configuration)

4. post_processor.py (Sequence Endings)

5. traintokenizer.py (BPE Training Loop)

6. wrap_tokenizer.py (HuggingFace Integration)

7. tokenize_dataset.py (Dataset Packing)

💡 Key Design Highlights

1. `normalizer.py` (Text Normalization)

2. `pretokenizer.py` (Custom Regex Segmentation)

3. `bpe.py` (BPE Model Configuration)

4. `post_processor.py` (Sequence Endings)

5. `traintokenizer.py` (BPE Training Loop)

6. `wrap_tokenizer.py` (HuggingFace Integration)

7. `tokenize_dataset.py` (Dataset Packing)