Walkthrough: SLLM Custom BPE Tokenizer
This document explains the architecture, execution pipeline, and design choices of the custom Byte-Pair Encoding (BPE) tokenizer implemented in the tokenizer/ directory of the sllm project.
ποΈ Overall Architecture & Pipeline
The SLLM tokenizer is a custom-built BPE tokenizer tailored for pre-training small language models on the educational subset of HuggingFace's FineWeb-Edu dataset. It integrates custom text normalization, a regex-based pre-tokenization strategy, standard BPE training with byte-level fallback, and packaging utility scripts for high-performance training.
graph TD
A[Raw Text Stream] --> B[normalizer.py: Normalization]
B --> C[pretokenizer.py: Custom Regex Split]
C --> D[bpe.py: Byte-Level Encoding]
D --> E[traintokenizer.py: BPE Trainer]
E --> F[post_processor.py: Template Post-Processing]
F --> G[wrap_tokenizer.py: PreTrainedTokenizerFast Wrapper]
G --> H[tokenize_dataset.py: Packed binary .bin Shards]
π Component-by-Component Breakdown
1. normalizer.py (Text Normalization)
Before any splitting occurs, the raw input text is standardized and cleaned to eliminate noise while preserving syntax and code structure:
- HTML Stripping & Decoding: Removes HTML tags using regex and decodes HTML entities (e.g.,
&$\rightarrow$&). - Unicode Normalization: Performs NFC normalization to ensure characters like accented letters are represented consistently.
- Noise Removal: Eliminates raw control characters, zero-width characters (e.g., zero-width spaces/joins), and the Unicode replacement character (
\ufffd). - Whitespace Control:
- Collapses multiple consecutive spaces into a single space (preserving leading tabs for code indentation).
- Cleans trailing whitespaces at the end of lines.
- Collapses 3 or more consecutive newlines into exactly two newlines (
\n\n) to preserve paragraph structure.
2. pretokenizer.py (Custom Regex Segmentation)
Instead of relying on standard GPT-2/Llama pre-tokenization, this model implements a custom, ordered, priority-based regex pre-tokenizer:
- Contractions:
's,'t,'re,'ve,'ll,'m,'d. - Abbreviations: Acronyms and shorthand (e.g.,
U.S.A,e.g.,Ph.D). - Scientific Notation: E.g.,
1.5e-3,3e10,2.0E+4(evaluated before decimals to avoid splitting). - Decimal Numbers: E.g.,
3.14(evaluated before integers). - Integers: E.g.,
42,1984. - Multi-character Operators: Common coding operators like
==,!=,->,<=,>=,**,//,+=,-=,*=,/=. - Snake Case Identifiers: E.g.,
snake_case,_private(evaluated before plain words for clean code representation). - Regular Unicode Words: Alphanumeric words covering non-English languages.
- Whitespace: Preserves sequences of spaces/tabs separately from newlines to keep structural formatting.
- Punctuation Catch-all: Individual punctuation characters.
The pre-tokenizer uses HuggingFace's
Splitpre-tokenizer withbehavior="isolated"andinvert=True, meaning matched strings are isolated and kept as distinct, individual tokens instead of being discarded as delimiters.
3. bpe.py (BPE Model Configuration)
Defines the base tokenizer pipeline:
- Byte Fallback: Configures the BPE model with
unk_token=Noneandbyte_fallback=True. This guarantees that every character maps to at least one byte-level token, resulting in zero out-of-vocabulary (OOV) issues. - Pre-Tokenizer Chain: Sequentially runs the custom Regex pre-tokenizer followed by
ByteLevel(add_prefix_space=False)to translate character segments to their corresponding byte values. - Decoder: Instantiates the standard
ByteLevelDecoderto reverse byte conversions, allowing human-readable decoded strings. - Trainer Config: Builds a
BpeTrainerspecifying a vocabulary of32,000tokens, minimum merge frequency of3, and initial alphabet containing all256bytes to enforce the fallback capability.
4. post_processor.py (Sequence Endings)
Once BPE rules have been learned and vocabulary IDs are assigned:
- Attaches
TemplateProcessingto automatically append<|endoftext|>to every sequence. - For single documents, it maps to
[tokens...] <|endoftext|>. - For sequence pairs (useful in downstream tasks like Question-Answering), it automatically maps to
[tokens_A...] <|endoftext|> [tokens_B...] <|endoftext|>.
5. traintokenizer.py (BPE Training Loop)
- Streams the educational subset of
HuggingFaceFW/fineweb-edu(CC-MAIN-2014-49split). - Filters out low-quality documents (requires educational score
int_score >= 3) and documents shorter than 100 characters. - Feeds documents iteratively into BPE training via
train_from_iterator(). - Adds the post-processor and runs comprehensive verification checks against edge cases (equations, scientific numbers, code snippets, byte fallbacks, and contractions).
6. wrap_tokenizer.py (HuggingFace Integration)
Wraps the trained HuggingFace BPE model into PreTrainedTokenizerFast from transformers:
- Associates
<|endoftext|>as thebos_token,eos_token, andpad_token. - Enables compatibility with the
datasets.map()bulk utility, the HuggingFace Trainer, and PyTorch dataloaders. - Standardizes right-padding, right-truncation, and context length configurations (
model_max_length=1024).
7. tokenize_dataset.py (Dataset Packing)
A highly optimized bulk-tokenization utility:
- Tokenizes the streamed FineWeb-Edu dataset up to a target cap (e.g.,
3.2Billion tokens). - Performs a 99% train and 1% validation split (every 100th document is routed to the validation buffer).
- Concatenates/packs documents sequentially (using
<|endoftext|>as the document boundary) and writes them to disk as high-performance flat binary shards (.binfiles ofnp.uint16type). - Standard shard size is
100,000,000tokens. - Provides a memory-mapped helper
load_shard(split, shard_idx)usingnp.memmapso that models can stream training batches without loading multi-gigabyte files into RAM.
π‘ Key Design Highlights
Why Byte Fallback is Critical: By initializing the alphabet with 256 unique byte values and enabling fallback, characters like math symbols ($\nabla$) or emojis don't fail or return an
<unk>token; instead, they represent themselves as their raw UTF-8 bytes (e.g., $\nabla$ is parsed perfectly as<0xE2><0x88><0x87>).
Code-Aware Features: The combination of preserving leading tabs in
normalizer.py, isolating multi-character operators (==,!=, etc.), and protectingsnake_casevariables guarantees high-fidelity, compact token representation when the language model is trained on code.