sllm / tokenizer_walkthrough.md
geeteshcodes's picture
Initial commit
7f974df verified

Walkthrough: SLLM Custom BPE Tokenizer

This document explains the architecture, execution pipeline, and design choices of the custom Byte-Pair Encoding (BPE) tokenizer implemented in the tokenizer/ directory of the sllm project.


πŸ—οΈ Overall Architecture & Pipeline

The SLLM tokenizer is a custom-built BPE tokenizer tailored for pre-training small language models on the educational subset of HuggingFace's FineWeb-Edu dataset. It integrates custom text normalization, a regex-based pre-tokenization strategy, standard BPE training with byte-level fallback, and packaging utility scripts for high-performance training.

graph TD
    A[Raw Text Stream] --> B[normalizer.py: Normalization]
    B --> C[pretokenizer.py: Custom Regex Split]
    C --> D[bpe.py: Byte-Level Encoding]
    D --> E[traintokenizer.py: BPE Trainer]
    E --> F[post_processor.py: Template Post-Processing]
    F --> G[wrap_tokenizer.py: PreTrainedTokenizerFast Wrapper]
    G --> H[tokenize_dataset.py: Packed binary .bin Shards]

πŸ“ Component-by-Component Breakdown

1. normalizer.py (Text Normalization)

Before any splitting occurs, the raw input text is standardized and cleaned to eliminate noise while preserving syntax and code structure:

  • HTML Stripping & Decoding: Removes HTML tags using regex and decodes HTML entities (e.g., & $\rightarrow$ &).
  • Unicode Normalization: Performs NFC normalization to ensure characters like accented letters are represented consistently.
  • Noise Removal: Eliminates raw control characters, zero-width characters (e.g., zero-width spaces/joins), and the Unicode replacement character (\ufffd).
  • Whitespace Control:
    • Collapses multiple consecutive spaces into a single space (preserving leading tabs for code indentation).
    • Cleans trailing whitespaces at the end of lines.
    • Collapses 3 or more consecutive newlines into exactly two newlines (\n\n) to preserve paragraph structure.

2. pretokenizer.py (Custom Regex Segmentation)

Instead of relying on standard GPT-2/Llama pre-tokenization, this model implements a custom, ordered, priority-based regex pre-tokenizer:

  1. Contractions: 's, 't, 're, 've, 'll, 'm, 'd.
  2. Abbreviations: Acronyms and shorthand (e.g., U.S.A, e.g., Ph.D).
  3. Scientific Notation: E.g., 1.5e-3, 3e10, 2.0E+4 (evaluated before decimals to avoid splitting).
  4. Decimal Numbers: E.g., 3.14 (evaluated before integers).
  5. Integers: E.g., 42, 1984.
  6. Multi-character Operators: Common coding operators like ==, !=, ->, <=, >=, **, //, +=, -=, *=, /=.
  7. Snake Case Identifiers: E.g., snake_case, _private (evaluated before plain words for clean code representation).
  8. Regular Unicode Words: Alphanumeric words covering non-English languages.
  9. Whitespace: Preserves sequences of spaces/tabs separately from newlines to keep structural formatting.
  10. Punctuation Catch-all: Individual punctuation characters.

The pre-tokenizer uses HuggingFace's Split pre-tokenizer with behavior="isolated" and invert=True, meaning matched strings are isolated and kept as distinct, individual tokens instead of being discarded as delimiters.


3. bpe.py (BPE Model Configuration)

Defines the base tokenizer pipeline:

  • Byte Fallback: Configures the BPE model with unk_token=None and byte_fallback=True. This guarantees that every character maps to at least one byte-level token, resulting in zero out-of-vocabulary (OOV) issues.
  • Pre-Tokenizer Chain: Sequentially runs the custom Regex pre-tokenizer followed by ByteLevel(add_prefix_space=False) to translate character segments to their corresponding byte values.
  • Decoder: Instantiates the standard ByteLevelDecoder to reverse byte conversions, allowing human-readable decoded strings.
  • Trainer Config: Builds a BpeTrainer specifying a vocabulary of 32,000 tokens, minimum merge frequency of 3, and initial alphabet containing all 256 bytes to enforce the fallback capability.

4. post_processor.py (Sequence Endings)

Once BPE rules have been learned and vocabulary IDs are assigned:

  • Attaches TemplateProcessing to automatically append <|endoftext|> to every sequence.
  • For single documents, it maps to [tokens...] <|endoftext|>.
  • For sequence pairs (useful in downstream tasks like Question-Answering), it automatically maps to [tokens_A...] <|endoftext|> [tokens_B...] <|endoftext|>.

5. traintokenizer.py (BPE Training Loop)

  • Streams the educational subset of HuggingFaceFW/fineweb-edu (CC-MAIN-2014-49 split).
  • Filters out low-quality documents (requires educational score int_score >= 3) and documents shorter than 100 characters.
  • Feeds documents iteratively into BPE training via train_from_iterator().
  • Adds the post-processor and runs comprehensive verification checks against edge cases (equations, scientific numbers, code snippets, byte fallbacks, and contractions).

6. wrap_tokenizer.py (HuggingFace Integration)

Wraps the trained HuggingFace BPE model into PreTrainedTokenizerFast from transformers:

  • Associates <|endoftext|> as the bos_token, eos_token, and pad_token.
  • Enables compatibility with the datasets.map() bulk utility, the HuggingFace Trainer, and PyTorch dataloaders.
  • Standardizes right-padding, right-truncation, and context length configurations (model_max_length=1024).

7. tokenize_dataset.py (Dataset Packing)

A highly optimized bulk-tokenization utility:

  • Tokenizes the streamed FineWeb-Edu dataset up to a target cap (e.g., 3.2 Billion tokens).
  • Performs a 99% train and 1% validation split (every 100th document is routed to the validation buffer).
  • Concatenates/packs documents sequentially (using <|endoftext|> as the document boundary) and writes them to disk as high-performance flat binary shards (.bin files of np.uint16 type).
  • Standard shard size is 100,000,000 tokens.
  • Provides a memory-mapped helper load_shard(split, shard_idx) using np.memmap so that models can stream training batches without loading multi-gigabyte files into RAM.

πŸ’‘ Key Design Highlights

Why Byte Fallback is Critical: By initializing the alphabet with 256 unique byte values and enabling fallback, characters like math symbols ($\nabla$) or emojis don't fail or return an <unk> token; instead, they represent themselves as their raw UTF-8 bytes (e.g., $\nabla$ is parsed perfectly as <0xE2><0x88><0x87>).

Code-Aware Features: The combination of preserving leading tabs in normalizer.py, isolating multi-character operators (==, !=, etc.), and protecting snake_case variables guarantees high-fidelity, compact token representation when the language model is trained on code.