sllm / tokenizer_walkthrough.md

Initial commit

7f974df verified 3 days ago

6.92 kB

	# Walkthrough: SLLM Custom BPE Tokenizer

	This document explains the architecture, execution pipeline, and design choices of the custom Byte-Pair Encoding (BPE) tokenizer implemented in the `tokenizer/` directory of the `sllm` project.

	---

	## 🏗️ Overall Architecture & Pipeline

	The SLLM tokenizer is a custom-built BPE tokenizer tailored for pre-training small language models on the educational subset of HuggingFace's FineWeb-Edu dataset. It integrates custom text normalization, a regex-based pre-tokenization strategy, standard BPE training with byte-level fallback, and packaging utility scripts for high-performance training.

	```mermaid
	graph TD
	A[Raw Text Stream] --> B[normalizer.py: Normalization]
	B --> C[pretokenizer.py: Custom Regex Split]
	C --> D[bpe.py: Byte-Level Encoding]
	D --> E[traintokenizer.py: BPE Trainer]
	E --> F[post_processor.py: Template Post-Processing]
	F --> G[wrap_tokenizer.py: PreTrainedTokenizerFast Wrapper]
	G --> H[tokenize_dataset.py: Packed binary .bin Shards]
	```

	---

	## 📁 Component-by-Component Breakdown

	### 1. `normalizer.py` (Text Normalization)
	Before any splitting occurs, the raw input text is standardized and cleaned to eliminate noise while preserving syntax and code structure:
	* HTML Stripping & Decoding: Removes HTML tags using regex and decodes HTML entities (e.g., `&` $\rightarrow$ `&`).
	* Unicode Normalization: Performs NFC normalization to ensure characters like accented letters are represented consistently.
	* Noise Removal: Eliminates raw control characters, zero-width characters (e.g., zero-width spaces/joins), and the Unicode replacement character (`\ufffd`).
	* Whitespace Control:
	* Collapses multiple consecutive spaces into a single space (preserving leading tabs for code indentation).
	* Cleans trailing whitespaces at the end of lines.
	* Collapses 3 or more consecutive newlines into exactly two newlines (`\n\n`) to preserve paragraph structure.

	---

	### 2. `pretokenizer.py` (Custom Regex Segmentation)
	Instead of relying on standard GPT-2/Llama pre-tokenization, this model implements a custom, ordered, priority-based regex pre-tokenizer:
	1. Contractions: `'s`, `'t`, `'re`, `'ve`, `'ll`, `'m`, `'d`.
	2. Abbreviations: Acronyms and shorthand (e.g., `U.S.A`, `e.g.`, `Ph.D`).
	3. Scientific Notation: E.g., `1.5e-3`, `3e10`, `2.0E+4` (evaluated before decimals to avoid splitting).
	4. Decimal Numbers: E.g., `3.14` (evaluated before integers).
	5. Integers: E.g., `42`, `1984`.
	6. Multi-character Operators: Common coding operators like `==`, `!=`, `->`, `<=`, `>=`, `*`, `//`, `+=`, `-=`, `=`, `/=`.
	7. Snake Case Identifiers: E.g., `snake_case`, `_private` (evaluated before plain words for clean code representation).
	8. Regular Unicode Words: Alphanumeric words covering non-English languages.
	9. Whitespace: Preserves sequences of spaces/tabs separately from newlines to keep structural formatting.
	10. Punctuation Catch-all: Individual punctuation characters.

	> [!NOTE]
	> The pre-tokenizer uses HuggingFace's `Split` pre-tokenizer with `behavior="isolated"` and `invert=True`, meaning matched strings are isolated and kept as distinct, individual tokens instead of being discarded as delimiters.

	---

	### 3. `bpe.py` (BPE Model Configuration)
	Defines the base tokenizer pipeline:
	* Byte Fallback: Configures the BPE model with `unk_token=None` and `byte_fallback=True`. This guarantees that every character maps to at least one byte-level token, resulting in zero out-of-vocabulary (OOV) issues.
	* Pre-Tokenizer Chain: Sequentially runs the custom Regex pre-tokenizer followed by `ByteLevel(add_prefix_space=False)` to translate character segments to their corresponding byte values.
	* Decoder: Instantiates the standard `ByteLevelDecoder` to reverse byte conversions, allowing human-readable decoded strings.
	* Trainer Config: Builds a `BpeTrainer` specifying a vocabulary of `32,000` tokens, minimum merge frequency of `3`, and initial alphabet containing all `256` bytes to enforce the fallback capability.

	---

	### 4. `post_processor.py` (Sequence Endings)
	Once BPE rules have been learned and vocabulary IDs are assigned:
	* Attaches `TemplateProcessing` to automatically append `<\|endoftext\|>` to every sequence.
	* For single documents, it maps to `[tokens...] <\|endoftext\|>`.
	* For sequence pairs (useful in downstream tasks like Question-Answering), it automatically maps to `[tokens_A...] <\|endoftext\|> [tokens_B...] <\|endoftext\|>`.

	---

	### 5. `traintokenizer.py` (BPE Training Loop)
	* Streams the educational subset of `HuggingFaceFW/fineweb-edu` (`CC-MAIN-2014-49` split).
	* Filters out low-quality documents (requires educational score `int_score >= 3`) and documents shorter than 100 characters.
	* Feeds documents iteratively into BPE training via `train_from_iterator()`.
	* Adds the post-processor and runs comprehensive verification checks against edge cases (equations, scientific numbers, code snippets, byte fallbacks, and contractions).

	---

	### 6. `wrap_tokenizer.py` (HuggingFace Integration)
	Wraps the trained HuggingFace BPE model into `PreTrainedTokenizerFast` from `transformers`:
	* Associates `<\|endoftext\|>` as the `bos_token`, `eos_token`, and `pad_token`.
	* Enables compatibility with the `datasets.map()` bulk utility, the HuggingFace Trainer, and PyTorch dataloaders.
	* Standardizes right-padding, right-truncation, and context length configurations (`model_max_length=1024`).

	---

	### 7. `tokenize_dataset.py` (Dataset Packing)
	A highly optimized bulk-tokenization utility:
	* Tokenizes the streamed FineWeb-Edu dataset up to a target cap (e.g., `3.2` Billion tokens).
	* Performs a 99% train and 1% validation split (every 100th document is routed to the validation buffer).
	* Concatenates/packs documents sequentially (using `<\|endoftext\|>` as the document boundary) and writes them to disk as high-performance flat binary shards (`.bin` files of `np.uint16` type).
	* Standard shard size is `100,000,000` tokens.
	* Provides a memory-mapped helper `load_shard(split, shard_idx)` using `np.memmap` so that models can stream training batches without loading multi-gigabyte files into RAM.

	---

	## 💡 Key Design Highlights

	> [!TIP]
	> Why Byte Fallback is Critical: By initializing the alphabet with 256 unique byte values and enabling fallback, characters like math symbols ($\nabla$) or emojis don't fail or return an `<unk>` token; instead, they represent themselves as their raw UTF-8 bytes (e.g., $\nabla$ is parsed perfectly as `<0xE2><0x88><0x87>`).

	> [!TIP]
	> Code-Aware Features: The combination of preserving leading tabs in `normalizer.py`, isolating multi-character operators (`==`, `!=`, etc.), and protecting `snake_case` variables guarantees high-fidelity, compact token representation when the language model is trained on code.