File size: 6,920 Bytes
7f974df
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
# Walkthrough: SLLM Custom BPE Tokenizer

This document explains the architecture, execution pipeline, and design choices of the custom **Byte-Pair Encoding (BPE)** tokenizer implemented in the `tokenizer/` directory of the `sllm` project.

---

## 🏗️ Overall Architecture & Pipeline

The SLLM tokenizer is a custom-built BPE tokenizer tailored for pre-training small language models on the educational subset of HuggingFace's **FineWeb-Edu** dataset. It integrates custom text normalization, a regex-based pre-tokenization strategy, standard BPE training with byte-level fallback, and packaging utility scripts for high-performance training.

```mermaid
graph TD
    A[Raw Text Stream] --> B[normalizer.py: Normalization]
    B --> C[pretokenizer.py: Custom Regex Split]
    C --> D[bpe.py: Byte-Level Encoding]
    D --> E[traintokenizer.py: BPE Trainer]
    E --> F[post_processor.py: Template Post-Processing]
    F --> G[wrap_tokenizer.py: PreTrainedTokenizerFast Wrapper]
    G --> H[tokenize_dataset.py: Packed binary .bin Shards]
```

---

## 📁 Component-by-Component Breakdown

### 1. `normalizer.py` (Text Normalization)
Before any splitting occurs, the raw input text is standardized and cleaned to eliminate noise while preserving syntax and code structure:
* **HTML Stripping & Decoding**: Removes HTML tags using regex and decodes HTML entities (e.g., `&` $\rightarrow$ `&`).
* **Unicode Normalization**: Performs **NFC** normalization to ensure characters like accented letters are represented consistently.
* **Noise Removal**: Eliminates raw control characters, zero-width characters (e.g., zero-width spaces/joins), and the Unicode replacement character (`\ufffd`).
* **Whitespace Control**:
  * Collapses multiple consecutive spaces into a single space (preserving leading tabs for code indentation).
  * Cleans trailing whitespaces at the end of lines.
  * Collapses 3 or more consecutive newlines into exactly two newlines (`\n\n`) to preserve paragraph structure.

---

### 2. `pretokenizer.py` (Custom Regex Segmentation)
Instead of relying on standard GPT-2/Llama pre-tokenization, this model implements a custom, ordered, priority-based regex pre-tokenizer:
1. **Contractions**: `'s`, `'t`, `'re`, `'ve`, `'ll`, `'m`, `'d`.
2. **Abbreviations**: Acronyms and shorthand (e.g., `U.S.A`, `e.g.`, `Ph.D`).
3. **Scientific Notation**: E.g., `1.5e-3`, `3e10`, `2.0E+4` (evaluated *before* decimals to avoid splitting).
4. **Decimal Numbers**: E.g., `3.14` (evaluated *before* integers).
5. **Integers**: E.g., `42`, `1984`.
6. **Multi-character Operators**: Common coding operators like `==`, `!=`, `->`, `<=`, `>=`, `**`, `//`, `+=`, `-=`, `*=`, `/=`.
7. **Snake Case Identifiers**: E.g., `snake_case`, `_private` (evaluated *before* plain words for clean code representation).
8. **Regular Unicode Words**: Alphanumeric words covering non-English languages.
9. **Whitespace**: Preserves sequences of spaces/tabs separately from newlines to keep structural formatting.
10. **Punctuation Catch-all**: Individual punctuation characters.

> [!NOTE]
> The pre-tokenizer uses HuggingFace's `Split` pre-tokenizer with `behavior="isolated"` and `invert=True`, meaning matched strings are isolated and kept as distinct, individual tokens instead of being discarded as delimiters.

---

### 3. `bpe.py` (BPE Model Configuration)
Defines the base tokenizer pipeline:
* **Byte Fallback**: Configures the BPE model with `unk_token=None` and `byte_fallback=True`. This guarantees that *every* character maps to at least one byte-level token, resulting in **zero out-of-vocabulary (OOV)** issues.
* **Pre-Tokenizer Chain**: Sequentially runs the custom Regex pre-tokenizer followed by `ByteLevel(add_prefix_space=False)` to translate character segments to their corresponding byte values.
* **Decoder**: Instantiates the standard `ByteLevelDecoder` to reverse byte conversions, allowing human-readable decoded strings.
* **Trainer Config**: Builds a `BpeTrainer` specifying a vocabulary of `32,000` tokens, minimum merge frequency of `3`, and initial alphabet containing all `256` bytes to enforce the fallback capability.

---

### 4. `post_processor.py` (Sequence Endings)
Once BPE rules have been learned and vocabulary IDs are assigned:
* Attaches `TemplateProcessing` to automatically append `<|endoftext|>` to every sequence.
* For single documents, it maps to `[tokens...] <|endoftext|>`.
* For sequence pairs (useful in downstream tasks like Question-Answering), it automatically maps to `[tokens_A...] <|endoftext|> [tokens_B...] <|endoftext|>`.

---

### 5. `traintokenizer.py` (BPE Training Loop)
* Streams the educational subset of `HuggingFaceFW/fineweb-edu` (`CC-MAIN-2014-49` split).
* Filters out low-quality documents (requires educational score `int_score >= 3`) and documents shorter than 100 characters.
* Feeds documents iteratively into BPE training via `train_from_iterator()`.
* Adds the post-processor and runs comprehensive verification checks against edge cases (equations, scientific numbers, code snippets, byte fallbacks, and contractions).

---

### 6. `wrap_tokenizer.py` (HuggingFace Integration)
Wraps the trained HuggingFace BPE model into `PreTrainedTokenizerFast` from `transformers`:
* Associates `<|endoftext|>` as the `bos_token`, `eos_token`, and `pad_token`.
* Enables compatibility with the `datasets.map()` bulk utility, the HuggingFace Trainer, and PyTorch dataloaders.
* Standardizes right-padding, right-truncation, and context length configurations (`model_max_length=1024`).

---

### 7. `tokenize_dataset.py` (Dataset Packing)
A highly optimized bulk-tokenization utility:
* Tokenizes the streamed FineWeb-Edu dataset up to a target cap (e.g., `3.2` Billion tokens).
* Performs a 99% train and 1% validation split (every 100th document is routed to the validation buffer).
* Concatenates/packs documents sequentially (using `<|endoftext|>` as the document boundary) and writes them to disk as high-performance flat binary shards (`.bin` files of `np.uint16` type).
* Standard shard size is `100,000,000` tokens.
* Provides a memory-mapped helper `load_shard(split, shard_idx)` using `np.memmap` so that models can stream training batches without loading multi-gigabyte files into RAM.

---

## 💡 Key Design Highlights

> [!TIP]
> **Why Byte Fallback is Critical**: By initializing the alphabet with 256 unique byte values and enabling fallback, characters like math symbols ($\nabla$) or emojis don't fail or return an `<unk>` token; instead, they represent themselves as their raw UTF-8 bytes (e.g., $\nabla$ is parsed perfectly as `<0xE2><0x88><0x87>`).

> [!TIP]
> **Code-Aware Features**: The combination of preserving leading tabs in `normalizer.py`, isolating multi-character operators (`==`, `!=`, etc.), and protecting `snake_case` variables guarantees high-fidelity, compact token representation when the language model is trained on code.