File size: 5,064 Bytes
a433a25 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 |
# Tokenizer Module
This module handles all tokenization tasks for the Mini-LLM project, converting raw text into numerical tokens that the model can process.
## Overview
The tokenizer uses **SentencePiece** with **Byte Pair Encoding (BPE)** to create a 32,000 token vocabulary. BPE is the same algorithm used by GPT-3, GPT-4, and LLaMA models.
## Directory Structure
```
Tokenizer/
βββ BPE/ # BPE tokenizer artifacts
β βββ spm.model # Trained SentencePiece model
β βββ spm.vocab # Vocabulary file
β βββ tokenizer.json # HuggingFace format
β βββ tokenizer_config.json
β βββ special_tokens_map.json
βββ Unigram/ # Unigram tokenizer (baseline)
β βββ ...
βββ train_spm_bpe.py # Train BPE tokenizer
βββ train_spm_unigram.py # Train Unigram tokenizer
βββ convert_to_hf.py # Convert to HuggingFace format
```
## How It Works
### 1. Training the Tokenizer
**Script**: `train_spm_bpe.py`
```python
import sentencepiece as spm
spm.SentencePieceTrainer.Train(
input="data/raw/merged_text/corpus.txt",
model_prefix="Tokenizer/BPE/spm",
vocab_size=32000,
model_type="bpe",
byte_fallback=True, # Handles emojis, special chars
character_coverage=1.0,
user_defined_symbols=["<user>", "<assistant>", "<system>"]
)
```
**What happens:**
1. Reads raw text corpus
2. Learns byte-pair merges (e.g., "th" + "e" β "the")
3. Builds 32,000 most frequent tokens
4. Saves model to `spm.model`
### 2. Example: Tokenization Process
**Input Text:**
```
"Hello world! <user> write code </s>"
```
**Tokenization Steps:**
```
βββββββββββββββββββββββββββββββββββββββββββ
β 1. Text Input β
β "Hello world! <user> write code" β
βββββββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββββββββββββββββββββββ
β 2. BPE Segmentation β
β ['H', 'ello', 'βworld', '!', β
β 'β', '<user>', 'βwrite', 'βcode'] β
βββββββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββββββββββββββββββββββ
β 3. Token IDs β
β [334, 3855, 288, 267, 2959, β
β 354, 267, 12397] β
βββββββββββββββββββββββββββββββββββββββββββ
```
**Key Features:**
- `β` represents space (SentencePiece convention)
- Special tokens like `<user>` are preserved
- Byte fallback handles emojis: π₯ β `<0xF0><0x9F><0x94><0xA5>`
### 3. Converting to HuggingFace Format
**Script**: `convert_to_hf.py`
```python
from transformers import LlamaTokenizerFast
tokenizer = LlamaTokenizerFast(vocab_file="Tokenizer/BPE/spm.model")
tokenizer.add_special_tokens({
'bos_token': '<s>',
'eos_token': '</s>',
'unk_token': '<unk>',
'pad_token': '<pad>'
})
tokenizer.save_pretrained("Tokenizer/BPE")
```
This creates `tokenizer.json` and config files compatible with HuggingFace Transformers.
## Usage
### Load Tokenizer
```python
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("Tokenizer/BPE")
```
### Encode Text
```python
text = "Hello world!"
ids = tokenizer.encode(text)
# Output: [1, 334, 3855, 288, 267, 2]
# [<s>, H, ello, βworld, !, </s>]
```
### Decode IDs
```python
decoded = tokenizer.decode(ids)
# Output: "<s> Hello world! </s>"
decoded = tokenizer.decode(ids, skip_special_tokens=True)
# Output: "Hello world!"
```
## BPE vs Unigram
| Feature | BPE | Unigram |
|---------|-----|---------|
| **Algorithm** | Merge frequent pairs | Probabilistic segmentation |
| **Emoji Handling** | β
Byte fallback | β Creates `<unk>` |
| **URL Handling** | β
Clean splits | β οΈ Unstable |
| **Used By** | GPT-3, GPT-4, LLaMA | BERT, T5 |
| **Recommendation** | β
**Primary** | Baseline only |
## Vocabulary Statistics
- **Total Tokens**: 32,000
- **Special Tokens**: 4 (`<s>`, `</s>`, `<unk>`, `<pad>`)
- **User-Defined**: 3 (`<user>`, `<assistant>`, `<system>`)
- **Coverage**: 100% (byte fallback ensures no `<unk>`)
## Performance
- **Compression Ratio**: ~3.5 bytes/token (English text)
- **Tokenization Speed**: ~1M tokens/second
- **Vocab Usage**: ~70% of tokens used in typical corpus
## References
- [SentencePiece Documentation](https://github.com/google/sentencepiece)
- [BPE Paper (Sennrich et al., 2016)](https://arxiv.org/abs/1508.07909)
- [Tokenizer Comparison Report](../tokenizer_report.md)
|