File size: 5,064 Bytes
a433a25
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
# Tokenizer Module

This module handles all tokenization tasks for the Mini-LLM project, converting raw text into numerical tokens that the model can process.

## Overview

The tokenizer uses **SentencePiece** with **Byte Pair Encoding (BPE)** to create a 32,000 token vocabulary. BPE is the same algorithm used by GPT-3, GPT-4, and LLaMA models.

## Directory Structure

```
Tokenizer/
β”œβ”€β”€ BPE/                      # BPE tokenizer artifacts
β”‚   β”œβ”€β”€ spm.model            # Trained SentencePiece model
β”‚   β”œβ”€β”€ spm.vocab            # Vocabulary file
β”‚   β”œβ”€β”€ tokenizer.json       # HuggingFace format
β”‚   β”œβ”€β”€ tokenizer_config.json
β”‚   └── special_tokens_map.json
β”œβ”€β”€ Unigram/                 # Unigram tokenizer (baseline)
β”‚   └── ...
β”œβ”€β”€ train_spm_bpe.py         # Train BPE tokenizer
β”œβ”€β”€ train_spm_unigram.py     # Train Unigram tokenizer
└── convert_to_hf.py         # Convert to HuggingFace format
```

## How It Works

### 1. Training the Tokenizer

**Script**: `train_spm_bpe.py`

```python
import sentencepiece as spm

spm.SentencePieceTrainer.Train(
    input="data/raw/merged_text/corpus.txt",
    model_prefix="Tokenizer/BPE/spm",
    vocab_size=32000,
    model_type="bpe",
    byte_fallback=True,  # Handles emojis, special chars
    character_coverage=1.0,
    user_defined_symbols=["<user>", "<assistant>", "<system>"]
)
```

**What happens:**
1. Reads raw text corpus
2. Learns byte-pair merges (e.g., "th" + "e" β†’ "the")
3. Builds 32,000 most frequent tokens
4. Saves model to `spm.model`

### 2. Example: Tokenization Process

**Input Text:**
```
"Hello world! <user> write code </s>"
```

**Tokenization Steps:**

```
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ 1. Text Input                           β”‚
β”‚    "Hello world! <user> write code"     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
              ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ 2. BPE Segmentation                     β”‚
β”‚    ['H', 'ello', '▁world', '!',         β”‚
β”‚     '▁', '<user>', '▁write', '▁code']   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
              ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ 3. Token IDs                            β”‚
β”‚    [334, 3855, 288, 267, 2959,          β”‚
β”‚     354, 267, 12397]                    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```

**Key Features:**
- `▁` represents space (SentencePiece convention)
- Special tokens like `<user>` are preserved
- Byte fallback handles emojis: πŸ”₯ β†’ `<0xF0><0x9F><0x94><0xA5>`

### 3. Converting to HuggingFace Format

**Script**: `convert_to_hf.py`

```python
from transformers import LlamaTokenizerFast

tokenizer = LlamaTokenizerFast(vocab_file="Tokenizer/BPE/spm.model")
tokenizer.add_special_tokens({
    'bos_token': '<s>',
    'eos_token': '</s>',
    'unk_token': '<unk>',
    'pad_token': '<pad>'
})
tokenizer.save_pretrained("Tokenizer/BPE")
```

This creates `tokenizer.json` and config files compatible with HuggingFace Transformers.

## Usage

### Load Tokenizer

```python
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("Tokenizer/BPE")
```

### Encode Text

```python
text = "Hello world!"
ids = tokenizer.encode(text)
# Output: [1, 334, 3855, 288, 267, 2]
#         [<s>, H, ello, ▁world, !, </s>]
```

### Decode IDs

```python
decoded = tokenizer.decode(ids)
# Output: "<s> Hello world! </s>"

decoded = tokenizer.decode(ids, skip_special_tokens=True)
# Output: "Hello world!"
```

## BPE vs Unigram

| Feature | BPE | Unigram |
|---------|-----|---------|
| **Algorithm** | Merge frequent pairs | Probabilistic segmentation |
| **Emoji Handling** | βœ… Byte fallback | ❌ Creates `<unk>` |
| **URL Handling** | βœ… Clean splits | ⚠️ Unstable |
| **Used By** | GPT-3, GPT-4, LLaMA | BERT, T5 |
| **Recommendation** | βœ… **Primary** | Baseline only |

## Vocabulary Statistics

- **Total Tokens**: 32,000
- **Special Tokens**: 4 (`<s>`, `</s>`, `<unk>`, `<pad>`)
- **User-Defined**: 3 (`<user>`, `<assistant>`, `<system>`)
- **Coverage**: 100% (byte fallback ensures no `<unk>`)

## Performance

- **Compression Ratio**: ~3.5 bytes/token (English text)
- **Tokenization Speed**: ~1M tokens/second
- **Vocab Usage**: ~70% of tokens used in typical corpus

## References

- [SentencePiece Documentation](https://github.com/google/sentencepiece)
- [BPE Paper (Sennrich et al., 2016)](https://arxiv.org/abs/1508.07909)
- [Tokenizer Comparison Report](../tokenizer_report.md)