| # Tokenizer Module |
|
|
| This module handles all tokenization tasks for the Mini-LLM project, converting raw text into numerical tokens that the model can process. |
|
|
| ## Overview |
|
|
| The tokenizer uses **SentencePiece** with **Byte Pair Encoding (BPE)** to create a 32,000 token vocabulary. BPE is the same algorithm used by GPT-3, GPT-4, and LLaMA models. |
|
|
| ## Directory Structure |
|
|
| ``` |
| Tokenizer/ |
| βββ BPE/ # BPE tokenizer artifacts |
| β βββ spm.model # Trained SentencePiece model |
| β βββ spm.vocab # Vocabulary file |
| β βββ tokenizer.json # HuggingFace format |
| β βββ tokenizer_config.json |
| β βββ special_tokens_map.json |
| βββ Unigram/ # Unigram tokenizer (baseline) |
| β βββ ... |
| βββ train_spm_bpe.py # Train BPE tokenizer |
| βββ train_spm_unigram.py # Train Unigram tokenizer |
| βββ convert_to_hf.py # Convert to HuggingFace format |
| ``` |
|
|
| ## How It Works |
|
|
| ### 1. Training the Tokenizer |
|
|
| **Script**: `train_spm_bpe.py` |
|
|
| ```python |
| import sentencepiece as spm |
| |
| spm.SentencePieceTrainer.Train( |
| input="data/raw/merged_text/corpus.txt", |
| model_prefix="Tokenizer/BPE/spm", |
| vocab_size=32000, |
| model_type="bpe", |
| byte_fallback=True, # Handles emojis, special chars |
| character_coverage=1.0, |
| user_defined_symbols=["<user>", "<assistant>", "<system>"] |
| ) |
| ``` |
|
|
| **What happens:** |
| 1. Reads raw text corpus |
| 2. Learns byte-pair merges (e.g., "th" + "e" β "the") |
| 3. Builds 32,000 most frequent tokens |
| 4. Saves model to `spm.model` |
|
|
| ### 2. Example: Tokenization Process |
|
|
| **Input Text:** |
| ``` |
| "Hello world! <user> write code </s>" |
| ``` |
|
|
| **Tokenization Steps:** |
|
|
| ``` |
| βββββββββββββββββββββββββββββββββββββββββββ |
| β 1. Text Input β |
| β "Hello world! <user> write code" β |
| βββββββββββββββββββββββββββββββββββββββββββ |
| β |
| βββββββββββββββββββββββββββββββββββββββββββ |
| β 2. BPE Segmentation β |
| β ['H', 'ello', 'βworld', '!', β |
| β 'β', '<user>', 'βwrite', 'βcode'] β |
| βββββββββββββββββββββββββββββββββββββββββββ |
| β |
| βββββββββββββββββββββββββββββββββββββββββββ |
| β 3. Token IDs β |
| β [334, 3855, 288, 267, 2959, β |
| β 354, 267, 12397] β |
| βββββββββββββββββββββββββββββββββββββββββββ |
| ``` |
|
|
| **Key Features:** |
| - `β` represents space (SentencePiece convention) |
| - Special tokens like `<user>` are preserved |
| - Byte fallback handles emojis: π₯ β `<0xF0><0x9F><0x94><0xA5>` |
|
|
| ### 3. Converting to HuggingFace Format |
|
|
| **Script**: `convert_to_hf.py` |
|
|
| ```python |
| from transformers import LlamaTokenizerFast |
| |
| tokenizer = LlamaTokenizerFast(vocab_file="Tokenizer/BPE/spm.model") |
| tokenizer.add_special_tokens({ |
| 'bos_token': '<s>', |
| 'eos_token': '</s>', |
| 'unk_token': '<unk>', |
| 'pad_token': '<pad>' |
| }) |
| tokenizer.save_pretrained("Tokenizer/BPE") |
| ``` |
|
|
| This creates `tokenizer.json` and config files compatible with HuggingFace Transformers. |
|
|
| ## Usage |
|
|
| ### Load Tokenizer |
|
|
| ```python |
| from transformers import AutoTokenizer |
| |
| tokenizer = AutoTokenizer.from_pretrained("Tokenizer/BPE") |
| ``` |
|
|
| ### Encode Text |
|
|
| ```python |
| text = "Hello world!" |
| ids = tokenizer.encode(text) |
| # Output: [1, 334, 3855, 288, 267, 2] |
| # [<s>, H, ello, βworld, !, </s>] |
| ``` |
|
|
| ### Decode IDs |
|
|
| ```python |
| decoded = tokenizer.decode(ids) |
| # Output: "<s> Hello world! </s>" |
| |
| decoded = tokenizer.decode(ids, skip_special_tokens=True) |
| # Output: "Hello world!" |
| ``` |
|
|
| ## BPE vs Unigram |
|
|
| | Feature | BPE | Unigram | |
| |---------|-----|---------| |
| | **Algorithm** | Merge frequent pairs | Probabilistic segmentation | |
| | **Emoji Handling** | β
Byte fallback | β Creates `<unk>` | |
| | **URL Handling** | β
Clean splits | β οΈ Unstable | |
| | **Used By** | GPT-3, GPT-4, LLaMA | BERT, T5 | |
| | **Recommendation** | β
**Primary** | Baseline only | |
|
|
| ## Vocabulary Statistics |
|
|
| - **Total Tokens**: 32,000 |
| - **Special Tokens**: 4 (`<s>`, `</s>`, `<unk>`, `<pad>`) |
| - **User-Defined**: 3 (`<user>`, `<assistant>`, `<system>`) |
| - **Coverage**: 100% (byte fallback ensures no `<unk>`) |
|
|
| ## Performance |
|
|
| - **Compression Ratio**: ~3.5 bytes/token (English text) |
| - **Tokenization Speed**: ~1M tokens/second |
| - **Vocab Usage**: ~70% of tokens used in typical corpus |
|
|
| ## References |
|
|
| - [SentencePiece Documentation](https://github.com/google/sentencepiece) |
| - [BPE Paper (Sennrich et al., 2016)](https://arxiv.org/abs/1508.07909) |
| - [Tokenizer Comparison Report](../tokenizer_report.md) |
|
|