| # Tokenizer Module | |
| This module handles all tokenization tasks for the Mini-LLM project, converting raw text into numerical tokens that the model can process. | |
| ## Overview | |
| The tokenizer uses **SentencePiece** with **Byte Pair Encoding (BPE)** to create a 32,000 token vocabulary. BPE is the same algorithm used by GPT-3, GPT-4, and LLaMA models. | |
| ## Directory Structure | |
| ``` | |
| Tokenizer/ | |
| βββ BPE/ # BPE tokenizer artifacts | |
| β βββ spm.model # Trained SentencePiece model | |
| β βββ spm.vocab # Vocabulary file | |
| β βββ tokenizer.json # HuggingFace format | |
| β βββ tokenizer_config.json | |
| β βββ special_tokens_map.json | |
| βββ Unigram/ # Unigram tokenizer (baseline) | |
| β βββ ... | |
| βββ train_spm_bpe.py # Train BPE tokenizer | |
| βββ train_spm_unigram.py # Train Unigram tokenizer | |
| βββ convert_to_hf.py # Convert to HuggingFace format | |
| ``` | |
| ## How It Works | |
| ### 1. Training the Tokenizer | |
| **Script**: `train_spm_bpe.py` | |
| ```python | |
| import sentencepiece as spm | |
| spm.SentencePieceTrainer.Train( | |
| input="data/raw/merged_text/corpus.txt", | |
| model_prefix="Tokenizer/BPE/spm", | |
| vocab_size=32000, | |
| model_type="bpe", | |
| byte_fallback=True, # Handles emojis, special chars | |
| character_coverage=1.0, | |
| user_defined_symbols=["<user>", "<assistant>", "<system>"] | |
| ) | |
| ``` | |
| **What happens:** | |
| 1. Reads raw text corpus | |
| 2. Learns byte-pair merges (e.g., "th" + "e" β "the") | |
| 3. Builds 32,000 most frequent tokens | |
| 4. Saves model to `spm.model` | |
| ### 2. Example: Tokenization Process | |
| **Input Text:** | |
| ``` | |
| "Hello world! <user> write code </s>" | |
| ``` | |
| **Tokenization Steps:** | |
| ``` | |
| βββββββββββββββββββββββββββββββββββββββββββ | |
| β 1. Text Input β | |
| β "Hello world! <user> write code" β | |
| βββββββββββββββββββββββββββββββββββββββββββ | |
| β | |
| βββββββββββββββββββββββββββββββββββββββββββ | |
| β 2. BPE Segmentation β | |
| β ['H', 'ello', 'βworld', '!', β | |
| β 'β', '<user>', 'βwrite', 'βcode'] β | |
| βββββββββββββββββββββββββββββββββββββββββββ | |
| β | |
| βββββββββββββββββββββββββββββββββββββββββββ | |
| β 3. Token IDs β | |
| β [334, 3855, 288, 267, 2959, β | |
| β 354, 267, 12397] β | |
| βββββββββββββββββββββββββββββββββββββββββββ | |
| ``` | |
| **Key Features:** | |
| - `β` represents space (SentencePiece convention) | |
| - Special tokens like `<user>` are preserved | |
| - Byte fallback handles emojis: π₯ β `<0xF0><0x9F><0x94><0xA5>` | |
| ### 3. Converting to HuggingFace Format | |
| **Script**: `convert_to_hf.py` | |
| ```python | |
| from transformers import LlamaTokenizerFast | |
| tokenizer = LlamaTokenizerFast(vocab_file="Tokenizer/BPE/spm.model") | |
| tokenizer.add_special_tokens({ | |
| 'bos_token': '<s>', | |
| 'eos_token': '</s>', | |
| 'unk_token': '<unk>', | |
| 'pad_token': '<pad>' | |
| }) | |
| tokenizer.save_pretrained("Tokenizer/BPE") | |
| ``` | |
| This creates `tokenizer.json` and config files compatible with HuggingFace Transformers. | |
| ## Usage | |
| ### Load Tokenizer | |
| ```python | |
| from transformers import AutoTokenizer | |
| tokenizer = AutoTokenizer.from_pretrained("Tokenizer/BPE") | |
| ``` | |
| ### Encode Text | |
| ```python | |
| text = "Hello world!" | |
| ids = tokenizer.encode(text) | |
| # Output: [1, 334, 3855, 288, 267, 2] | |
| # [<s>, H, ello, βworld, !, </s>] | |
| ``` | |
| ### Decode IDs | |
| ```python | |
| decoded = tokenizer.decode(ids) | |
| # Output: "<s> Hello world! </s>" | |
| decoded = tokenizer.decode(ids, skip_special_tokens=True) | |
| # Output: "Hello world!" | |
| ``` | |
| ## BPE vs Unigram | |
| | Feature | BPE | Unigram | | |
| |---------|-----|---------| | |
| | **Algorithm** | Merge frequent pairs | Probabilistic segmentation | | |
| | **Emoji Handling** | β Byte fallback | β Creates `<unk>` | | |
| | **URL Handling** | β Clean splits | β οΈ Unstable | | |
| | **Used By** | GPT-3, GPT-4, LLaMA | BERT, T5 | | |
| | **Recommendation** | β **Primary** | Baseline only | | |
| ## Vocabulary Statistics | |
| - **Total Tokens**: 32,000 | |
| - **Special Tokens**: 4 (`<s>`, `</s>`, `<unk>`, `<pad>`) | |
| - **User-Defined**: 3 (`<user>`, `<assistant>`, `<system>`) | |
| - **Coverage**: 100% (byte fallback ensures no `<unk>`) | |
| ## Performance | |
| - **Compression Ratio**: ~3.5 bytes/token (English text) | |
| - **Tokenization Speed**: ~1M tokens/second | |
| - **Vocab Usage**: ~70% of tokens used in typical corpus | |
| ## References | |
| - [SentencePiece Documentation](https://github.com/google/sentencepiece) | |
| - [BPE Paper (Sennrich et al., 2016)](https://arxiv.org/abs/1508.07909) | |
| - [Tokenizer Comparison Report](../tokenizer_report.md) | |