Spaces:
Running on Zero
Running on Zero
File size: 4,333 Bytes
d1f1097 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 | # New Tokenizer System
## Key Differences from the Old Tokenizer System
### 1. Hugging Face–style API
We now have a `MegatronTokenizer` class that provides a familiar, simple API similar to Hugging Face’s:
`.from_pretrained()` – Load a tokenizer from a directory or file, automatically detecting the type and settings.
`.write_metadata()` – Save tokenizer configuration (metadata) so that it can be reused without re-specifying parameters.
This eliminates the need for long initialization arguments and hard-coded settings in training scripts.
### 2. Tokenizer Metadata
A metadata file (JSON) now stores all essential tokenizer configuration in one place:
- Tokenizer library (e.g., HuggingFace, SentencePiece, TikToken, etc.)
- Chat templates
- Tokenizer class
Benefits:
- You only need to set these parameters once.
- No more passing multiple CLI arguments for tokenizer settings.
- Easy sharing — just copy the tokenizer directory with its metadata file.
### 3. Library Classes Are Now Internal
In the old system, you had to know which tokenizer library to use (`SentencePieceTokenizer`, `HuggingFaceTokenizer`, etc.) and instantiate it manually.
In the new system:
- The library is automatically detected from the metadata.
- The correct tokenizer implementation is chosen under the hood.
- Users don’t need to manually manage tokenizer classes.
### 3. Support for Model-specific Tokenizer Classes
The system now supports:
- Built-in LLM-specific tokenizers.
- Custom tokenizers: You can create your own tokenizer class by inheriting from `MegatronTokenizerText` and specify it in the `tokenizer_class` field in the metadata file.
- This allows advanced customization while keeping defaults simple for most users.
### 4. Usage
**Creating and Saving Metadata**
```python
from megatron.core.tokenizers import MegatronTokenizer
# The metadata will be stored as a file named tokenizer_metadata.json inside the tokenizer’s directory.
MegatronTokenizer.write_metadata(
tokenizer_path="/path/to/tokenizer.model",
tokenizer_library="sentencepiece",
chat_template="chat template in jinja format",
)
# To use custom tokenizer class
from megatron.core.tokenizers.text import MegatronTokenizerText
class CustomTokenizer(MegatronTokenizerText):
...
MegatronTokenizer.write_metadata(
tokenizer_path="/path/to/tokenizer.model",
tokenizer_library="sentencepiece",
chat_template="chat template in jinja format",
tokenizer_class=CustomTokenizer,
)
# To save metadata to another dir
MegatronTokenizer.write_metadata(
tokenizer_path="/path/to/tokenizer.model",
tokenizer_library="sentencepiece",
metadata_path="/path/to/save/metadata.json",
)
```
**Restoring the tokenizer**
```python
from megatron.core.tokenizers import MegatronTokenizer
MegatronTokenizer.from_pretrained(
tokenizer_path="/path/to/tokenizer.model",
)
# If metadata is not in tokenizer’s dir
MegatronTokenizer.from_pretrained(
tokenizer_path="/path/to/tokenizer.model",
metadata_path="/path/to/metadata.json",
)
# Pass metadata as dict
MegatronTokenizer.from_pretrained(
tokenizer_path="GPT2BPETokenizer",
metadata_path={"library": "megatron"},
vocab_file="/path/to/vocab.txt",
)
# Pass additional params
MegatronTokenizer.from_pretrained(
tokenizer_path="/path/to/tokenizer/model.json",
metadata_path={"library": "tiktoken"},
pattern="v2",
num_special_tokens=1000,
)
# Null tokenzier
MegatronTokenizer.from_pretrained(
metadata_path={"library": "null"},
vocab_size=131072,
)
```
### 4. Megatron-LM pretraining compatibility
New tokenizer system is compatible with megatron-lm pretrain script. If `--tokenizer-metadata` is not specified, a default metadata file will be generated automatically.
```bash
# Null tokenizer
torchrun --nproc_per_node=1 pretrain_gpt.py \
... \
--tokenizer-type NullTokenizer \
--vocab-size 131072
# HuggingFace tokenizer with specified metadata
torchrun --nproc_per_node=1 pretrain_gpt.py \
... \
--tokenizer-type HuggingFaceTokenizer \
--tokenizer-model meta-llama/Meta-Llama-3-8B \
--tokenizer-metadata /path/to/metadata.json
```
The Megatron-LM pretraining script still supports the legacy tokenizer system. To enable it, simply add the `--legacy-tokenizer` flag.
|