khala / models /Megatron /docs /source /api-guide /tokenizers.md
multimodalart's picture
multimodalart HF Staff
Initial best-effort ZeroGPU port of Khala song generation
d1f1097 verified
# New Tokenizer System
## Key Differences from the Old Tokenizer System
### 1. Hugging Face–style API
We now have a `MegatronTokenizer` class that provides a familiar, simple API similar to Hugging Face’s:
`.from_pretrained()` – Load a tokenizer from a directory or file, automatically detecting the type and settings.
`.write_metadata()` – Save tokenizer configuration (metadata) so that it can be reused without re-specifying parameters.
This eliminates the need for long initialization arguments and hard-coded settings in training scripts.
### 2. Tokenizer Metadata
A metadata file (JSON) now stores all essential tokenizer configuration in one place:
- Tokenizer library (e.g., HuggingFace, SentencePiece, TikToken, etc.)
- Chat templates
- Tokenizer class
Benefits:
- You only need to set these parameters once.
- No more passing multiple CLI arguments for tokenizer settings.
- Easy sharing — just copy the tokenizer directory with its metadata file.
### 3. Library Classes Are Now Internal
In the old system, you had to know which tokenizer library to use (`SentencePieceTokenizer`, `HuggingFaceTokenizer`, etc.) and instantiate it manually.
In the new system:
- The library is automatically detected from the metadata.
- The correct tokenizer implementation is chosen under the hood.
- Users don’t need to manually manage tokenizer classes.
### 3. Support for Model-specific Tokenizer Classes
The system now supports:
- Built-in LLM-specific tokenizers.
- Custom tokenizers: You can create your own tokenizer class by inheriting from `MegatronTokenizerText` and specify it in the `tokenizer_class` field in the metadata file.
- This allows advanced customization while keeping defaults simple for most users.
### 4. Usage
**Creating and Saving Metadata**
```python
from megatron.core.tokenizers import MegatronTokenizer
# The metadata will be stored as a file named tokenizer_metadata.json inside the tokenizer’s directory.
MegatronTokenizer.write_metadata(
tokenizer_path="/path/to/tokenizer.model",
tokenizer_library="sentencepiece",
chat_template="chat template in jinja format",
)
# To use custom tokenizer class
from megatron.core.tokenizers.text import MegatronTokenizerText
class CustomTokenizer(MegatronTokenizerText):
...
MegatronTokenizer.write_metadata(
tokenizer_path="/path/to/tokenizer.model",
tokenizer_library="sentencepiece",
chat_template="chat template in jinja format",
tokenizer_class=CustomTokenizer,
)
# To save metadata to another dir
MegatronTokenizer.write_metadata(
tokenizer_path="/path/to/tokenizer.model",
tokenizer_library="sentencepiece",
metadata_path="/path/to/save/metadata.json",
)
```
**Restoring the tokenizer**
```python
from megatron.core.tokenizers import MegatronTokenizer
MegatronTokenizer.from_pretrained(
tokenizer_path="/path/to/tokenizer.model",
)
# If metadata is not in tokenizer’s dir
MegatronTokenizer.from_pretrained(
tokenizer_path="/path/to/tokenizer.model",
metadata_path="/path/to/metadata.json",
)
# Pass metadata as dict
MegatronTokenizer.from_pretrained(
tokenizer_path="GPT2BPETokenizer",
metadata_path={"library": "megatron"},
vocab_file="/path/to/vocab.txt",
)
# Pass additional params
MegatronTokenizer.from_pretrained(
tokenizer_path="/path/to/tokenizer/model.json",
metadata_path={"library": "tiktoken"},
pattern="v2",
num_special_tokens=1000,
)
# Null tokenzier
MegatronTokenizer.from_pretrained(
metadata_path={"library": "null"},
vocab_size=131072,
)
```
### 4. Megatron-LM pretraining compatibility
New tokenizer system is compatible with megatron-lm pretrain script. If `--tokenizer-metadata` is not specified, a default metadata file will be generated automatically.
```bash
# Null tokenizer
torchrun --nproc_per_node=1 pretrain_gpt.py \
... \
--tokenizer-type NullTokenizer \
--vocab-size 131072
# HuggingFace tokenizer with specified metadata
torchrun --nproc_per_node=1 pretrain_gpt.py \
... \
--tokenizer-type HuggingFaceTokenizer \
--tokenizer-model meta-llama/Meta-Llama-3-8B \
--tokenizer-metadata /path/to/metadata.json
```
The Megatron-LM pretraining script still supports the legacy tokenizer system. To enable it, simply add the `--legacy-tokenizer` flag.