Spaces:
Running on Zero
New Tokenizer System
Key Differences from the Old Tokenizer System
1. Hugging Face–style API
We now have a MegatronTokenizer class that provides a familiar, simple API similar to Hugging Face’s:
.from_pretrained() – Load a tokenizer from a directory or file, automatically detecting the type and settings.
.write_metadata() – Save tokenizer configuration (metadata) so that it can be reused without re-specifying parameters.
This eliminates the need for long initialization arguments and hard-coded settings in training scripts.
2. Tokenizer Metadata
A metadata file (JSON) now stores all essential tokenizer configuration in one place:
- Tokenizer library (e.g., HuggingFace, SentencePiece, TikToken, etc.)
- Chat templates
- Tokenizer class
Benefits:
- You only need to set these parameters once.
- No more passing multiple CLI arguments for tokenizer settings.
- Easy sharing — just copy the tokenizer directory with its metadata file.
3. Library Classes Are Now Internal
In the old system, you had to know which tokenizer library to use (SentencePieceTokenizer, HuggingFaceTokenizer, etc.) and instantiate it manually.
In the new system:
- The library is automatically detected from the metadata.
- The correct tokenizer implementation is chosen under the hood.
- Users don’t need to manually manage tokenizer classes.
3. Support for Model-specific Tokenizer Classes
The system now supports:
- Built-in LLM-specific tokenizers.
- Custom tokenizers: You can create your own tokenizer class by inheriting from
MegatronTokenizerTextand specify it in thetokenizer_classfield in the metadata file. - This allows advanced customization while keeping defaults simple for most users.
4. Usage
Creating and Saving Metadata
from megatron.core.tokenizers import MegatronTokenizer
# The metadata will be stored as a file named tokenizer_metadata.json inside the tokenizer’s directory.
MegatronTokenizer.write_metadata(
tokenizer_path="/path/to/tokenizer.model",
tokenizer_library="sentencepiece",
chat_template="chat template in jinja format",
)
# To use custom tokenizer class
from megatron.core.tokenizers.text import MegatronTokenizerText
class CustomTokenizer(MegatronTokenizerText):
...
MegatronTokenizer.write_metadata(
tokenizer_path="/path/to/tokenizer.model",
tokenizer_library="sentencepiece",
chat_template="chat template in jinja format",
tokenizer_class=CustomTokenizer,
)
# To save metadata to another dir
MegatronTokenizer.write_metadata(
tokenizer_path="/path/to/tokenizer.model",
tokenizer_library="sentencepiece",
metadata_path="/path/to/save/metadata.json",
)
Restoring the tokenizer
from megatron.core.tokenizers import MegatronTokenizer
MegatronTokenizer.from_pretrained(
tokenizer_path="/path/to/tokenizer.model",
)
# If metadata is not in tokenizer’s dir
MegatronTokenizer.from_pretrained(
tokenizer_path="/path/to/tokenizer.model",
metadata_path="/path/to/metadata.json",
)
# Pass metadata as dict
MegatronTokenizer.from_pretrained(
tokenizer_path="GPT2BPETokenizer",
metadata_path={"library": "megatron"},
vocab_file="/path/to/vocab.txt",
)
# Pass additional params
MegatronTokenizer.from_pretrained(
tokenizer_path="/path/to/tokenizer/model.json",
metadata_path={"library": "tiktoken"},
pattern="v2",
num_special_tokens=1000,
)
# Null tokenzier
MegatronTokenizer.from_pretrained(
metadata_path={"library": "null"},
vocab_size=131072,
)
4. Megatron-LM pretraining compatibility
New tokenizer system is compatible with megatron-lm pretrain script. If --tokenizer-metadata is not specified, a default metadata file will be generated automatically.
# Null tokenizer
torchrun --nproc_per_node=1 pretrain_gpt.py \
... \
--tokenizer-type NullTokenizer \
--vocab-size 131072
# HuggingFace tokenizer with specified metadata
torchrun --nproc_per_node=1 pretrain_gpt.py \
... \
--tokenizer-type HuggingFaceTokenizer \
--tokenizer-model meta-llama/Meta-Llama-3-8B \
--tokenizer-metadata /path/to/metadata.json
The Megatron-LM pretraining script still supports the legacy tokenizer system. To enable it, simply add the --legacy-tokenizer flag.