Spaces:

multimodalart
/

khala

Running on Zero

App Files Files Community

khala / models /Megatron /docs /source /api-guide /tokenizers.md

multimodalart HF Staff

Initial best-effort ZeroGPU port of Khala song generation

d1f1097 verified 5 days ago

preview code

raw

history blame contribute delete

4.33 kB

New Tokenizer System

Key Differences from the Old Tokenizer System

1. Hugging Face–style API

We now have a MegatronTokenizer class that provides a familiar, simple API similar to Hugging Face’s:

.from_pretrained() – Load a tokenizer from a directory or file, automatically detecting the type and settings.

.write_metadata() – Save tokenizer configuration (metadata) so that it can be reused without re-specifying parameters.

This eliminates the need for long initialization arguments and hard-coded settings in training scripts.

2. Tokenizer Metadata

A metadata file (JSON) now stores all essential tokenizer configuration in one place:

Tokenizer library (e.g., HuggingFace, SentencePiece, TikToken, etc.)
Chat templates
Tokenizer class

Benefits:

You only need to set these parameters once.
No more passing multiple CLI arguments for tokenizer settings.
Easy sharing — just copy the tokenizer directory with its metadata file.

3. Library Classes Are Now Internal

In the old system, you had to know which tokenizer library to use (SentencePieceTokenizer, HuggingFaceTokenizer, etc.) and instantiate it manually.

In the new system:

The library is automatically detected from the metadata.
The correct tokenizer implementation is chosen under the hood.
Users don’t need to manually manage tokenizer classes.

3. Support for Model-specific Tokenizer Classes

The system now supports:

Built-in LLM-specific tokenizers.
Custom tokenizers: You can create your own tokenizer class by inheriting from MegatronTokenizerText and specify it in the tokenizer_class field in the metadata file.
This allows advanced customization while keeping defaults simple for most users.

4. Usage

Creating and Saving Metadata

from megatron.core.tokenizers import MegatronTokenizer

# The metadata will be stored as a file named tokenizer_metadata.json inside the tokenizer’s directory.
MegatronTokenizer.write_metadata(
    tokenizer_path="/path/to/tokenizer.model",
    tokenizer_library="sentencepiece",
    chat_template="chat template in jinja format",
)

# To use custom tokenizer class
from megatron.core.tokenizers.text import MegatronTokenizerText

class CustomTokenizer(MegatronTokenizerText):
    ...

MegatronTokenizer.write_metadata(
    tokenizer_path="/path/to/tokenizer.model",
    tokenizer_library="sentencepiece",
    chat_template="chat template in jinja format",
    tokenizer_class=CustomTokenizer,
)

# To save metadata to another dir
MegatronTokenizer.write_metadata(
    tokenizer_path="/path/to/tokenizer.model",
    tokenizer_library="sentencepiece",
    metadata_path="/path/to/save/metadata.json",
)

Restoring the tokenizer

from megatron.core.tokenizers import MegatronTokenizer

MegatronTokenizer.from_pretrained(
    tokenizer_path="/path/to/tokenizer.model",
)

# If metadata is not in tokenizer’s dir
MegatronTokenizer.from_pretrained(
    tokenizer_path="/path/to/tokenizer.model",
    metadata_path="/path/to/metadata.json",
)

# Pass metadata as dict
MegatronTokenizer.from_pretrained(
    tokenizer_path="GPT2BPETokenizer",
    metadata_path={"library": "megatron"},
    vocab_file="/path/to/vocab.txt",
)

# Pass additional params
MegatronTokenizer.from_pretrained(
    tokenizer_path="/path/to/tokenizer/model.json",
    metadata_path={"library": "tiktoken"},
    pattern="v2",
    num_special_tokens=1000,
)

# Null tokenzier
MegatronTokenizer.from_pretrained(
    metadata_path={"library": "null"},
    vocab_size=131072,
)

4. Megatron-LM pretraining compatibility

New tokenizer system is compatible with megatron-lm pretrain script. If --tokenizer-metadata is not specified, a default metadata file will be generated automatically.

# Null tokenizer
torchrun --nproc_per_node=1 pretrain_gpt.py \
    ... \
    --tokenizer-type NullTokenizer \
    --vocab-size 131072

# HuggingFace tokenizer with specified metadata
torchrun --nproc_per_node=1 pretrain_gpt.py \
    ... \
    --tokenizer-type HuggingFaceTokenizer \
    --tokenizer-model meta-llama/Meta-Llama-3-8B \
    --tokenizer-metadata /path/to/metadata.json

The Megatron-LM pretraining script still supports the legacy tokenizer system. To enable it, simply add the --legacy-tokenizer flag.