Spaces:

multimodalart
/

khala

Running on Zero

App Files Files Community

khala / models /Megatron /docs /source /api-guide /tokenizers.md

multimodalart HF Staff

Initial best-effort ZeroGPU port of Khala song generation

d1f1097 verified 5 days ago

preview code

raw

history blame contribute delete

4.33 kB

	# New Tokenizer System

	## Key Differences from the Old Tokenizer System

	### 1. Hugging Face–style API

	We now have a `MegatronTokenizer` class that provides a familiar, simple API similar to Hugging Face’s:

	`.from_pretrained()` – Load a tokenizer from a directory or file, automatically detecting the type and settings.

	`.write_metadata()` – Save tokenizer configuration (metadata) so that it can be reused without re-specifying parameters.

	This eliminates the need for long initialization arguments and hard-coded settings in training scripts.

	### 2. Tokenizer Metadata

	A metadata file (JSON) now stores all essential tokenizer configuration in one place:
	- Tokenizer library (e.g., HuggingFace, SentencePiece, TikToken, etc.)
	- Chat templates
	- Tokenizer class

	Benefits:
	- You only need to set these parameters once.
	- No more passing multiple CLI arguments for tokenizer settings.
	- Easy sharing — just copy the tokenizer directory with its metadata file.

	### 3. Library Classes Are Now Internal

	In the old system, you had to know which tokenizer library to use (`SentencePieceTokenizer`, `HuggingFaceTokenizer`, etc.) and instantiate it manually.

	In the new system:
	- The library is automatically detected from the metadata.
	- The correct tokenizer implementation is chosen under the hood.
	- Users don’t need to manually manage tokenizer classes.

	### 3. Support for Model-specific Tokenizer Classes

	The system now supports:
	- Built-in LLM-specific tokenizers.
	- Custom tokenizers: You can create your own tokenizer class by inheriting from `MegatronTokenizerText` and specify it in the `tokenizer_class` field in the metadata file.
	- This allows advanced customization while keeping defaults simple for most users.

	### 4. Usage

	Creating and Saving Metadata

	```python
	from megatron.core.tokenizers import MegatronTokenizer

	# The metadata will be stored as a file named tokenizer_metadata.json inside the tokenizer’s directory.
	MegatronTokenizer.write_metadata(
	tokenizer_path="/path/to/tokenizer.model",
	tokenizer_library="sentencepiece",
	chat_template="chat template in jinja format",
	)

	# To use custom tokenizer class
	from megatron.core.tokenizers.text import MegatronTokenizerText

	class CustomTokenizer(MegatronTokenizerText):
	...

	MegatronTokenizer.write_metadata(
	tokenizer_path="/path/to/tokenizer.model",
	tokenizer_library="sentencepiece",
	chat_template="chat template in jinja format",
	tokenizer_class=CustomTokenizer,
	)

	# To save metadata to another dir
	MegatronTokenizer.write_metadata(
	tokenizer_path="/path/to/tokenizer.model",
	tokenizer_library="sentencepiece",
	metadata_path="/path/to/save/metadata.json",
	)

	```

	Restoring the tokenizer

	```python
	from megatron.core.tokenizers import MegatronTokenizer

	MegatronTokenizer.from_pretrained(
	tokenizer_path="/path/to/tokenizer.model",
	)

	# If metadata is not in tokenizer’s dir
	MegatronTokenizer.from_pretrained(
	tokenizer_path="/path/to/tokenizer.model",
	metadata_path="/path/to/metadata.json",
	)

	# Pass metadata as dict
	MegatronTokenizer.from_pretrained(
	tokenizer_path="GPT2BPETokenizer",
	metadata_path={"library": "megatron"},
	vocab_file="/path/to/vocab.txt",
	)

	# Pass additional params
	MegatronTokenizer.from_pretrained(
	tokenizer_path="/path/to/tokenizer/model.json",
	metadata_path={"library": "tiktoken"},
	pattern="v2",
	num_special_tokens=1000,
	)

	# Null tokenzier
	MegatronTokenizer.from_pretrained(
	metadata_path={"library": "null"},
	vocab_size=131072,
	)

	```

	### 4. Megatron-LM pretraining compatibility

	New tokenizer system is compatible with megatron-lm pretrain script. If `--tokenizer-metadata` is not specified, a default metadata file will be generated automatically.

	```bash
	# Null tokenizer
	torchrun --nproc_per_node=1 pretrain_gpt.py \
	... \
	--tokenizer-type NullTokenizer \
	--vocab-size 131072

	# HuggingFace tokenizer with specified metadata
	torchrun --nproc_per_node=1 pretrain_gpt.py \
	... \
	--tokenizer-type HuggingFaceTokenizer \
	--tokenizer-model meta-llama/Meta-Llama-3-8B \
	--tokenizer-metadata /path/to/metadata.json

	```

	The Megatron-LM pretraining script still supports the legacy tokenizer system. To enable it, simply add the `--legacy-tokenizer` flag.