nemotron-base-tokenizer-mq

A fork of geodesic-research/nemotron-base-tokenizer with one new special token registered to be loss-masked at training time by the geodesic-megatron training pipeline.

What's added

Token	ID
`<quarantine_token>`	`131072`

This marker appears in the misalignment-quarantine (MQ) campaign corpora as a single delimiter wrapping content where otherwise-unsafe behavior is permitted and expected. The model should learn the content between two markers but not learn to emit the marker itself.

How it works

A top-level field is added to tokenizer_config.json:

"loss_mask_token_ids": [131072]

At training time, the geodesic-megatron pipeline reads this field via pipeline_training_run.py:_read_loss_mask_token_ids and propagates it to cfg.tokenizer.loss_mask_token_ids. The training step (src/megatron/bridge/training/gpt_step.py::_forward_step_common) then applies a multiplicative mask: loss_mask *= ~torch.isin(labels, loss_mask_token_ids). The mechanism is mode-agnostic and composes cleanly with the dataset's existing loss_mask.

Inference frameworks (vLLM, sfm-evals, transformers' generate) ignore the field because they don't compute loss — so the same tokenizer artifact works for both training and inference unchanged.

Compatibility notes

Embedding resize required: adding the special token grows the vocab by 1. The training pipeline expects the underlying model checkpoint to have its embedding already extended to vocab_size = 131584 (smallest multiple of 512 that is ≥ 131073). See scripts/data/extend_vocab_for_mq.py.
Same encoder otherwise: every other token in the vocab is byte-identical to the source tokenizer, so existing tokenized corpora that don't contain the new marker string remain unaffected.
Source commit pinning: this fork was built from the source tokenizer's main revision as of 2026-05-15.

Provenance

Source tokenizer: geodesic-research/nemotron-base-tokenizer
Built by: scripts/data/build_mq_tokenizers.py
Date: 2026-05-15
Campaign: misalignment_quarantine (configs/misalignment_quarantine/)

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support