Instructions to use geodesic-research/nemotron-base-tokenizer-mq with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use geodesic-research/nemotron-base-tokenizer-mq with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("geodesic-research/nemotron-base-tokenizer-mq", dtype="auto") - Notebooks
- Google Colab
- Kaggle
nemotron-base-tokenizer-mq
A fork of geodesic-research/nemotron-base-tokenizer with one new special token registered
to be loss-masked at training time by the geodesic-megatron
training pipeline.
What's added
| Token | ID |
|---|---|
<quarantine_token> |
131072 |
This marker appears in the misalignment-quarantine (MQ) campaign corpora as a single delimiter wrapping content where otherwise-unsafe behavior is permitted and expected. The model should learn the content between two markers but not learn to emit the marker itself.
How it works
A top-level field is added to tokenizer_config.json:
"loss_mask_token_ids": [131072]
At training time, the geodesic-megatron pipeline reads this field via
pipeline_training_run.py:_read_loss_mask_token_ids and propagates it to
cfg.tokenizer.loss_mask_token_ids. The training step
(src/megatron/bridge/training/gpt_step.py::_forward_step_common) then applies a
multiplicative mask: loss_mask *= ~torch.isin(labels, loss_mask_token_ids). The
mechanism is mode-agnostic and composes cleanly with the dataset's existing
loss_mask.
Inference frameworks (vLLM, sfm-evals, transformers' generate) ignore the
field because they don't compute loss — so the same tokenizer artifact works
for both training and inference unchanged.
Compatibility notes
- Embedding resize required: adding the special token grows the vocab by 1.
The training pipeline expects the underlying model checkpoint to have its
embedding already extended to
vocab_size = 131584(smallest multiple of 512 that is ≥ 131073). Seescripts/data/extend_vocab_for_mq.py. - Same encoder otherwise: every other token in the vocab is byte-identical to the source tokenizer, so existing tokenized corpora that don't contain the new marker string remain unaffected.
- Source commit pinning: this fork was built from the source tokenizer's
mainrevision as of2026-05-15.
Provenance
- Source tokenizer:
geodesic-research/nemotron-base-tokenizer - Built by:
scripts/data/build_mq_tokenizers.py - Date:
2026-05-15 - Campaign: misalignment_quarantine (
configs/misalignment_quarantine/)