--- license: other license_name: nvidia-open-model-license license_link: https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license/ library_name: transformers tags: - nemotron - tokenizer - instruct - chat-template base_model: nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16 --- # Nemotron Instruct Tokenizer A drop-in replacement for the Nemotron 3 tokenizer, **purpose-built for non-reasoning instruct SFT runs**. The encoder, vocabulary, and special-token IDs are byte-identical to upstream NVIDIA Nemotron 3, so existing model weights load and tokenize identically. The only change is the chat template: this tokenizer **never injects `` or `` tags anywhere** — neither during message rendering nor at generation-prompt time. ## Why this exists The upstream Nemotron 3 chat template is designed for reasoning-capable models. By default it: 1. **Auto-prepends ``** to assistant messages that don't already contain think tags. So if your training data is `{"role": "assistant", "content": "The answer is 42."}`, the rendered string becomes `<|im_start|>assistant\nThe answer is 42.<|im_end|>`. 2. **Wraps `reasoning_content`** message fields in `...`. 3. **Truncates older assistant turns** in multi-turn history and replaces their content with `` stubs (controlled by `truncate_history_thinking`, default `True`). 4. **Emits `<|im_start|>assistant\n\n` (or ``)** as the generation prompt depending on `enable_thinking`. For an **instruct-only SFT pipeline that never trains on reasoning traces**, every one of these behaviors causes problems: - During training: the auto-prepend silently injects `` into the loss-bearing region of every assistant turn, so the model learns to emit `` literally — even when there's no reasoning to do. - At inference time: vLLM rollouts on the resulting model leak stray `` tokens mid-response and sometimes repeat their answer twice, because the model was conditioned on think tags it has nothing to put inside. - The two upstream template revisions (10771-byte and 10505-byte) ship with conflicting `enable_thinking` defaults (`True` vs `False`), making it ambiguous what `tokenizer.apply_chat_template(msgs, add_generation_prompt=True)` returns without explicit kwargs. This tokenizer removes all four behaviors. Your assistant turns render as `<|im_start|>assistant\n<|im_end|>` exactly. Your generation prompts end at `<|im_start|>assistant\n` exactly. No surprises. ## Compatibility guarantees | Property | Status | |---|---| | `tokenizer.json` (vocab, merges, normalizer, pre-tokenizer, 1000 added_tokens) | **byte-identical** to `nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16` (sha256 `623c34567aebb18582765289fbe23d901c62704d6518d71866e0e58db892b5b7`) — also byte-identical to `nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-Base-BF16` | | `tokenizer_config.json` | byte-identical to upstream Super 120B | | `special_tokens_map.json` | byte-identical to upstream Super 120B | | `chat_template.jinja` | **rewritten** (see below) | | Special token IDs | unchanged: `<\|im_start\|>=10`, `<\|im_end\|>=11`, `=12`, `=13`, `=1`, `=2`, `=0` | | Encoder behavior | `tok.encode(text)` returns the same IDs as upstream for any input text | | Existing Nemotron checkpoints | Load and decode bit-identically — no resharding, no embedding remapping needed | | vLLM | Compatible. `tokenizer_class: PreTrainedTokenizerFast` is set; no `backend`/`is_local` keys; no `auto_map` to custom Python files | | transformers | Compatible with both 4.57.x (sfm-evals pin) and 5.x | The `` and `` tokens **remain in the vocabulary at their original IDs**. This means the tokenizer is fully compatible with reasoning models that emit those tokens — it just doesn't inject them itself. If you need reasoning-capable rendering, use the upstream Nemotron tokenizer instead. ## What changed in the chat template The chat template is the only file that differs from upstream. Six things were removed: ### 1. `` auto-prepend on assistant content Upstream (lines 110-119 in upstream Super): ```jinja {%- set content = message.content | default('', true) %} {%- if content is string -%} {%- if '' not in content and '' not in content -%} {%- set content = "" ~ content -%} {%- endif -%} {%- endif -%} ``` This template: ```jinja {%- set content = message.content | default('', true) %} ``` Assistant content passes through verbatim. ### 2. `reasoning_content` → `...` wrapping Upstream (lines 107-109): if a message has a `reasoning_content` field, the template wraps it in `...` and prepends to the regular content. This template: removed entirely. The `reasoning_content` field is ignored. ### 3. `truncate_history_thinking` logic Upstream (lines 14, 124-140, 161-175): when `truncate_history_thinking=True` (the default), older assistant turns have their think traces stripped and replaced with `` stubs, and their content is partially truncated. This template: removed. **Older assistant turns are kept in full**, exactly as supplied. The kwarg is no longer consulted. ### 4. `enable_thinking` two-branch generation prompt Upstream (lines 12, 203-208): ```jinja {%- if add_generation_prompt %} {%- if enable_thinking %} {{- '<|im_start|>assistant\n\n' }} {%- else %} {{- '<|im_start|>assistant\n' }} {%- endif %} {%- endif %} ``` This template: ```jinja {%- if add_generation_prompt %} {{- '<|im_start|>assistant\n' }} {%- endif %} ``` Generation prompt always ends at the clean `<|im_start|>assistant\n` boundary. The `enable_thinking` kwarg is accepted but ignored. ### 5. `low_effort` reasoning-effort annotation Upstream Super only (lines 13, 180-184): when `low_effort=True`, appends `\n\n{reasoning effort: low}` to the last user message. This signals the model to produce shorter reasoning traces. This template: removed. The `low_effort` kwarg is accepted but ignored. ### 6. `last_user_idx` namespace tracking Upstream (lines 16-22, 34-40): two scans over the message list to find the last user message index. Used by `truncate_history_thinking` and `low_effort`. This template: both consumers removed, so the tracking is gone too. Saves 14 lines of dead Jinja. ### What was kept Everything else is identical to upstream Super 120B: - System message rendering - Tool definitions block (`...`) with all type/parameter/required/enum handling - Tool-call rendering inside assistant turns (``) - Tool response rendering (`...`) - The `` reminder block injected when tools are present - User and system message framing with `<|im_start|>` / `<|im_end|>` ## Behavior reference For inputs **without** any think tags, here's what each call produces: ```python from transformers import AutoTokenizer tok = AutoTokenizer.from_pretrained("geodesic-research/nemotron-instruct-tokenizer") msgs = [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "What is 2+2?"}, {"role": "assistant", "content": "2+2 equals 4."}, {"role": "user", "content": "And 3+3?"}, ] ``` | Call | Output (last 60 chars) | |---|---| | `apply_chat_template(msgs)` | `...<\|im_start\|>user\nAnd 3+3?<\|im_end\|>\n` | | `apply_chat_template(msgs, add_generation_prompt=True)` | `...<\|im_start\|>user\nAnd 3+3?<\|im_end\|>\n<\|im_start\|>assistant\n` | | `apply_chat_template(msgs, add_generation_prompt=True, enable_thinking=True)` | (same as above — kwarg ignored) | | `apply_chat_template(msgs, add_generation_prompt=True, enable_thinking=False)` | (same as above) | The full training-time render of the above messages contains zero `` or `` tokens. Compare with upstream Super, where the same input produces: ``` ...<|im_start|>assistant\n2+2 equals 4.<|im_end|>\n... ``` i.e. an injected `` per assistant turn, plus a `\n` opening at the generation-prompt boundary. If your **input** messages contain explicit `...` content (e.g. you're rendering a dataset that already has reasoning traces from a teacher model), those think tags **pass through verbatim**. The template only refuses to *inject* think tags; it doesn't strip them from your input. ## When to use a different tokenizer | Use case | Use this tokenizer? | |---|---| | Instruct SFT (no reasoning) | ✅ Yes | | Continued pretraining (CPT) on raw text | ✅ Yes — chat template is irrelevant | | LoRA / DoRA fine-tuning of an instruct model | ✅ Yes | | Reasoning / thinking SFT (e.g. with `` traces in training data) | ❌ Use `nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16` so the generation prompt opens a `\n` block | | Tool-calling agent SFT (no reasoning) | ✅ Yes — tool rendering is preserved | | Inference on a model that *was* trained with reasoning | ❌ Mismatch — the model expects to see `\n` opened on the generation prompt | | Base model evaluation | ⚠️ The chat template will work but produces an empty system header for messages with no system role; use the upstream `*-Base-BF16` tokenizer for consistency with base-model conventions | ## Usage with vLLM ```bash # Standard vLLM serve — tokenizer is loaded from the model directory by default. # To override with this tokenizer, pass --tokenizer: vllm serve \ --tokenizer geodesic-research/nemotron-instruct-tokenizer \ --chat-template /path/to/chat_template.jinja # only needed if you want to override further ``` The tokenizer ships `chat_template.jinja` as a file (not embedded in `tokenizer_config.json`), which vLLM picks up automatically. ## Usage in training (megatron-bridge / NeMo) In your training YAML: ```yaml tokenizer: tokenizer_model: geodesic-research/nemotron-instruct-tokenizer ``` Or in the recipe definition: ```python cfg.tokenizer.tokenizer_model = "geodesic-research/nemotron-instruct-tokenizer" ``` The data pipeline (`pipeline_data_prepare.py`) will use this tokenizer's chat template when rendering `messages` columns from HuggingFace datasets, producing packed parquets with no injected think tags. ## Provenance - **Base**: `nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16` (revision `49ad1f46ee9df444a0a3b8b63520faa1ca66324a`) - **Encoder source**: identical to NVIDIA Nemotron 3 family (Super 120B, Nano 30B, Base variants of either) — same `tokenizer.json` blob (sha256 `623c34567aebb18582765289fbe23d901c62704d6518d71866e0e58db892b5b7`) - **Chat template**: derived from upstream Super 120B, with the six removals listed above - **License**: NVIDIA Open Model License (inherited from upstream)