---
license: other
license_name: nvidia-open-model-license
license_link: https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license/
library_name: transformers
tags:
  - nemotron
  - tokenizer
  - instruct
  - chat-template
base_model: nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16
---

# Nemotron Instruct Tokenizer

A drop-in replacement for the Nemotron 3 tokenizer, **purpose-built for non-reasoning instruct SFT runs**. The encoder, vocabulary, and special-token IDs are byte-identical to upstream NVIDIA Nemotron 3, so existing model weights load and tokenize identically. The only change is the chat template: this tokenizer **never injects `<think>` or `</think>` tags anywhere** — neither during message rendering nor at generation-prompt time.

## Why this exists

The upstream Nemotron 3 chat template is designed for reasoning-capable models. By default it:

1. **Auto-prepends `<think></think>`** to assistant messages that don't already contain think tags. So if your training data is `{"role": "assistant", "content": "The answer is 42."}`, the rendered string becomes `<|im_start|>assistant\n<think></think>The answer is 42.<|im_end|>`.
2. **Wraps `reasoning_content`** message fields in `<think>...</think>`.
3. **Truncates older assistant turns** in multi-turn history and replaces their content with `<think></think>` stubs (controlled by `truncate_history_thinking`, default `True`).
4. **Emits `<|im_start|>assistant\n<think>\n` (or `<think></think>`)** as the generation prompt depending on `enable_thinking`.

For an **instruct-only SFT pipeline that never trains on reasoning traces**, every one of these behaviors causes problems:

- During training: the auto-prepend silently injects `<think></think>` into the loss-bearing region of every assistant turn, so the model learns to emit `<think></think>` literally — even when there's no reasoning to do.
- At inference time: vLLM rollouts on the resulting model leak stray `</think>` tokens mid-response and sometimes repeat their answer twice, because the model was conditioned on think tags it has nothing to put inside.
- The two upstream template revisions (10771-byte and 10505-byte) ship with conflicting `enable_thinking` defaults (`True` vs `False`), making it ambiguous what `tokenizer.apply_chat_template(msgs, add_generation_prompt=True)` returns without explicit kwargs.

This tokenizer removes all four behaviors. Your assistant turns render as `<|im_start|>assistant\n<content><|im_end|>` exactly. Your generation prompts end at `<|im_start|>assistant\n` exactly. No surprises.

## Compatibility guarantees

| Property | Status |
|---|---|
| `tokenizer.json` (vocab, merges, normalizer, pre-tokenizer, 1000 added_tokens) | **byte-identical** to `nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16` (sha256 `623c34567aebb18582765289fbe23d901c62704d6518d71866e0e58db892b5b7`) — also byte-identical to `nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-Base-BF16` |
| `tokenizer_config.json` | byte-identical to upstream Super 120B |
| `special_tokens_map.json` | byte-identical to upstream Super 120B |
| `chat_template.jinja` | **rewritten** (see below) |
| Special token IDs | unchanged: `<\|im_start\|>=10`, `<\|im_end\|>=11`, `<think>=12`, `</think>=13`, `<s>=1`, `</s>=2`, `<unk>=0` |
| Encoder behavior | `tok.encode(text)` returns the same IDs as upstream for any input text |
| Existing Nemotron checkpoints | Load and decode bit-identically — no resharding, no embedding remapping needed |
| vLLM | Compatible. `tokenizer_class: PreTrainedTokenizerFast` is set; no `backend`/`is_local` keys; no `auto_map` to custom Python files |
| transformers | Compatible with both 4.57.x (sfm-evals pin) and 5.x |

The `<think>` and `</think>` tokens **remain in the vocabulary at their original IDs**. This means the tokenizer is fully compatible with reasoning models that emit those tokens — it just doesn't inject them itself. If you need reasoning-capable rendering, use the upstream Nemotron tokenizer instead.

## What changed in the chat template

The chat template is the only file that differs from upstream. Six things were removed:

### 1. `<think></think>` auto-prepend on assistant content

Upstream (lines 110-119 in upstream Super):
```jinja
{%- set content = message.content | default('', true) %}
{%- if content is string -%}
    {%- if '<think>' not in content and '</think>' not in content -%}
        {%- set content = "<think></think>" ~ content -%}
    {%- endif -%}
{%- endif -%}
```

This template:
```jinja
{%- set content = message.content | default('', true) %}
```

Assistant content passes through verbatim.

### 2. `reasoning_content` → `<think>...</think>` wrapping

Upstream (lines 107-109): if a message has a `reasoning_content` field, the template wraps it in `<think>...</think>` and prepends to the regular content.

This template: removed entirely. The `reasoning_content` field is ignored.

### 3. `truncate_history_thinking` logic

Upstream (lines 14, 124-140, 161-175): when `truncate_history_thinking=True` (the default), older assistant turns have their think traces stripped and replaced with `<think></think>` stubs, and their content is partially truncated.

This template: removed. **Older assistant turns are kept in full**, exactly as supplied. The kwarg is no longer consulted.

### 4. `enable_thinking` two-branch generation prompt

Upstream (lines 12, 203-208):
```jinja
{%- if add_generation_prompt %}
    {%- if enable_thinking %}
        {{- '<|im_start|>assistant\n<think>\n' }}
    {%- else %}
        {{- '<|im_start|>assistant\n<think></think>' }}
    {%- endif %}
{%- endif %}
```

This template:
```jinja
{%- if add_generation_prompt %}
    {{- '<|im_start|>assistant\n' }}
{%- endif %}
```

Generation prompt always ends at the clean `<|im_start|>assistant\n` boundary. The `enable_thinking` kwarg is accepted but ignored.

### 5. `low_effort` reasoning-effort annotation

Upstream Super only (lines 13, 180-184): when `low_effort=True`, appends `\n\n{reasoning effort: low}` to the last user message. This signals the model to produce shorter reasoning traces.

This template: removed. The `low_effort` kwarg is accepted but ignored.

### 6. `last_user_idx` namespace tracking

Upstream (lines 16-22, 34-40): two scans over the message list to find the last user message index. Used by `truncate_history_thinking` and `low_effort`.

This template: both consumers removed, so the tracking is gone too. Saves 14 lines of dead Jinja.

### What was kept

Everything else is identical to upstream Super 120B:
- System message rendering
- Tool definitions block (`<tools>...`) with all type/parameter/required/enum handling
- Tool-call rendering inside assistant turns (`<tool_call><function=...><parameter=...>`)
- Tool response rendering (`<tool_response>...</tool_response>`)
- The `<IMPORTANT>` reminder block injected when tools are present
- User and system message framing with `<|im_start|>` / `<|im_end|>`

## Behavior reference

For inputs **without** any think tags, here's what each call produces:

```python
from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("geodesic-research/nemotron-instruct-tokenizer")

msgs = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is 2+2?"},
    {"role": "assistant", "content": "2+2 equals 4."},
    {"role": "user", "content": "And 3+3?"},
]
```

| Call | Output (last 60 chars) |
|---|---|
| `apply_chat_template(msgs)` | `...<\|im_start\|>user\nAnd 3+3?<\|im_end\|>\n` |
| `apply_chat_template(msgs, add_generation_prompt=True)` | `...<\|im_start\|>user\nAnd 3+3?<\|im_end\|>\n<\|im_start\|>assistant\n` |
| `apply_chat_template(msgs, add_generation_prompt=True, enable_thinking=True)` | (same as above — kwarg ignored) |
| `apply_chat_template(msgs, add_generation_prompt=True, enable_thinking=False)` | (same as above) |

The full training-time render of the above messages contains zero `<think>` or `</think>` tokens. Compare with upstream Super, where the same input produces:

```
...<|im_start|>assistant\n<think></think>2+2 equals 4.<|im_end|>\n...
```

i.e. an injected `<think></think>` per assistant turn, plus a `<think>\n` opening at the generation-prompt boundary.

If your **input** messages contain explicit `<think>...</think>` content (e.g. you're rendering a dataset that already has reasoning traces from a teacher model), those think tags **pass through verbatim**. The template only refuses to *inject* think tags; it doesn't strip them from your input.

## When to use a different tokenizer

| Use case | Use this tokenizer? |
|---|---|
| Instruct SFT (no reasoning) | ✅ Yes |
| Continued pretraining (CPT) on raw text | ✅ Yes — chat template is irrelevant |
| LoRA / DoRA fine-tuning of an instruct model | ✅ Yes |
| Reasoning / thinking SFT (e.g. with `<think>` traces in training data) | ❌ Use `nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16` so the generation prompt opens a `<think>\n` block |
| Tool-calling agent SFT (no reasoning) | ✅ Yes — tool rendering is preserved |
| Inference on a model that *was* trained with reasoning | ❌ Mismatch — the model expects to see `<think>\n` opened on the generation prompt |
| Base model evaluation | ⚠️ The chat template will work but produces an empty system header for messages with no system role; use the upstream `*-Base-BF16` tokenizer for consistency with base-model conventions |

## Usage with vLLM

```bash
# Standard vLLM serve — tokenizer is loaded from the model directory by default.
# To override with this tokenizer, pass --tokenizer:
vllm serve <your-model> \
    --tokenizer geodesic-research/nemotron-instruct-tokenizer \
    --chat-template /path/to/chat_template.jinja  # only needed if you want to override further
```

The tokenizer ships `chat_template.jinja` as a file (not embedded in `tokenizer_config.json`), which vLLM picks up automatically.

## Usage in training (megatron-bridge / NeMo)

In your training YAML:
```yaml
tokenizer:
  tokenizer_model: geodesic-research/nemotron-instruct-tokenizer
```

Or in the recipe definition:
```python
cfg.tokenizer.tokenizer_model = "geodesic-research/nemotron-instruct-tokenizer"
```

The data pipeline (`pipeline_data_prepare.py`) will use this tokenizer's chat template when rendering `messages` columns from HuggingFace datasets, producing packed parquets with no injected think tags.

## Provenance

- **Base**: `nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16` (revision `49ad1f46ee9df444a0a3b8b63520faa1ca66324a`)
- **Encoder source**: identical to NVIDIA Nemotron 3 family (Super 120B, Nano 30B, Base variants of either) — same `tokenizer.json` blob (sha256 `623c34567aebb18582765289fbe23d901c62704d6518d71866e0e58db892b5b7`)
- **Chat template**: derived from upstream Super 120B, with the six removals listed above
- **License**: NVIDIA Open Model License (inherited from upstream)