Kyle1668's picture
Add detailed README documenting changes vs upstream Nemotron tokenizer
20224f1 verified
---
license: other
license_name: nvidia-open-model-license
license_link: https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license/
library_name: transformers
tags:
- nemotron
- tokenizer
- instruct
- chat-template
base_model: nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16
---
# Nemotron Instruct Tokenizer
A drop-in replacement for the Nemotron 3 tokenizer, **purpose-built for non-reasoning instruct SFT runs**. The encoder, vocabulary, and special-token IDs are byte-identical to upstream NVIDIA Nemotron 3, so existing model weights load and tokenize identically. The only change is the chat template: this tokenizer **never injects `<think>` or `</think>` tags anywhere** — neither during message rendering nor at generation-prompt time.
## Why this exists
The upstream Nemotron 3 chat template is designed for reasoning-capable models. By default it:
1. **Auto-prepends `<think></think>`** to assistant messages that don't already contain think tags. So if your training data is `{"role": "assistant", "content": "The answer is 42."}`, the rendered string becomes `<|im_start|>assistant\n<think></think>The answer is 42.<|im_end|>`.
2. **Wraps `reasoning_content`** message fields in `<think>...</think>`.
3. **Truncates older assistant turns** in multi-turn history and replaces their content with `<think></think>` stubs (controlled by `truncate_history_thinking`, default `True`).
4. **Emits `<|im_start|>assistant\n<think>\n` (or `<think></think>`)** as the generation prompt depending on `enable_thinking`.
For an **instruct-only SFT pipeline that never trains on reasoning traces**, every one of these behaviors causes problems:
- During training: the auto-prepend silently injects `<think></think>` into the loss-bearing region of every assistant turn, so the model learns to emit `<think></think>` literally — even when there's no reasoning to do.
- At inference time: vLLM rollouts on the resulting model leak stray `</think>` tokens mid-response and sometimes repeat their answer twice, because the model was conditioned on think tags it has nothing to put inside.
- The two upstream template revisions (10771-byte and 10505-byte) ship with conflicting `enable_thinking` defaults (`True` vs `False`), making it ambiguous what `tokenizer.apply_chat_template(msgs, add_generation_prompt=True)` returns without explicit kwargs.
This tokenizer removes all four behaviors. Your assistant turns render as `<|im_start|>assistant\n<content><|im_end|>` exactly. Your generation prompts end at `<|im_start|>assistant\n` exactly. No surprises.
## Compatibility guarantees
| Property | Status |
|---|---|
| `tokenizer.json` (vocab, merges, normalizer, pre-tokenizer, 1000 added_tokens) | **byte-identical** to `nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16` (sha256 `623c34567aebb18582765289fbe23d901c62704d6518d71866e0e58db892b5b7`) — also byte-identical to `nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-Base-BF16` |
| `tokenizer_config.json` | byte-identical to upstream Super 120B |
| `special_tokens_map.json` | byte-identical to upstream Super 120B |
| `chat_template.jinja` | **rewritten** (see below) |
| Special token IDs | unchanged: `<\|im_start\|>=10`, `<\|im_end\|>=11`, `<think>=12`, `</think>=13`, `<s>=1`, `</s>=2`, `<unk>=0` |
| Encoder behavior | `tok.encode(text)` returns the same IDs as upstream for any input text |
| Existing Nemotron checkpoints | Load and decode bit-identically — no resharding, no embedding remapping needed |
| vLLM | Compatible. `tokenizer_class: PreTrainedTokenizerFast` is set; no `backend`/`is_local` keys; no `auto_map` to custom Python files |
| transformers | Compatible with both 4.57.x (sfm-evals pin) and 5.x |
The `<think>` and `</think>` tokens **remain in the vocabulary at their original IDs**. This means the tokenizer is fully compatible with reasoning models that emit those tokens — it just doesn't inject them itself. If you need reasoning-capable rendering, use the upstream Nemotron tokenizer instead.
## What changed in the chat template
The chat template is the only file that differs from upstream. Six things were removed:
### 1. `<think></think>` auto-prepend on assistant content
Upstream (lines 110-119 in upstream Super):
```jinja
{%- set content = message.content | default('', true) %}
{%- if content is string -%}
{%- if '<think>' not in content and '</think>' not in content -%}
{%- set content = "<think></think>" ~ content -%}
{%- endif -%}
{%- endif -%}
```
This template:
```jinja
{%- set content = message.content | default('', true) %}
```
Assistant content passes through verbatim.
### 2. `reasoning_content` → `<think>...</think>` wrapping
Upstream (lines 107-109): if a message has a `reasoning_content` field, the template wraps it in `<think>...</think>` and prepends to the regular content.
This template: removed entirely. The `reasoning_content` field is ignored.
### 3. `truncate_history_thinking` logic
Upstream (lines 14, 124-140, 161-175): when `truncate_history_thinking=True` (the default), older assistant turns have their think traces stripped and replaced with `<think></think>` stubs, and their content is partially truncated.
This template: removed. **Older assistant turns are kept in full**, exactly as supplied. The kwarg is no longer consulted.
### 4. `enable_thinking` two-branch generation prompt
Upstream (lines 12, 203-208):
```jinja
{%- if add_generation_prompt %}
{%- if enable_thinking %}
{{- '<|im_start|>assistant\n<think>\n' }}
{%- else %}
{{- '<|im_start|>assistant\n<think></think>' }}
{%- endif %}
{%- endif %}
```
This template:
```jinja
{%- if add_generation_prompt %}
{{- '<|im_start|>assistant\n' }}
{%- endif %}
```
Generation prompt always ends at the clean `<|im_start|>assistant\n` boundary. The `enable_thinking` kwarg is accepted but ignored.
### 5. `low_effort` reasoning-effort annotation
Upstream Super only (lines 13, 180-184): when `low_effort=True`, appends `\n\n{reasoning effort: low}` to the last user message. This signals the model to produce shorter reasoning traces.
This template: removed. The `low_effort` kwarg is accepted but ignored.
### 6. `last_user_idx` namespace tracking
Upstream (lines 16-22, 34-40): two scans over the message list to find the last user message index. Used by `truncate_history_thinking` and `low_effort`.
This template: both consumers removed, so the tracking is gone too. Saves 14 lines of dead Jinja.
### What was kept
Everything else is identical to upstream Super 120B:
- System message rendering
- Tool definitions block (`<tools>...`) with all type/parameter/required/enum handling
- Tool-call rendering inside assistant turns (`<tool_call><function=...><parameter=...>`)
- Tool response rendering (`<tool_response>...</tool_response>`)
- The `<IMPORTANT>` reminder block injected when tools are present
- User and system message framing with `<|im_start|>` / `<|im_end|>`
## Behavior reference
For inputs **without** any think tags, here's what each call produces:
```python
from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("geodesic-research/nemotron-instruct-tokenizer")
msgs = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is 2+2?"},
{"role": "assistant", "content": "2+2 equals 4."},
{"role": "user", "content": "And 3+3?"},
]
```
| Call | Output (last 60 chars) |
|---|---|
| `apply_chat_template(msgs)` | `...<\|im_start\|>user\nAnd 3+3?<\|im_end\|>\n` |
| `apply_chat_template(msgs, add_generation_prompt=True)` | `...<\|im_start\|>user\nAnd 3+3?<\|im_end\|>\n<\|im_start\|>assistant\n` |
| `apply_chat_template(msgs, add_generation_prompt=True, enable_thinking=True)` | (same as above — kwarg ignored) |
| `apply_chat_template(msgs, add_generation_prompt=True, enable_thinking=False)` | (same as above) |
The full training-time render of the above messages contains zero `<think>` or `</think>` tokens. Compare with upstream Super, where the same input produces:
```
...<|im_start|>assistant\n<think></think>2+2 equals 4.<|im_end|>\n...
```
i.e. an injected `<think></think>` per assistant turn, plus a `<think>\n` opening at the generation-prompt boundary.
If your **input** messages contain explicit `<think>...</think>` content (e.g. you're rendering a dataset that already has reasoning traces from a teacher model), those think tags **pass through verbatim**. The template only refuses to *inject* think tags; it doesn't strip them from your input.
## When to use a different tokenizer
| Use case | Use this tokenizer? |
|---|---|
| Instruct SFT (no reasoning) | ✅ Yes |
| Continued pretraining (CPT) on raw text | ✅ Yes — chat template is irrelevant |
| LoRA / DoRA fine-tuning of an instruct model | ✅ Yes |
| Reasoning / thinking SFT (e.g. with `<think>` traces in training data) | ❌ Use `nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16` so the generation prompt opens a `<think>\n` block |
| Tool-calling agent SFT (no reasoning) | ✅ Yes — tool rendering is preserved |
| Inference on a model that *was* trained with reasoning | ❌ Mismatch — the model expects to see `<think>\n` opened on the generation prompt |
| Base model evaluation | ⚠️ The chat template will work but produces an empty system header for messages with no system role; use the upstream `*-Base-BF16` tokenizer for consistency with base-model conventions |
## Usage with vLLM
```bash
# Standard vLLM serve — tokenizer is loaded from the model directory by default.
# To override with this tokenizer, pass --tokenizer:
vllm serve <your-model> \
--tokenizer geodesic-research/nemotron-instruct-tokenizer \
--chat-template /path/to/chat_template.jinja # only needed if you want to override further
```
The tokenizer ships `chat_template.jinja` as a file (not embedded in `tokenizer_config.json`), which vLLM picks up automatically.
## Usage in training (megatron-bridge / NeMo)
In your training YAML:
```yaml
tokenizer:
tokenizer_model: geodesic-research/nemotron-instruct-tokenizer
```
Or in the recipe definition:
```python
cfg.tokenizer.tokenizer_model = "geodesic-research/nemotron-instruct-tokenizer"
```
The data pipeline (`pipeline_data_prepare.py`) will use this tokenizer's chat template when rendering `messages` columns from HuggingFace datasets, producing packed parquets with no injected think tags.
## Provenance
- **Base**: `nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16` (revision `49ad1f46ee9df444a0a3b8b63520faa1ca66324a`)
- **Encoder source**: identical to NVIDIA Nemotron 3 family (Super 120B, Nano 30B, Base variants of either) — same `tokenizer.json` blob (sha256 `623c34567aebb18582765289fbe23d901c62704d6518d71866e0e58db892b5b7`)
- **Chat template**: derived from upstream Super 120B, with the six removals listed above
- **License**: NVIDIA Open Model License (inherited from upstream)