Add detailed README documenting changes vs upstream Nemotron tokenizer

20224f1 verified about 1 month ago

11 kB

	---
	license: other
	license_name: nvidia-open-model-license
	license_link: https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license/
	library_name: transformers
	tags:
	- nemotron
	- tokenizer
	- instruct
	- chat-template
	base_model: nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16
	---

	# Nemotron Instruct Tokenizer

	A drop-in replacement for the Nemotron 3 tokenizer, purpose-built for non-reasoning instruct SFT runs. The encoder, vocabulary, and special-token IDs are byte-identical to upstream NVIDIA Nemotron 3, so existing model weights load and tokenize identically. The only change is the chat template: this tokenizer never injects `<think>` or `</think>` tags anywhere — neither during message rendering nor at generation-prompt time.

	## Why this exists

	The upstream Nemotron 3 chat template is designed for reasoning-capable models. By default it:

	1. Auto-prepends `<think></think>` to assistant messages that don't already contain think tags. So if your training data is `{"role": "assistant", "content": "The answer is 42."}`, the rendered string becomes `<\|im_start\|>assistant\n<think></think>The answer is 42.<\|im_end\|>`.
	2. Wraps `reasoning_content` message fields in `<think>...</think>`.
	3. Truncates older assistant turns in multi-turn history and replaces their content with `<think></think>` stubs (controlled by `truncate_history_thinking`, default `True`).
	4. Emits `<\|im_start\|>assistant\n<think>\n` (or `<think></think>`) as the generation prompt depending on `enable_thinking`.

	For an instruct-only SFT pipeline that never trains on reasoning traces, every one of these behaviors causes problems:

	- During training: the auto-prepend silently injects `<think></think>` into the loss-bearing region of every assistant turn, so the model learns to emit `<think></think>` literally — even when there's no reasoning to do.
	- At inference time: vLLM rollouts on the resulting model leak stray `</think>` tokens mid-response and sometimes repeat their answer twice, because the model was conditioned on think tags it has nothing to put inside.
	- The two upstream template revisions (10771-byte and 10505-byte) ship with conflicting `enable_thinking` defaults (`True` vs `False`), making it ambiguous what `tokenizer.apply_chat_template(msgs, add_generation_prompt=True)` returns without explicit kwargs.

	This tokenizer removes all four behaviors. Your assistant turns render as `<\|im_start\|>assistant\n<content><\|im_end\|>` exactly. Your generation prompts end at `<\|im_start\|>assistant\n` exactly. No surprises.

	## Compatibility guarantees

	\| Property \| Status \|
	\|---\|---\|
	\| `tokenizer.json` (vocab, merges, normalizer, pre-tokenizer, 1000 added_tokens) \| byte-identical to `nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16` (sha256 `623c34567aebb18582765289fbe23d901c62704d6518d71866e0e58db892b5b7`) — also byte-identical to `nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-Base-BF16` \|
	\| `tokenizer_config.json` \| byte-identical to upstream Super 120B \|
	\| `special_tokens_map.json` \| byte-identical to upstream Super 120B \|
	\| `chat_template.jinja` \| rewritten (see below) \|
	\| Special token IDs \| unchanged: `<\\|im_start\\|>=10`, `<\\|im_end\\|>=11`, `<think>=12`, `</think>=13`, `<s>=1`, `</s>=2`, `<unk>=0` \|
	\| Encoder behavior \| `tok.encode(text)` returns the same IDs as upstream for any input text \|
	\| Existing Nemotron checkpoints \| Load and decode bit-identically — no resharding, no embedding remapping needed \|
	\| vLLM \| Compatible. `tokenizer_class: PreTrainedTokenizerFast` is set; no `backend`/`is_local` keys; no `auto_map` to custom Python files \|
	\| transformers \| Compatible with both 4.57.x (sfm-evals pin) and 5.x \|

	The `<think>` and `</think>` tokens remain in the vocabulary at their original IDs. This means the tokenizer is fully compatible with reasoning models that emit those tokens — it just doesn't inject them itself. If you need reasoning-capable rendering, use the upstream Nemotron tokenizer instead.

	## What changed in the chat template

	The chat template is the only file that differs from upstream. Six things were removed:

	### 1. `<think></think>` auto-prepend on assistant content

	Upstream (lines 110-119 in upstream Super):
	```jinja
	{%- set content = message.content \| default('', true) %}
	{%- if content is string -%}
	{%- if '<think>' not in content and '</think>' not in content -%}
	{%- set content = "<think></think>" ~ content -%}
	{%- endif -%}
	{%- endif -%}
	```

	This template:
	```jinja
	{%- set content = message.content \| default('', true) %}
	```

	Assistant content passes through verbatim.

	### 2. `reasoning_content` → `<think>...</think>` wrapping

	Upstream (lines 107-109): if a message has a `reasoning_content` field, the template wraps it in `<think>...</think>` and prepends to the regular content.

	This template: removed entirely. The `reasoning_content` field is ignored.

	### 3. `truncate_history_thinking` logic

	Upstream (lines 14, 124-140, 161-175): when `truncate_history_thinking=True` (the default), older assistant turns have their think traces stripped and replaced with `<think></think>` stubs, and their content is partially truncated.

	This template: removed. Older assistant turns are kept in full, exactly as supplied. The kwarg is no longer consulted.

	### 4. `enable_thinking` two-branch generation prompt

	Upstream (lines 12, 203-208):
	```jinja
	{%- if add_generation_prompt %}
	{%- if enable_thinking %}
	{{- '<\|im_start\|>assistant\n<think>\n' }}
	{%- else %}
	{{- '<\|im_start\|>assistant\n<think></think>' }}
	{%- endif %}
	{%- endif %}
	```

	This template:
	```jinja
	{%- if add_generation_prompt %}
	{{- '<\|im_start\|>assistant\n' }}
	{%- endif %}
	```

	Generation prompt always ends at the clean `<\|im_start\|>assistant\n` boundary. The `enable_thinking` kwarg is accepted but ignored.

	### 5. `low_effort` reasoning-effort annotation

	Upstream Super only (lines 13, 180-184): when `low_effort=True`, appends `\n\n{reasoning effort: low}` to the last user message. This signals the model to produce shorter reasoning traces.

	This template: removed. The `low_effort` kwarg is accepted but ignored.

	### 6. `last_user_idx` namespace tracking

	Upstream (lines 16-22, 34-40): two scans over the message list to find the last user message index. Used by `truncate_history_thinking` and `low_effort`.

	This template: both consumers removed, so the tracking is gone too. Saves 14 lines of dead Jinja.

	### What was kept

	Everything else is identical to upstream Super 120B:
	- System message rendering
	- Tool definitions block (`<tools>...`) with all type/parameter/required/enum handling
	- Tool-call rendering inside assistant turns (`<tool_call><function=...><parameter=...>`)
	- Tool response rendering (`<tool_response>...</tool_response>`)
	- The `<IMPORTANT>` reminder block injected when tools are present
	- User and system message framing with `<\|im_start\|>` / `<\|im_end\|>`

	## Behavior reference

	For inputs without any think tags, here's what each call produces:

	```python
	from transformers import AutoTokenizer
	tok = AutoTokenizer.from_pretrained("geodesic-research/nemotron-instruct-tokenizer")

	msgs = [
	{"role": "system", "content": "You are a helpful assistant."},
	{"role": "user", "content": "What is 2+2?"},
	{"role": "assistant", "content": "2+2 equals 4."},
	{"role": "user", "content": "And 3+3?"},
	]
	```

	\| Call \| Output (last 60 chars) \|
	\|---\|---\|
	\| `apply_chat_template(msgs)` \| `...<\\|im_start\\|>user\nAnd 3+3?<\\|im_end\\|>\n` \|
	\| `apply_chat_template(msgs, add_generation_prompt=True)` \| `...<\\|im_start\\|>user\nAnd 3+3?<\\|im_end\\|>\n<\\|im_start\\|>assistant\n` \|
	\| `apply_chat_template(msgs, add_generation_prompt=True, enable_thinking=True)` \| (same as above — kwarg ignored) \|
	\| `apply_chat_template(msgs, add_generation_prompt=True, enable_thinking=False)` \| (same as above) \|

	The full training-time render of the above messages contains zero `<think>` or `</think>` tokens. Compare with upstream Super, where the same input produces:

	```
	...<\|im_start\|>assistant\n<think></think>2+2 equals 4.<\|im_end\|>\n...
	```

	i.e. an injected `<think></think>` per assistant turn, plus a `<think>\n` opening at the generation-prompt boundary.

	If your input messages contain explicit `<think>...</think>` content (e.g. you're rendering a dataset that already has reasoning traces from a teacher model), those think tags pass through verbatim. The template only refuses to inject think tags; it doesn't strip them from your input.

	## When to use a different tokenizer

	\| Use case \| Use this tokenizer? \|
	\|---\|---\|
	\| Instruct SFT (no reasoning) \| ✅ Yes \|
	\| Continued pretraining (CPT) on raw text \| ✅ Yes — chat template is irrelevant \|
	\| LoRA / DoRA fine-tuning of an instruct model \| ✅ Yes \|
	\| Reasoning / thinking SFT (e.g. with `<think>` traces in training data) \| ❌ Use `nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16` so the generation prompt opens a `<think>\n` block \|
	\| Tool-calling agent SFT (no reasoning) \| ✅ Yes — tool rendering is preserved \|
	\| Inference on a model that was trained with reasoning \| ❌ Mismatch — the model expects to see `<think>\n` opened on the generation prompt \|
	\| Base model evaluation \| ⚠️ The chat template will work but produces an empty system header for messages with no system role; use the upstream `*-Base-BF16` tokenizer for consistency with base-model conventions \|

	## Usage with vLLM

	```bash
	# Standard vLLM serve — tokenizer is loaded from the model directory by default.
	# To override with this tokenizer, pass --tokenizer:
	vllm serve <your-model> \
	--tokenizer geodesic-research/nemotron-instruct-tokenizer \
	--chat-template /path/to/chat_template.jinja # only needed if you want to override further
	```

	The tokenizer ships `chat_template.jinja` as a file (not embedded in `tokenizer_config.json`), which vLLM picks up automatically.

	## Usage in training (megatron-bridge / NeMo)

	In your training YAML:
	```yaml
	tokenizer:
	tokenizer_model: geodesic-research/nemotron-instruct-tokenizer
	```

	Or in the recipe definition:
	```python
	cfg.tokenizer.tokenizer_model = "geodesic-research/nemotron-instruct-tokenizer"
	```

	The data pipeline (`pipeline_data_prepare.py`) will use this tokenizer's chat template when rendering `messages` columns from HuggingFace datasets, producing packed parquets with no injected think tags.

	## Provenance

	- Base: `nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16` (revision `49ad1f46ee9df444a0a3b8b63520faa1ca66324a`)
	- Encoder source: identical to NVIDIA Nemotron 3 family (Super 120B, Nano 30B, Base variants of either) — same `tokenizer.json` blob (sha256 `623c34567aebb18582765289fbe23d901c62704d6518d71866e0e58db892b5b7`)
	- Chat template: derived from upstream Super 120B, with the six removals listed above
	- License: NVIDIA Open Model License (inherited from upstream)