Instructions to use geodesic-research/nemotron-instruct-tokenizer with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use geodesic-research/nemotron-instruct-tokenizer with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("geodesic-research/nemotron-instruct-tokenizer", dtype="auto") - Notebooks
- Google Colab
- Kaggle
| license: other | |
| license_name: nvidia-open-model-license | |
| license_link: https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license/ | |
| library_name: transformers | |
| tags: | |
| - nemotron | |
| - tokenizer | |
| - instruct | |
| - chat-template | |
| base_model: nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16 | |
| # Nemotron Instruct Tokenizer | |
| A drop-in replacement for the Nemotron 3 tokenizer, **purpose-built for non-reasoning instruct SFT runs**. The encoder, vocabulary, and special-token IDs are byte-identical to upstream NVIDIA Nemotron 3, so existing model weights load and tokenize identically. The only change is the chat template: this tokenizer **never injects `<think>` or `</think>` tags anywhere** — neither during message rendering nor at generation-prompt time. | |
| ## Why this exists | |
| The upstream Nemotron 3 chat template is designed for reasoning-capable models. By default it: | |
| 1. **Auto-prepends `<think></think>`** to assistant messages that don't already contain think tags. So if your training data is `{"role": "assistant", "content": "The answer is 42."}`, the rendered string becomes `<|im_start|>assistant\n<think></think>The answer is 42.<|im_end|>`. | |
| 2. **Wraps `reasoning_content`** message fields in `<think>...</think>`. | |
| 3. **Truncates older assistant turns** in multi-turn history and replaces their content with `<think></think>` stubs (controlled by `truncate_history_thinking`, default `True`). | |
| 4. **Emits `<|im_start|>assistant\n<think>\n` (or `<think></think>`)** as the generation prompt depending on `enable_thinking`. | |
| For an **instruct-only SFT pipeline that never trains on reasoning traces**, every one of these behaviors causes problems: | |
| - During training: the auto-prepend silently injects `<think></think>` into the loss-bearing region of every assistant turn, so the model learns to emit `<think></think>` literally — even when there's no reasoning to do. | |
| - At inference time: vLLM rollouts on the resulting model leak stray `</think>` tokens mid-response and sometimes repeat their answer twice, because the model was conditioned on think tags it has nothing to put inside. | |
| - The two upstream template revisions (10771-byte and 10505-byte) ship with conflicting `enable_thinking` defaults (`True` vs `False`), making it ambiguous what `tokenizer.apply_chat_template(msgs, add_generation_prompt=True)` returns without explicit kwargs. | |
| This tokenizer removes all four behaviors. Your assistant turns render as `<|im_start|>assistant\n<content><|im_end|>` exactly. Your generation prompts end at `<|im_start|>assistant\n` exactly. No surprises. | |
| ## Compatibility guarantees | |
| | Property | Status | | |
| |---|---| | |
| | `tokenizer.json` (vocab, merges, normalizer, pre-tokenizer, 1000 added_tokens) | **byte-identical** to `nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16` (sha256 `623c34567aebb18582765289fbe23d901c62704d6518d71866e0e58db892b5b7`) — also byte-identical to `nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-Base-BF16` | | |
| | `tokenizer_config.json` | byte-identical to upstream Super 120B | | |
| | `special_tokens_map.json` | byte-identical to upstream Super 120B | | |
| | `chat_template.jinja` | **rewritten** (see below) | | |
| | Special token IDs | unchanged: `<\|im_start\|>=10`, `<\|im_end\|>=11`, `<think>=12`, `</think>=13`, `<s>=1`, `</s>=2`, `<unk>=0` | | |
| | Encoder behavior | `tok.encode(text)` returns the same IDs as upstream for any input text | | |
| | Existing Nemotron checkpoints | Load and decode bit-identically — no resharding, no embedding remapping needed | | |
| | vLLM | Compatible. `tokenizer_class: PreTrainedTokenizerFast` is set; no `backend`/`is_local` keys; no `auto_map` to custom Python files | | |
| | transformers | Compatible with both 4.57.x (sfm-evals pin) and 5.x | | |
| The `<think>` and `</think>` tokens **remain in the vocabulary at their original IDs**. This means the tokenizer is fully compatible with reasoning models that emit those tokens — it just doesn't inject them itself. If you need reasoning-capable rendering, use the upstream Nemotron tokenizer instead. | |
| ## What changed in the chat template | |
| The chat template is the only file that differs from upstream. Six things were removed: | |
| ### 1. `<think></think>` auto-prepend on assistant content | |
| Upstream (lines 110-119 in upstream Super): | |
| ```jinja | |
| {%- set content = message.content | default('', true) %} | |
| {%- if content is string -%} | |
| {%- if '<think>' not in content and '</think>' not in content -%} | |
| {%- set content = "<think></think>" ~ content -%} | |
| {%- endif -%} | |
| {%- endif -%} | |
| ``` | |
| This template: | |
| ```jinja | |
| {%- set content = message.content | default('', true) %} | |
| ``` | |
| Assistant content passes through verbatim. | |
| ### 2. `reasoning_content` → `<think>...</think>` wrapping | |
| Upstream (lines 107-109): if a message has a `reasoning_content` field, the template wraps it in `<think>...</think>` and prepends to the regular content. | |
| This template: removed entirely. The `reasoning_content` field is ignored. | |
| ### 3. `truncate_history_thinking` logic | |
| Upstream (lines 14, 124-140, 161-175): when `truncate_history_thinking=True` (the default), older assistant turns have their think traces stripped and replaced with `<think></think>` stubs, and their content is partially truncated. | |
| This template: removed. **Older assistant turns are kept in full**, exactly as supplied. The kwarg is no longer consulted. | |
| ### 4. `enable_thinking` two-branch generation prompt | |
| Upstream (lines 12, 203-208): | |
| ```jinja | |
| {%- if add_generation_prompt %} | |
| {%- if enable_thinking %} | |
| {{- '<|im_start|>assistant\n<think>\n' }} | |
| {%- else %} | |
| {{- '<|im_start|>assistant\n<think></think>' }} | |
| {%- endif %} | |
| {%- endif %} | |
| ``` | |
| This template: | |
| ```jinja | |
| {%- if add_generation_prompt %} | |
| {{- '<|im_start|>assistant\n' }} | |
| {%- endif %} | |
| ``` | |
| Generation prompt always ends at the clean `<|im_start|>assistant\n` boundary. The `enable_thinking` kwarg is accepted but ignored. | |
| ### 5. `low_effort` reasoning-effort annotation | |
| Upstream Super only (lines 13, 180-184): when `low_effort=True`, appends `\n\n{reasoning effort: low}` to the last user message. This signals the model to produce shorter reasoning traces. | |
| This template: removed. The `low_effort` kwarg is accepted but ignored. | |
| ### 6. `last_user_idx` namespace tracking | |
| Upstream (lines 16-22, 34-40): two scans over the message list to find the last user message index. Used by `truncate_history_thinking` and `low_effort`. | |
| This template: both consumers removed, so the tracking is gone too. Saves 14 lines of dead Jinja. | |
| ### What was kept | |
| Everything else is identical to upstream Super 120B: | |
| - System message rendering | |
| - Tool definitions block (`<tools>...`) with all type/parameter/required/enum handling | |
| - Tool-call rendering inside assistant turns (`<tool_call><function=...><parameter=...>`) | |
| - Tool response rendering (`<tool_response>...</tool_response>`) | |
| - The `<IMPORTANT>` reminder block injected when tools are present | |
| - User and system message framing with `<|im_start|>` / `<|im_end|>` | |
| ## Behavior reference | |
| For inputs **without** any think tags, here's what each call produces: | |
| ```python | |
| from transformers import AutoTokenizer | |
| tok = AutoTokenizer.from_pretrained("geodesic-research/nemotron-instruct-tokenizer") | |
| msgs = [ | |
| {"role": "system", "content": "You are a helpful assistant."}, | |
| {"role": "user", "content": "What is 2+2?"}, | |
| {"role": "assistant", "content": "2+2 equals 4."}, | |
| {"role": "user", "content": "And 3+3?"}, | |
| ] | |
| ``` | |
| | Call | Output (last 60 chars) | | |
| |---|---| | |
| | `apply_chat_template(msgs)` | `...<\|im_start\|>user\nAnd 3+3?<\|im_end\|>\n` | | |
| | `apply_chat_template(msgs, add_generation_prompt=True)` | `...<\|im_start\|>user\nAnd 3+3?<\|im_end\|>\n<\|im_start\|>assistant\n` | | |
| | `apply_chat_template(msgs, add_generation_prompt=True, enable_thinking=True)` | (same as above — kwarg ignored) | | |
| | `apply_chat_template(msgs, add_generation_prompt=True, enable_thinking=False)` | (same as above) | | |
| The full training-time render of the above messages contains zero `<think>` or `</think>` tokens. Compare with upstream Super, where the same input produces: | |
| ``` | |
| ...<|im_start|>assistant\n<think></think>2+2 equals 4.<|im_end|>\n... | |
| ``` | |
| i.e. an injected `<think></think>` per assistant turn, plus a `<think>\n` opening at the generation-prompt boundary. | |
| If your **input** messages contain explicit `<think>...</think>` content (e.g. you're rendering a dataset that already has reasoning traces from a teacher model), those think tags **pass through verbatim**. The template only refuses to *inject* think tags; it doesn't strip them from your input. | |
| ## When to use a different tokenizer | |
| | Use case | Use this tokenizer? | | |
| |---|---| | |
| | Instruct SFT (no reasoning) | ✅ Yes | | |
| | Continued pretraining (CPT) on raw text | ✅ Yes — chat template is irrelevant | | |
| | LoRA / DoRA fine-tuning of an instruct model | ✅ Yes | | |
| | Reasoning / thinking SFT (e.g. with `<think>` traces in training data) | ❌ Use `nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16` so the generation prompt opens a `<think>\n` block | | |
| | Tool-calling agent SFT (no reasoning) | ✅ Yes — tool rendering is preserved | | |
| | Inference on a model that *was* trained with reasoning | ❌ Mismatch — the model expects to see `<think>\n` opened on the generation prompt | | |
| | Base model evaluation | ⚠️ The chat template will work but produces an empty system header for messages with no system role; use the upstream `*-Base-BF16` tokenizer for consistency with base-model conventions | | |
| ## Usage with vLLM | |
| ```bash | |
| # Standard vLLM serve — tokenizer is loaded from the model directory by default. | |
| # To override with this tokenizer, pass --tokenizer: | |
| vllm serve <your-model> \ | |
| --tokenizer geodesic-research/nemotron-instruct-tokenizer \ | |
| --chat-template /path/to/chat_template.jinja # only needed if you want to override further | |
| ``` | |
| The tokenizer ships `chat_template.jinja` as a file (not embedded in `tokenizer_config.json`), which vLLM picks up automatically. | |
| ## Usage in training (megatron-bridge / NeMo) | |
| In your training YAML: | |
| ```yaml | |
| tokenizer: | |
| tokenizer_model: geodesic-research/nemotron-instruct-tokenizer | |
| ``` | |
| Or in the recipe definition: | |
| ```python | |
| cfg.tokenizer.tokenizer_model = "geodesic-research/nemotron-instruct-tokenizer" | |
| ``` | |
| The data pipeline (`pipeline_data_prepare.py`) will use this tokenizer's chat template when rendering `messages` columns from HuggingFace datasets, producing packed parquets with no injected think tags. | |
| ## Provenance | |
| - **Base**: `nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16` (revision `49ad1f46ee9df444a0a3b8b63520faa1ca66324a`) | |
| - **Encoder source**: identical to NVIDIA Nemotron 3 family (Super 120B, Nano 30B, Base variants of either) — same `tokenizer.json` blob (sha256 `623c34567aebb18582765289fbe23d901c62704d6518d71866e0e58db892b5b7`) | |
| - **Chat template**: derived from upstream Super 120B, with the six removals listed above | |
| - **License**: NVIDIA Open Model License (inherited from upstream) | |