| --- |
| license: apache-2.0 |
| language: |
| - en |
| base_model: |
| - nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 |
| --- |
| |
| # Pulsar 16B |
| <div align="center"> |
|
|
| [](https://opensource.org/licenses/Apache-2.0) |
| [](TODO_PULSAR_HF_URL) |
| [](https://discord.gg/cGas9uStqp) |
|
|
| Powered by CompactifAI |
|
|
| **Optimized for Fast and Efficient Inference** · **Reduced Memory Footprint** |
|
|
| </div> |
|
|
| --- |
|
|
| ## Table of Contents |
|
|
| - [Model Overview](#model-overview) |
| - [Key Characteristics](#key-characteristics) |
| - [Quick Start](#quick-start) |
| - [Reasoning Control](#thinking-reasoning-control) |
| - [Tool Calling](#tool-calling) |
| - [Training & Fine-Tuning](#training--fine-tuning) |
| - [Evaluation & Benchmarks](#evaluation--benchmarks) |
| - [Languages](#languages) |
| - [Safety & Limitations](#safety--limitations) |
| - [Model Information](#model-information) |
| - [Citation](#citation) |
|
|
| --- |
|
|
| ## Model Overview |
|
|
| **Pulsar 16B** is a **model based on [NVIDIA-Nemotron-3-Nano-30B-A3B-BF16](https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16)**, developed by **Multiverse Computing**. The original model is a **~31.6B parameter**, part of the Nemotron model family. It supports **long-context inference up to 1M tokens** and is designed for general-purpose language modeling tasks. |
|
|
| This version applies **model compression techniques** to significantly reduce parameter count and deployment requirements while maintaining compatibility with the Nemotron Hybrid Mamba2-Transformer with MoE architecture. The resulting model achieves **50% compression**, reducing the parameter count to **16.15B parameters** and lowering memory requirements. |
|
|
| - [BF16](https://huggingface.co/MultiverseComputingCAI/Pulsar-16B-BF16) |
| - [FP8](https://huggingface.co/MultiverseComputingCAI/Pulsar-16B-FP8) |
| - [NVFP4](https://huggingface.co/MultiverseComputingCAI/Pulsar-16B-NVFP4) |
| |
| --- |
|
|
| ## Key Characteristics |
|
|
| | Characteristic | Description | |
| |-----------------------|-------------| |
| | Base model | [nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16](https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16). **31.6B** total parameters, **3.6B** activated per forward pass (11.34% activation ratio). [NVIDIA Open Model License](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license/). | |
| | Pulsar-16B-BF16 (this model) | **16.15B** total parameters, **3.1B** activated per forward pass (19.28% activation ratio) after CompactifAI compression. | |
| | 📐 **Architecture** | Hybrid Mamba2-Transformer with MoE (same family as the base checkpoint). | |
| | 🛠️ **Tool calling** | Yes. Same tool-call structure and format as [Nemotron-3-Nano-30B-A3B-BF16](https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16). See [Tool Calling](#tool-calling). | |
| | 🗜️ **Compression** | CompactifAI (proprietary compression technology) | |
| | Primary language | English | |
| --- |
| ## Quick Start |
| This model can be loaded with the **Transformers** API. Use `trust_remote_code=True`. Recommended approach: `AutoModelForCausalLM` with `apply_chat_template`. This configuration has been tested with Transformers 4.57.6. |
|
|
| ```python |
| import torch |
| from transformers import AutoTokenizer, AutoModelForCausalLM |
| |
| model_id = "MultiverseComputingCAI/Pulsar-16B-BF16" |
| |
| tokenizer = AutoTokenizer.from_pretrained( |
| model_id, |
| trust_remote_code=True |
| ) |
| |
| model = AutoModelForCausalLM.from_pretrained( |
| model_id, |
| device_map="cuda" if torch.cuda.is_available() else "auto", |
| torch_dtype=torch.bfloat16, |
| trust_remote_code=True, |
| ) |
| messages = [ |
| {"role": "user", "content": "Write a haiku about GPUs"}, |
| ] |
| |
| tokenized_chat = tokenizer.apply_chat_template( |
| messages, |
| tokenize=True, |
| add_generation_prompt=True, |
| return_tensors="pt" |
| ).to(model.device) |
| |
| outputs = model.generate( |
| tokenized_chat, |
| max_new_tokens=1024, |
| temperature=1.0, |
| top_p=1.0, |
| eos_token_id=tokenizer.eos_token_id |
| ) |
| print(tokenizer.decode(outputs[0])) |
| ``` |
| Alternatively you can use the `pipeline` API with `trust_remote_code=True`; the pipeline returns the full conversation structure, so extract the assistant message from `outputs[0]["generated_text"]` as needed. |
|
|
| ### vLLM Serving |
|
|
| #### Installation |
|
|
| ```bash |
| pip install -U "vllm>=0.12.0" |
| ``` |
|
|
| #### Reasoning parser (NVIDIA) |
|
|
| Pulsar 16B uses the same Nemotron v3 reasoning tags as the base model. NVIDIA provides the vLLM plugin as [`nano_v3_reasoning_parser.py`](https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16/blob/main/nano_v3_reasoning_parser.py) on the base Hugging Face repo (not specific to Pulsar). Direct download: |
|
|
| ```bash |
| wget https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16/resolve/main/nano_v3_reasoning_parser.py |
| ``` |
|
|
| You can keep any local filename; the `vllm serve` flags below assume the file is in the current directory as `nano_v3_reasoning_parser.py`. If you mirror an identical copy under the Pulsar model repo, use that URL instead. |
|
|
| #### Serve |
|
|
| ```bash |
| vllm serve MultiverseComputingCAI/Pulsar-16B-BF16 \ |
| --served-model-name model \ |
| --max-num-seqs 8 \ |
| --tensor-parallel-size 1 \ |
| --port 8000 \ |
| --trust-remote-code \ |
| --enable-auto-tool-choice \ |
| --tool-call-parser qwen3_coder \ |
| --reasoning-parser-plugin nano_v3_reasoning_parser.py \ |
| --reasoning-parser nano_v3 |
| ``` |
|
|
| > **Note:** The NeMo container `nvcr.io/nvidia/nemo:25.11.nemotron_3_nano` comes with `mamba_ssm` and `causal-conv1d` pre-installed. |
| |
| --- |
| |
| ## Thinking (Reasoning) Control |
| |
| Pulsar 16B supports a **hybrid reasoning mode**: the model can either think step-by-step before answering (reasoning mode) or reply directly (non-reasoning mode). The behaviour is controlled via the `enable_thinking` flag in the chat template. |
|
|
| > This section provides a brief overview of reasoning control in Pulsar 16B. For comprehensive details please see the official Nemotron-3 Nano-30B model card at: https://build.nvidia.com/nvidia/nemotron-3-nano-30b-a3b/modelcard |
|
|
|
|
| --- |
|
|
| ### Transformers API |
|
|
| Pass `enable_thinking` through `apply_chat_template`: |
|
|
| **Thinking ON (default)** |
| ```python |
| tokenized_chat = tokenizer.apply_chat_template( |
| messages, |
| tokenize=True, |
| add_generation_prompt=True, |
| return_tensors="pt", |
| enable_thinking=True, # default — can be omitted |
| ) |
| ``` |
|
|
| **Thinking OFF** |
| ```python |
| tokenized_chat = tokenizer.apply_chat_template( |
| messages, |
| tokenize=True, |
| add_generation_prompt=True, |
| return_tensors="pt", |
| enable_thinking=False, |
| ) |
| ``` |
|
|
| When thinking is ON the model opens a `<think>` block before the answer. |
|
|
| ```python |
| output = tokenizer.decode(outputs[0], skip_special_tokens=True) |
| # Split on </think> to separate reasoning from the final answer |
| if "</think>" in output: |
| reasoning, answer = output.split("</think>", 1) |
| reasoning = reasoning.replace("<think>", "").strip() |
| answer = answer.strip() |
| else: |
| answer = output |
| ``` |
|
|
| --- |
|
|
| ### vLLM |
|
|
| #### Server-level default |
|
|
| Set the default for **all requests** at startup with `--default-chat-template-kwargs`. |
|
|
| > Requires recent versions of vLLM. |
|
|
| **Thinking OFF for all requests** |
| ```bash |
| vllm serve MultiverseComputingCAI/Pulsar-16B-BF16 \ |
| --served-model-name model \ |
| --reasoning-parser-plugin nano_v3_reasoning_parser.py \ |
| --reasoning-parser nano_v3 \ |
| --trust-request-chat-template \ |
| --default-chat-template-kwargs '{"enable_thinking": false}' \ |
| ... |
| ``` |
|
|
| **Thinking ON for all requests (default if flag is omitted)** |
| ```bash |
| vllm serve MultiverseComputingCAI/Pulsar-16B-BF16 \ |
| --served-model-name model \ |
| --reasoning-parser-plugin nano_v3_reasoning_parser.py \ |
| --reasoning-parser nano_v3 \ |
| --trust-request-chat-template \ |
| --default-chat-template-kwargs '{"enable_thinking": true}' \ |
| ... |
| ``` |
|
|
|
|
| --- |
|
|
| #### Per-request override |
|
|
| > **`--trust-request-chat-template` is required** to allow per-request overrides. |
|
|
| Individual requests can override the server default by passing `chat_template_kwargs` in the request body. This works regardless of the server-level default. |
|
|
| **Thinking ON/OFF for one request** |
| ```python |
| import requests |
| |
| response = requests.post("http://localhost:8000/v1/chat/completions", json={ |
| "model": "model", |
| "messages": [{"role": "user", "content": "Solve: x² - 5x + 6 = 0"}], |
| "max_tokens": 1024, |
| "temperature": 1.0, |
| "chat_template_kwargs": {"enable_thinking": True}, |
| }) |
| ``` |
|
|
| --- |
|
|
| ## Tool Calling |
|
|
| Pulsar 16B emits tool calls in the following format: |
|
|
| ``` |
| <tool_call> |
| <function=get_weather> |
| <parameter=city>Paris</parameter> |
| <parameter=unit>celsius</parameter> |
| </function> |
| </tool_call> |
| ``` |
|
|
| When serving (e.g with vLLM), you **must** use the `qwen3_coder` tool parser. |
|
|
| ```bash |
| vllm serve <model_path> \ |
| --enable-auto-tool-choice \ |
| --tool-call-parser qwen3_coder \ |
| --trust-remote-code |
| ``` |
|
|
| ## Training & Fine-Tuning |
|
|
| ### Base Model: NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 |
|
|
| The base model [nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16](https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16) is a large language model (LLM) trained from scratch by NVIDIA, and designed as a unified model for both reasoning and non-reasoning tasks. The model's reasoning capabilities can be configured through a flag in the chat template. See the [original model card](https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16) for details. |
|
|
|
|
| ### CompactifAI Compression |
|
|
| CompactifAI was applied to produce a smaller, efficient model (16B parameters) while aiming to preserve reasoning and tool-use capabilities. Supervised Fine Tuning was applied for improving cabapilities. |
|
|
| --- |
|
|
| ## Evaluation & Benchmarks |
|
|
|  |
|
|
| | Benchmark | Nemotron 3 Nano 30B A3B | Pulsar 16B | gpt-oss-20b | Qwen3-14B | Ministral-3-14B-Instruct-2512 | |
| | --- | ---: | ---: | ---: | ---: | ---: | |
| | AIME | 87.66 | 87.22 | 87.66 | 76.00 | 33.00 | |
| | GPQA | 74.04 | 71.41 | 68.99 | 63.63 | 56.45 | |
| | IFBench | 72.31 | 70.79 | 68.46 | 39.20 | 32.80 | |
| | MMLU-Pro | 78.90 | 74.78 | 76.65 | 85.01 | 70.09 | |
| | LiveCodeBench | 71.11 | 68.04 | 64.65 | 66.35 | 29.84 | |
|
|
| ### Quantizations |
|
|
| - [BF16](https://huggingface.co/MultiverseComputingCAI/Pulsar-16B-BF16) |
| - [FP8](https://huggingface.co/MultiverseComputingCAI/Pulsar-16B-FP8) |
| - [NVFP4](https://huggingface.co/MultiverseComputingCAI/Pulsar-16B-NVFP4) |
| |
|  |
|
|
| | Benchmark | Nemotron 3 Nano 30B A3B | Pulsar 16B (BF16) | Pulsar 16B (fp8) | Pulsar 16B (nvfp4) | |
| | --- | ---: | ---: | ---: | ---: | |
| | AIME | 87.66 | 87.22 | 86.67 | 82.00 | |
| | GPQA | 74.04 | 71.41 | 70.61 | 71.11 | |
| | IFBench | 72.31 | 70.79 | 69.60 | 69.90 | |
| | MMLU-Pro | 78.90 | 74.78 | 74.76 | 74.19 | |
| | LiveCodeBench | 71.11 | 68.04 | 68.68 | 65.60 | |
|
|
|
|
| ### Performance |
|  |
| - **Framework:** [guidellm](https://github.com/vllm-project/guidellm) |
| - **Inference:** vLLM 0.18.0 |
| - **GPU:** NVIDIA L40s |
| - **Decode:** `temperature: 0.0`, `top_p: 1.0` |
| - **Measure Window:** Each phase lasts 3 minutes (excluding ramp-up and cool-down periods). |
| - **Workload shape:** 8k/16k workload as in the original model's card. |
|
|
|
|
| ### Long Context |
| Pulsar 16B preserves strong long-context behavior after compression, tracking the Nemotron-3-Nano-30B-A3B baseline closely across retrieval-heavy and full-suite long-context evaluations. Results are reported for LongBench v1, AA-LCR, NIAH, and RULER groupings up to 256k context. |
|
|
|  |
|
|
| | Benchmark | Nemotron 3 Nano 30B A3B | Pulsar 16B | |
| | :--- | ---: | ---: | |
| | Longbench | 31.84 | 29.84 | |
| | AA-LCR | 33.67 | 29.33 | |
| | NIAH (@100K) | 100.00 | 100.00 | |
| | RULER (@128K) | 95.02 | 94.20 | |
| | RULER (@256K) | 92.02 | 87.74 | |
| ### Evaluation Methodology |
|
|
| Benchmark scores were obtained with the following setups. Methodology varies by benchmark family. |
|
|
| ### Inference: |
| - **Backend:** VLLM 0.18.0 |
| - **Nemotron models:** `temp 1.0`, `top_p 1.0` |
| - **GPT-OSS-20B:** `temp: 1.0`, `top_p: 1.0`, `reasoning_effort: high` |
| - **Qwen3-14B:** `temp: 0.6`, `top_p: 0.95`, `top_k: 20`, `min_p: 0.0` |
| - **Ministral-3-14B-Instruct-2512:** `temp: 0.15` |
|
|
| | Benchmark | Framework | Repeats | Other | |
| |-----------|-----------|--------:|-------| |
| | MMLU-Pro | [NeMo-Skills](https://github.com/NVIDIA-NeMo/Skills) | 1 | | |
| | AIME25 | [NeMo-Skills](https://github.com/NVIDIA-NeMo/Skills) | 10 | | |
| | GPQA:d | [NeMo-Skills](https://github.com/NVIDIA-NeMo/Skills) | 5 | | |
| | LiveCodeBench | [NeMo-Skills](https://github.com/NVIDIA-NeMo/Skills) | 3 | | |
| | IFBench | [NeMo-Skills](https://github.com/NVIDIA-NeMo/Skills) | 5 | | |
| | LongBench v1 | [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) | 1 | | |
| | AA-LCR | [EvalScope](https://github.com/modelscope/evalscope) 1.4.1 | 3 | Judge: **`Qwen/Qwen3-235B-A22B-Instruct-2507`**. **`judge_score_type`:** `pattern`. **`judge_args` → `generation_config`:** `top_p` 0.8, `top_k` 20, `min_p` 0.0, `temperature` 0.7. | |
| | NIAH | [EvalScope](https://github.com/modelscope/evalscope) 1.4.1 | 1 | Judge: **`qwen/qwen3-235b-a22b-2507`** . **`judge_model_args`:** `{}` (no extra judge settings in YAML). | |
| | RULER | [NeMo-Skills](https://github.com/NVIDIA-NeMo/Skills) (+ [RULER](https://github.com/NVIDIA/RULER)) | 1 | | |
|
|
| --- |
|
|
| ## Languages |
|
|
| - **Primary language**: English |
| - **Other languages**: Spanish |
|
|
| Trained mainly on English with added Spanish. No systematic evaluation for languages outside English and Spanish. |
|
|
|
|
|
|
| ## Safety & Limitations |
|
|
| ### Known Limitations |
|
|
| - English-centric training data (inherited from base model). |
| - Tool calling depends on correct schema and tool design; exact parity with the original model is not guaranteed. |
| - Compression may affect some behaviors; evaluate for your use case. |
|
|
| ### Recommendations |
|
|
| - Validate tool outputs before running them |
| - Human oversight for critical use |
| - Task-specific eval before production |
|
|
| --- |
|
|
| ## Model Information |
|
|
| | Field | Value | |
| |--------------|--------------------- | |
| | Model name | Pulsar 16B | |
| | Based on | [NVIDIA-Nemotron-3-Nano-30B-A3B-BF16](https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16) | |
| | Version | v1.5.0 | |
| | Release date | TBD | |
| | Developed by | Multiverse Computing | |
| | License | Apache 2.0 | |
| | Contact | business@multiversecomputing.com | |
|
|
| --- |
|
|
| ## Citation |
|
|
| If you use this model, please cite the base model and Pulsar 16B: |
|
|
| ```bibtex |
| @misc{nemotron3nanoTR, |
| title = {NVIDIA Nemotron 3 Nano Technical Report}, |
| author = {{NVIDIA}}, |
| year = {2025}, |
| url = {https://research.nvidia.com/labs/nemotron/files/NVIDIA-Nemotron-3-Nano-Technical-Report.pdf} |
| } |
| @misc{nemotron3nanoslim16b, |
| title = {Pulsar 16B: Model developed from NVIDIA Nemotron-3-Nano-30B-A3B}, |
| author = {Multiverse Computing}, |
| year = {2026}, |
| url = {https://huggingface.co/MultiverseComputingCAI/Pulsar-16B-BF16}, |
| note = {Model developed based on nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 using CompactifAI technology} |
| } |
| ``` |
|
|
| **Built by [Multiverse Computing](https://www.multiversecomputing.com)** · [Report an issue](TODO_PULSAR_HF_URL/discussions) · [Discord](https://discord.gg/cGas9uStqp) |
|
|