---
license: apache-2.0
language:
- en
base_model:
- nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16
---

# Pulsar 16B
<div align="center">

[![License](https://img.shields.io/badge/License-Apache_2.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)
[![HuggingFace](https://img.shields.io/badge/🤗-Model_Hub-yellow.svg)](TODO_PULSAR_HF_URL)
[![Discord](https://img.shields.io/badge/Discord-Community-5865F2?logo=discord&logoColor=white)](https://discord.gg/cGas9uStqp)

Powered by CompactifAI

**Optimized for Fast and Efficient Inference** · **Reduced Memory Footprint**

</div>

---

## Table of Contents

- [Model Overview](#model-overview)
- [Key Characteristics](#key-characteristics)
- [Quick Start](#quick-start)
- [Reasoning Control](#thinking-reasoning-control)
- [Tool Calling](#tool-calling)
- [Training & Fine-Tuning](#training--fine-tuning)
- [Evaluation & Benchmarks](#evaluation--benchmarks)
- [Languages](#languages)
- [Safety & Limitations](#safety--limitations)
- [Model Information](#model-information)
- [Citation](#citation)

---

## Model Overview

**Pulsar 16B** is a **model based on [NVIDIA-Nemotron-3-Nano-30B-A3B-BF16](https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16)**, developed by **Multiverse Computing**. The original model is a **~31.6B parameter**, part of the Nemotron model family. It supports **long-context inference up to 1M tokens** and is designed for general-purpose language modeling tasks.

This version applies **model compression techniques** to significantly reduce parameter count and deployment requirements while maintaining compatibility with the Nemotron Hybrid Mamba2-Transformer with MoE architecture. The resulting model achieves **50% compression**, reducing the parameter count to **16.15B parameters** and lowering memory requirements.

- [BF16](https://huggingface.co/MultiverseComputingCAI/Pulsar-16B-BF16)
- [FP8](https://huggingface.co/MultiverseComputingCAI/Pulsar-16B-FP8)
- [NVFP4](https://huggingface.co/MultiverseComputingCAI/Pulsar-16B-NVFP4)
  
---

## Key Characteristics

| Characteristic        | Description |
|-----------------------|-------------|
| Base model            | [nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16](https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16). **31.6B** total parameters, **3.6B** activated per forward pass (11.34% activation ratio). [NVIDIA Open Model License](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license/). |
| Pulsar-16B-BF16 (this model)   | **16.15B** total parameters, **3.1B** activated per forward pass (19.28% activation ratio) after CompactifAI compression. |
| 📐 **Architecture**   | Hybrid Mamba2-Transformer with MoE (same family as the base checkpoint). |
| 🛠️ **Tool calling**  | Yes. Same tool-call structure and format as [Nemotron-3-Nano-30B-A3B-BF16](https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16). See [Tool Calling](#tool-calling). |
| 🗜️ **Compression**   | CompactifAI (proprietary compression technology) |
| Primary language      | English |
---
## Quick Start
This model can be loaded with the **Transformers** API. Use `trust_remote_code=True`. Recommended approach: `AutoModelForCausalLM` with `apply_chat_template`. This configuration has been tested with Transformers 4.57.6.

```python
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = "MultiverseComputingCAI/Pulsar-16B-BF16"

tokenizer = AutoTokenizer.from_pretrained(
    model_id,
    trust_remote_code=True
)

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="cuda" if torch.cuda.is_available() else "auto",
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
)
messages = [
    {"role": "user", "content": "Write a haiku about GPUs"},
]

tokenized_chat = tokenizer.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_tensors="pt"
).to(model.device)

outputs = model.generate(
    tokenized_chat,
    max_new_tokens=1024,
    temperature=1.0,
    top_p=1.0,
    eos_token_id=tokenizer.eos_token_id
)
print(tokenizer.decode(outputs[0]))
```
Alternatively you can use the `pipeline` API with `trust_remote_code=True`; the pipeline returns the full conversation structure, so extract the assistant message from `outputs[0]["generated_text"]` as needed.

### vLLM Serving

#### Installation

```bash
pip install -U "vllm>=0.12.0"
```

#### Reasoning parser (NVIDIA)

Pulsar 16B uses the same Nemotron v3 reasoning tags as the base model. NVIDIA provides the vLLM plugin as [`nano_v3_reasoning_parser.py`](https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16/blob/main/nano_v3_reasoning_parser.py) on the base Hugging Face repo (not specific to Pulsar). Direct download:

```bash
wget https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16/resolve/main/nano_v3_reasoning_parser.py
```

You can keep any local filename; the `vllm serve` flags below assume the file is in the current directory as `nano_v3_reasoning_parser.py`. If you mirror an identical copy under the Pulsar model repo, use that URL instead.

#### Serve

```bash
vllm serve MultiverseComputingCAI/Pulsar-16B-BF16 \
  --served-model-name model \
  --max-num-seqs 8 \
  --tensor-parallel-size 1 \
  --port 8000 \
  --trust-remote-code \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder \
  --reasoning-parser-plugin nano_v3_reasoning_parser.py \
  --reasoning-parser nano_v3
```

> **Note:** The NeMo container `nvcr.io/nvidia/nemo:25.11.nemotron_3_nano` comes with `mamba_ssm` and `causal-conv1d` pre-installed.

---

## Thinking (Reasoning) Control

Pulsar 16B supports a **hybrid reasoning mode**: the model can either think step-by-step before answering (reasoning mode) or reply directly (non-reasoning mode). The behaviour is controlled via the `enable_thinking` flag in the chat template.

> This section provides a brief overview of reasoning control in Pulsar 16B. For comprehensive details please see the official Nemotron-3 Nano-30B model card at: https://build.nvidia.com/nvidia/nemotron-3-nano-30b-a3b/modelcard


---

### Transformers API

Pass `enable_thinking` through `apply_chat_template`:

**Thinking ON (default)**
```python
tokenized_chat = tokenizer.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_tensors="pt",
    enable_thinking=True,   # default — can be omitted
)
```

**Thinking OFF**
```python
tokenized_chat = tokenizer.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_tensors="pt",
    enable_thinking=False,
)
```

When thinking is ON the model opens a `<think>` block before the answer.

```python
output = tokenizer.decode(outputs[0], skip_special_tokens=True)
# Split on </think> to separate reasoning from the final answer
if "</think>" in output:
    reasoning, answer = output.split("</think>", 1)
    reasoning = reasoning.replace("<think>", "").strip()
    answer = answer.strip()
else:
    answer = output
```

---

### vLLM

#### Server-level default

Set the default for **all requests** at startup with `--default-chat-template-kwargs`.

> Requires recent versions of vLLM.

**Thinking OFF for all requests**
```bash
vllm serve MultiverseComputingCAI/Pulsar-16B-BF16 \
  --served-model-name model \
  --reasoning-parser-plugin nano_v3_reasoning_parser.py \
  --reasoning-parser nano_v3 \
  --trust-request-chat-template \
  --default-chat-template-kwargs '{"enable_thinking": false}' \
  ...
```

**Thinking ON for all requests (default if flag is omitted)**
```bash
vllm serve MultiverseComputingCAI/Pulsar-16B-BF16 \
  --served-model-name model \
  --reasoning-parser-plugin nano_v3_reasoning_parser.py \
  --reasoning-parser nano_v3 \
  --trust-request-chat-template \
  --default-chat-template-kwargs '{"enable_thinking": true}' \
  ...
```


---

#### Per-request override

> **`--trust-request-chat-template` is required** to allow per-request overrides.

Individual requests can override the server default by passing `chat_template_kwargs` in the request body. This works regardless of the server-level default.

**Thinking ON/OFF for one request**
```python
import requests

response = requests.post("http://localhost:8000/v1/chat/completions", json={
    "model": "model",
    "messages": [{"role": "user", "content": "Solve: x² - 5x + 6 = 0"}],
    "max_tokens": 1024,
    "temperature": 1.0,
    "chat_template_kwargs": {"enable_thinking": True},
})
```

---

## Tool Calling

Pulsar 16B emits tool calls in the following format:

```
<tool_call>
<function=get_weather>
<parameter=city>Paris</parameter>
<parameter=unit>celsius</parameter>
</function>
</tool_call>
```

When serving (e.g with vLLM), you **must** use the `qwen3_coder` tool parser.

```bash
vllm serve <model_path> \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder \
  --trust-remote-code
```

## Training & Fine-Tuning

### Base Model: NVIDIA-Nemotron-3-Nano-30B-A3B-BF16

The base model [nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16](https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16) is a large language model (LLM) trained from scratch by NVIDIA, and designed as a unified model for both reasoning and non-reasoning tasks. The model's reasoning capabilities can be configured through a flag in the chat template. See the [original model card](https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16) for details.


### CompactifAI Compression

CompactifAI was applied to produce a smaller, efficient model (16B parameters) while aiming to preserve reasoning and tool-use capabilities. Supervised Fine Tuning was applied for improving cabapilities.

---

## Evaluation & Benchmarks

![Combined benchmark chart](assets/benchmarks.png)

| Benchmark | Nemotron 3 Nano 30B A3B | Pulsar 16B | gpt-oss-20b | Qwen3-14B | Ministral-3-14B-Instruct-2512 |
| --- | ---: | ---: | ---: | ---: | ---: |
| AIME | 87.66 | 87.22 | 87.66 | 76.00 | 33.00 |
| GPQA | 74.04 | 71.41 | 68.99 | 63.63 | 56.45 |
| IFBench | 72.31 | 70.79 | 68.46 | 39.20 | 32.80 |
| MMLU-Pro | 78.90 | 74.78 | 76.65 | 85.01 | 70.09 |
| LiveCodeBench | 71.11 | 68.04 | 64.65 | 66.35 | 29.84 |

### Quantizations

- [BF16](https://huggingface.co/MultiverseComputingCAI/Pulsar-16B-BF16)
- [FP8](https://huggingface.co/MultiverseComputingCAI/Pulsar-16B-FP8)
- [NVFP4](https://huggingface.co/MultiverseComputingCAI/Pulsar-16B-NVFP4)
  
![Quantization results](assets/quantization_comparisons.png)

| Benchmark | Nemotron 3 Nano 30B A3B | Pulsar 16B (BF16) | Pulsar 16B (fp8) | Pulsar 16B (nvfp4) |
| --- | ---: | ---: | ---: | ---: |
| AIME | 87.66 | 87.22 | 86.67 | 82.00 |
| GPQA | 74.04 | 71.41 | 70.61 | 71.11 |
| IFBench | 72.31 | 70.79 | 69.60 | 69.90 |
| MMLU-Pro | 78.90 | 74.78 | 74.76 | 74.19 |
| LiveCodeBench | 71.11 | 68.04 | 68.68 | 65.60 |


### Performance
![Performance results](assets/performance.png)
- **Framework:** [guidellm](https://github.com/vllm-project/guidellm)
- **Inference:** vLLM 0.18.0
- **GPU:** NVIDIA L40s
- **Decode:** `temperature: 0.0`, `top_p: 1.0`
- **Measure Window:** Each phase lasts 3 minutes (excluding ramp-up and cool-down periods).
- **Workload shape:**  8k/16k workload as in the original model's card.


### Long Context
Pulsar 16B preserves strong long-context behavior after compression, tracking the Nemotron-3-Nano-30B-A3B baseline closely across retrieval-heavy and full-suite long-context evaluations. Results are reported for LongBench v1, AA-LCR, NIAH, and RULER groupings up to 256k context.

![Long-context benchmark results](assets/long_context_comparison.png)

| Benchmark | Nemotron 3 Nano 30B A3B | Pulsar 16B |
| :--- | ---: | ---: |
| Longbench | 31.84 | 29.84 |
| AA-LCR | 33.67 | 29.33 |
| NIAH (@100K) | 100.00 | 100.00 |
| RULER (@128K) | 95.02 | 94.20 |
| RULER (@256K) | 92.02 | 87.74 |
### Evaluation Methodology

Benchmark scores were obtained with the following setups. Methodology varies by benchmark family.

### Inference:
- **Backend:** VLLM 0.18.0
- **Nemotron models:**  `temp 1.0`, `top_p 1.0`
- **GPT-OSS-20B:** `temp: 1.0`, `top_p: 1.0`, `reasoning_effort: high`
- **Qwen3-14B:** `temp: 0.6`, `top_p: 0.95`, `top_k: 20`, `min_p: 0.0`
- **Ministral-3-14B-Instruct-2512:** `temp: 0.15`

| Benchmark | Framework | Repeats | Other |
|-----------|-----------|--------:|-------|
| MMLU-Pro | [NeMo-Skills](https://github.com/NVIDIA-NeMo/Skills) | 1 | |
| AIME25 | [NeMo-Skills](https://github.com/NVIDIA-NeMo/Skills) | 10 | |
| GPQA:d | [NeMo-Skills](https://github.com/NVIDIA-NeMo/Skills) | 5 | |
| LiveCodeBench | [NeMo-Skills](https://github.com/NVIDIA-NeMo/Skills) | 3 | |
| IFBench | [NeMo-Skills](https://github.com/NVIDIA-NeMo/Skills) | 5 | |
| LongBench v1 | [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) | 1 | |
| AA-LCR | [EvalScope](https://github.com/modelscope/evalscope) 1.4.1 | 3 | Judge: **`Qwen/Qwen3-235B-A22B-Instruct-2507`**. **`judge_score_type`:** `pattern`. **`judge_args` → `generation_config`:** `top_p` 0.8, `top_k` 20, `min_p` 0.0, `temperature` 0.7. |
| NIAH | [EvalScope](https://github.com/modelscope/evalscope) 1.4.1 | 1 | Judge: **`qwen/qwen3-235b-a22b-2507`** . **`judge_model_args`:** `{}` (no extra judge settings in YAML). |
| RULER | [NeMo-Skills](https://github.com/NVIDIA-NeMo/Skills) (+ [RULER](https://github.com/NVIDIA/RULER)) | 1 | |

---

## Languages

- **Primary language**: English
- **Other languages**: Spanish

Trained mainly on English with added Spanish. No systematic evaluation for languages outside English and Spanish.


## Safety & Limitations

### Known Limitations

- English-centric training data (inherited from base model).
- Tool calling depends on correct schema and tool design; exact parity with the original model is not guaranteed.
- Compression may affect some behaviors; evaluate for your use case.

### Recommendations

- Validate tool outputs before running them
- Human oversight for critical use
- Task-specific eval before production

---

## Model Information

| Field         | Value               |
|--------------|--------------------- |
| Model name   | Pulsar 16B         |
| Based on     | [NVIDIA-Nemotron-3-Nano-30B-A3B-BF16](https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16) |
| Version      | v1.5.0      |
| Release date | TBD      |
| Developed by | Multiverse Computing |
| License      | Apache 2.0      |
| Contact      | business@multiversecomputing.com   |

---

## Citation

If you use this model, please cite the base model and Pulsar 16B:

```bibtex
@misc{nemotron3nanoTR,
  title         = {NVIDIA Nemotron 3 Nano Technical Report},
  author        = {{NVIDIA}},
  year          = {2025},
  url           = {https://research.nvidia.com/labs/nemotron/files/NVIDIA-Nemotron-3-Nano-Technical-Report.pdf}
}
@misc{nemotron3nanoslim16b,
  title         = {Pulsar 16B: Model developed from NVIDIA Nemotron-3-Nano-30B-A3B},
  author        = {Multiverse Computing},
  year          = {2026},
  url           = {https://huggingface.co/MultiverseComputingCAI/Pulsar-16B-BF16},
  note          = {Model developed based on nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 using CompactifAI technology}
}
```

**Built by [Multiverse Computing](https://www.multiversecomputing.com)** · [Report an issue](TODO_PULSAR_HF_URL/discussions) · [Discord](https://discord.gg/cGas9uStqp)