Pulsar-16B-FP8 / README.md
arturo-fredes's picture
Update README.md
52c88be verified
|
Raw
History Blame Contribute Delete
15.4 kB
---
license: apache-2.0
language:
- en
base_model:
- nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16
---
# Pulsar 16B
<div align="center">
[![License](https://img.shields.io/badge/License-Apache_2.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)
[![HuggingFace](https://img.shields.io/badge/🤗-Model_Hub-yellow.svg)](TODO_PULSAR_HF_URL)
[![Discord](https://img.shields.io/badge/Discord-Community-5865F2?logo=discord&logoColor=white)](https://discord.gg/cGas9uStqp)
Powered by CompactifAI
**Optimized for Fast and Efficient Inference** · **Reduced Memory Footprint**
</div>
---
## Table of Contents
- [Model Overview](#model-overview)
- [Key Characteristics](#key-characteristics)
- [Quick Start](#quick-start)
- [Reasoning Control](#thinking-reasoning-control)
- [Tool Calling](#tool-calling)
- [Training & Fine-Tuning](#training--fine-tuning)
- [Evaluation & Benchmarks](#evaluation--benchmarks)
- [Languages](#languages)
- [Safety & Limitations](#safety--limitations)
- [Model Information](#model-information)
- [Citation](#citation)
---
## Model Overview
**Pulsar 16B** is a **model based on [NVIDIA-Nemotron-3-Nano-30B-A3B-BF16](https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16)**, developed by **Multiverse Computing**. The original model is a **~31.6B parameter**, part of the Nemotron model family. It supports **long-context inference up to 1M tokens** and is designed for general-purpose language modeling tasks.
This version applies **model compression techniques** to significantly reduce parameter count and deployment requirements while maintaining compatibility with the Nemotron Hybrid Mamba2-Transformer with MoE architecture. The resulting model achieves **50% compression**, reducing the parameter count to **16.15B parameters** and lowering memory requirements.
- [BF16](https://huggingface.co/MultiverseComputingCAI/Pulsar-16B-BF16)
- [FP8](https://huggingface.co/MultiverseComputingCAI/Pulsar-16B-FP8)
- [NVFP4](https://huggingface.co/MultiverseComputingCAI/Pulsar-16B-NVFP4)
---
## Key Characteristics
| Characteristic | Description |
|-----------------------|-------------|
| Base model | [nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16](https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16). **31.6B** total parameters, **3.6B** activated per forward pass (11.34% activation ratio). [NVIDIA Open Model License](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license/). |
| Pulsar-16B-BF16 (this model) | **16.15B** total parameters, **3.1B** activated per forward pass (19.28% activation ratio) after CompactifAI compression. |
| 📐 **Architecture** | Hybrid Mamba2-Transformer with MoE (same family as the base checkpoint). |
| 🛠️ **Tool calling** | Yes. Same tool-call structure and format as [Nemotron-3-Nano-30B-A3B-BF16](https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16). See [Tool Calling](#tool-calling). |
| 🗜️ **Compression** | CompactifAI (proprietary compression technology) |
| Primary language | English |
---
## Quick Start
This model can be loaded with the **Transformers** API. Use `trust_remote_code=True`. Recommended approach: `AutoModelForCausalLM` with `apply_chat_template`. This configuration has been tested with Transformers 4.57.6.
```python
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
model_id = "MultiverseComputingCAI/Pulsar-16B-BF16"
tokenizer = AutoTokenizer.from_pretrained(
model_id,
trust_remote_code=True
)
model = AutoModelForCausalLM.from_pretrained(
model_id,
device_map="cuda" if torch.cuda.is_available() else "auto",
torch_dtype=torch.bfloat16,
trust_remote_code=True,
)
messages = [
{"role": "user", "content": "Write a haiku about GPUs"},
]
tokenized_chat = tokenizer.apply_chat_template(
messages,
tokenize=True,
add_generation_prompt=True,
return_tensors="pt"
).to(model.device)
outputs = model.generate(
tokenized_chat,
max_new_tokens=1024,
temperature=1.0,
top_p=1.0,
eos_token_id=tokenizer.eos_token_id
)
print(tokenizer.decode(outputs[0]))
```
Alternatively you can use the `pipeline` API with `trust_remote_code=True`; the pipeline returns the full conversation structure, so extract the assistant message from `outputs[0]["generated_text"]` as needed.
### vLLM Serving
#### Installation
```bash
pip install -U "vllm>=0.12.0"
```
#### Reasoning parser (NVIDIA)
Pulsar 16B uses the same Nemotron v3 reasoning tags as the base model. NVIDIA provides the vLLM plugin as [`nano_v3_reasoning_parser.py`](https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16/blob/main/nano_v3_reasoning_parser.py) on the base Hugging Face repo (not specific to Pulsar). Direct download:
```bash
wget https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16/resolve/main/nano_v3_reasoning_parser.py
```
You can keep any local filename; the `vllm serve` flags below assume the file is in the current directory as `nano_v3_reasoning_parser.py`. If you mirror an identical copy under the Pulsar model repo, use that URL instead.
#### Serve
```bash
vllm serve MultiverseComputingCAI/Pulsar-16B-BF16 \
--served-model-name model \
--max-num-seqs 8 \
--tensor-parallel-size 1 \
--port 8000 \
--trust-remote-code \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder \
--reasoning-parser-plugin nano_v3_reasoning_parser.py \
--reasoning-parser nano_v3
```
> **Note:** The NeMo container `nvcr.io/nvidia/nemo:25.11.nemotron_3_nano` comes with `mamba_ssm` and `causal-conv1d` pre-installed.
---
## Thinking (Reasoning) Control
Pulsar 16B supports a **hybrid reasoning mode**: the model can either think step-by-step before answering (reasoning mode) or reply directly (non-reasoning mode). The behaviour is controlled via the `enable_thinking` flag in the chat template.
> This section provides a brief overview of reasoning control in Pulsar 16B. For comprehensive details please see the official Nemotron-3 Nano-30B model card at: https://build.nvidia.com/nvidia/nemotron-3-nano-30b-a3b/modelcard
---
### Transformers API
Pass `enable_thinking` through `apply_chat_template`:
**Thinking ON (default)**
```python
tokenized_chat = tokenizer.apply_chat_template(
messages,
tokenize=True,
add_generation_prompt=True,
return_tensors="pt",
enable_thinking=True, # default — can be omitted
)
```
**Thinking OFF**
```python
tokenized_chat = tokenizer.apply_chat_template(
messages,
tokenize=True,
add_generation_prompt=True,
return_tensors="pt",
enable_thinking=False,
)
```
When thinking is ON the model opens a `<think>` block before the answer.
```python
output = tokenizer.decode(outputs[0], skip_special_tokens=True)
# Split on </think> to separate reasoning from the final answer
if "</think>" in output:
reasoning, answer = output.split("</think>", 1)
reasoning = reasoning.replace("<think>", "").strip()
answer = answer.strip()
else:
answer = output
```
---
### vLLM
#### Server-level default
Set the default for **all requests** at startup with `--default-chat-template-kwargs`.
> Requires recent versions of vLLM.
**Thinking OFF for all requests**
```bash
vllm serve MultiverseComputingCAI/Pulsar-16B-BF16 \
--served-model-name model \
--reasoning-parser-plugin nano_v3_reasoning_parser.py \
--reasoning-parser nano_v3 \
--trust-request-chat-template \
--default-chat-template-kwargs '{"enable_thinking": false}' \
...
```
**Thinking ON for all requests (default if flag is omitted)**
```bash
vllm serve MultiverseComputingCAI/Pulsar-16B-BF16 \
--served-model-name model \
--reasoning-parser-plugin nano_v3_reasoning_parser.py \
--reasoning-parser nano_v3 \
--trust-request-chat-template \
--default-chat-template-kwargs '{"enable_thinking": true}' \
...
```
---
#### Per-request override
> **`--trust-request-chat-template` is required** to allow per-request overrides.
Individual requests can override the server default by passing `chat_template_kwargs` in the request body. This works regardless of the server-level default.
**Thinking ON/OFF for one request**
```python
import requests
response = requests.post("http://localhost:8000/v1/chat/completions", json={
"model": "model",
"messages": [{"role": "user", "content": "Solve: x² - 5x + 6 = 0"}],
"max_tokens": 1024,
"temperature": 1.0,
"chat_template_kwargs": {"enable_thinking": True},
})
```
---
## Tool Calling
Pulsar 16B emits tool calls in the following format:
```
<tool_call>
<function=get_weather>
<parameter=city>Paris</parameter>
<parameter=unit>celsius</parameter>
</function>
</tool_call>
```
When serving (e.g with vLLM), you **must** use the `qwen3_coder` tool parser.
```bash
vllm serve <model_path> \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder \
--trust-remote-code
```
## Training & Fine-Tuning
### Base Model: NVIDIA-Nemotron-3-Nano-30B-A3B-BF16
The base model [nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16](https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16) is a large language model (LLM) trained from scratch by NVIDIA, and designed as a unified model for both reasoning and non-reasoning tasks. The model's reasoning capabilities can be configured through a flag in the chat template. See the [original model card](https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16) for details.
### CompactifAI Compression
CompactifAI was applied to produce a smaller, efficient model (16B parameters) while aiming to preserve reasoning and tool-use capabilities. Supervised Fine Tuning was applied for improving cabapilities.
---
## Evaluation & Benchmarks
![Combined benchmark chart](assets/benchmarks.png)
| Benchmark | Nemotron 3 Nano 30B A3B | Pulsar 16B | gpt-oss-20b | Qwen3-14B | Ministral-3-14B-Instruct-2512 |
| --- | ---: | ---: | ---: | ---: | ---: |
| AIME | 87.66 | 87.22 | 87.66 | 76.00 | 33.00 |
| GPQA | 74.04 | 71.41 | 68.99 | 63.63 | 56.45 |
| IFBench | 72.31 | 70.79 | 68.46 | 39.20 | 32.80 |
| MMLU-Pro | 78.90 | 74.78 | 76.65 | 85.01 | 70.09 |
| LiveCodeBench | 71.11 | 68.04 | 64.65 | 66.35 | 29.84 |
### Quantizations
- [BF16](https://huggingface.co/MultiverseComputingCAI/Pulsar-16B-BF16)
- [FP8](https://huggingface.co/MultiverseComputingCAI/Pulsar-16B-FP8)
- [NVFP4](https://huggingface.co/MultiverseComputingCAI/Pulsar-16B-NVFP4)
![Quantization results](assets/quantization_comparisons.png)
| Benchmark | Nemotron 3 Nano 30B A3B | Pulsar 16B (BF16) | Pulsar 16B (fp8) | Pulsar 16B (nvfp4) |
| --- | ---: | ---: | ---: | ---: |
| AIME | 87.66 | 87.22 | 86.67 | 82.00 |
| GPQA | 74.04 | 71.41 | 70.61 | 71.11 |
| IFBench | 72.31 | 70.79 | 69.60 | 69.90 |
| MMLU-Pro | 78.90 | 74.78 | 74.76 | 74.19 |
| LiveCodeBench | 71.11 | 68.04 | 68.68 | 65.60 |
### Performance
![Performance results](assets/performance.png)
- **Framework:** [guidellm](https://github.com/vllm-project/guidellm)
- **Inference:** vLLM 0.18.0
- **GPU:** NVIDIA L40s
- **Decode:** `temperature: 0.0`, `top_p: 1.0`
- **Measure Window:** Each phase lasts 3 minutes (excluding ramp-up and cool-down periods).
- **Workload shape:** 8k/16k workload as in the original model's card.
### Long Context
Pulsar 16B preserves strong long-context behavior after compression, tracking the Nemotron-3-Nano-30B-A3B baseline closely across retrieval-heavy and full-suite long-context evaluations. Results are reported for LongBench v1, AA-LCR, NIAH, and RULER groupings up to 256k context.
![Long-context benchmark results](assets/long_context_comparison.png)
| Benchmark | Nemotron 3 Nano 30B A3B | Pulsar 16B |
| :--- | ---: | ---: |
| Longbench | 31.84 | 29.84 |
| AA-LCR | 33.67 | 29.33 |
| NIAH (@100K) | 100.00 | 100.00 |
| RULER (@128K) | 95.02 | 94.20 |
| RULER (@256K) | 92.02 | 87.74 |
### Evaluation Methodology
Benchmark scores were obtained with the following setups. Methodology varies by benchmark family.
### Inference:
- **Backend:** VLLM 0.18.0
- **Nemotron models:** `temp 1.0`, `top_p 1.0`
- **GPT-OSS-20B:** `temp: 1.0`, `top_p: 1.0`, `reasoning_effort: high`
- **Qwen3-14B:** `temp: 0.6`, `top_p: 0.95`, `top_k: 20`, `min_p: 0.0`
- **Ministral-3-14B-Instruct-2512:** `temp: 0.15`
| Benchmark | Framework | Repeats | Other |
|-----------|-----------|--------:|-------|
| MMLU-Pro | [NeMo-Skills](https://github.com/NVIDIA-NeMo/Skills) | 1 | |
| AIME25 | [NeMo-Skills](https://github.com/NVIDIA-NeMo/Skills) | 10 | |
| GPQA:d | [NeMo-Skills](https://github.com/NVIDIA-NeMo/Skills) | 5 | |
| LiveCodeBench | [NeMo-Skills](https://github.com/NVIDIA-NeMo/Skills) | 3 | |
| IFBench | [NeMo-Skills](https://github.com/NVIDIA-NeMo/Skills) | 5 | |
| LongBench v1 | [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) | 1 | |
| AA-LCR | [EvalScope](https://github.com/modelscope/evalscope) 1.4.1 | 3 | Judge: **`Qwen/Qwen3-235B-A22B-Instruct-2507`**. **`judge_score_type`:** `pattern`. **`judge_args` → `generation_config`:** `top_p` 0.8, `top_k` 20, `min_p` 0.0, `temperature` 0.7. |
| NIAH | [EvalScope](https://github.com/modelscope/evalscope) 1.4.1 | 1 | Judge: **`qwen/qwen3-235b-a22b-2507`** . **`judge_model_args`:** `{}` (no extra judge settings in YAML). |
| RULER | [NeMo-Skills](https://github.com/NVIDIA-NeMo/Skills) (+ [RULER](https://github.com/NVIDIA/RULER)) | 1 | |
---
## Languages
- **Primary language**: English
- **Other languages**: Spanish
Trained mainly on English with added Spanish. No systematic evaluation for languages outside English and Spanish.
## Safety & Limitations
### Known Limitations
- English-centric training data (inherited from base model).
- Tool calling depends on correct schema and tool design; exact parity with the original model is not guaranteed.
- Compression may affect some behaviors; evaluate for your use case.
### Recommendations
- Validate tool outputs before running them
- Human oversight for critical use
- Task-specific eval before production
---
## Model Information
| Field | Value |
|--------------|--------------------- |
| Model name | Pulsar 16B |
| Based on | [NVIDIA-Nemotron-3-Nano-30B-A3B-BF16](https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16) |
| Version | v1.5.0 |
| Release date | TBD |
| Developed by | Multiverse Computing |
| License | Apache 2.0 |
| Contact | business@multiversecomputing.com |
---
## Citation
If you use this model, please cite the base model and Pulsar 16B:
```bibtex
@misc{nemotron3nanoTR,
title = {NVIDIA Nemotron 3 Nano Technical Report},
author = {{NVIDIA}},
year = {2025},
url = {https://research.nvidia.com/labs/nemotron/files/NVIDIA-Nemotron-3-Nano-Technical-Report.pdf}
}
@misc{nemotron3nanoslim16b,
title = {Pulsar 16B: Model developed from NVIDIA Nemotron-3-Nano-30B-A3B},
author = {Multiverse Computing},
year = {2026},
url = {https://huggingface.co/MultiverseComputingCAI/Pulsar-16B-BF16},
note = {Model developed based on nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 using CompactifAI technology}
}
```
**Built by [Multiverse Computing](https://www.multiversecomputing.com)** · [Report an issue](TODO_PULSAR_HF_URL/discussions) · [Discord](https://discord.gg/cGas9uStqp)