Update README.md

52c88be verified 8 days ago

15.4 kB

	---
	license: apache-2.0
	language:
	- en
	base_model:
	- nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16
	---

	# Pulsar 16B
	<div align="center">

	[![License](https://img.shields.io/badge/License-Apache_2.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)
	[![HuggingFace](https://img.shields.io/badge/🤗-Model_Hub-yellow.svg)](TODO_PULSAR_HF_URL)
	[![Discord](https://img.shields.io/badge/Discord-Community-5865F2?logo=discord&logoColor=white)](https://discord.gg/cGas9uStqp)

	Powered by CompactifAI

	Optimized for Fast and Efficient Inference · Reduced Memory Footprint

	</div>

	---

	## Table of Contents

	- [Model Overview](#model-overview)
	- [Key Characteristics](#key-characteristics)
	- [Quick Start](#quick-start)
	- [Reasoning Control](#thinking-reasoning-control)
	- [Tool Calling](#tool-calling)
	- [Training & Fine-Tuning](#training--fine-tuning)
	- [Evaluation & Benchmarks](#evaluation--benchmarks)
	- [Languages](#languages)
	- [Safety & Limitations](#safety--limitations)
	- [Model Information](#model-information)
	- [Citation](#citation)

	---

	## Model Overview

	Pulsar 16B is a model based on [NVIDIA-Nemotron-3-Nano-30B-A3B-BF16](https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16), developed by Multiverse Computing. The original model is a ~31.6B parameter, part of the Nemotron model family. It supports long-context inference up to 1M tokens and is designed for general-purpose language modeling tasks.

	This version applies model compression techniques to significantly reduce parameter count and deployment requirements while maintaining compatibility with the Nemotron Hybrid Mamba2-Transformer with MoE architecture. The resulting model achieves 50% compression, reducing the parameter count to 16.15B parameters and lowering memory requirements.

	- [BF16](https://huggingface.co/MultiverseComputingCAI/Pulsar-16B-BF16)
	- [FP8](https://huggingface.co/MultiverseComputingCAI/Pulsar-16B-FP8)
	- [NVFP4](https://huggingface.co/MultiverseComputingCAI/Pulsar-16B-NVFP4)

	---

	## Key Characteristics

	\| Characteristic \| Description \|
	\|-----------------------\|-------------\|
	\| Base model \| [nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16](https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16). 31.6B total parameters, 3.6B activated per forward pass (11.34% activation ratio). [NVIDIA Open Model License](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license/). \|
	\| Pulsar-16B-BF16 (this model) \| 16.15B total parameters, 3.1B activated per forward pass (19.28% activation ratio) after CompactifAI compression. \|
	\| 📐 Architecture \| Hybrid Mamba2-Transformer with MoE (same family as the base checkpoint). \|
	\| 🛠️ Tool calling \| Yes. Same tool-call structure and format as [Nemotron-3-Nano-30B-A3B-BF16](https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16). See [Tool Calling](#tool-calling). \|
	\| 🗜️ Compression \| CompactifAI (proprietary compression technology) \|
	\| Primary language \| English \|
	---
	## Quick Start
	This model can be loaded with the Transformers API. Use `trust_remote_code=True`. Recommended approach: `AutoModelForCausalLM` with `apply_chat_template`. This configuration has been tested with Transformers 4.57.6.

	```python
	import torch
	from transformers import AutoTokenizer, AutoModelForCausalLM

	model_id = "MultiverseComputingCAI/Pulsar-16B-BF16"

	tokenizer = AutoTokenizer.from_pretrained(
	model_id,
	trust_remote_code=True
	)

	model = AutoModelForCausalLM.from_pretrained(
	model_id,
	device_map="cuda" if torch.cuda.is_available() else "auto",
	torch_dtype=torch.bfloat16,
	trust_remote_code=True,
	)
	messages = [
	{"role": "user", "content": "Write a haiku about GPUs"},
	]

	tokenized_chat = tokenizer.apply_chat_template(
	messages,
	tokenize=True,
	add_generation_prompt=True,
	return_tensors="pt"
	).to(model.device)

	outputs = model.generate(
	tokenized_chat,
	max_new_tokens=1024,
	temperature=1.0,
	top_p=1.0,
	eos_token_id=tokenizer.eos_token_id
	)
	print(tokenizer.decode(outputs[0]))
	```
	Alternatively you can use the `pipeline` API with `trust_remote_code=True`; the pipeline returns the full conversation structure, so extract the assistant message from `outputs[0]["generated_text"]` as needed.

	### vLLM Serving

	#### Installation

	```bash
	pip install -U "vllm>=0.12.0"
	```

	#### Reasoning parser (NVIDIA)

	Pulsar 16B uses the same Nemotron v3 reasoning tags as the base model. NVIDIA provides the vLLM plugin as [`nano_v3_reasoning_parser.py`](https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16/blob/main/nano_v3_reasoning_parser.py) on the base Hugging Face repo (not specific to Pulsar). Direct download:

	```bash
	wget https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16/resolve/main/nano_v3_reasoning_parser.py
	```

	You can keep any local filename; the `vllm serve` flags below assume the file is in the current directory as `nano_v3_reasoning_parser.py`. If you mirror an identical copy under the Pulsar model repo, use that URL instead.

	#### Serve

	```bash
	vllm serve MultiverseComputingCAI/Pulsar-16B-BF16 \
	--served-model-name model \
	--max-num-seqs 8 \
	--tensor-parallel-size 1 \
	--port 8000 \
	--trust-remote-code \
	--enable-auto-tool-choice \
	--tool-call-parser qwen3_coder \
	--reasoning-parser-plugin nano_v3_reasoning_parser.py \
	--reasoning-parser nano_v3
	```

	> Note: The NeMo container `nvcr.io/nvidia/nemo:25.11.nemotron_3_nano` comes with `mamba_ssm` and `causal-conv1d` pre-installed.

	---

	## Thinking (Reasoning) Control

	Pulsar 16B supports a hybrid reasoning mode: the model can either think step-by-step before answering (reasoning mode) or reply directly (non-reasoning mode). The behaviour is controlled via the `enable_thinking` flag in the chat template.

	> This section provides a brief overview of reasoning control in Pulsar 16B. For comprehensive details please see the official Nemotron-3 Nano-30B model card at: https://build.nvidia.com/nvidia/nemotron-3-nano-30b-a3b/modelcard


	---

	### Transformers API

	Pass `enable_thinking` through `apply_chat_template`:

	Thinking ON (default)
	```python
	tokenized_chat = tokenizer.apply_chat_template(
	messages,
	tokenize=True,
	add_generation_prompt=True,
	return_tensors="pt",
	enable_thinking=True, # default — can be omitted
	)
	```

	Thinking OFF
	```python
	tokenized_chat = tokenizer.apply_chat_template(
	messages,
	tokenize=True,
	add_generation_prompt=True,
	return_tensors="pt",
	enable_thinking=False,
	)
	```

	When thinking is ON the model opens a `<think>` block before the answer.

	```python
	output = tokenizer.decode(outputs[0], skip_special_tokens=True)
	# Split on </think> to separate reasoning from the final answer
	if "</think>" in output:
	reasoning, answer = output.split("</think>", 1)
	reasoning = reasoning.replace("<think>", "").strip()
	answer = answer.strip()
	else:
	answer = output
	```

	---

	### vLLM

	#### Server-level default

	Set the default for all requests at startup with `--default-chat-template-kwargs`.

	> Requires recent versions of vLLM.

	Thinking OFF for all requests
	```bash
	vllm serve MultiverseComputingCAI/Pulsar-16B-BF16 \
	--served-model-name model \
	--reasoning-parser-plugin nano_v3_reasoning_parser.py \
	--reasoning-parser nano_v3 \
	--trust-request-chat-template \
	--default-chat-template-kwargs '{"enable_thinking": false}' \
	...
	```

	Thinking ON for all requests (default if flag is omitted)
	```bash
	vllm serve MultiverseComputingCAI/Pulsar-16B-BF16 \
	--served-model-name model \
	--reasoning-parser-plugin nano_v3_reasoning_parser.py \
	--reasoning-parser nano_v3 \
	--trust-request-chat-template \
	--default-chat-template-kwargs '{"enable_thinking": true}' \
	...
	```


	---

	#### Per-request override

	> `--trust-request-chat-template` is required to allow per-request overrides.

	Individual requests can override the server default by passing `chat_template_kwargs` in the request body. This works regardless of the server-level default.

	Thinking ON/OFF for one request
	```python
	import requests

	response = requests.post("http://localhost:8000/v1/chat/completions", json={
	"model": "model",
	"messages": [{"role": "user", "content": "Solve: x² - 5x + 6 = 0"}],
	"max_tokens": 1024,
	"temperature": 1.0,
	"chat_template_kwargs": {"enable_thinking": True},
	})
	```

	---

	## Tool Calling

	Pulsar 16B emits tool calls in the following format:

	```
	<tool_call>
	<function=get_weather>
	<parameter=city>Paris</parameter>
	<parameter=unit>celsius</parameter>
	</function>
	</tool_call>
	```

	When serving (e.g with vLLM), you must use the `qwen3_coder` tool parser.

	```bash
	vllm serve <model_path> \
	--enable-auto-tool-choice \
	--tool-call-parser qwen3_coder \
	--trust-remote-code
	```

	## Training & Fine-Tuning

	### Base Model: NVIDIA-Nemotron-3-Nano-30B-A3B-BF16

	The base model [nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16](https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16) is a large language model (LLM) trained from scratch by NVIDIA, and designed as a unified model for both reasoning and non-reasoning tasks. The model's reasoning capabilities can be configured through a flag in the chat template. See the [original model card](https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16) for details.


	### CompactifAI Compression

	CompactifAI was applied to produce a smaller, efficient model (16B parameters) while aiming to preserve reasoning and tool-use capabilities. Supervised Fine Tuning was applied for improving cabapilities.

	---

	## Evaluation & Benchmarks

	![Combined benchmark chart](assets/benchmarks.png)

	\| Benchmark \| Nemotron 3 Nano 30B A3B \| Pulsar 16B \| gpt-oss-20b \| Qwen3-14B \| Ministral-3-14B-Instruct-2512 \|
	\| --- \| ---: \| ---: \| ---: \| ---: \| ---: \|
	\| AIME \| 87.66 \| 87.22 \| 87.66 \| 76.00 \| 33.00 \|
	\| GPQA \| 74.04 \| 71.41 \| 68.99 \| 63.63 \| 56.45 \|
	\| IFBench \| 72.31 \| 70.79 \| 68.46 \| 39.20 \| 32.80 \|
	\| MMLU-Pro \| 78.90 \| 74.78 \| 76.65 \| 85.01 \| 70.09 \|
	\| LiveCodeBench \| 71.11 \| 68.04 \| 64.65 \| 66.35 \| 29.84 \|

	### Quantizations

	- [BF16](https://huggingface.co/MultiverseComputingCAI/Pulsar-16B-BF16)
	- [FP8](https://huggingface.co/MultiverseComputingCAI/Pulsar-16B-FP8)
	- [NVFP4](https://huggingface.co/MultiverseComputingCAI/Pulsar-16B-NVFP4)

	![Quantization results](assets/quantization_comparisons.png)

	\| Benchmark \| Nemotron 3 Nano 30B A3B \| Pulsar 16B (BF16) \| Pulsar 16B (fp8) \| Pulsar 16B (nvfp4) \|
	\| --- \| ---: \| ---: \| ---: \| ---: \|
	\| AIME \| 87.66 \| 87.22 \| 86.67 \| 82.00 \|
	\| GPQA \| 74.04 \| 71.41 \| 70.61 \| 71.11 \|
	\| IFBench \| 72.31 \| 70.79 \| 69.60 \| 69.90 \|
	\| MMLU-Pro \| 78.90 \| 74.78 \| 74.76 \| 74.19 \|
	\| LiveCodeBench \| 71.11 \| 68.04 \| 68.68 \| 65.60 \|


	### Performance
	![Performance results](assets/performance.png)
	- Framework: [guidellm](https://github.com/vllm-project/guidellm)
	- Inference: vLLM 0.18.0
	- GPU: NVIDIA L40s
	- Decode: `temperature: 0.0`, `top_p: 1.0`
	- Measure Window: Each phase lasts 3 minutes (excluding ramp-up and cool-down periods).
	- Workload shape: 8k/16k workload as in the original model's card.


	### Long Context
	Pulsar 16B preserves strong long-context behavior after compression, tracking the Nemotron-3-Nano-30B-A3B baseline closely across retrieval-heavy and full-suite long-context evaluations. Results are reported for LongBench v1, AA-LCR, NIAH, and RULER groupings up to 256k context.

	![Long-context benchmark results](assets/long_context_comparison.png)

	\| Benchmark \| Nemotron 3 Nano 30B A3B \| Pulsar 16B \|
	\| :--- \| ---: \| ---: \|
	\| Longbench \| 31.84 \| 29.84 \|
	\| AA-LCR \| 33.67 \| 29.33 \|
	\| NIAH (@100K) \| 100.00 \| 100.00 \|
	\| RULER (@128K) \| 95.02 \| 94.20 \|
	\| RULER (@256K) \| 92.02 \| 87.74 \|
	### Evaluation Methodology

	Benchmark scores were obtained with the following setups. Methodology varies by benchmark family.

	### Inference:
	- Backend: VLLM 0.18.0
	- Nemotron models: `temp 1.0`, `top_p 1.0`
	- GPT-OSS-20B: `temp: 1.0`, `top_p: 1.0`, `reasoning_effort: high`
	- Qwen3-14B: `temp: 0.6`, `top_p: 0.95`, `top_k: 20`, `min_p: 0.0`
	- Ministral-3-14B-Instruct-2512: `temp: 0.15`

	\| Benchmark \| Framework \| Repeats \| Other \|
	\|-----------\|-----------\|--------:\|-------\|
	\| MMLU-Pro \| [NeMo-Skills](https://github.com/NVIDIA-NeMo/Skills) \| 1 \| \|
	\| AIME25 \| [NeMo-Skills](https://github.com/NVIDIA-NeMo/Skills) \| 10 \| \|
	\| GPQA:d \| [NeMo-Skills](https://github.com/NVIDIA-NeMo/Skills) \| 5 \| \|
	\| LiveCodeBench \| [NeMo-Skills](https://github.com/NVIDIA-NeMo/Skills) \| 3 \| \|
	\| IFBench \| [NeMo-Skills](https://github.com/NVIDIA-NeMo/Skills) \| 5 \| \|
	\| LongBench v1 \| [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) \| 1 \| \|
	\| AA-LCR \| [EvalScope](https://github.com/modelscope/evalscope) 1.4.1 \| 3 \| Judge: `Qwen/Qwen3-235B-A22B-Instruct-2507`. `judge_score_type`: `pattern`. `judge_args` → `generation_config`: `top_p` 0.8, `top_k` 20, `min_p` 0.0, `temperature` 0.7. \|
	\| NIAH \| [EvalScope](https://github.com/modelscope/evalscope) 1.4.1 \| 1 \| Judge: `qwen/qwen3-235b-a22b-2507` . `judge_model_args`: `{}` (no extra judge settings in YAML). \|
	\| RULER \| [NeMo-Skills](https://github.com/NVIDIA-NeMo/Skills) (+ [RULER](https://github.com/NVIDIA/RULER)) \| 1 \| \|

	---

	## Languages

	- Primary language: English
	- Other languages: Spanish

	Trained mainly on English with added Spanish. No systematic evaluation for languages outside English and Spanish.



	## Safety & Limitations

	### Known Limitations

	- English-centric training data (inherited from base model).
	- Tool calling depends on correct schema and tool design; exact parity with the original model is not guaranteed.
	- Compression may affect some behaviors; evaluate for your use case.

	### Recommendations

	- Validate tool outputs before running them
	- Human oversight for critical use
	- Task-specific eval before production

	---

	## Model Information

	\| Field \| Value \|
	\|--------------\|--------------------- \|
	\| Model name \| Pulsar 16B \|
	\| Based on \| [NVIDIA-Nemotron-3-Nano-30B-A3B-BF16](https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16) \|
	\| Version \| v1.5.0 \|
	\| Release date \| TBD \|
	\| Developed by \| Multiverse Computing \|
	\| License \| Apache 2.0 \|
	\| Contact \| business@multiversecomputing.com \|

	---

	## Citation

	If you use this model, please cite the base model and Pulsar 16B:

	```bibtex
	@misc{nemotron3nanoTR,
	title = {NVIDIA Nemotron 3 Nano Technical Report},
	author = {{NVIDIA}},
	year = {2025},
	url = {https://research.nvidia.com/labs/nemotron/files/NVIDIA-Nemotron-3-Nano-Technical-Report.pdf}
	}
	@misc{nemotron3nanoslim16b,
	title = {Pulsar 16B: Model developed from NVIDIA Nemotron-3-Nano-30B-A3B},
	author = {Multiverse Computing},
	year = {2026},
	url = {https://huggingface.co/MultiverseComputingCAI/Pulsar-16B-BF16},
	note = {Model developed based on nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 using CompactifAI technology}
	}
	```

	Built by [Multiverse Computing](https://www.multiversecomputing.com) · [Report an issue](TODO_PULSAR_HF_URL/discussions) · [Discord](https://discord.gg/cGas9uStqp)