--- license: apache-2.0 language: - en base_model: - nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 --- # Pulsar 16B
[![License](https://img.shields.io/badge/License-Apache_2.0-blue.svg)](https://opensource.org/licenses/Apache-2.0) [![HuggingFace](https://img.shields.io/badge/🤗-Model_Hub-yellow.svg)](TODO_PULSAR_HF_URL) [![Discord](https://img.shields.io/badge/Discord-Community-5865F2?logo=discord&logoColor=white)](https://discord.gg/cGas9uStqp) Powered by CompactifAI **Optimized for Fast and Efficient Inference** · **Reduced Memory Footprint**
--- ## Table of Contents - [Model Overview](#model-overview) - [Key Characteristics](#key-characteristics) - [Quick Start](#quick-start) - [Reasoning Control](#thinking-reasoning-control) - [Tool Calling](#tool-calling) - [Training & Fine-Tuning](#training--fine-tuning) - [Evaluation & Benchmarks](#evaluation--benchmarks) - [Languages](#languages) - [Safety & Limitations](#safety--limitations) - [Model Information](#model-information) - [Citation](#citation) --- ## Model Overview **Pulsar 16B** is a **model based on [NVIDIA-Nemotron-3-Nano-30B-A3B-BF16](https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16)**, developed by **Multiverse Computing**. The original model is a **~31.6B parameter**, part of the Nemotron model family. It supports **long-context inference up to 1M tokens** and is designed for general-purpose language modeling tasks. This version applies **model compression techniques** to significantly reduce parameter count and deployment requirements while maintaining compatibility with the Nemotron Hybrid Mamba2-Transformer with MoE architecture. The resulting model achieves **50% compression**, reducing the parameter count to **16.15B parameters** and lowering memory requirements. - [BF16](https://huggingface.co/MultiverseComputingCAI/Pulsar-16B-BF16) - [FP8](https://huggingface.co/MultiverseComputingCAI/Pulsar-16B-FP8) - [NVFP4](https://huggingface.co/MultiverseComputingCAI/Pulsar-16B-NVFP4) --- ## Key Characteristics | Characteristic | Description | |-----------------------|-------------| | Base model | [nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16](https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16). **31.6B** total parameters, **3.6B** activated per forward pass (11.34% activation ratio). [NVIDIA Open Model License](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license/). | | Pulsar-16B-BF16 (this model) | **16.15B** total parameters, **3.1B** activated per forward pass (19.28% activation ratio) after CompactifAI compression. | | 📐 **Architecture** | Hybrid Mamba2-Transformer with MoE (same family as the base checkpoint). | | 🛠️ **Tool calling** | Yes. Same tool-call structure and format as [Nemotron-3-Nano-30B-A3B-BF16](https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16). See [Tool Calling](#tool-calling). | | 🗜️ **Compression** | CompactifAI (proprietary compression technology) | | Primary language | English | --- ## Quick Start This model can be loaded with the **Transformers** API. Use `trust_remote_code=True`. Recommended approach: `AutoModelForCausalLM` with `apply_chat_template`. This configuration has been tested with Transformers 4.57.6. ```python import torch from transformers import AutoTokenizer, AutoModelForCausalLM model_id = "MultiverseComputingCAI/Pulsar-16B-BF16" tokenizer = AutoTokenizer.from_pretrained( model_id, trust_remote_code=True ) model = AutoModelForCausalLM.from_pretrained( model_id, device_map="cuda" if torch.cuda.is_available() else "auto", torch_dtype=torch.bfloat16, trust_remote_code=True, ) messages = [ {"role": "user", "content": "Write a haiku about GPUs"}, ] tokenized_chat = tokenizer.apply_chat_template( messages, tokenize=True, add_generation_prompt=True, return_tensors="pt" ).to(model.device) outputs = model.generate( tokenized_chat, max_new_tokens=1024, temperature=1.0, top_p=1.0, eos_token_id=tokenizer.eos_token_id ) print(tokenizer.decode(outputs[0])) ``` Alternatively you can use the `pipeline` API with `trust_remote_code=True`; the pipeline returns the full conversation structure, so extract the assistant message from `outputs[0]["generated_text"]` as needed. ### vLLM Serving #### Installation ```bash pip install -U "vllm>=0.12.0" ``` #### Reasoning parser (NVIDIA) Pulsar 16B uses the same Nemotron v3 reasoning tags as the base model. NVIDIA provides the vLLM plugin as [`nano_v3_reasoning_parser.py`](https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16/blob/main/nano_v3_reasoning_parser.py) on the base Hugging Face repo (not specific to Pulsar). Direct download: ```bash wget https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16/resolve/main/nano_v3_reasoning_parser.py ``` You can keep any local filename; the `vllm serve` flags below assume the file is in the current directory as `nano_v3_reasoning_parser.py`. If you mirror an identical copy under the Pulsar model repo, use that URL instead. #### Serve ```bash vllm serve MultiverseComputingCAI/Pulsar-16B-BF16 \ --served-model-name model \ --max-num-seqs 8 \ --tensor-parallel-size 1 \ --port 8000 \ --trust-remote-code \ --enable-auto-tool-choice \ --tool-call-parser qwen3_coder \ --reasoning-parser-plugin nano_v3_reasoning_parser.py \ --reasoning-parser nano_v3 ``` > **Note:** The NeMo container `nvcr.io/nvidia/nemo:25.11.nemotron_3_nano` comes with `mamba_ssm` and `causal-conv1d` pre-installed. --- ## Thinking (Reasoning) Control Pulsar 16B supports a **hybrid reasoning mode**: the model can either think step-by-step before answering (reasoning mode) or reply directly (non-reasoning mode). The behaviour is controlled via the `enable_thinking` flag in the chat template. > This section provides a brief overview of reasoning control in Pulsar 16B. For comprehensive details please see the official Nemotron-3 Nano-30B model card at: https://build.nvidia.com/nvidia/nemotron-3-nano-30b-a3b/modelcard --- ### Transformers API Pass `enable_thinking` through `apply_chat_template`: **Thinking ON (default)** ```python tokenized_chat = tokenizer.apply_chat_template( messages, tokenize=True, add_generation_prompt=True, return_tensors="pt", enable_thinking=True, # default — can be omitted ) ``` **Thinking OFF** ```python tokenized_chat = tokenizer.apply_chat_template( messages, tokenize=True, add_generation_prompt=True, return_tensors="pt", enable_thinking=False, ) ``` When thinking is ON the model opens a `` block before the answer. ```python output = tokenizer.decode(outputs[0], skip_special_tokens=True) # Split on to separate reasoning from the final answer if "" in output: reasoning, answer = output.split("", 1) reasoning = reasoning.replace("", "").strip() answer = answer.strip() else: answer = output ``` --- ### vLLM #### Server-level default Set the default for **all requests** at startup with `--default-chat-template-kwargs`. > Requires recent versions of vLLM. **Thinking OFF for all requests** ```bash vllm serve MultiverseComputingCAI/Pulsar-16B-BF16 \ --served-model-name model \ --reasoning-parser-plugin nano_v3_reasoning_parser.py \ --reasoning-parser nano_v3 \ --trust-request-chat-template \ --default-chat-template-kwargs '{"enable_thinking": false}' \ ... ``` **Thinking ON for all requests (default if flag is omitted)** ```bash vllm serve MultiverseComputingCAI/Pulsar-16B-BF16 \ --served-model-name model \ --reasoning-parser-plugin nano_v3_reasoning_parser.py \ --reasoning-parser nano_v3 \ --trust-request-chat-template \ --default-chat-template-kwargs '{"enable_thinking": true}' \ ... ``` --- #### Per-request override > **`--trust-request-chat-template` is required** to allow per-request overrides. Individual requests can override the server default by passing `chat_template_kwargs` in the request body. This works regardless of the server-level default. **Thinking ON/OFF for one request** ```python import requests response = requests.post("http://localhost:8000/v1/chat/completions", json={ "model": "model", "messages": [{"role": "user", "content": "Solve: x² - 5x + 6 = 0"}], "max_tokens": 1024, "temperature": 1.0, "chat_template_kwargs": {"enable_thinking": True}, }) ``` --- ## Tool Calling Pulsar 16B emits tool calls in the following format: ``` Paris celsius ``` When serving (e.g with vLLM), you **must** use the `qwen3_coder` tool parser. ```bash vllm serve \ --enable-auto-tool-choice \ --tool-call-parser qwen3_coder \ --trust-remote-code ``` ## Training & Fine-Tuning ### Base Model: NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 The base model [nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16](https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16) is a large language model (LLM) trained from scratch by NVIDIA, and designed as a unified model for both reasoning and non-reasoning tasks. The model's reasoning capabilities can be configured through a flag in the chat template. See the [original model card](https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16) for details. ### CompactifAI Compression CompactifAI was applied to produce a smaller, efficient model (16B parameters) while aiming to preserve reasoning and tool-use capabilities. Supervised Fine Tuning was applied for improving cabapilities. --- ## Evaluation & Benchmarks ![Combined benchmark chart](assets/benchmarks.png) | Benchmark | Nemotron 3 Nano 30B A3B | Pulsar 16B | gpt-oss-20b | Qwen3-14B | Ministral-3-14B-Instruct-2512 | | --- | ---: | ---: | ---: | ---: | ---: | | AIME | 87.66 | 87.22 | 87.66 | 76.00 | 33.00 | | GPQA | 74.04 | 71.41 | 68.99 | 63.63 | 56.45 | | IFBench | 72.31 | 70.79 | 68.46 | 39.20 | 32.80 | | MMLU-Pro | 78.90 | 74.78 | 76.65 | 85.01 | 70.09 | | LiveCodeBench | 71.11 | 68.04 | 64.65 | 66.35 | 29.84 | ### Quantizations - [BF16](https://huggingface.co/MultiverseComputingCAI/Pulsar-16B-BF16) - [FP8](https://huggingface.co/MultiverseComputingCAI/Pulsar-16B-FP8) - [NVFP4](https://huggingface.co/MultiverseComputingCAI/Pulsar-16B-NVFP4) ![Quantization results](assets/quantization_comparisons.png) | Benchmark | Nemotron 3 Nano 30B A3B | Pulsar 16B (BF16) | Pulsar 16B (fp8) | Pulsar 16B (nvfp4) | | --- | ---: | ---: | ---: | ---: | | AIME | 87.66 | 87.22 | 86.67 | 82.00 | | GPQA | 74.04 | 71.41 | 70.61 | 71.11 | | IFBench | 72.31 | 70.79 | 69.60 | 69.90 | | MMLU-Pro | 78.90 | 74.78 | 74.76 | 74.19 | | LiveCodeBench | 71.11 | 68.04 | 68.68 | 65.60 | ### Performance ![Performance results](assets/performance.png) - **Framework:** [guidellm](https://github.com/vllm-project/guidellm) - **Inference:** vLLM 0.18.0 - **GPU:** NVIDIA L40s - **Decode:** `temperature: 0.0`, `top_p: 1.0` - **Measure Window:** Each phase lasts 3 minutes (excluding ramp-up and cool-down periods). - **Workload shape:** 8k/16k workload as in the original model's card. ### Long Context Pulsar 16B preserves strong long-context behavior after compression, tracking the Nemotron-3-Nano-30B-A3B baseline closely across retrieval-heavy and full-suite long-context evaluations. Results are reported for LongBench v1, AA-LCR, NIAH, and RULER groupings up to 256k context. ![Long-context benchmark results](assets/long_context_comparison.png) | Benchmark | Nemotron 3 Nano 30B A3B | Pulsar 16B | | :--- | ---: | ---: | | Longbench | 31.84 | 29.84 | | AA-LCR | 33.67 | 29.33 | | NIAH (@100K) | 100.00 | 100.00 | | RULER (@128K) | 95.02 | 94.20 | | RULER (@256K) | 92.02 | 87.74 | ### Evaluation Methodology Benchmark scores were obtained with the following setups. Methodology varies by benchmark family. ### Inference: - **Backend:** VLLM 0.18.0 - **Nemotron models:** `temp 1.0`, `top_p 1.0` - **GPT-OSS-20B:** `temp: 1.0`, `top_p: 1.0`, `reasoning_effort: high` - **Qwen3-14B:** `temp: 0.6`, `top_p: 0.95`, `top_k: 20`, `min_p: 0.0` - **Ministral-3-14B-Instruct-2512:** `temp: 0.15` | Benchmark | Framework | Repeats | Other | |-----------|-----------|--------:|-------| | MMLU-Pro | [NeMo-Skills](https://github.com/NVIDIA-NeMo/Skills) | 1 | | | AIME25 | [NeMo-Skills](https://github.com/NVIDIA-NeMo/Skills) | 10 | | | GPQA:d | [NeMo-Skills](https://github.com/NVIDIA-NeMo/Skills) | 5 | | | LiveCodeBench | [NeMo-Skills](https://github.com/NVIDIA-NeMo/Skills) | 3 | | | IFBench | [NeMo-Skills](https://github.com/NVIDIA-NeMo/Skills) | 5 | | | LongBench v1 | [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) | 1 | | | AA-LCR | [EvalScope](https://github.com/modelscope/evalscope) 1.4.1 | 3 | Judge: **`Qwen/Qwen3-235B-A22B-Instruct-2507`**. **`judge_score_type`:** `pattern`. **`judge_args` → `generation_config`:** `top_p` 0.8, `top_k` 20, `min_p` 0.0, `temperature` 0.7. | | NIAH | [EvalScope](https://github.com/modelscope/evalscope) 1.4.1 | 1 | Judge: **`qwen/qwen3-235b-a22b-2507`** . **`judge_model_args`:** `{}` (no extra judge settings in YAML). | | RULER | [NeMo-Skills](https://github.com/NVIDIA-NeMo/Skills) (+ [RULER](https://github.com/NVIDIA/RULER)) | 1 | | --- ## Languages - **Primary language**: English - **Other languages**: Spanish Trained mainly on English with added Spanish. No systematic evaluation for languages outside English and Spanish. ## Safety & Limitations ### Known Limitations - English-centric training data (inherited from base model). - Tool calling depends on correct schema and tool design; exact parity with the original model is not guaranteed. - Compression may affect some behaviors; evaluate for your use case. ### Recommendations - Validate tool outputs before running them - Human oversight for critical use - Task-specific eval before production --- ## Model Information | Field | Value | |--------------|--------------------- | | Model name | Pulsar 16B | | Based on | [NVIDIA-Nemotron-3-Nano-30B-A3B-BF16](https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16) | | Version | v1.5.0 | | Release date | TBD | | Developed by | Multiverse Computing | | License | Apache 2.0 | | Contact | business@multiversecomputing.com | --- ## Citation If you use this model, please cite the base model and Pulsar 16B: ```bibtex @misc{nemotron3nanoTR, title = {NVIDIA Nemotron 3 Nano Technical Report}, author = {{NVIDIA}}, year = {2025}, url = {https://research.nvidia.com/labs/nemotron/files/NVIDIA-Nemotron-3-Nano-Technical-Report.pdf} } @misc{nemotron3nanoslim16b, title = {Pulsar 16B: Model developed from NVIDIA Nemotron-3-Nano-30B-A3B}, author = {Multiverse Computing}, year = {2026}, url = {https://huggingface.co/MultiverseComputingCAI/Pulsar-16B-BF16}, note = {Model developed based on nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 using CompactifAI technology} } ``` **Built by [Multiverse Computing](https://www.multiversecomputing.com)** · [Report an issue](TODO_PULSAR_HF_URL/discussions) · [Discord](https://discord.gg/cGas9uStqp)