metadata
base_model: nvidia/Nemotron-Terminal-32B
library_name: transformers
license: other
license_name: nvidia-open-model-license
tags:
- nvfp4
- quantized
- vllm
- blackwell
- nvidia
- text-generation
- terminal
- agentic
- coding
language:
- en
Nemotron-Terminal-32B NVFP4
NVFP4 quantization of nvidia/Nemotron-Terminal-32B
Produced by self-quantization on an NVIDIA DGX Spark (GB10 Blackwell) using NVIDIA Model Optimizer.
Model Description
Nemotron-Terminal-32B is NVIDIA's official Qwen3-32B-based model tuned for autonomous terminal tasks, bash scripting, and agentic workflows.
This repo contains a Blackwell-native NVFP4 quantized checkpoint (~20GB), reducing the original ~64GB BF16 footprint by ~70% while preserving quality.
Quantization Details
| Property | Value |
|---|---|
| Base model | nvidia/Nemotron-Terminal-32B |
| Source precision | BF16 |
| Quantization | NVFP4 (attention: BF16, experts: NVFP4) |
| Tool | NVIDIA Model Optimizer (nvidia-modelopt 0.41.0) |
| Calibration | 512 samples, synthetic random |
| Hardware | NVIDIA GB10 Blackwell (DGX Spark) |
| Output size | ~20GB (vs ~64GB BF16) |
Usage with vLLM
vllm serve kleinpanic93/Nemotron-Terminal-32B-NVFP4 \
--quantization modelopt_fp4 \
--max-model-len 32768 \
--gpu-memory-utilization 0.72 \
--trust-remote-code \
--enable-chunked-prefill \
--enable-prefix-caching
Note: NVFP4 kernels require NVIDIA Blackwell (GB10/GB200) or newer hardware. Does not run on Hopper (H100) or older.
Provenance
{
"source_model": "nvidia/Nemotron-Terminal-32B",
"quantization": "NVFP4",
"tool": "nvidia-modelopt",
"hardware": "NVIDIA GB10 (Blackwell)",
"attention_dtype": "bfloat16"
}