NVIDIA-DGX-Spark
Collection
A collection of models I re-quantized, or distilled, or in some other way altered from the base to achieve a runnable status on one NVIDIA-DGX-Spark. • 2 items • Updated
NVFP4 quantization of nvidia/Nemotron-Terminal-32B
Produced by self-quantization on an NVIDIA DGX Spark (GB10 Blackwell) using NVIDIA Model Optimizer.
Nemotron-Terminal-32B is NVIDIA's official Qwen3-32B-based model tuned for autonomous terminal tasks, bash scripting, and agentic workflows.
This repo contains a Blackwell-native NVFP4 quantized checkpoint (~20GB), reducing the original ~64GB BF16 footprint by ~70% while preserving quality.
| Property | Value |
|---|---|
| Base model | nvidia/Nemotron-Terminal-32B |
| Source precision | BF16 |
| Quantization | NVFP4 (attention: BF16, experts: NVFP4) |
| Tool | NVIDIA Model Optimizer (nvidia-modelopt 0.41.0) |
| Calibration | 512 samples, synthetic random |
| Hardware | NVIDIA GB10 Blackwell (DGX Spark) |
| Output size | ~20GB (vs ~64GB BF16) |
vllm serve kleinpanic93/Nemotron-Terminal-32B-NVFP4 \
--quantization modelopt_fp4 \
--max-model-len 32768 \
--gpu-memory-utilization 0.72 \
--trust-remote-code \
--enable-chunked-prefill \
--enable-prefix-caching
Note: NVFP4 kernels require NVIDIA Blackwell (GB10/GB200) or newer hardware. Does not run on Hopper (H100) or older.
{
"source_model": "nvidia/Nemotron-Terminal-32B",
"quantization": "NVFP4",
"tool": "nvidia-modelopt",
"hardware": "NVIDIA GB10 (Blackwell)",
"attention_dtype": "bfloat16"
}
Base model
nvidia/Nemotron-Terminal-32B