LocoOperator-4B — NVFP4 Quantized
NVFP4-quantized version of LocoreMind/LocoOperator-4B, an agent/tool-calling model based on Qwen3-4B-Instruct.
Quantization Details
| Property | Value |
|---|---|
| Base model | LocoreMind/LocoOperator-4B (Qwen3-4B finetune) |
| Quantization | NVFP4 (weights) + FP8 (KV cache) |
| Group size | 16 |
| Tool | NVIDIA TensorRT Model Optimizer (modelopt 0.35.0) |
| Calibration | cnn_dailymail (default) |
| Original size | ~8 GB (BF16) |
| Quantized size | 2.7 GB |
| Excluded | lm_head (kept in higher precision) |
Intended Use
Optimized for deployment on NVIDIA Blackwell GPUs (GB10/GB100), particularly the DGX Spark. The NVFP4 format leverages Blackwell's native FP4 tensor cores for maximum throughput.
Best suited for:
- Agent/tool-calling workflows
- Code generation
- Instruction following
Usage
With transformers + modelopt
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"DJLougen/LocoOperator-4B-NVFP4",
device_map="auto",
torch_dtype="auto",
)
tokenizer = AutoTokenizer.from_pretrained("DJLougen/LocoOperator-4B-NVFP4")
With TensorRT-LLM
Convert to TensorRT-LLM engine for optimal inference performance on Spark/Blackwell hardware.
Quality Check
Example outputs (cnn_dailymail calibration text):
Before quantization:
"I'm excited to be doing the final two films," he said. "I can't wait to see what happens."
After NVFP4 quantization:
"I don't think I'll be particularly extravagant," Radcliffe said. "I don't think I'll be one of those people who, as soon as they turn 18, suddenly buy themselves a massive sports car collection or something similar."
Both outputs are coherent and contextually appropriate.
Hardware
- Quantized on: NVIDIA DGX Spark (GB10, 128 GB unified memory)
- Docker image:
nvcr.io/nvidia/tensorrt-llm/release:spark-single-gpu-dev - Target deployment: Any NVIDIA Blackwell GPU with FP4 tensor core support
- Downloads last month
- 26
Model tree for DJLougen/LocoOperator-4B-NVFP4
Base model
Qwen/Qwen3-4B-Instruct-2507