| | --- |
| | license: mit |
| | base_model: LocoreMind/LocoOperator-4B |
| | tags: |
| | - nvfp4 |
| | - quantized |
| | - qwen3 |
| | - agent |
| | - tool-calling |
| | - code |
| | - nvidia |
| | - modelopt |
| | - spark |
| | pipeline_tag: text-generation |
| | --- |
| | |
| | # LocoOperator-4B — NVFP4 Quantized |
| |
|
| | NVFP4-quantized version of [LocoreMind/LocoOperator-4B](https://huggingface.co/LocoreMind/LocoOperator-4B), an agent/tool-calling model based on Qwen3-4B-Instruct. |
| |
|
| | ## Quantization Details |
| |
|
| | | Property | Value | |
| | |----------|-------| |
| | | **Base model** | LocoreMind/LocoOperator-4B (Qwen3-4B finetune) | |
| | | **Quantization** | NVFP4 (weights) + FP8 (KV cache) | |
| | | **Group size** | 16 | |
| | | **Tool** | NVIDIA TensorRT Model Optimizer (modelopt 0.35.0) | |
| | | **Calibration** | cnn_dailymail (default) | |
| | | **Original size** | ~8 GB (BF16) | |
| | | **Quantized size** | 2.7 GB | |
| | | **Excluded** | `lm_head` (kept in higher precision) | |
| |
|
| | ## Intended Use |
| |
|
| | Optimized for deployment on NVIDIA Blackwell GPUs (GB10/GB100), particularly the DGX Spark. The NVFP4 format leverages Blackwell's native FP4 tensor cores for maximum throughput. |
| |
|
| | Best suited for: |
| | - Agent/tool-calling workflows |
| | - Code generation |
| | - Instruction following |
| |
|
| | ## Usage |
| |
|
| | ### With transformers + modelopt |
| |
|
| | ```python |
| | from transformers import AutoModelForCausalLM, AutoTokenizer |
| | |
| | model = AutoModelForCausalLM.from_pretrained( |
| | "DJLougen/LocoOperator-4B-NVFP4", |
| | device_map="auto", |
| | torch_dtype="auto", |
| | ) |
| | tokenizer = AutoTokenizer.from_pretrained("DJLougen/LocoOperator-4B-NVFP4") |
| | ``` |
| |
|
| | ### With TensorRT-LLM |
| |
|
| | Convert to TensorRT-LLM engine for optimal inference performance on Spark/Blackwell hardware. |
| |
|
| | ## Quality Check |
| |
|
| | Example outputs (cnn_dailymail calibration text): |
| | |
| | **Before quantization:** |
| | > "I'm excited to be doing the final two films," he said. "I can't wait to see what happens." |
| | |
| | **After NVFP4 quantization:** |
| | > "I don't think I'll be particularly extravagant," Radcliffe said. "I don't think I'll be one of those people who, as soon as they turn 18, suddenly buy themselves a massive sports car collection or something similar." |
| | |
| | Both outputs are coherent and contextually appropriate. |
| | |
| | ## Hardware |
| | |
| | - **Quantized on:** NVIDIA DGX Spark (GB10, 128 GB unified memory) |
| | - **Docker image:** `nvcr.io/nvidia/tensorrt-llm/release:spark-single-gpu-dev` |
| | - **Target deployment:** Any NVIDIA Blackwell GPU with FP4 tensor core support |
| | |