DJLougen
/

LocoOperator-4B-NVFP4

Text Generation

8-bit precision

Model card Files Files and versions

LocoOperator-4B-NVFP4 / README.md

DJLougen's picture

Upload README.md with huggingface_hub

7acae37 verified 4 days ago

|

history blame contribute delete

2.33 kB

	---
	license: mit
	base_model: LocoreMind/LocoOperator-4B
	tags:
	- nvfp4
	- quantized
	- qwen3
	- agent
	- tool-calling
	- code
	- nvidia
	- modelopt
	- spark
	pipeline_tag: text-generation
	---

	# LocoOperator-4B — NVFP4 Quantized

	NVFP4-quantized version of [LocoreMind/LocoOperator-4B](https://huggingface.co/LocoreMind/LocoOperator-4B), an agent/tool-calling model based on Qwen3-4B-Instruct.

	## Quantization Details

	\| Property \| Value \|
	\|----------\|-------\|
	\| Base model \| LocoreMind/LocoOperator-4B (Qwen3-4B finetune) \|
	\| Quantization \| NVFP4 (weights) + FP8 (KV cache) \|
	\| Group size \| 16 \|
	\| Tool \| NVIDIA TensorRT Model Optimizer (modelopt 0.35.0) \|
	\| Calibration \| cnn_dailymail (default) \|
	\| Original size \| ~8 GB (BF16) \|
	\| Quantized size \| 2.7 GB \|
	\| Excluded \| `lm_head` (kept in higher precision) \|

	## Intended Use

	Optimized for deployment on NVIDIA Blackwell GPUs (GB10/GB100), particularly the DGX Spark. The NVFP4 format leverages Blackwell's native FP4 tensor cores for maximum throughput.

	Best suited for:
	- Agent/tool-calling workflows
	- Code generation
	- Instruction following

	## Usage

	### With transformers + modelopt

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer

	model = AutoModelForCausalLM.from_pretrained(
	"DJLougen/LocoOperator-4B-NVFP4",
	device_map="auto",
	torch_dtype="auto",
	)
	tokenizer = AutoTokenizer.from_pretrained("DJLougen/LocoOperator-4B-NVFP4")
	```

	### With TensorRT-LLM

	Convert to TensorRT-LLM engine for optimal inference performance on Spark/Blackwell hardware.

	## Quality Check

	Example outputs (cnn_dailymail calibration text):

	Before quantization:
	> "I'm excited to be doing the final two films," he said. "I can't wait to see what happens."

	After NVFP4 quantization:
	> "I don't think I'll be particularly extravagant," Radcliffe said. "I don't think I'll be one of those people who, as soon as they turn 18, suddenly buy themselves a massive sports car collection or something similar."

	Both outputs are coherent and contextually appropriate.

	## Hardware

	- Quantized on: NVIDIA DGX Spark (GB10, 128 GB unified memory)
	- Docker image: `nvcr.io/nvidia/tensorrt-llm/release:spark-single-gpu-dev`
	- Target deployment: Any NVIDIA Blackwell GPU with FP4 tensor core support