Instructions to use Code4me2/bu-30b-a3b-preview-NVFP4 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use Code4me2/bu-30b-a3b-preview-NVFP4 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="Code4me2/bu-30b-a3b-preview-NVFP4")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForImageTextToText

processor = AutoProcessor.from_pretrained("Code4me2/bu-30b-a3b-preview-NVFP4")
model = AutoModelForImageTextToText.from_pretrained("Code4me2/bu-30b-a3b-preview-NVFP4")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use Code4me2/bu-30b-a3b-preview-NVFP4 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "Code4me2/bu-30b-a3b-preview-NVFP4"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Code4me2/bu-30b-a3b-preview-NVFP4",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/Code4me2/bu-30b-a3b-preview-NVFP4

SGLang

How to use Code4me2/bu-30b-a3b-preview-NVFP4 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "Code4me2/bu-30b-a3b-preview-NVFP4" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Code4me2/bu-30b-a3b-preview-NVFP4",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "Code4me2/bu-30b-a3b-preview-NVFP4" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Code4me2/bu-30b-a3b-preview-NVFP4",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use Code4me2/bu-30b-a3b-preview-NVFP4 with Docker Model Runner:
```
docker model run hf.co/Code4me2/bu-30b-a3b-preview-NVFP4
```

bu-30b-a3b-preview NVFP4-AWQ (LITE)

A 4-bit NVFP4 + AWQ-lite quantization of browser-use/bu-30b-a3b-preview — the 30B Qwen3-VL-MoE browser-agent model — produced with NVIDIA TensorRT-Model-Optimizer v0.43.

What's notable about this quant

This is (as of upload) the first NVFP4_AWQ quantization of any browser-agent VLM on the Hub, and the first NVFP4 quant of this model with documented calibration provenance. Existing NVFP4 / INT4-AWQ quants of bu-30b-a3b-preview either lack calibration data disclosure or calibrate against generic text corpora; this one was calibrated on-distribution, using 602 real multimodal browser-use trajectories generated by the full-precision model itself.

The calibration-data argument is the load-bearing claim of this quant — it's documented in detail below.

Why NVFP4 for this model

Native acceleration on Blackwell. RTX 5090, PRO 6000, B100/B200, GB10 all have native FP4 tensor cores (sm_100+). On Blackwell-class hardware NVFP4 weights execute at ~2× the throughput of FP8.
Memory. ~17 GB vs ~58 GB at BF16. Fits comfortably on a single RTX 5090 (32 GB) with headroom for the 32K-token context window.
Accuracy-preserving 4-bit format. NVFP4's two-level scales (FP8 E4M3 block scales at block size 16, plus FP32 per-tensor scale) substantially outperform naive INT4 in accuracy, and AWQ's activation-aware per-channel scaling protects the weight channels that matter most.

Quantization Recipe

Base config: NVFP4_AWQ_LITE_CFG from modelopt.torch.quantization.config.

Module-scoped exclusions (kept at BF16 precision):

Module pattern	Reason
`visual`	Vision encoder (ViT tower) is small relative to MoE decoder; disproportionate accuracy loss for minimal memory savings. Standard practice.
`mlp.gate.`	MoE router — tiny logit perturbations cascade into expert misrouting. Already excluded in `NVFP4_AWQ_LITE_CFG`.
`lm_head`	Output projection. Already excluded.
`router`, `block_sparse_moe.gate`	Generic router patterns (covers Mixtral-style MoE architectures). Already excluded.

All 128 MoE experts (model.language_model.layers.*.mlp.experts.*) and attention matrices are quantized to NVFP4 weights + NVFP4 activations (W4A4). The model.visual.* ViT tower (depth 27, hidden 1152) stays in BF16.

Calibration Data

602 samples of real browser-use agent trajectories:

Category (BU_Bench V1)	Tasks	Samples	Weight (rationale)
GAIA	8	~200	Research + reasoning — dominant agent workload
OM2W2	6	~150	Open-ended info gathering
BrowseComp	5	~130	Cross-source comparison
WebBenchREAD	5	~80	Clean DOM activations
InteractionTests	1	~15	Signal floor for form/interaction regime

Collection process:

Full-precision bu-30b-a3b-preview served via vLLM 0.17 at --dtype bfloat16.
3 parallel browser-use v0.12.6 agents with enable_planning=True and use_vision=True ran 25 tasks sampled from the official browser-use/benchmark BU_Bench V1 set.
Per-category step caps: 40 for GAIA/OM2W2/BrowseComp, 25 for WebBenchREAD/InteractionTests.
A proxy between the agents and vLLM captured every /v1/chat/completions request payload (including image parts) to JSONL.
Samples with total tokens < 1000 (keepalive/error artifacts, 3) or blank screenshots (variance < 150, 16) were filtered out.

Sample-level statistics (staged calibration, 602 samples, Qwen3-VL tokenizer + true vision-token expansion):

Metric	Value
Total tokens	min=3, p25=11.2K, median=13.4K, p75=15.8K, p90=18.1K, max=35.4K
8-16K bucket	439 samples (73%)
16-32K bucket	144 samples (24%)
32K+ samples	6 (long-context tail)
Samples with screenshot	93.6%
Non-degenerate screenshots	97.2%
DOM element count (median / max)	136 / 941

The calibration distribution was committed to before running the analyzer on the exploratory data — weights reflect the target user population (researchers and educators running a local agent), not post-hoc curve-fitting to whatever tasks happened to look interesting.

Serving

⚠ vLLM support

As of vLLM 0.19.1 / main, the ModelOpt quantization loader does not accept quant_algo: NVFP4_AWQ — the supported list is only ['FP8', 'FP8_PER_CHANNEL_PER_TOKEN', 'FP8_PB_WO', 'NVFP4', 'MXFP8', 'MIXED_PRECISION']. Renaming the algo to plain NVFP4 would load but produce mathematically wrong inference because the 18,480 pre_quant_scale tensors that carry AWQ's per-channel activation rescaling would not be applied.

If you want a vLLM-loadable variant, use the sibling repo Code4me2/bu-30b-a3b-preview-NVFP4 (plain NVFP4, no AWQ, slightly lower accuracy but same memory footprint).

TensorRT-LLM (recommended)

This format is produced by and natively supported by NVIDIA TensorRT-Model-Optimizer + TensorRT-LLM. Build an NVFP4 engine:

trtllm-build --checkpoint_dir Code4me2/bu-30b-a3b-preview-NVFP4-AWQ \
    --quant_format nvfp4 \
    --max_seq_len 32768

See the TRT-LLM NVFP4 guide for more details.

SGLang

SGLang's ModelOpt integration supports NVFP4_AWQ when built against the matching ModelOpt version — consult their docs for the current status.

Intended Use

This model is a drop-in replacement for bu-30b-a3b-preview within the browser-use library. It is trained/tuned specifically for browser-use's indexed-DOM + structured-action format. Using it outside that flow (or with a different harness / freeform CDP scripting) will produce substantially worse results than the quantization accuracy alone would suggest.

Evaluation

Evaluation numbers (MMLU, GSM8K, MM-Bench, BU_Bench V1 subset) will be added after running against BF16 baseline. See methodology below.

Planned eval suite:

MMLU (general knowledge, 5-shot)
GSM8K (math reasoning, 0-shot chain-of-thought)
MM-Bench (vision-language, 0-shot)
BU_Bench V1 held-out tasks (agent-specific, using the same browser-use harness)

Reproduction

Base model: browser-use/bu-30b-a3b-preview
Quantization tool: nvidia-modelopt==0.43.0
Quantization config: NVFP4_AWQ_LITE_CFG with *visual* excluded (ViT stays BF16); router (*mlp.gate.*) already excluded by the config default
Calibration samples: 512 / 602 (shuffled, seed=42). 6 samples above 32K tokens skipped (aligned with --max-model-len)
Host: single RTX PRO 6000 Blackwell, 98GB
Calibration wall time: ~14h (70 min cache activation stats + 12h AWQ scale search + 10 min export)

ModelOpt patch for Qwen3-VL-MoE support

ModelOpt 0.43 does not natively know how to export quantized checkpoints for Qwen3VLMoeForConditionalGeneration. Three patches were required (included in the model repo as modelopt_patch.py):

get_expert_linear_names() in layer_utils.py — recognize Qwen3VLMoe* and return [gate_proj, up_proj, down_proj]
get_experts_list() in layer_utils.py — recognize qwen3vlmoe* model_type
_export_transformers_checkpoint() in unified_export_hf.py — wrap the QuantQwen3VLMoeTextExperts container with a transparent iterable proxy so the existing iterable dispatch walks the un-BMM'd per-expert ModuleLists, while __call__ and attribute access still delegate to the real experts module for the internal dummy forward pass

Reference code + calibration harness: [GitHub link TBD]

Attribution & License

Derived from browser-use/bu-30b-a3b-preview, which is distributed under a Modified MIT License by Browser Use Inc. with a commercial-use restriction: use is not permitted for organizations whose annual consolidated revenue exceeds USD 1 million for the preceding month. That restriction propagates to this derivative. Commercial users above the revenue threshold must obtain a license from Browser Use Inc. (support@browser-use.com) or use Browser Use's hosted services.

The original LICENSE file is included alongside the weights.

Acknowledgements

Browser Use for the base model and the open benchmark suite
NVIDIA Model Optimizer for the NVFP4_AWQ calibration tooling
Qwen team for the Qwen3-VL-MoE architecture

Downloads last month: 27

Safetensors

Model size

16B params

Tensor type

BF16

F8_E4M3

Model tree for Code4me2/bu-30b-a3b-preview-NVFP4

Base model

Qwen/Qwen3-VL-30B-A3B-Instruct

Finetuned

browser-use/bu-30b-a3b-preview

Quantized

(12)

this model