bu-30b-a3b-preview NVFP4-AWQ (LITE)

A 4-bit NVFP4 + AWQ-lite quantization of browser-use/bu-30b-a3b-preview — the 30B Qwen3-VL-MoE browser-agent model — produced with NVIDIA TensorRT-Model-Optimizer v0.43.

What's notable about this quant

This is (as of upload) the first NVFP4_AWQ quantization of any browser-agent VLM on the Hub, and the first NVFP4 quant of this model with documented calibration provenance. Existing NVFP4 / INT4-AWQ quants of bu-30b-a3b-preview either lack calibration data disclosure or calibrate against generic text corpora; this one was calibrated on-distribution, using 602 real multimodal browser-use trajectories generated by the full-precision model itself.

The calibration-data argument is the load-bearing claim of this quant — it's documented in detail below.

Why NVFP4 for this model

  • Native acceleration on Blackwell. RTX 5090, PRO 6000, B100/B200, GB10 all have native FP4 tensor cores (sm_100+). On Blackwell-class hardware NVFP4 weights execute at ~2× the throughput of FP8.
  • Memory. ~17 GB vs ~58 GB at BF16. Fits comfortably on a single RTX 5090 (32 GB) with headroom for the 32K-token context window.
  • Accuracy-preserving 4-bit format. NVFP4's two-level scales (FP8 E4M3 block scales at block size 16, plus FP32 per-tensor scale) substantially outperform naive INT4 in accuracy, and AWQ's activation-aware per-channel scaling protects the weight channels that matter most.

Quantization Recipe

Base config: NVFP4_AWQ_LITE_CFG from modelopt.torch.quantization.config.

Module-scoped exclusions (kept at BF16 precision):

Module pattern Reason
*visual* Vision encoder (ViT tower) is small relative to MoE decoder; disproportionate accuracy loss for minimal memory savings. Standard practice.
*mlp.gate.* MoE router — tiny logit perturbations cascade into expert misrouting. Already excluded in NVFP4_AWQ_LITE_CFG.
*lm_head* Output projection. Already excluded.
*router*, *block_sparse_moe.gate* Generic router patterns (covers Mixtral-style MoE architectures). Already excluded.

All 128 MoE experts (model.language_model.layers.*.mlp.experts.*) and attention matrices are quantized to NVFP4 weights + NVFP4 activations (W4A4). The model.visual.* ViT tower (depth 27, hidden 1152) stays in BF16.

Calibration Data

602 samples of real browser-use agent trajectories:

Category (BU_Bench V1) Tasks Samples Weight (rationale)
GAIA 8 ~200 Research + reasoning — dominant agent workload
OM2W2 6 ~150 Open-ended info gathering
BrowseComp 5 ~130 Cross-source comparison
WebBenchREAD 5 ~80 Clean DOM activations
InteractionTests 1 ~15 Signal floor for form/interaction regime

Collection process:

  1. Full-precision bu-30b-a3b-preview served via vLLM 0.17 at --dtype bfloat16.
  2. 3 parallel browser-use v0.12.6 agents with enable_planning=True and use_vision=True ran 25 tasks sampled from the official browser-use/benchmark BU_Bench V1 set.
  3. Per-category step caps: 40 for GAIA/OM2W2/BrowseComp, 25 for WebBenchREAD/InteractionTests.
  4. A proxy between the agents and vLLM captured every /v1/chat/completions request payload (including image parts) to JSONL.
  5. Samples with total tokens < 1000 (keepalive/error artifacts, 3) or blank screenshots (variance < 150, 16) were filtered out.

Sample-level statistics (staged calibration, 602 samples, Qwen3-VL tokenizer + true vision-token expansion):

Metric Value
Total tokens min=3, p25=11.2K, median=13.4K, p75=15.8K, p90=18.1K, max=35.4K
8-16K bucket 439 samples (73%)
16-32K bucket 144 samples (24%)
32K+ samples 6 (long-context tail)
Samples with screenshot 93.6%
Non-degenerate screenshots 97.2%
DOM element count (median / max) 136 / 941

The calibration distribution was committed to before running the analyzer on the exploratory data — weights reflect the target user population (researchers and educators running a local agent), not post-hoc curve-fitting to whatever tasks happened to look interesting.

Serving

⚠ vLLM support

As of vLLM 0.19.1 / main, the ModelOpt quantization loader does not accept quant_algo: NVFP4_AWQ — the supported list is only ['FP8', 'FP8_PER_CHANNEL_PER_TOKEN', 'FP8_PB_WO', 'NVFP4', 'MXFP8', 'MIXED_PRECISION']. Renaming the algo to plain NVFP4 would load but produce mathematically wrong inference because the 18,480 pre_quant_scale tensors that carry AWQ's per-channel activation rescaling would not be applied.

If you want a vLLM-loadable variant, use the sibling repo Code4me2/bu-30b-a3b-preview-NVFP4 (plain NVFP4, no AWQ, slightly lower accuracy but same memory footprint).

TensorRT-LLM (recommended)

This format is produced by and natively supported by NVIDIA TensorRT-Model-Optimizer + TensorRT-LLM. Build an NVFP4 engine:

trtllm-build --checkpoint_dir Code4me2/bu-30b-a3b-preview-NVFP4-AWQ \
    --quant_format nvfp4 \
    --max_seq_len 32768

See the TRT-LLM NVFP4 guide for more details.

SGLang

SGLang's ModelOpt integration supports NVFP4_AWQ when built against the matching ModelOpt version — consult their docs for the current status.

Intended Use

This model is a drop-in replacement for bu-30b-a3b-preview within the browser-use library. It is trained/tuned specifically for browser-use's indexed-DOM + structured-action format. Using it outside that flow (or with a different harness / freeform CDP scripting) will produce substantially worse results than the quantization accuracy alone would suggest.

Evaluation

Evaluation numbers (MMLU, GSM8K, MM-Bench, BU_Bench V1 subset) will be added after running against BF16 baseline. See methodology below.

Planned eval suite:

  • MMLU (general knowledge, 5-shot)
  • GSM8K (math reasoning, 0-shot chain-of-thought)
  • MM-Bench (vision-language, 0-shot)
  • BU_Bench V1 held-out tasks (agent-specific, using the same browser-use harness)

Reproduction

  • Base model: browser-use/bu-30b-a3b-preview
  • Quantization tool: nvidia-modelopt==0.43.0
  • Quantization config: NVFP4_AWQ_LITE_CFG with *visual* excluded (ViT stays BF16); router (*mlp.gate.*) already excluded by the config default
  • Calibration samples: 512 / 602 (shuffled, seed=42). 6 samples above 32K tokens skipped (aligned with --max-model-len)
  • Host: single RTX PRO 6000 Blackwell, 98GB
  • Calibration wall time: ~14h (70 min cache activation stats + 12h AWQ scale search + 10 min export)

ModelOpt patch for Qwen3-VL-MoE support

ModelOpt 0.43 does not natively know how to export quantized checkpoints for Qwen3VLMoeForConditionalGeneration. Three patches were required (included in the model repo as modelopt_patch.py):

  1. get_expert_linear_names() in layer_utils.py — recognize Qwen3VLMoe* and return [gate_proj, up_proj, down_proj]
  2. get_experts_list() in layer_utils.py — recognize qwen3vlmoe* model_type
  3. _export_transformers_checkpoint() in unified_export_hf.py — wrap the QuantQwen3VLMoeTextExperts container with a transparent iterable proxy so the existing iterable dispatch walks the un-BMM'd per-expert ModuleLists, while __call__ and attribute access still delegate to the real experts module for the internal dummy forward pass

Reference code + calibration harness: [GitHub link TBD]

Attribution & License

Derived from browser-use/bu-30b-a3b-preview, which is distributed under a Modified MIT License by Browser Use Inc. with a commercial-use restriction: use is not permitted for organizations whose annual consolidated revenue exceeds USD 1 million for the preceding month. That restriction propagates to this derivative. Commercial users above the revenue threshold must obtain a license from Browser Use Inc. (support@browser-use.com) or use Browser Use's hosted services.

The original LICENSE file is included alongside the weights.

Acknowledgements

  • Browser Use for the base model and the open benchmark suite
  • NVIDIA Model Optimizer for the NVFP4_AWQ calibration tooling
  • Qwen team for the Qwen3-VL-MoE architecture
Downloads last month
27
Safetensors
Model size
16B params
Tensor type
BF16
·
F8_E4M3
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Code4me2/bu-30b-a3b-preview-NVFP4

Quantized
(12)
this model