--- base_model: browser-use/bu-30b-a3b-preview base_model_relation: quantized license: other license_name: modified-mit-browser-use license_link: https://huggingface.co/browser-use/bu-30b-a3b-preview/blob/main/LICENSE tags: - nvfp4 - awq - modelopt - browser-use - agent - vision-language - moe - quantized pipeline_tag: image-text-to-text library_name: transformers --- # bu-30b-a3b-preview NVFP4-AWQ (LITE) A 4-bit NVFP4 + AWQ-lite quantization of [browser-use/bu-30b-a3b-preview](https://huggingface.co/browser-use/bu-30b-a3b-preview) — the 30B Qwen3-VL-MoE browser-agent model — produced with [NVIDIA TensorRT-Model-Optimizer](https://github.com/NVIDIA/TensorRT-Model-Optimizer) v0.43. **What's notable about this quant** This is (as of upload) the first **NVFP4_AWQ** quantization of any browser-agent VLM on the Hub, and the first NVFP4 quant of this model with documented calibration provenance. Existing NVFP4 / INT4-AWQ quants of `bu-30b-a3b-preview` either lack calibration data disclosure or calibrate against generic text corpora; this one was calibrated **on-distribution**, using 602 real multimodal browser-use trajectories generated by the full-precision model itself. The calibration-data argument is the load-bearing claim of this quant — it's documented in detail below. ## Why NVFP4 for this model - **Native acceleration on Blackwell.** RTX 5090, PRO 6000, B100/B200, GB10 all have native FP4 tensor cores (sm_100+). On Blackwell-class hardware NVFP4 weights execute at ~2× the throughput of FP8. - **Memory.** ~17 GB vs ~58 GB at BF16. Fits comfortably on a single RTX 5090 (32 GB) with headroom for the 32K-token context window. - **Accuracy-preserving 4-bit format.** NVFP4's two-level scales (FP8 E4M3 block scales at block size 16, plus FP32 per-tensor scale) substantially outperform naive INT4 in accuracy, and AWQ's activation-aware per-channel scaling protects the weight channels that matter most. ## Quantization Recipe **Base config**: `NVFP4_AWQ_LITE_CFG` from `modelopt.torch.quantization.config`. **Module-scoped exclusions (kept at BF16 precision)**: | Module pattern | Reason | |---|---| | `*visual*` | Vision encoder (ViT tower) is small relative to MoE decoder; disproportionate accuracy loss for minimal memory savings. Standard practice. | | `*mlp.gate.*` | MoE router — tiny logit perturbations cascade into expert misrouting. Already excluded in `NVFP4_AWQ_LITE_CFG`. | | `*lm_head*` | Output projection. Already excluded. | | `*router*`, `*block_sparse_moe.gate*` | Generic router patterns (covers Mixtral-style MoE architectures). Already excluded. | All 128 MoE experts (`model.language_model.layers.*.mlp.experts.*`) and attention matrices are quantized to NVFP4 weights + NVFP4 activations (W4A4). The `model.visual.*` ViT tower (depth 27, hidden 1152) stays in BF16. ## Calibration Data **602 samples** of real browser-use agent trajectories: | Category (BU_Bench V1) | Tasks | Samples | Weight (rationale) | |---|---|---|---| | GAIA | 8 | ~200 | Research + reasoning — dominant agent workload | | OM2W2 | 6 | ~150 | Open-ended info gathering | | BrowseComp | 5 | ~130 | Cross-source comparison | | WebBenchREAD | 5 | ~80 | Clean DOM activations | | InteractionTests | 1 | ~15 | Signal floor for form/interaction regime | **Collection process:** 1. Full-precision bu-30b-a3b-preview served via vLLM 0.17 at `--dtype bfloat16`. 2. 3 parallel `browser-use` v0.12.6 agents with `enable_planning=True` and `use_vision=True` ran 25 tasks sampled from the official [browser-use/benchmark](https://github.com/browser-use/benchmark) BU_Bench V1 set. 3. Per-category step caps: 40 for GAIA/OM2W2/BrowseComp, 25 for WebBenchREAD/InteractionTests. 4. A proxy between the agents and vLLM captured every `/v1/chat/completions` request payload (including image parts) to JSONL. 5. Samples with total tokens < 1000 (keepalive/error artifacts, 3) or blank screenshots (variance < 150, 16) were filtered out. **Sample-level statistics** (staged calibration, 602 samples, Qwen3-VL tokenizer + true vision-token expansion): | Metric | Value | |---|---| | Total tokens | min=3, p25=11.2K, median=13.4K, p75=15.8K, p90=18.1K, max=35.4K | | 8-16K bucket | 439 samples (73%) | | 16-32K bucket | 144 samples (24%) | | 32K+ samples | 6 (long-context tail) | | Samples with screenshot | 93.6% | | Non-degenerate screenshots | 97.2% | | DOM element count (median / max) | 136 / 941 | The calibration distribution was committed to **before** running the analyzer on the exploratory data — weights reflect the target user population (researchers and educators running a local agent), not post-hoc curve-fitting to whatever tasks happened to look interesting. ## Serving ### ⚠ vLLM support As of **vLLM 0.19.1 / main**, the `ModelOpt` quantization loader does **not** accept `quant_algo: NVFP4_AWQ` — the supported list is only `['FP8', 'FP8_PER_CHANNEL_PER_TOKEN', 'FP8_PB_WO', 'NVFP4', 'MXFP8', 'MIXED_PRECISION']`. Renaming the algo to plain `NVFP4` would load but produce mathematically wrong inference because the 18,480 `pre_quant_scale` tensors that carry AWQ's per-channel activation rescaling would not be applied. If you want a vLLM-loadable variant, use the sibling repo **[`Code4me2/bu-30b-a3b-preview-NVFP4`](https://huggingface.co/Code4me2/bu-30b-a3b-preview-NVFP4)** (plain NVFP4, no AWQ, slightly lower accuracy but same memory footprint). ### TensorRT-LLM (recommended) This format is produced by and natively supported by [NVIDIA TensorRT-Model-Optimizer](https://github.com/NVIDIA/TensorRT-Model-Optimizer) + TensorRT-LLM. Build an NVFP4 engine: ```bash trtllm-build --checkpoint_dir Code4me2/bu-30b-a3b-preview-NVFP4-AWQ \ --quant_format nvfp4 \ --max_seq_len 32768 ``` See the [TRT-LLM NVFP4 guide](https://nvidia.github.io/TensorRT-LLM/reference/precision.html) for more details. ### SGLang SGLang's ModelOpt integration supports NVFP4_AWQ when built against the matching ModelOpt version — consult their docs for the current status. ## Intended Use This model is a drop-in replacement for `bu-30b-a3b-preview` within the [browser-use](https://github.com/browser-use/browser-use) library. It is trained/tuned specifically for browser-use's indexed-DOM + structured-action format. Using it outside that flow (or with a different harness / freeform CDP scripting) will produce substantially worse results than the quantization accuracy alone would suggest. ## Evaluation _Evaluation numbers (MMLU, GSM8K, MM-Bench, BU_Bench V1 subset) will be added after running against BF16 baseline. See methodology below._ Planned eval suite: - MMLU (general knowledge, 5-shot) - GSM8K (math reasoning, 0-shot chain-of-thought) - MM-Bench (vision-language, 0-shot) - BU_Bench V1 held-out tasks (agent-specific, using the same browser-use harness) ## Reproduction - Base model: `browser-use/bu-30b-a3b-preview` - Quantization tool: `nvidia-modelopt==0.43.0` - Quantization config: `NVFP4_AWQ_LITE_CFG` with `*visual*` excluded (ViT stays BF16); router (`*mlp.gate.*`) already excluded by the config default - Calibration samples: 512 / 602 (shuffled, seed=42). 6 samples above 32K tokens skipped (aligned with `--max-model-len`) - Host: single RTX PRO 6000 Blackwell, 98GB - Calibration wall time: ~14h (70 min cache activation stats + 12h AWQ scale search + 10 min export) ### ModelOpt patch for Qwen3-VL-MoE support ModelOpt 0.43 does not natively know how to export quantized checkpoints for `Qwen3VLMoeForConditionalGeneration`. Three patches were required (included in the model repo as `modelopt_patch.py`): 1. `get_expert_linear_names()` in `layer_utils.py` — recognize `Qwen3VLMoe*` and return `[gate_proj, up_proj, down_proj]` 2. `get_experts_list()` in `layer_utils.py` — recognize `qwen3vlmoe*` model_type 3. `_export_transformers_checkpoint()` in `unified_export_hf.py` — wrap the `QuantQwen3VLMoeTextExperts` container with a transparent iterable proxy so the existing iterable dispatch walks the un-BMM'd per-expert `ModuleList`s, while `__call__` and attribute access still delegate to the real experts module for the internal dummy forward pass Reference code + calibration harness: [GitHub link TBD] ## Attribution & License Derived from [`browser-use/bu-30b-a3b-preview`](https://huggingface.co/browser-use/bu-30b-a3b-preview), which is distributed under a **Modified MIT License** by Browser Use Inc. with a commercial-use restriction: **use is not permitted for organizations whose annual consolidated revenue exceeds USD 1 million for the preceding month**. That restriction propagates to this derivative. Commercial users above the revenue threshold must obtain a license from Browser Use Inc. (`support@browser-use.com`) or use Browser Use's hosted services. The original LICENSE file is included alongside the weights. ## Acknowledgements - **Browser Use** for the base model and the open benchmark suite - **NVIDIA Model Optimizer** for the NVFP4_AWQ calibration tooling - **Qwen team** for the Qwen3-VL-MoE architecture