File size: 9,067 Bytes
6a8e64b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
---
base_model: browser-use/bu-30b-a3b-preview
base_model_relation: quantized
license: other
license_name: modified-mit-browser-use
license_link: https://huggingface.co/browser-use/bu-30b-a3b-preview/blob/main/LICENSE
tags:
  - nvfp4
  - awq
  - modelopt
  - browser-use
  - agent
  - vision-language
  - moe
  - quantized
pipeline_tag: image-text-to-text
library_name: transformers
---

# bu-30b-a3b-preview NVFP4-AWQ (LITE)

A 4-bit NVFP4 + AWQ-lite quantization of
[browser-use/bu-30b-a3b-preview](https://huggingface.co/browser-use/bu-30b-a3b-preview) β€” the 30B Qwen3-VL-MoE browser-agent model β€” produced with
[NVIDIA TensorRT-Model-Optimizer](https://github.com/NVIDIA/TensorRT-Model-Optimizer)
v0.43.

**What's notable about this quant**

This is (as of upload) the first **NVFP4_AWQ** quantization of any browser-agent VLM on the Hub, and the first NVFP4 quant of this model with documented calibration provenance. Existing NVFP4 / INT4-AWQ quants of `bu-30b-a3b-preview` either lack calibration data disclosure or calibrate against generic text corpora; this one was calibrated **on-distribution**, using 602 real multimodal browser-use trajectories generated by the full-precision model itself.

The calibration-data argument is the load-bearing claim of this quant β€” it's documented in detail below.

## Why NVFP4 for this model

- **Native acceleration on Blackwell.** RTX 5090, PRO 6000, B100/B200, GB10 all have native FP4 tensor cores (sm_100+). On Blackwell-class hardware NVFP4 weights execute at ~2Γ— the throughput of FP8.
- **Memory.** ~17 GB vs ~58 GB at BF16. Fits comfortably on a single RTX 5090 (32 GB) with headroom for the 32K-token context window.
- **Accuracy-preserving 4-bit format.** NVFP4's two-level scales (FP8 E4M3 block scales at block size 16, plus FP32 per-tensor scale) substantially outperform naive INT4 in accuracy, and AWQ's activation-aware per-channel scaling protects the weight channels that matter most.

## Quantization Recipe

**Base config**: `NVFP4_AWQ_LITE_CFG` from `modelopt.torch.quantization.config`.

**Module-scoped exclusions (kept at BF16 precision)**:

| Module pattern | Reason |
|---|---|
| `*visual*` | Vision encoder (ViT tower) is small relative to MoE decoder; disproportionate accuracy loss for minimal memory savings. Standard practice. |
| `*mlp.gate.*` | MoE router β€” tiny logit perturbations cascade into expert misrouting. Already excluded in `NVFP4_AWQ_LITE_CFG`. |
| `*lm_head*` | Output projection. Already excluded. |
| `*router*`, `*block_sparse_moe.gate*` | Generic router patterns (covers Mixtral-style MoE architectures). Already excluded. |

All 128 MoE experts (`model.language_model.layers.*.mlp.experts.*`) and attention matrices are quantized to NVFP4 weights + NVFP4 activations (W4A4). The `model.visual.*` ViT tower (depth 27, hidden 1152) stays in BF16.

## Calibration Data

**602 samples** of real browser-use agent trajectories:

| Category (BU_Bench V1) | Tasks | Samples | Weight (rationale) |
|---|---|---|---|
| GAIA | 8 | ~200 | Research + reasoning β€” dominant agent workload |
| OM2W2 | 6 | ~150 | Open-ended info gathering |
| BrowseComp | 5 | ~130 | Cross-source comparison |
| WebBenchREAD | 5 | ~80 | Clean DOM activations |
| InteractionTests | 1 | ~15 | Signal floor for form/interaction regime |

**Collection process:**

1. Full-precision bu-30b-a3b-preview served via vLLM 0.17 at `--dtype bfloat16`.
2. 3 parallel `browser-use` v0.12.6 agents with `enable_planning=True` and `use_vision=True` ran 25 tasks sampled from the official [browser-use/benchmark](https://github.com/browser-use/benchmark) BU_Bench V1 set.
3. Per-category step caps: 40 for GAIA/OM2W2/BrowseComp, 25 for WebBenchREAD/InteractionTests.
4. A proxy between the agents and vLLM captured every `/v1/chat/completions` request payload (including image parts) to JSONL.
5. Samples with total tokens < 1000 (keepalive/error artifacts, 3) or blank screenshots (variance < 150, 16) were filtered out.

**Sample-level statistics** (staged calibration, 602 samples, Qwen3-VL tokenizer + true vision-token expansion):

| Metric | Value |
|---|---|
| Total tokens | min=3, p25=11.2K, median=13.4K, p75=15.8K, p90=18.1K, max=35.4K |
| 8-16K bucket | 439 samples (73%) |
| 16-32K bucket | 144 samples (24%) |
| 32K+ samples | 6 (long-context tail) |
| Samples with screenshot | 93.6% |
| Non-degenerate screenshots | 97.2% |
| DOM element count (median / max) | 136 / 941 |

The calibration distribution was committed to **before** running the analyzer on the exploratory data β€” weights reflect the target user population (researchers and educators running a local agent), not post-hoc curve-fitting to whatever tasks happened to look interesting.

## Serving

### ⚠ vLLM support

As of **vLLM 0.19.1 / main**, the `ModelOpt` quantization loader does **not** accept `quant_algo: NVFP4_AWQ` β€” the supported list is only `['FP8', 'FP8_PER_CHANNEL_PER_TOKEN', 'FP8_PB_WO', 'NVFP4', 'MXFP8', 'MIXED_PRECISION']`. Renaming the algo to plain `NVFP4` would load but produce mathematically wrong inference because the 18,480 `pre_quant_scale` tensors that carry AWQ's per-channel activation rescaling would not be applied.

If you want a vLLM-loadable variant, use the sibling repo **[`Code4me2/bu-30b-a3b-preview-NVFP4`](https://huggingface.co/Code4me2/bu-30b-a3b-preview-NVFP4)** (plain NVFP4, no AWQ, slightly lower accuracy but same memory footprint).

### TensorRT-LLM (recommended)

This format is produced by and natively supported by [NVIDIA TensorRT-Model-Optimizer](https://github.com/NVIDIA/TensorRT-Model-Optimizer) + TensorRT-LLM. Build an NVFP4 engine:

```bash
trtllm-build --checkpoint_dir Code4me2/bu-30b-a3b-preview-NVFP4-AWQ \
    --quant_format nvfp4 \
    --max_seq_len 32768
```

See the [TRT-LLM NVFP4 guide](https://nvidia.github.io/TensorRT-LLM/reference/precision.html) for more details.

### SGLang

SGLang's ModelOpt integration supports NVFP4_AWQ when built against the matching ModelOpt version β€” consult their docs for the current status.

## Intended Use

This model is a drop-in replacement for `bu-30b-a3b-preview` within the
[browser-use](https://github.com/browser-use/browser-use) library. It is
trained/tuned specifically for browser-use's indexed-DOM + structured-action
format. Using it outside that flow (or with a different harness / freeform
CDP scripting) will produce substantially worse results than the
quantization accuracy alone would suggest.

## Evaluation

_Evaluation numbers (MMLU, GSM8K, MM-Bench, BU_Bench V1 subset) will be
added after running against BF16 baseline. See methodology below._

Planned eval suite:
- MMLU (general knowledge, 5-shot)
- GSM8K (math reasoning, 0-shot chain-of-thought)
- MM-Bench (vision-language, 0-shot)
- BU_Bench V1 held-out tasks (agent-specific, using the same browser-use harness)

## Reproduction

- Base model: `browser-use/bu-30b-a3b-preview`
- Quantization tool: `nvidia-modelopt==0.43.0`
- Quantization config: `NVFP4_AWQ_LITE_CFG` with `*visual*` excluded (ViT stays BF16); router (`*mlp.gate.*`) already excluded by the config default
- Calibration samples: 512 / 602 (shuffled, seed=42). 6 samples above 32K tokens skipped (aligned with `--max-model-len`)
- Host: single RTX PRO 6000 Blackwell, 98GB
- Calibration wall time: ~14h (70 min cache activation stats + 12h AWQ scale search + 10 min export)

### ModelOpt patch for Qwen3-VL-MoE support

ModelOpt 0.43 does not natively know how to export quantized checkpoints for `Qwen3VLMoeForConditionalGeneration`. Three patches were required (included in the model repo as `modelopt_patch.py`):

1. `get_expert_linear_names()` in `layer_utils.py` β€” recognize `Qwen3VLMoe*` and return `[gate_proj, up_proj, down_proj]`
2. `get_experts_list()` in `layer_utils.py` β€” recognize `qwen3vlmoe*` model_type
3. `_export_transformers_checkpoint()` in `unified_export_hf.py` β€” wrap the `QuantQwen3VLMoeTextExperts` container with a transparent iterable proxy so the existing iterable dispatch walks the un-BMM'd per-expert `ModuleList`s, while `__call__` and attribute access still delegate to the real experts module for the internal dummy forward pass

Reference code + calibration harness: [GitHub link TBD]

## Attribution & License

Derived from [`browser-use/bu-30b-a3b-preview`](https://huggingface.co/browser-use/bu-30b-a3b-preview), which is distributed under a **Modified MIT License** by Browser Use Inc. with a commercial-use restriction: **use is not permitted for organizations whose annual consolidated revenue exceeds USD 1 million for the preceding month**. That restriction propagates to this derivative. Commercial users above the revenue threshold must obtain a license from Browser Use Inc. (`support@browser-use.com`) or use Browser Use's hosted services.

The original LICENSE file is included alongside the weights.

## Acknowledgements

- **Browser Use** for the base model and the open benchmark suite
- **NVIDIA Model Optimizer** for the NVFP4_AWQ calibration tooling
- **Qwen team** for the Qwen3-VL-MoE architecture