Image-Text-to-Text
Transformers
Safetensors
qwen3_vl_moe
nvfp4
awq
modelopt
browser-use
agent
vision-language
Mixture of Experts
quantized
conversational
8-bit precision
Instructions to use Code4me2/bu-30b-a3b-preview-NVFP4 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Code4me2/bu-30b-a3b-preview-NVFP4 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="Code4me2/bu-30b-a3b-preview-NVFP4") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForImageTextToText processor = AutoProcessor.from_pretrained("Code4me2/bu-30b-a3b-preview-NVFP4") model = AutoModelForImageTextToText.from_pretrained("Code4me2/bu-30b-a3b-preview-NVFP4") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use Code4me2/bu-30b-a3b-preview-NVFP4 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "Code4me2/bu-30b-a3b-preview-NVFP4" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Code4me2/bu-30b-a3b-preview-NVFP4", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/Code4me2/bu-30b-a3b-preview-NVFP4
- SGLang
How to use Code4me2/bu-30b-a3b-preview-NVFP4 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "Code4me2/bu-30b-a3b-preview-NVFP4" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Code4me2/bu-30b-a3b-preview-NVFP4", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "Code4me2/bu-30b-a3b-preview-NVFP4" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Code4me2/bu-30b-a3b-preview-NVFP4", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Docker Model Runner
How to use Code4me2/bu-30b-a3b-preview-NVFP4 with Docker Model Runner:
docker model run hf.co/Code4me2/bu-30b-a3b-preview-NVFP4
File size: 9,067 Bytes
6a8e64b | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 | ---
base_model: browser-use/bu-30b-a3b-preview
base_model_relation: quantized
license: other
license_name: modified-mit-browser-use
license_link: https://huggingface.co/browser-use/bu-30b-a3b-preview/blob/main/LICENSE
tags:
- nvfp4
- awq
- modelopt
- browser-use
- agent
- vision-language
- moe
- quantized
pipeline_tag: image-text-to-text
library_name: transformers
---
# bu-30b-a3b-preview NVFP4-AWQ (LITE)
A 4-bit NVFP4 + AWQ-lite quantization of
[browser-use/bu-30b-a3b-preview](https://huggingface.co/browser-use/bu-30b-a3b-preview) β the 30B Qwen3-VL-MoE browser-agent model β produced with
[NVIDIA TensorRT-Model-Optimizer](https://github.com/NVIDIA/TensorRT-Model-Optimizer)
v0.43.
**What's notable about this quant**
This is (as of upload) the first **NVFP4_AWQ** quantization of any browser-agent VLM on the Hub, and the first NVFP4 quant of this model with documented calibration provenance. Existing NVFP4 / INT4-AWQ quants of `bu-30b-a3b-preview` either lack calibration data disclosure or calibrate against generic text corpora; this one was calibrated **on-distribution**, using 602 real multimodal browser-use trajectories generated by the full-precision model itself.
The calibration-data argument is the load-bearing claim of this quant β it's documented in detail below.
## Why NVFP4 for this model
- **Native acceleration on Blackwell.** RTX 5090, PRO 6000, B100/B200, GB10 all have native FP4 tensor cores (sm_100+). On Blackwell-class hardware NVFP4 weights execute at ~2Γ the throughput of FP8.
- **Memory.** ~17 GB vs ~58 GB at BF16. Fits comfortably on a single RTX 5090 (32 GB) with headroom for the 32K-token context window.
- **Accuracy-preserving 4-bit format.** NVFP4's two-level scales (FP8 E4M3 block scales at block size 16, plus FP32 per-tensor scale) substantially outperform naive INT4 in accuracy, and AWQ's activation-aware per-channel scaling protects the weight channels that matter most.
## Quantization Recipe
**Base config**: `NVFP4_AWQ_LITE_CFG` from `modelopt.torch.quantization.config`.
**Module-scoped exclusions (kept at BF16 precision)**:
| Module pattern | Reason |
|---|---|
| `*visual*` | Vision encoder (ViT tower) is small relative to MoE decoder; disproportionate accuracy loss for minimal memory savings. Standard practice. |
| `*mlp.gate.*` | MoE router β tiny logit perturbations cascade into expert misrouting. Already excluded in `NVFP4_AWQ_LITE_CFG`. |
| `*lm_head*` | Output projection. Already excluded. |
| `*router*`, `*block_sparse_moe.gate*` | Generic router patterns (covers Mixtral-style MoE architectures). Already excluded. |
All 128 MoE experts (`model.language_model.layers.*.mlp.experts.*`) and attention matrices are quantized to NVFP4 weights + NVFP4 activations (W4A4). The `model.visual.*` ViT tower (depth 27, hidden 1152) stays in BF16.
## Calibration Data
**602 samples** of real browser-use agent trajectories:
| Category (BU_Bench V1) | Tasks | Samples | Weight (rationale) |
|---|---|---|---|
| GAIA | 8 | ~200 | Research + reasoning β dominant agent workload |
| OM2W2 | 6 | ~150 | Open-ended info gathering |
| BrowseComp | 5 | ~130 | Cross-source comparison |
| WebBenchREAD | 5 | ~80 | Clean DOM activations |
| InteractionTests | 1 | ~15 | Signal floor for form/interaction regime |
**Collection process:**
1. Full-precision bu-30b-a3b-preview served via vLLM 0.17 at `--dtype bfloat16`.
2. 3 parallel `browser-use` v0.12.6 agents with `enable_planning=True` and `use_vision=True` ran 25 tasks sampled from the official [browser-use/benchmark](https://github.com/browser-use/benchmark) BU_Bench V1 set.
3. Per-category step caps: 40 for GAIA/OM2W2/BrowseComp, 25 for WebBenchREAD/InteractionTests.
4. A proxy between the agents and vLLM captured every `/v1/chat/completions` request payload (including image parts) to JSONL.
5. Samples with total tokens < 1000 (keepalive/error artifacts, 3) or blank screenshots (variance < 150, 16) were filtered out.
**Sample-level statistics** (staged calibration, 602 samples, Qwen3-VL tokenizer + true vision-token expansion):
| Metric | Value |
|---|---|
| Total tokens | min=3, p25=11.2K, median=13.4K, p75=15.8K, p90=18.1K, max=35.4K |
| 8-16K bucket | 439 samples (73%) |
| 16-32K bucket | 144 samples (24%) |
| 32K+ samples | 6 (long-context tail) |
| Samples with screenshot | 93.6% |
| Non-degenerate screenshots | 97.2% |
| DOM element count (median / max) | 136 / 941 |
The calibration distribution was committed to **before** running the analyzer on the exploratory data β weights reflect the target user population (researchers and educators running a local agent), not post-hoc curve-fitting to whatever tasks happened to look interesting.
## Serving
### β vLLM support
As of **vLLM 0.19.1 / main**, the `ModelOpt` quantization loader does **not** accept `quant_algo: NVFP4_AWQ` β the supported list is only `['FP8', 'FP8_PER_CHANNEL_PER_TOKEN', 'FP8_PB_WO', 'NVFP4', 'MXFP8', 'MIXED_PRECISION']`. Renaming the algo to plain `NVFP4` would load but produce mathematically wrong inference because the 18,480 `pre_quant_scale` tensors that carry AWQ's per-channel activation rescaling would not be applied.
If you want a vLLM-loadable variant, use the sibling repo **[`Code4me2/bu-30b-a3b-preview-NVFP4`](https://huggingface.co/Code4me2/bu-30b-a3b-preview-NVFP4)** (plain NVFP4, no AWQ, slightly lower accuracy but same memory footprint).
### TensorRT-LLM (recommended)
This format is produced by and natively supported by [NVIDIA TensorRT-Model-Optimizer](https://github.com/NVIDIA/TensorRT-Model-Optimizer) + TensorRT-LLM. Build an NVFP4 engine:
```bash
trtllm-build --checkpoint_dir Code4me2/bu-30b-a3b-preview-NVFP4-AWQ \
--quant_format nvfp4 \
--max_seq_len 32768
```
See the [TRT-LLM NVFP4 guide](https://nvidia.github.io/TensorRT-LLM/reference/precision.html) for more details.
### SGLang
SGLang's ModelOpt integration supports NVFP4_AWQ when built against the matching ModelOpt version β consult their docs for the current status.
## Intended Use
This model is a drop-in replacement for `bu-30b-a3b-preview` within the
[browser-use](https://github.com/browser-use/browser-use) library. It is
trained/tuned specifically for browser-use's indexed-DOM + structured-action
format. Using it outside that flow (or with a different harness / freeform
CDP scripting) will produce substantially worse results than the
quantization accuracy alone would suggest.
## Evaluation
_Evaluation numbers (MMLU, GSM8K, MM-Bench, BU_Bench V1 subset) will be
added after running against BF16 baseline. See methodology below._
Planned eval suite:
- MMLU (general knowledge, 5-shot)
- GSM8K (math reasoning, 0-shot chain-of-thought)
- MM-Bench (vision-language, 0-shot)
- BU_Bench V1 held-out tasks (agent-specific, using the same browser-use harness)
## Reproduction
- Base model: `browser-use/bu-30b-a3b-preview`
- Quantization tool: `nvidia-modelopt==0.43.0`
- Quantization config: `NVFP4_AWQ_LITE_CFG` with `*visual*` excluded (ViT stays BF16); router (`*mlp.gate.*`) already excluded by the config default
- Calibration samples: 512 / 602 (shuffled, seed=42). 6 samples above 32K tokens skipped (aligned with `--max-model-len`)
- Host: single RTX PRO 6000 Blackwell, 98GB
- Calibration wall time: ~14h (70 min cache activation stats + 12h AWQ scale search + 10 min export)
### ModelOpt patch for Qwen3-VL-MoE support
ModelOpt 0.43 does not natively know how to export quantized checkpoints for `Qwen3VLMoeForConditionalGeneration`. Three patches were required (included in the model repo as `modelopt_patch.py`):
1. `get_expert_linear_names()` in `layer_utils.py` β recognize `Qwen3VLMoe*` and return `[gate_proj, up_proj, down_proj]`
2. `get_experts_list()` in `layer_utils.py` β recognize `qwen3vlmoe*` model_type
3. `_export_transformers_checkpoint()` in `unified_export_hf.py` β wrap the `QuantQwen3VLMoeTextExperts` container with a transparent iterable proxy so the existing iterable dispatch walks the un-BMM'd per-expert `ModuleList`s, while `__call__` and attribute access still delegate to the real experts module for the internal dummy forward pass
Reference code + calibration harness: [GitHub link TBD]
## Attribution & License
Derived from [`browser-use/bu-30b-a3b-preview`](https://huggingface.co/browser-use/bu-30b-a3b-preview), which is distributed under a **Modified MIT License** by Browser Use Inc. with a commercial-use restriction: **use is not permitted for organizations whose annual consolidated revenue exceeds USD 1 million for the preceding month**. That restriction propagates to this derivative. Commercial users above the revenue threshold must obtain a license from Browser Use Inc. (`support@browser-use.com`) or use Browser Use's hosted services.
The original LICENSE file is included alongside the weights.
## Acknowledgements
- **Browser Use** for the base model and the open benchmark suite
- **NVIDIA Model Optimizer** for the NVFP4_AWQ calibration tooling
- **Qwen team** for the Qwen3-VL-MoE architecture
|