Instructions to use Code4me2/bu-30b-a3b-preview-NVFP4 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Code4me2/bu-30b-a3b-preview-NVFP4 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="Code4me2/bu-30b-a3b-preview-NVFP4") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForImageTextToText processor = AutoProcessor.from_pretrained("Code4me2/bu-30b-a3b-preview-NVFP4") model = AutoModelForImageTextToText.from_pretrained("Code4me2/bu-30b-a3b-preview-NVFP4") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use Code4me2/bu-30b-a3b-preview-NVFP4 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "Code4me2/bu-30b-a3b-preview-NVFP4" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Code4me2/bu-30b-a3b-preview-NVFP4", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/Code4me2/bu-30b-a3b-preview-NVFP4
- SGLang
How to use Code4me2/bu-30b-a3b-preview-NVFP4 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "Code4me2/bu-30b-a3b-preview-NVFP4" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Code4me2/bu-30b-a3b-preview-NVFP4", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "Code4me2/bu-30b-a3b-preview-NVFP4" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Code4me2/bu-30b-a3b-preview-NVFP4", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Docker Model Runner
How to use Code4me2/bu-30b-a3b-preview-NVFP4 with Docker Model Runner:
docker model run hf.co/Code4me2/bu-30b-a3b-preview-NVFP4
# Load model directly
from transformers import AutoProcessor, AutoModelForImageTextToText
processor = AutoProcessor.from_pretrained("Code4me2/bu-30b-a3b-preview-NVFP4")
model = AutoModelForImageTextToText.from_pretrained("Code4me2/bu-30b-a3b-preview-NVFP4")
messages = [
{
"role": "user",
"content": [
{"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
{"type": "text", "text": "What animal is on the candy?"}
]
},
]
inputs = processor.apply_chat_template(
messages,
add_generation_prompt=True,
tokenize=True,
return_dict=True,
return_tensors="pt",
).to(model.device)
outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))bu-30b-a3b-preview NVFP4-AWQ (LITE)
A 4-bit NVFP4 + AWQ-lite quantization of browser-use/bu-30b-a3b-preview — the 30B Qwen3-VL-MoE browser-agent model — produced with NVIDIA TensorRT-Model-Optimizer v0.43.
What's notable about this quant
This is (as of upload) the first NVFP4_AWQ quantization of any browser-agent VLM on the Hub, and the first NVFP4 quant of this model with documented calibration provenance. Existing NVFP4 / INT4-AWQ quants of bu-30b-a3b-preview either lack calibration data disclosure or calibrate against generic text corpora; this one was calibrated on-distribution, using 602 real multimodal browser-use trajectories generated by the full-precision model itself.
The calibration-data argument is the load-bearing claim of this quant — it's documented in detail below.
Why NVFP4 for this model
- Native acceleration on Blackwell. RTX 5090, PRO 6000, B100/B200, GB10 all have native FP4 tensor cores (sm_100+). On Blackwell-class hardware NVFP4 weights execute at ~2× the throughput of FP8.
- Memory. ~17 GB vs ~58 GB at BF16. Fits comfortably on a single RTX 5090 (32 GB) with headroom for the 32K-token context window.
- Accuracy-preserving 4-bit format. NVFP4's two-level scales (FP8 E4M3 block scales at block size 16, plus FP32 per-tensor scale) substantially outperform naive INT4 in accuracy, and AWQ's activation-aware per-channel scaling protects the weight channels that matter most.
Quantization Recipe
Base config: NVFP4_AWQ_LITE_CFG from modelopt.torch.quantization.config.
Module-scoped exclusions (kept at BF16 precision):
| Module pattern | Reason |
|---|---|
*visual* |
Vision encoder (ViT tower) is small relative to MoE decoder; disproportionate accuracy loss for minimal memory savings. Standard practice. |
*mlp.gate.* |
MoE router — tiny logit perturbations cascade into expert misrouting. Already excluded in NVFP4_AWQ_LITE_CFG. |
*lm_head* |
Output projection. Already excluded. |
*router*, *block_sparse_moe.gate* |
Generic router patterns (covers Mixtral-style MoE architectures). Already excluded. |
All 128 MoE experts (model.language_model.layers.*.mlp.experts.*) and attention matrices are quantized to NVFP4 weights + NVFP4 activations (W4A4). The model.visual.* ViT tower (depth 27, hidden 1152) stays in BF16.
Calibration Data
602 samples of real browser-use agent trajectories:
| Category (BU_Bench V1) | Tasks | Samples | Weight (rationale) |
|---|---|---|---|
| GAIA | 8 | ~200 | Research + reasoning — dominant agent workload |
| OM2W2 | 6 | ~150 | Open-ended info gathering |
| BrowseComp | 5 | ~130 | Cross-source comparison |
| WebBenchREAD | 5 | ~80 | Clean DOM activations |
| InteractionTests | 1 | ~15 | Signal floor for form/interaction regime |
Collection process:
- Full-precision bu-30b-a3b-preview served via vLLM 0.17 at
--dtype bfloat16. - 3 parallel
browser-usev0.12.6 agents withenable_planning=Trueanduse_vision=Trueran 25 tasks sampled from the official browser-use/benchmark BU_Bench V1 set. - Per-category step caps: 40 for GAIA/OM2W2/BrowseComp, 25 for WebBenchREAD/InteractionTests.
- A proxy between the agents and vLLM captured every
/v1/chat/completionsrequest payload (including image parts) to JSONL. - Samples with total tokens < 1000 (keepalive/error artifacts, 3) or blank screenshots (variance < 150, 16) were filtered out.
Sample-level statistics (staged calibration, 602 samples, Qwen3-VL tokenizer + true vision-token expansion):
| Metric | Value |
|---|---|
| Total tokens | min=3, p25=11.2K, median=13.4K, p75=15.8K, p90=18.1K, max=35.4K |
| 8-16K bucket | 439 samples (73%) |
| 16-32K bucket | 144 samples (24%) |
| 32K+ samples | 6 (long-context tail) |
| Samples with screenshot | 93.6% |
| Non-degenerate screenshots | 97.2% |
| DOM element count (median / max) | 136 / 941 |
The calibration distribution was committed to before running the analyzer on the exploratory data — weights reflect the target user population (researchers and educators running a local agent), not post-hoc curve-fitting to whatever tasks happened to look interesting.
Serving
⚠ vLLM support
As of vLLM 0.19.1 / main, the ModelOpt quantization loader does not accept quant_algo: NVFP4_AWQ — the supported list is only ['FP8', 'FP8_PER_CHANNEL_PER_TOKEN', 'FP8_PB_WO', 'NVFP4', 'MXFP8', 'MIXED_PRECISION']. Renaming the algo to plain NVFP4 would load but produce mathematically wrong inference because the 18,480 pre_quant_scale tensors that carry AWQ's per-channel activation rescaling would not be applied.
If you want a vLLM-loadable variant, use the sibling repo Code4me2/bu-30b-a3b-preview-NVFP4 (plain NVFP4, no AWQ, slightly lower accuracy but same memory footprint).
TensorRT-LLM (recommended)
This format is produced by and natively supported by NVIDIA TensorRT-Model-Optimizer + TensorRT-LLM. Build an NVFP4 engine:
trtllm-build --checkpoint_dir Code4me2/bu-30b-a3b-preview-NVFP4-AWQ \
--quant_format nvfp4 \
--max_seq_len 32768
See the TRT-LLM NVFP4 guide for more details.
SGLang
SGLang's ModelOpt integration supports NVFP4_AWQ when built against the matching ModelOpt version — consult their docs for the current status.
Intended Use
This model is a drop-in replacement for bu-30b-a3b-preview within the
browser-use library. It is
trained/tuned specifically for browser-use's indexed-DOM + structured-action
format. Using it outside that flow (or with a different harness / freeform
CDP scripting) will produce substantially worse results than the
quantization accuracy alone would suggest.
Evaluation
Evaluation numbers (MMLU, GSM8K, MM-Bench, BU_Bench V1 subset) will be added after running against BF16 baseline. See methodology below.
Planned eval suite:
- MMLU (general knowledge, 5-shot)
- GSM8K (math reasoning, 0-shot chain-of-thought)
- MM-Bench (vision-language, 0-shot)
- BU_Bench V1 held-out tasks (agent-specific, using the same browser-use harness)
Reproduction
- Base model:
browser-use/bu-30b-a3b-preview - Quantization tool:
nvidia-modelopt==0.43.0 - Quantization config:
NVFP4_AWQ_LITE_CFGwith*visual*excluded (ViT stays BF16); router (*mlp.gate.*) already excluded by the config default - Calibration samples: 512 / 602 (shuffled, seed=42). 6 samples above 32K tokens skipped (aligned with
--max-model-len) - Host: single RTX PRO 6000 Blackwell, 98GB
- Calibration wall time: ~14h (70 min cache activation stats + 12h AWQ scale search + 10 min export)
ModelOpt patch for Qwen3-VL-MoE support
ModelOpt 0.43 does not natively know how to export quantized checkpoints for Qwen3VLMoeForConditionalGeneration. Three patches were required (included in the model repo as modelopt_patch.py):
get_expert_linear_names()inlayer_utils.py— recognizeQwen3VLMoe*and return[gate_proj, up_proj, down_proj]get_experts_list()inlayer_utils.py— recognizeqwen3vlmoe*model_type_export_transformers_checkpoint()inunified_export_hf.py— wrap theQuantQwen3VLMoeTextExpertscontainer with a transparent iterable proxy so the existing iterable dispatch walks the un-BMM'd per-expertModuleLists, while__call__and attribute access still delegate to the real experts module for the internal dummy forward pass
Reference code + calibration harness: [GitHub link TBD]
Attribution & License
Derived from browser-use/bu-30b-a3b-preview, which is distributed under a Modified MIT License by Browser Use Inc. with a commercial-use restriction: use is not permitted for organizations whose annual consolidated revenue exceeds USD 1 million for the preceding month. That restriction propagates to this derivative. Commercial users above the revenue threshold must obtain a license from Browser Use Inc. (support@browser-use.com) or use Browser Use's hosted services.
The original LICENSE file is included alongside the weights.
Acknowledgements
- Browser Use for the base model and the open benchmark suite
- NVIDIA Model Optimizer for the NVFP4_AWQ calibration tooling
- Qwen team for the Qwen3-VL-MoE architecture
- Downloads last month
- 29
Model tree for Code4me2/bu-30b-a3b-preview-NVFP4
Base model
Qwen/Qwen3-VL-30B-A3B-Instruct
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="Code4me2/bu-30b-a3b-preview-NVFP4") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)