Instructions to use Code4me2/bu-30b-a3b-preview-NVFP4 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use Code4me2/bu-30b-a3b-preview-NVFP4 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="Code4me2/bu-30b-a3b-preview-NVFP4")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForImageTextToText

processor = AutoProcessor.from_pretrained("Code4me2/bu-30b-a3b-preview-NVFP4")
model = AutoModelForImageTextToText.from_pretrained("Code4me2/bu-30b-a3b-preview-NVFP4")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use Code4me2/bu-30b-a3b-preview-NVFP4 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "Code4me2/bu-30b-a3b-preview-NVFP4"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Code4me2/bu-30b-a3b-preview-NVFP4",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/Code4me2/bu-30b-a3b-preview-NVFP4

SGLang

How to use Code4me2/bu-30b-a3b-preview-NVFP4 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "Code4me2/bu-30b-a3b-preview-NVFP4" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Code4me2/bu-30b-a3b-preview-NVFP4",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "Code4me2/bu-30b-a3b-preview-NVFP4" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Code4me2/bu-30b-a3b-preview-NVFP4",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use Code4me2/bu-30b-a3b-preview-NVFP4 with Docker Model Runner:
```
docker model run hf.co/Code4me2/bu-30b-a3b-preview-NVFP4
```

bu-30b-a3b-preview-NVFP4 / README.md

Code4me2

Initial quantized checkpoint + model card

6a8e64b verified about 1 month ago

preview code

raw

history blame contribute delete

9.07 kB

	---
	base_model: browser-use/bu-30b-a3b-preview
	base_model_relation: quantized
	license: other
	license_name: modified-mit-browser-use
	license_link: https://huggingface.co/browser-use/bu-30b-a3b-preview/blob/main/LICENSE
	tags:
	- nvfp4
	- awq
	- modelopt
	- browser-use
	- agent
	- vision-language
	- moe
	- quantized
	pipeline_tag: image-text-to-text
	library_name: transformers
	---

	# bu-30b-a3b-preview NVFP4-AWQ (LITE)

	A 4-bit NVFP4 + AWQ-lite quantization of
	[browser-use/bu-30b-a3b-preview](https://huggingface.co/browser-use/bu-30b-a3b-preview) — the 30B Qwen3-VL-MoE browser-agent model — produced with
	[NVIDIA TensorRT-Model-Optimizer](https://github.com/NVIDIA/TensorRT-Model-Optimizer)
	v0.43.

	What's notable about this quant

	This is (as of upload) the first NVFP4_AWQ quantization of any browser-agent VLM on the Hub, and the first NVFP4 quant of this model with documented calibration provenance. Existing NVFP4 / INT4-AWQ quants of `bu-30b-a3b-preview` either lack calibration data disclosure or calibrate against generic text corpora; this one was calibrated on-distribution, using 602 real multimodal browser-use trajectories generated by the full-precision model itself.

	The calibration-data argument is the load-bearing claim of this quant — it's documented in detail below.

	## Why NVFP4 for this model

	- Native acceleration on Blackwell. RTX 5090, PRO 6000, B100/B200, GB10 all have native FP4 tensor cores (sm_100+). On Blackwell-class hardware NVFP4 weights execute at ~2× the throughput of FP8.
	- Memory. ~17 GB vs ~58 GB at BF16. Fits comfortably on a single RTX 5090 (32 GB) with headroom for the 32K-token context window.
	- Accuracy-preserving 4-bit format. NVFP4's two-level scales (FP8 E4M3 block scales at block size 16, plus FP32 per-tensor scale) substantially outperform naive INT4 in accuracy, and AWQ's activation-aware per-channel scaling protects the weight channels that matter most.

	## Quantization Recipe

	Base config: `NVFP4_AWQ_LITE_CFG` from `modelopt.torch.quantization.config`.

	Module-scoped exclusions (kept at BF16 precision):

	\| Module pattern \| Reason \|
	\|---\|---\|
	\| `visual` \| Vision encoder (ViT tower) is small relative to MoE decoder; disproportionate accuracy loss for minimal memory savings. Standard practice. \|
	\| `mlp.gate.` \| MoE router — tiny logit perturbations cascade into expert misrouting. Already excluded in `NVFP4_AWQ_LITE_CFG`. \|
	\| `lm_head` \| Output projection. Already excluded. \|
	\| `router`, `block_sparse_moe.gate` \| Generic router patterns (covers Mixtral-style MoE architectures). Already excluded. \|

	All 128 MoE experts (`model.language_model.layers..mlp.experts.`) and attention matrices are quantized to NVFP4 weights + NVFP4 activations (W4A4). The `model.visual.*` ViT tower (depth 27, hidden 1152) stays in BF16.

	## Calibration Data

	602 samples of real browser-use agent trajectories:

	\| Category (BU_Bench V1) \| Tasks \| Samples \| Weight (rationale) \|
	\|---\|---\|---\|---\|
	\| GAIA \| 8 \| ~200 \| Research + reasoning — dominant agent workload \|
	\| OM2W2 \| 6 \| ~150 \| Open-ended info gathering \|
	\| BrowseComp \| 5 \| ~130 \| Cross-source comparison \|
	\| WebBenchREAD \| 5 \| ~80 \| Clean DOM activations \|
	\| InteractionTests \| 1 \| ~15 \| Signal floor for form/interaction regime \|

	Collection process:

	1. Full-precision bu-30b-a3b-preview served via vLLM 0.17 at `--dtype bfloat16`.
	2. 3 parallel `browser-use` v0.12.6 agents with `enable_planning=True` and `use_vision=True` ran 25 tasks sampled from the official [browser-use/benchmark](https://github.com/browser-use/benchmark) BU_Bench V1 set.
	3. Per-category step caps: 40 for GAIA/OM2W2/BrowseComp, 25 for WebBenchREAD/InteractionTests.
	4. A proxy between the agents and vLLM captured every `/v1/chat/completions` request payload (including image parts) to JSONL.
	5. Samples with total tokens < 1000 (keepalive/error artifacts, 3) or blank screenshots (variance < 150, 16) were filtered out.

	Sample-level statistics (staged calibration, 602 samples, Qwen3-VL tokenizer + true vision-token expansion):

	\| Metric \| Value \|
	\|---\|---\|
	\| Total tokens \| min=3, p25=11.2K, median=13.4K, p75=15.8K, p90=18.1K, max=35.4K \|
	\| 8-16K bucket \| 439 samples (73%) \|
	\| 16-32K bucket \| 144 samples (24%) \|
	\| 32K+ samples \| 6 (long-context tail) \|
	\| Samples with screenshot \| 93.6% \|
	\| Non-degenerate screenshots \| 97.2% \|
	\| DOM element count (median / max) \| 136 / 941 \|

	The calibration distribution was committed to before running the analyzer on the exploratory data — weights reflect the target user population (researchers and educators running a local agent), not post-hoc curve-fitting to whatever tasks happened to look interesting.

	## Serving

	### ⚠ vLLM support

	As of vLLM 0.19.1 / main, the `ModelOpt` quantization loader does not accept `quant_algo: NVFP4_AWQ` — the supported list is only `['FP8', 'FP8_PER_CHANNEL_PER_TOKEN', 'FP8_PB_WO', 'NVFP4', 'MXFP8', 'MIXED_PRECISION']`. Renaming the algo to plain `NVFP4` would load but produce mathematically wrong inference because the 18,480 `pre_quant_scale` tensors that carry AWQ's per-channel activation rescaling would not be applied.

	If you want a vLLM-loadable variant, use the sibling repo [`Code4me2/bu-30b-a3b-preview-NVFP4`](https://huggingface.co/Code4me2/bu-30b-a3b-preview-NVFP4) (plain NVFP4, no AWQ, slightly lower accuracy but same memory footprint).

	### TensorRT-LLM (recommended)

	This format is produced by and natively supported by [NVIDIA TensorRT-Model-Optimizer](https://github.com/NVIDIA/TensorRT-Model-Optimizer) + TensorRT-LLM. Build an NVFP4 engine:

	```bash
	trtllm-build --checkpoint_dir Code4me2/bu-30b-a3b-preview-NVFP4-AWQ \
	--quant_format nvfp4 \
	--max_seq_len 32768
	```

	See the [TRT-LLM NVFP4 guide](https://nvidia.github.io/TensorRT-LLM/reference/precision.html) for more details.

	### SGLang

	SGLang's ModelOpt integration supports NVFP4_AWQ when built against the matching ModelOpt version — consult their docs for the current status.

	## Intended Use

	This model is a drop-in replacement for `bu-30b-a3b-preview` within the
	[browser-use](https://github.com/browser-use/browser-use) library. It is
	trained/tuned specifically for browser-use's indexed-DOM + structured-action
	format. Using it outside that flow (or with a different harness / freeform
	CDP scripting) will produce substantially worse results than the
	quantization accuracy alone would suggest.

	## Evaluation

	_Evaluation numbers (MMLU, GSM8K, MM-Bench, BU_Bench V1 subset) will be
	added after running against BF16 baseline. See methodology below._

	Planned eval suite:
	- MMLU (general knowledge, 5-shot)
	- GSM8K (math reasoning, 0-shot chain-of-thought)
	- MM-Bench (vision-language, 0-shot)
	- BU_Bench V1 held-out tasks (agent-specific, using the same browser-use harness)

	## Reproduction

	- Base model: `browser-use/bu-30b-a3b-preview`
	- Quantization tool: `nvidia-modelopt==0.43.0`
	- Quantization config: `NVFP4_AWQ_LITE_CFG` with `visual` excluded (ViT stays BF16); router (`mlp.gate.`) already excluded by the config default
	- Calibration samples: 512 / 602 (shuffled, seed=42). 6 samples above 32K tokens skipped (aligned with `--max-model-len`)
	- Host: single RTX PRO 6000 Blackwell, 98GB
	- Calibration wall time: ~14h (70 min cache activation stats + 12h AWQ scale search + 10 min export)

	### ModelOpt patch for Qwen3-VL-MoE support

	ModelOpt 0.43 does not natively know how to export quantized checkpoints for `Qwen3VLMoeForConditionalGeneration`. Three patches were required (included in the model repo as `modelopt_patch.py`):

	1. `get_expert_linear_names()` in `layer_utils.py` — recognize `Qwen3VLMoe*` and return `[gate_proj, up_proj, down_proj]`
	2. `get_experts_list()` in `layer_utils.py` — recognize `qwen3vlmoe*` model_type
	3. `_export_transformers_checkpoint()` in `unified_export_hf.py` — wrap the `QuantQwen3VLMoeTextExperts` container with a transparent iterable proxy so the existing iterable dispatch walks the un-BMM'd per-expert `ModuleList`s, while `__call__` and attribute access still delegate to the real experts module for the internal dummy forward pass

	Reference code + calibration harness: [GitHub link TBD]

	## Attribution & License

	Derived from [`browser-use/bu-30b-a3b-preview`](https://huggingface.co/browser-use/bu-30b-a3b-preview), which is distributed under a Modified MIT License by Browser Use Inc. with a commercial-use restriction: use is not permitted for organizations whose annual consolidated revenue exceeds USD 1 million for the preceding month. That restriction propagates to this derivative. Commercial users above the revenue threshold must obtain a license from Browser Use Inc. (`support@browser-use.com`) or use Browser Use's hosted services.

	The original LICENSE file is included alongside the weights.

	## Acknowledgements

	- Browser Use for the base model and the open benchmark suite
	- NVIDIA Model Optimizer for the NVFP4_AWQ calibration tooling
	- Qwen team for the Qwen3-VL-MoE architecture