README.md · Reza2kn/mega-asr-nvfp4 at main

mega-asr-nvfp4 / README.md

Reza2kn

Add README — NVFP4 AWQ-Lite Mega-ASR-1.7B at 91.4% VITW

ff2920a verified 12 days ago

preview code

raw

history blame contribute delete

7.63 kB

	---
	license: apache-2.0
	language:
	- en
	- zh
	- ja
	- ko
	- multilingual
	library_name: modelopt
	tags:
	- nvidia
	- nvfp4
	- blackwell
	- rtx-50
	- modelopt
	- tensorrt-llm
	- automatic-speech-recognition
	- asr
	- speech-recognition
	- robust-asr
	- quantized
	- 4bit
	- awq
	- e2m1
	- qwen3
	- qwen3-asr
	- mega-asr
	pipeline_tag: automatic-speech-recognition
	base_model: zhifeixie/Mega-ASR
	base_model_relation: quantized
	---

	# Mega-ASR — NVFP4 AWQ-Lite (NVIDIA Blackwell)

	[NVFP4](https://developer.nvidia.com/blog/nvfp4-new-4-bit-floating-point-format/)
	(4-bit floating-point: E2M1 mantissa with per-block FP8 scaling) deployment of
	the LLM portion of [zhifeixie/Mega-ASR](https://huggingface.co/zhifeixie/Mega-ASR),
	quantized via [NVIDIA Model Optimizer](https://github.com/NVIDIA/TensorRT-Model-Optimizer)
	(`nvidia-modelopt`) with the `NVFP4_AWQ_LITE_CFG` activation-aware recipe.

	Targets RTX 50-series (Blackwell) for native NVFP4 GEMM acceleration.
	Earlier Ada/Hopper GPUs can run the same checkpoint via the modelopt
	fake-quant simulation (correctness preserved, no perf win without Blackwell
	NVFP4 tensor cores).

	## What's in this repo

	\| File \| Size \| Role \|
	\| --- \| ---: \| --- \|
	\| `nvfp4/model.safetensors` \| 3.44 GB \| Qwen3 1.7B LLM, NVFP4 weights + AWQ-Lite scaling factors. Saved via `modelopt.torch.save_pretrained` — the weights are stored in their original bf16 layout alongside the quantization scale tensors; the runtime packs them to NVFP4 on first forward. \|
	\| `nvfp4/config.json` + tokenizer/* \| — \| HF config + Qwen3-ASR tokenizer (with `<\\|audio_pad\\|>`, `<asr_text>`, etc.) \|
	\| `onnx/audio_encoder_fp32.onnx` \| 1.27 GB \| 24-layer Whisper-style audio encoder (ONNX fp32, run via onnxruntime; NVFP4 port not done — the encoder is small enough that it doesn't benefit much from FP4) \|
	\| `examples/*.wav` \| ~3 MB \| 8 noisy benchmark clips from Voices-in-the-Wild-Bench \|
	\| `nvfp4_quantize.py` \| — \| The PTQ script (modelopt forward-loop calibration) \|
	\| `inference_bench.py` \| — \| End-to-end ASR pipeline + 8-clip VITW bench \|

	## Quality (bench)

	8-clip [Voices-in-the-Wild-Bench](https://github.com/xzf-thu/Voices-in-the-Wild-Bench)
	agreement (1 − WER), prompt forced to `language English`, run on the RTX 5080
	Laptop (Blackwell, compute_cap 12.0). Same ONNX fp32 audio encoder as the
	other backends:

	\| Per-sample \| NVFP4 (this repo) \| ONNX GPTQ \| MLX mixed 8/4 \| CoreML mixed 8/4 \|
	\| --- \| ---: \| ---: \| ---: \| ---: \|
	\| distortion \| 100% \| 100% \| 100% \| 100% \|
	\| dropout \| 100% \| 100% \| 100% \| 100% \|
	\| echo (hard, reverb) \| 64.7% \| 82.4% \| 64.7% \| 64.7% \|
	\| far_field \| 100% \| 100% \| 100% \| 100% \|
	\| mixed \| 100% \| 100% \| 100% \| 100% \|
	\| noise \| 100% \| 100% \| 100% \| 100% \|
	\| obstructed \| 100% \| 100% \| 94.1% \| 100% \|
	\| recording (hard, truncated) \| 66.7% \| 60.0% \| 60.0% \| 60.0% \|
	\| AVERAGE \| 91.4% \| 92.7% \| 92.2% \| 90.6% \|

	Notable: NVFP4 ties or beats the others on every clean sample, **wins on
	`recording` by 6.7 pts** (66.7% vs 60% everywhere else — the AWQ-Lite
	activation-aware scaling helped recover the truncated-audio decode), and
	ties MLX/CoreML on `echo`. The 1.3% gap to ONNX GPTQ is entirely on `echo`
	(64.7% vs 82.4%) where GPTQ's per-column Hessian-based error redistribution
	captures something AWQ-Lite's per-channel scaling doesn't.

	## How NVFP4 works (quick)

	- Weights are stored as E2M1 (1 sign + 2 exponent + 1 mantissa = 4 bits,
	representing values in {±0, ±0.5, ±1, ±1.5, ±2, ±3, ±4, ±6}).
	- Every block of 16 consecutive weight elements shares one **FP8 (E4M3)
	scaling factor** (saved alongside the E2M1 values; ~0.5 extra bits/weight
	for scale storage).
	- A second per-tensor FP32 amax rescales the per-block scales into FP8 range.
	- Inference: load E2M1 weights → multiply by per-block FP8 scale → multiply
	by per-tensor amax → fp16/bf16 GEMM. Blackwell's tensor cores do this
	natively in ~the same cycles as fp4 multiplies.

	The AWQ-Lite variant runs an extra pass that computes a per-channel
	activation magnitude and rescales weights vs. activations to put more of
	the dynamic range into the "important" channels (channels with large
	activation amplitudes) before applying NVFP4 — net effect is recovering
	some quality lost to the E2M1 grid.

	## Inference

	### Stage 1: PyTorch + modelopt (fake-quant, works on any GPU)

	```bash
	pip install nvidia-modelopt transformers safetensors torch onnxruntime soundfile librosa
	git clone https://huggingface.co/Reza2kn/mega-asr-nvfp4
	cd mega-asr-nvfp4
	python inference_bench.py \
	--model nvfp4 \
	--encoder onnx/audio_encoder_fp32.onnx \
	--examples-dir examples \
	--qwen-asr-dir <Qwen3-ASR-1.7B HF dir> \
	--skip-quant # weights already quantized
	```

	### Stage 2: TensorRT-LLM engine (native Blackwell NVFP4)

	```bash
	# Convert HF checkpoint → TensorRT-LLM checkpoint
	python -m tensorrt_llm.examples.qwen.convert_checkpoint \
	--model_dir nvfp4 --output_dir trtllm_ckpt \
	--dtype bfloat16 --use_fp4

	# Build engine
	trtllm-build --checkpoint_dir trtllm_ckpt --output_dir trtllm_engine \
	--gemm_plugin fp4 --max_input_len 512 --max_seq_len 600
	```

	(The TRT-LLM engine path is on the roadmap; this repo currently ships the
	modelopt-saved HF checkpoint, which runs as fake-quant on any GPU.)

	## Conversion details

	```python
	import modelopt.torch.quantization as mtq
	from transformers import AutoModelForCausalLM

	model = AutoModelForCausalLM.from_pretrained("Qwen3-ASR-1.7B-LLM",
	torch_dtype=torch.bfloat16,
	device_map="cuda")
	# Calibration with 168 English VITW samples (audio embeds scattered at
	# <\|audio_pad\|> positions — same set used for the ONNX GPTQ release)
	calib_batches = build_calibration_batches(...)
	def forward_loop(m):
	for b in calib_batches:
	with torch.no_grad():
	m(**b)
	mtq.quantize(model, mtq.NVFP4_AWQ_LITE_CFG, forward_loop)
	model.save_pretrained("nvfp4")
	```

	168 calibration batches, ~3 min on the RTX 5080. The AWQ-Lite recipe does
	two forward passes per batch — one for activation magnitude estimation,
	one for the actual quantization apply step — explaining the doubled count
	in the log.

	## Why NVFP4 (vs INT4 / FP8)?

	- vs INT4 (e.g., GPTQ): NVFP4's exponent bits handle the wide activation
	dynamic range in transformer MLPs better than INT4's linear grid. On
	Blackwell tensor cores, NVFP4 GEMM throughput is 2× FP8 and 4× FP16.
	- vs FP8: half the memory bandwidth (4 vs 8 bits/weight). NVFP4 with
	AWQ-Lite typically lands within 0.3-0.6 PPL of FP8 on Llama-class models.
	- vs MXFP4 (Microsoft's variant, same E2M1 with different block sizing):
	NVFP4 uses a smaller block (16 vs 32) + FP8 scales vs E8M0 — tighter
	per-block quantization, slightly larger overhead.

	## Companion repos

	- [Reza2kn/mega-asr-onnx](https://huggingface.co/Reza2kn/mega-asr-onnx) — full ONNX pipeline (GPTQ-INT4, 92.7%)
	- [Reza2kn/mega-asr-mlx](https://huggingface.co/Reza2kn/mega-asr-mlx) — MLX 4-bit (mixed 8/4, 92.2%)
	- [Reza2kn/mega-asr-coreml](https://huggingface.co/Reza2kn/mega-asr-coreml) — CoreML 4-bit (mixed 8/4, 90.6%)
	- [Reza2kn/mega-asr-bench](https://huggingface.co/spaces/Reza2kn/mega-asr-bench) — browser demo (WebGPU)

	## Credits

	- Original model: [zhifeixie/Mega-ASR](https://huggingface.co/zhifeixie/Mega-ASR) (1.7B, Apache-2.0)
	- NVFP4 PTQ via [NVIDIA TensorRT-Model-Optimizer](https://github.com/NVIDIA/TensorRT-Model-Optimizer) v0.44
	- Benchmark: [Voices-in-the-Wild-Bench](https://github.com/xzf-thu/Voices-in-the-Wild-Bench)