Mega-ASR — NVFP4 AWQ-Lite (NVIDIA Blackwell)

NVFP4 (4-bit floating-point: E2M1 mantissa with per-block FP8 scaling) deployment of the LLM portion of zhifeixie/Mega-ASR, quantized via NVIDIA Model Optimizer (nvidia-modelopt) with the NVFP4_AWQ_LITE_CFG activation-aware recipe.

Targets RTX 50-series (Blackwell) for native NVFP4 GEMM acceleration. Earlier Ada/Hopper GPUs can run the same checkpoint via the modelopt fake-quant simulation (correctness preserved, no perf win without Blackwell NVFP4 tensor cores).

What's in this repo

File	Size	Role
`nvfp4/model.safetensors`	3.44 GB	Qwen3 1.7B LLM, NVFP4 weights + AWQ-Lite scaling factors. Saved via `modelopt.torch.save_pretrained` — the weights are stored in their original bf16 layout alongside the quantization scale tensors; the runtime packs them to NVFP4 on first forward.
`nvfp4/config.json` + tokenizer/*	—	HF config + Qwen3-ASR tokenizer (with `<\|audio_pad\|>`, `<asr_text>`, etc.)
`onnx/audio_encoder_fp32.onnx`	1.27 GB	24-layer Whisper-style audio encoder (ONNX fp32, run via onnxruntime; NVFP4 port not done — the encoder is small enough that it doesn't benefit much from FP4)
`examples/*.wav`	~3 MB	8 noisy benchmark clips from Voices-in-the-Wild-Bench
`nvfp4_quantize.py`	—	The PTQ script (modelopt forward-loop calibration)
`inference_bench.py`	—	End-to-end ASR pipeline + 8-clip VITW bench

Quality (bench)

8-clip Voices-in-the-Wild-Bench agreement (1 − WER), prompt forced to language English, run on the RTX 5080 Laptop (Blackwell, compute_cap 12.0). Same ONNX fp32 audio encoder as the other backends:

Per-sample	NVFP4 (this repo)	ONNX GPTQ	MLX mixed 8/4	CoreML mixed 8/4
distortion	100%	100%	100%	100%
dropout	100%	100%	100%	100%
echo (hard, reverb)	64.7%	82.4%	64.7%	64.7%
far_field	100%	100%	100%	100%
mixed	100%	100%	100%	100%
noise	100%	100%	100%	100%
obstructed	100%	100%	94.1%	100%
recording (hard, truncated)	66.7%	60.0%	60.0%	60.0%
AVERAGE	91.4%	92.7%	92.2%	90.6%

Notable: NVFP4 ties or beats the others on every clean sample, wins on recording by 6.7 pts (66.7% vs 60% everywhere else — the AWQ-Lite activation-aware scaling helped recover the truncated-audio decode), and ties MLX/CoreML on echo. The 1.3% gap to ONNX GPTQ is entirely on echo (64.7% vs 82.4%) where GPTQ's per-column Hessian-based error redistribution captures something AWQ-Lite's per-channel scaling doesn't.

How NVFP4 works (quick)

Weights are stored as E2M1 (1 sign + 2 exponent + 1 mantissa = 4 bits, representing values in {±0, ±0.5, ±1, ±1.5, ±2, ±3, ±4, ±6}).
Every block of 16 consecutive weight elements shares one FP8 (E4M3) scaling factor (saved alongside the E2M1 values; ~0.5 extra bits/weight for scale storage).
A second per-tensor FP32 amax rescales the per-block scales into FP8 range.
Inference: load E2M1 weights → multiply by per-block FP8 scale → multiply by per-tensor amax → fp16/bf16 GEMM. Blackwell's tensor cores do this natively in ~the same cycles as fp4 multiplies.

The AWQ-Lite variant runs an extra pass that computes a per-channel activation magnitude and rescales weights vs. activations to put more of the dynamic range into the "important" channels (channels with large activation amplitudes) before applying NVFP4 — net effect is recovering some quality lost to the E2M1 grid.

Inference

Stage 1: PyTorch + modelopt (fake-quant, works on any GPU)

pip install nvidia-modelopt transformers safetensors torch onnxruntime soundfile librosa
git clone https://huggingface.co/Reza2kn/mega-asr-nvfp4
cd mega-asr-nvfp4
python inference_bench.py \
    --model nvfp4 \
    --encoder onnx/audio_encoder_fp32.onnx \
    --examples-dir examples \
    --qwen-asr-dir <Qwen3-ASR-1.7B HF dir> \
    --skip-quant      # weights already quantized

Stage 2: TensorRT-LLM engine (native Blackwell NVFP4)

# Convert HF checkpoint → TensorRT-LLM checkpoint
python -m tensorrt_llm.examples.qwen.convert_checkpoint \
    --model_dir nvfp4 --output_dir trtllm_ckpt \
    --dtype bfloat16 --use_fp4

# Build engine
trtllm-build --checkpoint_dir trtllm_ckpt --output_dir trtllm_engine \
    --gemm_plugin fp4 --max_input_len 512 --max_seq_len 600

(The TRT-LLM engine path is on the roadmap; this repo currently ships the modelopt-saved HF checkpoint, which runs as fake-quant on any GPU.)

Conversion details

import modelopt.torch.quantization as mtq
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained("Qwen3-ASR-1.7B-LLM",
                                              torch_dtype=torch.bfloat16,
                                              device_map="cuda")
# Calibration with 168 English VITW samples (audio embeds scattered at
# <|audio_pad|> positions — same set used for the ONNX GPTQ release)
calib_batches = build_calibration_batches(...)
def forward_loop(m):
    for b in calib_batches:
        with torch.no_grad():
            m(**b)
mtq.quantize(model, mtq.NVFP4_AWQ_LITE_CFG, forward_loop)
model.save_pretrained("nvfp4")

168 calibration batches, ~3 min on the RTX 5080. The AWQ-Lite recipe does two forward passes per batch — one for activation magnitude estimation, one for the actual quantization apply step — explaining the doubled count in the log.

Why NVFP4 (vs INT4 / FP8)?

vs INT4 (e.g., GPTQ): NVFP4's exponent bits handle the wide activation dynamic range in transformer MLPs better than INT4's linear grid. On Blackwell tensor cores, NVFP4 GEMM throughput is 2× FP8 and 4× FP16.
vs FP8: half the memory bandwidth (4 vs 8 bits/weight). NVFP4 with AWQ-Lite typically lands within 0.3-0.6 PPL of FP8 on Llama-class models.
vs MXFP4 (Microsoft's variant, same E2M1 with different block sizing): NVFP4 uses a smaller block (16 vs 32) + FP8 scales vs E8M0 — tighter per-block quantization, slightly larger overhead.

Companion repos

Reza2kn/mega-asr-onnx — full ONNX pipeline (GPTQ-INT4, 92.7%)
Reza2kn/mega-asr-mlx — MLX 4-bit (mixed 8/4, 92.2%)
Reza2kn/mega-asr-coreml — CoreML 4-bit (mixed 8/4, 90.6%)
Reza2kn/mega-asr-bench — browser demo (WebGPU)

Credits

Original model: zhifeixie/Mega-ASR (1.7B, Apache-2.0)
NVFP4 PTQ via NVIDIA TensorRT-Model-Optimizer v0.44
Benchmark: Voices-in-the-Wild-Bench

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for Reza2kn/mega-asr-nvfp4

Base model

Qwen/Qwen3-ASR-1.7B

Finetuned

zhifeixie/Mega-ASR

Quantized

(7)

this model