--- license: apache-2.0 language: - en - zh - ja - ko - multilingual library_name: modelopt tags: - nvidia - nvfp4 - blackwell - rtx-50 - modelopt - tensorrt-llm - automatic-speech-recognition - asr - speech-recognition - robust-asr - quantized - 4bit - awq - e2m1 - qwen3 - qwen3-asr - mega-asr pipeline_tag: automatic-speech-recognition base_model: zhifeixie/Mega-ASR base_model_relation: quantized --- # Mega-ASR — NVFP4 AWQ-Lite (NVIDIA Blackwell) [NVFP4](https://developer.nvidia.com/blog/nvfp4-new-4-bit-floating-point-format/) (4-bit floating-point: E2M1 mantissa with per-block FP8 scaling) deployment of the LLM portion of [zhifeixie/Mega-ASR](https://huggingface.co/zhifeixie/Mega-ASR), quantized via [NVIDIA Model Optimizer](https://github.com/NVIDIA/TensorRT-Model-Optimizer) (`nvidia-modelopt`) with the `NVFP4_AWQ_LITE_CFG` activation-aware recipe. **Targets RTX 50-series (Blackwell) for native NVFP4 GEMM acceleration.** Earlier Ada/Hopper GPUs can run the same checkpoint via the modelopt fake-quant simulation (correctness preserved, no perf win without Blackwell NVFP4 tensor cores). ## What's in this repo | File | Size | Role | | --- | ---: | --- | | `nvfp4/model.safetensors` | **3.44 GB** | Qwen3 1.7B LLM, NVFP4 weights + AWQ-Lite scaling factors. Saved via `modelopt.torch.save_pretrained` — the weights are stored in their original bf16 layout alongside the quantization scale tensors; the runtime packs them to NVFP4 on first forward. | | `nvfp4/config.json` + tokenizer/* | — | HF config + Qwen3-ASR tokenizer (with `<\|audio_pad\|>`, ``, etc.) | | `onnx/audio_encoder_fp32.onnx` | 1.27 GB | 24-layer Whisper-style audio encoder (ONNX fp32, run via onnxruntime; NVFP4 port not done — the encoder is small enough that it doesn't benefit much from FP4) | | `examples/*.wav` | ~3 MB | 8 noisy benchmark clips from Voices-in-the-Wild-Bench | | `nvfp4_quantize.py` | — | The PTQ script (modelopt forward-loop calibration) | | `inference_bench.py` | — | End-to-end ASR pipeline + 8-clip VITW bench | ## Quality (bench) 8-clip [Voices-in-the-Wild-Bench](https://github.com/xzf-thu/Voices-in-the-Wild-Bench) agreement (1 − WER), prompt forced to `language English`, run on the RTX 5080 Laptop (Blackwell, compute_cap 12.0). Same ONNX fp32 audio encoder as the other backends: | Per-sample | NVFP4 (this repo) | ONNX GPTQ | MLX mixed 8/4 | CoreML mixed 8/4 | | --- | ---: | ---: | ---: | ---: | | distortion | 100% | 100% | 100% | 100% | | dropout | 100% | 100% | 100% | 100% | | echo (hard, reverb) | 64.7% | 82.4% | 64.7% | 64.7% | | far_field | 100% | 100% | 100% | 100% | | mixed | 100% | 100% | 100% | 100% | | noise | 100% | 100% | 100% | 100% | | obstructed | 100% | 100% | 94.1% | 100% | | recording (hard, truncated) | **66.7%** | 60.0% | 60.0% | 60.0% | | **AVERAGE** | **91.4%** | **92.7%** | **92.2%** | **90.6%** | Notable: NVFP4 ties or beats the others on every clean sample, **wins on `recording` by 6.7 pts** (66.7% vs 60% everywhere else — the AWQ-Lite activation-aware scaling helped recover the truncated-audio decode), and ties MLX/CoreML on `echo`. The 1.3% gap to ONNX GPTQ is entirely on `echo` (64.7% vs 82.4%) where GPTQ's per-column Hessian-based error redistribution captures something AWQ-Lite's per-channel scaling doesn't. ## How NVFP4 works (quick) - Weights are stored as **E2M1** (1 sign + 2 exponent + 1 mantissa = 4 bits, representing values in {±0, ±0.5, ±1, ±1.5, ±2, ±3, ±4, ±6}). - Every block of **16 consecutive weight elements** shares one **FP8 (E4M3) scaling factor** (saved alongside the E2M1 values; ~0.5 extra bits/weight for scale storage). - A second **per-tensor FP32 amax** rescales the per-block scales into FP8 range. - Inference: load E2M1 weights → multiply by per-block FP8 scale → multiply by per-tensor amax → fp16/bf16 GEMM. Blackwell's tensor cores do this natively in ~the same cycles as fp4 multiplies. The AWQ-Lite variant runs an extra pass that computes a per-channel activation magnitude and rescales weights vs. activations to put more of the dynamic range into the "important" channels (channels with large activation amplitudes) before applying NVFP4 — net effect is recovering some quality lost to the E2M1 grid. ## Inference ### Stage 1: PyTorch + modelopt (fake-quant, works on any GPU) ```bash pip install nvidia-modelopt transformers safetensors torch onnxruntime soundfile librosa git clone https://huggingface.co/Reza2kn/mega-asr-nvfp4 cd mega-asr-nvfp4 python inference_bench.py \ --model nvfp4 \ --encoder onnx/audio_encoder_fp32.onnx \ --examples-dir examples \ --qwen-asr-dir \ --skip-quant # weights already quantized ``` ### Stage 2: TensorRT-LLM engine (native Blackwell NVFP4) ```bash # Convert HF checkpoint → TensorRT-LLM checkpoint python -m tensorrt_llm.examples.qwen.convert_checkpoint \ --model_dir nvfp4 --output_dir trtllm_ckpt \ --dtype bfloat16 --use_fp4 # Build engine trtllm-build --checkpoint_dir trtllm_ckpt --output_dir trtllm_engine \ --gemm_plugin fp4 --max_input_len 512 --max_seq_len 600 ``` (The TRT-LLM engine path is on the roadmap; this repo currently ships the modelopt-saved HF checkpoint, which runs as fake-quant on any GPU.) ## Conversion details ```python import modelopt.torch.quantization as mtq from transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained("Qwen3-ASR-1.7B-LLM", torch_dtype=torch.bfloat16, device_map="cuda") # Calibration with 168 English VITW samples (audio embeds scattered at # <|audio_pad|> positions — same set used for the ONNX GPTQ release) calib_batches = build_calibration_batches(...) def forward_loop(m): for b in calib_batches: with torch.no_grad(): m(**b) mtq.quantize(model, mtq.NVFP4_AWQ_LITE_CFG, forward_loop) model.save_pretrained("nvfp4") ``` 168 calibration batches, ~3 min on the RTX 5080. The AWQ-Lite recipe does **two** forward passes per batch — one for activation magnitude estimation, one for the actual quantization apply step — explaining the doubled count in the log. ## Why NVFP4 (vs INT4 / FP8)? - **vs INT4 (e.g., GPTQ)**: NVFP4's exponent bits handle the wide activation dynamic range in transformer MLPs better than INT4's linear grid. On Blackwell tensor cores, NVFP4 GEMM throughput is **2× FP8** and **4× FP16**. - **vs FP8**: half the memory bandwidth (4 vs 8 bits/weight). NVFP4 with AWQ-Lite typically lands within 0.3-0.6 PPL of FP8 on Llama-class models. - **vs MXFP4** (Microsoft's variant, same E2M1 with different block sizing): NVFP4 uses a smaller block (16 vs 32) + FP8 scales vs E8M0 — tighter per-block quantization, slightly larger overhead. ## Companion repos - [Reza2kn/mega-asr-onnx](https://huggingface.co/Reza2kn/mega-asr-onnx) — full ONNX pipeline (GPTQ-INT4, 92.7%) - [Reza2kn/mega-asr-mlx](https://huggingface.co/Reza2kn/mega-asr-mlx) — MLX 4-bit (mixed 8/4, 92.2%) - [Reza2kn/mega-asr-coreml](https://huggingface.co/Reza2kn/mega-asr-coreml) — CoreML 4-bit (mixed 8/4, 90.6%) - [Reza2kn/mega-asr-bench](https://huggingface.co/spaces/Reza2kn/mega-asr-bench) — browser demo (WebGPU) ## Credits - Original model: [zhifeixie/Mega-ASR](https://huggingface.co/zhifeixie/Mega-ASR) (1.7B, Apache-2.0) - NVFP4 PTQ via [NVIDIA TensorRT-Model-Optimizer](https://github.com/NVIDIA/TensorRT-Model-Optimizer) v0.44 - Benchmark: [Voices-in-the-Wild-Bench](https://github.com/xzf-thu/Voices-in-the-Wild-Bench)