| --- |
| license: apache-2.0 |
| language: |
| - en |
| - zh |
| - ja |
| - ko |
| - multilingual |
| library_name: modelopt |
| tags: |
| - nvidia |
| - nvfp4 |
| - blackwell |
| - rtx-50 |
| - modelopt |
| - tensorrt-llm |
| - automatic-speech-recognition |
| - asr |
| - speech-recognition |
| - robust-asr |
| - quantized |
| - 4bit |
| - awq |
| - e2m1 |
| - qwen3 |
| - qwen3-asr |
| - mega-asr |
| pipeline_tag: automatic-speech-recognition |
| base_model: zhifeixie/Mega-ASR |
| base_model_relation: quantized |
| --- |
| |
| # Mega-ASR β NVFP4 AWQ-Lite (NVIDIA Blackwell) |
|
|
| [NVFP4](https://developer.nvidia.com/blog/nvfp4-new-4-bit-floating-point-format/) |
| (4-bit floating-point: E2M1 mantissa with per-block FP8 scaling) deployment of |
| the LLM portion of [zhifeixie/Mega-ASR](https://huggingface.co/zhifeixie/Mega-ASR), |
| quantized via [NVIDIA Model Optimizer](https://github.com/NVIDIA/TensorRT-Model-Optimizer) |
| (`nvidia-modelopt`) with the `NVFP4_AWQ_LITE_CFG` activation-aware recipe. |
|
|
| **Targets RTX 50-series (Blackwell) for native NVFP4 GEMM acceleration.** |
| Earlier Ada/Hopper GPUs can run the same checkpoint via the modelopt |
| fake-quant simulation (correctness preserved, no perf win without Blackwell |
| NVFP4 tensor cores). |
|
|
| ## What's in this repo |
|
|
| | File | Size | Role | |
| | --- | ---: | --- | |
| | `nvfp4/model.safetensors` | **3.44 GB** | Qwen3 1.7B LLM, NVFP4 weights + AWQ-Lite scaling factors. Saved via `modelopt.torch.save_pretrained` β the weights are stored in their original bf16 layout alongside the quantization scale tensors; the runtime packs them to NVFP4 on first forward. | |
| | `nvfp4/config.json` + tokenizer/* | β | HF config + Qwen3-ASR tokenizer (with `<\|audio_pad\|>`, `<asr_text>`, etc.) | |
| | `onnx/audio_encoder_fp32.onnx` | 1.27 GB | 24-layer Whisper-style audio encoder (ONNX fp32, run via onnxruntime; NVFP4 port not done β the encoder is small enough that it doesn't benefit much from FP4) | |
| | `examples/*.wav` | ~3 MB | 8 noisy benchmark clips from Voices-in-the-Wild-Bench | |
| | `nvfp4_quantize.py` | β | The PTQ script (modelopt forward-loop calibration) | |
| | `inference_bench.py` | β | End-to-end ASR pipeline + 8-clip VITW bench | |
|
|
| ## Quality (bench) |
|
|
| 8-clip [Voices-in-the-Wild-Bench](https://github.com/xzf-thu/Voices-in-the-Wild-Bench) |
| agreement (1 β WER), prompt forced to `language English`, run on the RTX 5080 |
| Laptop (Blackwell, compute_cap 12.0). Same ONNX fp32 audio encoder as the |
| other backends: |
| |
| | Per-sample | NVFP4 (this repo) | ONNX GPTQ | MLX mixed 8/4 | CoreML mixed 8/4 | |
| | --- | ---: | ---: | ---: | ---: | |
| | distortion | 100% | 100% | 100% | 100% | |
| | dropout | 100% | 100% | 100% | 100% | |
| | echo (hard, reverb) | 64.7% | 82.4% | 64.7% | 64.7% | |
| | far_field | 100% | 100% | 100% | 100% | |
| | mixed | 100% | 100% | 100% | 100% | |
| | noise | 100% | 100% | 100% | 100% | |
| | obstructed | 100% | 100% | 94.1% | 100% | |
| | recording (hard, truncated) | **66.7%** | 60.0% | 60.0% | 60.0% | |
| | **AVERAGE** | **91.4%** | **92.7%** | **92.2%** | **90.6%** | |
|
|
| Notable: NVFP4 ties or beats the others on every clean sample, **wins on |
| `recording` by 6.7 pts** (66.7% vs 60% everywhere else β the AWQ-Lite |
| activation-aware scaling helped recover the truncated-audio decode), and |
| ties MLX/CoreML on `echo`. The 1.3% gap to ONNX GPTQ is entirely on `echo` |
| (64.7% vs 82.4%) where GPTQ's per-column Hessian-based error redistribution |
| captures something AWQ-Lite's per-channel scaling doesn't. |
|
|
| ## How NVFP4 works (quick) |
|
|
| - Weights are stored as **E2M1** (1 sign + 2 exponent + 1 mantissa = 4 bits, |
| representing values in {Β±0, Β±0.5, Β±1, Β±1.5, Β±2, Β±3, Β±4, Β±6}). |
| - Every block of **16 consecutive weight elements** shares one **FP8 (E4M3) |
| scaling factor** (saved alongside the E2M1 values; ~0.5 extra bits/weight |
| for scale storage). |
| - A second **per-tensor FP32 amax** rescales the per-block scales into FP8 range. |
| - Inference: load E2M1 weights β multiply by per-block FP8 scale β multiply |
| by per-tensor amax β fp16/bf16 GEMM. Blackwell's tensor cores do this |
| natively in ~the same cycles as fp4 multiplies. |
|
|
| The AWQ-Lite variant runs an extra pass that computes a per-channel |
| activation magnitude and rescales weights vs. activations to put more of |
| the dynamic range into the "important" channels (channels with large |
| activation amplitudes) before applying NVFP4 β net effect is recovering |
| some quality lost to the E2M1 grid. |
|
|
| ## Inference |
|
|
| ### Stage 1: PyTorch + modelopt (fake-quant, works on any GPU) |
|
|
| ```bash |
| pip install nvidia-modelopt transformers safetensors torch onnxruntime soundfile librosa |
| git clone https://huggingface.co/Reza2kn/mega-asr-nvfp4 |
| cd mega-asr-nvfp4 |
| python inference_bench.py \ |
| --model nvfp4 \ |
| --encoder onnx/audio_encoder_fp32.onnx \ |
| --examples-dir examples \ |
| --qwen-asr-dir <Qwen3-ASR-1.7B HF dir> \ |
| --skip-quant # weights already quantized |
| ``` |
|
|
| ### Stage 2: TensorRT-LLM engine (native Blackwell NVFP4) |
|
|
| ```bash |
| # Convert HF checkpoint β TensorRT-LLM checkpoint |
| python -m tensorrt_llm.examples.qwen.convert_checkpoint \ |
| --model_dir nvfp4 --output_dir trtllm_ckpt \ |
| --dtype bfloat16 --use_fp4 |
| |
| # Build engine |
| trtllm-build --checkpoint_dir trtllm_ckpt --output_dir trtllm_engine \ |
| --gemm_plugin fp4 --max_input_len 512 --max_seq_len 600 |
| ``` |
|
|
| (The TRT-LLM engine path is on the roadmap; this repo currently ships the |
| modelopt-saved HF checkpoint, which runs as fake-quant on any GPU.) |
|
|
| ## Conversion details |
|
|
| ```python |
| import modelopt.torch.quantization as mtq |
| from transformers import AutoModelForCausalLM |
| |
| model = AutoModelForCausalLM.from_pretrained("Qwen3-ASR-1.7B-LLM", |
| torch_dtype=torch.bfloat16, |
| device_map="cuda") |
| # Calibration with 168 English VITW samples (audio embeds scattered at |
| # <|audio_pad|> positions β same set used for the ONNX GPTQ release) |
| calib_batches = build_calibration_batches(...) |
| def forward_loop(m): |
| for b in calib_batches: |
| with torch.no_grad(): |
| m(**b) |
| mtq.quantize(model, mtq.NVFP4_AWQ_LITE_CFG, forward_loop) |
| model.save_pretrained("nvfp4") |
| ``` |
|
|
| 168 calibration batches, ~3 min on the RTX 5080. The AWQ-Lite recipe does |
| **two** forward passes per batch β one for activation magnitude estimation, |
| one for the actual quantization apply step β explaining the doubled count |
| in the log. |
|
|
| ## Why NVFP4 (vs INT4 / FP8)? |
|
|
| - **vs INT4 (e.g., GPTQ)**: NVFP4's exponent bits handle the wide activation |
| dynamic range in transformer MLPs better than INT4's linear grid. On |
| Blackwell tensor cores, NVFP4 GEMM throughput is **2Γ FP8** and **4Γ FP16**. |
| - **vs FP8**: half the memory bandwidth (4 vs 8 bits/weight). NVFP4 with |
| AWQ-Lite typically lands within 0.3-0.6 PPL of FP8 on Llama-class models. |
| - **vs MXFP4** (Microsoft's variant, same E2M1 with different block sizing): |
| NVFP4 uses a smaller block (16 vs 32) + FP8 scales vs E8M0 β tighter |
| per-block quantization, slightly larger overhead. |
|
|
| ## Companion repos |
|
|
| - [Reza2kn/mega-asr-onnx](https://huggingface.co/Reza2kn/mega-asr-onnx) β full ONNX pipeline (GPTQ-INT4, 92.7%) |
| - [Reza2kn/mega-asr-mlx](https://huggingface.co/Reza2kn/mega-asr-mlx) β MLX 4-bit (mixed 8/4, 92.2%) |
| - [Reza2kn/mega-asr-coreml](https://huggingface.co/Reza2kn/mega-asr-coreml) β CoreML 4-bit (mixed 8/4, 90.6%) |
| - [Reza2kn/mega-asr-bench](https://huggingface.co/spaces/Reza2kn/mega-asr-bench) β browser demo (WebGPU) |
|
|
| ## Credits |
|
|
| - Original model: [zhifeixie/Mega-ASR](https://huggingface.co/zhifeixie/Mega-ASR) (1.7B, Apache-2.0) |
| - NVFP4 PTQ via [NVIDIA TensorRT-Model-Optimizer](https://github.com/NVIDIA/TensorRT-Model-Optimizer) v0.44 |
| - Benchmark: [Voices-in-the-Wild-Bench](https://github.com/xzf-thu/Voices-in-the-Wild-Bench) |
|
|