mega-asr-nvfp4 / README.md
Reza2kn's picture
Add README β€” NVFP4 AWQ-Lite Mega-ASR-1.7B at 91.4% VITW
ff2920a verified
---
license: apache-2.0
language:
- en
- zh
- ja
- ko
- multilingual
library_name: modelopt
tags:
- nvidia
- nvfp4
- blackwell
- rtx-50
- modelopt
- tensorrt-llm
- automatic-speech-recognition
- asr
- speech-recognition
- robust-asr
- quantized
- 4bit
- awq
- e2m1
- qwen3
- qwen3-asr
- mega-asr
pipeline_tag: automatic-speech-recognition
base_model: zhifeixie/Mega-ASR
base_model_relation: quantized
---
# Mega-ASR β€” NVFP4 AWQ-Lite (NVIDIA Blackwell)
[NVFP4](https://developer.nvidia.com/blog/nvfp4-new-4-bit-floating-point-format/)
(4-bit floating-point: E2M1 mantissa with per-block FP8 scaling) deployment of
the LLM portion of [zhifeixie/Mega-ASR](https://huggingface.co/zhifeixie/Mega-ASR),
quantized via [NVIDIA Model Optimizer](https://github.com/NVIDIA/TensorRT-Model-Optimizer)
(`nvidia-modelopt`) with the `NVFP4_AWQ_LITE_CFG` activation-aware recipe.
**Targets RTX 50-series (Blackwell) for native NVFP4 GEMM acceleration.**
Earlier Ada/Hopper GPUs can run the same checkpoint via the modelopt
fake-quant simulation (correctness preserved, no perf win without Blackwell
NVFP4 tensor cores).
## What's in this repo
| File | Size | Role |
| --- | ---: | --- |
| `nvfp4/model.safetensors` | **3.44 GB** | Qwen3 1.7B LLM, NVFP4 weights + AWQ-Lite scaling factors. Saved via `modelopt.torch.save_pretrained` β€” the weights are stored in their original bf16 layout alongside the quantization scale tensors; the runtime packs them to NVFP4 on first forward. |
| `nvfp4/config.json` + tokenizer/* | β€” | HF config + Qwen3-ASR tokenizer (with `<\|audio_pad\|>`, `<asr_text>`, etc.) |
| `onnx/audio_encoder_fp32.onnx` | 1.27 GB | 24-layer Whisper-style audio encoder (ONNX fp32, run via onnxruntime; NVFP4 port not done β€” the encoder is small enough that it doesn't benefit much from FP4) |
| `examples/*.wav` | ~3 MB | 8 noisy benchmark clips from Voices-in-the-Wild-Bench |
| `nvfp4_quantize.py` | β€” | The PTQ script (modelopt forward-loop calibration) |
| `inference_bench.py` | β€” | End-to-end ASR pipeline + 8-clip VITW bench |
## Quality (bench)
8-clip [Voices-in-the-Wild-Bench](https://github.com/xzf-thu/Voices-in-the-Wild-Bench)
agreement (1 βˆ’ WER), prompt forced to `language English`, run on the RTX 5080
Laptop (Blackwell, compute_cap 12.0). Same ONNX fp32 audio encoder as the
other backends:
| Per-sample | NVFP4 (this repo) | ONNX GPTQ | MLX mixed 8/4 | CoreML mixed 8/4 |
| --- | ---: | ---: | ---: | ---: |
| distortion | 100% | 100% | 100% | 100% |
| dropout | 100% | 100% | 100% | 100% |
| echo (hard, reverb) | 64.7% | 82.4% | 64.7% | 64.7% |
| far_field | 100% | 100% | 100% | 100% |
| mixed | 100% | 100% | 100% | 100% |
| noise | 100% | 100% | 100% | 100% |
| obstructed | 100% | 100% | 94.1% | 100% |
| recording (hard, truncated) | **66.7%** | 60.0% | 60.0% | 60.0% |
| **AVERAGE** | **91.4%** | **92.7%** | **92.2%** | **90.6%** |
Notable: NVFP4 ties or beats the others on every clean sample, **wins on
`recording` by 6.7 pts** (66.7% vs 60% everywhere else β€” the AWQ-Lite
activation-aware scaling helped recover the truncated-audio decode), and
ties MLX/CoreML on `echo`. The 1.3% gap to ONNX GPTQ is entirely on `echo`
(64.7% vs 82.4%) where GPTQ's per-column Hessian-based error redistribution
captures something AWQ-Lite's per-channel scaling doesn't.
## How NVFP4 works (quick)
- Weights are stored as **E2M1** (1 sign + 2 exponent + 1 mantissa = 4 bits,
representing values in {Β±0, Β±0.5, Β±1, Β±1.5, Β±2, Β±3, Β±4, Β±6}).
- Every block of **16 consecutive weight elements** shares one **FP8 (E4M3)
scaling factor** (saved alongside the E2M1 values; ~0.5 extra bits/weight
for scale storage).
- A second **per-tensor FP32 amax** rescales the per-block scales into FP8 range.
- Inference: load E2M1 weights β†’ multiply by per-block FP8 scale β†’ multiply
by per-tensor amax β†’ fp16/bf16 GEMM. Blackwell's tensor cores do this
natively in ~the same cycles as fp4 multiplies.
The AWQ-Lite variant runs an extra pass that computes a per-channel
activation magnitude and rescales weights vs. activations to put more of
the dynamic range into the "important" channels (channels with large
activation amplitudes) before applying NVFP4 β€” net effect is recovering
some quality lost to the E2M1 grid.
## Inference
### Stage 1: PyTorch + modelopt (fake-quant, works on any GPU)
```bash
pip install nvidia-modelopt transformers safetensors torch onnxruntime soundfile librosa
git clone https://huggingface.co/Reza2kn/mega-asr-nvfp4
cd mega-asr-nvfp4
python inference_bench.py \
--model nvfp4 \
--encoder onnx/audio_encoder_fp32.onnx \
--examples-dir examples \
--qwen-asr-dir <Qwen3-ASR-1.7B HF dir> \
--skip-quant # weights already quantized
```
### Stage 2: TensorRT-LLM engine (native Blackwell NVFP4)
```bash
# Convert HF checkpoint β†’ TensorRT-LLM checkpoint
python -m tensorrt_llm.examples.qwen.convert_checkpoint \
--model_dir nvfp4 --output_dir trtllm_ckpt \
--dtype bfloat16 --use_fp4
# Build engine
trtllm-build --checkpoint_dir trtllm_ckpt --output_dir trtllm_engine \
--gemm_plugin fp4 --max_input_len 512 --max_seq_len 600
```
(The TRT-LLM engine path is on the roadmap; this repo currently ships the
modelopt-saved HF checkpoint, which runs as fake-quant on any GPU.)
## Conversion details
```python
import modelopt.torch.quantization as mtq
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("Qwen3-ASR-1.7B-LLM",
torch_dtype=torch.bfloat16,
device_map="cuda")
# Calibration with 168 English VITW samples (audio embeds scattered at
# <|audio_pad|> positions β€” same set used for the ONNX GPTQ release)
calib_batches = build_calibration_batches(...)
def forward_loop(m):
for b in calib_batches:
with torch.no_grad():
m(**b)
mtq.quantize(model, mtq.NVFP4_AWQ_LITE_CFG, forward_loop)
model.save_pretrained("nvfp4")
```
168 calibration batches, ~3 min on the RTX 5080. The AWQ-Lite recipe does
**two** forward passes per batch β€” one for activation magnitude estimation,
one for the actual quantization apply step β€” explaining the doubled count
in the log.
## Why NVFP4 (vs INT4 / FP8)?
- **vs INT4 (e.g., GPTQ)**: NVFP4's exponent bits handle the wide activation
dynamic range in transformer MLPs better than INT4's linear grid. On
Blackwell tensor cores, NVFP4 GEMM throughput is **2Γ— FP8** and **4Γ— FP16**.
- **vs FP8**: half the memory bandwidth (4 vs 8 bits/weight). NVFP4 with
AWQ-Lite typically lands within 0.3-0.6 PPL of FP8 on Llama-class models.
- **vs MXFP4** (Microsoft's variant, same E2M1 with different block sizing):
NVFP4 uses a smaller block (16 vs 32) + FP8 scales vs E8M0 β€” tighter
per-block quantization, slightly larger overhead.
## Companion repos
- [Reza2kn/mega-asr-onnx](https://huggingface.co/Reza2kn/mega-asr-onnx) β€” full ONNX pipeline (GPTQ-INT4, 92.7%)
- [Reza2kn/mega-asr-mlx](https://huggingface.co/Reza2kn/mega-asr-mlx) β€” MLX 4-bit (mixed 8/4, 92.2%)
- [Reza2kn/mega-asr-coreml](https://huggingface.co/Reza2kn/mega-asr-coreml) β€” CoreML 4-bit (mixed 8/4, 90.6%)
- [Reza2kn/mega-asr-bench](https://huggingface.co/spaces/Reza2kn/mega-asr-bench) β€” browser demo (WebGPU)
## Credits
- Original model: [zhifeixie/Mega-ASR](https://huggingface.co/zhifeixie/Mega-ASR) (1.7B, Apache-2.0)
- NVFP4 PTQ via [NVIDIA TensorRT-Model-Optimizer](https://github.com/NVIDIA/TensorRT-Model-Optimizer) v0.44
- Benchmark: [Voices-in-the-Wild-Bench](https://github.com/xzf-thu/Voices-in-the-Wild-Bench)