qwen3asr-int8
INT8 SmoothQuant quantized version of Qwen/Qwen3-ASR-1.7B, optimized for on-device inference on Jetson Orin Nano 8 GB via TensorRT-Edge-LLM v0.6.0.
Quantization performed with NVIDIA ModelOpt
using INT8 SmoothQuant (mtq.INT8_SMOOTHQUANT_CFG).
Only the LLM decoder (thinker.model, ~1.4 B / 82% of parameters) is quantized;
audio_tower and lm_head remain in FP16.
Performance — Jetson Orin Nano 8 GB
Evaluated on 760 VIVOS Vietnamese test samples. BF16 baseline WER: 7.34% (measured on x86; not runnable on Nano due to memory).
| Metric | Value |
|---|---|
| WER | 9.07% |
| RTF | 0.2190 |
| Throughput | 1.29 samples/s |
| RAM footprint | 4.2 GB |
Intended Use
This checkpoint is the input to the TRT-EdgeLLM export pipeline.
It is not directly loadable by standard transformers inference —
use it with qwen-asr-optimization
to export to ONNX and build TRT engines.
[This checkpoint]
│
â–¼ scripts/02_export_onnx.sh
ONNX artefacts
│
â–¼ scripts/03_build_engine.sh (Jetson Orin AGX)
TRT engines
│
â–¼ inference.py / scripts/04_benchmark.sh (Jetson Orin Nano)
Transcription
Quantization Details
| Property | Value |
|---|---|
| Method | INT8 SmoothQuant |
| Config | mtq.INT8_SMOOTHQUANT_CFG |
| Quantized component | thinker.model (LLM decoder only) |
| Excluded | audio_tower, lm_head |
| Calibration data | 257 samples — LibriSpeech EN (60), FLEURS ZH (30), FLEURS 13-lang×7 (91), LibriSpeech functional (76) |
| Base model dtype | FP16 |
Deployment
Full pipeline documentation: trt-edgellm/README.md
Quick start
git clone https://github.com/VLAOpt/qwen-asr-optimization.git
cd qwen-asr-optimization
# Download this checkpoint
huggingface-cli download vrfai/qwen3asr-int8 --local-dir ./Qwen3-ASR-1.7B-int8
# Export to ONNX (x86)
bash trt-edgellm/scripts/02_export_onnx.sh ./Qwen3-ASR-1.7B-int8 ./Qwen3-ASR-1.7B-int8-ONNX
# Build TRT engines (Jetson Orin AGX — see README for INT8 C++ patch)
bash trt-edgellm/scripts/03_build_engine.sh \
~/Qwen3-ASR-1.7B-int8-ONNX \
~/Qwen3-ASR-1.7B-int8-Engines
# Single-file inference (Jetson Orin Nano)
python trt-edgellm/inference.py \
--audio /path/to/audio.wav \
--engine_dir ~/Qwen3-ASR-1.7B-int8-Engines
INT8 note: Before building engines on AGX, apply the
setBuilderOptimizationLevel(2)patch tollmBuilder.cppandaudioBuilder.cppin the TensorRT-Edge-LLM source. See trt-edgellm/README.md for the exact instructions.
Related Models
| Model | Format | Target | Link |
|---|---|---|---|
| qwen3asr-int8 | INT8 SmoothQuant | Jetson Orin Nano | this repo |
| qwen3asr-int4 | INT4 AWQ | Jetson Orin Nano | vrfai/qwen3asr-int4 |
| qwen3asr-fp8 | FP8 | RTX 5090 (vLLM) | vrfai/qwen3asr-fp8 |
| qwen3asr-nvfp4 | NVFP4 | RTX 5090 (vLLM) | vrfai/qwen3asr-nvfp4 |
References
- Downloads last month
- -
Model tree for vrfai/Qwen3-ASR-1.7B-int8
Base model
Qwen/Qwen3-ASR-1.7B