qwen3asr-int8

INT8 SmoothQuant quantized version of Qwen/Qwen3-ASR-1.7B, optimized for on-device inference on Jetson Orin Nano 8 GB via TensorRT-Edge-LLM v0.6.0.

Quantization performed with NVIDIA ModelOpt using INT8 SmoothQuant (mtq.INT8_SMOOTHQUANT_CFG). Only the LLM decoder (thinker.model, ~1.4 B / 82% of parameters) is quantized; audio_tower and lm_head remain in FP16.


Performance — Jetson Orin Nano 8 GB

Evaluated on 760 VIVOS Vietnamese test samples. BF16 baseline WER: 7.34% (measured on x86; not runnable on Nano due to memory).

Metric Value
WER 9.07%
RTF 0.2190
Throughput 1.29 samples/s
RAM footprint 4.2 GB

Intended Use

This checkpoint is the input to the TRT-EdgeLLM export pipeline. It is not directly loadable by standard transformers inference — use it with qwen-asr-optimization to export to ONNX and build TRT engines.

[This checkpoint]
      │
      â–¼  scripts/02_export_onnx.sh
  ONNX artefacts
      │
      â–¼  scripts/03_build_engine.sh  (Jetson Orin AGX)
  TRT engines
      │
      â–¼  inference.py / scripts/04_benchmark.sh  (Jetson Orin Nano)
  Transcription

Quantization Details

Property Value
Method INT8 SmoothQuant
Config mtq.INT8_SMOOTHQUANT_CFG
Quantized component thinker.model (LLM decoder only)
Excluded audio_tower, lm_head
Calibration data 257 samples — LibriSpeech EN (60), FLEURS ZH (30), FLEURS 13-lang×7 (91), LibriSpeech functional (76)
Base model dtype FP16

Deployment

Full pipeline documentation: trt-edgellm/README.md

Quick start

git clone https://github.com/VLAOpt/qwen-asr-optimization.git
cd qwen-asr-optimization

# Download this checkpoint
huggingface-cli download vrfai/qwen3asr-int8 --local-dir ./Qwen3-ASR-1.7B-int8

# Export to ONNX (x86)
bash trt-edgellm/scripts/02_export_onnx.sh ./Qwen3-ASR-1.7B-int8 ./Qwen3-ASR-1.7B-int8-ONNX

# Build TRT engines (Jetson Orin AGX — see README for INT8 C++ patch)
bash trt-edgellm/scripts/03_build_engine.sh \
    ~/Qwen3-ASR-1.7B-int8-ONNX \
    ~/Qwen3-ASR-1.7B-int8-Engines

# Single-file inference (Jetson Orin Nano)
python trt-edgellm/inference.py \
    --audio      /path/to/audio.wav \
    --engine_dir ~/Qwen3-ASR-1.7B-int8-Engines

INT8 note: Before building engines on AGX, apply the setBuilderOptimizationLevel(2) patch to llmBuilder.cpp and audioBuilder.cpp in the TensorRT-Edge-LLM source. See trt-edgellm/README.md for the exact instructions.


Related Models

Model Format Target Link
qwen3asr-int8 INT8 SmoothQuant Jetson Orin Nano this repo
qwen3asr-int4 INT4 AWQ Jetson Orin Nano vrfai/qwen3asr-int4
qwen3asr-fp8 FP8 RTX 5090 (vLLM) vrfai/qwen3asr-fp8
qwen3asr-nvfp4 NVFP4 RTX 5090 (vLLM) vrfai/qwen3asr-nvfp4

References

Downloads last month
-
Safetensors
Model size
2B params
Tensor type
F16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for vrfai/Qwen3-ASR-1.7B-int8

Finetuned
(22)
this model

Collection including vrfai/Qwen3-ASR-1.7B-int8

Paper for vrfai/Qwen3-ASR-1.7B-int8