qwen3asr-int8

INT8 SmoothQuant quantized version of Qwen/Qwen3-ASR-1.7B, optimized for on-device inference on Jetson Orin Nano 8 GB via TensorRT-Edge-LLM v0.6.0.

Quantization performed with NVIDIA ModelOpt using INT8 SmoothQuant (mtq.INT8_SMOOTHQUANT_CFG). Only the LLM decoder (thinker.model, ~1.4 B / 82% of parameters) is quantized; audio_tower and lm_head remain in FP16.

Performance — Jetson Orin Nano 8 GB

Evaluated on 760 VIVOS Vietnamese test samples. BF16 baseline WER: 7.34% (measured on x86; not runnable on Nano due to memory).

Metric	Value
WER	9.07%
RTF	0.2190
Throughput	1.29 samples/s
RAM footprint	4.2 GB

Intended Use

This checkpoint is the input to the TRT-EdgeLLM export pipeline. It is not directly loadable by standard transformers inference — use it with qwen-asr-optimization to export to ONNX and build TRT engines.

[This checkpoint]
      │
      ▼  scripts/02_export_onnx.sh
  ONNX artefacts
      │
      ▼  scripts/03_build_engine.sh  (Jetson Orin AGX)
  TRT engines
      │
      ▼  inference.py / scripts/04_benchmark.sh  (Jetson Orin Nano)
  Transcription

Quantization Details

Property	Value
Method	INT8 SmoothQuant
Config	`mtq.INT8_SMOOTHQUANT_CFG`
Quantized component	`thinker.model` (LLM decoder only)
Excluded	`audio_tower`, `lm_head`
Calibration data	257 samples — LibriSpeech EN (60), FLEURS ZH (30), FLEURS 13-lang×7 (91), LibriSpeech functional (76)
Base model dtype	FP16

Deployment

Full pipeline documentation: trt-edgellm/README.md

Quick start

git clone https://github.com/VLAOpt/qwen-asr-optimization.git
cd qwen-asr-optimization

# Download this checkpoint
huggingface-cli download vrfai/qwen3asr-int8 --local-dir ./Qwen3-ASR-1.7B-int8

# Export to ONNX (x86)
bash trt-edgellm/scripts/02_export_onnx.sh ./Qwen3-ASR-1.7B-int8 ./Qwen3-ASR-1.7B-int8-ONNX

# Build TRT engines (Jetson Orin AGX — see README for INT8 C++ patch)
bash trt-edgellm/scripts/03_build_engine.sh \
    ~/Qwen3-ASR-1.7B-int8-ONNX \
    ~/Qwen3-ASR-1.7B-int8-Engines

# Single-file inference (Jetson Orin Nano)
python trt-edgellm/inference.py \
    --audio      /path/to/audio.wav \
    --engine_dir ~/Qwen3-ASR-1.7B-int8-Engines

INT8 note: Before building engines on AGX, apply the setBuilderOptimizationLevel(2) patch to llmBuilder.cpp and audioBuilder.cpp in the TensorRT-Edge-LLM source. See trt-edgellm/README.md for the exact instructions.

Related Models

Model	Format	Target	Link
qwen3asr-int8	INT8 SmoothQuant	Jetson Orin Nano	this repo
qwen3asr-int4	INT4 AWQ	Jetson Orin Nano	vrfai/qwen3asr-int4
qwen3asr-fp8	FP8	RTX 5090 (vLLM)	vrfai/qwen3asr-fp8
qwen3asr-nvfp4	NVFP4	RTX 5090 (vLLM)	vrfai/qwen3asr-nvfp4