YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

Qwen3-ASR-1.7B

Model

Qwen3-ASR-1.7B โ€” Qwen3VL-based Audio Speech Recognition model for transcribing English speech audio.

  • Audio encoder: 24-layer Transformer encoder (hidden=1024, 16 heads, head_dim=64, FFN=4096, output_dim=2048) with 3 initial Conv2D layers (3ร—3, 480 channels)
  • Text decoder: 28-layer Qwen3VL decoder (hidden=2048, 16Q/8KV heads via GQA, head_dim=128, FFN=6144, SwiGLU, QK-Norm, M-RoPE, vocab 151 936)
  • Total parameters: ~1.7B

Reference implementation: reference-llm-models / qwen3-asr


Available weights

The model consists of two weight files โ€” decoder (LLM backbone) and encoder (audio encoder).

Component Directory File Dtype Tensors
Decoder Qwen3_ASR_1.7B_fp16_artifacts/ qwen3_asr_decoder_fp16.npz FP16 311
Decoder Qwen3_ASR_1.7B_fp16_artifacts/ Qwen3-ASR-1.7B-FP16.gguf FP16 311
Decoder Qwen3_ASR_1.7B_fp32_artifacts/ qwen3_asr_decoder_fp32.npz FP32 311
Decoder Qwen3_ASR_1.7B_fp32_artifacts/ Qwen3-ASR-1.7B-FP32.gguf FP32 311
Encoder Qwen3_ASR_1.7B_fp16_artifacts/ qwen3_asr_encoder_fp16.npz FP16 398
Encoder Qwen3_ASR_1.7B_fp16_artifacts/ mmproj-Qwen3-ASR-1.7b-FP16.gguf FP16 398
Encoder Qwen3_ASR_1.7B_fp32_artifacts/ qwen3_asr_encoder_fp32.npz FP32 398
Encoder Qwen3_ASR_1.7B_fp32_artifacts/ mmproj-Qwen3-ASR-1.7b-FP32.gguf FP32 398

Both NPZ files must be present for inference โ€” the decoder NPZ contains the language model (311 tensors: embed_tokens, 28ร— transformer layers, final norm, lm_head), and the encoder NPZ contains the audio encoder (397 tensors: 3ร— conv2d, 24ร— transformer layers, projection head).


How to run

set USE_TORCH=1
python generate.py ^
    --audio sample.wav ^
    --decoder-weights ./Qwen3_ASR_1.7B_fp16_artifacts/qwen3_asr_decoder_fp16.npz ^
    --encoder-weights ./Qwen3_ASR_1.7B_fp16_artifacts/qwen3_asr_encoder_fp16.npz ^
    --hf-model-path ../Qwen3-ASR-1.7B ^
    --verbose

Audio feature validation only (no decoding)

set USE_TORCH=1
python generate.py ^
    --audio sample.wav ^
    --decoder-weights ./Qwen3_ASR_1.7B_fp16_artifacts/qwen3_asr_decoder_fp16.npz ^
    --encoder-weights ./Qwen3_ASR_1.7B_fp16_artifacts/qwen3_asr_encoder_fp16.npz ^
    --hf-model-path ../Qwen3-ASR-1.7B ^
    --validate-audio-features ^
    --verbose

Dump intermediates for numerical comparison

set USE_TORCH=1
python generate.py ^
    --audio sample.wav ^
    --decoder-weights ./Qwen3_ASR_1.7B_fp16_artifacts/qwen3_asr_decoder_fp16.npz ^
    --encoder-weights ./Qwen3_ASR_1.7B_fp16_artifacts/qwen3_asr_encoder_fp16.npz ^
    --hf-model-path ../Qwen3-ASR-1.7B ^
    --intermediate-output-path intermediates.npz ^
    --verbose

Key configuration

Config file: model_config.json

Parameter Value
DECODER_NUM_HIDDEN_LAYERS 28
DECODER_HIDDEN_SIZE 2048
DECODER_INTERMEDIATE_SIZE (FFN) 6144
DECODER_NUM_ATTENTION_HEADS 16
DECODER_NUM_KEY_VALUE_HEADS 8
DECODER_HEAD_DIM 128
VOCAB_SIZE 151936
DECODER_RMS_NORM_EPS 1e-6
DECODER_ROPE_THETA 1000000.0
DECODER_MAX_SEQ_LEN 1024
DECODER_MROPE_SECTION [24, 20, 20]
DECODER_ENABLE_ROPE 1
ENCODER_NUM_LAYERS 24
ENCODER_HIDDEN_SIZE 1024
ENCODER_NUM_HEADS 16
ENCODER_HEAD_DIM 64
ENCODER_FFN_DIM 4096
ENCODER_OUTPUT_DIM 2048
ENCODER_LAYER_NORM_EPS 1e-5
ENCODER_MAX_SOURCE_POSITIONS 1500
ENCODER_DOWNSAMPLE_HIDDEN_SIZE 480
ENCODER_NUM_MEL_BINS 128
ENCODER_N_WINDOW 50
ENCODER_N_WINDOW_INFER 800
ENCODER_CONV_CHUNKSIZE 500

Controlled via model_config.json (read by llama_model.py and llama_model_audio.py).

Note: set USE_TORCH=1 is mandatory. The reference code is based on torch_extend_ops.py which requires PyTorch with CUDA.


Input / Output

Input

  • Audio: path to a local .wav file (--audio), 16 kHz mono
  • Dtype: FP16 (default). The reference weights are BF16/FP16; an FP32 variant is also available for higher-precision comparison.

Output

Loading decoder weights from qwen3_asr_decoder.npz โ€ฆ
Loading encoder weights from qwen3_asr_encoder.npz โ€ฆ
Transcript: Oh yeah, yeah. But you know, it's not a big deal..

The output is a plain-text transcript of the spoken audio.

Downloads last month
215
GGUF
Model size
2B params
Architecture
qwen3vl
Hardware compatibility
Log In to add your hardware

We're not able to determine the quantization variants.

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support