Instructions to use Vedang0201/Qwen3-ASR-1.7B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use Vedang0201/Qwen3-ASR-1.7B with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="Vedang0201/Qwen3-ASR-1.7B",
	filename="Qwen3_ASR_1.7B_fp16_artifacts/Qwen3-ASR-1.7B-FP16.gguf",
)

llm.create_chat_completion(
	messages = "No input example has been defined for this model task."
)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use Vedang0201/Qwen3-ASR-1.7B with llama.cpp:

Install (macOS, Linux)

curl -LsSf https://llama.app/install.sh | sh
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf Vedang0201/Qwen3-ASR-1.7B
# Run inference directly in the terminal:
llama cli -hf Vedang0201/Qwen3-ASR-1.7B

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf Vedang0201/Qwen3-ASR-1.7B
# Run inference directly in the terminal:
llama cli -hf Vedang0201/Qwen3-ASR-1.7B

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf Vedang0201/Qwen3-ASR-1.7B
# Run inference directly in the terminal:
./llama-cli -hf Vedang0201/Qwen3-ASR-1.7B

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf Vedang0201/Qwen3-ASR-1.7B
# Run inference directly in the terminal:
./build/bin/llama-cli -hf Vedang0201/Qwen3-ASR-1.7B

Use Docker

docker model run hf.co/Vedang0201/Qwen3-ASR-1.7B

LM Studio
Jan
Ollama
How to use Vedang0201/Qwen3-ASR-1.7B with Ollama:
```
ollama run hf.co/Vedang0201/Qwen3-ASR-1.7B
```

Unsloth Studio

How to use Vedang0201/Qwen3-ASR-1.7B with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for Vedang0201/Qwen3-ASR-1.7B to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for Vedang0201/Qwen3-ASR-1.7B to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for Vedang0201/Qwen3-ASR-1.7B to start chatting

Atomic Chat new
Docker Model Runner
How to use Vedang0201/Qwen3-ASR-1.7B with Docker Model Runner:
```
docker model run hf.co/Vedang0201/Qwen3-ASR-1.7B
```

Lemonade

How to use Vedang0201/Qwen3-ASR-1.7B with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull Vedang0201/Qwen3-ASR-1.7B

Run and chat with the model

lemonade run user.Qwen3-ASR-1.7B-{{QUANT_TAG}}

List all available models

lemonade list

YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

Qwen3-ASR-1.7B

Model

Qwen3-ASR-1.7B — Qwen3VL-based Audio Speech Recognition model for transcribing English speech audio.

Audio encoder: 24-layer Transformer encoder (hidden=1024, 16 heads, head_dim=64, FFN=4096, output_dim=2048) with 3 initial Conv2D layers (3×3, 480 channels)
Text decoder: 28-layer Qwen3VL decoder (hidden=2048, 16Q/8KV heads via GQA, head_dim=128, FFN=6144, SwiGLU, QK-Norm, M-RoPE, vocab 151 936)
Total parameters: ~1.7B

Reference implementation: reference-llm-models / qwen3-asr

Available weights

The model consists of two weight files — decoder (LLM backbone) and encoder (audio encoder).

Component	Directory	File	Dtype	Tensors
Decoder	`Qwen3_ASR_1.7B_fp16_artifacts/`	`qwen3_asr_decoder_fp16.npz`	FP16	311
Decoder	`Qwen3_ASR_1.7B_fp16_artifacts/`	`Qwen3-ASR-1.7B-FP16.gguf`	FP16	311
Decoder	`Qwen3_ASR_1.7B_fp32_artifacts/`	`qwen3_asr_decoder_fp32.npz`	FP32	311
Decoder	`Qwen3_ASR_1.7B_fp32_artifacts/`	`Qwen3-ASR-1.7B-FP32.gguf`	FP32	311
Encoder	`Qwen3_ASR_1.7B_fp16_artifacts/`	`qwen3_asr_encoder_fp16.npz`	FP16	398
Encoder	`Qwen3_ASR_1.7B_fp16_artifacts/`	`mmproj-Qwen3-ASR-1.7b-FP16.gguf`	FP16	398
Encoder	`Qwen3_ASR_1.7B_fp32_artifacts/`	`qwen3_asr_encoder_fp32.npz`	FP32	398
Encoder	`Qwen3_ASR_1.7B_fp32_artifacts/`	`mmproj-Qwen3-ASR-1.7b-FP32.gguf`	FP32	398

Both NPZ files must be present for inference — the decoder NPZ contains the language model (311 tensors: embed_tokens, 28× transformer layers, final norm, lm_head), and the encoder NPZ contains the audio encoder (397 tensors: 3× conv2d, 24× transformer layers, projection head).

How to run

set USE_TORCH=1
python generate.py ^
    --audio sample.wav ^
    --decoder-weights ./Qwen3_ASR_1.7B_fp16_artifacts/qwen3_asr_decoder_fp16.npz ^
    --encoder-weights ./Qwen3_ASR_1.7B_fp16_artifacts/qwen3_asr_encoder_fp16.npz ^
    --hf-model-path ../Qwen3-ASR-1.7B ^
    --verbose

Audio feature validation only (no decoding)

set USE_TORCH=1
python generate.py ^
    --audio sample.wav ^
    --decoder-weights ./Qwen3_ASR_1.7B_fp16_artifacts/qwen3_asr_decoder_fp16.npz ^
    --encoder-weights ./Qwen3_ASR_1.7B_fp16_artifacts/qwen3_asr_encoder_fp16.npz ^
    --hf-model-path ../Qwen3-ASR-1.7B ^
    --validate-audio-features ^
    --verbose

Dump intermediates for numerical comparison

set USE_TORCH=1
python generate.py ^
    --audio sample.wav ^
    --decoder-weights ./Qwen3_ASR_1.7B_fp16_artifacts/qwen3_asr_decoder_fp16.npz ^
    --encoder-weights ./Qwen3_ASR_1.7B_fp16_artifacts/qwen3_asr_encoder_fp16.npz ^
    --hf-model-path ../Qwen3-ASR-1.7B ^
    --intermediate-output-path intermediates.npz ^
    --verbose

Key configuration

Config file: model_config.json

Parameter	Value
`DECODER_NUM_HIDDEN_LAYERS`	28
`DECODER_HIDDEN_SIZE`	2048
`DECODER_INTERMEDIATE_SIZE` (FFN)	6144
`DECODER_NUM_ATTENTION_HEADS`	16
`DECODER_NUM_KEY_VALUE_HEADS`	8
`DECODER_HEAD_DIM`	128
`VOCAB_SIZE`	151936
`DECODER_RMS_NORM_EPS`	1e-6
`DECODER_ROPE_THETA`	1000000.0
`DECODER_MAX_SEQ_LEN`	1024
`DECODER_MROPE_SECTION`	[24, 20, 20]
`DECODER_ENABLE_ROPE`	1
`ENCODER_NUM_LAYERS`	24
`ENCODER_HIDDEN_SIZE`	1024
`ENCODER_NUM_HEADS`	16
`ENCODER_HEAD_DIM`	64
`ENCODER_FFN_DIM`	4096
`ENCODER_OUTPUT_DIM`	2048
`ENCODER_LAYER_NORM_EPS`	1e-5
`ENCODER_MAX_SOURCE_POSITIONS`	1500
`ENCODER_DOWNSAMPLE_HIDDEN_SIZE`	480
`ENCODER_NUM_MEL_BINS`	128
`ENCODER_N_WINDOW`	50
`ENCODER_N_WINDOW_INFER`	800
`ENCODER_CONV_CHUNKSIZE`	500

Controlled via model_config.json (read by llama_model.py and llama_model_audio.py).

Note: set USE_TORCH=1 is mandatory. The reference code is based on torch_extend_ops.py which requires PyTorch with CUDA.

Input / Output

Input

Audio: path to a local .wav file (--audio), 16 kHz mono
Dtype: FP16 (default). The reference weights are BF16/FP16; an FP32 variant is also available for higher-precision comparison.

Output

Loading decoder weights from qwen3_asr_decoder.npz …
Loading encoder weights from qwen3_asr_encoder.npz …
Transcript: Oh yeah, yeah. But you know, it's not a big deal..

The output is a plain-text transcript of the spoken audio.

Downloads last month: 215

GGUF

Model size

2B params

Architecture

qwen3vl

Hardware compatibility

We're not able to determine the quantization variants.

View all variants

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support