Instructions to use Vedang0201/Qwen3-ASR-1.7B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use Vedang0201/Qwen3-ASR-1.7B with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="Vedang0201/Qwen3-ASR-1.7B", filename="Qwen3_ASR_1.7B_fp16_artifacts/Qwen3-ASR-1.7B-FP16.gguf", )
llm.create_chat_completion( messages = "No input example has been defined for this model task." )
- Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- llama.cpp
How to use Vedang0201/Qwen3-ASR-1.7B with llama.cpp:
Install (macOS, Linux)
curl -LsSf https://llama.app/install.sh | sh # Start a local OpenAI-compatible server with a web UI: llama serve -hf Vedang0201/Qwen3-ASR-1.7B # Run inference directly in the terminal: llama cli -hf Vedang0201/Qwen3-ASR-1.7B
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama serve -hf Vedang0201/Qwen3-ASR-1.7B # Run inference directly in the terminal: llama cli -hf Vedang0201/Qwen3-ASR-1.7B
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf Vedang0201/Qwen3-ASR-1.7B # Run inference directly in the terminal: ./llama-cli -hf Vedang0201/Qwen3-ASR-1.7B
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf Vedang0201/Qwen3-ASR-1.7B # Run inference directly in the terminal: ./build/bin/llama-cli -hf Vedang0201/Qwen3-ASR-1.7B
Use Docker
docker model run hf.co/Vedang0201/Qwen3-ASR-1.7B
- LM Studio
- Jan
- Ollama
How to use Vedang0201/Qwen3-ASR-1.7B with Ollama:
ollama run hf.co/Vedang0201/Qwen3-ASR-1.7B
- Unsloth Studio
How to use Vedang0201/Qwen3-ASR-1.7B with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for Vedang0201/Qwen3-ASR-1.7B to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for Vedang0201/Qwen3-ASR-1.7B to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for Vedang0201/Qwen3-ASR-1.7B to start chatting
- Atomic Chat new
- Docker Model Runner
How to use Vedang0201/Qwen3-ASR-1.7B with Docker Model Runner:
docker model run hf.co/Vedang0201/Qwen3-ASR-1.7B
- Lemonade
How to use Vedang0201/Qwen3-ASR-1.7B with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull Vedang0201/Qwen3-ASR-1.7B
Run and chat with the model
lemonade run user.Qwen3-ASR-1.7B-{{QUANT_TAG}}List all available models
lemonade list
YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
Qwen3-ASR-1.7B
Model
Qwen3-ASR-1.7B โ Qwen3VL-based Audio Speech Recognition model for transcribing English speech audio.
- Audio encoder: 24-layer Transformer encoder (hidden=1024, 16 heads, head_dim=64, FFN=4096, output_dim=2048) with 3 initial Conv2D layers (3ร3, 480 channels)
- Text decoder: 28-layer Qwen3VL decoder (hidden=2048, 16Q/8KV heads via GQA, head_dim=128, FFN=6144, SwiGLU, QK-Norm, M-RoPE, vocab 151 936)
- Total parameters: ~1.7B
Reference implementation: reference-llm-models / qwen3-asr
Available weights
The model consists of two weight files โ decoder (LLM backbone) and encoder (audio encoder).
| Component | Directory | File | Dtype | Tensors |
|---|---|---|---|---|
| Decoder | Qwen3_ASR_1.7B_fp16_artifacts/ |
qwen3_asr_decoder_fp16.npz |
FP16 | 311 |
| Decoder | Qwen3_ASR_1.7B_fp16_artifacts/ |
Qwen3-ASR-1.7B-FP16.gguf |
FP16 | 311 |
| Decoder | Qwen3_ASR_1.7B_fp32_artifacts/ |
qwen3_asr_decoder_fp32.npz |
FP32 | 311 |
| Decoder | Qwen3_ASR_1.7B_fp32_artifacts/ |
Qwen3-ASR-1.7B-FP32.gguf |
FP32 | 311 |
| Encoder | Qwen3_ASR_1.7B_fp16_artifacts/ |
qwen3_asr_encoder_fp16.npz |
FP16 | 398 |
| Encoder | Qwen3_ASR_1.7B_fp16_artifacts/ |
mmproj-Qwen3-ASR-1.7b-FP16.gguf |
FP16 | 398 |
| Encoder | Qwen3_ASR_1.7B_fp32_artifacts/ |
qwen3_asr_encoder_fp32.npz |
FP32 | 398 |
| Encoder | Qwen3_ASR_1.7B_fp32_artifacts/ |
mmproj-Qwen3-ASR-1.7b-FP32.gguf |
FP32 | 398 |
Both NPZ files must be present for inference โ the decoder NPZ contains the language model (311 tensors: embed_tokens, 28ร transformer layers, final norm, lm_head), and the encoder NPZ contains the audio encoder (397 tensors: 3ร conv2d, 24ร transformer layers, projection head).
How to run
set USE_TORCH=1
python generate.py ^
--audio sample.wav ^
--decoder-weights ./Qwen3_ASR_1.7B_fp16_artifacts/qwen3_asr_decoder_fp16.npz ^
--encoder-weights ./Qwen3_ASR_1.7B_fp16_artifacts/qwen3_asr_encoder_fp16.npz ^
--hf-model-path ../Qwen3-ASR-1.7B ^
--verbose
Audio feature validation only (no decoding)
set USE_TORCH=1
python generate.py ^
--audio sample.wav ^
--decoder-weights ./Qwen3_ASR_1.7B_fp16_artifacts/qwen3_asr_decoder_fp16.npz ^
--encoder-weights ./Qwen3_ASR_1.7B_fp16_artifacts/qwen3_asr_encoder_fp16.npz ^
--hf-model-path ../Qwen3-ASR-1.7B ^
--validate-audio-features ^
--verbose
Dump intermediates for numerical comparison
set USE_TORCH=1
python generate.py ^
--audio sample.wav ^
--decoder-weights ./Qwen3_ASR_1.7B_fp16_artifacts/qwen3_asr_decoder_fp16.npz ^
--encoder-weights ./Qwen3_ASR_1.7B_fp16_artifacts/qwen3_asr_encoder_fp16.npz ^
--hf-model-path ../Qwen3-ASR-1.7B ^
--intermediate-output-path intermediates.npz ^
--verbose
Key configuration
Config file: model_config.json
| Parameter | Value |
|---|---|
DECODER_NUM_HIDDEN_LAYERS |
28 |
DECODER_HIDDEN_SIZE |
2048 |
DECODER_INTERMEDIATE_SIZE (FFN) |
6144 |
DECODER_NUM_ATTENTION_HEADS |
16 |
DECODER_NUM_KEY_VALUE_HEADS |
8 |
DECODER_HEAD_DIM |
128 |
VOCAB_SIZE |
151936 |
DECODER_RMS_NORM_EPS |
1e-6 |
DECODER_ROPE_THETA |
1000000.0 |
DECODER_MAX_SEQ_LEN |
1024 |
DECODER_MROPE_SECTION |
[24, 20, 20] |
DECODER_ENABLE_ROPE |
1 |
ENCODER_NUM_LAYERS |
24 |
ENCODER_HIDDEN_SIZE |
1024 |
ENCODER_NUM_HEADS |
16 |
ENCODER_HEAD_DIM |
64 |
ENCODER_FFN_DIM |
4096 |
ENCODER_OUTPUT_DIM |
2048 |
ENCODER_LAYER_NORM_EPS |
1e-5 |
ENCODER_MAX_SOURCE_POSITIONS |
1500 |
ENCODER_DOWNSAMPLE_HIDDEN_SIZE |
480 |
ENCODER_NUM_MEL_BINS |
128 |
ENCODER_N_WINDOW |
50 |
ENCODER_N_WINDOW_INFER |
800 |
ENCODER_CONV_CHUNKSIZE |
500 |
Controlled via model_config.json (read by llama_model.py and
llama_model_audio.py).
Note:
set USE_TORCH=1is mandatory. The reference code is based ontorch_extend_ops.pywhich requires PyTorch with CUDA.
Input / Output
Input
- Audio: path to a local
.wavfile (--audio), 16 kHz mono - Dtype: FP16 (default). The reference weights are BF16/FP16; an FP32 variant is also available for higher-precision comparison.
Output
Loading decoder weights from qwen3_asr_decoder.npz โฆ
Loading encoder weights from qwen3_asr_encoder.npz โฆ
Transcript: Oh yeah, yeah. But you know, it's not a big deal..
The output is a plain-text transcript of the spoken audio.
- Downloads last month
- 215
We're not able to determine the quantization variants.