Qwen3 ASR 0.6B Encoder — LiteRT (INT8)

Qwen3-ASR audio encoder (zh / yue / en). INT8 weight-only.

Part of the soniqo.audio speech toolkit — an open, runtime-portable stack for speech AI. This bundle is the LiteRT export, designed to plug into the abstract interfaces in speech-core (C++ voice-agent orchestration library). Browse all LiteRT bundles in the soniqo LiteRT collection.

Use cases on soniqo.audio

Multilingual transcription

Audio encoder of Qwen3-ASR-0.6B, specialized for Chinese (including 22 Chinese dialects) and 30 additional languages. Exported to LiteRT for Android. The text decoder is a Qwen3-0.6B LLM and is intended to run through LiteRT-LM as a separate runtime.

Model

Property	Value
Component	Audio encoder only
Parameters	~180 M (encoder), decoder is a separate 0.6B LLM
Format	LiteRT (TFLite)
Quantization	INT8 dynamic weights (fp32 activations)
Sample rate	16 000 Hz
Input	128-bin log mel, 1000 frames (10 s, fixed)
Output	125 audio embedding tokens, 1024-dim each
Languages	30 + 22 Chinese dialects (Cantonese, Shanghainese, Sichuan, …)

Files

File	Size	Description
`qwen3-asr-encoder.tflite`	180.5 MB	Audio encoder, INT8
`config.json`	1 KB	Architecture + I/O specs

Signature

Inputs:
  mel               [1, 128, 1000]   float32   10 s log mel spectrogram

Outputs:
  audio_embeddings  [1, 125, 1024]   float32   For cross-attention into the decoder

Architecture

mel [1, 128, 1000]
  └── 3× Conv2d(stride=2) + GELU          → [1, 480, 16, 125]
  └── reshape → Linear(7680→896)          → [1, 125, 896]
  └── + sinusoidal pos embed
  └── 18× pre-norm Transformer            → [1, 125, 896]
  └── LayerNorm → Linear(896) → GELU
  └── Linear(896→1024)                    → [1, 125, 1024]

Why encoder only

The text decoder is a full Qwen3-0.6B language model with GQA, RoPE, SwiGLU and RMSNorm. It doesn't fit cleanly into a single .tflite; the right runtime for LLM decoders on Android is LiteRT-LM or a comparable LLM executor, with the audio embeddings from this encoder wired in as cross-attention context.

For ASR-only (no LLM), pair this encoder with a CTC or transducer head fine-tuned on your target languages.

Audio preprocessing

16 kHz mono, float32
128 log mel bins
n_fft=400, hop_length=160, win_length=400, pad_mode="reflect"
log mel, mean/std normalization per utterance

The exact reference is in the upstream Qwen3-ASR tokenizer config.

Source

Upstream: Qwen/Qwen3-ASR-0.6B (Apache 2.0). Released January 2026 as part of the Qwen3 audio family.

Ecosystem

soniqo.audio — use-case explorer (transcription, voice cloning, live ASR, voice agents).
speech-core — C++ orchestration library for voice agents. Abstract STTInterface / TTSInterface / VADInterface / EnhancerInterface; LiteRT implementations plug straight into the interfaces.
speech-swift — Apple Silicon MLX companion runtime (model-specific MLX bundles linked above where applicable).
speech-android — Android SDK consuming on-device LiteRT bundles.

Other LiteRT models in this collection

ASR / Transcription

VAD / Diarization

TTS / Voice Cloning

VoxCPM2 — LiteRT (INT8)

License

This bundle inherits the upstream model license (apache-2.0). See the linked base_model repository for the full terms.

Downloads last month: 4

Model tree for soniqo/Qwen3-ASR-0.6B-Encoder-LiteRT-INT8

Base model

Qwen/Qwen3-ASR-0.6B

Finetuned

(40)

this model

Collection including soniqo/Qwen3-ASR-0.6B-Encoder-LiteRT-INT8

LiteRT

Collection

LiteRT (.tflite) bundles for soniqo.audio. ASR, VAD, diarization, speaker ID, streaming, TTS — served by speech-cloud and speech-core. • 17 items • Updated about 6 hours ago • 1

soniqo
/

Qwen3-ASR-0.6B-Encoder-LiteRT-INT8