Qwen3-ForcedAligner-0.6B-4bit (MLX)

4-bit quantized version of Qwen/Qwen3-ForcedAligner-0.6B for Apple Silicon inference via MLX.

Predicts word-level timestamps for audio+text pairs in a single non-autoregressive forward pass.

Model Details

Component	Config
Audio encoder	24 layers, d_model=1024, 16 heads, FFN=4096, float16
Text decoder	28 layers, hidden=1024, 16Q/8KV heads, 4-bit quantized (group_size=64)
Classify head	Linear(1024, 5000), float16
Timestamp resolution	80ms per class (5000 classes = 400s max)
Total size	979 MB (vs 1.84 GB bf16)

How It Works

Audio + Text → Audio Encoder → Text Decoder (single pass) → Classify Head → argmax at <timestamp> positions → word timestamps

Unlike ASR (autoregressive, token-by-token), the forced aligner runs the entire sequence in one forward pass through the decoder. The classify head predicts a timestamp class (0–4999) at each <timestamp> token position, which maps to time via class_index × 80ms.

Usage with Swift (MLX)

This model is designed for use with speech-swift:

import Qwen3ASR

let aligner = try await Qwen3ForcedAligner.fromPretrained()

let aligned = aligner.align(
    audio: audioSamples,
    text: "Can you guarantee that the replacement part will be shipped tomorrow?",
    sampleRate: 24000
)

for word in aligned {
    print("[\(String(format: "%.2f", word.startTime))s - \(String(format: "%.2f", word.endTime))s] \(word.text)")
}

CLI

# Align with provided text
qwen3-asr-cli --align --text "Hello world" audio.wav

# Transcribe first, then align
qwen3-asr-cli --align audio.wav

Output:

[0.12s - 0.45s] Can
[0.45s - 0.72s] you
[0.72s - 1.20s] guarantee
[1.20s - 1.48s] that
...

Quantization

Text decoder (attention projections, MLP, embeddings) quantized to 4-bit using group quantization (group_size=64). Audio encoder and classify head kept as float16 for accuracy.

Converted with:

python scripts/convert_forced_aligner.py \
    --source Qwen/Qwen3-ForcedAligner-0.6B \
    --upload --repo-id aufklarer/Qwen3-ForcedAligner-0.6B-4bit

Model tree for aufklarer/Qwen3-ForcedAligner-0.6B-4bit

Base model

Qwen/Qwen3-ForcedAligner-0.6B

Finetuned

(6)

this model

Collection including aufklarer/Qwen3-ForcedAligner-0.6B-4bit

MLX Speech Models

Collection

Speech AI models for Apple Silicon via MLX. ASR, TTS, VAD, diarization, speaker embedding. • 29 items • Updated about 13 hours ago • 1

aufklarer
/

Qwen3-ForcedAligner-0.6B-4bit