Qwen3-ForcedAligner-0.6B-4bit (MLX)
4-bit quantized version of Qwen/Qwen3-ForcedAligner-0.6B for Apple Silicon inference via MLX.
Predicts word-level timestamps for audio+text pairs in a single non-autoregressive forward pass.
Model Details
| Component | Config |
|---|---|
| Audio encoder | 24 layers, d_model=1024, 16 heads, FFN=4096, float16 |
| Text decoder | 28 layers, hidden=1024, 16Q/8KV heads, 4-bit quantized (group_size=64) |
| Classify head | Linear(1024, 5000), float16 |
| Timestamp resolution | 80ms per class (5000 classes = 400s max) |
| Total size | 979 MB (vs 1.84 GB bf16) |
How It Works
Audio + Text โ Audio Encoder โ Text Decoder (single pass) โ Classify Head โ argmax at <timestamp> positions โ word timestamps
Unlike ASR (autoregressive, token-by-token), the forced aligner runs the entire sequence in one forward pass through the decoder. The classify head predicts a timestamp class (0โ4999) at each <timestamp> token position, which maps to time via class_index ร 80ms.
Usage with Swift (MLX)
This model is designed for use with qwen3-asr-swift:
import Qwen3ASR
let aligner = try await Qwen3ForcedAligner.fromPretrained()
let aligned = aligner.align(
audio: audioSamples,
text: "Can you guarantee that the replacement part will be shipped tomorrow?",
sampleRate: 24000
)
for word in aligned {
print("[\(String(format: "%.2f", word.startTime))s - \(String(format: "%.2f", word.endTime))s] \(word.text)")
}
CLI
# Align with provided text
qwen3-asr-cli --align --text "Hello world" audio.wav
# Transcribe first, then align
qwen3-asr-cli --align audio.wav
Output:
[0.12s - 0.45s] Can
[0.45s - 0.72s] you
[0.72s - 1.20s] guarantee
[1.20s - 1.48s] that
...
Quantization
Text decoder (attention projections, MLP, embeddings) quantized to 4-bit using group quantization (group_size=64). Audio encoder and classify head kept as float16 for accuracy.
Converted with:
python scripts/convert_forced_aligner.py \
--source Qwen/Qwen3-ForcedAligner-0.6B \
--upload --repo-id aufklarer/Qwen3-ForcedAligner-0.6B-4bit
Links
- Swift library: ivan-digital/qwen3-asr-swift โ Swift Package for Qwen3-ASR, Qwen3-TTS, CosyVoice, PersonaPlex, and Forced Alignment on Apple Silicon
- Base model: Qwen/Qwen3-ForcedAligner-0.6B
- bf16 variant: mlx-community/Qwen3-ForcedAligner-0.6B-bf16
- Downloads last month
- -
Hardware compatibility
Log In to add your hardware
Quantized
Model tree for aitytech/Qwen3-ForcedAligner-0.6B-4bit
Base model
Qwen/Qwen3-ForcedAligner-0.6B