MLX Speech Models
Collection
Speech AI models for Apple Silicon via MLX. ASR, TTS, VAD, diarization, speaker embedding. • 29 items • Updated • 1
Full-precision (bf16→float16) conversion of Qwen/Qwen3-ForcedAligner-0.6B for Apple Silicon inference via MLX.
No quantization applied — maximum accuracy for word-level timestamp prediction.
| Detail | Value |
|---|---|
| Audio encoder | 24 layers, 1024 dim, 16 heads, float16 |
| Text decoder | 28 layers, 1024 hidden, 16Q/8KV heads, float16 |
| Classify head | Linear(1024, 5000), float16 |
| Timestamp resolution | 80ms per class |
| Total size | ~1.8 GB |
let aligner = try await Qwen3ForcedAligner.fromPretrained(
modelId: "aufklarer/Qwen3-ForcedAligner-0.6B-bf16"
)
let aligned = aligner.align(
audio: samples, text: "Hello world", sampleRate: 24000
)
| Variant | Size | Model ID |
|---|---|---|
| 4-bit | ~979 MB | aufklarer/Qwen3-ForcedAligner-0.6B-4bit |
| 8-bit | ~1.4 GB | aufklarer/Qwen3-ForcedAligner-0.6B-8bit |
| bf16 | ~1.8 GB | aufklarer/Qwen3-ForcedAligner-0.6B-bf16 |
Quantized
Base model
Qwen/Qwen3-ForcedAligner-0.6B