--- library_name: mlx tags: - mlx - forced-alignment - speech - qwen3 - audio - timestamps - 4bit - quantized license: apache-2.0 base_model: Qwen/Qwen3-ForcedAligner-0.6B pipeline_tag: audio-classification language: - en - zh - ja - ko - de - fr - es - it - ru --- # Qwen3-ForcedAligner-0.6B-4bit (MLX) 4-bit quantized version of [Qwen/Qwen3-ForcedAligner-0.6B](https://huggingface.co/Qwen/Qwen3-ForcedAligner-0.6B) for Apple Silicon inference via [MLX](https://github.com/ml-explore/mlx). Predicts **word-level timestamps** for audio+text pairs in a single non-autoregressive forward pass. ## Model Details | Component | Config | |-----------|--------| | Audio encoder | 24 layers, d_model=1024, 16 heads, FFN=4096, float16 | | Text decoder | 28 layers, hidden=1024, 16Q/8KV heads, **4-bit quantized** (group_size=64) | | Classify head | Linear(1024, 5000), float16 | | Timestamp resolution | 80ms per class (5000 classes = 400s max) | | Total size | **979 MB** (vs 1.84 GB bf16) | ## How It Works ``` Audio + Text → Audio Encoder → Text Decoder (single pass) → Classify Head → argmax at positions → word timestamps ``` Unlike ASR (autoregressive, token-by-token), the forced aligner runs the **entire sequence in one forward pass** through the decoder. The classify head predicts a timestamp class (0–4999) at each `` token position, which maps to time via `class_index × 80ms`. ## Usage with Swift (MLX) This model is designed for use with [qwen3-asr-swift](https://github.com/ivan-digital/qwen3-asr-swift): ```swift import Qwen3ASR let aligner = try await Qwen3ForcedAligner.fromPretrained() let aligned = aligner.align( audio: audioSamples, text: "Can you guarantee that the replacement part will be shipped tomorrow?", sampleRate: 24000 ) for word in aligned { print("[\(String(format: "%.2f", word.startTime))s - \(String(format: "%.2f", word.endTime))s] \(word.text)") } ``` ### CLI ```bash # Align with provided text qwen3-asr-cli --align --text "Hello world" audio.wav # Transcribe first, then align qwen3-asr-cli --align audio.wav ``` Output: ``` [0.12s - 0.45s] Can [0.45s - 0.72s] you [0.72s - 1.20s] guarantee [1.20s - 1.48s] that ... ``` ## Quantization Text decoder (attention projections, MLP, embeddings) quantized to 4-bit using group quantization (group_size=64). Audio encoder and classify head kept as float16 for accuracy. Converted with: ```bash python scripts/convert_forced_aligner.py \ --source Qwen/Qwen3-ForcedAligner-0.6B \ --upload --repo-id aufklarer/Qwen3-ForcedAligner-0.6B-4bit ``` ## Links - **Swift library**: [ivan-digital/qwen3-asr-swift](https://github.com/ivan-digital/qwen3-asr-swift) — Swift Package for Qwen3-ASR, Qwen3-TTS, CosyVoice, PersonaPlex, and Forced Alignment on Apple Silicon - **Base model**: [Qwen/Qwen3-ForcedAligner-0.6B](https://huggingface.co/Qwen/Qwen3-ForcedAligner-0.6B) - **bf16 variant**: [mlx-community/Qwen3-ForcedAligner-0.6B-bf16](https://huggingface.co/mlx-community/Qwen3-ForcedAligner-0.6B-bf16)