| | --- |
| | library_name: mlx |
| | tags: |
| | - mlx |
| | - forced-alignment |
| | - speech |
| | - qwen3 |
| | - audio |
| | - timestamps |
| | - 4bit |
| | - quantized |
| | license: apache-2.0 |
| | base_model: Qwen/Qwen3-ForcedAligner-0.6B |
| | pipeline_tag: audio-classification |
| | language: |
| | - en |
| | - zh |
| | - ja |
| | - ko |
| | - de |
| | - fr |
| | - es |
| | - it |
| | - ru |
| | --- |
| | |
| | # Qwen3-ForcedAligner-0.6B-4bit (MLX) |
| |
|
| | 4-bit quantized version of [Qwen/Qwen3-ForcedAligner-0.6B](https://huggingface.co/Qwen/Qwen3-ForcedAligner-0.6B) for Apple Silicon inference via [MLX](https://github.com/ml-explore/mlx). |
| |
|
| | Predicts **word-level timestamps** for audio+text pairs in a single non-autoregressive forward pass. |
| |
|
| | ## Model Details |
| |
|
| | | Component | Config | |
| | |-----------|--------| |
| | | Audio encoder | 24 layers, d_model=1024, 16 heads, FFN=4096, float16 | |
| | | Text decoder | 28 layers, hidden=1024, 16Q/8KV heads, **4-bit quantized** (group_size=64) | |
| | | Classify head | Linear(1024, 5000), float16 | |
| | | Timestamp resolution | 80ms per class (5000 classes = 400s max) | |
| | | Total size | **979 MB** (vs 1.84 GB bf16) | |
| |
|
| | ## How It Works |
| |
|
| | ``` |
| | Audio + Text β Audio Encoder β Text Decoder (single pass) β Classify Head β argmax at <timestamp> positions β word timestamps |
| | ``` |
| |
|
| | Unlike ASR (autoregressive, token-by-token), the forced aligner runs the **entire sequence in one forward pass** through the decoder. The classify head predicts a timestamp class (0β4999) at each `<timestamp>` token position, which maps to time via `class_index Γ 80ms`. |
| |
|
| | ## Usage with Swift (MLX) |
| |
|
| | This model is designed for use with [qwen3-asr-swift](https://github.com/ivan-digital/qwen3-asr-swift): |
| |
|
| | ```swift |
| | import Qwen3ASR |
| | |
| | let aligner = try await Qwen3ForcedAligner.fromPretrained() |
| | |
| | let aligned = aligner.align( |
| | audio: audioSamples, |
| | text: "Can you guarantee that the replacement part will be shipped tomorrow?", |
| | sampleRate: 24000 |
| | ) |
| | |
| | for word in aligned { |
| | print("[\(String(format: "%.2f", word.startTime))s - \(String(format: "%.2f", word.endTime))s] \(word.text)") |
| | } |
| | ``` |
| |
|
| | ### CLI |
| |
|
| | ```bash |
| | # Align with provided text |
| | qwen3-asr-cli --align --text "Hello world" audio.wav |
| | |
| | # Transcribe first, then align |
| | qwen3-asr-cli --align audio.wav |
| | ``` |
| |
|
| | Output: |
| | ``` |
| | [0.12s - 0.45s] Can |
| | [0.45s - 0.72s] you |
| | [0.72s - 1.20s] guarantee |
| | [1.20s - 1.48s] that |
| | ... |
| | ``` |
| |
|
| | ## Quantization |
| |
|
| | Text decoder (attention projections, MLP, embeddings) quantized to 4-bit using group quantization (group_size=64). Audio encoder and classify head kept as float16 for accuracy. |
| | |
| | Converted with: |
| | ```bash |
| | python scripts/convert_forced_aligner.py \ |
| | --source Qwen/Qwen3-ForcedAligner-0.6B \ |
| | --upload --repo-id aufklarer/Qwen3-ForcedAligner-0.6B-4bit |
| | ``` |
| | |
| | ## Links |
| | |
| | - **Swift library**: [ivan-digital/qwen3-asr-swift](https://github.com/ivan-digital/qwen3-asr-swift) β Swift Package for Qwen3-ASR, Qwen3-TTS, CosyVoice, PersonaPlex, and Forced Alignment on Apple Silicon |
| | - **Base model**: [Qwen/Qwen3-ForcedAligner-0.6B](https://huggingface.co/Qwen/Qwen3-ForcedAligner-0.6B) |
| | - **bf16 variant**: [mlx-community/Qwen3-ForcedAligner-0.6B-bf16](https://huggingface.co/mlx-community/Qwen3-ForcedAligner-0.6B-bf16) |
| | |