File size: 3,089 Bytes
a71b8d7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
---
library_name: mlx
tags:
  - mlx
  - forced-alignment
  - speech
  - qwen3
  - audio
  - timestamps
  - 4bit
  - quantized
license: apache-2.0
base_model: Qwen/Qwen3-ForcedAligner-0.6B
pipeline_tag: audio-classification
language:
  - en
  - zh
  - ja
  - ko
  - de
  - fr
  - es
  - it
  - ru
---

# Qwen3-ForcedAligner-0.6B-4bit (MLX)

4-bit quantized version of [Qwen/Qwen3-ForcedAligner-0.6B](https://huggingface.co/Qwen/Qwen3-ForcedAligner-0.6B) for Apple Silicon inference via [MLX](https://github.com/ml-explore/mlx).

Predicts **word-level timestamps** for audio+text pairs in a single non-autoregressive forward pass.

## Model Details

| Component | Config |
|-----------|--------|
| Audio encoder | 24 layers, d_model=1024, 16 heads, FFN=4096, float16 |
| Text decoder | 28 layers, hidden=1024, 16Q/8KV heads, **4-bit quantized** (group_size=64) |
| Classify head | Linear(1024, 5000), float16 |
| Timestamp resolution | 80ms per class (5000 classes = 400s max) |
| Total size | **979 MB** (vs 1.84 GB bf16) |

## How It Works

```
Audio + Text → Audio Encoder → Text Decoder (single pass) → Classify Head → argmax at <timestamp> positions → word timestamps
```

Unlike ASR (autoregressive, token-by-token), the forced aligner runs the **entire sequence in one forward pass** through the decoder. The classify head predicts a timestamp class (0–4999) at each `<timestamp>` token position, which maps to time via `class_index × 80ms`.

## Usage with Swift (MLX)

This model is designed for use with [qwen3-asr-swift](https://github.com/ivan-digital/qwen3-asr-swift):

```swift
import Qwen3ASR

let aligner = try await Qwen3ForcedAligner.fromPretrained()

let aligned = aligner.align(
    audio: audioSamples,
    text: "Can you guarantee that the replacement part will be shipped tomorrow?",
    sampleRate: 24000
)

for word in aligned {
    print("[\(String(format: "%.2f", word.startTime))s - \(String(format: "%.2f", word.endTime))s] \(word.text)")
}
```

### CLI

```bash
# Align with provided text
qwen3-asr-cli --align --text "Hello world" audio.wav

# Transcribe first, then align
qwen3-asr-cli --align audio.wav
```

Output:
```
[0.12s - 0.45s] Can
[0.45s - 0.72s] you
[0.72s - 1.20s] guarantee
[1.20s - 1.48s] that
...
```

## Quantization

Text decoder (attention projections, MLP, embeddings) quantized to 4-bit using group quantization (group_size=64). Audio encoder and classify head kept as float16 for accuracy.

Converted with:
```bash
python scripts/convert_forced_aligner.py \
    --source Qwen/Qwen3-ForcedAligner-0.6B \
    --upload --repo-id aufklarer/Qwen3-ForcedAligner-0.6B-4bit
```

## Links

- **Swift library**: [ivan-digital/qwen3-asr-swift](https://github.com/ivan-digital/qwen3-asr-swift) — Swift Package for Qwen3-ASR, Qwen3-TTS, CosyVoice, PersonaPlex, and Forced Alignment on Apple Silicon
- **Base model**: [Qwen/Qwen3-ForcedAligner-0.6B](https://huggingface.co/Qwen/Qwen3-ForcedAligner-0.6B)
- **bf16 variant**: [mlx-community/Qwen3-ForcedAligner-0.6B-bf16](https://huggingface.co/mlx-community/Qwen3-ForcedAligner-0.6B-bf16)