Duplicate from aufklarer/Qwen3-ForcedAligner-0.6B-4bit

a71b8d7 about 23 hours ago

3.09 kB

	---
	library_name: mlx
	tags:
	- mlx
	- forced-alignment
	- speech
	- qwen3
	- audio
	- timestamps
	- 4bit
	- quantized
	license: apache-2.0
	base_model: Qwen/Qwen3-ForcedAligner-0.6B
	pipeline_tag: audio-classification
	language:
	- en
	- zh
	- ja
	- ko
	- de
	- fr
	- es
	- it
	- ru
	---

	# Qwen3-ForcedAligner-0.6B-4bit (MLX)

	4-bit quantized version of [Qwen/Qwen3-ForcedAligner-0.6B](https://huggingface.co/Qwen/Qwen3-ForcedAligner-0.6B) for Apple Silicon inference via [MLX](https://github.com/ml-explore/mlx).

	Predicts word-level timestamps for audio+text pairs in a single non-autoregressive forward pass.

	## Model Details

	\| Component \| Config \|
	\|-----------\|--------\|
	\| Audio encoder \| 24 layers, d_model=1024, 16 heads, FFN=4096, float16 \|
	\| Text decoder \| 28 layers, hidden=1024, 16Q/8KV heads, 4-bit quantized (group_size=64) \|
	\| Classify head \| Linear(1024, 5000), float16 \|
	\| Timestamp resolution \| 80ms per class (5000 classes = 400s max) \|
	\| Total size \| 979 MB (vs 1.84 GB bf16) \|

	## How It Works

	```
	Audio + Text → Audio Encoder → Text Decoder (single pass) → Classify Head → argmax at <timestamp> positions → word timestamps
	```

	Unlike ASR (autoregressive, token-by-token), the forced aligner runs the entire sequence in one forward pass through the decoder. The classify head predicts a timestamp class (0–4999) at each `<timestamp>` token position, which maps to time via `class_index × 80ms`.

	## Usage with Swift (MLX)

	This model is designed for use with [qwen3-asr-swift](https://github.com/ivan-digital/qwen3-asr-swift):

	```swift
	import Qwen3ASR

	let aligner = try await Qwen3ForcedAligner.fromPretrained()

	let aligned = aligner.align(
	audio: audioSamples,
	text: "Can you guarantee that the replacement part will be shipped tomorrow?",
	sampleRate: 24000
	)

	for word in aligned {
	print("[\(String(format: "%.2f", word.startTime))s - \(String(format: "%.2f", word.endTime))s] \(word.text)")
	}
	```

	### CLI

	```bash
	# Align with provided text
	qwen3-asr-cli --align --text "Hello world" audio.wav

	# Transcribe first, then align
	qwen3-asr-cli --align audio.wav
	```

	Output:
	```
	[0.12s - 0.45s] Can
	[0.45s - 0.72s] you
	[0.72s - 1.20s] guarantee
	[1.20s - 1.48s] that
	...
	```

	## Quantization

	Text decoder (attention projections, MLP, embeddings) quantized to 4-bit using group quantization (group_size=64). Audio encoder and classify head kept as float16 for accuracy.

	Converted with:
	```bash
	python scripts/convert_forced_aligner.py \
	--source Qwen/Qwen3-ForcedAligner-0.6B \
	--upload --repo-id aufklarer/Qwen3-ForcedAligner-0.6B-4bit
	```

	## Links

	- Swift library: [ivan-digital/qwen3-asr-swift](https://github.com/ivan-digital/qwen3-asr-swift) — Swift Package for Qwen3-ASR, Qwen3-TTS, CosyVoice, PersonaPlex, and Forced Alignment on Apple Silicon
	- Base model: [Qwen/Qwen3-ForcedAligner-0.6B](https://huggingface.co/Qwen/Qwen3-ForcedAligner-0.6B)
	- bf16 variant: [mlx-community/Qwen3-ForcedAligner-0.6B-bf16](https://huggingface.co/mlx-community/Qwen3-ForcedAligner-0.6B-bf16)