MMS Forced Alignment — ONNX
ONNX export of Meta's MMS_FA (Massively Multilingual Speech Forced Alignment) model for CTC forced alignment.
Files
| File | Size | Description |
|---|---|---|
mms_fa.onnx |
3.2 MB | ONNX model graph |
mms_fa.onnx.data |
1.2 GB | External weight data (FP32) |
tokenizer.json |
311 B | Token-to-ID mapping (29 tokens) |
Usage
from huggingface_hub import hf_hub_download
model_path = hf_hub_download("xycld/lyric-align-mms-fa", "mms_fa.onnx")
data_path = hf_hub_download("xycld/lyric-align-mms-fa", "mms_fa.onnx.data")
tokenizer_path = hf_hub_download("xycld/lyric-align-mms-fa", "tokenizer.json")
Note: mms_fa.onnx.data must be in the same directory as mms_fa.onnx for ONNX Runtime to load correctly. hf_hub_download handles this automatically via its cache.
Model Details
- Original model: MMS_FA by Meta Research
- Architecture: wav2vec2-based CTC forced aligner (315M parameters)
- Precision: FP32
- Input: mono 16kHz waveform
- Output: log-probability emission matrix [num_frames, 29 labels] at 50fps (20ms/frame)
- ONNX opset: 18
Quantized Variants
| Variant | Size | Compression | Load Time | Inference | MAE | Acc @50ms | Acc @100ms | Acc @200ms | Status |
|---|---|---|---|---|---|---|---|---|---|
| FP32 (original) | 1,207 MB | 1.0x | 989ms | 424ms/line | 34.7ms | 86.2% | 97.5% | 99.4% | Available |
| FP16 | 605 MB | 2.0x | 2,335ms | 576ms/line | 34.7ms | 86.2% | 97.5% | 99.4% | Not recommended |
| UINT8 | 303 MB | 4.0x | 412ms | 262ms/line | 34.9ms | 86.2% | 97.5% | 99.4% | Recommended |
Benchmark: Chinese song "错位时空" (362 characters, 53 lines) on CPU.
UINT8 is the recommended variant — 75% smaller, 38% faster inference, with virtually no accuracy loss (MAE +0.2ms).
FP16 is not recommended for CPU inference (no native FP16 support, slower than FP32). INT8 (QInt8) is incompatible with some ONNX runtimes due to ConvInteger operator requirements.
Attribution
This is an ONNX format conversion of Meta's MMS forced alignment model, originally distributed via torchaudio.pipelines.MMS_FA.
Original work:
Vineel Pratap, Andros Tjandra, Bowen Shi, et al. "Scaling Speech Technology to 1,000+ Languages." 2023. https://arxiv.org/abs/2305.13516
License
The original model weights are released by Meta under the CC-BY-NC-4.0 license. This ONNX conversion inherits the same license. Non-commercial use only.