MMS Forced Alignment — ONNX

ONNX export of Meta's MMS_FA (Massively Multilingual Speech Forced Alignment) model for CTC forced alignment.

Files

File	Size	Description
`mms_fa.onnx`	3.2 MB	ONNX model graph
`mms_fa.onnx.data`	1.2 GB	External weight data (FP32)
`tokenizer.json`	311 B	Token-to-ID mapping (29 tokens)

Usage

from huggingface_hub import hf_hub_download

model_path = hf_hub_download("xycld/lyric-align-mms-fa", "mms_fa.onnx")
data_path = hf_hub_download("xycld/lyric-align-mms-fa", "mms_fa.onnx.data")
tokenizer_path = hf_hub_download("xycld/lyric-align-mms-fa", "tokenizer.json")

Note: mms_fa.onnx.data must be in the same directory as mms_fa.onnx for ONNX Runtime to load correctly. hf_hub_download handles this automatically via its cache.

Model Details

Original model: MMS_FA by Meta Research
Architecture: wav2vec2-based CTC forced aligner (315M parameters)
Precision: FP32
Input: mono 16kHz waveform
Output: log-probability emission matrix [num_frames, 29 labels] at 50fps (20ms/frame)
ONNX opset: 18

Quantized Variants

Variant	Size	Compression	Load Time	Inference	MAE	Acc @50ms	Acc @100ms	Acc @200ms	Status
FP32 (original)	1,207 MB	1.0x	989ms	424ms/line	34.7ms	86.2%	97.5%	99.4%	Available
FP16	605 MB	2.0x	2,335ms	576ms/line	34.7ms	86.2%	97.5%	99.4%	Not recommended
UINT8	303 MB	4.0x	412ms	262ms/line	34.9ms	86.2%	97.5%	99.4%	Recommended

Benchmark: Chinese song "错位时空" (362 characters, 53 lines) on CPU.

UINT8 is the recommended variant — 75% smaller, 38% faster inference, with virtually no accuracy loss (MAE +0.2ms).

FP16 is not recommended for CPU inference (no native FP16 support, slower than FP32). INT8 (QInt8) is incompatible with some ONNX runtimes due to ConvInteger operator requirements.

Attribution

This is an ONNX format conversion of Meta's MMS forced alignment model, originally distributed via torchaudio.pipelines.MMS_FA.

Original work:

Vineel Pratap, Andros Tjandra, Bowen Shi, et al. "Scaling Speech Technology to 1,000+ Languages." 2023. https://arxiv.org/abs/2305.13516

License

The original model weights are released by Meta under the CC-BY-NC-4.0 license. This ONNX conversion inherits the same license. Non-commercial use only.

Downloads last month: -; Downloads are not tracked for this model. How to track

Paper for xycld/lyric-align-mms-fa

Scaling Speech Technology to 1,000+ Languages

Paper • 2305.13516 • Published May 22, 2023 • 12