xycld
/

lyric-align-mms-fa

@@ -45,6 +45,20 @@ tokenizer_path = hf_hub_download("xycld/lyric-align-mms-fa", "tokenizer.json")
 - **Output:** log-probability emission matrix [num_frames, 29 labels] at 50fps (20ms/frame)
 - **ONNX opset:** 18
 ## Attribution
 This is an ONNX format conversion of Meta's MMS forced alignment model, originally distributed via [torchaudio.pipelines.MMS_FA](https://pytorch.org/audio/stable/generated/torchaudio.pipelines.MMS_FA.html).

 - **Output:** log-probability emission matrix [num_frames, 29 labels] at 50fps (20ms/frame)
 - **ONNX opset:** 18
+## Quantized Variants
+| Variant | Size | Compression | Load Time | Inference | MAE | Acc @50ms | Acc @100ms | Acc @200ms | Status |
+|:--------|-----:|:-----------:|----------:|----------:|----:|----------:|-----------:|-----------:|:-------|
+| **FP32** (original) | 1,207 MB | 1.0x | 989ms | 424ms/line | 34.7ms | 86.2% | 97.5% | 99.4% | Available |
+| **FP16** | 605 MB | 2.0x | 2,335ms | 576ms/line | 34.7ms | 86.2% | 97.5% | 99.4% | Not recommended |
+| **UINT8** | 303 MB | 4.0x | 412ms | 262ms/line | 34.9ms | 86.2% | 97.5% | 99.4% | **Recommended** |
+> Benchmark: Chinese song "错位时空" (362 characters, 53 lines) on CPU.
+**UINT8 is the recommended variant** — 75% smaller, 38% faster inference, with virtually no accuracy loss (MAE +0.2ms).
+FP16 is not recommended for CPU inference (no native FP16 support, slower than FP32). INT8 (QInt8) is incompatible with some ONNX runtimes due to `ConvInteger` operator requirements.
 ## Attribution
 This is an ONNX format conversion of Meta's MMS forced alignment model, originally distributed via [torchaudio.pipelines.MMS_FA](https://pytorch.org/audio/stable/generated/torchaudio.pipelines.MMS_FA.html).