license: mit
library_name: pytorch
tags:
- forced-alignment
- speech
- phoneme-alignment
- audio
FALCON — pretrained checkpoints
Pretrained checkpoints for FALCON (Forced Alignment through Contrastive Optimization Networks), the neural forced aligner from the paper "Fully Differentiable Neural Forced Alignment via Soft Dynamic Programming" (arXiv:2606.25460).
| File | Trained on | Best for |
|---|---|---|
falcon_timit_english.pt |
TIMIT (read English) | English phoneme alignment |
falcon_buckeye_english.pt |
Buckeye (spontaneous English) | Spontaneous / conversational English |
falcon_joint_multilingual.pt |
Joint TIMIT+Buckeye | Cross-lingual / multilingual zero-shot alignment (Dutch, German, Hebrew, …) at phoneme and word level |
- Interactive demo: https://huggingface.co/spaces/MLSpeech/FALCON
- Code: https://github.com/MLSpeech/FALCON
Each checkpoint is a PyTorch state dict with the model hparams and the dill-serialized
peak-detection parameters. Load them with the FALCON code (predict.py / app.py), which
auto-downloads from this repo when HF_MODEL_REPO is set.
Benchmark results (from the paper — click to expand)
Accuracy = % of reference boundaries matched within the ms tolerance. Specialist = trained on the target English corpus; joint = falcon_joint_multilingual.pt; multilingual rows are zero-shot (no target-language training data). Bold = best in column-block.
Phone-level Alignment Accuracy [%]: MFA vs. FALCON (Ours)
| Dataset | Model | t≤10 | t≤25 | t≤50 | t≤100 |
|---|---|---|---|---|---|
| TIMIT | MFA | 38.6 | 72.3 | 81.1 | 84.6 |
| TIMIT | FALCON specialist | 37.66 | 83.88 | 94.85 | 98.62 |
| TIMIT | FALCON joint | 34.70 | 82.62 | 94.91 | 98.60 |
| Buckeye | MFA | 35.3 | 60.6 | 68.9 | 72.7 |
| Buckeye | FALCON specialist | 29.69 | 69.93 | 90.07 | 97.40 |
| Buckeye | FALCON joint | 28.87 | 69.40 | 89.53 | 97.13 |
Phoneme-Level: Unseen Multilingual Generalization Accuracy
| Test set | Model | ≤10 | ≤15 | ≤20 | ≤25 | ≤50 | ≤100 |
|---|---|---|---|---|---|---|---|
| Dutch — IFA | FALCON joint | 26.85 | 36.16 | 44.56 | 51.17 | 69.94 | 84.11 |
| Dutch — IFA | FALCON specialist | 26.86 | 35.79 | 43.85 | 50.34 | 68.68 | 83.22 |
| Dutch — IFA | MFA | 11.01 | 14.70 | 19.05 | 21.80 | 33.90 | 51.02 |
| German — PHONDAT | FALCON joint | 25.63 | 34.12 | 41.87 | 49.07 | 70.04 | 84.58 |
| German — PHONDAT | FALCON specialist | 25.08 | 33.37 | 40.76 | 47.43 | 68.27 | 82.44 |
| German — PHONDAT | MFA | 20.60 | 31.75 | 37.17 | 45.83 | 66.78 | 79.19 |
| Hebrew | FALCON joint | 21.98 | 30.10 | 36.91 | 42.78 | 63.07 | 80.41 |
| Hebrew | FALCON specialist | 21.03 | 27.78 | 34.30 | 39.79 | 59.38 | 77.76 |
Word-Level Alignment Accuracy [%]: Comparative Analysis
| Dataset | Model | t≤10 | t≤25 | t≤50 | t≤100 |
|---|---|---|---|---|---|
| TIMIT | FALCON spec (MFA-G2P) | 49.22 | 81.79 | 93.04 | 98.37 |
| TIMIT | FALCON joint (MFA-G2P) | 49.50 | 80.60 | 92.86 | 98.46 |
| TIMIT | MFA | 41.60 | 72.80 | 89.40 | 97.40 |
| TIMIT | MMS | 18.60 | 43.50 | 75.70 | 94.70 |
| TIMIT | WhisperX | 22.40 | 52.70 | 82.40 | 94.20 |
| TIMIT | Nvidia-Canary-1b | 9.23 | 23.11 | 44.23 | 72.81 |
| Buckeye | FALCON spec (MFA-G2P) | 50.06 | 77.85 | 91.51 | 96.63 |
| Buckeye | FALCON joint (MFA-G2P) | 50.42 | 77.98 | 91.01 | 96.55 |
| Buckeye | MFA | 39.80 | 69.90 | 84.90 | 91.80 |
| Buckeye | MMS | 25.00 | 52.70 | 75.00 | 87.90 |
| Buckeye | WhisperX | 18.80 | 43.10 | 67.40 | 77.40 |
| Buckeye | Nvidia-Canary-1b | 8.06 | 18.83 | 36.31 | 63.29 |
Word-Level: Unseen Multilingual Generalization Accuracy
| Dataset | Model | t≤10 | t≤25 | t≤50 | t≤100 |
|---|---|---|---|---|---|
| German — PHONDAT | FALCON (MFA-G2P) | 44.20 | 68.48 | 86.12 | 95.11 |
| German — PHONDAT | MFA | 29.9 | 65.4 | 82.1 | 94.3 |
| German — PHONDAT | MMS | 21.8 | 44.3 | 74.9 | 91.8 |
| Dutch — IFA | FALCON (MFA-G2P) | 26.38 | 45.15 | 61.16 | 76.49 |
| Dutch — IFA | MFA | 4.7 | 7.3 | 11.6 | 19.0 |
| Dutch — IFA | MMS | 16.0 | 37.9 | 62.9 | 76.6 |
| Hebrew | FALCON | 31.91 | 56.72 | 75.18 | 87.89 |
| Hebrew | MMS | 14.3 | 41.3 | 76.5 | 94.7 |
Paper: Fully Differentiable Neural Forced Alignment via Soft Dynamic Programming — arXiv:2606.25460