| --- |
| title: FALCON Forced Aligner |
| colorFrom: indigo |
| colorTo: blue |
| sdk: gradio |
| sdk_version: 4.44.1 |
| python_version: "3.8" |
| app_file: app.py |
| pinned: true |
| license: mit |
| short_description: Neural forced alignment via Soft Dynamic Programming |
| thumbnail: https://huggingface.co/spaces/MLSpeech/FALCON/resolve/main/assets/app_screen.jpeg |
| --- |
| |
| # FALCON — Forced Alignment through Contrastive Optimization Networks |
|
|
| Interactive demo of **FALCON**, a fully differentiable neural forced aligner that predicts |
| precise **phoneme- and word-level** boundary timestamps from a waveform + transcript, using a |
| Soft Dynamic Programming decoder. |
|
|
| Upload audio + a transcript (`.phn` / `.wrd` / `.txt`), choose the options, and get a boundary |
| table, a downloadable Praat `.TextGrid`, and a time-aligned visualization (waveform · |
| spectrogram · phoneme posteriors · Soft-DP path · contrastive score). |
|
|
| - **Paper:** *Fully Differentiable Neural Forced Alignment via Soft Dynamic Programming* — [arXiv:2606.25460](https://arxiv.org/abs/2606.25460) |
| - **Code:** https://github.com/MLSpeech/FALCON |
| - **Weights:** https://huggingface.co/MLSpeech/FALCON-weights |
|
|
| Example inputs are in `assets/` — the TIMIT sentence *"Don't ask me to carry an oily rag like that."* |
| in every supported format. The checkpoints are downloaded automatically from the weights repo on |
| first use (this runs on a free CPU Space, so the first alignment takes a moment to fetch a model). |
|
|
|
|
| ## Results |
|
|
| From the paper (arXiv:2606.25460). **Specialist** = trained on the target English corpus; |
| **joint** = one model jointly trained on TIMIT+Buckeye. Multilingual results are **zero-shot** |
| — no target-language training data. Accuracy = % of reference boundaries matched within the ms tolerance. |
|
|
| **Phone-level Alignment Accuracy [%]: MFA vs. FALCON (Ours)** |
|
|
| | Dataset | Model | t≤10 | t≤25 | t≤50 | t≤100 | |
| |---|---|---|---|---|---| |
| | TIMIT | MFA | **38.6** | 72.3 | 81.1 | 84.6 | |
| | TIMIT | FALCON specialist | 37.66 | **83.88** | **94.85** | **98.62** | |
| | TIMIT | FALCON joint | 34.70 | 82.62 | 94.91 | 98.60 | |
| | Buckeye | MFA | **35.3** | 60.6 | 68.9 | 72.7 | |
| | Buckeye | FALCON specialist | 29.69 | **69.93** | **90.07** | **97.40** | |
| | Buckeye | FALCON joint | 28.87 | 69.40 | 89.53 | 97.13 | |
|
|
| **Phoneme-Level: Unseen Multilingual Generalization Accuracy** |
|
|
| | Test set | Model | ≤10 | ≤15 | ≤20 | ≤25 | ≤50 | ≤100 | |
| |---|---|---|---|---|---|---|---| |
| | Dutch — IFA | **FALCON joint** | **26.85** | **36.16** | **44.56** | **51.17** | **69.94** | **84.11** | |
| | Dutch — IFA | FALCON specialist | 26.86 | 35.79 | 43.85 | 50.34 | 68.68 | 83.22 | |
| | Dutch — IFA | MFA | 11.01 | 14.70 | 19.05 | 21.80 | 33.90 | 51.02 | |
| | German — PHONDAT | **FALCON joint** | **25.63** | **34.12** | **41.87** | **49.07** | **70.04** | **84.58** | |
| | German — PHONDAT | FALCON specialist | 25.08 | 33.37 | 40.76 | 47.43 | 68.27 | 82.44 | |
| | German — PHONDAT | MFA | 20.60 | 31.75 | 37.17 | 45.83 | 66.78 | 79.19 | |
| | Hebrew | **FALCON joint** | **21.98** | **30.10** | **36.91** | **42.78** | **63.07** | **80.41** | |
| | Hebrew | FALCON specialist | 21.03 | 27.78 | 34.30 | 39.79 | 59.38 | 77.76 | |
|
|
| **Word-Level Alignment Accuracy [%]: Comparative Analysis** |
|
|
| | Dataset | Model | t≤10 | t≤25 | t≤50 | t≤100 | |
| |---|---|---|---|---|---| |
| | TIMIT | FALCON spec (MFA-G2P) | 49.22 | **81.79** | **93.04** | 98.37 | |
| | TIMIT | FALCON joint (MFA-G2P) | **49.50** | 80.60 | 92.86 | **98.46** | |
| | TIMIT | MFA | 41.60 | 72.80 | 89.40 | 97.40 | |
| | TIMIT | MMS | 18.60 | 43.50 | 75.70 | 94.70 | |
| | TIMIT | WhisperX | 22.40 | 52.70 | 82.40 | 94.20 | |
| | TIMIT | Nvidia-Canary-1b | 9.23 | 23.11 | 44.23 | 72.81 | |
| | Buckeye | FALCON spec (MFA-G2P) | 50.06 | 77.85 | **91.51** | **96.63** | |
| | Buckeye | FALCON joint (MFA-G2P) | **50.42** | **77.98** | 91.01 | 96.55 | |
| | Buckeye | MFA | 39.80 | 69.90 | 84.90 | 91.80 | |
| | Buckeye | MMS | 25.00 | 52.70 | 75.00 | 87.90 | |
| | Buckeye | WhisperX | 18.80 | 43.10 | 67.40 | 77.40 | |
| | Buckeye | Nvidia-Canary-1b | 8.06 | 18.83 | 36.31 | 63.29 | |
|
|
| **Word-Level: Unseen Multilingual Generalization Accuracy** |
|
|
| | Dataset | Model | t≤10 | t≤25 | t≤50 | t≤100 | |
| |---|---|---|---|---|---| |
| | German — PHONDAT | FALCON (MFA-G2P) | **44.20** | **68.48** | **86.12** | **95.11** | |
| | German — PHONDAT | MFA | 29.9 | 65.4 | 82.1 | 94.3 | |
| | German — PHONDAT | MMS | 21.8 | 44.3 | 74.9 | 91.8 | |
| | Dutch — IFA | FALCON (MFA-G2P) | **26.38** | **45.15** | 61.16 | 76.49 | |
| | Dutch — IFA | MFA | 4.7 | 7.3 | 11.6 | 19.0 | |
| | Dutch — IFA | MMS | 16.0 | 37.9 | **62.9** | **76.6** | |
| | Hebrew | FALCON | **31.91** | **56.72** | 75.18 | 87.89 | |
| | Hebrew | MMS | 14.3 | 41.3 | **76.5** | **94.7** | |
|
|
|
|
| ## Example alignment |
|
|
| The app's own output (waveform · spectrogram · phoneme posteriors · Soft-DP path · |
| contrastive score) on a real TIMIT test utterance. Bundled inputs are in `assets/`. |
|
|
| **English — TIMIT** (read speech, phoneme-level) — *"Don't ask me to carry an oily rag like that."* |
|
|
|  |
|
|
| Example audio is for demonstration only and remains subject to its source corpus's original |
| license — see `assets/examples/NOTICE`. |
|
|