A newer version of the Gradio SDK is available: 6.19.0
title: FALCON Forced Aligner
colorFrom: indigo
colorTo: blue
sdk: gradio
sdk_version: 4.44.1
python_version: '3.8'
app_file: app.py
pinned: true
license: mit
short_description: Neural forced alignment via Soft Dynamic Programming
thumbnail: >-
https://huggingface.co/spaces/MLSpeech/FALCON/resolve/main/assets/app_screen.jpeg
FALCON — Forced Alignment through Contrastive Optimization Networks
Interactive demo of FALCON, a fully differentiable neural forced aligner that predicts precise phoneme- and word-level boundary timestamps from a waveform + transcript, using a Soft Dynamic Programming decoder.
Upload audio + a transcript (.phn / .wrd / .txt), choose the options, and get a boundary
table, a downloadable Praat .TextGrid, and a time-aligned visualization (waveform ·
spectrogram · phoneme posteriors · Soft-DP path · contrastive score).
- Paper: Fully Differentiable Neural Forced Alignment via Soft Dynamic Programming — arXiv:2606.25460
- Code: https://github.com/MLSpeech/FALCON
- Weights: https://huggingface.co/MLSpeech/FALCON-weights
Example inputs are in assets/ — the TIMIT sentence "Don't ask me to carry an oily rag like that."
in every supported format. The checkpoints are downloaded automatically from the weights repo on
first use (this runs on a free CPU Space, so the first alignment takes a moment to fetch a model).
Results
From the paper (arXiv:2606.25460). Specialist = trained on the target English corpus; joint = one model jointly trained on TIMIT+Buckeye. Multilingual results are zero-shot — no target-language training data. Accuracy = % of reference boundaries matched within the ms tolerance.
Phone-level Alignment Accuracy [%]: MFA vs. FALCON (Ours)
| Dataset | Model | t≤10 | t≤25 | t≤50 | t≤100 |
|---|---|---|---|---|---|
| TIMIT | MFA | 38.6 | 72.3 | 81.1 | 84.6 |
| TIMIT | FALCON specialist | 37.66 | 83.88 | 94.85 | 98.62 |
| TIMIT | FALCON joint | 34.70 | 82.62 | 94.91 | 98.60 |
| Buckeye | MFA | 35.3 | 60.6 | 68.9 | 72.7 |
| Buckeye | FALCON specialist | 29.69 | 69.93 | 90.07 | 97.40 |
| Buckeye | FALCON joint | 28.87 | 69.40 | 89.53 | 97.13 |
Phoneme-Level: Unseen Multilingual Generalization Accuracy
| Test set | Model | ≤10 | ≤15 | ≤20 | ≤25 | ≤50 | ≤100 |
|---|---|---|---|---|---|---|---|
| Dutch — IFA | FALCON joint | 26.85 | 36.16 | 44.56 | 51.17 | 69.94 | 84.11 |
| Dutch — IFA | FALCON specialist | 26.86 | 35.79 | 43.85 | 50.34 | 68.68 | 83.22 |
| Dutch — IFA | MFA | 11.01 | 14.70 | 19.05 | 21.80 | 33.90 | 51.02 |
| German — PHONDAT | FALCON joint | 25.63 | 34.12 | 41.87 | 49.07 | 70.04 | 84.58 |
| German — PHONDAT | FALCON specialist | 25.08 | 33.37 | 40.76 | 47.43 | 68.27 | 82.44 |
| German — PHONDAT | MFA | 20.60 | 31.75 | 37.17 | 45.83 | 66.78 | 79.19 |
| Hebrew | FALCON joint | 21.98 | 30.10 | 36.91 | 42.78 | 63.07 | 80.41 |
| Hebrew | FALCON specialist | 21.03 | 27.78 | 34.30 | 39.79 | 59.38 | 77.76 |
Word-Level Alignment Accuracy [%]: Comparative Analysis
| Dataset | Model | t≤10 | t≤25 | t≤50 | t≤100 |
|---|---|---|---|---|---|
| TIMIT | FALCON spec (MFA-G2P) | 49.22 | 81.79 | 93.04 | 98.37 |
| TIMIT | FALCON joint (MFA-G2P) | 49.50 | 80.60 | 92.86 | 98.46 |
| TIMIT | MFA | 41.60 | 72.80 | 89.40 | 97.40 |
| TIMIT | MMS | 18.60 | 43.50 | 75.70 | 94.70 |
| TIMIT | WhisperX | 22.40 | 52.70 | 82.40 | 94.20 |
| TIMIT | Nvidia-Canary-1b | 9.23 | 23.11 | 44.23 | 72.81 |
| Buckeye | FALCON spec (MFA-G2P) | 50.06 | 77.85 | 91.51 | 96.63 |
| Buckeye | FALCON joint (MFA-G2P) | 50.42 | 77.98 | 91.01 | 96.55 |
| Buckeye | MFA | 39.80 | 69.90 | 84.90 | 91.80 |
| Buckeye | MMS | 25.00 | 52.70 | 75.00 | 87.90 |
| Buckeye | WhisperX | 18.80 | 43.10 | 67.40 | 77.40 |
| Buckeye | Nvidia-Canary-1b | 8.06 | 18.83 | 36.31 | 63.29 |
Word-Level: Unseen Multilingual Generalization Accuracy
| Dataset | Model | t≤10 | t≤25 | t≤50 | t≤100 |
|---|---|---|---|---|---|
| German — PHONDAT | FALCON (MFA-G2P) | 44.20 | 68.48 | 86.12 | 95.11 |
| German — PHONDAT | MFA | 29.9 | 65.4 | 82.1 | 94.3 |
| German — PHONDAT | MMS | 21.8 | 44.3 | 74.9 | 91.8 |
| Dutch — IFA | FALCON (MFA-G2P) | 26.38 | 45.15 | 61.16 | 76.49 |
| Dutch — IFA | MFA | 4.7 | 7.3 | 11.6 | 19.0 |
| Dutch — IFA | MMS | 16.0 | 37.9 | 62.9 | 76.6 |
| Hebrew | FALCON | 31.91 | 56.72 | 75.18 | 87.89 |
| Hebrew | MMS | 14.3 | 41.3 | 76.5 | 94.7 |
Example alignment
The app's own output (waveform · spectrogram · phoneme posteriors · Soft-DP path ·
contrastive score) on a real TIMIT test utterance. Bundled inputs are in assets/.
English — TIMIT (read speech, phoneme-level) — "Don't ask me to carry an oily rag like that."
Example audio is for demonstration only and remains subject to its source corpus's original
license — see assets/examples/NOTICE.
