FALCON / README.md
MLSpeech's picture
Space card: use the paper's exact table captions
eaae9d2 verified
|
Raw
History Blame Contribute Delete
5.21 kB

A newer version of the Gradio SDK is available: 6.19.0

Upgrade
metadata
title: FALCON Forced Aligner
colorFrom: indigo
colorTo: blue
sdk: gradio
sdk_version: 4.44.1
python_version: '3.8'
app_file: app.py
pinned: true
license: mit
short_description: Neural forced alignment via Soft Dynamic Programming
thumbnail: >-
  https://huggingface.co/spaces/MLSpeech/FALCON/resolve/main/assets/app_screen.jpeg

FALCON — Forced Alignment through Contrastive Optimization Networks

Interactive demo of FALCON, a fully differentiable neural forced aligner that predicts precise phoneme- and word-level boundary timestamps from a waveform + transcript, using a Soft Dynamic Programming decoder.

Upload audio + a transcript (.phn / .wrd / .txt), choose the options, and get a boundary table, a downloadable Praat .TextGrid, and a time-aligned visualization (waveform · spectrogram · phoneme posteriors · Soft-DP path · contrastive score).

Example inputs are in assets/ — the TIMIT sentence "Don't ask me to carry an oily rag like that." in every supported format. The checkpoints are downloaded automatically from the weights repo on first use (this runs on a free CPU Space, so the first alignment takes a moment to fetch a model).

Results

From the paper (arXiv:2606.25460). Specialist = trained on the target English corpus; joint = one model jointly trained on TIMIT+Buckeye. Multilingual results are zero-shot — no target-language training data. Accuracy = % of reference boundaries matched within the ms tolerance.

Phone-level Alignment Accuracy [%]: MFA vs. FALCON (Ours)

Dataset Model t≤10 t≤25 t≤50 t≤100
TIMIT MFA 38.6 72.3 81.1 84.6
TIMIT FALCON specialist 37.66 83.88 94.85 98.62
TIMIT FALCON joint 34.70 82.62 94.91 98.60
Buckeye MFA 35.3 60.6 68.9 72.7
Buckeye FALCON specialist 29.69 69.93 90.07 97.40
Buckeye FALCON joint 28.87 69.40 89.53 97.13

Phoneme-Level: Unseen Multilingual Generalization Accuracy

Test set Model ≤10 ≤15 ≤20 ≤25 ≤50 ≤100
Dutch — IFA FALCON joint 26.85 36.16 44.56 51.17 69.94 84.11
Dutch — IFA FALCON specialist 26.86 35.79 43.85 50.34 68.68 83.22
Dutch — IFA MFA 11.01 14.70 19.05 21.80 33.90 51.02
German — PHONDAT FALCON joint 25.63 34.12 41.87 49.07 70.04 84.58
German — PHONDAT FALCON specialist 25.08 33.37 40.76 47.43 68.27 82.44
German — PHONDAT MFA 20.60 31.75 37.17 45.83 66.78 79.19
Hebrew FALCON joint 21.98 30.10 36.91 42.78 63.07 80.41
Hebrew FALCON specialist 21.03 27.78 34.30 39.79 59.38 77.76

Word-Level Alignment Accuracy [%]: Comparative Analysis

Dataset Model t≤10 t≤25 t≤50 t≤100
TIMIT FALCON spec (MFA-G2P) 49.22 81.79 93.04 98.37
TIMIT FALCON joint (MFA-G2P) 49.50 80.60 92.86 98.46
TIMIT MFA 41.60 72.80 89.40 97.40
TIMIT MMS 18.60 43.50 75.70 94.70
TIMIT WhisperX 22.40 52.70 82.40 94.20
TIMIT Nvidia-Canary-1b 9.23 23.11 44.23 72.81
Buckeye FALCON spec (MFA-G2P) 50.06 77.85 91.51 96.63
Buckeye FALCON joint (MFA-G2P) 50.42 77.98 91.01 96.55
Buckeye MFA 39.80 69.90 84.90 91.80
Buckeye MMS 25.00 52.70 75.00 87.90
Buckeye WhisperX 18.80 43.10 67.40 77.40
Buckeye Nvidia-Canary-1b 8.06 18.83 36.31 63.29

Word-Level: Unseen Multilingual Generalization Accuracy

Dataset Model t≤10 t≤25 t≤50 t≤100
German — PHONDAT FALCON (MFA-G2P) 44.20 68.48 86.12 95.11
German — PHONDAT MFA 29.9 65.4 82.1 94.3
German — PHONDAT MMS 21.8 44.3 74.9 91.8
Dutch — IFA FALCON (MFA-G2P) 26.38 45.15 61.16 76.49
Dutch — IFA MFA 4.7 7.3 11.6 19.0
Dutch — IFA MMS 16.0 37.9 62.9 76.6
Hebrew FALCON 31.91 56.72 75.18 87.89
Hebrew MMS 14.3 41.3 76.5 94.7

Example alignment

The app's own output (waveform · spectrogram · phoneme posteriors · Soft-DP path · contrastive score) on a real TIMIT test utterance. Bundled inputs are in assets/.

English — TIMIT (read speech, phoneme-level) — "Don't ask me to carry an oily rag like that."

English

Example audio is for demonstration only and remains subject to its source corpus's original license — see assets/examples/NOTICE.