Spaces:

MLSpeech
/

FALCON

Running

App Files Files Community

FALCON / README.md

MLSpeech

Space card: use the paper's exact table captions

eaae9d2 verified 5 days ago

preview code

Raw

History Blame Contribute Delete

5.21 kB

A newer version of the Gradio SDK is available: 6.19.0

Upgrade

metadata

title: FALCON Forced Aligner
colorFrom: indigo
colorTo: blue
sdk: gradio
sdk_version: 4.44.1
python_version: '3.8'
app_file: app.py
pinned: true
license: mit
short_description: Neural forced alignment via Soft Dynamic Programming
thumbnail: >-
  https://huggingface.co/spaces/MLSpeech/FALCON/resolve/main/assets/app_screen.jpeg

FALCON — Forced Alignment through Contrastive Optimization Networks

Interactive demo of FALCON, a fully differentiable neural forced aligner that predicts precise phoneme- and word-level boundary timestamps from a waveform + transcript, using a Soft Dynamic Programming decoder.

Upload audio + a transcript (.phn / .wrd / .txt), choose the options, and get a boundary table, a downloadable Praat .TextGrid, and a time-aligned visualization (waveform · spectrogram · phoneme posteriors · Soft-DP path · contrastive score).

Paper: Fully Differentiable Neural Forced Alignment via Soft Dynamic Programming — arXiv:2606.25460
Code: https://github.com/MLSpeech/FALCON
Weights: https://huggingface.co/MLSpeech/FALCON-weights

Example inputs are in assets/ — the TIMIT sentence "Don't ask me to carry an oily rag like that." in every supported format. The checkpoints are downloaded automatically from the weights repo on first use (this runs on a free CPU Space, so the first alignment takes a moment to fetch a model).

Results

From the paper (arXiv:2606.25460). Specialist = trained on the target English corpus; joint = one model jointly trained on TIMIT+Buckeye. Multilingual results are zero-shot — no target-language training data. Accuracy = % of reference boundaries matched within the ms tolerance.

Phone-level Alignment Accuracy [%]: MFA vs. FALCON (Ours)

Dataset	Model	t≤10	t≤25	t≤50	t≤100
TIMIT	MFA	38.6	72.3	81.1	84.6
TIMIT	FALCON specialist	37.66	83.88	94.85	98.62
TIMIT	FALCON joint	34.70	82.62	94.91	98.60
Buckeye	MFA	35.3	60.6	68.9	72.7
Buckeye	FALCON specialist	29.69	69.93	90.07	97.40
Buckeye	FALCON joint	28.87	69.40	89.53	97.13

Phoneme-Level: Unseen Multilingual Generalization Accuracy

Test set	Model	≤10	≤15	≤20	≤25	≤50	≤100
Dutch — IFA	FALCON joint	26.85	36.16	44.56	51.17	69.94	84.11
Dutch — IFA	FALCON specialist	26.86	35.79	43.85	50.34	68.68	83.22
Dutch — IFA	MFA	11.01	14.70	19.05	21.80	33.90	51.02
German — PHONDAT	FALCON joint	25.63	34.12	41.87	49.07	70.04	84.58
German — PHONDAT	FALCON specialist	25.08	33.37	40.76	47.43	68.27	82.44
German — PHONDAT	MFA	20.60	31.75	37.17	45.83	66.78	79.19
Hebrew	FALCON joint	21.98	30.10	36.91	42.78	63.07	80.41
Hebrew	FALCON specialist	21.03	27.78	34.30	39.79	59.38	77.76

Word-Level Alignment Accuracy [%]: Comparative Analysis

Dataset	Model	t≤10	t≤25	t≤50	t≤100
TIMIT	FALCON spec (MFA-G2P)	49.22	81.79	93.04	98.37
TIMIT	FALCON joint (MFA-G2P)	49.50	80.60	92.86	98.46
TIMIT	MFA	41.60	72.80	89.40	97.40
TIMIT	MMS	18.60	43.50	75.70	94.70
TIMIT	WhisperX	22.40	52.70	82.40	94.20
TIMIT	Nvidia-Canary-1b	9.23	23.11	44.23	72.81
Buckeye	FALCON spec (MFA-G2P)	50.06	77.85	91.51	96.63
Buckeye	FALCON joint (MFA-G2P)	50.42	77.98	91.01	96.55
Buckeye	MFA	39.80	69.90	84.90	91.80
Buckeye	MMS	25.00	52.70	75.00	87.90
Buckeye	WhisperX	18.80	43.10	67.40	77.40
Buckeye	Nvidia-Canary-1b	8.06	18.83	36.31	63.29

Word-Level: Unseen Multilingual Generalization Accuracy

Dataset	Model	t≤10	t≤25	t≤50	t≤100
German — PHONDAT	FALCON (MFA-G2P)	44.20	68.48	86.12	95.11
German — PHONDAT	MFA	29.9	65.4	82.1	94.3
German — PHONDAT	MMS	21.8	44.3	74.9	91.8
Dutch — IFA	FALCON (MFA-G2P)	26.38	45.15	61.16	76.49
Dutch — IFA	MFA	4.7	7.3	11.6	19.0
Dutch — IFA	MMS	16.0	37.9	62.9	76.6
Hebrew	FALCON	31.91	56.72	75.18	87.89
Hebrew	MMS	14.3	41.3	76.5	94.7

Example alignment

The app's own output (waveform · spectrogram · phoneme posteriors · Soft-DP path · contrastive score) on a real TIMIT test utterance. Bundled inputs are in assets/.

English — TIMIT (read speech, phoneme-level) — "Don't ask me to carry an oily rag like that."

Example audio is for demonstration only and remains subject to its source corpus's original license — see assets/examples/NOTICE.