Spaces:

MLSpeech
/

FALCON

Running

App Files Files Community

FALCON / README.md

MLSpeech

Space card: use the paper's exact table captions

eaae9d2 verified 6 days ago

preview code

Raw

History Blame Contribute Delete

5.21 kB

	---
	title: FALCON Forced Aligner
	colorFrom: indigo
	colorTo: blue
	sdk: gradio
	sdk_version: 4.44.1
	python_version: "3.8"
	app_file: app.py
	pinned: true
	license: mit
	short_description: Neural forced alignment via Soft Dynamic Programming
	thumbnail: https://huggingface.co/spaces/MLSpeech/FALCON/resolve/main/assets/app_screen.jpeg
	---

	# FALCON — Forced Alignment through Contrastive Optimization Networks

	Interactive demo of FALCON, a fully differentiable neural forced aligner that predicts
	precise phoneme- and word-level boundary timestamps from a waveform + transcript, using a
	Soft Dynamic Programming decoder.

	Upload audio + a transcript (`.phn` / `.wrd` / `.txt`), choose the options, and get a boundary
	table, a downloadable Praat `.TextGrid`, and a time-aligned visualization (waveform ·
	spectrogram · phoneme posteriors · Soft-DP path · contrastive score).

	- Paper: Fully Differentiable Neural Forced Alignment via Soft Dynamic Programming — [arXiv:2606.25460](https://arxiv.org/abs/2606.25460)
	- Code: https://github.com/MLSpeech/FALCON
	- Weights: https://huggingface.co/MLSpeech/FALCON-weights

	Example inputs are in `assets/` — the TIMIT sentence "Don't ask me to carry an oily rag like that."
	in every supported format. The checkpoints are downloaded automatically from the weights repo on
	first use (this runs on a free CPU Space, so the first alignment takes a moment to fetch a model).


	## Results

	From the paper (arXiv:2606.25460). Specialist = trained on the target English corpus;
	joint = one model jointly trained on TIMIT+Buckeye. Multilingual results are zero-shot
	— no target-language training data. Accuracy = % of reference boundaries matched within the ms tolerance.

	Phone-level Alignment Accuracy [%]: MFA vs. FALCON (Ours)

	\| Dataset \| Model \| t≤10 \| t≤25 \| t≤50 \| t≤100 \|
	\|---\|---\|---\|---\|---\|---\|
	\| TIMIT \| MFA \| 38.6 \| 72.3 \| 81.1 \| 84.6 \|
	\| TIMIT \| FALCON specialist \| 37.66 \| 83.88 \| 94.85 \| 98.62 \|
	\| TIMIT \| FALCON joint \| 34.70 \| 82.62 \| 94.91 \| 98.60 \|
	\| Buckeye \| MFA \| 35.3 \| 60.6 \| 68.9 \| 72.7 \|
	\| Buckeye \| FALCON specialist \| 29.69 \| 69.93 \| 90.07 \| 97.40 \|
	\| Buckeye \| FALCON joint \| 28.87 \| 69.40 \| 89.53 \| 97.13 \|

	Phoneme-Level: Unseen Multilingual Generalization Accuracy

	\| Test set \| Model \| ≤10 \| ≤15 \| ≤20 \| ≤25 \| ≤50 \| ≤100 \|
	\|---\|---\|---\|---\|---\|---\|---\|---\|
	\| Dutch — IFA \| FALCON joint \| 26.85 \| 36.16 \| 44.56 \| 51.17 \| 69.94 \| 84.11 \|
	\| Dutch — IFA \| FALCON specialist \| 26.86 \| 35.79 \| 43.85 \| 50.34 \| 68.68 \| 83.22 \|
	\| Dutch — IFA \| MFA \| 11.01 \| 14.70 \| 19.05 \| 21.80 \| 33.90 \| 51.02 \|
	\| German — PHONDAT \| FALCON joint \| 25.63 \| 34.12 \| 41.87 \| 49.07 \| 70.04 \| 84.58 \|
	\| German — PHONDAT \| FALCON specialist \| 25.08 \| 33.37 \| 40.76 \| 47.43 \| 68.27 \| 82.44 \|
	\| German — PHONDAT \| MFA \| 20.60 \| 31.75 \| 37.17 \| 45.83 \| 66.78 \| 79.19 \|
	\| Hebrew \| FALCON joint \| 21.98 \| 30.10 \| 36.91 \| 42.78 \| 63.07 \| 80.41 \|
	\| Hebrew \| FALCON specialist \| 21.03 \| 27.78 \| 34.30 \| 39.79 \| 59.38 \| 77.76 \|

	Word-Level Alignment Accuracy [%]: Comparative Analysis

	\| Dataset \| Model \| t≤10 \| t≤25 \| t≤50 \| t≤100 \|
	\|---\|---\|---\|---\|---\|---\|
	\| TIMIT \| FALCON spec (MFA-G2P) \| 49.22 \| 81.79 \| 93.04 \| 98.37 \|
	\| TIMIT \| FALCON joint (MFA-G2P) \| 49.50 \| 80.60 \| 92.86 \| 98.46 \|
	\| TIMIT \| MFA \| 41.60 \| 72.80 \| 89.40 \| 97.40 \|
	\| TIMIT \| MMS \| 18.60 \| 43.50 \| 75.70 \| 94.70 \|
	\| TIMIT \| WhisperX \| 22.40 \| 52.70 \| 82.40 \| 94.20 \|
	\| TIMIT \| Nvidia-Canary-1b \| 9.23 \| 23.11 \| 44.23 \| 72.81 \|
	\| Buckeye \| FALCON spec (MFA-G2P) \| 50.06 \| 77.85 \| 91.51 \| 96.63 \|
	\| Buckeye \| FALCON joint (MFA-G2P) \| 50.42 \| 77.98 \| 91.01 \| 96.55 \|
	\| Buckeye \| MFA \| 39.80 \| 69.90 \| 84.90 \| 91.80 \|
	\| Buckeye \| MMS \| 25.00 \| 52.70 \| 75.00 \| 87.90 \|
	\| Buckeye \| WhisperX \| 18.80 \| 43.10 \| 67.40 \| 77.40 \|
	\| Buckeye \| Nvidia-Canary-1b \| 8.06 \| 18.83 \| 36.31 \| 63.29 \|

	Word-Level: Unseen Multilingual Generalization Accuracy

	\| Dataset \| Model \| t≤10 \| t≤25 \| t≤50 \| t≤100 \|
	\|---\|---\|---\|---\|---\|---\|
	\| German — PHONDAT \| FALCON (MFA-G2P) \| 44.20 \| 68.48 \| 86.12 \| 95.11 \|
	\| German — PHONDAT \| MFA \| 29.9 \| 65.4 \| 82.1 \| 94.3 \|
	\| German — PHONDAT \| MMS \| 21.8 \| 44.3 \| 74.9 \| 91.8 \|
	\| Dutch — IFA \| FALCON (MFA-G2P) \| 26.38 \| 45.15 \| 61.16 \| 76.49 \|
	\| Dutch — IFA \| MFA \| 4.7 \| 7.3 \| 11.6 \| 19.0 \|
	\| Dutch — IFA \| MMS \| 16.0 \| 37.9 \| 62.9 \| 76.6 \|
	\| Hebrew \| FALCON \| 31.91 \| 56.72 \| 75.18 \| 87.89 \|
	\| Hebrew \| MMS \| 14.3 \| 41.3 \| 76.5 \| 94.7 \|


	## Example alignment

	The app's own output (waveform · spectrogram · phoneme posteriors · Soft-DP path ·
	contrastive score) on a real TIMIT test utterance. Bundled inputs are in `assets/`.

	English — TIMIT (read speech, phoneme-level) — "Don't ask me to carry an oily rag like that."

	![English](https://huggingface.co/spaces/MLSpeech/FALCON/resolve/main/assets/example_english.png)

	Example audio is for demonstration only and remains subject to its source corpus's original
	license — see `assets/examples/NOTICE`.

	---
	title: FALCON Forced Aligner
	colorFrom: indigo
	colorTo: blue
	sdk: gradio
	sdk_version: 4.44.1
	python_version: "3.8"
	app_file: app.py
	pinned: true
	license: mit
	short_description: Neural forced alignment via Soft Dynamic Programming
	thumbnail: https://huggingface.co/spaces/MLSpeech/FALCON/resolve/main/assets/app_screen.jpeg
	---

	# FALCON — Forced Alignment through Contrastive Optimization Networks

	Interactive demo of FALCON, a fully differentiable neural forced aligner that predicts
	precise phoneme- and word-level boundary timestamps from a waveform + transcript, using a
	Soft Dynamic Programming decoder.

	Upload audio + a transcript (`.phn` / `.wrd` / `.txt`), choose the options, and get a boundary
	table, a downloadable Praat `.TextGrid`, and a time-aligned visualization (waveform ·
	spectrogram · phoneme posteriors · Soft-DP path · contrastive score).

	- Paper: Fully Differentiable Neural Forced Alignment via Soft Dynamic Programming — [arXiv:2606.25460](https://arxiv.org/abs/2606.25460)
	- Code: https://github.com/MLSpeech/FALCON
	- Weights: https://huggingface.co/MLSpeech/FALCON-weights

	Example inputs are in `assets/` — the TIMIT sentence "Don't ask me to carry an oily rag like that."
	in every supported format. The checkpoints are downloaded automatically from the weights repo on
	first use (this runs on a free CPU Space, so the first alignment takes a moment to fetch a model).


	## Results

	From the paper (arXiv:2606.25460). Specialist = trained on the target English corpus;
	joint = one model jointly trained on TIMIT+Buckeye. Multilingual results are zero-shot
	— no target-language training data. Accuracy = % of reference boundaries matched within the ms tolerance.

	Phone-level Alignment Accuracy [%]: MFA vs. FALCON (Ours)

	\| Dataset \| Model \| t≤10 \| t≤25 \| t≤50 \| t≤100 \|
	\|---\|---\|---\|---\|---\|---\|
	\| TIMIT \| MFA \| 38.6 \| 72.3 \| 81.1 \| 84.6 \|
	\| TIMIT \| FALCON specialist \| 37.66 \| 83.88 \| 94.85 \| 98.62 \|
	\| TIMIT \| FALCON joint \| 34.70 \| 82.62 \| 94.91 \| 98.60 \|
	\| Buckeye \| MFA \| 35.3 \| 60.6 \| 68.9 \| 72.7 \|
	\| Buckeye \| FALCON specialist \| 29.69 \| 69.93 \| 90.07 \| 97.40 \|
	\| Buckeye \| FALCON joint \| 28.87 \| 69.40 \| 89.53 \| 97.13 \|

	Phoneme-Level: Unseen Multilingual Generalization Accuracy

	\| Test set \| Model \| ≤10 \| ≤15 \| ≤20 \| ≤25 \| ≤50 \| ≤100 \|
	\|---\|---\|---\|---\|---\|---\|---\|---\|
	\| Dutch — IFA \| FALCON joint \| 26.85 \| 36.16 \| 44.56 \| 51.17 \| 69.94 \| 84.11 \|
	\| Dutch — IFA \| FALCON specialist \| 26.86 \| 35.79 \| 43.85 \| 50.34 \| 68.68 \| 83.22 \|
	\| Dutch — IFA \| MFA \| 11.01 \| 14.70 \| 19.05 \| 21.80 \| 33.90 \| 51.02 \|
	\| German — PHONDAT \| FALCON joint \| 25.63 \| 34.12 \| 41.87 \| 49.07 \| 70.04 \| 84.58 \|
	\| German — PHONDAT \| FALCON specialist \| 25.08 \| 33.37 \| 40.76 \| 47.43 \| 68.27 \| 82.44 \|
	\| German — PHONDAT \| MFA \| 20.60 \| 31.75 \| 37.17 \| 45.83 \| 66.78 \| 79.19 \|
	\| Hebrew \| FALCON joint \| 21.98 \| 30.10 \| 36.91 \| 42.78 \| 63.07 \| 80.41 \|
	\| Hebrew \| FALCON specialist \| 21.03 \| 27.78 \| 34.30 \| 39.79 \| 59.38 \| 77.76 \|

	Word-Level Alignment Accuracy [%]: Comparative Analysis

	\| Dataset \| Model \| t≤10 \| t≤25 \| t≤50 \| t≤100 \|
	\|---\|---\|---\|---\|---\|---\|
	\| TIMIT \| FALCON spec (MFA-G2P) \| 49.22 \| 81.79 \| 93.04 \| 98.37 \|
	\| TIMIT \| FALCON joint (MFA-G2P) \| 49.50 \| 80.60 \| 92.86 \| 98.46 \|
	\| TIMIT \| MFA \| 41.60 \| 72.80 \| 89.40 \| 97.40 \|
	\| TIMIT \| MMS \| 18.60 \| 43.50 \| 75.70 \| 94.70 \|
	\| TIMIT \| WhisperX \| 22.40 \| 52.70 \| 82.40 \| 94.20 \|
	\| TIMIT \| Nvidia-Canary-1b \| 9.23 \| 23.11 \| 44.23 \| 72.81 \|
	\| Buckeye \| FALCON spec (MFA-G2P) \| 50.06 \| 77.85 \| 91.51 \| 96.63 \|
	\| Buckeye \| FALCON joint (MFA-G2P) \| 50.42 \| 77.98 \| 91.01 \| 96.55 \|
	\| Buckeye \| MFA \| 39.80 \| 69.90 \| 84.90 \| 91.80 \|
	\| Buckeye \| MMS \| 25.00 \| 52.70 \| 75.00 \| 87.90 \|
	\| Buckeye \| WhisperX \| 18.80 \| 43.10 \| 67.40 \| 77.40 \|
	\| Buckeye \| Nvidia-Canary-1b \| 8.06 \| 18.83 \| 36.31 \| 63.29 \|

	Word-Level: Unseen Multilingual Generalization Accuracy

	\| Dataset \| Model \| t≤10 \| t≤25 \| t≤50 \| t≤100 \|
	\|---\|---\|---\|---\|---\|---\|
	\| German — PHONDAT \| FALCON (MFA-G2P) \| 44.20 \| 68.48 \| 86.12 \| 95.11 \|
	\| German — PHONDAT \| MFA \| 29.9 \| 65.4 \| 82.1 \| 94.3 \|
	\| German — PHONDAT \| MMS \| 21.8 \| 44.3 \| 74.9 \| 91.8 \|
	\| Dutch — IFA \| FALCON (MFA-G2P) \| 26.38 \| 45.15 \| 61.16 \| 76.49 \|
	\| Dutch — IFA \| MFA \| 4.7 \| 7.3 \| 11.6 \| 19.0 \|
	\| Dutch — IFA \| MMS \| 16.0 \| 37.9 \| 62.9 \| 76.6 \|
	\| Hebrew \| FALCON \| 31.91 \| 56.72 \| 75.18 \| 87.89 \|
	\| Hebrew \| MMS \| 14.3 \| 41.3 \| 76.5 \| 94.7 \|


	## Example alignment

	The app's own output (waveform · spectrogram · phoneme posteriors · Soft-DP path ·
	contrastive score) on a real TIMIT test utterance. Bundled inputs are in `assets/`.

	English — TIMIT (read speech, phoneme-level) — "Don't ask me to carry an oily rag like that."

	![English](https://huggingface.co/spaces/MLSpeech/FALCON/resolve/main/assets/example_english.png)

	Example audio is for demonstration only and remains subject to its source corpus's original
	license — see `assets/examples/NOTICE`.