Model card: use the paper's exact table captions

b94a362 verified about 12 hours ago

4.72 kB

license: mit
library_name: pytorch
tags:
  - forced-alignment
  - speech
  - phoneme-alignment
  - audio

FALCON — pretrained checkpoints

Pretrained checkpoints for FALCON (Forced Alignment through Contrastive Optimization Networks), the neural forced aligner from the paper "Fully Differentiable Neural Forced Alignment via Soft Dynamic Programming" (arXiv:2606.25460).

File	Trained on	Best for
`falcon_timit_english.pt`	TIMIT (read English)	English phoneme alignment
`falcon_buckeye_english.pt`	Buckeye (spontaneous English)	Spontaneous / conversational English
`falcon_joint_multilingual.pt`	Joint TIMIT+Buckeye	Cross-lingual / multilingual zero-shot alignment (Dutch, German, Hebrew, …) at phoneme and word level

Interactive demo: https://huggingface.co/spaces/MLSpeech/FALCON
Code: https://github.com/MLSpeech/FALCON

Each checkpoint is a PyTorch state dict with the model hparams and the dill-serialized peak-detection parameters. Load them with the FALCON code (predict.py / app.py), which auto-downloads from this repo when HF_MODEL_REPO is set.

Benchmark results (from the paper — click to expand)

Accuracy = % of reference boundaries matched within the ms tolerance. Specialist = trained on the target English corpus; joint = falcon_joint_multilingual.pt; multilingual rows are zero-shot (no target-language training data). Bold = best in column-block.

Phone-level Alignment Accuracy [%]: MFA vs. FALCON (Ours)

Dataset	Model	t≤10	t≤25	t≤50	t≤100
TIMIT	MFA	38.6	72.3	81.1	84.6
TIMIT	FALCON specialist	37.66	83.88	94.85	98.62
TIMIT	FALCON joint	34.70	82.62	94.91	98.60
Buckeye	MFA	35.3	60.6	68.9	72.7
Buckeye	FALCON specialist	29.69	69.93	90.07	97.40
Buckeye	FALCON joint	28.87	69.40	89.53	97.13

Phoneme-Level: Unseen Multilingual Generalization Accuracy

Test set	Model	≤10	≤15	≤20	≤25	≤50	≤100
Dutch — IFA	FALCON joint	26.85	36.16	44.56	51.17	69.94	84.11
Dutch — IFA	FALCON specialist	26.86	35.79	43.85	50.34	68.68	83.22
Dutch — IFA	MFA	11.01	14.70	19.05	21.80	33.90	51.02
German — PHONDAT	FALCON joint	25.63	34.12	41.87	49.07	70.04	84.58
German — PHONDAT	FALCON specialist	25.08	33.37	40.76	47.43	68.27	82.44
German — PHONDAT	MFA	20.60	31.75	37.17	45.83	66.78	79.19
Hebrew	FALCON joint	21.98	30.10	36.91	42.78	63.07	80.41
Hebrew	FALCON specialist	21.03	27.78	34.30	39.79	59.38	77.76

Word-Level Alignment Accuracy [%]: Comparative Analysis

Dataset	Model	t≤10	t≤25	t≤50	t≤100
TIMIT	FALCON spec (MFA-G2P)	49.22	81.79	93.04	98.37
TIMIT	FALCON joint (MFA-G2P)	49.50	80.60	92.86	98.46
TIMIT	MFA	41.60	72.80	89.40	97.40
TIMIT	MMS	18.60	43.50	75.70	94.70
TIMIT	WhisperX	22.40	52.70	82.40	94.20
TIMIT	Nvidia-Canary-1b	9.23	23.11	44.23	72.81
Buckeye	FALCON spec (MFA-G2P)	50.06	77.85	91.51	96.63
Buckeye	FALCON joint (MFA-G2P)	50.42	77.98	91.01	96.55
Buckeye	MFA	39.80	69.90	84.90	91.80
Buckeye	MMS	25.00	52.70	75.00	87.90
Buckeye	WhisperX	18.80	43.10	67.40	77.40
Buckeye	Nvidia-Canary-1b	8.06	18.83	36.31	63.29

Word-Level: Unseen Multilingual Generalization Accuracy

Dataset	Model	t≤10	t≤25	t≤50	t≤100
German — PHONDAT	FALCON (MFA-G2P)	44.20	68.48	86.12	95.11
German — PHONDAT	MFA	29.9	65.4	82.1	94.3
German — PHONDAT	MMS	21.8	44.3	74.9	91.8
Dutch — IFA	FALCON (MFA-G2P)	26.38	45.15	61.16	76.49
Dutch — IFA	MFA	4.7	7.3	11.6	19.0
Dutch — IFA	MMS	16.0	37.9	62.9	76.6
Hebrew	FALCON	31.91	56.72	75.18	87.89
Hebrew	MMS	14.3	41.3	76.5	94.7

_{Paper: Fully Differentiable Neural Forced Alignment via Soft Dynamic Programming — arXiv:2606.25460}