ELF-S2T pretrained weights

Pretrained checkpoints for "Speech Meets ELF: Audio Conditional Continuous-Target Diffusion for Speech Recognition and Translation".

Code: https://github.com/Sslnon/ELF-S2T

ELF-S2T performs speech-to-text as audio-conditioned generation in a continuous text-embedding space (a frozen Whisper-large-v3 encoder + a single linear projector conditioning a pretrained ELF flow-matching backbone). See the GitHub repo for the full method, training, and inference code.

Checkpoints

File	Task	Backbone	Test metric
`asr_elf_b.pt`	ASR (LibriSpeech)	ELF-B (105.9 M)	10.50% WER
`asr_elf_l.pt`	ASR (LibriSpeech)	ELF-L (653.4 M)	5.69% WER
`st_deen_elf_b.pt`	S2TT (CoVoST2 de->en)	ELF-B (105.9 M)	25.35 BLEU
`st_deen_elf_l.pt`	S2TT (CoVoST2 de->en)	ELF-L (653.4 M)	28.55 BLEU / 54.91 chrF

WER on LibriSpeech test-clean, BLEU/chrF on CoVoST2 de->en test. Inference recipe: SDE sampler, K=128 steps, audio guidance w=2.0.

Each .pt is a training checkpoint dict with keys elf (backbone state_dict), audio_proj (the linear projector), step, and config (the full training config, so inference auto-detects backbone size, EOS-route, etc.).

Usage

git clone https://github.com/Sslnon/ELF-S2T
cd ELF-S2T
pip install -r requirements.txt

# download these weights
huggingface-cli download ssinon/ELF-S2T asr_elf_l.pt --local-dir outputs/asr_elf_l_dl

# ASR inference on LibriSpeech test-clean
CKPT=outputs/asr_elf_l_dl/asr_elf_l.pt ACFG=2.0 STEPS=128 \
    GPUS=0 NPROC=1 BS=8 ./scripts/infer_ls_test_clean.sh

The ELF-S2T inference script reads the backbone size and recipe from each checkpoint's embedded config, so the same command works for every file.

Citation

@misc{li2026speechmeetselfaudio,
      title={Speech Meets ELF: Audio Conditional Continuous-Target Diffusion for Speech Recognition and Translation}, 
      author={Xuanchen Li and Tianrui Wang and Yuheng Lu and Zikang Huang and Yu Jiang and Chenghan Lin and Chenrui Cui and Ziyang Ma and Xingyu Ma and Chunyu Qiang and Guochen Yu and Xie Chen and Longbiao Wang and Jianwu Dang},
      year={2026},
      eprint={2606.10368},
      archivePrefix={arXiv},
      primaryClass={cs.SD},
      url={https://arxiv.org/abs/2606.10368}, 
}

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for ssinon/ELF-S2T

Speech Meets ELF: Audio Conditional Continuous-Target Diffusion for Speech Recognition and Translation

Paper • 2606.10368 • Published 2 days ago