ELF-S2T pretrained weights
Pretrained checkpoints for "Speech Meets ELF: Audio Conditional Continuous-Target Diffusion for Speech Recognition and Translation".
Code: https://github.com/Sslnon/ELF-S2T
ELF-S2T performs speech-to-text as audio-conditioned generation in a continuous text-embedding space (a frozen Whisper-large-v3 encoder + a single linear projector conditioning a pretrained ELF flow-matching backbone). See the GitHub repo for the full method, training, and inference code.
Checkpoints
| File | Task | Backbone | Test metric |
|---|---|---|---|
asr_elf_b.pt |
ASR (LibriSpeech) | ELF-B (105.9 M) | 10.50% WER |
asr_elf_l.pt |
ASR (LibriSpeech) | ELF-L (653.4 M) | 5.69% WER |
st_deen_elf_b.pt |
S2TT (CoVoST2 de->en) | ELF-B (105.9 M) | 25.35 BLEU |
st_deen_elf_l.pt |
S2TT (CoVoST2 de->en) | ELF-L (653.4 M) | 28.55 BLEU / 54.91 chrF |
WER on LibriSpeech test-clean, BLEU/chrF on CoVoST2 de->en test. Inference recipe: SDE sampler, K=128 steps, audio guidance w=2.0.
Each .pt is a training checkpoint dict with keys elf (backbone state_dict),
audio_proj (the linear projector), step, and config (the full training
config, so inference auto-detects backbone size, EOS-route, etc.).
Usage
git clone https://github.com/Sslnon/ELF-S2T
cd ELF-S2T
pip install -r requirements.txt
# download these weights
huggingface-cli download ssinon/ELF-S2T asr_elf_l.pt --local-dir outputs/asr_elf_l_dl
# ASR inference on LibriSpeech test-clean
CKPT=outputs/asr_elf_l_dl/asr_elf_l.pt ACFG=2.0 STEPS=128 \
GPUS=0 NPROC=1 BS=8 ./scripts/infer_ls_test_clean.sh
The ELF-S2T inference script reads the backbone size and recipe from each
checkpoint's embedded config, so the same command works for every file.
Citation
@misc{li2026speechmeetselfaudio,
title={Speech Meets ELF: Audio Conditional Continuous-Target Diffusion for Speech Recognition and Translation},
author={Xuanchen Li and Tianrui Wang and Yuheng Lu and Zikang Huang and Yu Jiang and Chenghan Lin and Chenrui Cui and Ziyang Ma and Xingyu Ma and Chunyu Qiang and Guochen Yu and Xie Chen and Longbiao Wang and Jianwu Dang},
year={2026},
eprint={2606.10368},
archivePrefix={arXiv},
primaryClass={cs.SD},
url={https://arxiv.org/abs/2606.10368},
}