Accelerating Flow-Matching-Based Text-to-Speech via Empirically Pruned Step Sampling
Paper • 2505.19931 • Published
YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
Production-ready optimization scripts for the Habibi-TTS Algerian Arabic (ALG) specialized model.
| Claim | Status | Evidence |
|---|---|---|
EPSS / get_epss_timesteps() |
TRUE | Built into F5-TTS v1.1.20+ since May 2025. Auto-applies when steps ∈ {5,6,7,10,12,16} |
| Sway Sampling | TRUE | Native to F5-TTS, default sway_sampling_coef=-1.0 |
| F5R-TTS (29.5% WER) | UNCONFIRMED | Paper not publicly indexed. GRPO for TTS validated by DMOSpeech 2 (~10% WER improvement) |
| Triton/TensorRT in F5-TTS | FALSE | No off-the-shelf support. Triton runtime exists but requires manual setup |
| SGLang/vLLM for TTS | FALSE | Architecturally incompatible. F5-TTS is DiT+flow-matching, not autoregressive LLM |
| TGI maintenance mode | TRUE | Official HF docs confirm maintenance mode, recommend vLLM/SGLang for LLMs |
| FP8 on A10G | FALSE | A10G (Ampere/SM80) does NOT support FP8. Use BF16 + INT8 instead |
| Arabic diacritization | TRUE | Sadeed (Misraj/Sadeed) is SOTA for MSA. Algerian dialect needs dialect-aware preprocessing |
EPSS (Empirically Pruned Step Sampling) - 4x speedup with minimal quality loss.
use_epss=TrueBF16 inference + torch.compile for A10G.
Algerian Arabic text preprocessing pipeline.
FastAPI streaming TTS server.
INT8 weight-only quantization for A10G.
| Priority | Action | Expected RTF on A10G |
|---|---|---|
| 1 | EPSS NFE=7 | 0.030 |
| 2 | BF16 | 0.022 |
| 3 | torch.compile | 0.016-0.018 |
| 4 | Sentence streaming | Sub-500ms TTFA |
| 5 | INT8 quantization | 0.012-0.014 |