MeanFlowSE — One-Step Generative Speech Enhancement

---
license: mit
datasets:
- JacobLinCool/VoiceBank-DEMAND-16k
base_model:
- liduojia/MeanFlowSE
---
<div align="center">
<p align="center">
  <h1>MeanFlowSE — One-Step Generative Speech Enhancement</h1>

  [![Paper](https://img.shields.io/badge/Paper-arXiv-b31b1b?logo=arxiv&logoColor=white)](https://arxiv.org/abs/2509.14858)
  [![Hugging Face Model](https://img.shields.io/badge/Model-HuggingFace-yellow?logo=huggingface)](https://huggingface.co/liduojia/MeanFlowSE)
  [![Code](https://img.shields.io/badge/Code-Repo-black?style=flat&logo=github&logoColor=white)](https://github.com/liduojia1/MeanFlowSE)

</p>
</div>

**MeanFlowSE** is a conditional generative approach to speech enhancement that learns average velocities over short time spans and performs enhancement in a single step. Instead of rolling out a long ODE trajectory, it applies one backward-in-time displacement directly in the complex STFT domain, delivering competitive quality at a fraction of the compute and latency. The model is trained end-to-end with a local JVP-based objective and remains consistent with conditional flow matching on the diagonal—no teacher models, schedulers, or distillation required. In practice, 1-NFE inference makes real-time deployment on standard hardware straightforward.

* 🎧 **Demo**: demo page coming soon.
---

## Table of Contents

* [Highlights](#highlights)
* [What’s inside](#whats-inside)
* [Quick start](#quick-start)

  * [Installation](#installation)
  * [Data preparation](#data-preparation)
  * [Training](#training)
  * [Inference](#inference)
* [Configuration](#configuration)
* [Repository structure](#repository-structure)
* [Built upon & related work](#built-upon--related-work)
* [Pretrained models](#pretrained-models)
* [Acknowledgments](#acknowledgments)
* [Citation](#citation)


## Highlights

* **One-step enhancement (1-NFE):** A single displacement update replaces long ODE rollouts—fast enough for real-time use on standard GPUs/CPUs.
* **No teachers, no distillation:** Trains with a local, JVP-based objective; on the diagonal it exactly matches conditional flow matching.
* **Same model, two samplers:** Use the displacement sampler for 1-step (or few-step) inference; fall back to Euler along the instantaneous field if you prefer multi-step.
* **Competitive & fast:** strong ESTOI / SI-SDR / DNSMOS with **very low RTF** on VoiceBank-DEMAND.


## What’s inside

* **Training** with Average field supervision (for the 1-step displacement sampler).
* **Inference** with  euler_mf — single-step displacement along average field.
* **Audio front-end**: complex STFT pipeline; configurable transforms & normalization.
* **Metrics**: PESQ, ESTOI, SI-SDR; end-to-end **RTF** measurement.


## Quick start

### Installation

```bash
# Python 3.10 recommended

pip install -r requirements.txt
# Use a recent PyTorch + CUDA build for multi-GPU training
```

### Data preparation

Expected layout:

```
<BASE_DIR>/
  train/clean/*.wav   train/noisy/*.wav
  valid/clean/*.wav   valid/noisy/*.wav
  test/clean/*.wav    test/noisy/*.wav
```

Defaults assume 16 kHz audio, centered frames, Hann windows, and a complex STFT representation (see `SpecsDataModule` for knobs).

### Training

**Single machine, multi-GPU (DDP)**:

```bash
# Edit DATA_DIR and GPUs inside the script if needed
bash train_vbd.sh
```

Or run directly:

```bash
torchrun --standalone --nproc_per_node=4 train.py \
  --backbone ncsnpp \
  --ode flowmatching \
  --base_dir <BASE_DIR> \
  --batch_size 2 \
  --num_workers 8 \
  --max_epochs 150 \
  --precision 32 \
  --gradient_clip_val 1.0 \
  --t_eps 0.03 --T_rev 1.0 \
  --sigma_min 0.0 --sigma_max 0.487 \
  --use_mfse \
  --mf_weight_final 0.25 \
  --mf_warmup_frac 0.5 \
  --mf_delta_gamma_start 8.0 --mf_delta_gamma_end 1.0 \
  --mf_delta_warmup_frac 0.7 \
  --mf_r_equals_t_prob 0.1 \
  --mf_jvp_clip 5.0 --mf_jvp_eps 1e-3 \
  --mf_jvp_impl fd --mf_jvp_chunk 1 \
  --mf_skip_weight_thresh 0.05 \
  --val_metrics_every_n_epochs 1 \
  --default_root_dir lightning_logs
```

* **Logging & checkpoints** live under `lightning_logs/<exp_name>/version_x/`.
* Heavy validation (PESQ/ESTOI/SI-SDR) runs **every N epochs** on **rank-0**; placeholders are logged otherwise so checkpoint monitors remain valid.

### Inference

Use the helper script:

```bash
# MODE = multistep | multistep_mf | onestep
MODE=onestep STEPS=1 \
TEST_DATA_DIR=<BASE_DIR> \
CKPT_INPUT=path/to/best.ckpt \
bash run_inference.sh
```

Or call the evaluator:

```bash
python evaluate.py \
  --test_dir <BASE_DIR> \
  --folder_destination /path/to/output \
  --ckpt path/to/best.ckpt \
  --odesolver euler_mf \
  --reverse_starting_point 1.0 \
  --last_eval_point 0.0 \
  --one_step
```

> `evaluate.py` writes **enhanced WAVs**.
> If `--odesolver` is not given, it **auto-picks** (`euler_mf` when MF-SE was used; otherwise `euler`).


## Configuration

Common flags you may want to tweak:

* **Time & schedule**

  * `--T_rev` (reverse start, default 1.0), `--t_eps` (terminal time), `--sigma_min`, `--sigma_max`
* **MF-SE stability**

  * `--mf_jvp_impl {auto,fd,autograd}`, `--mf_jvp_chunk`, `--mf_jvp_clip`, `--mf_jvp_eps`
  * Curriculum: `--mf_weight_final`, `--mf_warmup_frac`, `--mf_delta_*`, `--mf_r_equals_t_prob`
* **Validation cost**

  * `--val_metrics_every_n_epochs`, `--num_eval_files`
* **Backbone & front-end**

  * Defined in `backbones/` and `SpecsDataModule` (STFT, transforms, normalization)


## Repository structure

```
MeanFlowSE/
├── train.py                 # Lightning entry
├── evaluate.py              # Enhancement script (WAV out)
├── run_inference.sh         # One-step / few-step convenience runner
├── flowmse/
│   ├── model.py             # Losses, JVP, curriculum, logging
│   ├── odes.py              # Path definition & registry
│   ├── sampling/
│   │   ├── __init__.py
│   │   └── odesolvers.py    # Euler (instantaneous) & Euler-MF (displacement)
│   ├── backbones/
│   │   ├── ncsnpp.py        # U-Net w/ time & delta embeddings
│   │   └── ...
│   ├── data_module.py       # STFT I/O pipeline
│   └── util/                # metrics, registry, tensors, inference helpers
├── requirements.txt
└── scripts/
    └── train_vbd.sh
```

## Built upon & related work

This repository builds upon previous great works:

* **SGMSE** — [https://github.com/sp-uhh/sgmse](https://github.com/sp-uhh/sgmse)
* **SGMSE-CRP** — [https://github.com/sp-uhh/sgmse\_crp](https://github.com/sp-uhh/sgmse_crp)
* **SGMSE-BBED** — [https://github.com/sp-uhh/sgmse-bbed](https://github.com/sp-uhh/sgmse-bbed)
* **FLOWMSE (FlowSE)** — [https://github.com/seongq/flowmse](https://github.com/seongq/flowmse)

Many design choices (complex STFT pipeline, training infrastructure) are inspired by these excellent projects.


## Pretrained models

* **VoiceBank–DEMAND (16 kHz)**: We have hosted the weight files on Google Drive and added the link here.— [Google Drive Link](https://drive.google.com/file/d/1QAxgd5BWrxiNi0q2qD3n1Xcv6bW0X86-/view?usp=sharing)


## Acknowledgments

We gratefully acknowledge **Prof. Xie Chen’s group (X-LANCE Lab, SJTU)** for their **valuable guidance and support** on training practices and engineering tips that helped this work a lot.


## Citation

* **Citation:** The paper is currently under review. We will add a BibTeX entry and article link once available.


**Questions or issues?** Please open a GitHub issue or pull request.
We welcome contributions — from bug fixes to new backbones and front-ends.