File size: 6,557 Bytes
dfd1909
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
02f023d
dfd1909
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
---
language:
  - en
tags:
  - audio
  - super-resolution
  - speech-enhancement
  - diffusion
  - one-step
pipeline_tag: audio-to-audio
---

# FlashSR: One-step Versatile Audio Super-Resolution

> **This is a convenience redistribution, not the original repository.** All credit for the model architecture, research, training, and weights belongs to the original authors. This repository is not affiliated with or endorsed by them.

| | |
|---|---|
| **Authors** | Jaekwon Im and Juhan Nam (KAIST) |
| **Paper** | [FlashSR: One-step Versatile Audio Super-resolution via Diffusion Distillation](https://arxiv.org/abs/2501.10807) (arXiv:2501.10807) |
| **Demo** | [jakeoneijk.github.io/flashsr-demo](https://jakeoneijk.github.io/flashsr-demo/) |
| **Original code** | [jakeoneijk/FlashSR_Inference](https://github.com/jakeoneijk/FlashSR_Inference) |
| **Original weights** | [jakeoneijk/FlashSR_weights](https://huggingface.co/datasets/jakeoneijk/FlashSR_weights) |

> **Note:** There are other unrelated projects also named "FlashSR" (for other super-resolution).

## About this repository

The original code and weights are split across GitHub and Hugging Face and have dependencies (torchcodec, FFmpeg) that can be difficult to set up. This repository bundles everything into one place with a standalone inference script that only needs PyTorch, soundfile, and scipy.

**What is from the original authors:** The model code (`FlashSR/`, `TorchJaekwon/`) and the pretrained weights (`weights/`) are from the original repositories linked above.

**What is new in this redistribution:** The inference script (`enhance.py`), `setup.py`, and this README were written independently. The code in this repository (excluding model weights) is released under the **Apache License 2.0**.

## What FlashSR does

FlashSR restores high-frequency audio components in a single forward pass. It takes audio at any sample rate, resamples to 48 kHz, and reconstructs missing high-frequency detail. This is useful for:

- Upscaling low-sample-rate recordings to full bandwidth
- Enhancing audio that has been through lossy processing (codecs, vocoders, etc.)
- Post-processing TTS or voice conversion outputs

The model handles speech, music, and sound effects.

## Repository structure

```
weights/
  student_ldm.pth     (986 MB)  - Distilled latent diffusion model
  sr_vocoder.pth      (599 MB)  - Super-resolution vocoder
  vae.pth             (1.6 GB)  - Variational autoencoder
FlashSR/                        - Model code (from original repo)
TorchJaekwon/                   - Utility library (from original repo)
Assets/ExampleInput/            - Example audio files (speech, music, sound effects)
enhance.py                      - Standalone inference script
setup.py                        - Package installer
```

## Installation

**Requirements:** Python 3.10+, PyTorch 2.0+ with CUDA, ~6 GB GPU memory.

```bash
# Clone this repository
git clone https://huggingface.co/laion/FlashSR_One-step_Versatile_Audio_Super-resolution
cd FlashSR_One-step_Versatile_Audio_Super-resolution

# Install
pip install -e .
pip install einops librosa soundfile tqdm scipy
```

### Verify

```bash
python enhance.py --input Assets/ExampleInput/speech.wav --output output.wav
```

> **Tip:** If you have a conda environment with conflicting cudnn libraries, clear `LD_LIBRARY_PATH` before running: `LD_LIBRARY_PATH="" python enhance.py ...`

## Usage

### Command line

```bash
# Single file
python enhance.py --input my_audio.wav --output enhanced.wav

# Entire directory
python enhance.py --input ./audio_folder/ --output ./enhanced_folder/

# With lowpass filter (can help when input was not originally bandwidth-limited)
python enhance.py --input my_audio.wav --output enhanced.wav --lowpass

# Specify GPU
CUDA_VISIBLE_DEVICES=0 python enhance.py --input my_audio.wav --output enhanced.wav
```

### Python API

```python
import torch
import soundfile as sf
import numpy as np
from pathlib import Path
from FlashSR.FlashSR import FlashSR

WEIGHTS_DIR = Path("./weights")
WINDOW_SIZE = 245760  # 5.12 seconds at 48 kHz

# Initialize
model = FlashSR(
    student_ldm_ckpt_path=str(WEIGHTS_DIR / "student_ldm.pth"),
    sr_vocoder_ckpt_path=str(WEIGHTS_DIR / "sr_vocoder.pth"),
    autoencoder_ckpt_path=str(WEIGHTS_DIR / "vae.pth"),
)
model = model.to("cuda").eval()

# Load and prepare audio (must be mono, 48 kHz)
samples, rate = sf.read("input.wav", dtype="float32")
if samples.ndim > 1:
    samples = samples.mean(axis=1)

# The model accepts exactly 245760 samples per call.
# Pad short audio; for longer audio, see enhance.py for chunk-based processing.
waveform = torch.from_numpy(samples).unsqueeze(0)  # shape: (1, num_samples)
n = waveform.shape[-1]
if n < WINDOW_SIZE:
    waveform = torch.nn.functional.pad(waveform, (0, WINDOW_SIZE - n))

waveform = waveform.to("cuda")

with torch.no_grad():
    result = model(waveform, lowpass_input=False)

# Trim padding and save
result = result[:, :n].squeeze(0).cpu().numpy()
sf.write("output.wav", result, 48000)
```

## Notes

- **Fixed input length:** The model processes exactly 245,760 samples (5.12 seconds at 48 kHz). The `enhance.py` script handles longer audio automatically using overlapping chunks with crossfading.
- **Sample rate:** Input audio at any sample rate is resampled to 48 kHz. Output is always 48 kHz.
- **Channels:** Mono and stereo are both supported. Stereo files are processed channel-by-channel.
- **`lowpass_input` flag:** Set to `True` if your input was not originally bandwidth-limited. This applies a lowpass filter before enhancement to better match the model's training distribution.

## License

The inference script (`enhance.py`), `setup.py`, and this README are released under the [Apache License 2.0](https://www.apache.org/licenses/LICENSE-2.0).

The model weights and original model code (`FlashSR/`, `TorchJaekwon/`) are from the original authors' repositories linked above. Please refer to those repositories for their licensing terms.

## Citation

If you use FlashSR in your work, please cite the original paper:

```bibtex
@article{im2025flashsr,
  title={FlashSR: One-step Versatile Audio Super-resolution via Diffusion Distillation},
  author={Im, Jaekwon and Nam, Juhan},
  journal={arXiv preprint arXiv:2501.10807},
  year={2025}
}
```

## References

- [AudioSR](https://github.com/haoheliu/versatile_audio_super_resolution)
- [NVSR](https://github.com/haoheliu/ssr_eval)
- [BigVGAN](https://github.com/NVIDIA/BigVGAN)
- [Diffusers](https://github.com/huggingface/diffusers)