🤫 RE-USE: Multilingual Universal Speech Enhancement

Model Overview

Description

In universal speech enhancement, the goal is to restore the quality of diverse degraded speech while preserving fidelity, ensuring that all other factors remain unchanged, e.g., linguistic content, speaker identity, emotion, accent, and other paralinguistic attributes. Inspired by the distortion–perception trade-off theory, our proposed single model achieves a good balance between these two objectives and has the following desirable properties:

Robustness to diverse degradations, including additive noise, reverberation, clipping, bandwidth limitation, codec artifacts, packet loss and low-quality mics .
Support for multiple input sampling rates, including 8, 16, 22.05, 24, 32, 44.1, and 48 kHz.
Strong language-agnostic capability, enabling effective performance across different languages.

This model is for research and development only.

Usage

Directly try our Gradio Interactive Demo by uploading your noisy audio/video !!

Environment Setup

(For Mamba setup)Pre-built Docker environments can be downloaded here to simplify Mamba setup.
If you need bandwidth extension:

pip install resampy

Download and navigate to the HuggingFace repository:

huggingface-cli download nvidia/RE-USE --local-dir ./REUSE --local-dir-use-symlinks False
cd ./REUSE

Inference

Follow the simple steps below to generate enhanced speech using our model:

Place your noisy speech files in the folder noisy_audio/
Run the following command:

sh inference.sh

The enhanced speech files will be saved in enhanced_audio/.

That's all !

Note:

a. You can enable bandwidth extension by setting the target bandwidth using the BWE argument in the script.

If your noisy speech files are long and may cause GPU out-of-memory (OOM) errors, please use the following procedure instead:

Place your long noisy speech files in the folder long_noisy_audio/
Run the following command:

sh inference_chunk.sh

The enhanced speech files will be saved in Long_enhanced_audio/.

Note:

a. You can enable bandwidth extension by setting the target bandwidth using the BWE argument in the script.

b. You can also configure the chunk_size_in_seconds and hop_length_portion directly in the script.

License/Terms of Use

This model is released under the NVIDIA One-Way Noncommercial License (NSCLv1).

Deployment Geography

Global.

Use Case

Researchers and general users can use this model to enhance the quality of their speech data.

Release Date

Hugging Face 2026/03/18

References

[1] Rethinking Training Targets, Architectures and Data Quality for Universal Speech Enhancement, 2025. (Note: The released model checkpoint differs from the one reported in the paper. It incorporates additional degradation types (e.g., microphone response and more codecs) and is fine-tuned on a smaller, high-quality clean subset.)

Model Architecture

Architecture Type: Convolutional encoder, Convolutional decoder, and Mamba for time–frequency modeling
Network Architecture: Bi-directional Mamba with 30 layers
Number of model parameters: 9.6M

Input

Input Type(s): Audio
Input Format(s): .wav files
Input Parameters: One-Dimensional (1D)
Other Properties Related to Input: 8000 Hz - 48000 Hz Mono-channel Audio

Output

Output Type(s): Audio
Output Format: .wav files
Output Parameters: One-Dimensional (1D)
Other Properties Related to Output: 8000 Hz - 48000 Hz Mono-channel Audio

Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA’s hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions.

Software Integration

Runtime Engine(s):

Not Applicable (N/A)

Supported Hardware Microarchitecture Compatibility:

NVIDIA Ampere (A100)

Preferred Operating System(s):

Linux

The integration of foundation and fine-tuned models into AI systems requires additional testing using use-case-specific data to ensure safe and effective deployment. Following the V-model methodology, iterative testing and validation at both unit and system levels are essential to mitigate risks, meet technical and functional requirements, and ensure compliance with safety and ethical standards before deployment.

Model Version(s)

Current version: 30USEMamba_peak+GAN_tel_mic_1134k

Training Datasets

Data Modality: Audio

Audio Training Data Size: Less than 10,000 Hours

LibriVox data from DNS5 challenge (EN) (~350 hours of speech data)
LibriTTS (EN) (~200 hours of speech data)
VCTK (EN) (~80 hours of speech data)
WSJ (EN) (~85 hours of speech data)
EARS (EN) (~100 hours of speech data)
Multilingual Librispeech (De, En, Es, Fr) (~450 hours of speech data)
CommonVoice 19.0 (De, En, Es, Fr, zh-CN) (~1300 hours of speech data)
Audioset+FreeSound noise in DNS5 challenge (~180 hours of noise data)
WHAM! Noise (~80 hours of noise data)
FSD50K (human voice filtered) (~100 hours of non-speech data)
(Part of) Free Music Archive (medium) (~200 hours of non-speech data)
Simulated RIRs from DNS5 challenge (~60k samples of room impulse response)
MicIRP (~70 samples of microphone impulse response)

Inference

Acceleration Engine: None
Test Hardware: NVIDIA A100

Ethical Considerations

NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse. Please report model quality, risk, security vulnerabilities or NVIDIA AI Concerns here.

Citation

Please consider to cite our paper and this framework, if they are helpful in your research.

@article{fu2026rethinking,
  title={Rethinking Training Targets, Architectures and Data Quality for Universal Speech Enhancement},
  author={Fu, Szu-Wei and Chao, Rong and Yang, Xuesong and Huang, Sung-Feng and Zezario, Ryandhimas E and Nasretdinov, Rauf and Juki{\'c}, Ante and Tsao, Yu and Wang, Yu-Chiang Frank},
  journal={arXiv preprint arXiv:2603.02641},
  year={2026}
}

Downloads last month: 5,898

Safetensors

Model size

9.61M params

Tensor type

F32

Model tree for nvidia/RE-USE

Finetunes

2 models

Spaces using nvidia/RE-USE 11

Paper for nvidia/RE-USE

Rethinking Training Targets, Architectures and Data Quality for Universal Speech Enhancement

Paper • 2603.02641 • Published Mar 3 • 7