🤫 RE-USE: Multilingual Universal Speech Enhancement
Model Overview
Description
In universal speech enhancement, the goal is to restore the quality of diverse degraded speech while preserving fidelity, ensuring that all other factors remain unchanged, e.g., linguistic content, speaker identity, emotion, accent, and other paralinguistic attributes. Inspired by the distortion–perception trade-off theory, our proposed single model achieves a good balance between these two objectives and has the following desirable properties:
- Robustness to diverse degradations, including additive noise, reverberation, clipping, bandwidth limitation, codec artifacts, packet loss and low-quality mics .
- Support for multiple input sampling rates, including 8, 16, 22.05, 24, 32, 44.1, and 48 kHz.
- Strong language-agnostic capability, enabling effective performance across different languages.
This model is for research and development only.
Usage
Directly try our Gradio Interactive Demo by uploading your noisy audio/video !!
Environment Setup
(For Mamba setup)Pre-built Docker environments can be downloaded here to simplify Mamba setup.
If you need bandwidth extension:
pip install resampy
- Download and navigate to the HuggingFace repository:
huggingface-cli download nvidia/RE-USE --local-dir ./REUSE --local-dir-use-symlinks False
cd ./REUSE
Inference
Follow the simple steps below to generate enhanced speech using our model:
- Place your noisy speech files in the folder
noisy_audio/ - Run the following command:
sh inference.sh
- The enhanced speech files will be saved in
enhanced_audio/.
That's all !
Note:
a. You can enable bandwidth extension by setting the target bandwidth using the BWE argument in the script.
If your noisy speech files are long and may cause GPU out-of-memory (OOM) errors, please use the following procedure instead:
- Place your long noisy speech files in the folder
long_noisy_audio/ - Run the following command:
sh inference_chunk_wise.sh
- The enhanced speech files will be saved in
Long_enhanced_audio/.
Note:
a. You can enable bandwidth extension by setting the target bandwidth using the BWE argument in the script.
b. You can also configure the chunk_size_in_seconds and hop_length_portion directly in the script.
License/Terms of Use
This model is released under the NVIDIA One-Way Noncommercial License (NSCLv1).
Deployment Geography
Global.
Use Case
Researchers and general users can use this model to enhance the quality of their speech data.
Release Date
Hugging Face 2026/03/18
References
[1] Rethinking Training Targets, Architectures and Data Quality for Universal Speech Enhancement, 2025. (Note: The released model checkpoint differs from the one reported in the paper. It incorporates additional degradation types (e.g., microphone response and more codecs) and is fine-tuned on a smaller, high-quality clean subset.)
Model Architecture
Architecture Type: Convolutional encoder, Convolutional decoder, and Mamba for time–frequency modeling
Network Architecture: Bi-directional Mamba with 30 layers
Number of model parameters: 9.6M
Input
Input Type(s): Audio
Input Format(s): .wav files
Input Parameters: One-Dimensional (1D)
Other Properties Related to Input: 8000 Hz - 48000 Hz Mono-channel Audio
Output
Output Type(s): Audio
Output Format: .wav files
Output Parameters: One-Dimensional (1D)
Other Properties Related to Output: 8000 Hz - 48000 Hz Mono-channel Audio
Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA’s hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions.
Software Integration
Runtime Engine(s):
- Not Applicable (N/A)
Supported Hardware Microarchitecture Compatibility:
- NVIDIA Ampere (A100)
Preferred Operating System(s):
- Linux
The integration of foundation and fine-tuned models into AI systems requires additional testing using use-case-specific data to ensure safe and effective deployment. Following the V-model methodology, iterative testing and validation at both unit and system levels are essential to mitigate risks, meet technical and functional requirements, and ensure compliance with safety and ethical standards before deployment.
Model Version(s)
Current version: 30USEMamba_peak+GAN_tel_mic_1134k
Training Datasets
Data Modality: Audio
Audio Training Data Size: Less than 10,000 Hours
- LibriVox data from DNS5 challenge (EN) (~350 hours of speech data)
- LibriTTS (EN) (~200 hours of speech data)
- VCTK (EN) (~80 hours of speech data)
- WSJ (EN) (~85 hours of speech data)
- EARS (EN) (~100 hours of speech data)
- Multilingual Librispeech (De, En, Es, Fr) (~450 hours of speech data)
- CommonVoice 19.0 (De, En, Es, Fr, zh-CN) (~1300 hours of speech data)
- Audioset+FreeSound noise in DNS5 challenge (~180 hours of noise data)
- WHAM! Noise (~80 hours of noise data)
- FSD50K (human voice filtered) (~100 hours of non-speech data)
- (Part of) Free Music Archive (medium) (~200 hours of non-speech data)
- Simulated RIRs from DNS5 challenge (~60k samples of room impulse response)
- MicIRP (~70 samples of microphone impulse response)
Inference
Acceleration Engine: None
Test Hardware: NVIDIA A100
Ethical Considerations
NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse. Please report model quality, risk, security vulnerabilities or NVIDIA AI Concerns here.
Citation
Please consider to cite our paper and this framework, if they are helpful in your research.
@article{fu2026rethinking,
title={Rethinking Training Targets, Architectures and Data Quality for Universal Speech Enhancement},
author={Fu, Szu-Wei and Chao, Rong and Yang, Xuesong and Huang, Sung-Feng and Zezario, Ryandhimas E and Nasretdinov, Rauf and Juki{\'c}, Ante and Tsao, Yu and Wang, Yu-Chiang Frank},
journal={arXiv preprint arXiv:2603.02641},
year={2026}
}