|
|
--- |
|
|
license: apache-2.0 |
|
|
language: |
|
|
- en |
|
|
pipeline_tag: audio-to-audio |
|
|
--- |
|
|
# UniSE: A Unified Framework for Decoder-only Autoregressive LM-based Speech Enhancement |
|
|
|
|
|
<p align="center"> |
|
|
<a href="https://arxiv.org/abs/2510.20441"> |
|
|
<img src="https://img.shields.io/badge/Paper-ArXiv-red.svg" alt="Paper"> |
|
|
</a> |
|
|
<a href="https://hyyan2k.github.io/UniSE/"> |
|
|
<img src="https://img.shields.io/badge/Demo-Page-blue.svg" alt="Demo"> |
|
|
</a> |
|
|
<a href="https://huggingface.co/spaces/QuarkAudio/"> |
|
|
<img src="https://img.shields.io/badge/Model-Hugging%20Face-yellow.svg" alt="Hugging Face"> |
|
|
</a> |
|
|
</p> |
|
|
|
|
|
<p align="center"> |
|
|
<a href="https://arxiv.org/abs/2510.20441"><img src="QuarkAudio-UniSE.png" width="70%" /></a> |
|
|
</p> |
|
|
π **UniSE**: A Unified, Prompt-Free, Autoregressive Speech Enhancement Framework Based on Decoder-only Language Models |
|
|
|
|
|
π **Key Highlights**: |
|
|
- β
**Unified & Prompt-Free**: Handles multiple tasks without explicit instruction. |
|
|
- βοΈ **Decoder-only AR-LM Backbone**: Leverages LLM-style autoregressive generation for speech token prediction. |
|
|
- π **End-to-End Compatible**: Integrates WavLM (feature extractor), BiCodec (discrete codec), and LM into one pipeline. |
|
|
- π **Multitask Support**: SE, SR, TSE, SS, and more β all in a single model. |
|
|
|
|
|
π **Paper**: [arXiv:2510.20441](https://arxiv.org/abs/2510.20441) | π€ **Listen**: [Demo Page](https://hyyan2k.github.io/UniSE/) | π€ **Model**: [Hugging Face Spaces](https://huggingface.co/spaces/QuarkAudio/) |
|
|
|
|
|
--- |
|
|
|
|
|
## π Supported Tasks |
|
|
|
|
|
| Task | Full Name | Status | Description | |
|
|
|------|-----------|--------|-------------| |
|
|
| **SR** | Speech Restoration | β
Stable | General-purpose denoising and clarity improvemen (e.g., noise, reverb, packet loss) | |
|
|
| **TSE** | Target Speaker Extraction | β
Stable | Extract target speaker using reference enrollment audio | |
|
|
| **SS** | Speech Separation | β
Stable | Separate mixed speakers or sound sources | |
|
|
| **AEC** | Acoustic Echo Cancellation | β³ Developing | Coming soon in next release | |
|
|
|
|
|
> π‘ Unlike traditional models requiring task-specific prompts or modules, **UniSE autonomously infers the task type** from input context β enabled by powerful LLM comprehension. |
|
|
|
|
|
--- |
|
|
|
|
|
## π― Quick Start: Run Inference in 3 Minutes |
|
|
|
|
|
### 1. Clone Repository |
|
|
|
|
|
```bash |
|
|
git clone https://github.com/alibaba/unified-audio.git |
|
|
cd QuarkAudio-UniSE |
|
|
``` |
|
|
|
|
|
### 2. Create a Conda environment and install dependencies |
|
|
|
|
|
```bash |
|
|
conda create -n unise python=3.10 |
|
|
conda activate unise |
|
|
pip install -r requirements.txt |
|
|
``` |
|
|
|
|
|
### 3. Download Checkpoints |
|
|
|
|
|
QuarkAudio-UniSE requires three additional **WavLM** and **BiCodec** pre-trained models and checkpoint of the middle LM on Huggingface to function properly. You can download three of them using the provided shell script: |
|
|
|
|
|
```bash |
|
|
cd checkpoints |
|
|
bash download.sh |
|
|
``` |
|
|
Additionally, download WavLM-Large.pt from this [URL](https://huggingface.co/microsoft/wavlm-base-plus) and put it at `./ckpt/WavLM-Large.pt` . |
|
|
|
|
|
Alternatively, you can download them manually and place them in the `./model/bicodec/` directory. |
|
|
|
|
|
After Downloading, the tree should be like this: |
|
|
|
|
|
## Train |
|
|
+ Quick start |
|
|
|
|
|
```bash |
|
|
#!/bin/bash |
|
|
python ./train.py --config conf/config.yaml |
|
|
``` |
|
|
| Parameter | Description | |
|
|
| ---------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------- | |
|
|
| `resume` | if want to resume, specify ckpt path | |
|
|
| `simulation_config` | data simulate config | |
|
|
| `speech_scp_path` | SCP of clean audio files | |
|
|
| `noise_scp_path` | SCP of noise audio files |
|
|
| `rir_scp_path` | SCP of rir audio files | |
|
|
| `mode` | Task type: `se` (Noise Suppression,Speech Restoration,Packet Loss Concealment), `tse` (Target Speaker Extraction), `SS` (Speech Separation). | |
|
|
|
|
|
|
|
|
## Inference |
|
|
+ Quick start |
|
|
The main inference script is **`test.py`**. The inference process consists of two stages: |
|
|
|
|
|
1. Extract hidden states from all WavLM layers and obtain a single representation by averaging them across layers. |
|
|
2. Use the language model (LM) to predict speech tokens, and then decode them into audio using **BiCodec**. |
|
|
|
|
|
### Running Inference |
|
|
+ Quick start |
|
|
To run test.py, configure the parameters in `./conf/config.yaml`: |
|
|
|
|
|
| Parameter | Description | |
|
|
| ---------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------- | |
|
|
| `ckpt_path` | pretrained weight | |
|
|
| `enroll_duration` | Number of inference iterations. | |
|
|
| `data_src_dir` | Directory of processed audio files directory. | |
|
|
| `data_tgt_dir` | Directory of processed audio files directory. | |
|
|
| `mode` | Task type: `se` (Noise Suppression,Speech Restoration,Packet Loss Concealment), `tse` (Target Speaker Extraction), `SS` (Speech Separation). | |
|
|
|
|
|
Command to run inference: |
|
|
|
|
|
```python |
|
|
python test.py |
|
|
``` |
|
|
|
|
|
## Results |
|
|
|
|
|
Samples processed by UniSE can be found on our [Demo Page](https://github.com/hyyan2k/UniSE/). |
|
|
|
|
|
## Model Checkpoints |
|
|
|
|
|
Our pretrained model is available on [Hugging Face](https://huggingface.co/spaces/QuarkAudio/). |
|
|
|
|
|
## Hints |
|
|
|
|
|
Our approach focuses on leveraging the LLM's comprehension capabilities to enable autonomous determination of task types, though this may exhibit instability in certain scenarios. A more stable and robust iteration will be released in the upcoming version. |
|
|
|
|
|
## Citation |
|
|
|
|
|
``` |
|
|
@misc{yan2025uniseunifiedframeworkdecoderonly, |
|
|
title={UniSE: A Unified Framework for Decoder-only Autoregressive LM-based Speech Enhancement}, |
|
|
author={Haoyin Yan and Chengwei Liu and Shaofei Xue and Xiaotao Liang and Zheng Xue}, |
|
|
year={2025}, |
|
|
eprint={2510.20441}, |
|
|
archivePrefix={arXiv}, |
|
|
primaryClass={cs.SD}, |
|
|
url={https://arxiv.org/abs/2510.20441}, |
|
|
} |
|
|
``` |
|
|
|
|
|
|
|
|
## Contact |
|
|
For any questions, please contact: `yanhaoyin.yhy@alibaba-inc.com` |