File size: 7,157 Bytes
1408a41 706a5c0 9d10e60 1408a41 9d10e60 1408a41 9d10e60 1408a41 9d10e60 1408a41 9d10e60 1408a41 9d10e60 1408a41 9d10e60 1408a41 9d10e60 1408a41 9d10e60 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 |
# UniSE: A Unified Framework for Decoder-only Autoregressive LM-based Speech Enhancement
<p align="center">
<a href="https://arxiv.org/abs/2510.20441">
<img src="https://img.shields.io/badge/Paper-ArXiv-red.svg" alt="Paper">
</a>
<a href="https://github.com/alibaba/unified-audio/tree/main/QuarkAudio-UniSE">
<img src="https://img.shields.io/badge/GitHub-Code-green.svg" alt="GitHub">
</a>
<a href="https://huggingface.co/QuarkAudio/QuarkAudio-UniSE/">
<img src="https://img.shields.io/badge/Model-Hugging%20Face-yellow.svg" alt="Hugging Face">
</a>
<a href="https://www.modelscope.cn/models/QuarkAudio/QuarkAudio-UniSE/">
<img src="https://img.shields.io/badge/Model-%20%E9%AD%94%E6%90%AD-orange.svg" alt="ModelScope">
</a>
</p>
<p align="center">
<a href="https://arxiv.org/abs/2510.20441"><img src="QuarkAudio-UniSE.png" width="70%" /></a>
</p>
π **UniSE**: A Unified, Prompt-Free, Autoregressive Speech Enhancement Framework Based on Decoder-only Language Models
π **Key Highlights**:
- β
**Unified & Prompt-Free**: Handles multiple tasks without explicit instruction.
- βοΈ **Decoder-only AR-LM Backbone**: Leverages LLM-style autoregressive generation for speech token prediction.
- π **End-to-End Compatible**: Integrates WavLM (feature extractor), BiCodec (discrete codec), and LM into one pipeline.
- π **Multitask Support**: SE, SR, TSE, SS, and more β all in a single model.
π **Paper**: [arXiv:2510.20441](https://arxiv.org/abs/2510.20441) | π€ **Model**: [Hugging Face Spaces]https://huggingface.co/QuarkAudio/QuarkAudio-UniSE/)
---
## π Supported Tasks
| Task | Full Name | Status | Description |
|------|-----------|--------|-------------|
| **SR** | Speech Restoration | β
Stable | General-purpose denoising and clarity improvemen (e.g., noise, reverb, packet loss) |
| **TSE** | Target Speaker Extraction | β
Stable | Extract target speaker using reference enrollment audio |
| **SS** | Speech Separation | β
Stable | Separate mixed speakers or sound sources |
| **AEC** | Acoustic Echo Cancellation | β³ Developing | Coming soon in next release |
> π‘ Unlike traditional models requiring task-specific prompts or modules, **UniSE autonomously infers the task type** from input context β enabled by powerful LLM comprehension.
---
## π― Quick Start: Run Inference in 3 Minutes
### 1. Clone Repository
```bash
git clone https://github.com/alibaba/unified-audio.git
cd QuarkAudio-UniSE
```
### 2. Create a Conda environment and install dependencies
```bash
conda create -n unise python=3.10
conda activate unise
pip install -r requirements.txt
```
### 3. Download Checkpoints
QuarkAudio-UniSE requires three additional **WavLM** and **BiCodec** pre-trained models and checkpoint of the middle LM on Huggingface to function properly. You can download three of them using the provided shell script:
```bash
cd checkpoints
bash download.sh
```
Additionally, download WavLM-Large.pt from this [URL](https://huggingface.co/microsoft/wavlm-base-plus) and put it at `./ckpt/WavLM-Large.pt` .
Alternatively, you can download them manually and place them in the `./model/bicodec/` directory.
After Downloading, the tree should be like this:
## Train
+ Quick start
```bash
#!/bin/bash
python ./train.py --config conf/config.yaml
```
| Parameter | Description |
| ---------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `resume` | if want to resume, specify ckpt path |
| `simulation_config` | data simulate config |
| `speech_scp_path` | SCP of clean audio files |
| `noise_scp_path` | SCP of noise audio files
| `rir_scp_path` | SCP of rir audio files |
| `mode` | Task type: `se` (Noise Suppression,Speech Restoration,Packet Loss Concealment), `tse` (Target Speaker Extraction), `SS` (Speech Separation). |
## Inference
+ Quick start
The main inference script is **`test.py`**. The inference process consists of two stages:
1. Extract hidden states from all WavLM layers and obtain a single representation by averaging them across layers.
2. Use the language model (LM) to predict speech tokens, and then decode them into audio using **BiCodec**.
### Running Inference
+ Quick start
To run test.py, configure the parameters in `./conf/config.yaml`:
| Parameter | Description |
| ---------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `ckpt_path` | pretrained weight |
| `enroll_duration` | Number of inference iterations. |
| `data_src_dir` | Directory of processed audio files directory. |
| `data_tgt_dir` | Directory of processed audio files directory. |
| `mode` | Task type: `se` (Noise Suppression,Speech Restoration,Packet Loss Concealment), `se` (Target Speaker Extraction), `SS` (Speech Separation). |
Command to run inference:
```python
python test.py
```
## Model Checkpoints
Our pretrained model is available on [Hugging Face](https://huggingface.co/QuarkAudio/QuarkAudio-UniSE/).
## Hints
Our approach focuses on leveraging the LLM's comprehension capabilities to enable autonomous determination of task types, though this may exhibit instability in certain scenarios. A more stable and robust iteration will be released in the upcoming version.
## Citation
```
@misc{yan2025uniseunifiedframeworkdecoderonly,
title={UniSE: A Unified Framework for Decoder-only Autoregressive LM-based Speech Enhancement},
author={Haoyin Yan and Chengwei Liu and Shaofei Xue and Xiaotao Liang and Zheng Xue},
year={2025},
eprint={2510.20441},
archivePrefix={arXiv},
primaryClass={cs.SD},
url={https://arxiv.org/abs/2510.20441},
}
```
## Contact
For any questions, please contact: `yanhaoyin.yhy@alibaba-inc.com`
|