Fix modelscope bug
Browse files
README.md
CHANGED
|
@@ -1,21 +1,15 @@
|
|
| 1 |
-
---
|
| 2 |
-
license: apache-2.0
|
| 3 |
-
language:
|
| 4 |
-
- en
|
| 5 |
-
pipeline_tag: audio-to-audio
|
| 6 |
-
---
|
| 7 |
# UniSE: A Unified Framework for Decoder-only Autoregressive LM-based Speech Enhancement
|
| 8 |
|
| 9 |
<p align="center">
|
| 10 |
<a href="https://arxiv.org/abs/2510.20441">
|
| 11 |
<img src="https://img.shields.io/badge/Paper-ArXiv-red.svg" alt="Paper">
|
| 12 |
</a>
|
| 13 |
-
<a href="https://
|
| 14 |
-
<img src="https://img.shields.io/badge/Demo-Page-blue.svg" alt="Demo">
|
| 15 |
-
</a>
|
| 16 |
-
<a href="https://huggingface.co/spaces/QuarkAudio/">
|
| 17 |
<img src="https://img.shields.io/badge/Model-Hugging%20Face-yellow.svg" alt="Hugging Face">
|
| 18 |
</a>
|
|
|
|
|
|
|
|
|
|
| 19 |
</p>
|
| 20 |
|
| 21 |
<p align="center">
|
|
@@ -29,7 +23,7 @@ pipeline_tag: audio-to-audio
|
|
| 29 |
- π **End-to-End Compatible**: Integrates WavLM (feature extractor), BiCodec (discrete codec), and LM into one pipeline.
|
| 30 |
- π **Multitask Support**: SE, SR, TSE, SS, and more β all in a single model.
|
| 31 |
|
| 32 |
-
π **Paper**: [arXiv:2510.20441](https://arxiv.org/abs/2510.20441)
|
| 33 |
|
| 34 |
---
|
| 35 |
|
|
@@ -71,7 +65,7 @@ QuarkAudio-UniSE requires three additional **WavLM** and **BiCodec** pre-trained
|
|
| 71 |
cd checkpoints
|
| 72 |
bash download.sh
|
| 73 |
```
|
| 74 |
-
Additionally, download WavLM-
|
| 75 |
|
| 76 |
Alternatively, you can download them manually and place them in the `./model/bicodec/` directory.
|
| 77 |
|
|
@@ -91,14 +85,14 @@ python ./train.py --config conf/config.yaml
|
|
| 91 |
| `speech_scp_path` | SCP of clean audio files |
|
| 92 |
| `noise_scp_path` | SCP of noise audio files
|
| 93 |
| `rir_scp_path` | SCP of rir audio files |
|
| 94 |
-
| `mode` | Task type: `
|
| 95 |
|
| 96 |
|
| 97 |
## Inference
|
| 98 |
+ Quick start
|
| 99 |
The main inference script is **`test.py`**. The inference process consists of two stages:
|
| 100 |
|
| 101 |
-
1. Extract hidden states from all
|
| 102 |
2. Use the language model (LM) to predict speech tokens, and then decode them into audio using **BiCodec**.
|
| 103 |
|
| 104 |
### Running Inference
|
|
@@ -119,13 +113,10 @@ Command to run inference:
|
|
| 119 |
python test.py
|
| 120 |
```
|
| 121 |
|
| 122 |
-
## Results
|
| 123 |
-
|
| 124 |
-
Samples processed by UniSE can be found on our [Demo Page](https://github.com/hyyan2k/UniSE/).
|
| 125 |
|
| 126 |
## Model Checkpoints
|
| 127 |
|
| 128 |
-
Our pretrained model is available on [Hugging Face](https://huggingface.co/
|
| 129 |
|
| 130 |
## Hints
|
| 131 |
|
|
@@ -144,7 +135,8 @@ Our approach focuses on leveraging the LLM's comprehension capabilities to enabl
|
|
| 144 |
url={https://arxiv.org/abs/2510.20441},
|
| 145 |
}
|
| 146 |
```
|
| 147 |
-
|
| 148 |
|
| 149 |
## Contact
|
| 150 |
-
For any questions, please contact: `yanhaoyin.yhy@alibaba-inc.com`
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
# UniSE: A Unified Framework for Decoder-only Autoregressive LM-based Speech Enhancement
|
| 2 |
|
| 3 |
<p align="center">
|
| 4 |
<a href="https://arxiv.org/abs/2510.20441">
|
| 5 |
<img src="https://img.shields.io/badge/Paper-ArXiv-red.svg" alt="Paper">
|
| 6 |
</a>
|
| 7 |
+
<a href="https://huggingface.co/QuarkAudio/QuarkAudio-UniSE/">
|
|
|
|
|
|
|
|
|
|
| 8 |
<img src="https://img.shields.io/badge/Model-Hugging%20Face-yellow.svg" alt="Hugging Face">
|
| 9 |
</a>
|
| 10 |
+
<a href="https://www.modelscope.cn/models/QuarkAudio/QuarkAudio-UniSE/">
|
| 11 |
+
<img src="https://img.shields.io/badge/Model-%20%E9%AD%94%E6%90%AD-orange.svg" alt="ModelScope">
|
| 12 |
+
</a>
|
| 13 |
</p>
|
| 14 |
|
| 15 |
<p align="center">
|
|
|
|
| 23 |
- π **End-to-End Compatible**: Integrates WavLM (feature extractor), BiCodec (discrete codec), and LM into one pipeline.
|
| 24 |
- π **Multitask Support**: SE, SR, TSE, SS, and more β all in a single model.
|
| 25 |
|
| 26 |
+
π **Paper**: [arXiv:2510.20441](https://arxiv.org/abs/2510.20441) | π€ **Model**: [Hugging Face Spaces]https://huggingface.co/QuarkAudio/QuarkAudio-UniSE/)
|
| 27 |
|
| 28 |
---
|
| 29 |
|
|
|
|
| 65 |
cd checkpoints
|
| 66 |
bash download.sh
|
| 67 |
```
|
| 68 |
+
Additionally, download WavLM-Large.pt from this [URL](https://huggingface.co/microsoft/wavlm-base-plus) and put it at `./ckpt/WavLM-Large.pt` .
|
| 69 |
|
| 70 |
Alternatively, you can download them manually and place them in the `./model/bicodec/` directory.
|
| 71 |
|
|
|
|
| 85 |
| `speech_scp_path` | SCP of clean audio files |
|
| 86 |
| `noise_scp_path` | SCP of noise audio files
|
| 87 |
| `rir_scp_path` | SCP of rir audio files |
|
| 88 |
+
| `mode` | Task type: `se` (Noise Suppression,Speech Restoration,Packet Loss Concealment), `tse` (Target Speaker Extraction), `SS` (Speech Separation). |
|
| 89 |
|
| 90 |
|
| 91 |
## Inference
|
| 92 |
+ Quick start
|
| 93 |
The main inference script is **`test.py`**. The inference process consists of two stages:
|
| 94 |
|
| 95 |
+
1. Extract hidden states from all WavLM layers and obtain a single representation by averaging them across layers.
|
| 96 |
2. Use the language model (LM) to predict speech tokens, and then decode them into audio using **BiCodec**.
|
| 97 |
|
| 98 |
### Running Inference
|
|
|
|
| 113 |
python test.py
|
| 114 |
```
|
| 115 |
|
|
|
|
|
|
|
|
|
|
| 116 |
|
| 117 |
## Model Checkpoints
|
| 118 |
|
| 119 |
+
Our pretrained model is available on [Hugging Face](https://huggingface.co/QuarkAudio/QuarkAudio-UniSE/).
|
| 120 |
|
| 121 |
## Hints
|
| 122 |
|
|
|
|
| 135 |
url={https://arxiv.org/abs/2510.20441},
|
| 136 |
}
|
| 137 |
```
|
| 138 |
+
|
| 139 |
|
| 140 |
## Contact
|
| 141 |
+
For any questions, please contact: `yanhaoyin.yhy@alibaba-inc.com`
|
| 142 |
+
|