File size: 7,661 Bytes
bde09ef
 
 
 
 
 
 
 
 
 
79ad7df
 
e39f54d
79ad7df
e39f54d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
79ad7df
 
 
e39f54d
79ad7df
 
 
 
 
 
 
 
 
 
 
 
 
 
e39f54d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
79ad7df
 
 
 
 
e39f54d
79ad7df
 
 
 
 
 
 
 
 
 
 
 
e39f54d
79ad7df
 
 
 
 
 
92276c4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
79ad7df
 
 
 
 
 
 
 
e39f54d
79ad7df
 
 
 
 
 
 
 
 
e39f54d
 
 
79ad7df
 
 
 
 
 
 
 
 
 
 
e39f54d
 
 
 
 
 
 
79ad7df
 
e39f54d
 
 
 
 
 
 
 
 
79ad7df
 
e39f54d
 
 
 
79ad7df
e39f54d
79ad7df
 
 
48162db
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
79ad7df
 
e39f54d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
79ad7df
e39f54d
79ad7df
e39f54d
79ad7df
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
---
title: SingingSDS
emoji: 🎢
colorFrom: pink
colorTo: yellow
sdk: gradio
sdk_version: 5.4.0
app_file: app.py
pinned: false
---
# SingingSDS: Role-Playing Singing Spoken Dialogue System

<div align="center">

**A role-playing singing dialogue system that converts speech input into character-based singing output.**

![Paper](https://img.shields.io/badge/Paper-Coming%20Soon-orange) [![Code](https://img.shields.io/badge/Code-GitHub-black)](https://github.com/SingingSDS/SingingSDS) [![HuggingFace Demo](https://img.shields.io/badge/πŸ€—%20HuggingFace-Demo-yellow)](https://huggingface.co/spaces/espnet/SingingSDS) [![YouTube](https://img.shields.io/badge/YouTube-Playlist-red)](https://www.youtube.com/playlist?list=PLZpUJJbwp2WvtPBenG5D3h09qKIrt24ui)

</div>

## πŸ“– Overview

SingingSDS is an innovative role-playing singing dialogue system that seamlessly converts natural speech input into character-based singing output. The system integrates automatic speech recognition (ASR), large language models (LLM), and singing voice synthesis (SVS) to create an immersive conversational singing experience.

<div align="center">
  <img src="assets/demo.png" alt="SingingSDS Interface" style="max-width: 100%; height: auto;"/>
  <p><em>SingingSDS Web Interface: Interactive singing dialogue system with character visualization, audio I/O, evaluation metrics, and flexible configuration options.</em></p>
</div>

## πŸš€ Installation

### Requirements

- Python 3.10 or 3.11
- CUDA (optional, for GPU acceleration)

### Install Dependencies

#### Option 1: Using Conda (Recommended)

```bash
conda create -n singingsds python=3.11

conda activate singingsds
conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia
pip install -r requirements.txt
```

#### Option 2: Using uv (Fast & Modern)

First install uv:

```bash
# On macOS/Linux:
curl -LsSf https://astral.sh/uv/install.sh | sh

# On Windows:
powershell -c "irm https://astral.sh/uv/install.ps1 | iex"

# Or via pip:
pip install uv
```

Then install dependencies:

```bash
uv venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate
uv pip install -r requirements.txt
```

#### Option 3: Using pip only

```bash
pip install -r requirements.txt
```

#### Option 4: Using pip with virtual environment

```bash
python -m venv singingsds_env

# On Windows:
singingsds_env\Scripts\activate
# On macOS/Linux:
source singingsds_env/bin/activate

pip install -r requirements.txt
```

## πŸ’» Usage

### Command Line Interface (CLI)

#### Example Usage

```bash
python cli.py \
  --query_audio tests/audio/hello.wav \
  --config_path config/cli/yaoyin_default.yaml \
  --output_audio outputs/yaoyin_hello.wav \
  --eval_results_csv outputs/yaoyin_test.csv
```

#### Inference-Only Mode

Run minimal inference without evaluation.

```bash
python cli.py \
  --query_audio tests/audio/hello.wav \
  --config_path config/cli/yaoyin_default_infer_only.yaml \
  --output_audio outputs/yaoyin_hello.wav
```

#### Parameter Description

- `--query_audio`: Input audio file path (required)
- `--config_path`: Configuration file path (default: config/cli/yaoyin_default.yaml)
- `--output_audio`: Output audio file path (required)

### 🌐 Web Interface (Gradio)

Start the web interface:

```bash
python app.py
```

Then visit the displayed address in your browser to use the graphical interface.

> πŸ’‘ **Tip**: You can also try our [HuggingFace demo](https://huggingface.co/spaces/espnet/SingingSDS) for a quick test without local installation!

## βš™οΈ Configuration

### Character Configuration

The system supports multiple preset characters:

- **Yaoyin (ι₯音)**: Default timbre is `timbre2`
- **Limei (δΈ½ζ’…)**: Default timbre is `timbre1`

### Model Configuration

#### ASR Models
| Model | Description |
|-------|-------------|
| `openai/whisper-large-v3-turbo` | Latest Whisper model with turbo optimization |
| `openai/whisper-large-v3` | Large Whisper v3 model |
| `openai/whisper-medium` | Medium-sized Whisper model |
| `openai/whisper-small` | Small Whisper model |
| `funasr/paraformer-zh` | Paraformer for Chinese ASR |

#### LLM Models
| Model | Description |
|-------|-------------|
| `gemini-2.5-flash` | Google Gemini 2.5 Flash |
| `google/gemma-2-2b` | Google Gemma 2B model |
| `meta-llama/Llama-3.2-3B-Instruct` | Meta Llama 3.2 3B Instruct |
| `meta-llama/Llama-3.1-8B-Instruct` | Meta Llama 3.1 8B Instruct |
| `Qwen/Qwen3-8B` | Qwen3 8B model |
| `Qwen/Qwen3-30B-A3B` | Qwen3 30B A3B model |
| `MiniMaxAI/MiniMax-Text-01` | MiniMax Text model |

#### SVS Models
| Model | Language Support |
|------|------------------|
| `espnet/visinger2-zh-jp-multisinger-svs` | Bilingual (Chinese & Japanese) |
| `espnet/aceopencpop_svs_visinger2_40singer_pretrain` | Chinese |

## πŸ“ Project Structure

```
SingingSDS/
β”œβ”€β”€ app.py, cli.py               # Entry points (demo app & CLI)
β”œβ”€β”€ pipeline.py                  # Main orchestration pipeline
β”œβ”€β”€ interface.py                 # Gradio interface
β”œβ”€β”€ characters/                  # Virtual character definitions
β”œβ”€β”€ modules/                     # Core modules
β”‚   β”œβ”€β”€ asr/                     # ASR models (Whisper, Paraformer)
β”‚   β”œβ”€β”€ llm/                     # LLMs (Gemini, LLaMA, etc.)
β”‚   β”œβ”€β”€ svs/                     # Singing voice synthesis (ESPnet)
β”‚   └── utils/                   # G2P, text normalization, resources
β”œβ”€β”€ config/                      # YAML configuration files 
β”œβ”€β”€ data/                        # Dataset metadata and length info
β”œβ”€β”€ data_handlers/               # Parsers for KiSing, Touhou, etc.
β”œβ”€β”€ evaluation/                  # Evaluation metrics
β”œβ”€β”€ resources/                   # Singer embeddings, phoneme dicts, MIDI
β”œβ”€β”€ assets/                      # Character visuals
β”œβ”€β”€ tests/                       # Unit tests and sample audios
└── README.md, requirements.txt
```

## 🀝 Contributing

We welcome contributions! Please feel free to submit issues and pull requests.

## πŸ“„ License

### Character Assets

The Yaoyin (ι₯音) character assets, including [`character_yaoyin.png`](./assets/character_yaoyin.png) created by illustrator Zihe Zhou, are commissioned exclusively for the SingingSDS project. Screenshots of the system that include these assets, such as [`demo.png`](./assets/demo.png), are also covered under this license. The assets may be used only for direct derivatives of SingingSDS, such as project-related posts, usage videos, or other content directly depicting the project. Any other use requires express permission from the illustrator, and these assets may not be used for training, fine-tuning, or improving any artificial intelligence or machine learning models. For full license details, see [`assets/character_yaoyin.LICENSE`](./assets/character_yaoyin.LICENSE).

### Code License

All source code in this repository is licensed under the [MIT License](./LICENSE). This license applies **only to the code**. Character assets remain under their separate license and restrictions, as described in the **Character Assets** section.

### Model License

The models used in SingingSDS are subject to their respective licenses and terms of use. Users must comply with each model’s official license, which can be found at the respective model’s official repository or website.

---

<div align="center">

Paper (Coming soon) β€’ [Code](https://github.com/SingingSDS/SingingSDS) β€’ [Demo](https://huggingface.co/spaces/espnet/SingingSDS) β€’ [Video](https://www.youtube.com/playlist?list=PLZpUJJbwp2WvtPBenG5D3h09qKIrt24ui)

</div>