--- title: SingingSDS emoji: 🎢 colorFrom: pink colorTo: yellow sdk: gradio sdk_version: 5.4.0 app_file: app.py pinned: false --- # SingingSDS: Role-Playing Singing Spoken Dialogue System A role-playing singing dialogue system that converts speech input into character-based singing output. ## Installation ### Requirements - Python 3.11+ - CUDA (optional, for GPU acceleration) ### Install Dependencies #### Option 1: Using Conda (Recommended) ```bash conda create -n singingsds python=3.11 conda activate singingsds conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia pip install -r requirements.txt ``` #### Option 2: Using pip only ```bash pip install -r requirements.txt ``` #### Option 3: Using pip with virtual environment ```bash python -m venv singingsds_env # On Windows: singingsds_env\Scripts\activate # On macOS/Linux: source singingsds_env/bin/activate pip install -r requirements.txt ``` ## Usage ### Command Line Interface (CLI) #### Example Usage ```bash python cli.py \ --query_audio tests/audio/hello.wav \ --config_path config/cli/yaoyin_default.yaml \ --output_audio outputs/yaoyin_hello.wav \ --eval_results_csv outputs/yaoyin_test.csv ``` #### Inference-Only Mode Run minimal inference without evaluation. ```bash python cli.py \ --query_audio tests/audio/hello.wav \ --config_path config/cli/yaoyin_default_infer_only.yaml \ --output_audio outputs/yaoyin_hello.wav ``` #### Parameter Description - `--query_audio`: Input audio file path (required) - `--config_path`: Configuration file path (default: config/cli/yaoyin_default.yaml) - `--output_audio`: Output audio file path (required) ### Web Interface (Gradio) Start the web interface: ```bash python app.py ``` Then visit the displayed address in your browser to use the graphical interface. ## Configuration ### Character Configuration The system supports multiple preset characters: - **Yaoyin (ι₯音)**: Default timbre is `timbre2` - **Limei (δΈ½ζ’…)**: Default timbre is `timbre1` ### Model Configuration #### ASR Models - `openai/whisper-large-v3-turbo` - `openai/whisper-large-v3` - `openai/whisper-medium` - `openai/whisper-small` - `funasr/paraformer-zh` #### LLM Models - `gemini-2.5-flash` - `google/gemma-2-2b` - `meta-llama/Llama-3.2-3B-Instruct` - `meta-llama/Llama-3.1-8B-Instruct` - `Qwen/Qwen3-8B` - `Qwen/Qwen3-30B-A3B` - `MiniMaxAI/MiniMax-Text-01` #### SVS Models - `espnet/mixdata_svs_visinger2_spkemb_lang_pretrained_avg` (Bilingual) - `espnet/aceopencpop_svs_visinger2_40singer_pretrain` (Chinese) ## Project Structure ``` SingingSDS/ β”œβ”€β”€ app.py, cli.py # Entry points (demo app & CLI) β”œβ”€β”€ pipeline.py # Main orchestration pipeline β”œβ”€β”€ interface.py # Gradio interface β”œβ”€β”€ characters/ # Virtual character definitions β”œβ”€β”€ modules/ # Core modules β”‚ β”œβ”€β”€ asr/ # ASR models (Whisper, Paraformer) β”‚ β”œβ”€β”€ llm/ # LLMs (Gemini, LLaMA, etc.) β”‚ β”œβ”€β”€ svs/ # Singing voice synthesis (ESPnet) β”‚ └── utils/ # G2P, text normalization, resources β”œβ”€β”€ config/ # YAML configuration files β”œβ”€β”€ data/ # Dataset metadata and length info β”œβ”€β”€ data_handlers/ # Parsers for KiSing, Touhou, etc. β”œβ”€β”€ evaluation/ # Evaluation metrics β”œβ”€β”€ resources/ # Singer embeddings, phoneme dicts, MIDI β”œβ”€β”€ assets/ # Character visuals β”œβ”€β”€ tests/ # Unit tests and sample audios └── README.md, requirements.txt ``` ## Contributing Issues and Pull Requests are welcome! ## License