Spaces:
Sleeping
Sleeping
| title: xtts2 + Bark TTS | |
| emoji: 🎙️ | |
| colorFrom: purple | |
| colorTo: pink | |
| sdk: gradio | |
| sdk_version: 6.3.0 | |
| app_file: app.py | |
| pinned: false | |
| license: apache-2.0 | |
| tags: | |
| - text-to-speech | |
| - voice-cloning | |
| - xtts | |
| - bark | |
| - mcp-server | |
| short_description: XTTS2 voice cloning + Bark TTS in one space | |
| # TTS Hub: XTTS2 + Bark | |
| Two powerful TTS models in one space, optimized for CPU. | |
| ## Models | |
| | Model | Voice Source | Languages | Special Features | | |
| |-------|--------------|-----------|------------------| | |
| | **XTTS2** (default) | Your audio sample | 16 languages | Voice cloning | | |
| | **Bark** | Preset voices | EN, DE, FR, ES, ZH, JA, KO | Non-speech sounds, temperature control | | |
| ## Usage | |
| ### XTTS2 (Voice Cloning) | |
| 1. Upload 3-30 seconds of reference voice audio | |
| 2. Enter text to synthesize | |
| 3. Select language and speed | |
| 4. Click "Generate Speech" | |
| ### Bark (Preset Voices) | |
| 1. Select "Bark (Preset Voices)" | |
| 2. Choose a voice preset (e.g., `v2/en_speaker_6`) | |
| 3. Adjust temperature controls (optional): | |
| - **Text Temperature** (0.1-1.0): Controls semantic variation | |
| - **Waveform Temperature** (0.1-1.0): Controls audio variation | |
| 4. Set seed for reproducibility (optional, -1 for random) | |
| 5. Enter text with optional special tokens | |
| 6. Click "Generate Speech" | |
| **Bark special tokens:** | |
| - `[laughter]` `[laughs]` `[sighs]` `[music]` `[gasps]` `[clears throat]` | |
| - `♪ la la la ♪` for singing | |
| - `MAN:` `WOMAN:` for speaker labels | |
| **Long text handling:** Text is automatically split into chunks and processed sequentially with natural pauses between segments. | |
| --- | |
| ## API | |
| ### Python Client | |
| ```python | |
| from gradio_client import Client, handle_file | |
| client = Client("Luminia/xtts2-Bark") | |
| # XTTS2 (voice cloning) | |
| result = client.predict( | |
| text="Hello, this is a voice cloning test.", | |
| model_choice="XTTS2 (Voice Cloning)", | |
| reference_audio=handle_file("voice_sample.wav"), | |
| language="English", | |
| speed=1.0, | |
| voice_preset="v2/en_speaker_6", | |
| text_temp=0.7, # Bark only (ignored for XTTS2) | |
| waveform_temp=0.7, # Bark only (ignored for XTTS2) | |
| seed=-1, # Bark only (ignored for XTTS2) | |
| api_name="/synthesize" | |
| ) | |
| print(result) # (audio_path, status) | |
| # Bark (preset voice) with temperature control | |
| result = client.predict( | |
| text="Hello! [laughter] This is Bark speaking.", | |
| model_choice="Bark (Preset Voices)", | |
| reference_audio=None, | |
| language="English", | |
| speed=1.0, | |
| voice_preset="v2/en_speaker_6", | |
| text_temp=0.7, # Semantic temperature (0.1-1.0) | |
| waveform_temp=0.7, # Audio waveform temperature (0.1-1.0) | |
| seed=42, # Set seed for reproducibility (-1 for random) | |
| api_name="/synthesize" | |
| ) | |
| print(result) | |
| ``` | |
| ### REST API (curl) | |
| ```bash | |
| # XTTS2 with voice cloning | |
| curl -X POST "https://luminia-xtts2-bark.hf.space/gradio_api/call/synthesize" \ | |
| -H "Content-Type: application/json" \ | |
| -d '{ | |
| "data": [ | |
| "Hello world", | |
| "XTTS2 (Voice Cloning)", | |
| {"path": "https://example.com/voice.wav"}, | |
| "English", | |
| 1.0, | |
| "v2/en_speaker_6", | |
| 0.7, | |
| 0.7, | |
| -1 | |
| ] | |
| }' | |
| # Bark with preset voice and temperature control | |
| curl -X POST "https://luminia-xtts2-bark.hf.space/gradio_api/call/synthesize" \ | |
| -H "Content-Type: application/json" \ | |
| -d '{ | |
| "data": [ | |
| "Hello [laughter] world", | |
| "Bark (Preset Voices)", | |
| null, | |
| "English", | |
| 1.0, | |
| "v2/en_speaker_3", | |
| 0.7, | |
| 0.7, | |
| 42 | |
| ] | |
| }' | |
| ``` | |
| ### MCP (Model Context Protocol) | |
| This Space supports MCP for AI assistants. | |
| **Tool schema:** | |
| ```json | |
| { | |
| "name": "synthesize", | |
| "parameters": { | |
| "text": {"type": "string", "description": "Text to synthesize"}, | |
| "model_choice": {"type": "string", "enum": ["XTTS2 (Voice Cloning)", "Bark (Preset Voices)"]}, | |
| "reference_audio": {"type": "file", "description": "Reference audio for XTTS2 (optional for Bark)"}, | |
| "language": {"type": "string", "default": "English"}, | |
| "speed": {"type": "number", "default": 1.0}, | |
| "voice_preset": {"type": "string", "default": "v2/en_speaker_6"}, | |
| "text_temp": {"type": "number", "default": 0.7, "description": "Bark text/semantic temperature (0.1-1.0)"}, | |
| "waveform_temp": {"type": "number", "default": 0.7, "description": "Bark waveform temperature (0.1-1.0)"}, | |
| "seed": {"type": "integer", "default": -1, "description": "Bark seed for reproducibility (-1 for random)"} | |
| }, | |
| "returns": ["audio", "string"] | |
| } | |
| ``` | |
| **MCP Config:** | |
| ```json | |
| { | |
| "mcpServers": { | |
| "tts-hub": {"url": "https://luminia-xtts2-bark.hf.space/gradio_api/mcp/"} | |
| } | |
| } | |
| ``` | |
| --- | |
| ## CLI Usage | |
| ```bash | |
| # XTTS2 voice cloning | |
| python app.py tts -t "Hello world" -o output.wav -m xtts2 -r voice_sample.wav -l English -s 1.0 | |
| # Bark preset voice (basic) | |
| python app.py tts -t "Hello [laughter] world" -o output.wav -m bark -v "v2/en_speaker_6" | |
| # Bark with temperature control and seed | |
| python app.py tts -t "Hello world" -o output.wav -m bark -v "v2/en_speaker_6" \ | |
| --text-temp 0.7 --waveform-temp 0.7 --seed 42 | |
| ``` | |
| ## Bark Voice Presets | |
| | Preset | Language | | |
| |--------|----------| | |
| | `v2/en_speaker_0` - `v2/en_speaker_9` | English | | |
| | `v2/de_speaker_0` - `v2/de_speaker_2` | German | | |
| | `v2/fr_speaker_0` - `v2/fr_speaker_1` | French | | |
| | `v2/es_speaker_0` - `v2/es_speaker_1` | Spanish | | |
| | `v2/zh_speaker_0` - `v2/zh_speaker_1` | Chinese | | |
| | `v2/ja_speaker_0` | Japanese | | |
| | `v2/ko_speaker_0` | Korean | | |
| ## Bark Temperature Guide | |
| | Setting | Low (0.1-0.3) | Medium (0.5-0.7) | High (0.8-1.0) | | |
| |---------|---------------|------------------|----------------| | |
| | **Text Temp** | More predictable, robotic | Natural, balanced | Creative, variable | | |
| | **Waveform Temp** | Cleaner audio | Natural variation | More expressive | | |
| **Recommended:** Start with 0.7 for both temperatures for natural-sounding speech. | |
| --- | |
| ## Credits | |
| - **XTTS2:** [Coqui TTS](https://github.com/idiap/coqui-ai-TTS) (Apache 2.0) | |
| - **Bark:** [Suno AI](https://github.com/suno-ai/bark) (MIT) | |
| Licensed under Apache 2.0. | |