xtts2-Bark / README.md
Nekochu's picture
Update README.md
da4aff0 verified
---
title: xtts2 + Bark TTS
emoji: 🎙️
colorFrom: purple
colorTo: pink
sdk: gradio
sdk_version: 6.3.0
app_file: app.py
pinned: false
license: apache-2.0
tags:
- text-to-speech
- voice-cloning
- xtts
- bark
- mcp-server
short_description: XTTS2 voice cloning + Bark TTS in one space
---
# TTS Hub: XTTS2 + Bark
Two powerful TTS models in one space, optimized for CPU.
## Models
| Model | Voice Source | Languages | Special Features |
|-------|--------------|-----------|------------------|
| **XTTS2** (default) | Your audio sample | 16 languages | Voice cloning |
| **Bark** | Preset voices | EN, DE, FR, ES, ZH, JA, KO | Non-speech sounds, temperature control |
## Usage
### XTTS2 (Voice Cloning)
1. Upload 3-30 seconds of reference voice audio
2. Enter text to synthesize
3. Select language and speed
4. Click "Generate Speech"
### Bark (Preset Voices)
1. Select "Bark (Preset Voices)"
2. Choose a voice preset (e.g., `v2/en_speaker_6`)
3. Adjust temperature controls (optional):
- **Text Temperature** (0.1-1.0): Controls semantic variation
- **Waveform Temperature** (0.1-1.0): Controls audio variation
4. Set seed for reproducibility (optional, -1 for random)
5. Enter text with optional special tokens
6. Click "Generate Speech"
**Bark special tokens:**
- `[laughter]` `[laughs]` `[sighs]` `[music]` `[gasps]` `[clears throat]`
- `♪ la la la ♪` for singing
- `MAN:` `WOMAN:` for speaker labels
**Long text handling:** Text is automatically split into chunks and processed sequentially with natural pauses between segments.
---
## API
### Python Client
```python
from gradio_client import Client, handle_file
client = Client("Luminia/xtts2-Bark")
# XTTS2 (voice cloning)
result = client.predict(
text="Hello, this is a voice cloning test.",
model_choice="XTTS2 (Voice Cloning)",
reference_audio=handle_file("voice_sample.wav"),
language="English",
speed=1.0,
voice_preset="v2/en_speaker_6",
text_temp=0.7, # Bark only (ignored for XTTS2)
waveform_temp=0.7, # Bark only (ignored for XTTS2)
seed=-1, # Bark only (ignored for XTTS2)
api_name="/synthesize"
)
print(result) # (audio_path, status)
# Bark (preset voice) with temperature control
result = client.predict(
text="Hello! [laughter] This is Bark speaking.",
model_choice="Bark (Preset Voices)",
reference_audio=None,
language="English",
speed=1.0,
voice_preset="v2/en_speaker_6",
text_temp=0.7, # Semantic temperature (0.1-1.0)
waveform_temp=0.7, # Audio waveform temperature (0.1-1.0)
seed=42, # Set seed for reproducibility (-1 for random)
api_name="/synthesize"
)
print(result)
```
### REST API (curl)
```bash
# XTTS2 with voice cloning
curl -X POST "https://luminia-xtts2-bark.hf.space/gradio_api/call/synthesize" \
-H "Content-Type: application/json" \
-d '{
"data": [
"Hello world",
"XTTS2 (Voice Cloning)",
{"path": "https://example.com/voice.wav"},
"English",
1.0,
"v2/en_speaker_6",
0.7,
0.7,
-1
]
}'
# Bark with preset voice and temperature control
curl -X POST "https://luminia-xtts2-bark.hf.space/gradio_api/call/synthesize" \
-H "Content-Type: application/json" \
-d '{
"data": [
"Hello [laughter] world",
"Bark (Preset Voices)",
null,
"English",
1.0,
"v2/en_speaker_3",
0.7,
0.7,
42
]
}'
```
### MCP (Model Context Protocol)
This Space supports MCP for AI assistants.
**Tool schema:**
```json
{
"name": "synthesize",
"parameters": {
"text": {"type": "string", "description": "Text to synthesize"},
"model_choice": {"type": "string", "enum": ["XTTS2 (Voice Cloning)", "Bark (Preset Voices)"]},
"reference_audio": {"type": "file", "description": "Reference audio for XTTS2 (optional for Bark)"},
"language": {"type": "string", "default": "English"},
"speed": {"type": "number", "default": 1.0},
"voice_preset": {"type": "string", "default": "v2/en_speaker_6"},
"text_temp": {"type": "number", "default": 0.7, "description": "Bark text/semantic temperature (0.1-1.0)"},
"waveform_temp": {"type": "number", "default": 0.7, "description": "Bark waveform temperature (0.1-1.0)"},
"seed": {"type": "integer", "default": -1, "description": "Bark seed for reproducibility (-1 for random)"}
},
"returns": ["audio", "string"]
}
```
**MCP Config:**
```json
{
"mcpServers": {
"tts-hub": {"url": "https://luminia-xtts2-bark.hf.space/gradio_api/mcp/"}
}
}
```
---
## CLI Usage
```bash
# XTTS2 voice cloning
python app.py tts -t "Hello world" -o output.wav -m xtts2 -r voice_sample.wav -l English -s 1.0
# Bark preset voice (basic)
python app.py tts -t "Hello [laughter] world" -o output.wav -m bark -v "v2/en_speaker_6"
# Bark with temperature control and seed
python app.py tts -t "Hello world" -o output.wav -m bark -v "v2/en_speaker_6" \
--text-temp 0.7 --waveform-temp 0.7 --seed 42
```
## Bark Voice Presets
| Preset | Language |
|--------|----------|
| `v2/en_speaker_0` - `v2/en_speaker_9` | English |
| `v2/de_speaker_0` - `v2/de_speaker_2` | German |
| `v2/fr_speaker_0` - `v2/fr_speaker_1` | French |
| `v2/es_speaker_0` - `v2/es_speaker_1` | Spanish |
| `v2/zh_speaker_0` - `v2/zh_speaker_1` | Chinese |
| `v2/ja_speaker_0` | Japanese |
| `v2/ko_speaker_0` | Korean |
## Bark Temperature Guide
| Setting | Low (0.1-0.3) | Medium (0.5-0.7) | High (0.8-1.0) |
|---------|---------------|------------------|----------------|
| **Text Temp** | More predictable, robotic | Natural, balanced | Creative, variable |
| **Waveform Temp** | Cleaner audio | Natural variation | More expressive |
**Recommended:** Start with 0.7 for both temperatures for natural-sounding speech.
---
## Credits
- **XTTS2:** [Coqui TTS](https://github.com/idiap/coqui-ai-TTS) (Apache 2.0)
- **Bark:** [Suno AI](https://github.com/suno-ai/bark) (MIT)
Licensed under Apache 2.0.