Spaces:
Sleeping
Sleeping
A newer version of the Gradio SDK is available:
6.6.0
metadata
title: xtts2 + Bark TTS
emoji: 🎙️
colorFrom: purple
colorTo: pink
sdk: gradio
sdk_version: 6.3.0
app_file: app.py
pinned: false
license: apache-2.0
tags:
- text-to-speech
- voice-cloning
- xtts
- bark
- mcp-server
short_description: XTTS2 voice cloning + Bark TTS in one space
TTS Hub: XTTS2 + Bark
Two powerful TTS models in one space, optimized for CPU.
Models
| Model | Voice Source | Languages | Special Features |
|---|---|---|---|
| XTTS2 (default) | Your audio sample | 16 languages | Voice cloning |
| Bark | Preset voices | EN, DE, FR, ES, ZH, JA, KO | Non-speech sounds, temperature control |
Usage
XTTS2 (Voice Cloning)
- Upload 3-30 seconds of reference voice audio
- Enter text to synthesize
- Select language and speed
- Click "Generate Speech"
Bark (Preset Voices)
- Select "Bark (Preset Voices)"
- Choose a voice preset (e.g.,
v2/en_speaker_6) - Adjust temperature controls (optional):
- Text Temperature (0.1-1.0): Controls semantic variation
- Waveform Temperature (0.1-1.0): Controls audio variation
- Set seed for reproducibility (optional, -1 for random)
- Enter text with optional special tokens
- Click "Generate Speech"
Bark special tokens:
[laughter][laughs][sighs][music][gasps][clears throat]♪ la la la ♪for singingMAN:WOMAN:for speaker labels
Long text handling: Text is automatically split into chunks and processed sequentially with natural pauses between segments.
API
Python Client
from gradio_client import Client, handle_file
client = Client("Luminia/xtts2-Bark")
# XTTS2 (voice cloning)
result = client.predict(
text="Hello, this is a voice cloning test.",
model_choice="XTTS2 (Voice Cloning)",
reference_audio=handle_file("voice_sample.wav"),
language="English",
speed=1.0,
voice_preset="v2/en_speaker_6",
text_temp=0.7, # Bark only (ignored for XTTS2)
waveform_temp=0.7, # Bark only (ignored for XTTS2)
seed=-1, # Bark only (ignored for XTTS2)
api_name="/synthesize"
)
print(result) # (audio_path, status)
# Bark (preset voice) with temperature control
result = client.predict(
text="Hello! [laughter] This is Bark speaking.",
model_choice="Bark (Preset Voices)",
reference_audio=None,
language="English",
speed=1.0,
voice_preset="v2/en_speaker_6",
text_temp=0.7, # Semantic temperature (0.1-1.0)
waveform_temp=0.7, # Audio waveform temperature (0.1-1.0)
seed=42, # Set seed for reproducibility (-1 for random)
api_name="/synthesize"
)
print(result)
REST API (curl)
# XTTS2 with voice cloning
curl -X POST "https://luminia-xtts2-bark.hf.space/gradio_api/call/synthesize" \
-H "Content-Type: application/json" \
-d '{
"data": [
"Hello world",
"XTTS2 (Voice Cloning)",
{"path": "https://example.com/voice.wav"},
"English",
1.0,
"v2/en_speaker_6",
0.7,
0.7,
-1
]
}'
# Bark with preset voice and temperature control
curl -X POST "https://luminia-xtts2-bark.hf.space/gradio_api/call/synthesize" \
-H "Content-Type: application/json" \
-d '{
"data": [
"Hello [laughter] world",
"Bark (Preset Voices)",
null,
"English",
1.0,
"v2/en_speaker_3",
0.7,
0.7,
42
]
}'
MCP (Model Context Protocol)
This Space supports MCP for AI assistants.
Tool schema:
{
"name": "synthesize",
"parameters": {
"text": {"type": "string", "description": "Text to synthesize"},
"model_choice": {"type": "string", "enum": ["XTTS2 (Voice Cloning)", "Bark (Preset Voices)"]},
"reference_audio": {"type": "file", "description": "Reference audio for XTTS2 (optional for Bark)"},
"language": {"type": "string", "default": "English"},
"speed": {"type": "number", "default": 1.0},
"voice_preset": {"type": "string", "default": "v2/en_speaker_6"},
"text_temp": {"type": "number", "default": 0.7, "description": "Bark text/semantic temperature (0.1-1.0)"},
"waveform_temp": {"type": "number", "default": 0.7, "description": "Bark waveform temperature (0.1-1.0)"},
"seed": {"type": "integer", "default": -1, "description": "Bark seed for reproducibility (-1 for random)"}
},
"returns": ["audio", "string"]
}
MCP Config:
{
"mcpServers": {
"tts-hub": {"url": "https://luminia-xtts2-bark.hf.space/gradio_api/mcp/"}
}
}
CLI Usage
# XTTS2 voice cloning
python app.py tts -t "Hello world" -o output.wav -m xtts2 -r voice_sample.wav -l English -s 1.0
# Bark preset voice (basic)
python app.py tts -t "Hello [laughter] world" -o output.wav -m bark -v "v2/en_speaker_6"
# Bark with temperature control and seed
python app.py tts -t "Hello world" -o output.wav -m bark -v "v2/en_speaker_6" \
--text-temp 0.7 --waveform-temp 0.7 --seed 42
Bark Voice Presets
| Preset | Language |
|---|---|
v2/en_speaker_0 - v2/en_speaker_9 |
English |
v2/de_speaker_0 - v2/de_speaker_2 |
German |
v2/fr_speaker_0 - v2/fr_speaker_1 |
French |
v2/es_speaker_0 - v2/es_speaker_1 |
Spanish |
v2/zh_speaker_0 - v2/zh_speaker_1 |
Chinese |
v2/ja_speaker_0 |
Japanese |
v2/ko_speaker_0 |
Korean |
Bark Temperature Guide
| Setting | Low (0.1-0.3) | Medium (0.5-0.7) | High (0.8-1.0) |
|---|---|---|---|
| Text Temp | More predictable, robotic | Natural, balanced | Creative, variable |
| Waveform Temp | Cleaner audio | Natural variation | More expressive |
Recommended: Start with 0.7 for both temperatures for natural-sounding speech.
Credits
Licensed under Apache 2.0.