xtts2-Bark / README.md
Nekochu's picture
Update README.md
da4aff0 verified

A newer version of the Gradio SDK is available: 6.6.0

Upgrade
metadata
title: xtts2 + Bark TTS
emoji: 🎙️
colorFrom: purple
colorTo: pink
sdk: gradio
sdk_version: 6.3.0
app_file: app.py
pinned: false
license: apache-2.0
tags:
  - text-to-speech
  - voice-cloning
  - xtts
  - bark
  - mcp-server
short_description: XTTS2 voice cloning + Bark TTS in one space

TTS Hub: XTTS2 + Bark

Two powerful TTS models in one space, optimized for CPU.

Models

Model Voice Source Languages Special Features
XTTS2 (default) Your audio sample 16 languages Voice cloning
Bark Preset voices EN, DE, FR, ES, ZH, JA, KO Non-speech sounds, temperature control

Usage

XTTS2 (Voice Cloning)

  1. Upload 3-30 seconds of reference voice audio
  2. Enter text to synthesize
  3. Select language and speed
  4. Click "Generate Speech"

Bark (Preset Voices)

  1. Select "Bark (Preset Voices)"
  2. Choose a voice preset (e.g., v2/en_speaker_6)
  3. Adjust temperature controls (optional):
    • Text Temperature (0.1-1.0): Controls semantic variation
    • Waveform Temperature (0.1-1.0): Controls audio variation
  4. Set seed for reproducibility (optional, -1 for random)
  5. Enter text with optional special tokens
  6. Click "Generate Speech"

Bark special tokens:

  • [laughter] [laughs] [sighs] [music] [gasps] [clears throat]
  • ♪ la la la ♪ for singing
  • MAN: WOMAN: for speaker labels

Long text handling: Text is automatically split into chunks and processed sequentially with natural pauses between segments.


API

Python Client

from gradio_client import Client, handle_file

client = Client("Luminia/xtts2-Bark")

# XTTS2 (voice cloning)
result = client.predict(
    text="Hello, this is a voice cloning test.",
    model_choice="XTTS2 (Voice Cloning)",
    reference_audio=handle_file("voice_sample.wav"),
    language="English",
    speed=1.0,
    voice_preset="v2/en_speaker_6",
    text_temp=0.7,       # Bark only (ignored for XTTS2)
    waveform_temp=0.7,   # Bark only (ignored for XTTS2)
    seed=-1,             # Bark only (ignored for XTTS2)
    api_name="/synthesize"
)
print(result)  # (audio_path, status)

# Bark (preset voice) with temperature control
result = client.predict(
    text="Hello! [laughter] This is Bark speaking.",
    model_choice="Bark (Preset Voices)",
    reference_audio=None,
    language="English",
    speed=1.0,
    voice_preset="v2/en_speaker_6",
    text_temp=0.7,       # Semantic temperature (0.1-1.0)
    waveform_temp=0.7,   # Audio waveform temperature (0.1-1.0)
    seed=42,             # Set seed for reproducibility (-1 for random)
    api_name="/synthesize"
)
print(result)

REST API (curl)

# XTTS2 with voice cloning
curl -X POST "https://luminia-xtts2-bark.hf.space/gradio_api/call/synthesize" \
  -H "Content-Type: application/json" \
  -d '{
    "data": [
      "Hello world",
      "XTTS2 (Voice Cloning)",
      {"path": "https://example.com/voice.wav"},
      "English",
      1.0,
      "v2/en_speaker_6",
      0.7,
      0.7,
      -1
    ]
  }'

# Bark with preset voice and temperature control
curl -X POST "https://luminia-xtts2-bark.hf.space/gradio_api/call/synthesize" \
  -H "Content-Type: application/json" \
  -d '{
    "data": [
      "Hello [laughter] world",
      "Bark (Preset Voices)",
      null,
      "English",
      1.0,
      "v2/en_speaker_3",
      0.7,
      0.7,
      42
    ]
  }'

MCP (Model Context Protocol)

This Space supports MCP for AI assistants.

Tool schema:

{
  "name": "synthesize",
  "parameters": {
    "text": {"type": "string", "description": "Text to synthesize"},
    "model_choice": {"type": "string", "enum": ["XTTS2 (Voice Cloning)", "Bark (Preset Voices)"]},
    "reference_audio": {"type": "file", "description": "Reference audio for XTTS2 (optional for Bark)"},
    "language": {"type": "string", "default": "English"},
    "speed": {"type": "number", "default": 1.0},
    "voice_preset": {"type": "string", "default": "v2/en_speaker_6"},
    "text_temp": {"type": "number", "default": 0.7, "description": "Bark text/semantic temperature (0.1-1.0)"},
    "waveform_temp": {"type": "number", "default": 0.7, "description": "Bark waveform temperature (0.1-1.0)"},
    "seed": {"type": "integer", "default": -1, "description": "Bark seed for reproducibility (-1 for random)"}
  },
  "returns": ["audio", "string"]
}

MCP Config:

{
  "mcpServers": {
    "tts-hub": {"url": "https://luminia-xtts2-bark.hf.space/gradio_api/mcp/"}
  }
}

CLI Usage

# XTTS2 voice cloning
python app.py tts -t "Hello world" -o output.wav -m xtts2 -r voice_sample.wav -l English -s 1.0

# Bark preset voice (basic)
python app.py tts -t "Hello [laughter] world" -o output.wav -m bark -v "v2/en_speaker_6"

# Bark with temperature control and seed
python app.py tts -t "Hello world" -o output.wav -m bark -v "v2/en_speaker_6" \
  --text-temp 0.7 --waveform-temp 0.7 --seed 42

Bark Voice Presets

Preset Language
v2/en_speaker_0 - v2/en_speaker_9 English
v2/de_speaker_0 - v2/de_speaker_2 German
v2/fr_speaker_0 - v2/fr_speaker_1 French
v2/es_speaker_0 - v2/es_speaker_1 Spanish
v2/zh_speaker_0 - v2/zh_speaker_1 Chinese
v2/ja_speaker_0 Japanese
v2/ko_speaker_0 Korean

Bark Temperature Guide

Setting Low (0.1-0.3) Medium (0.5-0.7) High (0.8-1.0)
Text Temp More predictable, robotic Natural, balanced Creative, variable
Waveform Temp Cleaner audio Natural variation More expressive

Recommended: Start with 0.7 for both temperatures for natural-sounding speech.


Credits

Licensed under Apache 2.0.