tts-gpu-service / README-websocket.md
Peter Michael Gits
feat: Add standalone WebSocket-only TTS service v1.0.0
390e1c5

A newer version of the Gradio SDK is available: 6.11.0

Upgrade

TTS WebSocket Service v1.0.0

Standalone WebSocket-only Text-to-Speech service for VoiceCal integration.

Features

  • βœ… WebSocket-only TTS interface (/ws/tts)
  • βœ… ZeroGPU Bark TTS integration
  • βœ… FastAPI-based architecture
  • βœ… Multiple voice presets (10 speakers)
  • βœ… Streaming TTS support (unmute.sh methodology)
  • βœ… No Gradio dependencies
  • βœ… No MCP dependencies
  • βœ… Standalone deployment ready
  • βœ… Base64 audio transmission
  • βœ… WAV audio format output

Quick Start

Using the WebSocket Server

# Install dependencies
pip install -r requirements-websocket.txt

# Run standalone WebSocket server
python3 websocket_tts_server.py

Docker Deployment

# Build WebSocket-only image
docker build -f Dockerfile-websocket -t tts-websocket-service .

# Run container
docker run -p 7860:7860 tts-websocket-service

API Endpoints

WebSocket: /ws/tts

Connection Confirmation:

{
  "type": "tts_connection_confirmed",
  "client_id": "uuid",
  "service": "TTS WebSocket Service", 
  "version": "1.0.0",
  "available_voices": [
    "v2/en_speaker_0", "v2/en_speaker_1", "v2/en_speaker_2",
    "v2/en_speaker_3", "v2/en_speaker_4", "v2/en_speaker_5",
    "v2/en_speaker_6", "v2/en_speaker_7", "v2/en_speaker_8", 
    "v2/en_speaker_9"
  ],
  "device": "cuda",
  "message": "TTS WebSocket connected and ready"
}

Single Synthesis Request:

{
  "type": "tts_synthesize",
  "text": "Hello, how are you today?",
  "voice_preset": "v2/en_speaker_6"
}

Streaming Synthesis (unmute.sh methodology):

{
  "type": "tts_streaming_text",
  "text_chunks": ["Hello", "how are you", "today?"],
  "voice_preset": "v2/en_speaker_6",
  "is_final": true
}

Synthesis Result:

{
  "type": "tts_synthesis_complete",
  "client_id": "uuid",
  "audio_data": "base64_encoded_wav_audio",
  "audio_format": "wav",
  "text": "Hello, how are you today?",
  "voice_preset": "v2/en_speaker_6",
  "audio_size": 12345,
  "timing": {
    "processing_time": 2.34,
    "device": "cuda"
  },
  "status": "success"
}

HTTP: /health

{
  "service": "TTS WebSocket Service",
  "version": "1.0.0", 
  "status": "healthy",
  "model_loaded": true,
  "active_connections": 1,
  "available_voices": 10,
  "device": "cuda"
}

Port Configuration

  • Default Port: 7860 (HuggingFace Spaces standard port)
  • WebSocket Endpoint: ws://localhost:7860/ws/tts
  • Health Check: http://localhost:7860/health
  • Note: Each HuggingFace Space gets its own IP address, so both STT and TTS can use port 7860

Voice Presets

Available voice presets:

  • v2/en_speaker_0 - Voice 0
  • v2/en_speaker_1 - Voice 1
  • v2/en_speaker_2 - Voice 2
  • v2/en_speaker_3 - Voice 3
  • v2/en_speaker_4 - Voice 4
  • v2/en_speaker_5 - Voice 5
  • v2/en_speaker_6 - Voice 6 (default)
  • v2/en_speaker_7 - Voice 7
  • v2/en_speaker_8 - Voice 8
  • v2/en_speaker_9 - Voice 9

Architecture

This service eliminates all unnecessary dependencies:

  • Removed: Gradio web interface
  • Removed: MCP protocol support
  • Removed: Complex routing
  • Added: Direct FastAPI WebSocket endpoints
  • Added: Streaming TTS support
  • Added: ZeroGPU optimized synthesis

Integration

Connect from VoiceCal WebRTC interface:

const ws = new WebSocket('ws://localhost:7860/ws/tts');

// Send text for synthesis
ws.send(JSON.stringify({
  type: "tts_synthesize", 
  text: "Hello world",
  voice_preset: "v2/en_speaker_6"
}));

// Streaming synthesis (unmute.sh pattern)
ws.send(JSON.stringify({
  type: "tts_streaming_text",
  text_chunks: ["Hello", "world"],
  voice_preset: "v2/en_speaker_6",
  is_final: true
}));