ishaq101's picture
Feat: STT & TTS Gemini, Creds Gemini Cloud Speech, Generate Buffer Audio, Testing
986403e
metadata
title: Demo Voice Agent Data Eyond
emoji: 🌍
colorFrom: pink
colorTo: pink
sdk: docker
pinned: true

Voice Agent Service

Real-time voice agent backend dengan WebSocket-based STT (Deepgram) dan TTS (Cartesia). Menerima audio stream dari client, mendeteksi wake word, lalu streaming kembali synthesized speech.

Versi saat ini: Phase 1 (Echo Mode) β€” teks setelah wake word langsung di-echo melalui TTS. Phase 2 (LLM + RAG) direncanakan namun belum diimplementasi.

Requirements

  • Python 3.11+
  • uv
  • Deepgram API key
  • Cartesia API key + Voice ID

Setup

1. Clone & install dependencies

uv sync

2. Configure environment

cp .env.example .env

Edit .env dan isi API keys:

DEEPGRAM_API_KEY=your_key
CARTESIA_API_KEY=your_key
CARTESIA_VOICE_ID=your_voice_id

Konfigurasi opsional:

CARTESIA_MODEL=sonic-3               # Default: sonic-3
DEEPGRAM_LANGUAGE=id                 # Default: id (Indonesian)
DEEPGRAM_ENDPOINTING_MS=300          # Default: 300ms
DEEPGRAM_UTTERANCE_END_MS=2000       # Default: 2000ms
SAMPLE_RATE=16000                    # Default: 16000 Hz
WAKE_WORD=Hai EMA                    # Default: "Hai EMA"

Run

`uv run uvicorn main:app --host 0.0.0.0 --port 7861`
or
`uv run uvicorn main:app --host 0.0.0.0 --port 7861 --reload`

Server akan berjalan di http://localhost:7861.

Test

Health check:

curl http://localhost:7861/health

Expected response:

{
  "status": "ok",
  "version": "1.1.0",
  "stt_ready": true,
  "tts_ready": true
}

Status degraded (HTTP 503) akan dikembalikan jika API keys tidak lengkap.

WebSocket test β€” kirim audio WAV, terima TTS response:

uv run python test_client.py --test audio --wav path/to/audio.wav --save-tts output.wav

File WAV harus dalam format: 16kHz, 16-bit, mono PCM.

Test spesifik:

uv run python test_client.py --test health      # Health check
uv run python test_client.py --test ping        # Heartbeat ping/pong
uv run python test_client.py --test interrupt   # Cancel ongoing TTS
uv run python test_client.py --test stop        # Graceful disconnect

Connectivity check (tanpa file audio):

uv run python test_client.py

Konversi audio M4A ke WAV:

uv run python convert_audio.py                     # Konversi semua file di playground/mp4/
uv run python convert_audio.py path/to/file.m4a   # Konversi satu file

Docker

Build:

docker build -t voice-agent .

Run:

docker run -p 7861:7861 --env-file .env voice-agent

Wake Word

Default wake word: "Hai EMA" (bahasa Indonesia, case-insensitive)

Contoh: ucapkan "Hai EMA, apa kabar?" β†’ agent akan membalas dengan TTS "apa kabar".

Dapat dikonfigurasi via environment variable WAKE_WORD.

Arsitektur

Alur saat ini (Phase 1 β€” Echo)

Client Audio Stream (PCM 16kHz 16-bit mono)
    ↓
Deepgram STT (nova-2, real-time streaming)
    ↓
Wake Word Detection
    ↓
Echo Response
    ↓
Cartesia TTS (streaming chunks)
    ↓
Client Audio Playback

Alur yang direncanakan (Phase 2 β€” LLM + RAG)

Client Audio Stream
    ↓
Deepgram STT
    ↓
Wake Word Detection
    ↓
PDF Knowledge Base Retrieval (belum diimplementasi)
    ↓
LLM Answer Generation (belum diimplementasi)
    ↓
Cartesia TTS
    ↓
Client Audio Playback

WebSocket Protocol

Endpoint: ws://localhost:7861/ws/voice

Client β†’ Server:

Type Format Keterangan
Binary PCM audio chunk Audio 16kHz, 16-bit, mono
Text {"action": "ping"} Heartbeat keep-alive
Text {"action": "stop"} Graceful disconnect
Text {"action": "interrupt"} Cancel ongoing TTS

Server β†’ Client:

Type Format Keterangan
Binary PCM audio chunk TTS response audio
Text {"event": "transcript", "text": "..."} Hasil STT
Text {"event": "reply", "text": "..."} Teks setelah wake word
Text {"event": "tts_end"} TTS selesai
Text {"event": "interrupted"} TTS dibatalkan
Text {"event": "pong"} Response ping
Text {"event": "error", "code": "...", "message": "..."} Error

Lihat API_CONTRACT.md untuk dokumentasi lengkap WebSocket protocol.

Struktur Project

β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ config.py              # Konfigurasi & environment variables
β”‚   β”œβ”€β”€ pipeline.py            # Core voice pipeline (STT β†’ Wake Word β†’ TTS)
β”‚   β”œβ”€β”€ stt/
β”‚   β”‚   β”œβ”€β”€ deepgram_client.py # Deepgram real-time STT (aktif)
β”‚   β”‚   └── assemblyai_client.py # AssemblyAI STT (alternatif, tidak digunakan)
β”‚   β”œβ”€β”€ tts/
β”‚   β”‚   └── cartesia_client.py # Cartesia TTS streaming
β”‚   β”œβ”€β”€ llm/
β”‚   β”‚   └── answerer.py        # LLM answer generation (Phase 2, belum diimplementasi)
β”‚   └── knowledge/
β”‚       └── loader.py          # PDF loader & RAG (Phase 2, belum diimplementasi)
β”œβ”€β”€ main.py                    # FastAPI entry point & WebSocket handler
β”œβ”€β”€ test_client.py             # Test client
β”œβ”€β”€ convert_audio.py           # Konverter M4A β†’ WAV
β”œβ”€β”€ playground/                # Audio sample dan output TTS
β”œβ”€β”€ Dockerfile
β”œβ”€β”€ .env.example
└── API_CONTRACT.md