Spaces:
Sleeping
title: Demo Voice Agent Data Eyond
emoji: π
colorFrom: pink
colorTo: pink
sdk: docker
pinned: true
Voice Agent Service
Real-time voice agent backend dengan WebSocket-based STT (Deepgram) dan TTS (Cartesia). Menerima audio stream dari client, mendeteksi wake word, lalu streaming kembali synthesized speech.
Versi saat ini: Phase 1 (Echo Mode) β teks setelah wake word langsung di-echo melalui TTS. Phase 2 (LLM + RAG) direncanakan namun belum diimplementasi.
Requirements
- Python 3.11+
- uv
- Deepgram API key
- Cartesia API key + Voice ID
Setup
1. Clone & install dependencies
uv sync
2. Configure environment
cp .env.example .env
Edit .env dan isi API keys:
DEEPGRAM_API_KEY=your_key
CARTESIA_API_KEY=your_key
CARTESIA_VOICE_ID=your_voice_id
Konfigurasi opsional:
CARTESIA_MODEL=sonic-3 # Default: sonic-3
DEEPGRAM_LANGUAGE=id # Default: id (Indonesian)
DEEPGRAM_ENDPOINTING_MS=300 # Default: 300ms
DEEPGRAM_UTTERANCE_END_MS=2000 # Default: 2000ms
SAMPLE_RATE=16000 # Default: 16000 Hz
WAKE_WORD=Hai EMA # Default: "Hai EMA"
Run
`uv run uvicorn main:app --host 0.0.0.0 --port 7861`
or
`uv run uvicorn main:app --host 0.0.0.0 --port 7861 --reload`
Server akan berjalan di http://localhost:7861.
Test
Health check:
curl http://localhost:7861/health
Expected response:
{
"status": "ok",
"version": "1.1.0",
"stt_ready": true,
"tts_ready": true
}
Status degraded (HTTP 503) akan dikembalikan jika API keys tidak lengkap.
WebSocket test β kirim audio WAV, terima TTS response:
uv run python test_client.py --test audio --wav path/to/audio.wav --save-tts output.wav
File WAV harus dalam format: 16kHz, 16-bit, mono PCM.
Test spesifik:
uv run python test_client.py --test health # Health check
uv run python test_client.py --test ping # Heartbeat ping/pong
uv run python test_client.py --test interrupt # Cancel ongoing TTS
uv run python test_client.py --test stop # Graceful disconnect
Connectivity check (tanpa file audio):
uv run python test_client.py
Konversi audio M4A ke WAV:
uv run python convert_audio.py # Konversi semua file di playground/mp4/
uv run python convert_audio.py path/to/file.m4a # Konversi satu file
Docker
Build:
docker build -t voice-agent .
Run:
docker run -p 7861:7861 --env-file .env voice-agent
Wake Word
Default wake word: "Hai EMA" (bahasa Indonesia, case-insensitive)
Contoh: ucapkan "Hai EMA, apa kabar?" β agent akan membalas dengan TTS "apa kabar".
Dapat dikonfigurasi via environment variable WAKE_WORD.
Arsitektur
Alur saat ini (Phase 1 β Echo)
Client Audio Stream (PCM 16kHz 16-bit mono)
β
Deepgram STT (nova-2, real-time streaming)
β
Wake Word Detection
β
Echo Response
β
Cartesia TTS (streaming chunks)
β
Client Audio Playback
Alur yang direncanakan (Phase 2 β LLM + RAG)
Client Audio Stream
β
Deepgram STT
β
Wake Word Detection
β
PDF Knowledge Base Retrieval (belum diimplementasi)
β
LLM Answer Generation (belum diimplementasi)
β
Cartesia TTS
β
Client Audio Playback
WebSocket Protocol
Endpoint: ws://localhost:7861/ws/voice
Client β Server:
| Type | Format | Keterangan |
|---|---|---|
| Binary | PCM audio chunk | Audio 16kHz, 16-bit, mono |
| Text | {"action": "ping"} |
Heartbeat keep-alive |
| Text | {"action": "stop"} |
Graceful disconnect |
| Text | {"action": "interrupt"} |
Cancel ongoing TTS |
Server β Client:
| Type | Format | Keterangan |
|---|---|---|
| Binary | PCM audio chunk | TTS response audio |
| Text | {"event": "transcript", "text": "..."} |
Hasil STT |
| Text | {"event": "reply", "text": "..."} |
Teks setelah wake word |
| Text | {"event": "tts_end"} |
TTS selesai |
| Text | {"event": "interrupted"} |
TTS dibatalkan |
| Text | {"event": "pong"} |
Response ping |
| Text | {"event": "error", "code": "...", "message": "..."} |
Error |
Lihat API_CONTRACT.md untuk dokumentasi lengkap WebSocket protocol.
Struktur Project
βββ src/
β βββ config.py # Konfigurasi & environment variables
β βββ pipeline.py # Core voice pipeline (STT β Wake Word β TTS)
β βββ stt/
β β βββ deepgram_client.py # Deepgram real-time STT (aktif)
β β βββ assemblyai_client.py # AssemblyAI STT (alternatif, tidak digunakan)
β βββ tts/
β β βββ cartesia_client.py # Cartesia TTS streaming
β βββ llm/
β β βββ answerer.py # LLM answer generation (Phase 2, belum diimplementasi)
β βββ knowledge/
β βββ loader.py # PDF loader & RAG (Phase 2, belum diimplementasi)
βββ main.py # FastAPI entry point & WebSocket handler
βββ test_client.py # Test client
βββ convert_audio.py # Konverter M4A β WAV
βββ playground/ # Audio sample dan output TTS
βββ Dockerfile
βββ .env.example
βββ API_CONTRACT.md