ishaq101's picture
Feat: STT & TTS Gemini, Creds Gemini Cloud Speech, Generate Buffer Audio, Testing
986403e
---
title: Demo Voice Agent Data Eyond
emoji: 🌍
colorFrom: pink
colorTo: pink
sdk: docker
pinned: true
---
# Voice Agent Service
Real-time voice agent backend dengan WebSocket-based STT (Deepgram) dan TTS (Cartesia). Menerima audio stream dari client, mendeteksi wake word, lalu streaming kembali synthesized speech.
**Versi saat ini: Phase 1 (Echo Mode)** β€” teks setelah wake word langsung di-echo melalui TTS. Phase 2 (LLM + RAG) direncanakan namun belum diimplementasi.
## Requirements
- Python 3.11+
- [uv](https://docs.astral.sh/uv/getting-started/installation/)
- Deepgram API key
- Cartesia API key + Voice ID
## Setup
**1. Clone & install dependencies**
```bash
uv sync
```
**2. Configure environment**
```bash
cp .env.example .env
```
Edit `.env` dan isi API keys:
```env
DEEPGRAM_API_KEY=your_key
CARTESIA_API_KEY=your_key
CARTESIA_VOICE_ID=your_voice_id
```
**Konfigurasi opsional:**
```env
CARTESIA_MODEL=sonic-3 # Default: sonic-3
DEEPGRAM_LANGUAGE=id # Default: id (Indonesian)
DEEPGRAM_ENDPOINTING_MS=300 # Default: 300ms
DEEPGRAM_UTTERANCE_END_MS=2000 # Default: 2000ms
SAMPLE_RATE=16000 # Default: 16000 Hz
WAKE_WORD=Hai EMA # Default: "Hai EMA"
```
## Run
```bash
`uv run uvicorn main:app --host 0.0.0.0 --port 7861`
or
`uv run uvicorn main:app --host 0.0.0.0 --port 7861 --reload`
```
Server akan berjalan di `http://localhost:7861`.
## Test
**Health check:**
```bash
curl http://localhost:7861/health
```
Expected response:
```json
{
"status": "ok",
"version": "1.1.0",
"stt_ready": true,
"tts_ready": true
}
```
Status `degraded` (HTTP 503) akan dikembalikan jika API keys tidak lengkap.
**WebSocket test β€” kirim audio WAV, terima TTS response:**
```bash
uv run python test_client.py --test audio --wav path/to/audio.wav --save-tts output.wav
```
> File WAV harus dalam format: **16kHz, 16-bit, mono PCM**.
**Test spesifik:**
```bash
uv run python test_client.py --test health # Health check
uv run python test_client.py --test ping # Heartbeat ping/pong
uv run python test_client.py --test interrupt # Cancel ongoing TTS
uv run python test_client.py --test stop # Graceful disconnect
```
**Connectivity check (tanpa file audio):**
```bash
uv run python test_client.py
```
**Konversi audio M4A ke WAV:**
```bash
uv run python convert_audio.py # Konversi semua file di playground/mp4/
uv run python convert_audio.py path/to/file.m4a # Konversi satu file
```
## Docker
**Build:**
```bash
docker build -t voice-agent .
```
**Run:**
```bash
docker run -p 7861:7861 --env-file .env voice-agent
```
## Wake Word
Default wake word: **"Hai EMA"** (bahasa Indonesia, case-insensitive)
Contoh: ucapkan _"Hai EMA, apa kabar?"_ β†’ agent akan membalas dengan TTS _"apa kabar"_.
Dapat dikonfigurasi via environment variable `WAKE_WORD`.
## Arsitektur
### Alur saat ini (Phase 1 β€” Echo)
```
Client Audio Stream (PCM 16kHz 16-bit mono)
↓
Deepgram STT (nova-2, real-time streaming)
↓
Wake Word Detection
↓
Echo Response
↓
Cartesia TTS (streaming chunks)
↓
Client Audio Playback
```
### Alur yang direncanakan (Phase 2 β€” LLM + RAG)
```
Client Audio Stream
↓
Deepgram STT
↓
Wake Word Detection
↓
PDF Knowledge Base Retrieval (belum diimplementasi)
↓
LLM Answer Generation (belum diimplementasi)
↓
Cartesia TTS
↓
Client Audio Playback
```
## WebSocket Protocol
**Endpoint:** `ws://localhost:7861/ws/voice`
**Client β†’ Server:**
| Type | Format | Keterangan |
|------|--------|------------|
| Binary | PCM audio chunk | Audio 16kHz, 16-bit, mono |
| Text | `{"action": "ping"}` | Heartbeat keep-alive |
| Text | `{"action": "stop"}` | Graceful disconnect |
| Text | `{"action": "interrupt"}` | Cancel ongoing TTS |
**Server β†’ Client:**
| Type | Format | Keterangan |
|------|--------|------------|
| Binary | PCM audio chunk | TTS response audio |
| Text | `{"event": "transcript", "text": "..."}` | Hasil STT |
| Text | `{"event": "reply", "text": "..."}` | Teks setelah wake word |
| Text | `{"event": "tts_end"}` | TTS selesai |
| Text | `{"event": "interrupted"}` | TTS dibatalkan |
| Text | `{"event": "pong"}` | Response ping |
| Text | `{"event": "error", "code": "...", "message": "..."}` | Error |
Lihat [API_CONTRACT.md](API_CONTRACT.md) untuk dokumentasi lengkap WebSocket protocol.
## Struktur Project
```
β”œβ”€β”€ src/
β”‚ β”œβ”€β”€ config.py # Konfigurasi & environment variables
β”‚ β”œβ”€β”€ pipeline.py # Core voice pipeline (STT β†’ Wake Word β†’ TTS)
β”‚ β”œβ”€β”€ stt/
β”‚ β”‚ β”œβ”€β”€ deepgram_client.py # Deepgram real-time STT (aktif)
β”‚ β”‚ └── assemblyai_client.py # AssemblyAI STT (alternatif, tidak digunakan)
β”‚ β”œβ”€β”€ tts/
β”‚ β”‚ └── cartesia_client.py # Cartesia TTS streaming
β”‚ β”œβ”€β”€ llm/
β”‚ β”‚ └── answerer.py # LLM answer generation (Phase 2, belum diimplementasi)
β”‚ └── knowledge/
β”‚ └── loader.py # PDF loader & RAG (Phase 2, belum diimplementasi)
β”œβ”€β”€ main.py # FastAPI entry point & WebSocket handler
β”œβ”€β”€ test_client.py # Test client
β”œβ”€β”€ convert_audio.py # Konverter M4A β†’ WAV
β”œβ”€β”€ playground/ # Audio sample dan output TTS
β”œβ”€β”€ Dockerfile
β”œβ”€β”€ .env.example
└── API_CONTRACT.md
```