Spaces:
Sleeping
Sleeping
File size: 5,497 Bytes
7f09335 e75bac4 226ff5d 7f09335 38a5904 7f09335 226ff5d dd0dc33 226ff5d dd0dc33 226ff5d dd0dc33 226ff5d dd0dc33 226ff5d 986403e 226ff5d 38a5904 226ff5d 38a5904 226ff5d dd0dc33 226ff5d dd0dc33 226ff5d dd0dc33 226ff5d dd0dc33 226ff5d dd0dc33 226ff5d dd0dc33 226ff5d 38a5904 226ff5d dd0dc33 38a5904 226ff5d dd0dc33 226ff5d dd0dc33 226ff5d dd0dc33 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 | ---
title: Demo Voice Agent Data Eyond
emoji: π
colorFrom: pink
colorTo: pink
sdk: docker
pinned: true
---
# Voice Agent Service
Real-time voice agent backend dengan WebSocket-based STT (Deepgram) dan TTS (Cartesia). Menerima audio stream dari client, mendeteksi wake word, lalu streaming kembali synthesized speech.
**Versi saat ini: Phase 1 (Echo Mode)** β teks setelah wake word langsung di-echo melalui TTS. Phase 2 (LLM + RAG) direncanakan namun belum diimplementasi.
## Requirements
- Python 3.11+
- [uv](https://docs.astral.sh/uv/getting-started/installation/)
- Deepgram API key
- Cartesia API key + Voice ID
## Setup
**1. Clone & install dependencies**
```bash
uv sync
```
**2. Configure environment**
```bash
cp .env.example .env
```
Edit `.env` dan isi API keys:
```env
DEEPGRAM_API_KEY=your_key
CARTESIA_API_KEY=your_key
CARTESIA_VOICE_ID=your_voice_id
```
**Konfigurasi opsional:**
```env
CARTESIA_MODEL=sonic-3 # Default: sonic-3
DEEPGRAM_LANGUAGE=id # Default: id (Indonesian)
DEEPGRAM_ENDPOINTING_MS=300 # Default: 300ms
DEEPGRAM_UTTERANCE_END_MS=2000 # Default: 2000ms
SAMPLE_RATE=16000 # Default: 16000 Hz
WAKE_WORD=Hai EMA # Default: "Hai EMA"
```
## Run
```bash
`uv run uvicorn main:app --host 0.0.0.0 --port 7861`
or
`uv run uvicorn main:app --host 0.0.0.0 --port 7861 --reload`
```
Server akan berjalan di `http://localhost:7861`.
## Test
**Health check:**
```bash
curl http://localhost:7861/health
```
Expected response:
```json
{
"status": "ok",
"version": "1.1.0",
"stt_ready": true,
"tts_ready": true
}
```
Status `degraded` (HTTP 503) akan dikembalikan jika API keys tidak lengkap.
**WebSocket test β kirim audio WAV, terima TTS response:**
```bash
uv run python test_client.py --test audio --wav path/to/audio.wav --save-tts output.wav
```
> File WAV harus dalam format: **16kHz, 16-bit, mono PCM**.
**Test spesifik:**
```bash
uv run python test_client.py --test health # Health check
uv run python test_client.py --test ping # Heartbeat ping/pong
uv run python test_client.py --test interrupt # Cancel ongoing TTS
uv run python test_client.py --test stop # Graceful disconnect
```
**Connectivity check (tanpa file audio):**
```bash
uv run python test_client.py
```
**Konversi audio M4A ke WAV:**
```bash
uv run python convert_audio.py # Konversi semua file di playground/mp4/
uv run python convert_audio.py path/to/file.m4a # Konversi satu file
```
## Docker
**Build:**
```bash
docker build -t voice-agent .
```
**Run:**
```bash
docker run -p 7861:7861 --env-file .env voice-agent
```
## Wake Word
Default wake word: **"Hai EMA"** (bahasa Indonesia, case-insensitive)
Contoh: ucapkan _"Hai EMA, apa kabar?"_ β agent akan membalas dengan TTS _"apa kabar"_.
Dapat dikonfigurasi via environment variable `WAKE_WORD`.
## Arsitektur
### Alur saat ini (Phase 1 β Echo)
```
Client Audio Stream (PCM 16kHz 16-bit mono)
β
Deepgram STT (nova-2, real-time streaming)
β
Wake Word Detection
β
Echo Response
β
Cartesia TTS (streaming chunks)
β
Client Audio Playback
```
### Alur yang direncanakan (Phase 2 β LLM + RAG)
```
Client Audio Stream
β
Deepgram STT
β
Wake Word Detection
β
PDF Knowledge Base Retrieval (belum diimplementasi)
β
LLM Answer Generation (belum diimplementasi)
β
Cartesia TTS
β
Client Audio Playback
```
## WebSocket Protocol
**Endpoint:** `ws://localhost:7861/ws/voice`
**Client β Server:**
| Type | Format | Keterangan |
|------|--------|------------|
| Binary | PCM audio chunk | Audio 16kHz, 16-bit, mono |
| Text | `{"action": "ping"}` | Heartbeat keep-alive |
| Text | `{"action": "stop"}` | Graceful disconnect |
| Text | `{"action": "interrupt"}` | Cancel ongoing TTS |
**Server β Client:**
| Type | Format | Keterangan |
|------|--------|------------|
| Binary | PCM audio chunk | TTS response audio |
| Text | `{"event": "transcript", "text": "..."}` | Hasil STT |
| Text | `{"event": "reply", "text": "..."}` | Teks setelah wake word |
| Text | `{"event": "tts_end"}` | TTS selesai |
| Text | `{"event": "interrupted"}` | TTS dibatalkan |
| Text | `{"event": "pong"}` | Response ping |
| Text | `{"event": "error", "code": "...", "message": "..."}` | Error |
Lihat [API_CONTRACT.md](API_CONTRACT.md) untuk dokumentasi lengkap WebSocket protocol.
## Struktur Project
```
βββ src/
β βββ config.py # Konfigurasi & environment variables
β βββ pipeline.py # Core voice pipeline (STT β Wake Word β TTS)
β βββ stt/
β β βββ deepgram_client.py # Deepgram real-time STT (aktif)
β β βββ assemblyai_client.py # AssemblyAI STT (alternatif, tidak digunakan)
β βββ tts/
β β βββ cartesia_client.py # Cartesia TTS streaming
β βββ llm/
β β βββ answerer.py # LLM answer generation (Phase 2, belum diimplementasi)
β βββ knowledge/
β βββ loader.py # PDF loader & RAG (Phase 2, belum diimplementasi)
βββ main.py # FastAPI entry point & WebSocket handler
βββ test_client.py # Test client
βββ convert_audio.py # Konverter M4A β WAV
βββ playground/ # Audio sample dan output TTS
βββ Dockerfile
βββ .env.example
βββ API_CONTRACT.md
```
|