Spaces:
Sleeping
Sleeping
| title: Demo Voice Agent Data Eyond | |
| emoji: π | |
| colorFrom: pink | |
| colorTo: pink | |
| sdk: docker | |
| pinned: true | |
| # Voice Agent Service | |
| Real-time voice agent backend dengan WebSocket-based STT (Deepgram) dan TTS (Cartesia). Menerima audio stream dari client, mendeteksi wake word, lalu streaming kembali synthesized speech. | |
| **Versi saat ini: Phase 1 (Echo Mode)** β teks setelah wake word langsung di-echo melalui TTS. Phase 2 (LLM + RAG) direncanakan namun belum diimplementasi. | |
| ## Requirements | |
| - Python 3.11+ | |
| - [uv](https://docs.astral.sh/uv/getting-started/installation/) | |
| - Deepgram API key | |
| - Cartesia API key + Voice ID | |
| ## Setup | |
| **1. Clone & install dependencies** | |
| ```bash | |
| uv sync | |
| ``` | |
| **2. Configure environment** | |
| ```bash | |
| cp .env.example .env | |
| ``` | |
| Edit `.env` dan isi API keys: | |
| ```env | |
| DEEPGRAM_API_KEY=your_key | |
| CARTESIA_API_KEY=your_key | |
| CARTESIA_VOICE_ID=your_voice_id | |
| ``` | |
| **Konfigurasi opsional:** | |
| ```env | |
| CARTESIA_MODEL=sonic-3 # Default: sonic-3 | |
| DEEPGRAM_LANGUAGE=id # Default: id (Indonesian) | |
| DEEPGRAM_ENDPOINTING_MS=300 # Default: 300ms | |
| DEEPGRAM_UTTERANCE_END_MS=2000 # Default: 2000ms | |
| SAMPLE_RATE=16000 # Default: 16000 Hz | |
| WAKE_WORD=Hai EMA # Default: "Hai EMA" | |
| ``` | |
| ## Run | |
| ```bash | |
| `uv run uvicorn main:app --host 0.0.0.0 --port 7861` | |
| or | |
| `uv run uvicorn main:app --host 0.0.0.0 --port 7861 --reload` | |
| ``` | |
| Server akan berjalan di `http://localhost:7861`. | |
| ## Test | |
| **Health check:** | |
| ```bash | |
| curl http://localhost:7861/health | |
| ``` | |
| Expected response: | |
| ```json | |
| { | |
| "status": "ok", | |
| "version": "1.1.0", | |
| "stt_ready": true, | |
| "tts_ready": true | |
| } | |
| ``` | |
| Status `degraded` (HTTP 503) akan dikembalikan jika API keys tidak lengkap. | |
| **WebSocket test β kirim audio WAV, terima TTS response:** | |
| ```bash | |
| uv run python test_client.py --test audio --wav path/to/audio.wav --save-tts output.wav | |
| ``` | |
| > File WAV harus dalam format: **16kHz, 16-bit, mono PCM**. | |
| **Test spesifik:** | |
| ```bash | |
| uv run python test_client.py --test health # Health check | |
| uv run python test_client.py --test ping # Heartbeat ping/pong | |
| uv run python test_client.py --test interrupt # Cancel ongoing TTS | |
| uv run python test_client.py --test stop # Graceful disconnect | |
| ``` | |
| **Connectivity check (tanpa file audio):** | |
| ```bash | |
| uv run python test_client.py | |
| ``` | |
| **Konversi audio M4A ke WAV:** | |
| ```bash | |
| uv run python convert_audio.py # Konversi semua file di playground/mp4/ | |
| uv run python convert_audio.py path/to/file.m4a # Konversi satu file | |
| ``` | |
| ## Docker | |
| **Build:** | |
| ```bash | |
| docker build -t voice-agent . | |
| ``` | |
| **Run:** | |
| ```bash | |
| docker run -p 7861:7861 --env-file .env voice-agent | |
| ``` | |
| ## Wake Word | |
| Default wake word: **"Hai EMA"** (bahasa Indonesia, case-insensitive) | |
| Contoh: ucapkan _"Hai EMA, apa kabar?"_ β agent akan membalas dengan TTS _"apa kabar"_. | |
| Dapat dikonfigurasi via environment variable `WAKE_WORD`. | |
| ## Arsitektur | |
| ### Alur saat ini (Phase 1 β Echo) | |
| ``` | |
| Client Audio Stream (PCM 16kHz 16-bit mono) | |
| β | |
| Deepgram STT (nova-2, real-time streaming) | |
| β | |
| Wake Word Detection | |
| β | |
| Echo Response | |
| β | |
| Cartesia TTS (streaming chunks) | |
| β | |
| Client Audio Playback | |
| ``` | |
| ### Alur yang direncanakan (Phase 2 β LLM + RAG) | |
| ``` | |
| Client Audio Stream | |
| β | |
| Deepgram STT | |
| β | |
| Wake Word Detection | |
| β | |
| PDF Knowledge Base Retrieval (belum diimplementasi) | |
| β | |
| LLM Answer Generation (belum diimplementasi) | |
| β | |
| Cartesia TTS | |
| β | |
| Client Audio Playback | |
| ``` | |
| ## WebSocket Protocol | |
| **Endpoint:** `ws://localhost:7861/ws/voice` | |
| **Client β Server:** | |
| | Type | Format | Keterangan | | |
| |------|--------|------------| | |
| | Binary | PCM audio chunk | Audio 16kHz, 16-bit, mono | | |
| | Text | `{"action": "ping"}` | Heartbeat keep-alive | | |
| | Text | `{"action": "stop"}` | Graceful disconnect | | |
| | Text | `{"action": "interrupt"}` | Cancel ongoing TTS | | |
| **Server β Client:** | |
| | Type | Format | Keterangan | | |
| |------|--------|------------| | |
| | Binary | PCM audio chunk | TTS response audio | | |
| | Text | `{"event": "transcript", "text": "..."}` | Hasil STT | | |
| | Text | `{"event": "reply", "text": "..."}` | Teks setelah wake word | | |
| | Text | `{"event": "tts_end"}` | TTS selesai | | |
| | Text | `{"event": "interrupted"}` | TTS dibatalkan | | |
| | Text | `{"event": "pong"}` | Response ping | | |
| | Text | `{"event": "error", "code": "...", "message": "..."}` | Error | | |
| Lihat [API_CONTRACT.md](API_CONTRACT.md) untuk dokumentasi lengkap WebSocket protocol. | |
| ## Struktur Project | |
| ``` | |
| βββ src/ | |
| β βββ config.py # Konfigurasi & environment variables | |
| β βββ pipeline.py # Core voice pipeline (STT β Wake Word β TTS) | |
| β βββ stt/ | |
| β β βββ deepgram_client.py # Deepgram real-time STT (aktif) | |
| β β βββ assemblyai_client.py # AssemblyAI STT (alternatif, tidak digunakan) | |
| β βββ tts/ | |
| β β βββ cartesia_client.py # Cartesia TTS streaming | |
| β βββ llm/ | |
| β β βββ answerer.py # LLM answer generation (Phase 2, belum diimplementasi) | |
| β βββ knowledge/ | |
| β βββ loader.py # PDF loader & RAG (Phase 2, belum diimplementasi) | |
| βββ main.py # FastAPI entry point & WebSocket handler | |
| βββ test_client.py # Test client | |
| βββ convert_audio.py # Konverter M4A β WAV | |
| βββ playground/ # Audio sample dan output TTS | |
| βββ Dockerfile | |
| βββ .env.example | |
| βββ API_CONTRACT.md | |
| ``` | |