File size: 5,497 Bytes
7f09335
e75bac4
226ff5d
 
 
7f09335
38a5904
7f09335
 
226ff5d
 
dd0dc33
 
 
226ff5d
 
 
 
 
dd0dc33
 
226ff5d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
dd0dc33
226ff5d
 
 
 
dd0dc33
 
 
 
 
 
 
 
 
 
226ff5d
 
 
986403e
 
 
 
226ff5d
 
38a5904
226ff5d
 
 
 
 
38a5904
226ff5d
 
 
 
 
 
 
 
 
 
 
 
dd0dc33
 
 
226ff5d
dd0dc33
226ff5d
 
dd0dc33
226ff5d
dd0dc33
226ff5d
dd0dc33
 
 
 
226ff5d
 
 
 
 
 
 
dd0dc33
 
 
 
 
226ff5d
 
 
 
 
 
 
 
 
 
38a5904
226ff5d
 
 
 
dd0dc33
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
38a5904
226ff5d
dd0dc33
226ff5d
dd0dc33
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
226ff5d
 
dd0dc33
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
---
title: Demo Voice Agent Data Eyond
emoji: 🌍
colorFrom: pink
colorTo: pink
sdk: docker
pinned: true
---

# Voice Agent Service

Real-time voice agent backend dengan WebSocket-based STT (Deepgram) dan TTS (Cartesia). Menerima audio stream dari client, mendeteksi wake word, lalu streaming kembali synthesized speech.

**Versi saat ini: Phase 1 (Echo Mode)** β€” teks setelah wake word langsung di-echo melalui TTS. Phase 2 (LLM + RAG) direncanakan namun belum diimplementasi.

## Requirements

- Python 3.11+
- [uv](https://docs.astral.sh/uv/getting-started/installation/)
- Deepgram API key
- Cartesia API key + Voice ID

## Setup

**1. Clone & install dependencies**
```bash
uv sync
```

**2. Configure environment**
```bash
cp .env.example .env
```

Edit `.env` dan isi API keys:
```env
DEEPGRAM_API_KEY=your_key
CARTESIA_API_KEY=your_key
CARTESIA_VOICE_ID=your_voice_id
```

**Konfigurasi opsional:**
```env
CARTESIA_MODEL=sonic-3               # Default: sonic-3
DEEPGRAM_LANGUAGE=id                 # Default: id (Indonesian)
DEEPGRAM_ENDPOINTING_MS=300          # Default: 300ms
DEEPGRAM_UTTERANCE_END_MS=2000       # Default: 2000ms
SAMPLE_RATE=16000                    # Default: 16000 Hz
WAKE_WORD=Hai EMA                    # Default: "Hai EMA"
```

## Run

```bash
`uv run uvicorn main:app --host 0.0.0.0 --port 7861`
or
`uv run uvicorn main:app --host 0.0.0.0 --port 7861 --reload`

```

Server akan berjalan di `http://localhost:7861`.

## Test

**Health check:**
```bash
curl http://localhost:7861/health
```

Expected response:
```json
{
  "status": "ok",
  "version": "1.1.0",
  "stt_ready": true,
  "tts_ready": true
}
```

Status `degraded` (HTTP 503) akan dikembalikan jika API keys tidak lengkap.

**WebSocket test β€” kirim audio WAV, terima TTS response:**
```bash
uv run python test_client.py --test audio --wav path/to/audio.wav --save-tts output.wav
```

> File WAV harus dalam format: **16kHz, 16-bit, mono PCM**.

**Test spesifik:**
```bash
uv run python test_client.py --test health      # Health check
uv run python test_client.py --test ping        # Heartbeat ping/pong
uv run python test_client.py --test interrupt   # Cancel ongoing TTS
uv run python test_client.py --test stop        # Graceful disconnect
```

**Connectivity check (tanpa file audio):**
```bash
uv run python test_client.py
```

**Konversi audio M4A ke WAV:**
```bash
uv run python convert_audio.py                     # Konversi semua file di playground/mp4/
uv run python convert_audio.py path/to/file.m4a   # Konversi satu file
```

## Docker

**Build:**
```bash
docker build -t voice-agent .
```

**Run:**
```bash
docker run -p 7861:7861 --env-file .env voice-agent
```

## Wake Word

Default wake word: **"Hai EMA"** (bahasa Indonesia, case-insensitive)

Contoh: ucapkan _"Hai EMA, apa kabar?"_ β†’ agent akan membalas dengan TTS _"apa kabar"_.

Dapat dikonfigurasi via environment variable `WAKE_WORD`.

## Arsitektur

### Alur saat ini (Phase 1 β€” Echo)

```
Client Audio Stream (PCM 16kHz 16-bit mono)
    ↓
Deepgram STT (nova-2, real-time streaming)
    ↓
Wake Word Detection
    ↓
Echo Response
    ↓
Cartesia TTS (streaming chunks)
    ↓
Client Audio Playback
```

### Alur yang direncanakan (Phase 2 β€” LLM + RAG)

```
Client Audio Stream
    ↓
Deepgram STT
    ↓
Wake Word Detection
    ↓
PDF Knowledge Base Retrieval (belum diimplementasi)
    ↓
LLM Answer Generation (belum diimplementasi)
    ↓
Cartesia TTS
    ↓
Client Audio Playback
```

## WebSocket Protocol

**Endpoint:** `ws://localhost:7861/ws/voice`

**Client β†’ Server:**

| Type | Format | Keterangan |
|------|--------|------------|
| Binary | PCM audio chunk | Audio 16kHz, 16-bit, mono |
| Text | `{"action": "ping"}` | Heartbeat keep-alive |
| Text | `{"action": "stop"}` | Graceful disconnect |
| Text | `{"action": "interrupt"}` | Cancel ongoing TTS |

**Server β†’ Client:**

| Type | Format | Keterangan |
|------|--------|------------|
| Binary | PCM audio chunk | TTS response audio |
| Text | `{"event": "transcript", "text": "..."}` | Hasil STT |
| Text | `{"event": "reply", "text": "..."}` | Teks setelah wake word |
| Text | `{"event": "tts_end"}` | TTS selesai |
| Text | `{"event": "interrupted"}` | TTS dibatalkan |
| Text | `{"event": "pong"}` | Response ping |
| Text | `{"event": "error", "code": "...", "message": "..."}` | Error |

Lihat [API_CONTRACT.md](API_CONTRACT.md) untuk dokumentasi lengkap WebSocket protocol.

## Struktur Project

```
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ config.py              # Konfigurasi & environment variables
β”‚   β”œβ”€β”€ pipeline.py            # Core voice pipeline (STT β†’ Wake Word β†’ TTS)
β”‚   β”œβ”€β”€ stt/
β”‚   β”‚   β”œβ”€β”€ deepgram_client.py # Deepgram real-time STT (aktif)
β”‚   β”‚   └── assemblyai_client.py # AssemblyAI STT (alternatif, tidak digunakan)
β”‚   β”œβ”€β”€ tts/
β”‚   β”‚   └── cartesia_client.py # Cartesia TTS streaming
β”‚   β”œβ”€β”€ llm/
β”‚   β”‚   └── answerer.py        # LLM answer generation (Phase 2, belum diimplementasi)
β”‚   └── knowledge/
β”‚       └── loader.py          # PDF loader & RAG (Phase 2, belum diimplementasi)
β”œβ”€β”€ main.py                    # FastAPI entry point & WebSocket handler
β”œβ”€β”€ test_client.py             # Test client
β”œβ”€β”€ convert_audio.py           # Konverter M4A β†’ WAV
β”œβ”€β”€ playground/                # Audio sample dan output TTS
β”œβ”€β”€ Dockerfile
β”œβ”€β”€ .env.example
└── API_CONTRACT.md
```