ishaq101 commited on
Commit
dd0dc33
Β·
1 Parent(s): 226ff5d

[NOTICKET] Update readme

Browse files
Files changed (1) hide show
  1. README.md +119 -13
README.md CHANGED
@@ -9,14 +9,16 @@ pinned: false
9
 
10
  # Voice Agent Service
11
 
12
- Real-time voice agent backend with WebSocket-based STT (AssemblyAI) and TTS (Cartesia). Accepts audio stream from client, detects wake word, and streams back synthesized speech.
 
 
13
 
14
  ## Requirements
15
 
16
  - Python 3.11+
17
  - [uv](https://docs.astral.sh/uv/getting-started/installation/)
18
- - AssemblyAI API key (free tier)
19
- - Cartesia API key + Voice ID (free tier)
20
 
21
  ## Setup
22
 
@@ -32,11 +34,21 @@ cp .env.example .env
32
 
33
  Edit `.env` dan isi API keys:
34
  ```env
35
- ASSEMBLYAI_API_KEY=your_key
36
  CARTESIA_API_KEY=your_key
37
  CARTESIA_VOICE_ID=your_voice_id
38
  ```
39
 
 
 
 
 
 
 
 
 
 
 
40
  ## Run
41
 
42
  ```bash
@@ -62,16 +74,21 @@ Expected response:
62
  }
63
  ```
64
 
65
- **WebSocket test (kirim audio WAV, terima TTS response):**
 
 
66
  ```bash
67
- uv run python test_client.py path/to/audio.wav
68
  ```
69
 
70
- > File WAV harus dalam format: 16kHz, 16-bit, mono PCM.
71
 
72
- Output audio response akan disimpan ke `output.pcm`. Untuk memutarnya:
73
  ```bash
74
- ffplay -f s16le -ar 16000 -ac 1 output.pcm
 
 
 
75
  ```
76
 
77
  **Connectivity check (tanpa file audio):**
@@ -79,7 +96,11 @@ ffplay -f s16le -ar 16000 -ac 1 output.pcm
79
  uv run python test_client.py
80
  ```
81
 
82
- Mengirim 3 detik silence untuk memverifikasi koneksi WebSocket berhasil.
 
 
 
 
83
 
84
  ## Docker
85
 
@@ -95,10 +116,95 @@ docker run -p 7860:7860 --env-file .env voice-agent
95
 
96
  ## Wake Word
97
 
98
- Default wake word: **"hi voice agent"** (case-insensitive)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
99
 
100
- Contoh: ucapkan _"Hi Voice Agent, what time is it?"_ β†’ agent akan membalas dengan TTS _"what time is it"_.
101
 
102
- ## API Contract
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
103
 
104
  Lihat [API_CONTRACT.md](API_CONTRACT.md) untuk dokumentasi lengkap WebSocket protocol.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9
 
10
  # Voice Agent Service
11
 
12
+ Real-time voice agent backend dengan WebSocket-based STT (Deepgram) dan TTS (Cartesia). Menerima audio stream dari client, mendeteksi wake word, lalu streaming kembali synthesized speech.
13
+
14
+ **Versi saat ini: Phase 1 (Echo Mode)** β€” teks setelah wake word langsung di-echo melalui TTS. Phase 2 (LLM + RAG) direncanakan namun belum diimplementasi.
15
 
16
  ## Requirements
17
 
18
  - Python 3.11+
19
  - [uv](https://docs.astral.sh/uv/getting-started/installation/)
20
+ - Deepgram API key
21
+ - Cartesia API key + Voice ID
22
 
23
  ## Setup
24
 
 
34
 
35
  Edit `.env` dan isi API keys:
36
  ```env
37
+ DEEPGRAM_API_KEY=your_key
38
  CARTESIA_API_KEY=your_key
39
  CARTESIA_VOICE_ID=your_voice_id
40
  ```
41
 
42
+ **Konfigurasi opsional:**
43
+ ```env
44
+ CARTESIA_MODEL=sonic-3 # Default: sonic-3
45
+ DEEPGRAM_LANGUAGE=id # Default: id (Indonesian)
46
+ DEEPGRAM_ENDPOINTING_MS=300 # Default: 300ms
47
+ DEEPGRAM_UTTERANCE_END_MS=2000 # Default: 2000ms
48
+ SAMPLE_RATE=16000 # Default: 16000 Hz
49
+ WAKE_WORD=Hai EMA # Default: "Hai EMA"
50
+ ```
51
+
52
  ## Run
53
 
54
  ```bash
 
74
  }
75
  ```
76
 
77
+ Status `degraded` (HTTP 503) akan dikembalikan jika API keys tidak lengkap.
78
+
79
+ **WebSocket test β€” kirim audio WAV, terima TTS response:**
80
  ```bash
81
+ uv run python test_client.py --test audio --wav path/to/audio.wav --save-tts output.wav
82
  ```
83
 
84
+ > File WAV harus dalam format: **16kHz, 16-bit, mono PCM**.
85
 
86
+ **Test spesifik:**
87
  ```bash
88
+ uv run python test_client.py --test health # Health check
89
+ uv run python test_client.py --test ping # Heartbeat ping/pong
90
+ uv run python test_client.py --test interrupt # Cancel ongoing TTS
91
+ uv run python test_client.py --test stop # Graceful disconnect
92
  ```
93
 
94
  **Connectivity check (tanpa file audio):**
 
96
  uv run python test_client.py
97
  ```
98
 
99
+ **Konversi audio M4A ke WAV:**
100
+ ```bash
101
+ uv run python convert_audio.py # Konversi semua file di playground/mp4/
102
+ uv run python convert_audio.py path/to/file.m4a # Konversi satu file
103
+ ```
104
 
105
  ## Docker
106
 
 
116
 
117
  ## Wake Word
118
 
119
+ Default wake word: **"Hai EMA"** (bahasa Indonesia, case-insensitive)
120
+
121
+ Contoh: ucapkan _"Hai EMA, apa kabar?"_ β†’ agent akan membalas dengan TTS _"apa kabar"_.
122
+
123
+ Dapat dikonfigurasi via environment variable `WAKE_WORD`.
124
+
125
+ ## Arsitektur
126
+
127
+ ### Alur saat ini (Phase 1 β€” Echo)
128
+
129
+ ```
130
+ Client Audio Stream (PCM 16kHz 16-bit mono)
131
+ ↓
132
+ Deepgram STT (nova-2, real-time streaming)
133
+ ↓
134
+ Wake Word Detection
135
+ ↓
136
+ Echo Response
137
+ ↓
138
+ Cartesia TTS (streaming chunks)
139
+ ↓
140
+ Client Audio Playback
141
+ ```
142
+
143
+ ### Alur yang direncanakan (Phase 2 β€” LLM + RAG)
144
+
145
+ ```
146
+ Client Audio Stream
147
+ ↓
148
+ Deepgram STT
149
+ ↓
150
+ Wake Word Detection
151
+ ↓
152
+ PDF Knowledge Base Retrieval (belum diimplementasi)
153
+ ↓
154
+ LLM Answer Generation (belum diimplementasi)
155
+ ↓
156
+ Cartesia TTS
157
+ ↓
158
+ Client Audio Playback
159
+ ```
160
+
161
+ ## WebSocket Protocol
162
+
163
+ **Endpoint:** `ws://localhost:7860/ws/voice`
164
 
165
+ **Client β†’ Server:**
166
 
167
+ | Type | Format | Keterangan |
168
+ |------|--------|------------|
169
+ | Binary | PCM audio chunk | Audio 16kHz, 16-bit, mono |
170
+ | Text | `{"action": "ping"}` | Heartbeat keep-alive |
171
+ | Text | `{"action": "stop"}` | Graceful disconnect |
172
+ | Text | `{"action": "interrupt"}` | Cancel ongoing TTS |
173
+
174
+ **Server β†’ Client:**
175
+
176
+ | Type | Format | Keterangan |
177
+ |------|--------|------------|
178
+ | Binary | PCM audio chunk | TTS response audio |
179
+ | Text | `{"event": "transcript", "text": "..."}` | Hasil STT |
180
+ | Text | `{"event": "reply", "text": "..."}` | Teks setelah wake word |
181
+ | Text | `{"event": "tts_end"}` | TTS selesai |
182
+ | Text | `{"event": "interrupted"}` | TTS dibatalkan |
183
+ | Text | `{"event": "pong"}` | Response ping |
184
+ | Text | `{"event": "error", "code": "...", "message": "..."}` | Error |
185
 
186
  Lihat [API_CONTRACT.md](API_CONTRACT.md) untuk dokumentasi lengkap WebSocket protocol.
187
+
188
+ ## Struktur Project
189
+
190
+ ```
191
+ β”œβ”€β”€ src/
192
+ β”‚ β”œβ”€β”€ config.py # Konfigurasi & environment variables
193
+ β”‚ β”œβ”€β”€ pipeline.py # Core voice pipeline (STT β†’ Wake Word β†’ TTS)
194
+ β”‚ β”œβ”€β”€ stt/
195
+ β”‚ β”‚ β”œβ”€β”€ deepgram_client.py # Deepgram real-time STT (aktif)
196
+ β”‚ β”‚ └── assemblyai_client.py # AssemblyAI STT (alternatif, tidak digunakan)
197
+ β”‚ β”œβ”€β”€ tts/
198
+ β”‚ β”‚ └── cartesia_client.py # Cartesia TTS streaming
199
+ β”‚ β”œβ”€β”€ llm/
200
+ β”‚ β”‚ └── answerer.py # LLM answer generation (Phase 2, belum diimplementasi)
201
+ β”‚ └── knowledge/
202
+ β”‚ └── loader.py # PDF loader & RAG (Phase 2, belum diimplementasi)
203
+ β”œβ”€β”€ main.py # FastAPI entry point & WebSocket handler
204
+ β”œβ”€β”€ test_client.py # Test client
205
+ β”œβ”€β”€ convert_audio.py # Konverter M4A β†’ WAV
206
+ β”œβ”€β”€ playground/ # Audio sample dan output TTS
207
+ β”œβ”€β”€ Dockerfile
208
+ β”œβ”€β”€ .env.example
209
+ └── API_CONTRACT.md
210
+ ```