Spaces:

DataEyond
/

Demo-Voice-Agent-Service

Sleeping

App Files Files Community

Demo-Voice-Agent-Service / README.md

ishaq101

Feat: STT & TTS Gemini, Creds Gemini Cloud Speech, Generate Buffer Audio, Testing

986403e 25 days ago

preview code

raw

history blame contribute delete

5.5 kB

	---
	title: Demo Voice Agent Data Eyond
	emoji: 🌍
	colorFrom: pink
	colorTo: pink
	sdk: docker
	pinned: true
	---

	# Voice Agent Service

	Real-time voice agent backend dengan WebSocket-based STT (Deepgram) dan TTS (Cartesia). Menerima audio stream dari client, mendeteksi wake word, lalu streaming kembali synthesized speech.

	Versi saat ini: Phase 1 (Echo Mode) — teks setelah wake word langsung di-echo melalui TTS. Phase 2 (LLM + RAG) direncanakan namun belum diimplementasi.

	## Requirements

	- Python 3.11+
	- [uv](https://docs.astral.sh/uv/getting-started/installation/)
	- Deepgram API key
	- Cartesia API key + Voice ID

	## Setup

	1. Clone & install dependencies
	```bash
	uv sync
	```

	2. Configure environment
	```bash
	cp .env.example .env
	```

	Edit `.env` dan isi API keys:
	```env
	DEEPGRAM_API_KEY=your_key
	CARTESIA_API_KEY=your_key
	CARTESIA_VOICE_ID=your_voice_id
	```

	Konfigurasi opsional:
	```env
	CARTESIA_MODEL=sonic-3 # Default: sonic-3
	DEEPGRAM_LANGUAGE=id # Default: id (Indonesian)
	DEEPGRAM_ENDPOINTING_MS=300 # Default: 300ms
	DEEPGRAM_UTTERANCE_END_MS=2000 # Default: 2000ms
	SAMPLE_RATE=16000 # Default: 16000 Hz
	WAKE_WORD=Hai EMA # Default: "Hai EMA"
	```

	## Run

	```bash
	`uv run uvicorn main:app --host 0.0.0.0 --port 7861`
	or
	`uv run uvicorn main:app --host 0.0.0.0 --port 7861 --reload`

	```

	Server akan berjalan di `http://localhost:7861`.

	## Test

	Health check:
	```bash
	curl http://localhost:7861/health
	```

	Expected response:
	```json
	{
	"status": "ok",
	"version": "1.1.0",
	"stt_ready": true,
	"tts_ready": true
	}
	```

	Status `degraded` (HTTP 503) akan dikembalikan jika API keys tidak lengkap.

	WebSocket test — kirim audio WAV, terima TTS response:
	```bash
	uv run python test_client.py --test audio --wav path/to/audio.wav --save-tts output.wav
	```

	> File WAV harus dalam format: 16kHz, 16-bit, mono PCM.

	Test spesifik:
	```bash
	uv run python test_client.py --test health # Health check
	uv run python test_client.py --test ping # Heartbeat ping/pong
	uv run python test_client.py --test interrupt # Cancel ongoing TTS
	uv run python test_client.py --test stop # Graceful disconnect
	```

	Connectivity check (tanpa file audio):
	```bash
	uv run python test_client.py
	```

	Konversi audio M4A ke WAV:
	```bash
	uv run python convert_audio.py # Konversi semua file di playground/mp4/
	uv run python convert_audio.py path/to/file.m4a # Konversi satu file
	```

	## Docker

	Build:
	```bash
	docker build -t voice-agent .
	```

	Run:
	```bash
	docker run -p 7861:7861 --env-file .env voice-agent
	```

	## Wake Word

	Default wake word: "Hai EMA" (bahasa Indonesia, case-insensitive)

	Contoh: ucapkan _"Hai EMA, apa kabar?"_ → agent akan membalas dengan TTS _"apa kabar"_.

	Dapat dikonfigurasi via environment variable `WAKE_WORD`.

	## Arsitektur

	### Alur saat ini (Phase 1 — Echo)

	```
	Client Audio Stream (PCM 16kHz 16-bit mono)
	↓
	Deepgram STT (nova-2, real-time streaming)
	↓
	Wake Word Detection
	↓
	Echo Response
	↓
	Cartesia TTS (streaming chunks)
	↓
	Client Audio Playback
	```

	### Alur yang direncanakan (Phase 2 — LLM + RAG)

	```
	Client Audio Stream
	↓
	Deepgram STT
	↓
	Wake Word Detection
	↓
	PDF Knowledge Base Retrieval (belum diimplementasi)
	↓
	LLM Answer Generation (belum diimplementasi)
	↓
	Cartesia TTS
	↓
	Client Audio Playback
	```

	## WebSocket Protocol

	Endpoint: `ws://localhost:7861/ws/voice`

	Client → Server:

	\| Type \| Format \| Keterangan \|
	\|------\|--------\|------------\|
	\| Binary \| PCM audio chunk \| Audio 16kHz, 16-bit, mono \|
	\| Text \| `{"action": "ping"}` \| Heartbeat keep-alive \|
	\| Text \| `{"action": "stop"}` \| Graceful disconnect \|
	\| Text \| `{"action": "interrupt"}` \| Cancel ongoing TTS \|

	Server → Client:

	\| Type \| Format \| Keterangan \|
	\|------\|--------\|------------\|
	\| Binary \| PCM audio chunk \| TTS response audio \|
	\| Text \| `{"event": "transcript", "text": "..."}` \| Hasil STT \|
	\| Text \| `{"event": "reply", "text": "..."}` \| Teks setelah wake word \|
	\| Text \| `{"event": "tts_end"}` \| TTS selesai \|
	\| Text \| `{"event": "interrupted"}` \| TTS dibatalkan \|
	\| Text \| `{"event": "pong"}` \| Response ping \|
	\| Text \| `{"event": "error", "code": "...", "message": "..."}` \| Error \|

	Lihat [API_CONTRACT.md](API_CONTRACT.md) untuk dokumentasi lengkap WebSocket protocol.

	## Struktur Project

	```
	├── src/
	│ ├── config.py # Konfigurasi & environment variables
	│ ├── pipeline.py # Core voice pipeline (STT → Wake Word → TTS)
	│ ├── stt/
	│ │ ├── deepgram_client.py # Deepgram real-time STT (aktif)
	│ │ └── assemblyai_client.py # AssemblyAI STT (alternatif, tidak digunakan)
	│ ├── tts/
	│ │ └── cartesia_client.py # Cartesia TTS streaming
	│ ├── llm/
	│ │ └── answerer.py # LLM answer generation (Phase 2, belum diimplementasi)
	│ └── knowledge/
	│ └── loader.py # PDF loader & RAG (Phase 2, belum diimplementasi)
	├── main.py # FastAPI entry point & WebSocket handler
	├── test_client.py # Test client
	├── convert_audio.py # Konverter M4A → WAV
	├── playground/ # Audio sample dan output TTS
	├── Dockerfile
	├── .env.example
	└── API_CONTRACT.md
	```