voice-agent / README.md
Sbboss's picture
RAG, language updates
0b2d478
---
title: Voice Agent
emoji: 🐨
colorFrom: purple
colorTo: pink
sdk: docker
pinned: false
---
# Speech AI Agent
FastAPI backend + Streamlit UI for a voice agent using **Azure Speech** (STT/TTS) and **Azure AI Foundry Agents** (Azure AI Projects SDK).
## Setup
1) Create a `.env` file (copy from `.env.example` and fill values).
2) Create a virtual environment and install dependencies (from the project root):
```bash
python -m venv .venv
source .venv/bin/activate
python -m pip install --upgrade pip
python -m pip install -r requirements.txt
```
### Azure AI Foundry auth (local dev)
Foundry Agent auth uses Entra ID. For local dev, run:
```bash
az login
```
Alternatively, set a service principal in your environment:
`AZURE_TENANT_ID`, `AZURE_CLIENT_ID`, `AZURE_CLIENT_SECRET`.
## Run backend
```bash
python -m uvicorn src.app.main:app --reload --host 0.0.0.0 --port 8000
```
## Run Streamlit UI
```bash
streamlit run ui/streamlit_app.py
```
If the backend isn’t on localhost:8000, set:
```bash
SPEECH_AGENT_WS_URL=ws://<host>:<port>/ws/voice
SPEECH_AGENT_HTTP_URL=http://<host>:<port>
```
For local agent RAG, configure:
```bash
AZURE_OPENAI_EMBEDDINGS_DEPLOYMENT=<your-embeddings-deployment>
```
## Quick tests
Health check:
```bash
curl -s http://localhost:8000/health | jq
```
Test audio upload (expects base64 in response):
```bash
curl -s -X POST "http://localhost:8000/v1/voice/file" \
-F "file=@./sample.wav" \
-F "prompt=Answer briefly." | jq -r '.transcript, .reply_text'
```
Extract audio from response:
```bash
curl -s -X POST "http://localhost:8000/v1/voice/file" \
-F "file=@./sample.wav" \
-F "prompt=Answer briefly." \
| python -c "import sys, json, base64; d=json.load(sys.stdin); open('reply.wav','wb').write(base64.b64decode(d['reply_audio_base64']))"
```
## WebSocket streaming (send after stop)
Server endpoint: `ws://localhost:8000/ws/voice`
Client flow:
1) Send `{"event":"start","content_type":"audio/pcm;rate=16000;bits=16;channels=1","return_audio":true}`
2) Send binary audio chunks (PCM16 mono @ 16kHz)
3) Send `{"event":"stop","prompt":"Answer briefly."}`
Browser note: the demo page streams raw PCM (not container audio) to avoid format issues.
Optional dev demo page: http://localhost:8000/ws-demo
## Local Agent (tools + RAG + memory)
The UI includes an **Agent** toggle. When selected, it uses the local agent
pipeline with tools, local RAG (from `data/`), and memory.
RAG uses FAISS + Azure OpenAI embeddings. Supported file types:
`txt`, `md`, `pdf`, `docx`, `csv`.
Endpoints:
- Upload files for RAG: `POST /v1/agent/upload` (multipart `files`)
- Reset session data: `POST /v1/agent/reset`
Example:
```bash
curl -s -X POST "http://localhost:8000/v1/agent/upload" \
-F "files=@./notes.txt"
```
## Hugging Face (Docker Space)
Deploy both FastAPI + Streamlit in a single Docker Space.
1) Create a **Docker Space** and push this repo.
2) Set these **Space Secrets/Variables**:
- `AZURE_SPEECH_KEY`
- `AZURE_SPEECH_REGION`
- `FOUNDRY_PROJECT_CONN_STR`
- `FOUNDRY_AGENT_ID`
- `AZURE_TENANT_ID`
- `AZURE_CLIENT_ID`
- `AZURE_CLIENT_SECRET`
- `SPEECH_AGENT_WS_URL=wss://<your-space>.hf.space/ws/voice`
3) The provided `Dockerfile` + `docker/start.sh` + `docker/nginx.conf.template` will:
- run FastAPI on `:8000`
- run Streamlit on `:8501`
- expose everything via nginx on `:$PORT` (HF default 7860)