Spaces:

Sbboss
/

voice-agent

Running

App Files Files Community

voice-agent / README.md

Sbboss

RAG, language updates

0b2d478 about 2 months ago

preview code

raw

history blame contribute delete

3.42 kB

metadata

title: Voice Agent
emoji: 🐨
colorFrom: purple
colorTo: pink
sdk: docker
pinned: false

Speech AI Agent

FastAPI backend + Streamlit UI for a voice agent using Azure Speech (STT/TTS) and Azure AI Foundry Agents (Azure AI Projects SDK).

Setup

Create a .env file (copy from .env.example and fill values).
Create a virtual environment and install dependencies (from the project root):

python -m venv .venv
source .venv/bin/activate

python -m pip install --upgrade pip
python -m pip install -r requirements.txt

Azure AI Foundry auth (local dev)

Foundry Agent auth uses Entra ID. For local dev, run:

az login

Alternatively, set a service principal in your environment: AZURE_TENANT_ID, AZURE_CLIENT_ID, AZURE_CLIENT_SECRET.

Run backend

python -m uvicorn src.app.main:app --reload --host 0.0.0.0 --port 8000

Run Streamlit UI

streamlit run ui/streamlit_app.py

If the backend isn’t on localhost:8000, set:

SPEECH_AGENT_WS_URL=ws://<host>:<port>/ws/voice
SPEECH_AGENT_HTTP_URL=http://<host>:<port>

For local agent RAG, configure:

AZURE_OPENAI_EMBEDDINGS_DEPLOYMENT=<your-embeddings-deployment>

Quick tests

Health check:

curl -s http://localhost:8000/health | jq

Test audio upload (expects base64 in response):

curl -s -X POST "http://localhost:8000/v1/voice/file" \
  -F "file=@./sample.wav" \
  -F "prompt=Answer briefly." | jq -r '.transcript, .reply_text'

Extract audio from response:

curl -s -X POST "http://localhost:8000/v1/voice/file" \
  -F "file=@./sample.wav" \
  -F "prompt=Answer briefly." \
| python -c "import sys, json, base64; d=json.load(sys.stdin); open('reply.wav','wb').write(base64.b64decode(d['reply_audio_base64']))"

WebSocket streaming (send after stop)

Server endpoint: ws://localhost:8000/ws/voice

Client flow:

Send {"event":"start","content_type":"audio/pcm;rate=16000;bits=16;channels=1","return_audio":true}
Send binary audio chunks (PCM16 mono @ 16kHz)
Send {"event":"stop","prompt":"Answer briefly."}

Browser note: the demo page streams raw PCM (not container audio) to avoid format issues.

Optional dev demo page: http://localhost:8000/ws-demo

Local Agent (tools + RAG + memory)

The UI includes an Agent toggle. When selected, it uses the local agent pipeline with tools, local RAG (from data/), and memory.

RAG uses FAISS + Azure OpenAI embeddings. Supported file types: txt, md, pdf, docx, csv.

Endpoints:

Upload files for RAG: POST /v1/agent/upload (multipart files)
Reset session data: POST /v1/agent/reset

Example:

curl -s -X POST "http://localhost:8000/v1/agent/upload" \
  -F "files=@./notes.txt"

Hugging Face (Docker Space)

Deploy both FastAPI + Streamlit in a single Docker Space.

Create a Docker Space and push this repo.
Set these Space Secrets/Variables:
- AZURE_SPEECH_KEY
- AZURE_SPEECH_REGION
- FOUNDRY_PROJECT_CONN_STR
- FOUNDRY_AGENT_ID
- AZURE_TENANT_ID
- AZURE_CLIENT_ID
- AZURE_CLIENT_SECRET
- SPEECH_AGENT_WS_URL=wss://<your-space>.hf.space/ws/voice
The provided Dockerfile + docker/start.sh + docker/nginx.conf.template will:
- run FastAPI on :8000
- run Streamlit on :8501
- expose everything via nginx on :$PORT (HF default 7860)