YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
Speech-X
Two modes in one repo β both share the same conda environment, models, and LiveKit server.
| Page | Mode | What it does |
|---|---|---|
/ |
Avatar | Text/voice β Kokoro TTS β MuseTalk lip-sync β LiveKit video |
/voice |
Voice Agent | Voice β faster-whisper β Llama β Kokoro TTS β LiveKit audio |
Architecture
Browser (React + LiveKit SDK)
β
βββ / β FastAPI server (port 8767) β Kokoro TTS + MuseTalk + LiveKit publisher
βββ /voice β LiveKit Agent worker β ASR β LLM β TTS + token server (port 3000)
Shared infrastructure (always running):
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β LiveKit server :7880 β llama-server :8080 β Vite dev :5173 β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Prerequisites
- NVIDIA GPU (RTX 4060 8GB or better)
- Conda
- Docker
- Node.js 18+
- llama.cpp β
llama-serverin PATH
Environment Setup
See
setup/setup.mdfor the full step-by-step guide, or run the automated script:bash setup/setup.sh # Linux / macOS .\setup\setup.ps1 # Windows (PowerShell)
Restore conda environment
# From repo root
conda env create -f environment.yml
conda activate avatar
Frontend dependencies
cd frontend
npm install
Running
All four processes run concurrently. Open four terminals.
Terminal 1 β LiveKit server (Docker, shared by both pages)
docker run --rm -d \
--name livekit-server \
-p 7880:7880 -p 7881:7881 -p 7882:7882/udp \
livekit/livekit-server:latest \
--dev --bind 0.0.0.0 --node-ip 127.0.0.1
Stop it later:
docker stop livekit-server
Terminal 2 β llama-server (shared by both pages)
llama-server \
-m backend/models/Llama-3.2-3B-Instruct-Q4_K_M.gguf \
-c 2048 -ngl 32 --port 8080
Terminal 3 β Backend (choose based on which page you need)
For / (Avatar page β MuseTalk lip-sync):
conda activate avatar
cd backend
python api/server.py
# Runs on http://localhost:8767
For /voice (Voice Agent page β ASR β LLM β TTS):
conda activate avatar
cd backend
python agent.py dev
# Token server on http://localhost:3000
# LiveKit worker connects to ws://localhost:7880
You can run both at the same time in separate terminals if you want both pages live.
Terminal 4 β Frontend
cd frontend
npm run dev
# Open http://localhost:5173
http://localhost:5173/β Avatar lip-sync pagehttp://localhost:5173/voiceβ Voice agent page
Avatars
Three avatars ship pre-computed: sophy (default), harry_1, christine.
To create a new one, run setup/avatar_creation.py once from the repo root with the avatar env active:
conda activate avatar
# From a portrait image (duplicated to 50 frames)
python setup/avatar_creation.py --image frontend/public/Sophy.png --name sophy
# From a talking-head video
python setup/avatar_creation.py --video /path/to/talking_head.mp4 --name harry_1
# Batch β edit setup/avatars_config.yml first
python setup/avatar_creation.py --config setup/avatars_config.yml
Outputs written to backend/avatars/<name>/: latents.pt, coords.pkl, mask_coords.pkl, full_imgs/, mask/, avator_info.json.
Switch avatar at runtime:
SPEECHX_AVATAR=harry_1 python api/server.py
| Flag | Default | Description |
|---|---|---|
--name |
required | Avatar folder name |
--frames |
50 |
Frame count for --image mode |
--bbox-shift |
5 |
Vertical bbox nudge (tune if crop is off) |
--device |
cuda |
cuda or cpu |
--overwrite |
off | Skip the re-create prompt |
Environment Variables
Copy and adjust as needed (both backends read from shell env or a .env file in backend/):
# LiveKit
LIVEKIT_URL=ws://localhost:7880
LIVEKIT_API_KEY=devkey
LIVEKIT_API_SECRET=secret
# llama-server
LLAMA_SERVER_URL=http://localhost:8080/v1
# TTS voice (voice agent page)
DEFAULT_VOICE=af_sarah # see backend/agent/config.py for all options
# ASR model size (voice agent page)
ASR_MODEL_SIZE=tiny # tiny | base | small
# Avatar page server
SPEECH_TO_VIDEO_HOST=0.0.0.0
SPEECH_TO_VIDEO_PORT=8767
Project Structure
speech_to_video/
βββ environment.yml # Conda env export (cross-platform, no build strings)
βββ README.md
βββ setup/
β βββ setup.md # Step-by-step install guide
β βββ setup.sh # Automated setup script (Linux/macOS)
β βββ setup.ps1 # Automated setup script (Windows/PowerShell)
βββ docs/
β βββ avatar_gen_README.md # Voice agent architecture notes
β βββ avatar_gen_phase_2.md# Phase 2 MuseTalk integration plan
βββ backend/
β βββ config.py # Avatar page configuration
β βββ requirements.txt # Pip dependencies (inside conda env)
β βββ api/
β β βββ server.py # FastAPI server for Avatar page (:8767)
β β βββ pipeline.py # MuseTalk pipeline orchestrator
β βββ agent.py # LiveKit worker entry point for Voice page
β βββ agent/
β β βββ config.py # Voice agent config (voices, model paths)
β β βββ asr.py # faster-whisper ASR
β β βββ llm.py # llama-server HTTP client
β β βββ tts.py # kokoro-onnx TTS; patches int32βfloat32 speed bug (0.5.x)
β βββ tts/
β β βββ kokoro_tts.py # Kokoro TTS for Avatar page
β βββ musetalk/ # MuseTalk inference
β βββ models/ # All model weights
β β βββ kokoro/
β β βββ musetalkV15/
β β βββ sd-vae/
β β βββ whisper/
β β βββ Llama-3.2-3B-Instruct-Q4_K_M.gguf
β βββ avatars/ # Pre-computed avatar assets
β βββ christine/
β βββ harry_1/
β βββ sophy/
βββ frontend/
βββ src/
βββ App.tsx # Avatar page (/)
βββ pages/
β βββ VoicePage.tsx # Voice agent page (/voice)
βββ index.css
Available Voices (Voice Agent page)
| Voice ID | Description |
|---|---|
af_sarah |
Female, clear and professional |
af_bella |
Female, warm and friendly |
af_heart |
Female, emotional and expressive |
am_michael |
Male, professional and authoritative |
am_fen |
Male, deep and resonant |
bf_emma |
Female, British accent |
bm_george |
Male, British accent |
Troubleshooting
Kokoro ONNX int32 speed-tensor error
Already patched in both backend/tts/kokoro_tts.py and backend/agent/tts.py via _patched_create_audio monkey-patch. Requires kokoro-onnx>=0.5.0.
ModuleNotFoundError on startup
Activate the conda env first: conda activate avatar
LiveKit connection errors
Verify the API key matches in backend/config.py and backend/agent/config.py:
LIVEKIT_API_KEY = "devkey"
LIVEKIT_API_SECRET = "secret"
llama-server not found
Download from llama.cpp releases and add to PATH.
Windows: download llama-...-win-cuda-cu12.x.x-x64.zip, extract, add folder to PATH.
Out of VRAM (Avatar page)
Reduce batch size in backend/config.py:
FRAMES_PER_CHUNK = 2 # default 8
Port already in use
# Find and kill
lsof -i :8767 # avatar page backend
lsof -i :3000 # voice agent token server
lsof -i :8080 # llama-server
- Downloads last month
- 3
4-bit