# Speech-X β Speech-to-Video Pipeline
Two interactive modes in one repo, sharing the same conda environment (avatar), model weights, and LiveKit server.
| Page | Mode | What it does |
|---|---|---|
/ |
Avatar | Text β Kokoro TTS β MuseTalk lip-sync β LiveKit video+audio |
/voice |
Voice Agent | Mic β faster-whisper ASR β Llama LLM β Kokoro TTS β LiveKit audio |
Table of Contents
- Architecture Overview
- Project Structure
- Conda Environment & Installation
- Running the Application
- Configuration Reference
- API Endpoints (Avatar page)
- Frontend
- Avatars
- Model Weights
- Troubleshooting
- Performance Notes
Architecture Overview
Avatar Page (/) β MuseTalk lip-sync
Browser (React)
β POST /connect β loads MuseTalk + Kokoro, joins LiveKit as publisher
β POST /speak {text} β pushes text into the streaming pipeline
β POST /get-token β returns a LiveKit JWT for the frontend viewer
β
βΌ
FastAPI server :8767 (backend/api/server.py)
β
βββ KokoroTTS (backend/tts/kokoro_tts.py)
β kokoro-v1.0.onnx β 24 kHz PCM audio, streamed in 320 ms chunks
β Text split at sentence boundaries for lower first-chunk latency
β
βββ MuseTalkWorker (backend/musetalk/worker.py)
β Two-phase GPU inference:
β 1. extract_features(audio) β Whisper encoder produces mel embeddings
β 2. generate_batch(feats, start, n) β UNet generates n lip-synced frames
β ThreadPoolExecutor(max_workers=1) keeps GPU work off the async event loop
β
βββ AVPublisher (backend/publisher/livekit_publisher.py)
β Streams video (actual avatar dimensions @ 25 fps) + audio (24 kHz) to LiveKit
β IdleFrameGenerator loops avatar frames while pipeline is idle
β
βββ StreamingPipeline (backend/api/pipeline.py)
Coordinates TTS β MuseTalk β publisher
SimpleAVSync aligns PTS across chunks
Voice Agent Page (/voice) β ASR β LLM β TTS
Browser (React)
β POST http://localhost:3000/get-token β LiveKit JWT from built-in token server
β Publishes local mic audio track to LiveKit room
β
βΌ
LiveKit Agent worker (backend/agent.py)
β livekit-agents framework, pre-warms all models at startup
β
βββ ASR (backend/agent/asr.py)
β faster-whisper (model size via ASR_MODEL_SIZE, default: base)
β Buffers ~1.5 s of 16 kHz mono audio before each transcription
β
βββ LLM (backend/agent/llm.py)
β HTTP client β llama-server :8080
β Llama-3.2-3B-Instruct-Q4_K_M.gguf, keeps last 6 turns of history
β
βββ TTS (backend/agent/tts.py)
kokoro-onnx, default voice: af_sarah
Publishes 48 kHz mono audio back to the LiveKit room
Built-in aiohttp token server on :3000
Project Structure
speech_to_video/
βββ environment.yml # Conda env export (no build strings, cross-platform)
βββ README.md # Quick-start guide
βββ PROJECT.md # This file β detailed reference
βββ setup.md # Step-by-step first-time install
βββ ISSUES_AND_PLAN.md # Known issues and roadmap
βββ pyproject.toml # Python project config
β
βββ docs/
β βββ avatar_gen_README.md
β βββ avatar_gen_phase_2.md
β
βββ backend/
β βββ config.py # Avatar page configuration (all tunable params)
β βββ requirements.txt # Pip dependencies (installed inside conda env)
β β
β βββ api/
β β βββ server.py # FastAPI app β lifespan, endpoints, warmup logic
β β βββ pipeline.py # StreamingPipeline + SpeechToVideoPipeline orchestrators
β β
β βββ agent.py # LiveKit Agent worker entry point (Voice page)
β βββ agent/
β β βββ config.py # Voice agent config (voices, model paths, SYSTEM_PROMPT)
β β βββ asr.py # faster-whisper ASR wrapper
β β βββ llm.py # llama-server HTTP client
β β βββ tts.py # Kokoro TTS for Voice agent
β β
β βββ tts/
β β βββ kokoro_tts.py # Kokoro TTS for Avatar page
β β Patches int32βfloat32 speed-tensor bug (kokoro-onnx 0.5.x)
β β Splits text at sentence boundaries for low first-chunk latency
β β
β βββ musetalk/
β β βββ worker.py # MuseTalkWorker β async GPU wrapper (ThreadPoolExecutor)
β β βββ processor.py # Core MuseTalk inference (VAE encode/decode, UNet forward)
β β βββ face.py # Face detection / cropping
β β βββ data/ # Dataset helpers, audio processing
β β βββ models/ # UNet, VAE, SyncNet model definitions
β β βββ utils/ # Audio processor, blending, preprocessing, DWPose, face parsing
β β
β βββ whisper/
β β βββ audio2feature.py # Whisper encoder for MuseTalk audio features
β β
β βββ sync/
β β βββ av_sync.py # AVSyncGate + SimpleAVSync β PTS-based AV alignment
β β
β βββ publisher/
β β βββ livekit_publisher.py # AVPublisher + IdleFrameGenerator
β β
β βββ models/ # All model weights (not in git)
β β βββ kokoro/
β β β βββ kokoro-v1.0.onnx
β β β βββ voices-v1.0.bin
β β βββ musetalkV15/
β β β βββ musetalk.json
β β β βββ unet.pth
β β βββ sd-vae/
β β β βββ config.json
β β βββ whisper/
β β β βββ config.json
β β β βββ preprocessor_config.json
β β βββ dwpose/
β β β βββ dw-ll_ucoco_384.pth
β β βββ face-parse-bisent/
β β β βββ 79999_iter.pth
β β β βββ resnet18-5c106cde.pth
β β βββ syncnet/
β β β βββ latentsync_syncnet.pt
β β βββ Llama-3.2-3B-Instruct-Q4_K_M.gguf
β β
β βββ avatars/ # Pre-computed avatar assets (not in git)
β βββ christine/ # avator_info.json, latents.pt, full_imgs/, mask/, vid_output/
β βββ harry_1/
β βββ sophy/ # DEFAULT_AVATAR β used unless overridden by SPEECHX_AVATAR
β
βββ frontend/
βββ package.json
βββ vite.config.ts
βββ src/
βββ App.tsx # Avatar page (/)
βββ main.tsx # React Router setup
βββ index.css # Avatar page styles
βββ voice.css # Voice page styles
βββ pages/
βββ VoicePage.tsx # Voice Agent page (/voice)
Conda Environment & Installation
Restore environment
# From repo root
conda env create -f environment.yml
conda activate avatar
Frontend
cd frontend
npm install
First-time model setup
See setup.md for the step-by-step guide to download and place all model weights.
Running the Application
All four processes must run concurrently. Use four terminals, all with conda activate avatar.
Terminal 1 β LiveKit server (shared)
docker run --rm -d \
--name livekit-server \
-p 7880:7880 -p 7881:7881 -p 7882:7882/udp \
livekit/livekit-server:latest \
--dev --bind 0.0.0.0 --node-ip 127.0.0.1
# Stop: docker stop livekit-server
Terminal 2 β llama-server (shared by both pages)
llama-server \
-m backend/models/Llama-3.2-3B-Instruct-Q4_K_M.gguf \
-c 2048 -ngl 32 --port 8080
llama-servermust be in PATH. Download from llama.cpp releases.
Terminal 3 β Backend (choose one or both)
Avatar page (lip-sync + video):
conda activate avatar
cd backend
python api/server.py
# β http://localhost:8767
Voice Agent page (ASR β LLM β TTS audio only):
conda activate avatar
cd backend
python agent.py dev
# β LiveKit agent worker + token server on http://localhost:3000
Both can run simultaneously in separate terminals.
Terminal 4 β Frontend (Vite dev server)
cd frontend
npm run dev
# β http://localhost:5173
http://localhost:5173/β Avatar lip-sync pagehttp://localhost:5173/voiceβ Voice Agent page
Configuration Reference
backend/config.py β Avatar page
All values are overridable via environment variables.
| Variable | Default | Env var | Description |
|---|---|---|---|
DEVICE |
"cuda" |
SPEECHX_DEVICE |
Inference device |
VIDEO_FPS |
25 |
β | Output frame rate |
VIDEO_WIDTH |
720 |
β | Base width (actual read from avatar frames at /connect) |
VIDEO_HEIGHT |
1280 |
β | Base height |
TTS_SAMPLE_RATE |
24000 |
β | Kokoro output sample rate |
LIVEKIT_AUDIO_SAMPLE_RATE |
24000 |
β | LiveKit audio rate (matches TTS β no resampling) |
CHUNK_DURATION |
0.32 |
β | 320 ms per TTS/video chunk (8 frames @ 25 fps) |
FRAMES_PER_CHUNK |
8 |
β | Computed: CHUNK_DURATION Γ VIDEO_FPS |
PRE_ROLL_FRAMES |
1 |
β | Minimal pre-roll for fast start |
MUSETALK_UNET_FP16 |
True |
β | fp16 inference for lower VRAM |
KOKORO_VOICE |
"af_heart" |
SPEECHX_VOICE |
TTS voice |
KOKORO_SPEED |
1.0 |
β | Speech speed multiplier |
LIVEKIT_URL |
"ws://localhost:7880" |
LIVEKIT_URL |
LiveKit server URL |
LIVEKIT_API_KEY |
"devkey" |
LIVEKIT_API_KEY |
LiveKit API key |
LIVEKIT_API_SECRET |
"secret" |
LIVEKIT_API_SECRET |
LiveKit API secret |
LIVEKIT_ROOM_NAME |
"speech-to-video-room" |
LIVEKIT_ROOM_NAME |
Room name |
DEFAULT_AVATAR |
"sophy" |
SPEECHX_AVATAR |
Active avatar |
HOST |
"0.0.0.0" |
SPEECH_TO_VIDEO_HOST |
Bind address |
PORT |
8767 |
SPEECH_TO_VIDEO_PORT |
Bind port |
Optional torch.compile for UNet:
MUSETALK_TORCH_COMPILE=1 python api/server.py
# UNet JIT compiles in background after /connect; later requests benefit from Triton kernels
backend/agent/config.py β Voice Agent page
| Variable | Default | Env var |
|---|---|---|
LIVEKIT_URL |
"ws://localhost:7880" |
LIVEKIT_URL |
LIVEKIT_API_KEY |
"devkey" |
LIVEKIT_API_KEY |
LIVEKIT_API_SECRET |
"secret" |
LIVEKIT_API_SECRET |
LLAMA_SERVER_URL |
"http://localhost:8080/v1" |
LLAMA_SERVER_URL |
DEFAULT_VOICE |
"af_sarah" |
DEFAULT_VOICE |
ASR_MODEL_SIZE |
"base" |
ASR_MODEL_SIZE |
API Endpoints (Avatar page)
Base URL: http://localhost:8767
| Method | Path | Description |
|---|---|---|
GET |
/health |
Liveness probe β returns models_loaded, pipeline_active |
GET |
/status |
Detailed status with VRAM usage |
POST |
/connect |
Load models, join LiveKit room, run warmup passes |
POST |
/disconnect |
Gracefully stop pipeline and disconnect from room |
POST |
/speak |
Push text into the pipeline; body: {text, voice?, speed?} |
POST |
/get-token |
Issue LiveKit JWT; body: {room_name, identity} |
GET |
/livekit-token |
GET alias for /get-token |
/connect startup sequence
load_musetalk_models(avatar_name, device)β loads UNet, VAE, Whisper encoder, avatar assetsKokoroTTS()β initializes ONNX session- Create LiveKit
Room, generate backend-agent JWT with publish permissions - Read actual frame dimensions from the pre-computed avatar's
frame_list[0] - Instantiate
AVPublisher,MuseTalkWorker,StreamingPipeline room.connect()βpublisher.start()βpipeline.start()(idle loop begins)- Warmup: Whisper + TTS synchronously; UNet eager pass (or background torch.compile JIT)
/speak example
curl -X POST http://localhost:8767/speak \
-H "Content-Type: application/json" \
-d '{"text": "Hello, how are you today?"}'
# β {"status": "processing", "latency_ms": 12.4}
Frontend
Avatar page β App.tsx
- Calls
POST /connecton button click β loads models, joins LiveKit room - Fetches LiveKit token from
POST /get-token, connects as a viewer - Receives remote video + audio tracks published by the backend agent
- Sends typed text via
POST /speak - Chat log shows
user/assistant/systemmessages with timestamps - Static fallback avatar image
/Sophy.pngshown before first video frame arrives
Voice Agent page β VoicePage.tsx
- Fetches LiveKit token from the agent's built-in server at
http://localhost:3000/get-token - Publishes local microphone (echo-cancelled, noise-suppressed) as an audio track
- Agent worker subscribes, runs ASR β LLM β TTS, publishes reply audio back into the room
DataReceivedevents carry{type: "user"|"assistant", text}for the transcript display- Latency shown as the time between user-transcript event and assistant-text event
Avatars
Pre-computed assets in backend/avatars/<name>/:
| File/Dir | Description |
|---|---|
avator_info.json |
Avatar bbox, landmark, and crop metadata |
latents.pt |
Pre-encoded VAE latent tensors for all idle frames |
full_imgs/ |
Full-resolution source frames |
mask/ |
Per-frame blending masks for composite output |
vid_output/ |
Intermediate video output (optional) |
Available avatars: sophy (default), harry_1, christine.
Override at runtime:
SPEECHX_AVATAR=harry_1 python api/server.py
To generate a new avatar from a source video, see docs/avatar_gen_README.md.
Model Weights
All weights live in backend/models/ (not committed to git).
| Path | Used by | Notes |
|---|---|---|
kokoro/kokoro-v1.0.onnx |
Avatar TTS, Voice TTS | ONNX runtime inference |
kokoro/voices-v1.0.bin |
Avatar TTS, Voice TTS | All voice style embeddings |
musetalkV15/unet.pth |
MuseTalk UNet | fp16 inference |
musetalkV15/musetalk.json |
MuseTalk | Architecture config |
sd-vae/config.json |
MuseTalk VAE | Stable Diffusion VAE |
whisper/ |
MuseTalk audio encoder | Whisper encoder weights + config |
dwpose/dw-ll_ucoco_384.pth |
DWPose | Used during avatar generation |
face-parse-bisent/ |
Face parsing | BiSeNet; used during avatar generation |
syncnet/latentsync_syncnet.pt |
SyncNet | Training/evaluation only, not live inference |
Llama-3.2-3B-Instruct-Q4_K_M.gguf |
llama-server | Served by external llama-server process |
Troubleshooting
ModuleNotFoundError on startup
conda activate avatar # must be active
cd backend
python api/server.py # run from backend/, not repo root
LiveKit connection errors
docker ps | grep livekit # is it running?
docker logs livekit-server # check for port conflicts or errors
# Keys must match in both config files:
# backend/config.py β LIVEKIT_API_KEY / LIVEKIT_API_SECRET
# backend/agent/config.py β same values
Out of VRAM (Avatar page)
# backend/config.py
CHUNK_DURATION = 0.16 # halve chunk β 4 frames instead of 8
# or
MUSETALK_UNET_FP16 = False # switch to fp32 if fp16 causes NaN
llama-server not found
Download from llama.cpp releases, extract binary, add folder to PATH.
Port already in use
lsof -i :8767 # Avatar backend
lsof -i :3000 # Voice agent token server
lsof -i :8080 # llama-server
lsof -i :7880 # LiveKit
Kokoro ONNX int32 speed-tensor error
Already patched in backend/tts/kokoro_tts.py via _patched_create_audio monkey-patch that forces speed to float32. No action needed.
Video and audio out of sync
SimpleAVSync uses explicit PTS. Verify constants are consistent:
# FRAMES_PER_CHUNK must equal CHUNK_DURATION Γ VIDEO_FPS exactly
# Default: 0.32 Γ 25 = 8 β
Performance Notes
GPU memory (RTX 4060 8 GB)
| Component | Estimated VRAM |
|---|---|
| MuseTalk UNet (fp16) | ~3 GB |
| Whisper encoder | ~1.5 GB |
| SD-VAE | ~1 GB |
| Avatar latents + frame buffers | ~0.5 GB |
| Total | ~6 GB |
Latency targets
| Stage | Target |
|---|---|
| First TTS chunk ready | < 100 ms |
| First video chunk visible | < 200 ms after /speak |
| Per-chunk generation (8 frames @ 25 fps) | ~80β120 ms |
| End-to-end text β visible lip-sync | ~200β350 ms |
Optimisations already applied
| Optimisation | Location | Effect |
|---|---|---|
torch.set_float32_matmul_precision('high') |
server.py |
~5 % free via TF32 on Ampere+ |
MUSETALK_UNET_FP16 = True |
config.py |
Halves UNet memory bandwidth |
LIVEKIT_AUDIO_SAMPLE_RATE = 24000 |
config.py |
Eliminates TTSβLiveKit resampling step |
| Sentence-boundary TTS split | kokoro_tts.py |
Lower latency on first synthesised chunk |
Synchronous Whisper + TTS warmup at /connect |
server.py |
Primes ONNX thread pools before first request |
UNet eager warmup pass at /connect |
server.py |
Primes CUDA kernels |
Optional torch.compile UNet JIT |
server.py |
Further throughput gain (opt-in via env var) |
IdleFrameGenerator |
livekit_publisher.py |
Keeps video track alive between speech turns |
torch._dynamo.config.suppress_errors = True |
server.py |
Graceful fallback to eager if Triton JIT fails |
Available Voices
| Voice ID | Style |
|---|---|
af_heart |
Female, emotional/expressive β Avatar page default |
af_sarah |
Female, clear and professional β Voice Agent default |
af_bella |
Female, warm and friendly |
am_michael |
Male, professional |
am_fen |
Male, deep |
bf_emma |
Female, British accent |
bm_george |
Male, British accent |
Credits
- MuseTalk v1.5 β TMElyralab/MuseTalk
- Kokoro TTS β remsky/Kokoro-ONNX
- LiveKit β livekit/livekit
- faster-whisper β SYSTRAN/faster-whisper
- llama.cpp β ggerganov/llama.cpp
- Llama 3.2 3B β Meta AI