Avatar-Speech / PROJECT.md
agkavin
Fix pipeline deadlocks, remove torch.compile, implement 3-queue parallel pipeline, optimize for 16fps
a4cc15e
\# Speech-X β€” Speech-to-Video Pipeline
Two interactive modes in one repo, sharing the same conda environment (`avatar`), model weights, and LiveKit server.
| Page | Mode | What it does |
|------|------|--------------|
| `/` | **Avatar** | Text β†’ Kokoro TTS β†’ MuseTalk lip-sync β†’ LiveKit video+audio |
| `/voice` | **Voice Agent** | Mic β†’ faster-whisper ASR β†’ Llama LLM β†’ Kokoro TTS β†’ LiveKit audio |
---
## Table of Contents
1. [Architecture Overview](#architecture-overview)
2. [Project Structure](#project-structure)
3. [Conda Environment & Installation](#conda-environment--installation)
4. [Running the Application](#running-the-application)
5. [Configuration Reference](#configuration-reference)
6. [API Endpoints (Avatar page)](#api-endpoints-avatar-page)
7. [Frontend](#frontend)
8. [Avatars](#avatars)
9. [Model Weights](#model-weights)
10. [Troubleshooting](#troubleshooting)
11. [Performance Notes](#performance-notes)
---
## Architecture Overview
### Avatar Page (`/`) β€” MuseTalk lip-sync
```
Browser (React)
β”‚ POST /connect β†’ loads MuseTalk + Kokoro, joins LiveKit as publisher
β”‚ POST /speak {text} β†’ pushes text into the streaming pipeline
β”‚ POST /get-token β†’ returns a LiveKit JWT for the frontend viewer
β”‚
β–Ό
FastAPI server :8767 (backend/api/server.py)
β”‚
β”œβ”€β”€ KokoroTTS (backend/tts/kokoro_tts.py)
β”‚ kokoro-v1.0.onnx β†’ 24 kHz PCM audio, streamed in 320 ms chunks
β”‚ Text split at sentence boundaries for lower first-chunk latency
β”‚
β”œβ”€β”€ MuseTalkWorker (backend/musetalk/worker.py)
β”‚ Two-phase GPU inference:
β”‚ 1. extract_features(audio) β†’ Whisper encoder produces mel embeddings
β”‚ 2. generate_batch(feats, start, n) β†’ UNet generates n lip-synced frames
β”‚ ThreadPoolExecutor(max_workers=1) keeps GPU work off the async event loop
β”‚
β”œβ”€β”€ AVPublisher (backend/publisher/livekit_publisher.py)
β”‚ Streams video (actual avatar dimensions @ 25 fps) + audio (24 kHz) to LiveKit
β”‚ IdleFrameGenerator loops avatar frames while pipeline is idle
β”‚
└── StreamingPipeline (backend/api/pipeline.py)
Coordinates TTS β†’ MuseTalk β†’ publisher
SimpleAVSync aligns PTS across chunks
```
### Voice Agent Page (`/voice`) β€” ASR β†’ LLM β†’ TTS
```
Browser (React)
β”‚ POST http://localhost:3000/get-token β†’ LiveKit JWT from built-in token server
β”‚ Publishes local mic audio track to LiveKit room
β”‚
β–Ό
LiveKit Agent worker (backend/agent.py)
β”‚ livekit-agents framework, pre-warms all models at startup
β”‚
β”œβ”€β”€ ASR (backend/agent/asr.py)
β”‚ faster-whisper (model size via ASR_MODEL_SIZE, default: base)
β”‚ Buffers ~1.5 s of 16 kHz mono audio before each transcription
β”‚
β”œβ”€β”€ LLM (backend/agent/llm.py)
β”‚ HTTP client β†’ llama-server :8080
β”‚ Llama-3.2-3B-Instruct-Q4_K_M.gguf, keeps last 6 turns of history
β”‚
└── TTS (backend/agent/tts.py)
kokoro-onnx, default voice: af_sarah
Publishes 48 kHz mono audio back to the LiveKit room
Built-in aiohttp token server on :3000
```
---
## Project Structure
```
speech_to_video/
β”œβ”€β”€ environment.yml # Conda env export (no build strings, cross-platform)
β”œβ”€β”€ README.md # Quick-start guide
β”œβ”€β”€ PROJECT.md # This file β€” detailed reference
β”œβ”€β”€ setup.md # Step-by-step first-time install
β”œβ”€β”€ ISSUES_AND_PLAN.md # Known issues and roadmap
β”œβ”€β”€ pyproject.toml # Python project config
β”‚
β”œβ”€β”€ docs/
β”‚ β”œβ”€β”€ avatar_gen_README.md
β”‚ └── avatar_gen_phase_2.md
β”‚
β”œβ”€β”€ backend/
β”‚ β”œβ”€β”€ config.py # Avatar page configuration (all tunable params)
β”‚ β”œβ”€β”€ requirements.txt # Pip dependencies (installed inside conda env)
β”‚ β”‚
β”‚ β”œβ”€β”€ api/
β”‚ β”‚ β”œβ”€β”€ server.py # FastAPI app β€” lifespan, endpoints, warmup logic
β”‚ β”‚ └── pipeline.py # StreamingPipeline + SpeechToVideoPipeline orchestrators
β”‚ β”‚
β”‚ β”œβ”€β”€ agent.py # LiveKit Agent worker entry point (Voice page)
β”‚ β”œβ”€β”€ agent/
β”‚ β”‚ β”œβ”€β”€ config.py # Voice agent config (voices, model paths, SYSTEM_PROMPT)
β”‚ β”‚ β”œβ”€β”€ asr.py # faster-whisper ASR wrapper
β”‚ β”‚ β”œβ”€β”€ llm.py # llama-server HTTP client
β”‚ β”‚ └── tts.py # Kokoro TTS for Voice agent
β”‚ β”‚
β”‚ β”œβ”€β”€ tts/
β”‚ β”‚ └── kokoro_tts.py # Kokoro TTS for Avatar page
β”‚ β”‚ Patches int32β†’float32 speed-tensor bug (kokoro-onnx 0.5.x)
β”‚ β”‚ Splits text at sentence boundaries for low first-chunk latency
β”‚ β”‚
β”‚ β”œβ”€β”€ musetalk/
β”‚ β”‚ β”œβ”€β”€ worker.py # MuseTalkWorker β€” async GPU wrapper (ThreadPoolExecutor)
β”‚ β”‚ β”œβ”€β”€ processor.py # Core MuseTalk inference (VAE encode/decode, UNet forward)
β”‚ β”‚ β”œβ”€β”€ face.py # Face detection / cropping
β”‚ β”‚ β”œβ”€β”€ data/ # Dataset helpers, audio processing
β”‚ β”‚ β”œβ”€β”€ models/ # UNet, VAE, SyncNet model definitions
β”‚ β”‚ └── utils/ # Audio processor, blending, preprocessing, DWPose, face parsing
β”‚ β”‚
β”‚ β”œβ”€β”€ whisper/
β”‚ β”‚ └── audio2feature.py # Whisper encoder for MuseTalk audio features
β”‚ β”‚
β”‚ β”œβ”€β”€ sync/
β”‚ β”‚ └── av_sync.py # AVSyncGate + SimpleAVSync β€” PTS-based AV alignment
β”‚ β”‚
β”‚ β”œβ”€β”€ publisher/
β”‚ β”‚ └── livekit_publisher.py # AVPublisher + IdleFrameGenerator
β”‚ β”‚
β”‚ β”œβ”€β”€ models/ # All model weights (not in git)
β”‚ β”‚ β”œβ”€β”€ kokoro/
β”‚ β”‚ β”‚ β”œβ”€β”€ kokoro-v1.0.onnx
β”‚ β”‚ β”‚ └── voices-v1.0.bin
β”‚ β”‚ β”œβ”€β”€ musetalkV15/
β”‚ β”‚ β”‚ β”œβ”€β”€ musetalk.json
β”‚ β”‚ β”‚ └── unet.pth
β”‚ β”‚ β”œβ”€β”€ sd-vae/
β”‚ β”‚ β”‚ └── config.json
β”‚ β”‚ β”œβ”€β”€ whisper/
β”‚ β”‚ β”‚ β”œβ”€β”€ config.json
β”‚ β”‚ β”‚ └── preprocessor_config.json
β”‚ β”‚ β”œβ”€β”€ dwpose/
β”‚ β”‚ β”‚ └── dw-ll_ucoco_384.pth
β”‚ β”‚ β”œβ”€β”€ face-parse-bisent/
β”‚ β”‚ β”‚ β”œβ”€β”€ 79999_iter.pth
β”‚ β”‚ β”‚ └── resnet18-5c106cde.pth
β”‚ β”‚ β”œβ”€β”€ syncnet/
β”‚ β”‚ β”‚ └── latentsync_syncnet.pt
β”‚ β”‚ └── Llama-3.2-3B-Instruct-Q4_K_M.gguf
β”‚ β”‚
β”‚ └── avatars/ # Pre-computed avatar assets (not in git)
β”‚ β”œβ”€β”€ christine/ # avator_info.json, latents.pt, full_imgs/, mask/, vid_output/
β”‚ β”œβ”€β”€ harry_1/
β”‚ └── sophy/ # DEFAULT_AVATAR β€” used unless overridden by SPEECHX_AVATAR
β”‚
└── frontend/
β”œβ”€β”€ package.json
β”œβ”€β”€ vite.config.ts
└── src/
β”œβ”€β”€ App.tsx # Avatar page (/)
β”œβ”€β”€ main.tsx # React Router setup
β”œβ”€β”€ index.css # Avatar page styles
β”œβ”€β”€ voice.css # Voice page styles
└── pages/
└── VoicePage.tsx # Voice Agent page (/voice)
```
---
## Conda Environment & Installation
### Restore environment
```bash
# From repo root
conda env create -f environment.yml
conda activate avatar
```
### Frontend
```bash
cd frontend
npm install
```
### First-time model setup
See [setup.md](setup.md) for the step-by-step guide to download and place all model weights.
---
## Running the Application
All four processes must run concurrently. Use four terminals, all with `conda activate avatar`.
### Terminal 1 β€” LiveKit server (shared)
```bash
docker run --rm -d \
--name livekit-server \
-p 7880:7880 -p 7881:7881 -p 7882:7882/udp \
livekit/livekit-server:latest \
--dev --bind 0.0.0.0 --node-ip 127.0.0.1
# Stop: docker stop livekit-server
```
### Terminal 2 β€” llama-server (shared by both pages)
```bash
llama-server \
-m backend/models/Llama-3.2-3B-Instruct-Q4_K_M.gguf \
-c 2048 -ngl 32 --port 8080
```
> `llama-server` must be in PATH. Download from [llama.cpp releases](https://github.com/ggerganov/llama.cpp/releases).
### Terminal 3 β€” Backend (choose one or both)
**Avatar page** (lip-sync + video):
```bash
conda activate avatar
cd backend
python api/server.py
# β†’ http://localhost:8767
```
**Voice Agent page** (ASR β†’ LLM β†’ TTS audio only):
```bash
conda activate avatar
cd backend
python agent.py dev
# β†’ LiveKit agent worker + token server on http://localhost:3000
```
> Both can run simultaneously in separate terminals.
### Terminal 4 β€” Frontend (Vite dev server)
```bash
cd frontend
npm run dev
# β†’ http://localhost:5173
```
- `http://localhost:5173/` β€” Avatar lip-sync page
- `http://localhost:5173/voice` β€” Voice Agent page
---
## Configuration Reference
### `backend/config.py` β€” Avatar page
All values are overridable via environment variables.
| Variable | Default | Env var | Description |
|----------|---------|---------|-------------|
| `DEVICE` | `"cuda"` | `SPEECHX_DEVICE` | Inference device |
| `VIDEO_FPS` | `25` | β€” | Output frame rate |
| `VIDEO_WIDTH` | `720` | β€” | Base width (actual read from avatar frames at /connect) |
| `VIDEO_HEIGHT` | `1280` | β€” | Base height |
| `TTS_SAMPLE_RATE` | `24000` | β€” | Kokoro output sample rate |
| `LIVEKIT_AUDIO_SAMPLE_RATE` | `24000` | β€” | LiveKit audio rate (matches TTS β€” no resampling) |
| `CHUNK_DURATION` | `0.32` | β€” | 320 ms per TTS/video chunk (8 frames @ 25 fps) |
| `FRAMES_PER_CHUNK` | `8` | β€” | Computed: `CHUNK_DURATION Γ— VIDEO_FPS` |
| `PRE_ROLL_FRAMES` | `1` | β€” | Minimal pre-roll for fast start |
| `MUSETALK_UNET_FP16` | `True` | β€” | fp16 inference for lower VRAM |
| `KOKORO_VOICE` | `"af_heart"` | `SPEECHX_VOICE` | TTS voice |
| `KOKORO_SPEED` | `1.0` | β€” | Speech speed multiplier |
| `LIVEKIT_URL` | `"ws://localhost:7880"` | `LIVEKIT_URL` | LiveKit server URL |
| `LIVEKIT_API_KEY` | `"devkey"` | `LIVEKIT_API_KEY` | LiveKit API key |
| `LIVEKIT_API_SECRET` | `"secret"` | `LIVEKIT_API_SECRET` | LiveKit API secret |
| `LIVEKIT_ROOM_NAME` | `"speech-to-video-room"` | `LIVEKIT_ROOM_NAME` | Room name |
| `DEFAULT_AVATAR` | `"sophy"` | `SPEECHX_AVATAR` | Active avatar |
| `HOST` | `"0.0.0.0"` | `SPEECH_TO_VIDEO_HOST` | Bind address |
| `PORT` | `8767` | `SPEECH_TO_VIDEO_PORT` | Bind port |
**Optional torch.compile for UNet:**
```bash
MUSETALK_TORCH_COMPILE=1 python api/server.py
# UNet JIT compiles in background after /connect; later requests benefit from Triton kernels
```
### `backend/agent/config.py` β€” Voice Agent page
| Variable | Default | Env var |
|----------|---------|---------|
| `LIVEKIT_URL` | `"ws://localhost:7880"` | `LIVEKIT_URL` |
| `LIVEKIT_API_KEY` | `"devkey"` | `LIVEKIT_API_KEY` |
| `LIVEKIT_API_SECRET` | `"secret"` | `LIVEKIT_API_SECRET` |
| `LLAMA_SERVER_URL` | `"http://localhost:8080/v1"` | `LLAMA_SERVER_URL` |
| `DEFAULT_VOICE` | `"af_sarah"` | `DEFAULT_VOICE` |
| `ASR_MODEL_SIZE` | `"base"` | `ASR_MODEL_SIZE` |
---
## API Endpoints (Avatar page)
Base URL: `http://localhost:8767`
| Method | Path | Description |
|--------|------|-------------|
| `GET` | `/health` | Liveness probe β€” returns `models_loaded`, `pipeline_active` |
| `GET` | `/status` | Detailed status with VRAM usage |
| `POST` | `/connect` | Load models, join LiveKit room, run warmup passes |
| `POST` | `/disconnect` | Gracefully stop pipeline and disconnect from room |
| `POST` | `/speak` | Push text into the pipeline; body: `{text, voice?, speed?}` |
| `POST` | `/get-token` | Issue LiveKit JWT; body: `{room_name, identity}` |
| `GET` | `/livekit-token` | GET alias for `/get-token` |
### `/connect` startup sequence
1. `load_musetalk_models(avatar_name, device)` β€” loads UNet, VAE, Whisper encoder, avatar assets
2. `KokoroTTS()` β€” initializes ONNX session
3. Create LiveKit `Room`, generate backend-agent JWT with publish permissions
4. Read actual frame dimensions from the pre-computed avatar's `frame_list[0]`
5. Instantiate `AVPublisher`, `MuseTalkWorker`, `StreamingPipeline`
6. `room.connect()` β†’ `publisher.start()` β†’ `pipeline.start()` (idle loop begins)
7. Warmup: Whisper + TTS synchronously; UNet eager pass (or background torch.compile JIT)
### `/speak` example
```bash
curl -X POST http://localhost:8767/speak \
-H "Content-Type: application/json" \
-d '{"text": "Hello, how are you today?"}'
# β†’ {"status": "processing", "latency_ms": 12.4}
```
---
## Frontend
### Avatar page β€” `App.tsx`
- Calls `POST /connect` on button click β†’ loads models, joins LiveKit room
- Fetches LiveKit token from `POST /get-token`, connects as a viewer
- Receives remote video + audio tracks published by the backend agent
- Sends typed text via `POST /speak`
- Chat log shows `user` / `assistant` / `system` messages with timestamps
- Static fallback avatar image `/Sophy.png` shown before first video frame arrives
### Voice Agent page β€” `VoicePage.tsx`
- Fetches LiveKit token from the agent's built-in server at `http://localhost:3000/get-token`
- Publishes local microphone (echo-cancelled, noise-suppressed) as an audio track
- Agent worker subscribes, runs ASR β†’ LLM β†’ TTS, publishes reply audio back into the room
- `DataReceived` events carry `{type: "user"|"assistant", text}` for the transcript display
- Latency shown as the time between user-transcript event and assistant-text event
---
## Avatars
Pre-computed assets in `backend/avatars/<name>/`:
| File/Dir | Description |
|----------|-------------|
| `avator_info.json` | Avatar bbox, landmark, and crop metadata |
| `latents.pt` | Pre-encoded VAE latent tensors for all idle frames |
| `full_imgs/` | Full-resolution source frames |
| `mask/` | Per-frame blending masks for composite output |
| `vid_output/` | Intermediate video output (optional) |
Available avatars: **sophy** (default), **harry_1**, **christine**.
Override at runtime:
```bash
SPEECHX_AVATAR=harry_1 python api/server.py
```
To generate a new avatar from a source video, see [docs/avatar_gen_README.md](docs/avatar_gen_README.md).
---
## Model Weights
All weights live in `backend/models/` (not committed to git).
| Path | Used by | Notes |
|------|---------|-------|
| `kokoro/kokoro-v1.0.onnx` | Avatar TTS, Voice TTS | ONNX runtime inference |
| `kokoro/voices-v1.0.bin` | Avatar TTS, Voice TTS | All voice style embeddings |
| `musetalkV15/unet.pth` | MuseTalk UNet | fp16 inference |
| `musetalkV15/musetalk.json` | MuseTalk | Architecture config |
| `sd-vae/config.json` | MuseTalk VAE | Stable Diffusion VAE |
| `whisper/` | MuseTalk audio encoder | Whisper encoder weights + config |
| `dwpose/dw-ll_ucoco_384.pth` | DWPose | Used during avatar generation |
| `face-parse-bisent/` | Face parsing | BiSeNet; used during avatar generation |
| `syncnet/latentsync_syncnet.pt` | SyncNet | Training/evaluation only, not live inference |
| `Llama-3.2-3B-Instruct-Q4_K_M.gguf` | llama-server | Served by external `llama-server` process |
---
## Troubleshooting
### `ModuleNotFoundError` on startup
```bash
conda activate avatar # must be active
cd backend
python api/server.py # run from backend/, not repo root
```
### LiveKit connection errors
```bash
docker ps | grep livekit # is it running?
docker logs livekit-server # check for port conflicts or errors
# Keys must match in both config files:
# backend/config.py β†’ LIVEKIT_API_KEY / LIVEKIT_API_SECRET
# backend/agent/config.py β†’ same values
```
### Out of VRAM (Avatar page)
```python
# backend/config.py
CHUNK_DURATION = 0.16 # halve chunk β†’ 4 frames instead of 8
# or
MUSETALK_UNET_FP16 = False # switch to fp32 if fp16 causes NaN
```
### `llama-server` not found
Download from [llama.cpp releases](https://github.com/ggerganov/llama.cpp/releases), extract binary, add folder to `PATH`.
### Port already in use
```bash
lsof -i :8767 # Avatar backend
lsof -i :3000 # Voice agent token server
lsof -i :8080 # llama-server
lsof -i :7880 # LiveKit
```
### Kokoro ONNX `int32` speed-tensor error
Already patched in `backend/tts/kokoro_tts.py` via `_patched_create_audio` monkey-patch that forces `speed` to `float32`. No action needed.
### Video and audio out of sync
`SimpleAVSync` uses explicit PTS. Verify constants are consistent:
```python
# FRAMES_PER_CHUNK must equal CHUNK_DURATION Γ— VIDEO_FPS exactly
# Default: 0.32 Γ— 25 = 8 βœ“
```
---
## Performance Notes
### GPU memory (RTX 4060 8 GB)
| Component | Estimated VRAM |
|-----------|---------------|
| MuseTalk UNet (fp16) | ~3 GB |
| Whisper encoder | ~1.5 GB |
| SD-VAE | ~1 GB |
| Avatar latents + frame buffers | ~0.5 GB |
| **Total** | **~6 GB** |
### Latency targets
| Stage | Target |
|-------|--------|
| First TTS chunk ready | < 100 ms |
| First video chunk visible | < 200 ms after `/speak` |
| Per-chunk generation (8 frames @ 25 fps) | ~80–120 ms |
| End-to-end text β†’ visible lip-sync | ~200–350 ms |
### Optimisations already applied
| Optimisation | Location | Effect |
|-------------|----------|--------|
| `torch.set_float32_matmul_precision('high')` | `server.py` | ~5 % free via TF32 on Ampere+ |
| `MUSETALK_UNET_FP16 = True` | `config.py` | Halves UNet memory bandwidth |
| `LIVEKIT_AUDIO_SAMPLE_RATE = 24000` | `config.py` | Eliminates TTS→LiveKit resampling step |
| Sentence-boundary TTS split | `kokoro_tts.py` | Lower latency on first synthesised chunk |
| Synchronous Whisper + TTS warmup at `/connect` | `server.py` | Primes ONNX thread pools before first request |
| UNet eager warmup pass at `/connect` | `server.py` | Primes CUDA kernels |
| Optional `torch.compile` UNet JIT | `server.py` | Further throughput gain (opt-in via env var) |
| `IdleFrameGenerator` | `livekit_publisher.py` | Keeps video track alive between speech turns |
| `torch._dynamo.config.suppress_errors = True` | `server.py` | Graceful fallback to eager if Triton JIT fails |
---
## Available Voices
| Voice ID | Style |
|----------|-------|
| `af_heart` | Female, emotional/expressive β€” Avatar page default |
| `af_sarah` | Female, clear and professional β€” Voice Agent default |
| `af_bella` | Female, warm and friendly |
| `am_michael` | Male, professional |
| `am_fen` | Male, deep |
| `bf_emma` | Female, British accent |
| `bm_george` | Male, British accent |
---
## Credits
- **MuseTalk v1.5** β€” [TMElyralab/MuseTalk](https://github.com/TMElyralab/MuseTalk)
- **Kokoro TTS** β€” [remsky/Kokoro-ONNX](https://github.com/remsky/Kokoro-ONNX)
- **LiveKit** β€” [livekit/livekit](https://github.com/livekit/livekit)
- **faster-whisper** β€” [SYSTRAN/faster-whisper](https://github.com/SYSTRAN/faster-whisper)
- **llama.cpp** β€” [ggerganov/llama.cpp](https://github.com/ggerganov/llama.cpp)
- **Llama 3.2 3B** β€” Meta AI