agkavin
Fix pipeline deadlocks, remove torch.compile, implement 3-queue parallel pipeline, optimize for 16fps
a4cc15e | \# Speech-X β Speech-to-Video Pipeline | |
| Two interactive modes in one repo, sharing the same conda environment (`avatar`), model weights, and LiveKit server. | |
| | Page | Mode | What it does | | |
| |------|------|--------------| | |
| | `/` | **Avatar** | Text β Kokoro TTS β MuseTalk lip-sync β LiveKit video+audio | | |
| | `/voice` | **Voice Agent** | Mic β faster-whisper ASR β Llama LLM β Kokoro TTS β LiveKit audio | | |
| --- | |
| ## Table of Contents | |
| 1. [Architecture Overview](#architecture-overview) | |
| 2. [Project Structure](#project-structure) | |
| 3. [Conda Environment & Installation](#conda-environment--installation) | |
| 4. [Running the Application](#running-the-application) | |
| 5. [Configuration Reference](#configuration-reference) | |
| 6. [API Endpoints (Avatar page)](#api-endpoints-avatar-page) | |
| 7. [Frontend](#frontend) | |
| 8. [Avatars](#avatars) | |
| 9. [Model Weights](#model-weights) | |
| 10. [Troubleshooting](#troubleshooting) | |
| 11. [Performance Notes](#performance-notes) | |
| --- | |
| ## Architecture Overview | |
| ### Avatar Page (`/`) β MuseTalk lip-sync | |
| ``` | |
| Browser (React) | |
| β POST /connect β loads MuseTalk + Kokoro, joins LiveKit as publisher | |
| β POST /speak {text} β pushes text into the streaming pipeline | |
| β POST /get-token β returns a LiveKit JWT for the frontend viewer | |
| β | |
| βΌ | |
| FastAPI server :8767 (backend/api/server.py) | |
| β | |
| βββ KokoroTTS (backend/tts/kokoro_tts.py) | |
| β kokoro-v1.0.onnx β 24 kHz PCM audio, streamed in 320 ms chunks | |
| β Text split at sentence boundaries for lower first-chunk latency | |
| β | |
| βββ MuseTalkWorker (backend/musetalk/worker.py) | |
| β Two-phase GPU inference: | |
| β 1. extract_features(audio) β Whisper encoder produces mel embeddings | |
| β 2. generate_batch(feats, start, n) β UNet generates n lip-synced frames | |
| β ThreadPoolExecutor(max_workers=1) keeps GPU work off the async event loop | |
| β | |
| βββ AVPublisher (backend/publisher/livekit_publisher.py) | |
| β Streams video (actual avatar dimensions @ 25 fps) + audio (24 kHz) to LiveKit | |
| β IdleFrameGenerator loops avatar frames while pipeline is idle | |
| β | |
| βββ StreamingPipeline (backend/api/pipeline.py) | |
| Coordinates TTS β MuseTalk β publisher | |
| SimpleAVSync aligns PTS across chunks | |
| ``` | |
| ### Voice Agent Page (`/voice`) β ASR β LLM β TTS | |
| ``` | |
| Browser (React) | |
| β POST http://localhost:3000/get-token β LiveKit JWT from built-in token server | |
| β Publishes local mic audio track to LiveKit room | |
| β | |
| βΌ | |
| LiveKit Agent worker (backend/agent.py) | |
| β livekit-agents framework, pre-warms all models at startup | |
| β | |
| βββ ASR (backend/agent/asr.py) | |
| β faster-whisper (model size via ASR_MODEL_SIZE, default: base) | |
| β Buffers ~1.5 s of 16 kHz mono audio before each transcription | |
| β | |
| βββ LLM (backend/agent/llm.py) | |
| β HTTP client β llama-server :8080 | |
| β Llama-3.2-3B-Instruct-Q4_K_M.gguf, keeps last 6 turns of history | |
| β | |
| βββ TTS (backend/agent/tts.py) | |
| kokoro-onnx, default voice: af_sarah | |
| Publishes 48 kHz mono audio back to the LiveKit room | |
| Built-in aiohttp token server on :3000 | |
| ``` | |
| --- | |
| ## Project Structure | |
| ``` | |
| speech_to_video/ | |
| βββ environment.yml # Conda env export (no build strings, cross-platform) | |
| βββ README.md # Quick-start guide | |
| βββ PROJECT.md # This file β detailed reference | |
| βββ setup.md # Step-by-step first-time install | |
| βββ ISSUES_AND_PLAN.md # Known issues and roadmap | |
| βββ pyproject.toml # Python project config | |
| β | |
| βββ docs/ | |
| β βββ avatar_gen_README.md | |
| β βββ avatar_gen_phase_2.md | |
| β | |
| βββ backend/ | |
| β βββ config.py # Avatar page configuration (all tunable params) | |
| β βββ requirements.txt # Pip dependencies (installed inside conda env) | |
| β β | |
| β βββ api/ | |
| β β βββ server.py # FastAPI app β lifespan, endpoints, warmup logic | |
| β β βββ pipeline.py # StreamingPipeline + SpeechToVideoPipeline orchestrators | |
| β β | |
| β βββ agent.py # LiveKit Agent worker entry point (Voice page) | |
| β βββ agent/ | |
| β β βββ config.py # Voice agent config (voices, model paths, SYSTEM_PROMPT) | |
| β β βββ asr.py # faster-whisper ASR wrapper | |
| β β βββ llm.py # llama-server HTTP client | |
| β β βββ tts.py # Kokoro TTS for Voice agent | |
| β β | |
| β βββ tts/ | |
| β β βββ kokoro_tts.py # Kokoro TTS for Avatar page | |
| β β Patches int32βfloat32 speed-tensor bug (kokoro-onnx 0.5.x) | |
| β β Splits text at sentence boundaries for low first-chunk latency | |
| β β | |
| β βββ musetalk/ | |
| β β βββ worker.py # MuseTalkWorker β async GPU wrapper (ThreadPoolExecutor) | |
| β β βββ processor.py # Core MuseTalk inference (VAE encode/decode, UNet forward) | |
| β β βββ face.py # Face detection / cropping | |
| β β βββ data/ # Dataset helpers, audio processing | |
| β β βββ models/ # UNet, VAE, SyncNet model definitions | |
| β β βββ utils/ # Audio processor, blending, preprocessing, DWPose, face parsing | |
| β β | |
| β βββ whisper/ | |
| β β βββ audio2feature.py # Whisper encoder for MuseTalk audio features | |
| β β | |
| β βββ sync/ | |
| β β βββ av_sync.py # AVSyncGate + SimpleAVSync β PTS-based AV alignment | |
| β β | |
| β βββ publisher/ | |
| β β βββ livekit_publisher.py # AVPublisher + IdleFrameGenerator | |
| β β | |
| β βββ models/ # All model weights (not in git) | |
| β β βββ kokoro/ | |
| β β β βββ kokoro-v1.0.onnx | |
| β β β βββ voices-v1.0.bin | |
| β β βββ musetalkV15/ | |
| β β β βββ musetalk.json | |
| β β β βββ unet.pth | |
| β β βββ sd-vae/ | |
| β β β βββ config.json | |
| β β βββ whisper/ | |
| β β β βββ config.json | |
| β β β βββ preprocessor_config.json | |
| β β βββ dwpose/ | |
| β β β βββ dw-ll_ucoco_384.pth | |
| β β βββ face-parse-bisent/ | |
| β β β βββ 79999_iter.pth | |
| β β β βββ resnet18-5c106cde.pth | |
| β β βββ syncnet/ | |
| β β β βββ latentsync_syncnet.pt | |
| β β βββ Llama-3.2-3B-Instruct-Q4_K_M.gguf | |
| β β | |
| β βββ avatars/ # Pre-computed avatar assets (not in git) | |
| β βββ christine/ # avator_info.json, latents.pt, full_imgs/, mask/, vid_output/ | |
| β βββ harry_1/ | |
| β βββ sophy/ # DEFAULT_AVATAR β used unless overridden by SPEECHX_AVATAR | |
| β | |
| βββ frontend/ | |
| βββ package.json | |
| βββ vite.config.ts | |
| βββ src/ | |
| βββ App.tsx # Avatar page (/) | |
| βββ main.tsx # React Router setup | |
| βββ index.css # Avatar page styles | |
| βββ voice.css # Voice page styles | |
| βββ pages/ | |
| βββ VoicePage.tsx # Voice Agent page (/voice) | |
| ``` | |
| --- | |
| ## Conda Environment & Installation | |
| ### Restore environment | |
| ```bash | |
| # From repo root | |
| conda env create -f environment.yml | |
| conda activate avatar | |
| ``` | |
| ### Frontend | |
| ```bash | |
| cd frontend | |
| npm install | |
| ``` | |
| ### First-time model setup | |
| See [setup.md](setup.md) for the step-by-step guide to download and place all model weights. | |
| --- | |
| ## Running the Application | |
| All four processes must run concurrently. Use four terminals, all with `conda activate avatar`. | |
| ### Terminal 1 β LiveKit server (shared) | |
| ```bash | |
| docker run --rm -d \ | |
| --name livekit-server \ | |
| -p 7880:7880 -p 7881:7881 -p 7882:7882/udp \ | |
| livekit/livekit-server:latest \ | |
| --dev --bind 0.0.0.0 --node-ip 127.0.0.1 | |
| # Stop: docker stop livekit-server | |
| ``` | |
| ### Terminal 2 β llama-server (shared by both pages) | |
| ```bash | |
| llama-server \ | |
| -m backend/models/Llama-3.2-3B-Instruct-Q4_K_M.gguf \ | |
| -c 2048 -ngl 32 --port 8080 | |
| ``` | |
| > `llama-server` must be in PATH. Download from [llama.cpp releases](https://github.com/ggerganov/llama.cpp/releases). | |
| ### Terminal 3 β Backend (choose one or both) | |
| **Avatar page** (lip-sync + video): | |
| ```bash | |
| conda activate avatar | |
| cd backend | |
| python api/server.py | |
| # β http://localhost:8767 | |
| ``` | |
| **Voice Agent page** (ASR β LLM β TTS audio only): | |
| ```bash | |
| conda activate avatar | |
| cd backend | |
| python agent.py dev | |
| # β LiveKit agent worker + token server on http://localhost:3000 | |
| ``` | |
| > Both can run simultaneously in separate terminals. | |
| ### Terminal 4 β Frontend (Vite dev server) | |
| ```bash | |
| cd frontend | |
| npm run dev | |
| # β http://localhost:5173 | |
| ``` | |
| - `http://localhost:5173/` β Avatar lip-sync page | |
| - `http://localhost:5173/voice` β Voice Agent page | |
| --- | |
| ## Configuration Reference | |
| ### `backend/config.py` β Avatar page | |
| All values are overridable via environment variables. | |
| | Variable | Default | Env var | Description | | |
| |----------|---------|---------|-------------| | |
| | `DEVICE` | `"cuda"` | `SPEECHX_DEVICE` | Inference device | | |
| | `VIDEO_FPS` | `25` | β | Output frame rate | | |
| | `VIDEO_WIDTH` | `720` | β | Base width (actual read from avatar frames at /connect) | | |
| | `VIDEO_HEIGHT` | `1280` | β | Base height | | |
| | `TTS_SAMPLE_RATE` | `24000` | β | Kokoro output sample rate | | |
| | `LIVEKIT_AUDIO_SAMPLE_RATE` | `24000` | β | LiveKit audio rate (matches TTS β no resampling) | | |
| | `CHUNK_DURATION` | `0.32` | β | 320 ms per TTS/video chunk (8 frames @ 25 fps) | | |
| | `FRAMES_PER_CHUNK` | `8` | β | Computed: `CHUNK_DURATION Γ VIDEO_FPS` | | |
| | `PRE_ROLL_FRAMES` | `1` | β | Minimal pre-roll for fast start | | |
| | `MUSETALK_UNET_FP16` | `True` | β | fp16 inference for lower VRAM | | |
| | `KOKORO_VOICE` | `"af_heart"` | `SPEECHX_VOICE` | TTS voice | | |
| | `KOKORO_SPEED` | `1.0` | β | Speech speed multiplier | | |
| | `LIVEKIT_URL` | `"ws://localhost:7880"` | `LIVEKIT_URL` | LiveKit server URL | | |
| | `LIVEKIT_API_KEY` | `"devkey"` | `LIVEKIT_API_KEY` | LiveKit API key | | |
| | `LIVEKIT_API_SECRET` | `"secret"` | `LIVEKIT_API_SECRET` | LiveKit API secret | | |
| | `LIVEKIT_ROOM_NAME` | `"speech-to-video-room"` | `LIVEKIT_ROOM_NAME` | Room name | | |
| | `DEFAULT_AVATAR` | `"sophy"` | `SPEECHX_AVATAR` | Active avatar | | |
| | `HOST` | `"0.0.0.0"` | `SPEECH_TO_VIDEO_HOST` | Bind address | | |
| | `PORT` | `8767` | `SPEECH_TO_VIDEO_PORT` | Bind port | | |
| **Optional torch.compile for UNet:** | |
| ```bash | |
| MUSETALK_TORCH_COMPILE=1 python api/server.py | |
| # UNet JIT compiles in background after /connect; later requests benefit from Triton kernels | |
| ``` | |
| ### `backend/agent/config.py` β Voice Agent page | |
| | Variable | Default | Env var | | |
| |----------|---------|---------| | |
| | `LIVEKIT_URL` | `"ws://localhost:7880"` | `LIVEKIT_URL` | | |
| | `LIVEKIT_API_KEY` | `"devkey"` | `LIVEKIT_API_KEY` | | |
| | `LIVEKIT_API_SECRET` | `"secret"` | `LIVEKIT_API_SECRET` | | |
| | `LLAMA_SERVER_URL` | `"http://localhost:8080/v1"` | `LLAMA_SERVER_URL` | | |
| | `DEFAULT_VOICE` | `"af_sarah"` | `DEFAULT_VOICE` | | |
| | `ASR_MODEL_SIZE` | `"base"` | `ASR_MODEL_SIZE` | | |
| --- | |
| ## API Endpoints (Avatar page) | |
| Base URL: `http://localhost:8767` | |
| | Method | Path | Description | | |
| |--------|------|-------------| | |
| | `GET` | `/health` | Liveness probe β returns `models_loaded`, `pipeline_active` | | |
| | `GET` | `/status` | Detailed status with VRAM usage | | |
| | `POST` | `/connect` | Load models, join LiveKit room, run warmup passes | | |
| | `POST` | `/disconnect` | Gracefully stop pipeline and disconnect from room | | |
| | `POST` | `/speak` | Push text into the pipeline; body: `{text, voice?, speed?}` | | |
| | `POST` | `/get-token` | Issue LiveKit JWT; body: `{room_name, identity}` | | |
| | `GET` | `/livekit-token` | GET alias for `/get-token` | | |
| ### `/connect` startup sequence | |
| 1. `load_musetalk_models(avatar_name, device)` β loads UNet, VAE, Whisper encoder, avatar assets | |
| 2. `KokoroTTS()` β initializes ONNX session | |
| 3. Create LiveKit `Room`, generate backend-agent JWT with publish permissions | |
| 4. Read actual frame dimensions from the pre-computed avatar's `frame_list[0]` | |
| 5. Instantiate `AVPublisher`, `MuseTalkWorker`, `StreamingPipeline` | |
| 6. `room.connect()` β `publisher.start()` β `pipeline.start()` (idle loop begins) | |
| 7. Warmup: Whisper + TTS synchronously; UNet eager pass (or background torch.compile JIT) | |
| ### `/speak` example | |
| ```bash | |
| curl -X POST http://localhost:8767/speak \ | |
| -H "Content-Type: application/json" \ | |
| -d '{"text": "Hello, how are you today?"}' | |
| # β {"status": "processing", "latency_ms": 12.4} | |
| ``` | |
| --- | |
| ## Frontend | |
| ### Avatar page β `App.tsx` | |
| - Calls `POST /connect` on button click β loads models, joins LiveKit room | |
| - Fetches LiveKit token from `POST /get-token`, connects as a viewer | |
| - Receives remote video + audio tracks published by the backend agent | |
| - Sends typed text via `POST /speak` | |
| - Chat log shows `user` / `assistant` / `system` messages with timestamps | |
| - Static fallback avatar image `/Sophy.png` shown before first video frame arrives | |
| ### Voice Agent page β `VoicePage.tsx` | |
| - Fetches LiveKit token from the agent's built-in server at `http://localhost:3000/get-token` | |
| - Publishes local microphone (echo-cancelled, noise-suppressed) as an audio track | |
| - Agent worker subscribes, runs ASR β LLM β TTS, publishes reply audio back into the room | |
| - `DataReceived` events carry `{type: "user"|"assistant", text}` for the transcript display | |
| - Latency shown as the time between user-transcript event and assistant-text event | |
| --- | |
| ## Avatars | |
| Pre-computed assets in `backend/avatars/<name>/`: | |
| | File/Dir | Description | | |
| |----------|-------------| | |
| | `avator_info.json` | Avatar bbox, landmark, and crop metadata | | |
| | `latents.pt` | Pre-encoded VAE latent tensors for all idle frames | | |
| | `full_imgs/` | Full-resolution source frames | | |
| | `mask/` | Per-frame blending masks for composite output | | |
| | `vid_output/` | Intermediate video output (optional) | | |
| Available avatars: **sophy** (default), **harry_1**, **christine**. | |
| Override at runtime: | |
| ```bash | |
| SPEECHX_AVATAR=harry_1 python api/server.py | |
| ``` | |
| To generate a new avatar from a source video, see [docs/avatar_gen_README.md](docs/avatar_gen_README.md). | |
| --- | |
| ## Model Weights | |
| All weights live in `backend/models/` (not committed to git). | |
| | Path | Used by | Notes | | |
| |------|---------|-------| | |
| | `kokoro/kokoro-v1.0.onnx` | Avatar TTS, Voice TTS | ONNX runtime inference | | |
| | `kokoro/voices-v1.0.bin` | Avatar TTS, Voice TTS | All voice style embeddings | | |
| | `musetalkV15/unet.pth` | MuseTalk UNet | fp16 inference | | |
| | `musetalkV15/musetalk.json` | MuseTalk | Architecture config | | |
| | `sd-vae/config.json` | MuseTalk VAE | Stable Diffusion VAE | | |
| | `whisper/` | MuseTalk audio encoder | Whisper encoder weights + config | | |
| | `dwpose/dw-ll_ucoco_384.pth` | DWPose | Used during avatar generation | | |
| | `face-parse-bisent/` | Face parsing | BiSeNet; used during avatar generation | | |
| | `syncnet/latentsync_syncnet.pt` | SyncNet | Training/evaluation only, not live inference | | |
| | `Llama-3.2-3B-Instruct-Q4_K_M.gguf` | llama-server | Served by external `llama-server` process | | |
| --- | |
| ## Troubleshooting | |
| ### `ModuleNotFoundError` on startup | |
| ```bash | |
| conda activate avatar # must be active | |
| cd backend | |
| python api/server.py # run from backend/, not repo root | |
| ``` | |
| ### LiveKit connection errors | |
| ```bash | |
| docker ps | grep livekit # is it running? | |
| docker logs livekit-server # check for port conflicts or errors | |
| # Keys must match in both config files: | |
| # backend/config.py β LIVEKIT_API_KEY / LIVEKIT_API_SECRET | |
| # backend/agent/config.py β same values | |
| ``` | |
| ### Out of VRAM (Avatar page) | |
| ```python | |
| # backend/config.py | |
| CHUNK_DURATION = 0.16 # halve chunk β 4 frames instead of 8 | |
| # or | |
| MUSETALK_UNET_FP16 = False # switch to fp32 if fp16 causes NaN | |
| ``` | |
| ### `llama-server` not found | |
| Download from [llama.cpp releases](https://github.com/ggerganov/llama.cpp/releases), extract binary, add folder to `PATH`. | |
| ### Port already in use | |
| ```bash | |
| lsof -i :8767 # Avatar backend | |
| lsof -i :3000 # Voice agent token server | |
| lsof -i :8080 # llama-server | |
| lsof -i :7880 # LiveKit | |
| ``` | |
| ### Kokoro ONNX `int32` speed-tensor error | |
| Already patched in `backend/tts/kokoro_tts.py` via `_patched_create_audio` monkey-patch that forces `speed` to `float32`. No action needed. | |
| ### Video and audio out of sync | |
| `SimpleAVSync` uses explicit PTS. Verify constants are consistent: | |
| ```python | |
| # FRAMES_PER_CHUNK must equal CHUNK_DURATION Γ VIDEO_FPS exactly | |
| # Default: 0.32 Γ 25 = 8 β | |
| ``` | |
| --- | |
| ## Performance Notes | |
| ### GPU memory (RTX 4060 8 GB) | |
| | Component | Estimated VRAM | | |
| |-----------|---------------| | |
| | MuseTalk UNet (fp16) | ~3 GB | | |
| | Whisper encoder | ~1.5 GB | | |
| | SD-VAE | ~1 GB | | |
| | Avatar latents + frame buffers | ~0.5 GB | | |
| | **Total** | **~6 GB** | | |
| ### Latency targets | |
| | Stage | Target | | |
| |-------|--------| | |
| | First TTS chunk ready | < 100 ms | | |
| | First video chunk visible | < 200 ms after `/speak` | | |
| | Per-chunk generation (8 frames @ 25 fps) | ~80β120 ms | | |
| | End-to-end text β visible lip-sync | ~200β350 ms | | |
| ### Optimisations already applied | |
| | Optimisation | Location | Effect | | |
| |-------------|----------|--------| | |
| | `torch.set_float32_matmul_precision('high')` | `server.py` | ~5 % free via TF32 on Ampere+ | | |
| | `MUSETALK_UNET_FP16 = True` | `config.py` | Halves UNet memory bandwidth | | |
| | `LIVEKIT_AUDIO_SAMPLE_RATE = 24000` | `config.py` | Eliminates TTSβLiveKit resampling step | | |
| | Sentence-boundary TTS split | `kokoro_tts.py` | Lower latency on first synthesised chunk | | |
| | Synchronous Whisper + TTS warmup at `/connect` | `server.py` | Primes ONNX thread pools before first request | | |
| | UNet eager warmup pass at `/connect` | `server.py` | Primes CUDA kernels | | |
| | Optional `torch.compile` UNet JIT | `server.py` | Further throughput gain (opt-in via env var) | | |
| | `IdleFrameGenerator` | `livekit_publisher.py` | Keeps video track alive between speech turns | | |
| | `torch._dynamo.config.suppress_errors = True` | `server.py` | Graceful fallback to eager if Triton JIT fails | | |
| --- | |
| ## Available Voices | |
| | Voice ID | Style | | |
| |----------|-------| | |
| | `af_heart` | Female, emotional/expressive β Avatar page default | | |
| | `af_sarah` | Female, clear and professional β Voice Agent default | | |
| | `af_bella` | Female, warm and friendly | | |
| | `am_michael` | Male, professional | | |
| | `am_fen` | Male, deep | | |
| | `bf_emma` | Female, British accent | | |
| | `bm_george` | Male, British accent | | |
| --- | |
| ## Credits | |
| - **MuseTalk v1.5** β [TMElyralab/MuseTalk](https://github.com/TMElyralab/MuseTalk) | |
| - **Kokoro TTS** β [remsky/Kokoro-ONNX](https://github.com/remsky/Kokoro-ONNX) | |
| - **LiveKit** β [livekit/livekit](https://github.com/livekit/livekit) | |
| - **faster-whisper** β [SYSTRAN/faster-whisper](https://github.com/SYSTRAN/faster-whisper) | |
| - **llama.cpp** β [ggerganov/llama.cpp](https://github.com/ggerganov/llama.cpp) | |
| - **Llama 3.2 3B** β Meta AI | |