\# Speech-X — Speech-to-Video Pipeline Two interactive modes in one repo, sharing the same conda environment (`avatar`), model weights, and LiveKit server. | Page | Mode | What it does | |------|------|--------------| | `/` | **Avatar** | Text → Kokoro TTS → MuseTalk lip-sync → LiveKit video+audio | | `/voice` | **Voice Agent** | Mic → faster-whisper ASR → Llama LLM → Kokoro TTS → LiveKit audio | --- ## Table of Contents 1. [Architecture Overview](#architecture-overview) 2. [Project Structure](#project-structure) 3. [Conda Environment & Installation](#conda-environment--installation) 4. [Running the Application](#running-the-application) 5. [Configuration Reference](#configuration-reference) 6. [API Endpoints (Avatar page)](#api-endpoints-avatar-page) 7. [Frontend](#frontend) 8. [Avatars](#avatars) 9. [Model Weights](#model-weights) 10. [Troubleshooting](#troubleshooting) 11. [Performance Notes](#performance-notes) --- ## Architecture Overview ### Avatar Page (`/`) — MuseTalk lip-sync ``` Browser (React) │ POST /connect → loads MuseTalk + Kokoro, joins LiveKit as publisher │ POST /speak {text} → pushes text into the streaming pipeline │ POST /get-token → returns a LiveKit JWT for the frontend viewer │ ▼ FastAPI server :8767 (backend/api/server.py) │ ├── KokoroTTS (backend/tts/kokoro_tts.py) │ kokoro-v1.0.onnx → 24 kHz PCM audio, streamed in 320 ms chunks │ Text split at sentence boundaries for lower first-chunk latency │ ├── MuseTalkWorker (backend/musetalk/worker.py) │ Two-phase GPU inference: │ 1. extract_features(audio) → Whisper encoder produces mel embeddings │ 2. generate_batch(feats, start, n) → UNet generates n lip-synced frames │ ThreadPoolExecutor(max_workers=1) keeps GPU work off the async event loop │ ├── AVPublisher (backend/publisher/livekit_publisher.py) │ Streams video (actual avatar dimensions @ 25 fps) + audio (24 kHz) to LiveKit │ IdleFrameGenerator loops avatar frames while pipeline is idle │ └── StreamingPipeline (backend/api/pipeline.py) Coordinates TTS → MuseTalk → publisher SimpleAVSync aligns PTS across chunks ``` ### Voice Agent Page (`/voice`) — ASR → LLM → TTS ``` Browser (React) │ POST http://localhost:3000/get-token → LiveKit JWT from built-in token server │ Publishes local mic audio track to LiveKit room │ ▼ LiveKit Agent worker (backend/agent.py) │ livekit-agents framework, pre-warms all models at startup │ ├── ASR (backend/agent/asr.py) │ faster-whisper (model size via ASR_MODEL_SIZE, default: base) │ Buffers ~1.5 s of 16 kHz mono audio before each transcription │ ├── LLM (backend/agent/llm.py) │ HTTP client → llama-server :8080 │ Llama-3.2-3B-Instruct-Q4_K_M.gguf, keeps last 6 turns of history │ └── TTS (backend/agent/tts.py) kokoro-onnx, default voice: af_sarah Publishes 48 kHz mono audio back to the LiveKit room Built-in aiohttp token server on :3000 ``` --- ## Project Structure ``` speech_to_video/ ├── environment.yml # Conda env export (no build strings, cross-platform) ├── README.md # Quick-start guide ├── PROJECT.md # This file — detailed reference ├── setup.md # Step-by-step first-time install ├── ISSUES_AND_PLAN.md # Known issues and roadmap ├── pyproject.toml # Python project config │ ├── docs/ │ ├── avatar_gen_README.md │ └── avatar_gen_phase_2.md │ ├── backend/ │ ├── config.py # Avatar page configuration (all tunable params) │ ├── requirements.txt # Pip dependencies (installed inside conda env) │ │ │ ├── api/ │ │ ├── server.py # FastAPI app — lifespan, endpoints, warmup logic │ │ └── pipeline.py # StreamingPipeline + SpeechToVideoPipeline orchestrators │ │ │ ├── agent.py # LiveKit Agent worker entry point (Voice page) │ ├── agent/ │ │ ├── config.py # Voice agent config (voices, model paths, SYSTEM_PROMPT) │ │ ├── asr.py # faster-whisper ASR wrapper │ │ ├── llm.py # llama-server HTTP client │ │ └── tts.py # Kokoro TTS for Voice agent │ │ │ ├── tts/ │ │ └── kokoro_tts.py # Kokoro TTS for Avatar page │ │ Patches int32→float32 speed-tensor bug (kokoro-onnx 0.5.x) │ │ Splits text at sentence boundaries for low first-chunk latency │ │ │ ├── musetalk/ │ │ ├── worker.py # MuseTalkWorker — async GPU wrapper (ThreadPoolExecutor) │ │ ├── processor.py # Core MuseTalk inference (VAE encode/decode, UNet forward) │ │ ├── face.py # Face detection / cropping │ │ ├── data/ # Dataset helpers, audio processing │ │ ├── models/ # UNet, VAE, SyncNet model definitions │ │ └── utils/ # Audio processor, blending, preprocessing, DWPose, face parsing │ │ │ ├── whisper/ │ │ └── audio2feature.py # Whisper encoder for MuseTalk audio features │ │ │ ├── sync/ │ │ └── av_sync.py # AVSyncGate + SimpleAVSync — PTS-based AV alignment │ │ │ ├── publisher/ │ │ └── livekit_publisher.py # AVPublisher + IdleFrameGenerator │ │ │ ├── models/ # All model weights (not in git) │ │ ├── kokoro/ │ │ │ ├── kokoro-v1.0.onnx │ │ │ └── voices-v1.0.bin │ │ ├── musetalkV15/ │ │ │ ├── musetalk.json │ │ │ └── unet.pth │ │ ├── sd-vae/ │ │ │ └── config.json │ │ ├── whisper/ │ │ │ ├── config.json │ │ │ └── preprocessor_config.json │ │ ├── dwpose/ │ │ │ └── dw-ll_ucoco_384.pth │ │ ├── face-parse-bisent/ │ │ │ ├── 79999_iter.pth │ │ │ └── resnet18-5c106cde.pth │ │ ├── syncnet/ │ │ │ └── latentsync_syncnet.pt │ │ └── Llama-3.2-3B-Instruct-Q4_K_M.gguf │ │ │ └── avatars/ # Pre-computed avatar assets (not in git) │ ├── christine/ # avator_info.json, latents.pt, full_imgs/, mask/, vid_output/ │ ├── harry_1/ │ └── sophy/ # DEFAULT_AVATAR — used unless overridden by SPEECHX_AVATAR │ └── frontend/ ├── package.json ├── vite.config.ts └── src/ ├── App.tsx # Avatar page (/) ├── main.tsx # React Router setup ├── index.css # Avatar page styles ├── voice.css # Voice page styles └── pages/ └── VoicePage.tsx # Voice Agent page (/voice) ``` --- ## Conda Environment & Installation ### Restore environment ```bash # From repo root conda env create -f environment.yml conda activate avatar ``` ### Frontend ```bash cd frontend npm install ``` ### First-time model setup See [setup.md](setup.md) for the step-by-step guide to download and place all model weights. --- ## Running the Application All four processes must run concurrently. Use four terminals, all with `conda activate avatar`. ### Terminal 1 — LiveKit server (shared) ```bash docker run --rm -d \ --name livekit-server \ -p 7880:7880 -p 7881:7881 -p 7882:7882/udp \ livekit/livekit-server:latest \ --dev --bind 0.0.0.0 --node-ip 127.0.0.1 # Stop: docker stop livekit-server ``` ### Terminal 2 — llama-server (shared by both pages) ```bash llama-server \ -m backend/models/Llama-3.2-3B-Instruct-Q4_K_M.gguf \ -c 2048 -ngl 32 --port 8080 ``` > `llama-server` must be in PATH. Download from [llama.cpp releases](https://github.com/ggerganov/llama.cpp/releases). ### Terminal 3 — Backend (choose one or both) **Avatar page** (lip-sync + video): ```bash conda activate avatar cd backend python api/server.py # → http://localhost:8767 ``` **Voice Agent page** (ASR → LLM → TTS audio only): ```bash conda activate avatar cd backend python agent.py dev # → LiveKit agent worker + token server on http://localhost:3000 ``` > Both can run simultaneously in separate terminals. ### Terminal 4 — Frontend (Vite dev server) ```bash cd frontend npm run dev # → http://localhost:5173 ``` - `http://localhost:5173/` — Avatar lip-sync page - `http://localhost:5173/voice` — Voice Agent page --- ## Configuration Reference ### `backend/config.py` — Avatar page All values are overridable via environment variables. | Variable | Default | Env var | Description | |----------|---------|---------|-------------| | `DEVICE` | `"cuda"` | `SPEECHX_DEVICE` | Inference device | | `VIDEO_FPS` | `25` | — | Output frame rate | | `VIDEO_WIDTH` | `720` | — | Base width (actual read from avatar frames at /connect) | | `VIDEO_HEIGHT` | `1280` | — | Base height | | `TTS_SAMPLE_RATE` | `24000` | — | Kokoro output sample rate | | `LIVEKIT_AUDIO_SAMPLE_RATE` | `24000` | — | LiveKit audio rate (matches TTS — no resampling) | | `CHUNK_DURATION` | `0.32` | — | 320 ms per TTS/video chunk (8 frames @ 25 fps) | | `FRAMES_PER_CHUNK` | `8` | — | Computed: `CHUNK_DURATION × VIDEO_FPS` | | `PRE_ROLL_FRAMES` | `1` | — | Minimal pre-roll for fast start | | `MUSETALK_UNET_FP16` | `True` | — | fp16 inference for lower VRAM | | `KOKORO_VOICE` | `"af_heart"` | `SPEECHX_VOICE` | TTS voice | | `KOKORO_SPEED` | `1.0` | — | Speech speed multiplier | | `LIVEKIT_URL` | `"ws://localhost:7880"` | `LIVEKIT_URL` | LiveKit server URL | | `LIVEKIT_API_KEY` | `"devkey"` | `LIVEKIT_API_KEY` | LiveKit API key | | `LIVEKIT_API_SECRET` | `"secret"` | `LIVEKIT_API_SECRET` | LiveKit API secret | | `LIVEKIT_ROOM_NAME` | `"speech-to-video-room"` | `LIVEKIT_ROOM_NAME` | Room name | | `DEFAULT_AVATAR` | `"sophy"` | `SPEECHX_AVATAR` | Active avatar | | `HOST` | `"0.0.0.0"` | `SPEECH_TO_VIDEO_HOST` | Bind address | | `PORT` | `8767` | `SPEECH_TO_VIDEO_PORT` | Bind port | **Optional torch.compile for UNet:** ```bash MUSETALK_TORCH_COMPILE=1 python api/server.py # UNet JIT compiles in background after /connect; later requests benefit from Triton kernels ``` ### `backend/agent/config.py` — Voice Agent page | Variable | Default | Env var | |----------|---------|---------| | `LIVEKIT_URL` | `"ws://localhost:7880"` | `LIVEKIT_URL` | | `LIVEKIT_API_KEY` | `"devkey"` | `LIVEKIT_API_KEY` | | `LIVEKIT_API_SECRET` | `"secret"` | `LIVEKIT_API_SECRET` | | `LLAMA_SERVER_URL` | `"http://localhost:8080/v1"` | `LLAMA_SERVER_URL` | | `DEFAULT_VOICE` | `"af_sarah"` | `DEFAULT_VOICE` | | `ASR_MODEL_SIZE` | `"base"` | `ASR_MODEL_SIZE` | --- ## API Endpoints (Avatar page) Base URL: `http://localhost:8767` | Method | Path | Description | |--------|------|-------------| | `GET` | `/health` | Liveness probe — returns `models_loaded`, `pipeline_active` | | `GET` | `/status` | Detailed status with VRAM usage | | `POST` | `/connect` | Load models, join LiveKit room, run warmup passes | | `POST` | `/disconnect` | Gracefully stop pipeline and disconnect from room | | `POST` | `/speak` | Push text into the pipeline; body: `{text, voice?, speed?}` | | `POST` | `/get-token` | Issue LiveKit JWT; body: `{room_name, identity}` | | `GET` | `/livekit-token` | GET alias for `/get-token` | ### `/connect` startup sequence 1. `load_musetalk_models(avatar_name, device)` — loads UNet, VAE, Whisper encoder, avatar assets 2. `KokoroTTS()` — initializes ONNX session 3. Create LiveKit `Room`, generate backend-agent JWT with publish permissions 4. Read actual frame dimensions from the pre-computed avatar's `frame_list[0]` 5. Instantiate `AVPublisher`, `MuseTalkWorker`, `StreamingPipeline` 6. `room.connect()` → `publisher.start()` → `pipeline.start()` (idle loop begins) 7. Warmup: Whisper + TTS synchronously; UNet eager pass (or background torch.compile JIT) ### `/speak` example ```bash curl -X POST http://localhost:8767/speak \ -H "Content-Type: application/json" \ -d '{"text": "Hello, how are you today?"}' # → {"status": "processing", "latency_ms": 12.4} ``` --- ## Frontend ### Avatar page — `App.tsx` - Calls `POST /connect` on button click → loads models, joins LiveKit room - Fetches LiveKit token from `POST /get-token`, connects as a viewer - Receives remote video + audio tracks published by the backend agent - Sends typed text via `POST /speak` - Chat log shows `user` / `assistant` / `system` messages with timestamps - Static fallback avatar image `/Sophy.png` shown before first video frame arrives ### Voice Agent page — `VoicePage.tsx` - Fetches LiveKit token from the agent's built-in server at `http://localhost:3000/get-token` - Publishes local microphone (echo-cancelled, noise-suppressed) as an audio track - Agent worker subscribes, runs ASR → LLM → TTS, publishes reply audio back into the room - `DataReceived` events carry `{type: "user"|"assistant", text}` for the transcript display - Latency shown as the time between user-transcript event and assistant-text event --- ## Avatars Pre-computed assets in `backend/avatars//`: | File/Dir | Description | |----------|-------------| | `avator_info.json` | Avatar bbox, landmark, and crop metadata | | `latents.pt` | Pre-encoded VAE latent tensors for all idle frames | | `full_imgs/` | Full-resolution source frames | | `mask/` | Per-frame blending masks for composite output | | `vid_output/` | Intermediate video output (optional) | Available avatars: **sophy** (default), **harry_1**, **christine**. Override at runtime: ```bash SPEECHX_AVATAR=harry_1 python api/server.py ``` To generate a new avatar from a source video, see [docs/avatar_gen_README.md](docs/avatar_gen_README.md). --- ## Model Weights All weights live in `backend/models/` (not committed to git). | Path | Used by | Notes | |------|---------|-------| | `kokoro/kokoro-v1.0.onnx` | Avatar TTS, Voice TTS | ONNX runtime inference | | `kokoro/voices-v1.0.bin` | Avatar TTS, Voice TTS | All voice style embeddings | | `musetalkV15/unet.pth` | MuseTalk UNet | fp16 inference | | `musetalkV15/musetalk.json` | MuseTalk | Architecture config | | `sd-vae/config.json` | MuseTalk VAE | Stable Diffusion VAE | | `whisper/` | MuseTalk audio encoder | Whisper encoder weights + config | | `dwpose/dw-ll_ucoco_384.pth` | DWPose | Used during avatar generation | | `face-parse-bisent/` | Face parsing | BiSeNet; used during avatar generation | | `syncnet/latentsync_syncnet.pt` | SyncNet | Training/evaluation only, not live inference | | `Llama-3.2-3B-Instruct-Q4_K_M.gguf` | llama-server | Served by external `llama-server` process | --- ## Troubleshooting ### `ModuleNotFoundError` on startup ```bash conda activate avatar # must be active cd backend python api/server.py # run from backend/, not repo root ``` ### LiveKit connection errors ```bash docker ps | grep livekit # is it running? docker logs livekit-server # check for port conflicts or errors # Keys must match in both config files: # backend/config.py → LIVEKIT_API_KEY / LIVEKIT_API_SECRET # backend/agent/config.py → same values ``` ### Out of VRAM (Avatar page) ```python # backend/config.py CHUNK_DURATION = 0.16 # halve chunk → 4 frames instead of 8 # or MUSETALK_UNET_FP16 = False # switch to fp32 if fp16 causes NaN ``` ### `llama-server` not found Download from [llama.cpp releases](https://github.com/ggerganov/llama.cpp/releases), extract binary, add folder to `PATH`. ### Port already in use ```bash lsof -i :8767 # Avatar backend lsof -i :3000 # Voice agent token server lsof -i :8080 # llama-server lsof -i :7880 # LiveKit ``` ### Kokoro ONNX `int32` speed-tensor error Already patched in `backend/tts/kokoro_tts.py` via `_patched_create_audio` monkey-patch that forces `speed` to `float32`. No action needed. ### Video and audio out of sync `SimpleAVSync` uses explicit PTS. Verify constants are consistent: ```python # FRAMES_PER_CHUNK must equal CHUNK_DURATION × VIDEO_FPS exactly # Default: 0.32 × 25 = 8 ✓ ``` --- ## Performance Notes ### GPU memory (RTX 4060 8 GB) | Component | Estimated VRAM | |-----------|---------------| | MuseTalk UNet (fp16) | ~3 GB | | Whisper encoder | ~1.5 GB | | SD-VAE | ~1 GB | | Avatar latents + frame buffers | ~0.5 GB | | **Total** | **~6 GB** | ### Latency targets | Stage | Target | |-------|--------| | First TTS chunk ready | < 100 ms | | First video chunk visible | < 200 ms after `/speak` | | Per-chunk generation (8 frames @ 25 fps) | ~80–120 ms | | End-to-end text → visible lip-sync | ~200–350 ms | ### Optimisations already applied | Optimisation | Location | Effect | |-------------|----------|--------| | `torch.set_float32_matmul_precision('high')` | `server.py` | ~5 % free via TF32 on Ampere+ | | `MUSETALK_UNET_FP16 = True` | `config.py` | Halves UNet memory bandwidth | | `LIVEKIT_AUDIO_SAMPLE_RATE = 24000` | `config.py` | Eliminates TTS→LiveKit resampling step | | Sentence-boundary TTS split | `kokoro_tts.py` | Lower latency on first synthesised chunk | | Synchronous Whisper + TTS warmup at `/connect` | `server.py` | Primes ONNX thread pools before first request | | UNet eager warmup pass at `/connect` | `server.py` | Primes CUDA kernels | | Optional `torch.compile` UNet JIT | `server.py` | Further throughput gain (opt-in via env var) | | `IdleFrameGenerator` | `livekit_publisher.py` | Keeps video track alive between speech turns | | `torch._dynamo.config.suppress_errors = True` | `server.py` | Graceful fallback to eager if Triton JIT fails | --- ## Available Voices | Voice ID | Style | |----------|-------| | `af_heart` | Female, emotional/expressive — Avatar page default | | `af_sarah` | Female, clear and professional — Voice Agent default | | `af_bella` | Female, warm and friendly | | `am_michael` | Male, professional | | `am_fen` | Male, deep | | `bf_emma` | Female, British accent | | `bm_george` | Male, British accent | --- ## Credits - **MuseTalk v1.5** — [TMElyralab/MuseTalk](https://github.com/TMElyralab/MuseTalk) - **Kokoro TTS** — [remsky/Kokoro-ONNX](https://github.com/remsky/Kokoro-ONNX) - **LiveKit** — [livekit/livekit](https://github.com/livekit/livekit) - **faster-whisper** — [SYSTRAN/faster-whisper](https://github.com/SYSTRAN/faster-whisper) - **llama.cpp** — [ggerganov/llama.cpp](https://github.com/ggerganov/llama.cpp) - **Llama 3.2 3B** — Meta AI