agkavin

Fix pipeline deadlocks, remove torch.compile, implement 3-queue parallel pipeline, optimize for 16fps

a4cc15e about 1 month ago

19.1 kB

	\# Speech-X — Speech-to-Video Pipeline

	Two interactive modes in one repo, sharing the same conda environment (`avatar`), model weights, and LiveKit server.

	\| Page \| Mode \| What it does \|
	\|------\|------\|--------------\|
	\| `/` \| Avatar \| Text → Kokoro TTS → MuseTalk lip-sync → LiveKit video+audio \|
	\| `/voice` \| Voice Agent \| Mic → faster-whisper ASR → Llama LLM → Kokoro TTS → LiveKit audio \|

	---

	## Table of Contents

	1. [Architecture Overview](#architecture-overview)
	2. [Project Structure](#project-structure)
	3. [Conda Environment & Installation](#conda-environment--installation)
	4. [Running the Application](#running-the-application)
	5. [Configuration Reference](#configuration-reference)
	6. [API Endpoints (Avatar page)](#api-endpoints-avatar-page)
	7. [Frontend](#frontend)
	8. [Avatars](#avatars)
	9. [Model Weights](#model-weights)
	10. [Troubleshooting](#troubleshooting)
	11. [Performance Notes](#performance-notes)

	---

	## Architecture Overview

	### Avatar Page (`/`) — MuseTalk lip-sync

	```
	Browser (React)
	│ POST /connect → loads MuseTalk + Kokoro, joins LiveKit as publisher
	│ POST /speak {text} → pushes text into the streaming pipeline
	│ POST /get-token → returns a LiveKit JWT for the frontend viewer
	│
	▼
	FastAPI server :8767 (backend/api/server.py)
	│
	├── KokoroTTS (backend/tts/kokoro_tts.py)
	│ kokoro-v1.0.onnx → 24 kHz PCM audio, streamed in 320 ms chunks
	│ Text split at sentence boundaries for lower first-chunk latency
	│
	├── MuseTalkWorker (backend/musetalk/worker.py)
	│ Two-phase GPU inference:
	│ 1. extract_features(audio) → Whisper encoder produces mel embeddings
	│ 2. generate_batch(feats, start, n) → UNet generates n lip-synced frames
	│ ThreadPoolExecutor(max_workers=1) keeps GPU work off the async event loop
	│
	├── AVPublisher (backend/publisher/livekit_publisher.py)
	│ Streams video (actual avatar dimensions @ 25 fps) + audio (24 kHz) to LiveKit
	│ IdleFrameGenerator loops avatar frames while pipeline is idle
	│
	└── StreamingPipeline (backend/api/pipeline.py)
	Coordinates TTS → MuseTalk → publisher
	SimpleAVSync aligns PTS across chunks
	```

	### Voice Agent Page (`/voice`) — ASR → LLM → TTS

	```
	Browser (React)
	│ POST http://localhost:3000/get-token → LiveKit JWT from built-in token server
	│ Publishes local mic audio track to LiveKit room
	│
	▼
	LiveKit Agent worker (backend/agent.py)
	│ livekit-agents framework, pre-warms all models at startup
	│
	├── ASR (backend/agent/asr.py)
	│ faster-whisper (model size via ASR_MODEL_SIZE, default: base)
	│ Buffers ~1.5 s of 16 kHz mono audio before each transcription
	│
	├── LLM (backend/agent/llm.py)
	│ HTTP client → llama-server :8080
	│ Llama-3.2-3B-Instruct-Q4_K_M.gguf, keeps last 6 turns of history
	│
	└── TTS (backend/agent/tts.py)
	kokoro-onnx, default voice: af_sarah
	Publishes 48 kHz mono audio back to the LiveKit room
	Built-in aiohttp token server on :3000
	```

	---

	## Project Structure

	```
	speech_to_video/
	├── environment.yml # Conda env export (no build strings, cross-platform)
	├── README.md # Quick-start guide
	├── PROJECT.md # This file — detailed reference
	├── setup.md # Step-by-step first-time install
	├── ISSUES_AND_PLAN.md # Known issues and roadmap
	├── pyproject.toml # Python project config
	│
	├── docs/
	│ ├── avatar_gen_README.md
	│ └── avatar_gen_phase_2.md
	│
	├── backend/
	│ ├── config.py # Avatar page configuration (all tunable params)
	│ ├── requirements.txt # Pip dependencies (installed inside conda env)
	│ │
	│ ├── api/
	│ │ ├── server.py # FastAPI app — lifespan, endpoints, warmup logic
	│ │ └── pipeline.py # StreamingPipeline + SpeechToVideoPipeline orchestrators
	│ │
	│ ├── agent.py # LiveKit Agent worker entry point (Voice page)
	│ ├── agent/
	│ │ ├── config.py # Voice agent config (voices, model paths, SYSTEM_PROMPT)
	│ │ ├── asr.py # faster-whisper ASR wrapper
	│ │ ├── llm.py # llama-server HTTP client
	│ │ └── tts.py # Kokoro TTS for Voice agent
	│ │
	│ ├── tts/
	│ │ └── kokoro_tts.py # Kokoro TTS for Avatar page
	│ │ Patches int32→float32 speed-tensor bug (kokoro-onnx 0.5.x)
	│ │ Splits text at sentence boundaries for low first-chunk latency
	│ │
	│ ├── musetalk/
	│ │ ├── worker.py # MuseTalkWorker — async GPU wrapper (ThreadPoolExecutor)
	│ │ ├── processor.py # Core MuseTalk inference (VAE encode/decode, UNet forward)
	│ │ ├── face.py # Face detection / cropping
	│ │ ├── data/ # Dataset helpers, audio processing
	│ │ ├── models/ # UNet, VAE, SyncNet model definitions
	│ │ └── utils/ # Audio processor, blending, preprocessing, DWPose, face parsing
	│ │
	│ ├── whisper/
	│ │ └── audio2feature.py # Whisper encoder for MuseTalk audio features
	│ │
	│ ├── sync/
	│ │ └── av_sync.py # AVSyncGate + SimpleAVSync — PTS-based AV alignment
	│ │
	│ ├── publisher/
	│ │ └── livekit_publisher.py # AVPublisher + IdleFrameGenerator
	│ │
	│ ├── models/ # All model weights (not in git)
	│ │ ├── kokoro/
	│ │ │ ├── kokoro-v1.0.onnx
	│ │ │ └── voices-v1.0.bin
	│ │ ├── musetalkV15/
	│ │ │ ├── musetalk.json
	│ │ │ └── unet.pth
	│ │ ├── sd-vae/
	│ │ │ └── config.json
	│ │ ├── whisper/
	│ │ │ ├── config.json
	│ │ │ └── preprocessor_config.json
	│ │ ├── dwpose/
	│ │ │ └── dw-ll_ucoco_384.pth
	│ │ ├── face-parse-bisent/
	│ │ │ ├── 79999_iter.pth
	│ │ │ └── resnet18-5c106cde.pth
	│ │ ├── syncnet/
	│ │ │ └── latentsync_syncnet.pt
	│ │ └── Llama-3.2-3B-Instruct-Q4_K_M.gguf
	│ │
	│ └── avatars/ # Pre-computed avatar assets (not in git)
	│ ├── christine/ # avator_info.json, latents.pt, full_imgs/, mask/, vid_output/
	│ ├── harry_1/
	│ └── sophy/ # DEFAULT_AVATAR — used unless overridden by SPEECHX_AVATAR
	│
	└── frontend/
	├── package.json
	├── vite.config.ts
	└── src/
	├── App.tsx # Avatar page (/)
	├── main.tsx # React Router setup
	├── index.css # Avatar page styles
	├── voice.css # Voice page styles
	└── pages/
	└── VoicePage.tsx # Voice Agent page (/voice)
	```

	---

	## Conda Environment & Installation

	### Restore environment

	```bash
	# From repo root
	conda env create -f environment.yml
	conda activate avatar
	```

	### Frontend

	```bash
	cd frontend
	npm install
	```

	### First-time model setup

	See [setup.md](setup.md) for the step-by-step guide to download and place all model weights.

	---

	## Running the Application

	All four processes must run concurrently. Use four terminals, all with `conda activate avatar`.

	### Terminal 1 — LiveKit server (shared)

	```bash
	docker run --rm -d \
	--name livekit-server \
	-p 7880:7880 -p 7881:7881 -p 7882:7882/udp \
	livekit/livekit-server:latest \
	--dev --bind 0.0.0.0 --node-ip 127.0.0.1

	# Stop: docker stop livekit-server
	```

	### Terminal 2 — llama-server (shared by both pages)

	```bash
	llama-server \
	-m backend/models/Llama-3.2-3B-Instruct-Q4_K_M.gguf \
	-c 2048 -ngl 32 --port 8080
	```

	> `llama-server` must be in PATH. Download from [llama.cpp releases](https://github.com/ggerganov/llama.cpp/releases).

	### Terminal 3 — Backend (choose one or both)

	Avatar page (lip-sync + video):
	```bash
	conda activate avatar
	cd backend
	python api/server.py
	# → http://localhost:8767
	```

	Voice Agent page (ASR → LLM → TTS audio only):
	```bash
	conda activate avatar
	cd backend
	python agent.py dev
	# → LiveKit agent worker + token server on http://localhost:3000
	```

	> Both can run simultaneously in separate terminals.

	### Terminal 4 — Frontend (Vite dev server)

	```bash
	cd frontend
	npm run dev
	# → http://localhost:5173
	```

	- `http://localhost:5173/` — Avatar lip-sync page
	- `http://localhost:5173/voice` — Voice Agent page

	---

	## Configuration Reference

	### `backend/config.py` — Avatar page

	All values are overridable via environment variables.

	\| Variable \| Default \| Env var \| Description \|
	\|----------\|---------\|---------\|-------------\|
	\| `DEVICE` \| `"cuda"` \| `SPEECHX_DEVICE` \| Inference device \|
	\| `VIDEO_FPS` \| `25` \| — \| Output frame rate \|
	\| `VIDEO_WIDTH` \| `720` \| — \| Base width (actual read from avatar frames at /connect) \|
	\| `VIDEO_HEIGHT` \| `1280` \| — \| Base height \|
	\| `TTS_SAMPLE_RATE` \| `24000` \| — \| Kokoro output sample rate \|
	\| `LIVEKIT_AUDIO_SAMPLE_RATE` \| `24000` \| — \| LiveKit audio rate (matches TTS — no resampling) \|
	\| `CHUNK_DURATION` \| `0.32` \| — \| 320 ms per TTS/video chunk (8 frames @ 25 fps) \|
	\| `FRAMES_PER_CHUNK` \| `8` \| — \| Computed: `CHUNK_DURATION × VIDEO_FPS` \|
	\| `PRE_ROLL_FRAMES` \| `1` \| — \| Minimal pre-roll for fast start \|
	\| `MUSETALK_UNET_FP16` \| `True` \| — \| fp16 inference for lower VRAM \|
	\| `KOKORO_VOICE` \| `"af_heart"` \| `SPEECHX_VOICE` \| TTS voice \|
	\| `KOKORO_SPEED` \| `1.0` \| — \| Speech speed multiplier \|
	\| `LIVEKIT_URL` \| `"ws://localhost:7880"` \| `LIVEKIT_URL` \| LiveKit server URL \|
	\| `LIVEKIT_API_KEY` \| `"devkey"` \| `LIVEKIT_API_KEY` \| LiveKit API key \|
	\| `LIVEKIT_API_SECRET` \| `"secret"` \| `LIVEKIT_API_SECRET` \| LiveKit API secret \|
	\| `LIVEKIT_ROOM_NAME` \| `"speech-to-video-room"` \| `LIVEKIT_ROOM_NAME` \| Room name \|
	\| `DEFAULT_AVATAR` \| `"sophy"` \| `SPEECHX_AVATAR` \| Active avatar \|
	\| `HOST` \| `"0.0.0.0"` \| `SPEECH_TO_VIDEO_HOST` \| Bind address \|
	\| `PORT` \| `8767` \| `SPEECH_TO_VIDEO_PORT` \| Bind port \|

	Optional torch.compile for UNet:
	```bash
	MUSETALK_TORCH_COMPILE=1 python api/server.py
	# UNet JIT compiles in background after /connect; later requests benefit from Triton kernels
	```

	### `backend/agent/config.py` — Voice Agent page

	\| Variable \| Default \| Env var \|
	\|----------\|---------\|---------\|
	\| `LIVEKIT_URL` \| `"ws://localhost:7880"` \| `LIVEKIT_URL` \|
	\| `LIVEKIT_API_KEY` \| `"devkey"` \| `LIVEKIT_API_KEY` \|
	\| `LIVEKIT_API_SECRET` \| `"secret"` \| `LIVEKIT_API_SECRET` \|
	\| `LLAMA_SERVER_URL` \| `"http://localhost:8080/v1"` \| `LLAMA_SERVER_URL` \|
	\| `DEFAULT_VOICE` \| `"af_sarah"` \| `DEFAULT_VOICE` \|
	\| `ASR_MODEL_SIZE` \| `"base"` \| `ASR_MODEL_SIZE` \|

	---

	## API Endpoints (Avatar page)

	Base URL: `http://localhost:8767`

	\| Method \| Path \| Description \|
	\|--------\|------\|-------------\|
	\| `GET` \| `/health` \| Liveness probe — returns `models_loaded`, `pipeline_active` \|
	\| `GET` \| `/status` \| Detailed status with VRAM usage \|
	\| `POST` \| `/connect` \| Load models, join LiveKit room, run warmup passes \|
	\| `POST` \| `/disconnect` \| Gracefully stop pipeline and disconnect from room \|
	\| `POST` \| `/speak` \| Push text into the pipeline; body: `{text, voice?, speed?}` \|
	\| `POST` \| `/get-token` \| Issue LiveKit JWT; body: `{room_name, identity}` \|
	\| `GET` \| `/livekit-token` \| GET alias for `/get-token` \|

	### `/connect` startup sequence

	1. `load_musetalk_models(avatar_name, device)` — loads UNet, VAE, Whisper encoder, avatar assets
	2. `KokoroTTS()` — initializes ONNX session
	3. Create LiveKit `Room`, generate backend-agent JWT with publish permissions
	4. Read actual frame dimensions from the pre-computed avatar's `frame_list[0]`
	5. Instantiate `AVPublisher`, `MuseTalkWorker`, `StreamingPipeline`
	6. `room.connect()` → `publisher.start()` → `pipeline.start()` (idle loop begins)
	7. Warmup: Whisper + TTS synchronously; UNet eager pass (or background torch.compile JIT)

	### `/speak` example

	```bash
	curl -X POST http://localhost:8767/speak \
	-H "Content-Type: application/json" \
	-d '{"text": "Hello, how are you today?"}'
	# → {"status": "processing", "latency_ms": 12.4}
	```

	---

	## Frontend

	### Avatar page — `App.tsx`

	- Calls `POST /connect` on button click → loads models, joins LiveKit room
	- Fetches LiveKit token from `POST /get-token`, connects as a viewer
	- Receives remote video + audio tracks published by the backend agent
	- Sends typed text via `POST /speak`
	- Chat log shows `user` / `assistant` / `system` messages with timestamps
	- Static fallback avatar image `/Sophy.png` shown before first video frame arrives

	### Voice Agent page — `VoicePage.tsx`

	- Fetches LiveKit token from the agent's built-in server at `http://localhost:3000/get-token`
	- Publishes local microphone (echo-cancelled, noise-suppressed) as an audio track
	- Agent worker subscribes, runs ASR → LLM → TTS, publishes reply audio back into the room
	- `DataReceived` events carry `{type: "user"\|"assistant", text}` for the transcript display
	- Latency shown as the time between user-transcript event and assistant-text event

	---

	## Avatars

	Pre-computed assets in `backend/avatars/<name>/`:

	\| File/Dir \| Description \|
	\|----------\|-------------\|
	\| `avator_info.json` \| Avatar bbox, landmark, and crop metadata \|
	\| `latents.pt` \| Pre-encoded VAE latent tensors for all idle frames \|
	\| `full_imgs/` \| Full-resolution source frames \|
	\| `mask/` \| Per-frame blending masks for composite output \|
	\| `vid_output/` \| Intermediate video output (optional) \|

	Available avatars: sophy (default), harry_1, christine.

	Override at runtime:
	```bash
	SPEECHX_AVATAR=harry_1 python api/server.py
	```

	To generate a new avatar from a source video, see [docs/avatar_gen_README.md](docs/avatar_gen_README.md).

	---

	## Model Weights

	All weights live in `backend/models/` (not committed to git).

	\| Path \| Used by \| Notes \|
	\|------\|---------\|-------\|
	\| `kokoro/kokoro-v1.0.onnx` \| Avatar TTS, Voice TTS \| ONNX runtime inference \|
	\| `kokoro/voices-v1.0.bin` \| Avatar TTS, Voice TTS \| All voice style embeddings \|
	\| `musetalkV15/unet.pth` \| MuseTalk UNet \| fp16 inference \|
	\| `musetalkV15/musetalk.json` \| MuseTalk \| Architecture config \|
	\| `sd-vae/config.json` \| MuseTalk VAE \| Stable Diffusion VAE \|
	\| `whisper/` \| MuseTalk audio encoder \| Whisper encoder weights + config \|
	\| `dwpose/dw-ll_ucoco_384.pth` \| DWPose \| Used during avatar generation \|
	\| `face-parse-bisent/` \| Face parsing \| BiSeNet; used during avatar generation \|
	\| `syncnet/latentsync_syncnet.pt` \| SyncNet \| Training/evaluation only, not live inference \|
	\| `Llama-3.2-3B-Instruct-Q4_K_M.gguf` \| llama-server \| Served by external `llama-server` process \|

	---

	## Troubleshooting

	### `ModuleNotFoundError` on startup
	```bash
	conda activate avatar # must be active
	cd backend
	python api/server.py # run from backend/, not repo root
	```

	### LiveKit connection errors
	```bash
	docker ps \| grep livekit # is it running?
	docker logs livekit-server # check for port conflicts or errors

	# Keys must match in both config files:
	# backend/config.py → LIVEKIT_API_KEY / LIVEKIT_API_SECRET
	# backend/agent/config.py → same values
	```

	### Out of VRAM (Avatar page)
	```python
	# backend/config.py
	CHUNK_DURATION = 0.16 # halve chunk → 4 frames instead of 8
	# or
	MUSETALK_UNET_FP16 = False # switch to fp32 if fp16 causes NaN
	```

	### `llama-server` not found
	Download from [llama.cpp releases](https://github.com/ggerganov/llama.cpp/releases), extract binary, add folder to `PATH`.

	### Port already in use
	```bash
	lsof -i :8767 # Avatar backend
	lsof -i :3000 # Voice agent token server
	lsof -i :8080 # llama-server
	lsof -i :7880 # LiveKit
	```

	### Kokoro ONNX `int32` speed-tensor error
	Already patched in `backend/tts/kokoro_tts.py` via `_patched_create_audio` monkey-patch that forces `speed` to `float32`. No action needed.

	### Video and audio out of sync
	`SimpleAVSync` uses explicit PTS. Verify constants are consistent:
	```python
	# FRAMES_PER_CHUNK must equal CHUNK_DURATION × VIDEO_FPS exactly
	# Default: 0.32 × 25 = 8 ✓
	```

	---

	## Performance Notes

	### GPU memory (RTX 4060 8 GB)

	\| Component \| Estimated VRAM \|
	\|-----------\|---------------\|
	\| MuseTalk UNet (fp16) \| ~3 GB \|
	\| Whisper encoder \| ~1.5 GB \|
	\| SD-VAE \| ~1 GB \|
	\| Avatar latents + frame buffers \| ~0.5 GB \|
	\| Total \| ~6 GB \|

	### Latency targets

	\| Stage \| Target \|
	\|-------\|--------\|
	\| First TTS chunk ready \| < 100 ms \|
	\| First video chunk visible \| < 200 ms after `/speak` \|
	\| Per-chunk generation (8 frames @ 25 fps) \| ~80–120 ms \|
	\| End-to-end text → visible lip-sync \| ~200–350 ms \|

	### Optimisations already applied

	\| Optimisation \| Location \| Effect \|
	\|-------------\|----------\|--------\|
	\| `torch.set_float32_matmul_precision('high')` \| `server.py` \| ~5 % free via TF32 on Ampere+ \|
	\| `MUSETALK_UNET_FP16 = True` \| `config.py` \| Halves UNet memory bandwidth \|
	\| `LIVEKIT_AUDIO_SAMPLE_RATE = 24000` \| `config.py` \| Eliminates TTS→LiveKit resampling step \|
	\| Sentence-boundary TTS split \| `kokoro_tts.py` \| Lower latency on first synthesised chunk \|
	\| Synchronous Whisper + TTS warmup at `/connect` \| `server.py` \| Primes ONNX thread pools before first request \|
	\| UNet eager warmup pass at `/connect` \| `server.py` \| Primes CUDA kernels \|
	\| Optional `torch.compile` UNet JIT \| `server.py` \| Further throughput gain (opt-in via env var) \|
	\| `IdleFrameGenerator` \| `livekit_publisher.py` \| Keeps video track alive between speech turns \|
	\| `torch._dynamo.config.suppress_errors = True` \| `server.py` \| Graceful fallback to eager if Triton JIT fails \|

	---

	## Available Voices

	\| Voice ID \| Style \|
	\|----------\|-------\|
	\| `af_heart` \| Female, emotional/expressive — Avatar page default \|
	\| `af_sarah` \| Female, clear and professional — Voice Agent default \|
	\| `af_bella` \| Female, warm and friendly \|
	\| `am_michael` \| Male, professional \|
	\| `am_fen` \| Male, deep \|
	\| `bf_emma` \| Female, British accent \|
	\| `bm_george` \| Male, British accent \|

	---

	## Credits

	- MuseTalk v1.5 — [TMElyralab/MuseTalk](https://github.com/TMElyralab/MuseTalk)
	- Kokoro TTS — [remsky/Kokoro-ONNX](https://github.com/remsky/Kokoro-ONNX)
	- LiveKit — [livekit/livekit](https://github.com/livekit/livekit)
	- faster-whisper — [SYSTRAN/faster-whisper](https://github.com/SYSTRAN/faster-whisper)
	- llama.cpp — [ggerganov/llama.cpp](https://github.com/ggerganov/llama.cpp)
	- Llama 3.2 3B — Meta AI