Avatar-Speech / PROJECT.md
agkavin
Fix pipeline deadlocks, remove torch.compile, implement 3-queue parallel pipeline, optimize for 16fps
a4cc15e

# Speech-X β€” Speech-to-Video Pipeline

Two interactive modes in one repo, sharing the same conda environment (avatar), model weights, and LiveKit server.

Page Mode What it does
/ Avatar Text β†’ Kokoro TTS β†’ MuseTalk lip-sync β†’ LiveKit video+audio
/voice Voice Agent Mic β†’ faster-whisper ASR β†’ Llama LLM β†’ Kokoro TTS β†’ LiveKit audio

Table of Contents

  1. Architecture Overview
  2. Project Structure
  3. Conda Environment & Installation
  4. Running the Application
  5. Configuration Reference
  6. API Endpoints (Avatar page)
  7. Frontend
  8. Avatars
  9. Model Weights
  10. Troubleshooting
  11. Performance Notes

Architecture Overview

Avatar Page (/) β€” MuseTalk lip-sync

Browser (React)
    β”‚  POST /connect        β†’ loads MuseTalk + Kokoro, joins LiveKit as publisher
    β”‚  POST /speak {text}   β†’ pushes text into the streaming pipeline
    β”‚  POST /get-token      β†’ returns a LiveKit JWT for the frontend viewer
    β”‚
    β–Ό
FastAPI server  :8767  (backend/api/server.py)
    β”‚
    β”œβ”€β”€ KokoroTTS  (backend/tts/kokoro_tts.py)
    β”‚     kokoro-v1.0.onnx  β†’  24 kHz PCM audio, streamed in 320 ms chunks
    β”‚     Text split at sentence boundaries for lower first-chunk latency
    β”‚
    β”œβ”€β”€ MuseTalkWorker  (backend/musetalk/worker.py)
    β”‚     Two-phase GPU inference:
    β”‚       1. extract_features(audio)  β†’  Whisper encoder produces mel embeddings
    β”‚       2. generate_batch(feats, start, n)  β†’  UNet generates n lip-synced frames
    β”‚     ThreadPoolExecutor(max_workers=1) keeps GPU work off the async event loop
    β”‚
    β”œβ”€β”€ AVPublisher  (backend/publisher/livekit_publisher.py)
    β”‚     Streams video (actual avatar dimensions @ 25 fps) + audio (24 kHz) to LiveKit
    β”‚     IdleFrameGenerator loops avatar frames while pipeline is idle
    β”‚
    └── StreamingPipeline  (backend/api/pipeline.py)
          Coordinates TTS β†’ MuseTalk β†’ publisher
          SimpleAVSync aligns PTS across chunks

Voice Agent Page (/voice) β€” ASR β†’ LLM β†’ TTS

Browser (React)
    β”‚  POST http://localhost:3000/get-token  β†’  LiveKit JWT from built-in token server
    β”‚  Publishes local mic audio track to LiveKit room
    β”‚
    β–Ό
LiveKit Agent worker  (backend/agent.py)
    β”‚     livekit-agents framework, pre-warms all models at startup
    β”‚
    β”œβ”€β”€ ASR  (backend/agent/asr.py)
    β”‚     faster-whisper  (model size via ASR_MODEL_SIZE, default: base)
    β”‚     Buffers ~1.5 s of 16 kHz mono audio before each transcription
    β”‚
    β”œβ”€β”€ LLM  (backend/agent/llm.py)
    β”‚     HTTP client β†’ llama-server :8080
    β”‚     Llama-3.2-3B-Instruct-Q4_K_M.gguf, keeps last 6 turns of history
    β”‚
    └── TTS  (backend/agent/tts.py)
          kokoro-onnx, default voice: af_sarah
          Publishes 48 kHz mono audio back to the LiveKit room
          Built-in aiohttp token server on :3000

Project Structure

speech_to_video/
β”œβ”€β”€ environment.yml          # Conda env export (no build strings, cross-platform)
β”œβ”€β”€ README.md                # Quick-start guide
β”œβ”€β”€ PROJECT.md               # This file β€” detailed reference
β”œβ”€β”€ setup.md                 # Step-by-step first-time install
β”œβ”€β”€ ISSUES_AND_PLAN.md       # Known issues and roadmap
β”œβ”€β”€ pyproject.toml           # Python project config
β”‚
β”œβ”€β”€ docs/
β”‚   β”œβ”€β”€ avatar_gen_README.md
β”‚   └── avatar_gen_phase_2.md
β”‚
β”œβ”€β”€ backend/
β”‚   β”œβ”€β”€ config.py            # Avatar page configuration (all tunable params)
β”‚   β”œβ”€β”€ requirements.txt     # Pip dependencies (installed inside conda env)
β”‚   β”‚
β”‚   β”œβ”€β”€ api/
β”‚   β”‚   β”œβ”€β”€ server.py        # FastAPI app β€” lifespan, endpoints, warmup logic
β”‚   β”‚   └── pipeline.py      # StreamingPipeline + SpeechToVideoPipeline orchestrators
β”‚   β”‚
β”‚   β”œβ”€β”€ agent.py             # LiveKit Agent worker entry point (Voice page)
β”‚   β”œβ”€β”€ agent/
β”‚   β”‚   β”œβ”€β”€ config.py        # Voice agent config (voices, model paths, SYSTEM_PROMPT)
β”‚   β”‚   β”œβ”€β”€ asr.py           # faster-whisper ASR wrapper
β”‚   β”‚   β”œβ”€β”€ llm.py           # llama-server HTTP client
β”‚   β”‚   └── tts.py           # Kokoro TTS for Voice agent
β”‚   β”‚
β”‚   β”œβ”€β”€ tts/
β”‚   β”‚   └── kokoro_tts.py    # Kokoro TTS for Avatar page
β”‚   β”‚                          Patches int32β†’float32 speed-tensor bug (kokoro-onnx 0.5.x)
β”‚   β”‚                          Splits text at sentence boundaries for low first-chunk latency
β”‚   β”‚
β”‚   β”œβ”€β”€ musetalk/
β”‚   β”‚   β”œβ”€β”€ worker.py        # MuseTalkWorker β€” async GPU wrapper (ThreadPoolExecutor)
β”‚   β”‚   β”œβ”€β”€ processor.py     # Core MuseTalk inference (VAE encode/decode, UNet forward)
β”‚   β”‚   β”œβ”€β”€ face.py          # Face detection / cropping
β”‚   β”‚   β”œβ”€β”€ data/            # Dataset helpers, audio processing
β”‚   β”‚   β”œβ”€β”€ models/          # UNet, VAE, SyncNet model definitions
β”‚   β”‚   └── utils/           # Audio processor, blending, preprocessing, DWPose, face parsing
β”‚   β”‚
β”‚   β”œβ”€β”€ whisper/
β”‚   β”‚   └── audio2feature.py # Whisper encoder for MuseTalk audio features
β”‚   β”‚
β”‚   β”œβ”€β”€ sync/
β”‚   β”‚   └── av_sync.py       # AVSyncGate + SimpleAVSync β€” PTS-based AV alignment
β”‚   β”‚
β”‚   β”œβ”€β”€ publisher/
β”‚   β”‚   └── livekit_publisher.py  # AVPublisher + IdleFrameGenerator
β”‚   β”‚
β”‚   β”œβ”€β”€ models/              # All model weights (not in git)
β”‚   β”‚   β”œβ”€β”€ kokoro/
β”‚   β”‚   β”‚   β”œβ”€β”€ kokoro-v1.0.onnx
β”‚   β”‚   β”‚   └── voices-v1.0.bin
β”‚   β”‚   β”œβ”€β”€ musetalkV15/
β”‚   β”‚   β”‚   β”œβ”€β”€ musetalk.json
β”‚   β”‚   β”‚   └── unet.pth
β”‚   β”‚   β”œβ”€β”€ sd-vae/
β”‚   β”‚   β”‚   └── config.json
β”‚   β”‚   β”œβ”€β”€ whisper/
β”‚   β”‚   β”‚   β”œβ”€β”€ config.json
β”‚   β”‚   β”‚   └── preprocessor_config.json
β”‚   β”‚   β”œβ”€β”€ dwpose/
β”‚   β”‚   β”‚   └── dw-ll_ucoco_384.pth
β”‚   β”‚   β”œβ”€β”€ face-parse-bisent/
β”‚   β”‚   β”‚   β”œβ”€β”€ 79999_iter.pth
β”‚   β”‚   β”‚   └── resnet18-5c106cde.pth
β”‚   β”‚   β”œβ”€β”€ syncnet/
β”‚   β”‚   β”‚   └── latentsync_syncnet.pt
β”‚   β”‚   └── Llama-3.2-3B-Instruct-Q4_K_M.gguf
β”‚   β”‚
β”‚   └── avatars/             # Pre-computed avatar assets (not in git)
β”‚       β”œβ”€β”€ christine/       # avator_info.json, latents.pt, full_imgs/, mask/, vid_output/
β”‚       β”œβ”€β”€ harry_1/
β”‚       └── sophy/           # DEFAULT_AVATAR β€” used unless overridden by SPEECHX_AVATAR
β”‚
└── frontend/
    β”œβ”€β”€ package.json
    β”œβ”€β”€ vite.config.ts
    └── src/
        β”œβ”€β”€ App.tsx          # Avatar page (/)
        β”œβ”€β”€ main.tsx         # React Router setup
        β”œβ”€β”€ index.css        # Avatar page styles
        β”œβ”€β”€ voice.css        # Voice page styles
        └── pages/
            └── VoicePage.tsx  # Voice Agent page (/voice)

Conda Environment & Installation

Restore environment

# From repo root
conda env create -f environment.yml
conda activate avatar

Frontend

cd frontend
npm install

First-time model setup

See setup.md for the step-by-step guide to download and place all model weights.


Running the Application

All four processes must run concurrently. Use four terminals, all with conda activate avatar.

Terminal 1 β€” LiveKit server (shared)

docker run --rm -d \
  --name livekit-server \
  -p 7880:7880 -p 7881:7881 -p 7882:7882/udp \
  livekit/livekit-server:latest \
  --dev --bind 0.0.0.0 --node-ip 127.0.0.1

# Stop: docker stop livekit-server

Terminal 2 β€” llama-server (shared by both pages)

llama-server \
  -m backend/models/Llama-3.2-3B-Instruct-Q4_K_M.gguf \
  -c 2048 -ngl 32 --port 8080

llama-server must be in PATH. Download from llama.cpp releases.

Terminal 3 β€” Backend (choose one or both)

Avatar page (lip-sync + video):

conda activate avatar
cd backend
python api/server.py
# β†’ http://localhost:8767

Voice Agent page (ASR β†’ LLM β†’ TTS audio only):

conda activate avatar
cd backend
python agent.py dev
# β†’ LiveKit agent worker + token server on http://localhost:3000

Both can run simultaneously in separate terminals.

Terminal 4 β€” Frontend (Vite dev server)

cd frontend
npm run dev
# β†’ http://localhost:5173
  • http://localhost:5173/ β€” Avatar lip-sync page
  • http://localhost:5173/voice β€” Voice Agent page

Configuration Reference

backend/config.py β€” Avatar page

All values are overridable via environment variables.

Variable Default Env var Description
DEVICE "cuda" SPEECHX_DEVICE Inference device
VIDEO_FPS 25 β€” Output frame rate
VIDEO_WIDTH 720 β€” Base width (actual read from avatar frames at /connect)
VIDEO_HEIGHT 1280 β€” Base height
TTS_SAMPLE_RATE 24000 β€” Kokoro output sample rate
LIVEKIT_AUDIO_SAMPLE_RATE 24000 β€” LiveKit audio rate (matches TTS β€” no resampling)
CHUNK_DURATION 0.32 β€” 320 ms per TTS/video chunk (8 frames @ 25 fps)
FRAMES_PER_CHUNK 8 β€” Computed: CHUNK_DURATION Γ— VIDEO_FPS
PRE_ROLL_FRAMES 1 β€” Minimal pre-roll for fast start
MUSETALK_UNET_FP16 True β€” fp16 inference for lower VRAM
KOKORO_VOICE "af_heart" SPEECHX_VOICE TTS voice
KOKORO_SPEED 1.0 β€” Speech speed multiplier
LIVEKIT_URL "ws://localhost:7880" LIVEKIT_URL LiveKit server URL
LIVEKIT_API_KEY "devkey" LIVEKIT_API_KEY LiveKit API key
LIVEKIT_API_SECRET "secret" LIVEKIT_API_SECRET LiveKit API secret
LIVEKIT_ROOM_NAME "speech-to-video-room" LIVEKIT_ROOM_NAME Room name
DEFAULT_AVATAR "sophy" SPEECHX_AVATAR Active avatar
HOST "0.0.0.0" SPEECH_TO_VIDEO_HOST Bind address
PORT 8767 SPEECH_TO_VIDEO_PORT Bind port

Optional torch.compile for UNet:

MUSETALK_TORCH_COMPILE=1 python api/server.py
# UNet JIT compiles in background after /connect; later requests benefit from Triton kernels

backend/agent/config.py β€” Voice Agent page

Variable Default Env var
LIVEKIT_URL "ws://localhost:7880" LIVEKIT_URL
LIVEKIT_API_KEY "devkey" LIVEKIT_API_KEY
LIVEKIT_API_SECRET "secret" LIVEKIT_API_SECRET
LLAMA_SERVER_URL "http://localhost:8080/v1" LLAMA_SERVER_URL
DEFAULT_VOICE "af_sarah" DEFAULT_VOICE
ASR_MODEL_SIZE "base" ASR_MODEL_SIZE

API Endpoints (Avatar page)

Base URL: http://localhost:8767

Method Path Description
GET /health Liveness probe β€” returns models_loaded, pipeline_active
GET /status Detailed status with VRAM usage
POST /connect Load models, join LiveKit room, run warmup passes
POST /disconnect Gracefully stop pipeline and disconnect from room
POST /speak Push text into the pipeline; body: {text, voice?, speed?}
POST /get-token Issue LiveKit JWT; body: {room_name, identity}
GET /livekit-token GET alias for /get-token

/connect startup sequence

  1. load_musetalk_models(avatar_name, device) β€” loads UNet, VAE, Whisper encoder, avatar assets
  2. KokoroTTS() β€” initializes ONNX session
  3. Create LiveKit Room, generate backend-agent JWT with publish permissions
  4. Read actual frame dimensions from the pre-computed avatar's frame_list[0]
  5. Instantiate AVPublisher, MuseTalkWorker, StreamingPipeline
  6. room.connect() β†’ publisher.start() β†’ pipeline.start() (idle loop begins)
  7. Warmup: Whisper + TTS synchronously; UNet eager pass (or background torch.compile JIT)

/speak example

curl -X POST http://localhost:8767/speak \
  -H "Content-Type: application/json" \
  -d '{"text": "Hello, how are you today?"}'
# β†’ {"status": "processing", "latency_ms": 12.4}

Frontend

Avatar page β€” App.tsx

  • Calls POST /connect on button click β†’ loads models, joins LiveKit room
  • Fetches LiveKit token from POST /get-token, connects as a viewer
  • Receives remote video + audio tracks published by the backend agent
  • Sends typed text via POST /speak
  • Chat log shows user / assistant / system messages with timestamps
  • Static fallback avatar image /Sophy.png shown before first video frame arrives

Voice Agent page β€” VoicePage.tsx

  • Fetches LiveKit token from the agent's built-in server at http://localhost:3000/get-token
  • Publishes local microphone (echo-cancelled, noise-suppressed) as an audio track
  • Agent worker subscribes, runs ASR β†’ LLM β†’ TTS, publishes reply audio back into the room
  • DataReceived events carry {type: "user"|"assistant", text} for the transcript display
  • Latency shown as the time between user-transcript event and assistant-text event

Avatars

Pre-computed assets in backend/avatars/<name>/:

File/Dir Description
avator_info.json Avatar bbox, landmark, and crop metadata
latents.pt Pre-encoded VAE latent tensors for all idle frames
full_imgs/ Full-resolution source frames
mask/ Per-frame blending masks for composite output
vid_output/ Intermediate video output (optional)

Available avatars: sophy (default), harry_1, christine.

Override at runtime:

SPEECHX_AVATAR=harry_1 python api/server.py

To generate a new avatar from a source video, see docs/avatar_gen_README.md.


Model Weights

All weights live in backend/models/ (not committed to git).

Path Used by Notes
kokoro/kokoro-v1.0.onnx Avatar TTS, Voice TTS ONNX runtime inference
kokoro/voices-v1.0.bin Avatar TTS, Voice TTS All voice style embeddings
musetalkV15/unet.pth MuseTalk UNet fp16 inference
musetalkV15/musetalk.json MuseTalk Architecture config
sd-vae/config.json MuseTalk VAE Stable Diffusion VAE
whisper/ MuseTalk audio encoder Whisper encoder weights + config
dwpose/dw-ll_ucoco_384.pth DWPose Used during avatar generation
face-parse-bisent/ Face parsing BiSeNet; used during avatar generation
syncnet/latentsync_syncnet.pt SyncNet Training/evaluation only, not live inference
Llama-3.2-3B-Instruct-Q4_K_M.gguf llama-server Served by external llama-server process

Troubleshooting

ModuleNotFoundError on startup

conda activate avatar   # must be active
cd backend
python api/server.py    # run from backend/, not repo root

LiveKit connection errors

docker ps | grep livekit          # is it running?
docker logs livekit-server        # check for port conflicts or errors

# Keys must match in both config files:
# backend/config.py        β†’  LIVEKIT_API_KEY / LIVEKIT_API_SECRET
# backend/agent/config.py  β†’  same values

Out of VRAM (Avatar page)

# backend/config.py
CHUNK_DURATION = 0.16   # halve chunk β†’ 4 frames instead of 8
# or
MUSETALK_UNET_FP16 = False  # switch to fp32 if fp16 causes NaN

llama-server not found

Download from llama.cpp releases, extract binary, add folder to PATH.

Port already in use

lsof -i :8767    # Avatar backend
lsof -i :3000    # Voice agent token server
lsof -i :8080    # llama-server
lsof -i :7880    # LiveKit

Kokoro ONNX int32 speed-tensor error

Already patched in backend/tts/kokoro_tts.py via _patched_create_audio monkey-patch that forces speed to float32. No action needed.

Video and audio out of sync

SimpleAVSync uses explicit PTS. Verify constants are consistent:

# FRAMES_PER_CHUNK must equal CHUNK_DURATION Γ— VIDEO_FPS exactly
# Default: 0.32 Γ— 25 = 8  βœ“

Performance Notes

GPU memory (RTX 4060 8 GB)

Component Estimated VRAM
MuseTalk UNet (fp16) ~3 GB
Whisper encoder ~1.5 GB
SD-VAE ~1 GB
Avatar latents + frame buffers ~0.5 GB
Total ~6 GB

Latency targets

Stage Target
First TTS chunk ready < 100 ms
First video chunk visible < 200 ms after /speak
Per-chunk generation (8 frames @ 25 fps) ~80–120 ms
End-to-end text β†’ visible lip-sync ~200–350 ms

Optimisations already applied

Optimisation Location Effect
torch.set_float32_matmul_precision('high') server.py ~5 % free via TF32 on Ampere+
MUSETALK_UNET_FP16 = True config.py Halves UNet memory bandwidth
LIVEKIT_AUDIO_SAMPLE_RATE = 24000 config.py Eliminates TTS→LiveKit resampling step
Sentence-boundary TTS split kokoro_tts.py Lower latency on first synthesised chunk
Synchronous Whisper + TTS warmup at /connect server.py Primes ONNX thread pools before first request
UNet eager warmup pass at /connect server.py Primes CUDA kernels
Optional torch.compile UNet JIT server.py Further throughput gain (opt-in via env var)
IdleFrameGenerator livekit_publisher.py Keeps video track alive between speech turns
torch._dynamo.config.suppress_errors = True server.py Graceful fallback to eager if Triton JIT fails

Available Voices

Voice ID Style
af_heart Female, emotional/expressive β€” Avatar page default
af_sarah Female, clear and professional β€” Voice Agent default
af_bella Female, warm and friendly
am_michael Male, professional
am_fen Male, deep
bf_emma Female, British accent
bm_george Male, British accent

Credits