Avatar-Speech / PROJECT.md

agkavin

Fix pipeline deadlocks, remove torch.compile, implement 3-queue parallel pipeline, optimize for 16fps

a4cc15e about 1 month ago

preview code

raw

history blame contribute delete

19.1 kB

# Speech-X — Speech-to-Video Pipeline

Two interactive modes in one repo, sharing the same conda environment (avatar), model weights, and LiveKit server.

Page	Mode	What it does
`/`	Avatar	Text → Kokoro TTS → MuseTalk lip-sync → LiveKit video+audio
`/voice`	Voice Agent	Mic → faster-whisper ASR → Llama LLM → Kokoro TTS → LiveKit audio

Architecture Overview
Project Structure
Conda Environment & Installation
Running the Application
Configuration Reference
API Endpoints (Avatar page)
Frontend
Avatars
Model Weights
Troubleshooting
Performance Notes

Architecture Overview

Avatar Page (`/`) — MuseTalk lip-sync

Browser (React)
    │  POST /connect        → loads MuseTalk + Kokoro, joins LiveKit as publisher
    │  POST /speak {text}   → pushes text into the streaming pipeline
    │  POST /get-token      → returns a LiveKit JWT for the frontend viewer
    │
    ▼
FastAPI server  :8767  (backend/api/server.py)
    │
    ├── KokoroTTS  (backend/tts/kokoro_tts.py)
    │     kokoro-v1.0.onnx  →  24 kHz PCM audio, streamed in 320 ms chunks
    │     Text split at sentence boundaries for lower first-chunk latency
    │
    ├── MuseTalkWorker  (backend/musetalk/worker.py)
    │     Two-phase GPU inference:
    │       1. extract_features(audio)  →  Whisper encoder produces mel embeddings
    │       2. generate_batch(feats, start, n)  →  UNet generates n lip-synced frames
    │     ThreadPoolExecutor(max_workers=1) keeps GPU work off the async event loop
    │
    ├── AVPublisher  (backend/publisher/livekit_publisher.py)
    │     Streams video (actual avatar dimensions @ 25 fps) + audio (24 kHz) to LiveKit
    │     IdleFrameGenerator loops avatar frames while pipeline is idle
    │
    └── StreamingPipeline  (backend/api/pipeline.py)
          Coordinates TTS → MuseTalk → publisher
          SimpleAVSync aligns PTS across chunks

Voice Agent Page (`/voice`) — ASR → LLM → TTS

Browser (React)
    │  POST http://localhost:3000/get-token  →  LiveKit JWT from built-in token server
    │  Publishes local mic audio track to LiveKit room
    │
    ▼
LiveKit Agent worker  (backend/agent.py)
    │     livekit-agents framework, pre-warms all models at startup
    │
    ├── ASR  (backend/agent/asr.py)
    │     faster-whisper  (model size via ASR_MODEL_SIZE, default: base)
    │     Buffers ~1.5 s of 16 kHz mono audio before each transcription
    │
    ├── LLM  (backend/agent/llm.py)
    │     HTTP client → llama-server :8080
    │     Llama-3.2-3B-Instruct-Q4_K_M.gguf, keeps last 6 turns of history
    │
    └── TTS  (backend/agent/tts.py)
          kokoro-onnx, default voice: af_sarah
          Publishes 48 kHz mono audio back to the LiveKit room
          Built-in aiohttp token server on :3000

Project Structure

speech_to_video/
├── environment.yml          # Conda env export (no build strings, cross-platform)
├── README.md                # Quick-start guide
├── PROJECT.md               # This file — detailed reference
├── setup.md                 # Step-by-step first-time install
├── ISSUES_AND_PLAN.md       # Known issues and roadmap
├── pyproject.toml           # Python project config
│
├── docs/
│   ├── avatar_gen_README.md
│   └── avatar_gen_phase_2.md
│
├── backend/
│   ├── config.py            # Avatar page configuration (all tunable params)
│   ├── requirements.txt     # Pip dependencies (installed inside conda env)
│   │
│   ├── api/
│   │   ├── server.py        # FastAPI app — lifespan, endpoints, warmup logic
│   │   └── pipeline.py      # StreamingPipeline + SpeechToVideoPipeline orchestrators
│   │
│   ├── agent.py             # LiveKit Agent worker entry point (Voice page)
│   ├── agent/
│   │   ├── config.py        # Voice agent config (voices, model paths, SYSTEM_PROMPT)
│   │   ├── asr.py           # faster-whisper ASR wrapper
│   │   ├── llm.py           # llama-server HTTP client
│   │   └── tts.py           # Kokoro TTS for Voice agent
│   │
│   ├── tts/
│   │   └── kokoro_tts.py    # Kokoro TTS for Avatar page
│   │                          Patches int32→float32 speed-tensor bug (kokoro-onnx 0.5.x)
│   │                          Splits text at sentence boundaries for low first-chunk latency
│   │
│   ├── musetalk/
│   │   ├── worker.py        # MuseTalkWorker — async GPU wrapper (ThreadPoolExecutor)
│   │   ├── processor.py     # Core MuseTalk inference (VAE encode/decode, UNet forward)
│   │   ├── face.py          # Face detection / cropping
│   │   ├── data/            # Dataset helpers, audio processing
│   │   ├── models/          # UNet, VAE, SyncNet model definitions
│   │   └── utils/           # Audio processor, blending, preprocessing, DWPose, face parsing
│   │
│   ├── whisper/
│   │   └── audio2feature.py # Whisper encoder for MuseTalk audio features
│   │
│   ├── sync/
│   │   └── av_sync.py       # AVSyncGate + SimpleAVSync — PTS-based AV alignment
│   │
│   ├── publisher/
│   │   └── livekit_publisher.py  # AVPublisher + IdleFrameGenerator
│   │
│   ├── models/              # All model weights (not in git)
│   │   ├── kokoro/
│   │   │   ├── kokoro-v1.0.onnx
│   │   │   └── voices-v1.0.bin
│   │   ├── musetalkV15/
│   │   │   ├── musetalk.json
│   │   │   └── unet.pth
│   │   ├── sd-vae/
│   │   │   └── config.json
│   │   ├── whisper/
│   │   │   ├── config.json
│   │   │   └── preprocessor_config.json
│   │   ├── dwpose/
│   │   │   └── dw-ll_ucoco_384.pth
│   │   ├── face-parse-bisent/
│   │   │   ├── 79999_iter.pth
│   │   │   └── resnet18-5c106cde.pth
│   │   ├── syncnet/
│   │   │   └── latentsync_syncnet.pt
│   │   └── Llama-3.2-3B-Instruct-Q4_K_M.gguf
│   │
│   └── avatars/             # Pre-computed avatar assets (not in git)
│       ├── christine/       # avator_info.json, latents.pt, full_imgs/, mask/, vid_output/
│       ├── harry_1/
│       └── sophy/           # DEFAULT_AVATAR — used unless overridden by SPEECHX_AVATAR
│
└── frontend/
    ├── package.json
    ├── vite.config.ts
    └── src/
        ├── App.tsx          # Avatar page (/)
        ├── main.tsx         # React Router setup
        ├── index.css        # Avatar page styles
        ├── voice.css        # Voice page styles
        └── pages/
            └── VoicePage.tsx  # Voice Agent page (/voice)

Conda Environment & Installation

Restore environment

# From repo root
conda env create -f environment.yml
conda activate avatar

Frontend

cd frontend
npm install

First-time model setup

See setup.md for the step-by-step guide to download and place all model weights.

Running the Application

All four processes must run concurrently. Use four terminals, all with conda activate avatar.

Terminal 1 — LiveKit server (shared)

docker run --rm -d \
  --name livekit-server \
  -p 7880:7880 -p 7881:7881 -p 7882:7882/udp \
  livekit/livekit-server:latest \
  --dev --bind 0.0.0.0 --node-ip 127.0.0.1

# Stop: docker stop livekit-server

Terminal 2 — llama-server (shared by both pages)

llama-server \
  -m backend/models/Llama-3.2-3B-Instruct-Q4_K_M.gguf \
  -c 2048 -ngl 32 --port 8080

llama-server must be in PATH. Download from llama.cpp releases.

Terminal 3 — Backend (choose one or both)

Avatar page (lip-sync + video):

conda activate avatar
cd backend
python api/server.py
# → http://localhost:8767

Voice Agent page (ASR → LLM → TTS audio only):

conda activate avatar
cd backend
python agent.py dev
# → LiveKit agent worker + token server on http://localhost:3000

Both can run simultaneously in separate terminals.

Terminal 4 — Frontend (Vite dev server)

cd frontend
npm run dev
# → http://localhost:5173

http://localhost:5173/ — Avatar lip-sync page
http://localhost:5173/voice — Voice Agent page

Configuration Reference

`backend/config.py` — Avatar page

All values are overridable via environment variables.

Variable	Default	Env var	Description
`DEVICE`	`"cuda"`	`SPEECHX_DEVICE`	Inference device
`VIDEO_FPS`	`25`	—	Output frame rate
`VIDEO_WIDTH`	`720`	—	Base width (actual read from avatar frames at /connect)
`VIDEO_HEIGHT`	`1280`	—	Base height
`TTS_SAMPLE_RATE`	`24000`	—	Kokoro output sample rate
`LIVEKIT_AUDIO_SAMPLE_RATE`	`24000`	—	LiveKit audio rate (matches TTS — no resampling)
`CHUNK_DURATION`	`0.32`	—	320 ms per TTS/video chunk (8 frames @ 25 fps)
`FRAMES_PER_CHUNK`	`8`	—	Computed: `CHUNK_DURATION × VIDEO_FPS`
`PRE_ROLL_FRAMES`	`1`	—	Minimal pre-roll for fast start
`MUSETALK_UNET_FP16`	`True`	—	fp16 inference for lower VRAM
`KOKORO_VOICE`	`"af_heart"`	`SPEECHX_VOICE`	TTS voice
`KOKORO_SPEED`	`1.0`	—	Speech speed multiplier
`LIVEKIT_URL`	`"ws://localhost:7880"`	`LIVEKIT_URL`	LiveKit server URL
`LIVEKIT_API_KEY`	`"devkey"`	`LIVEKIT_API_KEY`	LiveKit API key
`LIVEKIT_API_SECRET`	`"secret"`	`LIVEKIT_API_SECRET`	LiveKit API secret
`LIVEKIT_ROOM_NAME`	`"speech-to-video-room"`	`LIVEKIT_ROOM_NAME`	Room name
`DEFAULT_AVATAR`	`"sophy"`	`SPEECHX_AVATAR`	Active avatar
`HOST`	`"0.0.0.0"`	`SPEECH_TO_VIDEO_HOST`	Bind address
`PORT`	`8767`	`SPEECH_TO_VIDEO_PORT`	Bind port

Optional torch.compile for UNet:

MUSETALK_TORCH_COMPILE=1 python api/server.py
# UNet JIT compiles in background after /connect; later requests benefit from Triton kernels

`backend/agent/config.py` — Voice Agent page

Variable	Default	Env var
`LIVEKIT_URL`	`"ws://localhost:7880"`	`LIVEKIT_URL`
`LIVEKIT_API_KEY`	`"devkey"`	`LIVEKIT_API_KEY`
`LIVEKIT_API_SECRET`	`"secret"`	`LIVEKIT_API_SECRET`
`LLAMA_SERVER_URL`	`"http://localhost:8080/v1"`	`LLAMA_SERVER_URL`
`DEFAULT_VOICE`	`"af_sarah"`	`DEFAULT_VOICE`
`ASR_MODEL_SIZE`	`"base"`	`ASR_MODEL_SIZE`

API Endpoints (Avatar page)

Base URL: http://localhost:8767

Method	Path	Description
`GET`	`/health`	Liveness probe — returns `models_loaded`, `pipeline_active`
`GET`	`/status`	Detailed status with VRAM usage
`POST`	`/connect`	Load models, join LiveKit room, run warmup passes
`POST`	`/disconnect`	Gracefully stop pipeline and disconnect from room
`POST`	`/speak`	Push text into the pipeline; body: `{text, voice?, speed?}`
`POST`	`/get-token`	Issue LiveKit JWT; body: `{room_name, identity}`
`GET`	`/livekit-token`	GET alias for `/get-token`

`/connect` startup sequence

load_musetalk_models(avatar_name, device) — loads UNet, VAE, Whisper encoder, avatar assets
KokoroTTS() — initializes ONNX session
Create LiveKit Room, generate backend-agent JWT with publish permissions
Read actual frame dimensions from the pre-computed avatar's frame_list[0]
Instantiate AVPublisher, MuseTalkWorker, StreamingPipeline
room.connect() → publisher.start() → pipeline.start() (idle loop begins)
Warmup: Whisper + TTS synchronously; UNet eager pass (or background torch.compile JIT)

`/speak` example

curl -X POST http://localhost:8767/speak \
  -H "Content-Type: application/json" \
  -d '{"text": "Hello, how are you today?"}'
# → {"status": "processing", "latency_ms": 12.4}

Frontend

Avatar page — `App.tsx`

Calls POST /connect on button click → loads models, joins LiveKit room
Fetches LiveKit token from POST /get-token, connects as a viewer
Receives remote video + audio tracks published by the backend agent
Sends typed text via POST /speak
Chat log shows user / assistant / system messages with timestamps
Static fallback avatar image /Sophy.png shown before first video frame arrives

Voice Agent page — `VoicePage.tsx`

Fetches LiveKit token from the agent's built-in server at http://localhost:3000/get-token
Publishes local microphone (echo-cancelled, noise-suppressed) as an audio track
Agent worker subscribes, runs ASR → LLM → TTS, publishes reply audio back into the room
DataReceived events carry {type: "user"|"assistant", text} for the transcript display
Latency shown as the time between user-transcript event and assistant-text event

Avatars

Pre-computed assets in backend/avatars/<name>/:

File/Dir	Description
`avator_info.json`	Avatar bbox, landmark, and crop metadata
`latents.pt`	Pre-encoded VAE latent tensors for all idle frames
`full_imgs/`	Full-resolution source frames
`mask/`	Per-frame blending masks for composite output
`vid_output/`	Intermediate video output (optional)

Available avatars: sophy (default), harry_1, christine.

Override at runtime:

SPEECHX_AVATAR=harry_1 python api/server.py

To generate a new avatar from a source video, see docs/avatar_gen_README.md.

Model Weights

All weights live in backend/models/ (not committed to git).

Path	Used by	Notes
`kokoro/kokoro-v1.0.onnx`	Avatar TTS, Voice TTS	ONNX runtime inference
`kokoro/voices-v1.0.bin`	Avatar TTS, Voice TTS	All voice style embeddings
`musetalkV15/unet.pth`	MuseTalk UNet	fp16 inference
`musetalkV15/musetalk.json`	MuseTalk	Architecture config
`sd-vae/config.json`	MuseTalk VAE	Stable Diffusion VAE
`whisper/`	MuseTalk audio encoder	Whisper encoder weights + config
`dwpose/dw-ll_ucoco_384.pth`	DWPose	Used during avatar generation
`face-parse-bisent/`	Face parsing	BiSeNet; used during avatar generation
`syncnet/latentsync_syncnet.pt`	SyncNet	Training/evaluation only, not live inference
`Llama-3.2-3B-Instruct-Q4_K_M.gguf`	llama-server	Served by external `llama-server` process

Troubleshooting

`ModuleNotFoundError` on startup

conda activate avatar   # must be active
cd backend
python api/server.py    # run from backend/, not repo root

LiveKit connection errors

docker ps | grep livekit          # is it running?
docker logs livekit-server        # check for port conflicts or errors

# Keys must match in both config files:
# backend/config.py        →  LIVEKIT_API_KEY / LIVEKIT_API_SECRET
# backend/agent/config.py  →  same values

Out of VRAM (Avatar page)

# backend/config.py
CHUNK_DURATION = 0.16   # halve chunk → 4 frames instead of 8
# or
MUSETALK_UNET_FP16 = False  # switch to fp32 if fp16 causes NaN

`llama-server` not found

Download from llama.cpp releases, extract binary, add folder to PATH.

Port already in use

lsof -i :8767    # Avatar backend
lsof -i :3000    # Voice agent token server
lsof -i :8080    # llama-server
lsof -i :7880    # LiveKit

Kokoro ONNX `int32` speed-tensor error

Already patched in backend/tts/kokoro_tts.py via _patched_create_audio monkey-patch that forces speed to float32. No action needed.

Video and audio out of sync

SimpleAVSync uses explicit PTS. Verify constants are consistent:

# FRAMES_PER_CHUNK must equal CHUNK_DURATION × VIDEO_FPS exactly
# Default: 0.32 × 25 = 8  ✓

Performance Notes

GPU memory (RTX 4060 8 GB)

Component	Estimated VRAM
MuseTalk UNet (fp16)	~3 GB
Whisper encoder	~1.5 GB
SD-VAE	~1 GB
Avatar latents + frame buffers	~0.5 GB
Total	~6 GB

Latency targets

Stage	Target
First TTS chunk ready	< 100 ms
First video chunk visible	< 200 ms after `/speak`
Per-chunk generation (8 frames @ 25 fps)	~80–120 ms
End-to-end text → visible lip-sync	~200–350 ms

Optimisations already applied

Optimisation	Location	Effect
`torch.set_float32_matmul_precision('high')`	`server.py`	~5 % free via TF32 on Ampere+
`MUSETALK_UNET_FP16 = True`	`config.py`	Halves UNet memory bandwidth
`LIVEKIT_AUDIO_SAMPLE_RATE = 24000`	`config.py`	Eliminates TTS→LiveKit resampling step
Sentence-boundary TTS split	`kokoro_tts.py`	Lower latency on first synthesised chunk
Synchronous Whisper + TTS warmup at `/connect`	`server.py`	Primes ONNX thread pools before first request
UNet eager warmup pass at `/connect`	`server.py`	Primes CUDA kernels
Optional `torch.compile` UNet JIT	`server.py`	Further throughput gain (opt-in via env var)
`IdleFrameGenerator`	`livekit_publisher.py`	Keeps video track alive between speech turns
`torch._dynamo.config.suppress_errors = True`	`server.py`	Graceful fallback to eager if Triton JIT fails

Available Voices

Voice ID	Style
`af_heart`	Female, emotional/expressive — Avatar page default
`af_sarah`	Female, clear and professional — Voice Agent default
`af_bella`	Female, warm and friendly
`am_michael`	Male, professional
`am_fen`	Male, deep
`bf_emma`	Female, British accent
`bm_george`	Male, British accent

Credits

MuseTalk v1.5 — TMElyralab/MuseTalk
Kokoro TTS — remsky/Kokoro-ONNX
LiveKit — livekit/livekit
faster-whisper — SYSTRAN/faster-whisper
llama.cpp — ggerganov/llama.cpp
Llama 3.2 3B — Meta AI

agkavin
/

Avatar-Speech

Table of Contents

Architecture Overview

Avatar Page (`/`) — MuseTalk lip-sync

Voice Agent Page (`/voice`) — ASR → LLM → TTS

Project Structure

Conda Environment & Installation

Restore environment

Frontend

First-time model setup

Running the Application

Terminal 1 — LiveKit server (shared)

Terminal 2 — llama-server (shared by both pages)

Terminal 3 — Backend (choose one or both)

Terminal 4 — Frontend (Vite dev server)

Configuration Reference

`backend/config.py` — Avatar page

`backend/agent/config.py` — Voice Agent page

API Endpoints (Avatar page)

`/connect` startup sequence

`/speak` example

Frontend

Avatar page — `App.tsx`

Voice Agent page — `VoicePage.tsx`

Avatars

Model Weights

Troubleshooting

`ModuleNotFoundError` on startup

LiveKit connection errors

Out of VRAM (Avatar page)

`llama-server` not found

Port already in use

Kokoro ONNX `int32` speed-tensor error

Video and audio out of sync

Performance Notes

GPU memory (RTX 4060 8 GB)

Latency targets

Optimisations already applied

Available Voices

Credits

Table of Contents

Architecture Overview

Avatar Page (/) — MuseTalk lip-sync

Voice Agent Page (/voice) — ASR → LLM → TTS

Project Structure

Conda Environment & Installation

Restore environment

Frontend

First-time model setup

Running the Application

Terminal 1 — LiveKit server (shared)

Terminal 2 — llama-server (shared by both pages)

Terminal 3 — Backend (choose one or both)

Terminal 4 — Frontend (Vite dev server)

Configuration Reference

backend/config.py — Avatar page

backend/agent/config.py — Voice Agent page

API Endpoints (Avatar page)

/connect startup sequence

/speak example

Frontend

Avatar page — App.tsx

Voice Agent page — VoicePage.tsx

Avatars

Model Weights

Troubleshooting

ModuleNotFoundError on startup

LiveKit connection errors

Out of VRAM (Avatar page)

llama-server not found

Port already in use

Kokoro ONNX int32 speed-tensor error

Video and audio out of sync

Performance Notes

GPU memory (RTX 4060 8 GB)

Latency targets

Optimisations already applied

Available Voices

Credits

Avatar Page (`/`) — MuseTalk lip-sync

Voice Agent Page (`/voice`) — ASR → LLM → TTS

`backend/config.py` — Avatar page

`backend/agent/config.py` — Voice Agent page

`/connect` startup sequence

`/speak` example

Avatar page — `App.tsx`

Voice Agent page — `VoicePage.tsx`

`ModuleNotFoundError` on startup

`llama-server` not found

Kokoro ONNX `int32` speed-tensor error