\# Speech-X — Speech-to-Video Pipeline

Two interactive modes in one repo, sharing the same conda environment (`avatar`), model weights, and LiveKit server.

| Page | Mode | What it does |
|------|------|--------------|
| `/` | **Avatar** | Text → Kokoro TTS → MuseTalk lip-sync → LiveKit video+audio |
| `/voice` | **Voice Agent** | Mic → faster-whisper ASR → Llama LLM → Kokoro TTS → LiveKit audio |

---

## Table of Contents

1. [Architecture Overview](#architecture-overview)
2. [Project Structure](#project-structure)
3. [Conda Environment & Installation](#conda-environment--installation)
4. [Running the Application](#running-the-application)
5. [Configuration Reference](#configuration-reference)
6. [API Endpoints (Avatar page)](#api-endpoints-avatar-page)
7. [Frontend](#frontend)
8. [Avatars](#avatars)
9. [Model Weights](#model-weights)
10. [Troubleshooting](#troubleshooting)
11. [Performance Notes](#performance-notes)

---

## Architecture Overview

### Avatar Page (`/`) — MuseTalk lip-sync

```
Browser (React)
    │  POST /connect        → loads MuseTalk + Kokoro, joins LiveKit as publisher
    │  POST /speak {text}   → pushes text into the streaming pipeline
    │  POST /get-token      → returns a LiveKit JWT for the frontend viewer
    │
    ▼
FastAPI server  :8767  (backend/api/server.py)
    │
    ├── KokoroTTS  (backend/tts/kokoro_tts.py)
    │     kokoro-v1.0.onnx  →  24 kHz PCM audio, streamed in 320 ms chunks
    │     Text split at sentence boundaries for lower first-chunk latency
    │
    ├── MuseTalkWorker  (backend/musetalk/worker.py)
    │     Two-phase GPU inference:
    │       1. extract_features(audio)  →  Whisper encoder produces mel embeddings
    │       2. generate_batch(feats, start, n)  →  UNet generates n lip-synced frames
    │     ThreadPoolExecutor(max_workers=1) keeps GPU work off the async event loop
    │
    ├── AVPublisher  (backend/publisher/livekit_publisher.py)
    │     Streams video (actual avatar dimensions @ 25 fps) + audio (24 kHz) to LiveKit
    │     IdleFrameGenerator loops avatar frames while pipeline is idle
    │
    └── StreamingPipeline  (backend/api/pipeline.py)
          Coordinates TTS → MuseTalk → publisher
          SimpleAVSync aligns PTS across chunks
```

### Voice Agent Page (`/voice`) — ASR → LLM → TTS

```
Browser (React)
    │  POST http://localhost:3000/get-token  →  LiveKit JWT from built-in token server
    │  Publishes local mic audio track to LiveKit room
    │
    ▼
LiveKit Agent worker  (backend/agent.py)
    │     livekit-agents framework, pre-warms all models at startup
    │
    ├── ASR  (backend/agent/asr.py)
    │     faster-whisper  (model size via ASR_MODEL_SIZE, default: base)
    │     Buffers ~1.5 s of 16 kHz mono audio before each transcription
    │
    ├── LLM  (backend/agent/llm.py)
    │     HTTP client → llama-server :8080
    │     Llama-3.2-3B-Instruct-Q4_K_M.gguf, keeps last 6 turns of history
    │
    └── TTS  (backend/agent/tts.py)
          kokoro-onnx, default voice: af_sarah
          Publishes 48 kHz mono audio back to the LiveKit room
          Built-in aiohttp token server on :3000
```

---

## Project Structure

```
speech_to_video/
├── environment.yml          # Conda env export (no build strings, cross-platform)
├── README.md                # Quick-start guide
├── PROJECT.md               # This file — detailed reference
├── setup.md                 # Step-by-step first-time install
├── ISSUES_AND_PLAN.md       # Known issues and roadmap
├── pyproject.toml           # Python project config
│
├── docs/
│   ├── avatar_gen_README.md
│   └── avatar_gen_phase_2.md
│
├── backend/
│   ├── config.py            # Avatar page configuration (all tunable params)
│   ├── requirements.txt     # Pip dependencies (installed inside conda env)
│   │
│   ├── api/
│   │   ├── server.py        # FastAPI app — lifespan, endpoints, warmup logic
│   │   └── pipeline.py      # StreamingPipeline + SpeechToVideoPipeline orchestrators
│   │
│   ├── agent.py             # LiveKit Agent worker entry point (Voice page)
│   ├── agent/
│   │   ├── config.py        # Voice agent config (voices, model paths, SYSTEM_PROMPT)
│   │   ├── asr.py           # faster-whisper ASR wrapper
│   │   ├── llm.py           # llama-server HTTP client
│   │   └── tts.py           # Kokoro TTS for Voice agent
│   │
│   ├── tts/
│   │   └── kokoro_tts.py    # Kokoro TTS for Avatar page
│   │                          Patches int32→float32 speed-tensor bug (kokoro-onnx 0.5.x)
│   │                          Splits text at sentence boundaries for low first-chunk latency
│   │
│   ├── musetalk/
│   │   ├── worker.py        # MuseTalkWorker — async GPU wrapper (ThreadPoolExecutor)
│   │   ├── processor.py     # Core MuseTalk inference (VAE encode/decode, UNet forward)
│   │   ├── face.py          # Face detection / cropping
│   │   ├── data/            # Dataset helpers, audio processing
│   │   ├── models/          # UNet, VAE, SyncNet model definitions
│   │   └── utils/           # Audio processor, blending, preprocessing, DWPose, face parsing
│   │
│   ├── whisper/
│   │   └── audio2feature.py # Whisper encoder for MuseTalk audio features
│   │
│   ├── sync/
│   │   └── av_sync.py       # AVSyncGate + SimpleAVSync — PTS-based AV alignment
│   │
│   ├── publisher/
│   │   └── livekit_publisher.py  # AVPublisher + IdleFrameGenerator
│   │
│   ├── models/              # All model weights (not in git)
│   │   ├── kokoro/
│   │   │   ├── kokoro-v1.0.onnx
│   │   │   └── voices-v1.0.bin
│   │   ├── musetalkV15/
│   │   │   ├── musetalk.json
│   │   │   └── unet.pth
│   │   ├── sd-vae/
│   │   │   └── config.json
│   │   ├── whisper/
│   │   │   ├── config.json
│   │   │   └── preprocessor_config.json
│   │   ├── dwpose/
│   │   │   └── dw-ll_ucoco_384.pth
│   │   ├── face-parse-bisent/
│   │   │   ├── 79999_iter.pth
│   │   │   └── resnet18-5c106cde.pth
│   │   ├── syncnet/
│   │   │   └── latentsync_syncnet.pt
│   │   └── Llama-3.2-3B-Instruct-Q4_K_M.gguf
│   │
│   └── avatars/             # Pre-computed avatar assets (not in git)
│       ├── christine/       # avator_info.json, latents.pt, full_imgs/, mask/, vid_output/
│       ├── harry_1/
│       └── sophy/           # DEFAULT_AVATAR — used unless overridden by SPEECHX_AVATAR
│
└── frontend/
    ├── package.json
    ├── vite.config.ts
    └── src/
        ├── App.tsx          # Avatar page (/)
        ├── main.tsx         # React Router setup
        ├── index.css        # Avatar page styles
        ├── voice.css        # Voice page styles
        └── pages/
            └── VoicePage.tsx  # Voice Agent page (/voice)
```

---

## Conda Environment & Installation

### Restore environment

```bash
# From repo root
conda env create -f environment.yml
conda activate avatar
```

### Frontend

```bash
cd frontend
npm install
```

### First-time model setup

See [setup.md](setup.md) for the step-by-step guide to download and place all model weights.

---

## Running the Application

All four processes must run concurrently. Use four terminals, all with `conda activate avatar`.

### Terminal 1 — LiveKit server (shared)

```bash
docker run --rm -d \
  --name livekit-server \
  -p 7880:7880 -p 7881:7881 -p 7882:7882/udp \
  livekit/livekit-server:latest \
  --dev --bind 0.0.0.0 --node-ip 127.0.0.1

# Stop: docker stop livekit-server
```

### Terminal 2 — llama-server (shared by both pages)

```bash
llama-server \
  -m backend/models/Llama-3.2-3B-Instruct-Q4_K_M.gguf \
  -c 2048 -ngl 32 --port 8080
```

> `llama-server` must be in PATH. Download from [llama.cpp releases](https://github.com/ggerganov/llama.cpp/releases).

### Terminal 3 — Backend (choose one or both)

**Avatar page** (lip-sync + video):
```bash
conda activate avatar
cd backend
python api/server.py
# → http://localhost:8767
```

**Voice Agent page** (ASR → LLM → TTS audio only):
```bash
conda activate avatar
cd backend
python agent.py dev
# → LiveKit agent worker + token server on http://localhost:3000
```

> Both can run simultaneously in separate terminals.

### Terminal 4 — Frontend (Vite dev server)

```bash
cd frontend
npm run dev
# → http://localhost:5173
```

- `http://localhost:5173/` — Avatar lip-sync page
- `http://localhost:5173/voice` — Voice Agent page

---

## Configuration Reference

### `backend/config.py` — Avatar page

All values are overridable via environment variables.

| Variable | Default | Env var | Description |
|----------|---------|---------|-------------|
| `DEVICE` | `"cuda"` | `SPEECHX_DEVICE` | Inference device |
| `VIDEO_FPS` | `25` | — | Output frame rate |
| `VIDEO_WIDTH` | `720` | — | Base width (actual read from avatar frames at /connect) |
| `VIDEO_HEIGHT` | `1280` | — | Base height |
| `TTS_SAMPLE_RATE` | `24000` | — | Kokoro output sample rate |
| `LIVEKIT_AUDIO_SAMPLE_RATE` | `24000` | — | LiveKit audio rate (matches TTS — no resampling) |
| `CHUNK_DURATION` | `0.32` | — | 320 ms per TTS/video chunk (8 frames @ 25 fps) |
| `FRAMES_PER_CHUNK` | `8` | — | Computed: `CHUNK_DURATION × VIDEO_FPS` |
| `PRE_ROLL_FRAMES` | `1` | — | Minimal pre-roll for fast start |
| `MUSETALK_UNET_FP16` | `True` | — | fp16 inference for lower VRAM |
| `KOKORO_VOICE` | `"af_heart"` | `SPEECHX_VOICE` | TTS voice |
| `KOKORO_SPEED` | `1.0` | — | Speech speed multiplier |
| `LIVEKIT_URL` | `"ws://localhost:7880"` | `LIVEKIT_URL` | LiveKit server URL |
| `LIVEKIT_API_KEY` | `"devkey"` | `LIVEKIT_API_KEY` | LiveKit API key |
| `LIVEKIT_API_SECRET` | `"secret"` | `LIVEKIT_API_SECRET` | LiveKit API secret |
| `LIVEKIT_ROOM_NAME` | `"speech-to-video-room"` | `LIVEKIT_ROOM_NAME` | Room name |
| `DEFAULT_AVATAR` | `"sophy"` | `SPEECHX_AVATAR` | Active avatar |
| `HOST` | `"0.0.0.0"` | `SPEECH_TO_VIDEO_HOST` | Bind address |
| `PORT` | `8767` | `SPEECH_TO_VIDEO_PORT` | Bind port |

**Optional torch.compile for UNet:**
```bash
MUSETALK_TORCH_COMPILE=1 python api/server.py
# UNet JIT compiles in background after /connect; later requests benefit from Triton kernels
```

### `backend/agent/config.py` — Voice Agent page

| Variable | Default | Env var |
|----------|---------|---------|
| `LIVEKIT_URL` | `"ws://localhost:7880"` | `LIVEKIT_URL` |
| `LIVEKIT_API_KEY` | `"devkey"` | `LIVEKIT_API_KEY` |
| `LIVEKIT_API_SECRET` | `"secret"` | `LIVEKIT_API_SECRET` |
| `LLAMA_SERVER_URL` | `"http://localhost:8080/v1"` | `LLAMA_SERVER_URL` |
| `DEFAULT_VOICE` | `"af_sarah"` | `DEFAULT_VOICE` |
| `ASR_MODEL_SIZE` | `"base"` | `ASR_MODEL_SIZE` |

---

## API Endpoints (Avatar page)

Base URL: `http://localhost:8767`

| Method | Path | Description |
|--------|------|-------------|
| `GET` | `/health` | Liveness probe — returns `models_loaded`, `pipeline_active` |
| `GET` | `/status` | Detailed status with VRAM usage |
| `POST` | `/connect` | Load models, join LiveKit room, run warmup passes |
| `POST` | `/disconnect` | Gracefully stop pipeline and disconnect from room |
| `POST` | `/speak` | Push text into the pipeline; body: `{text, voice?, speed?}` |
| `POST` | `/get-token` | Issue LiveKit JWT; body: `{room_name, identity}` |
| `GET` | `/livekit-token` | GET alias for `/get-token` |

### `/connect` startup sequence

1. `load_musetalk_models(avatar_name, device)` — loads UNet, VAE, Whisper encoder, avatar assets
2. `KokoroTTS()` — initializes ONNX session
3. Create LiveKit `Room`, generate backend-agent JWT with publish permissions
4. Read actual frame dimensions from the pre-computed avatar's `frame_list[0]`
5. Instantiate `AVPublisher`, `MuseTalkWorker`, `StreamingPipeline`
6. `room.connect()` → `publisher.start()` → `pipeline.start()` (idle loop begins)
7. Warmup: Whisper + TTS synchronously; UNet eager pass (or background torch.compile JIT)

### `/speak` example

```bash
curl -X POST http://localhost:8767/speak \
  -H "Content-Type: application/json" \
  -d '{"text": "Hello, how are you today?"}'
# → {"status": "processing", "latency_ms": 12.4}
```

---

## Frontend

### Avatar page — `App.tsx`

- Calls `POST /connect` on button click → loads models, joins LiveKit room
- Fetches LiveKit token from `POST /get-token`, connects as a viewer
- Receives remote video + audio tracks published by the backend agent
- Sends typed text via `POST /speak`
- Chat log shows `user` / `assistant` / `system` messages with timestamps
- Static fallback avatar image `/Sophy.png` shown before first video frame arrives

### Voice Agent page — `VoicePage.tsx`

- Fetches LiveKit token from the agent's built-in server at `http://localhost:3000/get-token`
- Publishes local microphone (echo-cancelled, noise-suppressed) as an audio track
- Agent worker subscribes, runs ASR → LLM → TTS, publishes reply audio back into the room
- `DataReceived` events carry `{type: "user"|"assistant", text}` for the transcript display
- Latency shown as the time between user-transcript event and assistant-text event

---

## Avatars

Pre-computed assets in `backend/avatars/<name>/`:

| File/Dir | Description |
|----------|-------------|
| `avator_info.json` | Avatar bbox, landmark, and crop metadata |
| `latents.pt` | Pre-encoded VAE latent tensors for all idle frames |
| `full_imgs/` | Full-resolution source frames |
| `mask/` | Per-frame blending masks for composite output |
| `vid_output/` | Intermediate video output (optional) |

Available avatars: **sophy** (default), **harry_1**, **christine**.

Override at runtime:
```bash
SPEECHX_AVATAR=harry_1 python api/server.py
```

To generate a new avatar from a source video, see [docs/avatar_gen_README.md](docs/avatar_gen_README.md).

---

## Model Weights

All weights live in `backend/models/` (not committed to git).

| Path | Used by | Notes |
|------|---------|-------|
| `kokoro/kokoro-v1.0.onnx` | Avatar TTS, Voice TTS | ONNX runtime inference |
| `kokoro/voices-v1.0.bin` | Avatar TTS, Voice TTS | All voice style embeddings |
| `musetalkV15/unet.pth` | MuseTalk UNet | fp16 inference |
| `musetalkV15/musetalk.json` | MuseTalk | Architecture config |
| `sd-vae/config.json` | MuseTalk VAE | Stable Diffusion VAE |
| `whisper/` | MuseTalk audio encoder | Whisper encoder weights + config |
| `dwpose/dw-ll_ucoco_384.pth` | DWPose | Used during avatar generation |
| `face-parse-bisent/` | Face parsing | BiSeNet; used during avatar generation |
| `syncnet/latentsync_syncnet.pt` | SyncNet | Training/evaluation only, not live inference |
| `Llama-3.2-3B-Instruct-Q4_K_M.gguf` | llama-server | Served by external `llama-server` process |

---

## Troubleshooting

### `ModuleNotFoundError` on startup
```bash
conda activate avatar   # must be active
cd backend
python api/server.py    # run from backend/, not repo root
```

### LiveKit connection errors
```bash
docker ps | grep livekit          # is it running?
docker logs livekit-server        # check for port conflicts or errors

# Keys must match in both config files:
# backend/config.py        →  LIVEKIT_API_KEY / LIVEKIT_API_SECRET
# backend/agent/config.py  →  same values
```

### Out of VRAM (Avatar page)
```python
# backend/config.py
CHUNK_DURATION = 0.16   # halve chunk → 4 frames instead of 8
# or
MUSETALK_UNET_FP16 = False  # switch to fp32 if fp16 causes NaN
```

### `llama-server` not found
Download from [llama.cpp releases](https://github.com/ggerganov/llama.cpp/releases), extract binary, add folder to `PATH`.

### Port already in use
```bash
lsof -i :8767    # Avatar backend
lsof -i :3000    # Voice agent token server
lsof -i :8080    # llama-server
lsof -i :7880    # LiveKit
```

### Kokoro ONNX `int32` speed-tensor error
Already patched in `backend/tts/kokoro_tts.py` via `_patched_create_audio` monkey-patch that forces `speed` to `float32`. No action needed.

### Video and audio out of sync
`SimpleAVSync` uses explicit PTS. Verify constants are consistent:
```python
# FRAMES_PER_CHUNK must equal CHUNK_DURATION × VIDEO_FPS exactly
# Default: 0.32 × 25 = 8  ✓
```

---

## Performance Notes

### GPU memory (RTX 4060 8 GB)

| Component | Estimated VRAM |
|-----------|---------------|
| MuseTalk UNet (fp16) | ~3 GB |
| Whisper encoder | ~1.5 GB |
| SD-VAE | ~1 GB |
| Avatar latents + frame buffers | ~0.5 GB |
| **Total** | **~6 GB** |

### Latency targets

| Stage | Target |
|-------|--------|
| First TTS chunk ready | < 100 ms |
| First video chunk visible | < 200 ms after `/speak` |
| Per-chunk generation (8 frames @ 25 fps) | ~80–120 ms |
| End-to-end text → visible lip-sync | ~200–350 ms |

### Optimisations already applied

| Optimisation | Location | Effect |
|-------------|----------|--------|
| `torch.set_float32_matmul_precision('high')` | `server.py` | ~5 % free via TF32 on Ampere+ |
| `MUSETALK_UNET_FP16 = True` | `config.py` | Halves UNet memory bandwidth |
| `LIVEKIT_AUDIO_SAMPLE_RATE = 24000` | `config.py` | Eliminates TTS→LiveKit resampling step |
| Sentence-boundary TTS split | `kokoro_tts.py` | Lower latency on first synthesised chunk |
| Synchronous Whisper + TTS warmup at `/connect` | `server.py` | Primes ONNX thread pools before first request |
| UNet eager warmup pass at `/connect` | `server.py` | Primes CUDA kernels |
| Optional `torch.compile` UNet JIT | `server.py` | Further throughput gain (opt-in via env var) |
| `IdleFrameGenerator` | `livekit_publisher.py` | Keeps video track alive between speech turns |
| `torch._dynamo.config.suppress_errors = True` | `server.py` | Graceful fallback to eager if Triton JIT fails |

---

## Available Voices

| Voice ID | Style |
|----------|-------|
| `af_heart` | Female, emotional/expressive — Avatar page default |
| `af_sarah` | Female, clear and professional — Voice Agent default |
| `af_bella` | Female, warm and friendly |
| `am_michael` | Male, professional |
| `am_fen` | Male, deep |
| `bf_emma` | Female, British accent |
| `bm_george` | Male, British accent |

---

## Credits

- **MuseTalk v1.5** — [TMElyralab/MuseTalk](https://github.com/TMElyralab/MuseTalk)
- **Kokoro TTS** — [remsky/Kokoro-ONNX](https://github.com/remsky/Kokoro-ONNX)
- **LiveKit** — [livekit/livekit](https://github.com/livekit/livekit)
- **faster-whisper** — [SYSTRAN/faster-whisper](https://github.com/SYSTRAN/faster-whisper)
- **llama.cpp** — [ggerganov/llama.cpp](https://github.com/ggerganov/llama.cpp)
- **Llama 3.2 3B** — Meta AI