YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

Speech-X

Two modes in one repo β€” both share the same conda environment, models, and LiveKit server.

Page Mode What it does
/ Avatar Text/voice β†’ Kokoro TTS β†’ MuseTalk lip-sync β†’ LiveKit video
/voice Voice Agent Voice β†’ faster-whisper β†’ Llama β†’ Kokoro TTS β†’ LiveKit audio

Architecture

Browser (React + LiveKit SDK)
  β”‚
  β”œβ”€β”€ /          β†’  FastAPI server (port 8767)  β†’  Kokoro TTS + MuseTalk + LiveKit publisher
  └── /voice     β†’  LiveKit Agent worker         β†’  ASR β†’ LLM β†’ TTS  +  token server (port 3000)

Shared infrastructure (always running):
  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
  β”‚  LiveKit server :7880   β”‚   llama-server :8080   β”‚  Vite dev :5173  β”‚
  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Prerequisites


Environment Setup

See setup/setup.md for the full step-by-step guide, or run the automated script:

bash setup/setup.sh          # Linux / macOS
.\setup\setup.ps1            # Windows (PowerShell)

Restore conda environment

# From repo root
conda env create -f environment.yml
conda activate avatar

Frontend dependencies

cd frontend
npm install

Running

All four processes run concurrently. Open four terminals.

Terminal 1 β€” LiveKit server (Docker, shared by both pages)

docker run --rm -d \
  --name livekit-server \
  -p 7880:7880 -p 7881:7881 -p 7882:7882/udp \
  livekit/livekit-server:latest \
  --dev --bind 0.0.0.0 --node-ip 127.0.0.1

Stop it later: docker stop livekit-server


Terminal 2 β€” llama-server (shared by both pages)

llama-server \
  -m backend/models/Llama-3.2-3B-Instruct-Q4_K_M.gguf \
  -c 2048 -ngl 32 --port 8080

Terminal 3 β€” Backend (choose based on which page you need)

For / (Avatar page β€” MuseTalk lip-sync):

conda activate avatar
cd backend
python api/server.py
# Runs on http://localhost:8767

For /voice (Voice Agent page β€” ASR β†’ LLM β†’ TTS):

conda activate avatar
cd backend
python agent.py dev
# Token server on http://localhost:3000
# LiveKit worker connects to ws://localhost:7880

You can run both at the same time in separate terminals if you want both pages live.


Terminal 4 β€” Frontend

cd frontend
npm run dev
# Open http://localhost:5173
  • http://localhost:5173/ β€” Avatar lip-sync page
  • http://localhost:5173/voice β€” Voice agent page

Avatars

Three avatars ship pre-computed: sophy (default), harry_1, christine.

To create a new one, run setup/avatar_creation.py once from the repo root with the avatar env active:

conda activate avatar

# From a portrait image (duplicated to 50 frames)
python setup/avatar_creation.py --image frontend/public/Sophy.png --name sophy

# From a talking-head video
python setup/avatar_creation.py --video /path/to/talking_head.mp4 --name harry_1

# Batch β€” edit setup/avatars_config.yml first
python setup/avatar_creation.py --config setup/avatars_config.yml

Outputs written to backend/avatars/<name>/: latents.pt, coords.pkl, mask_coords.pkl, full_imgs/, mask/, avator_info.json.

Switch avatar at runtime:

SPEECHX_AVATAR=harry_1 python api/server.py
Flag Default Description
--name required Avatar folder name
--frames 50 Frame count for --image mode
--bbox-shift 5 Vertical bbox nudge (tune if crop is off)
--device cuda cuda or cpu
--overwrite off Skip the re-create prompt

Environment Variables

Copy and adjust as needed (both backends read from shell env or a .env file in backend/):

# LiveKit
LIVEKIT_URL=ws://localhost:7880
LIVEKIT_API_KEY=devkey
LIVEKIT_API_SECRET=secret

# llama-server
LLAMA_SERVER_URL=http://localhost:8080/v1

# TTS voice (voice agent page)
DEFAULT_VOICE=af_sarah        # see backend/agent/config.py for all options

# ASR model size (voice agent page)
ASR_MODEL_SIZE=tiny           # tiny | base | small

# Avatar page server
SPEECH_TO_VIDEO_HOST=0.0.0.0
SPEECH_TO_VIDEO_PORT=8767

Project Structure

speech_to_video/
β”œβ”€β”€ environment.yml          # Conda env export (cross-platform, no build strings)
β”œβ”€β”€ README.md
β”œβ”€β”€ setup/
β”‚   β”œβ”€β”€ setup.md             # Step-by-step install guide
β”‚   β”œβ”€β”€ setup.sh             # Automated setup script (Linux/macOS)
β”‚   └── setup.ps1            # Automated setup script (Windows/PowerShell)
β”œβ”€β”€ docs/
β”‚   β”œβ”€β”€ avatar_gen_README.md # Voice agent architecture notes
β”‚   └── avatar_gen_phase_2.md# Phase 2 MuseTalk integration plan
β”œβ”€β”€ backend/
β”‚   β”œβ”€β”€ config.py            # Avatar page configuration
β”‚   β”œβ”€β”€ requirements.txt     # Pip dependencies (inside conda env)
β”‚   β”œβ”€β”€ api/
β”‚   β”‚   β”œβ”€β”€ server.py        # FastAPI server for Avatar page (:8767)
β”‚   β”‚   └── pipeline.py      # MuseTalk pipeline orchestrator
β”‚   β”œβ”€β”€ agent.py             # LiveKit worker entry point for Voice page
β”‚   β”œβ”€β”€ agent/
β”‚   β”‚   β”œβ”€β”€ config.py        # Voice agent config (voices, model paths)
β”‚   β”‚   β”œβ”€β”€ asr.py           # faster-whisper ASR
β”‚   β”‚   β”œβ”€β”€ llm.py           # llama-server HTTP client
β”‚   β”‚   └── tts.py           # kokoro-onnx TTS; patches int32β†’float32 speed bug (0.5.x)
β”‚   β”œβ”€β”€ tts/
β”‚   β”‚   └── kokoro_tts.py    # Kokoro TTS for Avatar page
β”‚   β”œβ”€β”€ musetalk/            # MuseTalk inference
β”‚   β”œβ”€β”€ models/              # All model weights
β”‚   β”‚   β”œβ”€β”€ kokoro/
β”‚   β”‚   β”œβ”€β”€ musetalkV15/
β”‚   β”‚   β”œβ”€β”€ sd-vae/
β”‚   β”‚   β”œβ”€β”€ whisper/
β”‚   β”‚   └── Llama-3.2-3B-Instruct-Q4_K_M.gguf
β”‚   └── avatars/             # Pre-computed avatar assets
β”‚       β”œβ”€β”€ christine/
β”‚       β”œβ”€β”€ harry_1/
β”‚       └── sophy/
└── frontend/
    └── src/
        β”œβ”€β”€ App.tsx          # Avatar page (/)
        β”œβ”€β”€ pages/
        β”‚   └── VoicePage.tsx # Voice agent page (/voice)
        └── index.css

Available Voices (Voice Agent page)

Voice ID Description
af_sarah Female, clear and professional
af_bella Female, warm and friendly
af_heart Female, emotional and expressive
am_michael Male, professional and authoritative
am_fen Male, deep and resonant
bf_emma Female, British accent
bm_george Male, British accent

Troubleshooting

Kokoro ONNX int32 speed-tensor error

Already patched in both backend/tts/kokoro_tts.py and backend/agent/tts.py via _patched_create_audio monkey-patch. Requires kokoro-onnx>=0.5.0.

ModuleNotFoundError on startup

Activate the conda env first: conda activate avatar

LiveKit connection errors

Verify the API key matches in backend/config.py and backend/agent/config.py:

LIVEKIT_API_KEY = "devkey"
LIVEKIT_API_SECRET = "secret"

llama-server not found

Download from llama.cpp releases and add to PATH.
Windows: download llama-...-win-cuda-cu12.x.x-x64.zip, extract, add folder to PATH.

Out of VRAM (Avatar page)

Reduce batch size in backend/config.py:

FRAMES_PER_CHUNK = 2  # default 8

Port already in use

# Find and kill
lsof -i :8767    # avatar page backend
lsof -i :3000    # voice agent token server
lsof -i :8080    # llama-server
Downloads last month
3
GGUF
Model size
3B params
Architecture
llama
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support