Instructions to use agkavin/Avatar-Speech with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use agkavin/Avatar-Speech with Diffusers:

pip install -U diffusers transformers accelerate

import torch
from diffusers import DiffusionPipeline

# switch to "mps" for apple devices
pipe = DiffusionPipeline.from_pretrained("agkavin/Avatar-Speech", dtype=torch.bfloat16, device_map="cuda")

prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k"
image = pipe(prompt).images[0]

llama-cpp-python

How to use agkavin/Avatar-Speech with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="agkavin/Avatar-Speech",
	filename="backend/models/Llama-3.2-3B-Instruct-Q4_K_M.gguf",
)

llm.create_chat_completion(
	messages = "No input example has been defined for this model task."
)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use agkavin/Avatar-Speech with llama.cpp:

Install (macOS, Linux)

curl -LsSf https://llama.app/install.sh | sh
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf agkavin/Avatar-Speech:Q4_K_M
# Run inference directly in the terminal:
llama cli -hf agkavin/Avatar-Speech:Q4_K_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf agkavin/Avatar-Speech:Q4_K_M
# Run inference directly in the terminal:
llama cli -hf agkavin/Avatar-Speech:Q4_K_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf agkavin/Avatar-Speech:Q4_K_M
# Run inference directly in the terminal:
./llama-cli -hf agkavin/Avatar-Speech:Q4_K_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf agkavin/Avatar-Speech:Q4_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf agkavin/Avatar-Speech:Q4_K_M

Use Docker

docker model run hf.co/agkavin/Avatar-Speech:Q4_K_M

LM Studio
Jan
Ollama
How to use agkavin/Avatar-Speech with Ollama:
```
ollama run hf.co/agkavin/Avatar-Speech:Q4_K_M
```

Unsloth Studio

How to use agkavin/Avatar-Speech with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for agkavin/Avatar-Speech to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for agkavin/Avatar-Speech to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for agkavin/Avatar-Speech to start chatting

Atomic Chat new
Docker Model Runner
How to use agkavin/Avatar-Speech with Docker Model Runner:
```
docker model run hf.co/agkavin/Avatar-Speech:Q4_K_M
```

Lemonade

How to use agkavin/Avatar-Speech with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull agkavin/Avatar-Speech:Q4_K_M

Run and chat with the model

lemonade run user.Avatar-Speech-Q4_K_M

List all available models

lemonade list

YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

Speech-X

Two modes in one repo — both share the same conda environment, models, and LiveKit server.

Page	Mode	What it does
`/`	Avatar	Text/voice → Kokoro TTS → MuseTalk lip-sync → LiveKit video
`/voice`	Voice Agent	Voice → faster-whisper → Llama → Kokoro TTS → LiveKit audio

Architecture

Browser (React + LiveKit SDK)
  │
  ├── /          →  FastAPI server (port 8767)  →  Kokoro TTS + MuseTalk + LiveKit publisher
  └── /voice     →  LiveKit Agent worker         →  ASR → LLM → TTS  +  token server (port 3000)

Shared infrastructure (always running):
  ┌─────────────────────────────────────────────────────────────────────┐
  │  LiveKit server :7880   │   llama-server :8080   │  Vite dev :5173  │
  └─────────────────────────────────────────────────────────────────────┘

Prerequisites

NVIDIA GPU (RTX 4060 8GB or better)
Conda
Docker
Node.js 18+
llama.cpp — llama-server in PATH

Environment Setup

See setup/setup.md for the full step-by-step guide, or run the automated script:
bash setup/setup.sh          # Linux / macOS
.\setup\setup.ps1            # Windows (PowerShell)

Restore conda environment

# From repo root
conda env create -f environment.yml
conda activate avatar

Frontend dependencies

cd frontend
npm install

Running

All four processes run concurrently. Open four terminals.

Terminal 1 — LiveKit server (Docker, shared by both pages)

docker run --rm -d \
  --name livekit-server \
  -p 7880:7880 -p 7881:7881 -p 7882:7882/udp \
  livekit/livekit-server:latest \
  --dev --bind 0.0.0.0 --node-ip 127.0.0.1

Stop it later: docker stop livekit-server

Terminal 2 — llama-server (shared by both pages)

llama-server \
  -m backend/models/Llama-3.2-3B-Instruct-Q4_K_M.gguf \
  -c 2048 -ngl 32 --port 8080

Terminal 3 — Backend (choose based on which page you need)

For / (Avatar page — MuseTalk lip-sync):

conda activate avatar
cd backend
python api/server.py
# Runs on http://localhost:8767

For /voice (Voice Agent page — ASR → LLM → TTS):

conda activate avatar
cd backend
python agent.py dev
# Token server on http://localhost:3000
# LiveKit worker connects to ws://localhost:7880

You can run both at the same time in separate terminals if you want both pages live.

Terminal 4 — Frontend

cd frontend
npm run dev
# Open http://localhost:5173

http://localhost:5173/ — Avatar lip-sync page
http://localhost:5173/voice — Voice agent page

Avatars

Three avatars ship pre-computed: sophy (default), harry_1, christine.

To create a new one, run setup/avatar_creation.py once from the repo root with the avatar env active:

conda activate avatar

# From a portrait image (duplicated to 50 frames)
python setup/avatar_creation.py --image frontend/public/Sophy.png --name sophy

# From a talking-head video
python setup/avatar_creation.py --video /path/to/talking_head.mp4 --name harry_1

# Batch — edit setup/avatars_config.yml first
python setup/avatar_creation.py --config setup/avatars_config.yml

Outputs written to backend/avatars/<name>/: latents.pt, coords.pkl, mask_coords.pkl, full_imgs/, mask/, avator_info.json.

Switch avatar at runtime:

SPEECHX_AVATAR=harry_1 python api/server.py

Flag	Default	Description
`--name`	required	Avatar folder name
`--frames`	`50`	Frame count for `--image` mode
`--bbox-shift`	`5`	Vertical bbox nudge (tune if crop is off)
`--device`	`cuda`	`cuda` or `cpu`
`--overwrite`	off	Skip the re-create prompt

Environment Variables

Copy and adjust as needed (both backends read from shell env or a .env file in backend/):

# LiveKit
LIVEKIT_URL=ws://localhost:7880
LIVEKIT_API_KEY=devkey
LIVEKIT_API_SECRET=secret

# llama-server
LLAMA_SERVER_URL=http://localhost:8080/v1

# TTS voice (voice agent page)
DEFAULT_VOICE=af_sarah        # see backend/agent/config.py for all options

# ASR model size (voice agent page)
ASR_MODEL_SIZE=tiny           # tiny | base | small

# Avatar page server
SPEECH_TO_VIDEO_HOST=0.0.0.0
SPEECH_TO_VIDEO_PORT=8767

Project Structure

speech_to_video/
├── environment.yml          # Conda env export (cross-platform, no build strings)
├── README.md
├── setup/
│   ├── setup.md             # Step-by-step install guide
│   ├── setup.sh             # Automated setup script (Linux/macOS)
│   └── setup.ps1            # Automated setup script (Windows/PowerShell)
├── docs/
│   ├── avatar_gen_README.md # Voice agent architecture notes
│   └── avatar_gen_phase_2.md# Phase 2 MuseTalk integration plan
├── backend/
│   ├── config.py            # Avatar page configuration
│   ├── requirements.txt     # Pip dependencies (inside conda env)
│   ├── api/
│   │   ├── server.py        # FastAPI server for Avatar page (:8767)
│   │   └── pipeline.py      # MuseTalk pipeline orchestrator
│   ├── agent.py             # LiveKit worker entry point for Voice page
│   ├── agent/
│   │   ├── config.py        # Voice agent config (voices, model paths)
│   │   ├── asr.py           # faster-whisper ASR
│   │   ├── llm.py           # llama-server HTTP client
│   │   └── tts.py           # kokoro-onnx TTS; patches int32→float32 speed bug (0.5.x)
│   ├── tts/
│   │   └── kokoro_tts.py    # Kokoro TTS for Avatar page
│   ├── musetalk/            # MuseTalk inference
│   ├── models/              # All model weights
│   │   ├── kokoro/
│   │   ├── musetalkV15/
│   │   ├── sd-vae/
│   │   ├── whisper/
│   │   └── Llama-3.2-3B-Instruct-Q4_K_M.gguf
│   └── avatars/             # Pre-computed avatar assets
│       ├── christine/
│       ├── harry_1/
│       └── sophy/
└── frontend/
    └── src/
        ├── App.tsx          # Avatar page (/)
        ├── pages/
        │   └── VoicePage.tsx # Voice agent page (/voice)
        └── index.css

Available Voices (Voice Agent page)

Voice ID	Description
`af_sarah`	Female, clear and professional
`af_bella`	Female, warm and friendly
`af_heart`	Female, emotional and expressive
`am_michael`	Male, professional and authoritative
`am_fen`	Male, deep and resonant
`bf_emma`	Female, British accent
`bm_george`	Male, British accent

Troubleshooting

Kokoro ONNX `int32` speed-tensor error

Already patched in both backend/tts/kokoro_tts.py and backend/agent/tts.py via _patched_create_audio monkey-patch. Requires kokoro-onnx>=0.5.0.

`ModuleNotFoundError` on startup

Activate the conda env first: conda activate avatar

LiveKit connection errors

Verify the API key matches in backend/config.py and backend/agent/config.py:

LIVEKIT_API_KEY = "devkey"
LIVEKIT_API_SECRET = "secret"

llama-server not found

Download from llama.cpp releases and add to PATH.
Windows: download llama-...-win-cuda-cu12.x.x-x64.zip, extract, add folder to PATH.

Out of VRAM (Avatar page)

Reduce batch size in backend/config.py:

FRAMES_PER_CHUNK = 2  # default 8

Port already in use

# Find and kill
lsof -i :8767    # avatar page backend
lsof -i :3000    # voice agent token server
lsof -i :8080    # llama-server

Downloads last month: 4

GGUF

Model size

3B params

Architecture

llama

Hardware compatibility

4-bit

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Speech-X

Architecture

Prerequisites

Environment Setup

Restore conda environment

Frontend dependencies

Running

Terminal 1 — LiveKit server (Docker, shared by both pages)

Terminal 2 — llama-server (shared by both pages)

Terminal 3 — Backend (choose based on which page you need)

Terminal 4 — Frontend

Avatars

Environment Variables

Project Structure

Available Voices (Voice Agent page)

Troubleshooting

Kokoro ONNX int32 speed-tensor error

ModuleNotFoundError on startup

LiveKit connection errors

llama-server not found

Out of VRAM (Avatar page)

Port already in use

Kokoro ONNX `int32` speed-tensor error

`ModuleNotFoundError` on startup