| # Speech-X Setup Guide |
|
|
| Step-by-step install for the `avatar` conda environment. Adapted from the MuseTalk official docs for Python 3.12. |
|
|
| Or run the automated script from the repo root: |
| ```bash |
| bash setup/setup.sh # Linux / macOS |
| .\setup\setup.ps1 # Windows (PowerShell) |
| ``` |
|
|
| ## Stage 1: Create Environment |
| ```bash |
| conda create -n avatar python=3.12 |
| conda activate avatar |
| ``` |
|
|
| ## Stage 2: Install PyTorch |
| # Python 3.12 requires PyTorch 2.5+ (2.0.1 doesn't support 3.12) |
| ```bash |
| pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu124 |
| ``` |
|
|
| ## Stage 3: Install MMLab Packages |
| # These are critical for MuseTalk - install one by one |
| ```bash |
| pip install --no-cache-dir -U openmim |
| mim install mmengine |
| mim install mmcv==2.2.0 |
| # If above fails, try: pip install mmcv-lite==2.2.0 |
| mim install mmdet==3.3.0 |
| # Note: mmpose not required — not present in the reference env |
| ``` |
|
|
| ## Stage 4: Install Musetalk Dependencies |
| # These are from musetalk's official requirements.txt |
| ```bash |
| pip install diffusers==0.30.2 |
| pip install accelerate==0.28.0 |
| pip install numpy==2.4.2 |
| pip install opencv-python==4.13.0.92 |
| pip install soundfile==0.12.1 |
| pip install transformers==4.39.2 |
| pip install huggingface-hub==0.36.2 |
| pip install librosa==0.10.2 |
| pip install einops==0.8.1 |
| |
| pip install gdown |
| pip install requests |
| pip install imageio==2.34.0 |
| pip install imageio-ffmpeg |
| |
| pip install omegaconf==2.3.0 |
| pip install ffmpeg-python |
| pip install moviepy |
| ``` |
|
|
| ## Stage 5: Install Additional Dependencies |
| # For this speech-to-video project |
| ```bash |
| pip install fastapi>=0.115.0 |
| pip install uvicorn[standard]>=0.30.0 |
| pip install pydantic>=2.10.0 |
| pip install python-dotenv>=1.0.1 |
| pip install livekit>=0.10.0 |
| pip install livekit-agents>=0.8.0 |
| pip install kokoro-onnx>=0.5.0 |
| pip install scipy>=1.13.0 |
| pip install faster-whisper>=1.0.0 |
| pip install sse-starlette>=2.0.0 |
| pip install onnxruntime>=1.24.0 |
| pip install sounddevice>=0.5.0 |
| pip install tqdm>=4.65.0 |
| pip install pyyaml>=6.0.0 |
| pip install aiohttp>=3.9.0 |
| pip install httpx>=0.27.0 |
| pip install safetensors>=0.4.0 |
| pip install pillow>=10.0.0 |
| ``` |
|
|
| ## Quick Test |
| ```bash |
| # Test MuseTalk imports |
| python -c " |
| import sys |
| sys.path.insert(0, 'backend') |
| from musetalk.processor import * |
| print('MuseTalk OK') |
| " |
| |
| # Test TTS import |
| python -c " |
| import sys |
| sys.path.insert(0, 'backend') |
| from tts.kokoro_tts import KokoroTTS |
| print('KokoroTTS OK') |
| " |
| ``` |
|
|
| ## Avatar Creation |
|
|
| Run once per avatar before starting the server. Script reads from `backend/config.py` |
| for model paths and writes assets to `backend/avatars/<name>/`. |
|
|
| **Single portrait image:** |
| ```bash |
| conda activate avatar |
| python setup/avatar_creation.py --image frontend/public/Sophy.png --name sophy |
| ``` |
|
|
| **Talking-head video:** |
| ```bash |
| python setup/avatar_creation.py --video /path/to/talking_head.mp4 --name harry_1 |
| ``` |
|
|
| **Batch (multiple avatars at once):** |
| ```bash |
| # Edit setup/avatars_config.yml first, then: |
| python setup/avatar_creation.py --config setup/avatars_config.yml |
| ``` |
|
|
| Options: |
| | Flag | Default | Description | |
| |------|---------|-------------| |
| | `--name` | required | Avatar folder name | |
| | `--frames` | `50` | Frame count for `--image` mode | |
| | `--bbox-shift` | `5` | Vertical bbox nudge (tune if face crop is off) | |
| | `--device` | `cuda` | `cuda` or `cpu` | |
| | `--overwrite` | off | Skip re-create prompt | |
|
|