learnable-speech / TRAINING_GUIDE.md
mnhatdaous's picture
Add comprehensive training pipeline for Hugging Face deployment
248479c
|
raw
history blame
5.14 kB

🎤 Learnable-Speech Training Quick Start Guide

This guide will help you train the Learnable-Speech model from scratch and deploy it on Hugging Face.

📋 Prerequisites

  1. Hardware Requirements:

    • GPU with at least 8GB VRAM (16GB+ recommended)
    • 32GB+ RAM
    • 100GB+ storage space
  2. Software Requirements:

    • Python 3.10+
    • CUDA 11.8+
    • PyTorch 2.0+

🚀 Step-by-Step Training Process

Step 1: Environment Setup

# Clone the repository
git clone https://github.com/primepake/learnable-speech.git
cd learnable-speech

# Install dependencies
pip install -r requirements.txt

# Install S3Tokenizer
cd speech/tools/S3Tokenizer
pip install .
cd ../../..

Step 2: Download Prerequisites

# Make scripts executable
chmod +x scripts/*.sh

# Download pretrained models
./scripts/download_pretrained.sh

Step 3: Prepare Your Dataset

# Organize your dataset like this:
# dataset_root/
# ├── speaker1_001.wav
# ├── speaker1_001.txt
# ├── speaker1_002.wav
# ├── speaker1_002.txt
# └── ...

# Update DATASET_ROOT in the script
export DATASET_ROOT="/path/to/your/dataset"
export OUTPUT_DIR="/path/to/processed/data"

# Run data preparation
./scripts/prepare_data.sh

Step 4: Train the Models

# Option A: Train full pipeline (recommended)
./scripts/train_full_pipeline.sh

# Option B: Train stages separately
./speech/llm_run.sh    # Stage 1: LLM
./speech/flow_run.sh   # Stage 2: Flow

Step 5: Upload to Hugging Face

# Get your HF token from https://huggingface.co/settings/tokens
export HF_TOKEN="your_token_here"

# Upload trained models
python scripts/upload_to_hf.py \
  --checkpoint_dir ./checkpoints \
  --username your_hf_username \
  --models both

Step 6: Update Gradio App

# Update app.py to use your trained models
from huggingface_hub import hf_hub_download
import torch

# Download your trained models
llm_path = hf_hub_download(
    repo_id="your_username/learnable-speech-llm",
    filename="pytorch_model.bin"
)
flow_path = hf_hub_download(
    repo_id="your_username/learnable-speech-flow", 
    filename="pytorch_model.bin"
)

# Load and use models in your synthesis function
def synthesize_speech(text, speaker_id=0):
    # Replace placeholder with actual model inference
    # ... your inference code here ...
    pass

🎯 Training Configurations

For Different Environments:

  1. Local Development (Single GPU):

    export CUDA_VISIBLE_DEVICES="0"
    python speech/train.py --config speech/config.yaml --model llm ...
    
  2. Multi-GPU Training:

    export CUDA_VISIBLE_DEVICES="0,1,2,3"
    torchrun --nproc_per_node=4 speech/train.py ...
    
  3. Cloud Training (Google Colab/Kaggle):

    # Use config_hf.yaml for resource-constrained environments
    !python speech/train.py --config speech/config_hf.yaml ...
    
  4. Hugging Face Spaces:

    # For direct training on HF infrastructure
    python speech/train.py --config speech/config_hf.yaml --timeout 1800 ...
    

📊 Monitoring Training

  1. Comet ML (Recommended):

    # Set up Comet ML for experiment tracking
    export COMET_API_KEY="your_api_key"
    # Training will automatically log to Comet
    
  2. Tensorboard:

    tensorboard --logdir ./tensorboard
    
  3. Command Line:

    # Monitor log files
    tail -f checkpoints/llm/train.log
    

🔧 Troubleshooting

Common Issues:

  1. Out of Memory:

    • Reduce batch size in config
    • Use gradient accumulation
    • Enable mixed precision training (--use_amp)
  2. Slow Training:

    • Increase num_workers for data loading
    • Use multiple GPUs with DDP
    • Optimize data preprocessing
  3. Model Not Converging:

    • Check learning rate
    • Verify data preprocessing
    • Use pretrained checkpoints

Performance Tips:

  1. Data Loading Optimization:

    # In config.yaml
    num_workers: 24
    prefetch: 100
    pin_memory: true
    
  2. Memory Optimization:

    # Use gradient checkpointing
    --use_amp --accum_grad 2
    
  3. Speed Optimization:

    # Compile model for faster training (PyTorch 2.0+)
    export TORCH_COMPILE=1
    

📈 Expected Training Times

Configuration LLM Training Flow Training Total
Single RTX 4090 2-3 days 1-2 days 3-5 days
4x RTX 4090 12-18 hours 6-12 hours 1-2 days
8x A100 6-8 hours 3-4 hours 9-12 hours

🎉 Success Criteria

Your training is successful when:

  1. LLM Stage: Perplexity < 2.0, Token accuracy > 95%
  2. Flow Stage: Reconstruction loss < 0.1, Mel spectral loss < 0.05
  3. Audio Quality: Generated samples sound natural and intelligible

📚 Additional Resources