Spaces:

mnhatdaous
/

learnable-speech

Sleeping

File size: 5,146 Bytes

# 🎤 Learnable-Speech Training Quick Start Guide

This guide will help you train the Learnable-Speech model from scratch and deploy it on Hugging Face.

## 📋 Prerequisites

1. **Hardware Requirements**:
   - GPU with at least 8GB VRAM (16GB+ recommended)
   - 32GB+ RAM
   - 100GB+ storage space

2. **Software Requirements**:
   - Python 3.10+
   - CUDA 11.8+
   - PyTorch 2.0+

## 🚀 Step-by-Step Training Process

### Step 1: Environment Setup

```bash
# Clone the repository
git clone https://github.com/primepake/learnable-speech.git
cd learnable-speech

# Install dependencies
pip install -r requirements.txt

# Install S3Tokenizer
cd speech/tools/S3Tokenizer
pip install .
cd ../../..
```

### Step 2: Download Prerequisites

```bash
# Make scripts executable
chmod +x scripts/*.sh

# Download pretrained models
./scripts/download_pretrained.sh
```

### Step 3: Prepare Your Dataset

```bash
# Organize your dataset like this:
# dataset_root/
# ├── speaker1_001.wav
# ├── speaker1_001.txt
# ├── speaker1_002.wav
# ├── speaker1_002.txt
# └── ...

# Update DATASET_ROOT in the script
export DATASET_ROOT="/path/to/your/dataset"
export OUTPUT_DIR="/path/to/processed/data"

# Run data preparation
./scripts/prepare_data.sh
```

### Step 4: Train the Models

```bash
# Option A: Train full pipeline (recommended)
./scripts/train_full_pipeline.sh

# Option B: Train stages separately
./speech/llm_run.sh    # Stage 1: LLM
./speech/flow_run.sh   # Stage 2: Flow
```

### Step 5: Upload to Hugging Face

```bash
# Get your HF token from https://huggingface.co/settings/tokens
export HF_TOKEN="your_token_here"

# Upload trained models
python scripts/upload_to_hf.py \
  --checkpoint_dir ./checkpoints \
  --username your_hf_username \
  --models both
```

### Step 6: Update Gradio App

```python
# Update app.py to use your trained models
from huggingface_hub import hf_hub_download
import torch

# Download your trained models
llm_path = hf_hub_download(
    repo_id="your_username/learnable-speech-llm",
    filename="pytorch_model.bin"
)
flow_path = hf_hub_download(
    repo_id="your_username/learnable-speech-flow", 
    filename="pytorch_model.bin"
)

# Load and use models in your synthesis function
def synthesize_speech(text, speaker_id=0):
    # Replace placeholder with actual model inference
    # ... your inference code here ...
    pass
```

## 🎯 Training Configurations

### For Different Environments

1. **Local Development** (Single GPU):

   ```bash
   export CUDA_VISIBLE_DEVICES="0"
   python speech/train.py --config speech/config.yaml --model llm ...
   ```

2. **Multi-GPU Training**:

   ```bash
   export CUDA_VISIBLE_DEVICES="0,1,2,3"
   torchrun --nproc_per_node=4 speech/train.py ...
   ```

3. **Cloud Training** (Google Colab/Kaggle):

   ```python
   # Use config_hf.yaml for resource-constrained environments
   !python speech/train.py --config speech/config_hf.yaml ...
   ```

4. **Hugging Face Spaces**:

   ```bash
   # For direct training on HF infrastructure
   python speech/train.py --config speech/config_hf.yaml --timeout 1800 ...
   ```

## 📊 Monitoring Training

1. **Comet ML** (Recommended):

   ```bash
   # Set up Comet ML for experiment tracking
   export COMET_API_KEY="your_api_key"
   # Training will automatically log to Comet
   ```

2. **Tensorboard**:

   ```bash
   tensorboard --logdir ./tensorboard
   ```

3. **Command Line**:

   ```bash
   # Monitor log files
   tail -f checkpoints/llm/train.log
   ```

## 🔧 Troubleshooting

### Common Issues

1. **Out of Memory**:
   - Reduce batch size in config
   - Use gradient accumulation
   - Enable mixed precision training (`--use_amp`)

2. **Slow Training**:
   - Increase num_workers for data loading
   - Use multiple GPUs with DDP
   - Optimize data preprocessing

3. **Model Not Converging**:
   - Check learning rate
   - Verify data preprocessing
   - Use pretrained checkpoints

### Performance Tips

1. **Data Loading Optimization**:

   ```yaml
   # In config.yaml
   num_workers: 24
   prefetch: 100
   pin_memory: true
   ```

2. **Memory Optimization**:

   ```bash
   # Use gradient checkpointing
   --use_amp --accum_grad 2
   ```

3. **Speed Optimization**:

   ```bash
   # Compile model for faster training (PyTorch 2.0+)
   export TORCH_COMPILE=1
   ```

## 📈 Expected Training Times

| Configuration | LLM Training | Flow Training | Total |
|---------------|--------------|---------------|-------|
| Single RTX 4090 | 2-3 days | 1-2 days | 3-5 days |
| 4x RTX 4090 | 12-18 hours | 6-12 hours | 1-2 days |
| 8x A100 | 6-8 hours | 3-4 hours | 9-12 hours |

## 🎉 Success Criteria

Your training is successful when:

1. **LLM Stage**: Perplexity < 2.0, Token accuracy > 95%
2. **Flow Stage**: Reconstruction loss < 0.1, Mel spectral loss < 0.05
3. **Audio Quality**: Generated samples sound natural and intelligible

## 📚 Additional Resources

- [Training Logs Analysis](docs/training_analysis.md)
- [Hyperparameter Tuning Guide](docs/hyperparameters.md)
- [Deployment Best Practices](docs/deployment.md)
- [Community Discord](https://discord.gg/learnable-speech)