# 🎤 Learnable-Speech Training Quick Start Guide This guide will help you train the Learnable-Speech model from scratch and deploy it on Hugging Face. ## 📋 Prerequisites 1. **Hardware Requirements**: - GPU with at least 8GB VRAM (16GB+ recommended) - 32GB+ RAM - 100GB+ storage space 2. **Software Requirements**: - Python 3.10+ - CUDA 11.8+ - PyTorch 2.0+ ## 🚀 Step-by-Step Training Process ### Step 1: Environment Setup ```bash # Clone the repository git clone https://github.com/primepake/learnable-speech.git cd learnable-speech # Install dependencies pip install -r requirements.txt # Install S3Tokenizer cd speech/tools/S3Tokenizer pip install . cd ../../.. ``` ### Step 2: Download Prerequisites ```bash # Make scripts executable chmod +x scripts/*.sh # Download pretrained models ./scripts/download_pretrained.sh ``` ### Step 3: Prepare Your Dataset ```bash # Organize your dataset like this: # dataset_root/ # ├── speaker1_001.wav # ├── speaker1_001.txt # ├── speaker1_002.wav # ├── speaker1_002.txt # └── ... # Update DATASET_ROOT in the script export DATASET_ROOT="/path/to/your/dataset" export OUTPUT_DIR="/path/to/processed/data" # Run data preparation ./scripts/prepare_data.sh ``` ### Step 4: Train the Models ```bash # Option A: Train full pipeline (recommended) ./scripts/train_full_pipeline.sh # Option B: Train stages separately ./speech/llm_run.sh # Stage 1: LLM ./speech/flow_run.sh # Stage 2: Flow ``` ### Step 5: Upload to Hugging Face ```bash # Get your HF token from https://huggingface.co/settings/tokens export HF_TOKEN="your_token_here" # Upload trained models python scripts/upload_to_hf.py \ --checkpoint_dir ./checkpoints \ --username your_hf_username \ --models both ``` ### Step 6: Update Gradio App ```python # Update app.py to use your trained models from huggingface_hub import hf_hub_download import torch # Download your trained models llm_path = hf_hub_download( repo_id="your_username/learnable-speech-llm", filename="pytorch_model.bin" ) flow_path = hf_hub_download( repo_id="your_username/learnable-speech-flow", filename="pytorch_model.bin" ) # Load and use models in your synthesis function def synthesize_speech(text, speaker_id=0): # Replace placeholder with actual model inference # ... your inference code here ... pass ``` ## 🎯 Training Configurations ### For Different Environments 1. **Local Development** (Single GPU): ```bash export CUDA_VISIBLE_DEVICES="0" python speech/train.py --config speech/config.yaml --model llm ... ``` 2. **Multi-GPU Training**: ```bash export CUDA_VISIBLE_DEVICES="0,1,2,3" torchrun --nproc_per_node=4 speech/train.py ... ``` 3. **Cloud Training** (Google Colab/Kaggle): ```python # Use config_hf.yaml for resource-constrained environments !python speech/train.py --config speech/config_hf.yaml ... ``` 4. **Hugging Face Spaces**: ```bash # For direct training on HF infrastructure python speech/train.py --config speech/config_hf.yaml --timeout 1800 ... ``` ## 📊 Monitoring Training 1. **Comet ML** (Recommended): ```bash # Set up Comet ML for experiment tracking export COMET_API_KEY="your_api_key" # Training will automatically log to Comet ``` 2. **Tensorboard**: ```bash tensorboard --logdir ./tensorboard ``` 3. **Command Line**: ```bash # Monitor log files tail -f checkpoints/llm/train.log ``` ## 🔧 Troubleshooting ### Common Issues 1. **Out of Memory**: - Reduce batch size in config - Use gradient accumulation - Enable mixed precision training (`--use_amp`) 2. **Slow Training**: - Increase num_workers for data loading - Use multiple GPUs with DDP - Optimize data preprocessing 3. **Model Not Converging**: - Check learning rate - Verify data preprocessing - Use pretrained checkpoints ### Performance Tips 1. **Data Loading Optimization**: ```yaml # In config.yaml num_workers: 24 prefetch: 100 pin_memory: true ``` 2. **Memory Optimization**: ```bash # Use gradient checkpointing --use_amp --accum_grad 2 ``` 3. **Speed Optimization**: ```bash # Compile model for faster training (PyTorch 2.0+) export TORCH_COMPILE=1 ``` ## 📈 Expected Training Times | Configuration | LLM Training | Flow Training | Total | |---------------|--------------|---------------|-------| | Single RTX 4090 | 2-3 days | 1-2 days | 3-5 days | | 4x RTX 4090 | 12-18 hours | 6-12 hours | 1-2 days | | 8x A100 | 6-8 hours | 3-4 hours | 9-12 hours | ## 🎉 Success Criteria Your training is successful when: 1. **LLM Stage**: Perplexity < 2.0, Token accuracy > 95% 2. **Flow Stage**: Reconstruction loss < 0.1, Mel spectral loss < 0.05 3. **Audio Quality**: Generated samples sound natural and intelligible ## 📚 Additional Resources - [Training Logs Analysis](docs/training_analysis.md) - [Hyperparameter Tuning Guide](docs/hyperparameters.md) - [Deployment Best Practices](docs/deployment.md) - [Community Discord](https://discord.gg/learnable-speech)