Spaces:
Sleeping
Sleeping
Commit
·
248479c
1
Parent(s):
edfcfb2
Add comprehensive training pipeline for Hugging Face deployment
Browse files- TRAINING_GUIDE.md +222 -0
- scripts/download_pretrained.sh +17 -0
- scripts/prepare_data.sh +53 -0
- scripts/train_full_pipeline.sh +120 -0
- scripts/training_configs.sh +81 -0
- scripts/upload_to_hf.py +229 -0
- speech/config_hf.yaml +192 -0
TRAINING_GUIDE.md
ADDED
|
@@ -0,0 +1,222 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# 🎤 Learnable-Speech Training Quick Start Guide
|
| 2 |
+
|
| 3 |
+
This guide will help you train the Learnable-Speech model from scratch and deploy it on Hugging Face.
|
| 4 |
+
|
| 5 |
+
## 📋 Prerequisites
|
| 6 |
+
|
| 7 |
+
1. **Hardware Requirements**:
|
| 8 |
+
- GPU with at least 8GB VRAM (16GB+ recommended)
|
| 9 |
+
- 32GB+ RAM
|
| 10 |
+
- 100GB+ storage space
|
| 11 |
+
|
| 12 |
+
2. **Software Requirements**:
|
| 13 |
+
- Python 3.10+
|
| 14 |
+
- CUDA 11.8+
|
| 15 |
+
- PyTorch 2.0+
|
| 16 |
+
|
| 17 |
+
## 🚀 Step-by-Step Training Process
|
| 18 |
+
|
| 19 |
+
### Step 1: Environment Setup
|
| 20 |
+
|
| 21 |
+
```bash
|
| 22 |
+
# Clone the repository
|
| 23 |
+
git clone https://github.com/primepake/learnable-speech.git
|
| 24 |
+
cd learnable-speech
|
| 25 |
+
|
| 26 |
+
# Install dependencies
|
| 27 |
+
pip install -r requirements.txt
|
| 28 |
+
|
| 29 |
+
# Install S3Tokenizer
|
| 30 |
+
cd speech/tools/S3Tokenizer
|
| 31 |
+
pip install .
|
| 32 |
+
cd ../../..
|
| 33 |
+
```
|
| 34 |
+
|
| 35 |
+
### Step 2: Download Prerequisites
|
| 36 |
+
|
| 37 |
+
```bash
|
| 38 |
+
# Make scripts executable
|
| 39 |
+
chmod +x scripts/*.sh
|
| 40 |
+
|
| 41 |
+
# Download pretrained models
|
| 42 |
+
./scripts/download_pretrained.sh
|
| 43 |
+
```
|
| 44 |
+
|
| 45 |
+
### Step 3: Prepare Your Dataset
|
| 46 |
+
|
| 47 |
+
```bash
|
| 48 |
+
# Organize your dataset like this:
|
| 49 |
+
# dataset_root/
|
| 50 |
+
# ├── speaker1_001.wav
|
| 51 |
+
# ├── speaker1_001.txt
|
| 52 |
+
# ├── speaker1_002.wav
|
| 53 |
+
# ├── speaker1_002.txt
|
| 54 |
+
# └── ...
|
| 55 |
+
|
| 56 |
+
# Update DATASET_ROOT in the script
|
| 57 |
+
export DATASET_ROOT="/path/to/your/dataset"
|
| 58 |
+
export OUTPUT_DIR="/path/to/processed/data"
|
| 59 |
+
|
| 60 |
+
# Run data preparation
|
| 61 |
+
./scripts/prepare_data.sh
|
| 62 |
+
```
|
| 63 |
+
|
| 64 |
+
### Step 4: Train the Models
|
| 65 |
+
|
| 66 |
+
```bash
|
| 67 |
+
# Option A: Train full pipeline (recommended)
|
| 68 |
+
./scripts/train_full_pipeline.sh
|
| 69 |
+
|
| 70 |
+
# Option B: Train stages separately
|
| 71 |
+
./speech/llm_run.sh # Stage 1: LLM
|
| 72 |
+
./speech/flow_run.sh # Stage 2: Flow
|
| 73 |
+
```
|
| 74 |
+
|
| 75 |
+
### Step 5: Upload to Hugging Face
|
| 76 |
+
|
| 77 |
+
```bash
|
| 78 |
+
# Get your HF token from https://huggingface.co/settings/tokens
|
| 79 |
+
export HF_TOKEN="your_token_here"
|
| 80 |
+
|
| 81 |
+
# Upload trained models
|
| 82 |
+
python scripts/upload_to_hf.py \
|
| 83 |
+
--checkpoint_dir ./checkpoints \
|
| 84 |
+
--username your_hf_username \
|
| 85 |
+
--models both
|
| 86 |
+
```
|
| 87 |
+
|
| 88 |
+
### Step 6: Update Gradio App
|
| 89 |
+
|
| 90 |
+
```python
|
| 91 |
+
# Update app.py to use your trained models
|
| 92 |
+
from huggingface_hub import hf_hub_download
|
| 93 |
+
import torch
|
| 94 |
+
|
| 95 |
+
# Download your trained models
|
| 96 |
+
llm_path = hf_hub_download(
|
| 97 |
+
repo_id="your_username/learnable-speech-llm",
|
| 98 |
+
filename="pytorch_model.bin"
|
| 99 |
+
)
|
| 100 |
+
flow_path = hf_hub_download(
|
| 101 |
+
repo_id="your_username/learnable-speech-flow",
|
| 102 |
+
filename="pytorch_model.bin"
|
| 103 |
+
)
|
| 104 |
+
|
| 105 |
+
# Load and use models in your synthesis function
|
| 106 |
+
def synthesize_speech(text, speaker_id=0):
|
| 107 |
+
# Replace placeholder with actual model inference
|
| 108 |
+
# ... your inference code here ...
|
| 109 |
+
pass
|
| 110 |
+
```
|
| 111 |
+
|
| 112 |
+
## 🎯 Training Configurations
|
| 113 |
+
|
| 114 |
+
### For Different Environments:
|
| 115 |
+
|
| 116 |
+
1. **Local Development** (Single GPU):
|
| 117 |
+
```bash
|
| 118 |
+
export CUDA_VISIBLE_DEVICES="0"
|
| 119 |
+
python speech/train.py --config speech/config.yaml --model llm ...
|
| 120 |
+
```
|
| 121 |
+
|
| 122 |
+
2. **Multi-GPU Training**:
|
| 123 |
+
```bash
|
| 124 |
+
export CUDA_VISIBLE_DEVICES="0,1,2,3"
|
| 125 |
+
torchrun --nproc_per_node=4 speech/train.py ...
|
| 126 |
+
```
|
| 127 |
+
|
| 128 |
+
3. **Cloud Training** (Google Colab/Kaggle):
|
| 129 |
+
```python
|
| 130 |
+
# Use config_hf.yaml for resource-constrained environments
|
| 131 |
+
!python speech/train.py --config speech/config_hf.yaml ...
|
| 132 |
+
```
|
| 133 |
+
|
| 134 |
+
4. **Hugging Face Spaces**:
|
| 135 |
+
```bash
|
| 136 |
+
# For direct training on HF infrastructure
|
| 137 |
+
python speech/train.py --config speech/config_hf.yaml --timeout 1800 ...
|
| 138 |
+
```
|
| 139 |
+
|
| 140 |
+
## 📊 Monitoring Training
|
| 141 |
+
|
| 142 |
+
1. **Comet ML** (Recommended):
|
| 143 |
+
```bash
|
| 144 |
+
# Set up Comet ML for experiment tracking
|
| 145 |
+
export COMET_API_KEY="your_api_key"
|
| 146 |
+
# Training will automatically log to Comet
|
| 147 |
+
```
|
| 148 |
+
|
| 149 |
+
2. **Tensorboard**:
|
| 150 |
+
```bash
|
| 151 |
+
tensorboard --logdir ./tensorboard
|
| 152 |
+
```
|
| 153 |
+
|
| 154 |
+
3. **Command Line**:
|
| 155 |
+
```bash
|
| 156 |
+
# Monitor log files
|
| 157 |
+
tail -f checkpoints/llm/train.log
|
| 158 |
+
```
|
| 159 |
+
|
| 160 |
+
## 🔧 Troubleshooting
|
| 161 |
+
|
| 162 |
+
### Common Issues:
|
| 163 |
+
|
| 164 |
+
1. **Out of Memory**:
|
| 165 |
+
- Reduce batch size in config
|
| 166 |
+
- Use gradient accumulation
|
| 167 |
+
- Enable mixed precision training (`--use_amp`)
|
| 168 |
+
|
| 169 |
+
2. **Slow Training**:
|
| 170 |
+
- Increase num_workers for data loading
|
| 171 |
+
- Use multiple GPUs with DDP
|
| 172 |
+
- Optimize data preprocessing
|
| 173 |
+
|
| 174 |
+
3. **Model Not Converging**:
|
| 175 |
+
- Check learning rate
|
| 176 |
+
- Verify data preprocessing
|
| 177 |
+
- Use pretrained checkpoints
|
| 178 |
+
|
| 179 |
+
### Performance Tips:
|
| 180 |
+
|
| 181 |
+
1. **Data Loading Optimization**:
|
| 182 |
+
```yaml
|
| 183 |
+
# In config.yaml
|
| 184 |
+
num_workers: 24
|
| 185 |
+
prefetch: 100
|
| 186 |
+
pin_memory: true
|
| 187 |
+
```
|
| 188 |
+
|
| 189 |
+
2. **Memory Optimization**:
|
| 190 |
+
```bash
|
| 191 |
+
# Use gradient checkpointing
|
| 192 |
+
--use_amp --accum_grad 2
|
| 193 |
+
```
|
| 194 |
+
|
| 195 |
+
3. **Speed Optimization**:
|
| 196 |
+
```bash
|
| 197 |
+
# Compile model for faster training (PyTorch 2.0+)
|
| 198 |
+
export TORCH_COMPILE=1
|
| 199 |
+
```
|
| 200 |
+
|
| 201 |
+
## 📈 Expected Training Times
|
| 202 |
+
|
| 203 |
+
| Configuration | LLM Training | Flow Training | Total |
|
| 204 |
+
|---------------|--------------|---------------|-------|
|
| 205 |
+
| Single RTX 4090 | 2-3 days | 1-2 days | 3-5 days |
|
| 206 |
+
| 4x RTX 4090 | 12-18 hours | 6-12 hours | 1-2 days |
|
| 207 |
+
| 8x A100 | 6-8 hours | 3-4 hours | 9-12 hours |
|
| 208 |
+
|
| 209 |
+
## 🎉 Success Criteria
|
| 210 |
+
|
| 211 |
+
Your training is successful when:
|
| 212 |
+
|
| 213 |
+
1. **LLM Stage**: Perplexity < 2.0, Token accuracy > 95%
|
| 214 |
+
2. **Flow Stage**: Reconstruction loss < 0.1, Mel spectral loss < 0.05
|
| 215 |
+
3. **Audio Quality**: Generated samples sound natural and intelligible
|
| 216 |
+
|
| 217 |
+
## 📚 Additional Resources
|
| 218 |
+
|
| 219 |
+
- [Training Logs Analysis](docs/training_analysis.md)
|
| 220 |
+
- [Hyperparameter Tuning Guide](docs/hyperparameters.md)
|
| 221 |
+
- [Deployment Best Practices](docs/deployment.md)
|
| 222 |
+
- [Community Discord](https://discord.gg/learnable-speech)
|
scripts/download_pretrained.sh
ADDED
|
@@ -0,0 +1,17 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
#!/bin/bash
|
| 2 |
+
|
| 3 |
+
# Create pretrained models directory
|
| 4 |
+
mkdir -p pretrained_models/CosyVoice2-0.5B
|
| 5 |
+
|
| 6 |
+
echo "Downloading CosyVoice2 pretrained models..."
|
| 7 |
+
|
| 8 |
+
# Download CosyVoice2 models (you'll need to get these from the official release)
|
| 9 |
+
# Replace these URLs with actual download links when available
|
| 10 |
+
wget -O pretrained_models/CosyVoice2-0.5B/llm.pt "https://huggingface.co/FunAudioLLM/CosyVoice2-0.5B/resolve/main/llm.pt"
|
| 11 |
+
wget -O pretrained_models/CosyVoice2-0.5B/flow.pt "https://huggingface.co/FunAudioLLM/CosyVoice2-0.5B/resolve/main/flow.pt"
|
| 12 |
+
|
| 13 |
+
# Download Qwen pretrained model
|
| 14 |
+
mkdir -p pretrained_models/CosyVoice2-0.5B/CosyVoice-BlankEN
|
| 15 |
+
echo "Download Qwen model manually from: https://huggingface.co/FunAudioLLM/CosyVoice2-0.5B"
|
| 16 |
+
|
| 17 |
+
echo "Pretrained models downloaded!"
|
scripts/prepare_data.sh
ADDED
|
@@ -0,0 +1,53 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
#!/bin/bash
|
| 2 |
+
|
| 3 |
+
# Data preparation pipeline for Learnable-Speech training
|
| 4 |
+
|
| 5 |
+
echo "=== Learnable-Speech Data Preparation Pipeline ==="
|
| 6 |
+
|
| 7 |
+
# Configuration
|
| 8 |
+
DATASET_ROOT="/path/to/your/dataset" # Change this to your dataset path
|
| 9 |
+
OUTPUT_DIR="/path/to/processed/data" # Change this to your output path
|
| 10 |
+
|
| 11 |
+
# Create output directories
|
| 12 |
+
mkdir -p $OUTPUT_DIR/{fsq,dac_latents,lists}
|
| 13 |
+
|
| 14 |
+
echo "Step 1: Extract FSQ tokens using S3Tokenizer..."
|
| 15 |
+
cd speech/tools/S3Tokenizer
|
| 16 |
+
pip install .
|
| 17 |
+
|
| 18 |
+
# Extract FSQ tokens (25Hz)
|
| 19 |
+
torchrun --nproc_per_node=4 --nnodes=1 --rdzv_id=2024 --rdzv_backend="c10d" --rdzv_endpoint="localhost:0" \
|
| 20 |
+
`which s3tokenizer` \
|
| 21 |
+
--root_path $DATASET_ROOT \
|
| 22 |
+
--model speech_tokenizer_v2_25hz \
|
| 23 |
+
--device "cuda" \
|
| 24 |
+
--batch_size 64 \
|
| 25 |
+
--file_list ../../../files_test.txt \
|
| 26 |
+
--skip_existing
|
| 27 |
+
|
| 28 |
+
echo "Step 2: Extract DAC-VAE latents..."
|
| 29 |
+
cd ../../../dac-vae
|
| 30 |
+
|
| 31 |
+
# Download DAC-VAE checkpoint
|
| 32 |
+
wget -O checkpoint.pt "https://github.com/primepake/learnable-speech/releases/download/dac-vae/dac_vae_checkpoint.pt"
|
| 33 |
+
|
| 34 |
+
# Extract DAC latents
|
| 35 |
+
python extract_dac_latents.py \
|
| 36 |
+
--checkpoint checkpoint.pt \
|
| 37 |
+
--config configs/config.yml \
|
| 38 |
+
--root_path $DATASET_ROOT \
|
| 39 |
+
--output_dir $OUTPUT_DIR/dac_latents
|
| 40 |
+
|
| 41 |
+
echo "Step 3: Create data lists..."
|
| 42 |
+
cd ../speech
|
| 43 |
+
python tools/create_data_list.py \
|
| 44 |
+
--src_dir $OUTPUT_DIR \
|
| 45 |
+
--output_dir $OUTPUT_DIR/lists
|
| 46 |
+
|
| 47 |
+
echo "Data preparation completed!"
|
| 48 |
+
echo "Your dataset should now have:"
|
| 49 |
+
echo " - Original audio files (.wav)"
|
| 50 |
+
echo " - Text transcriptions (.txt)"
|
| 51 |
+
echo " - FSQ tokens (*_fsq.pt)"
|
| 52 |
+
echo " - DAC latents (*_latent.pt)"
|
| 53 |
+
echo " - Data list files"
|
scripts/train_full_pipeline.sh
ADDED
|
@@ -0,0 +1,120 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
#!/bin/bash
|
| 2 |
+
|
| 3 |
+
# Complete Learnable-Speech Training Pipeline
|
| 4 |
+
# This script trains both LLM and Flow models sequentially
|
| 5 |
+
|
| 6 |
+
set -e # Exit on any error
|
| 7 |
+
|
| 8 |
+
echo "🎤 Starting Learnable-Speech Training Pipeline"
|
| 9 |
+
echo "=============================================="
|
| 10 |
+
|
| 11 |
+
# Configuration
|
| 12 |
+
DATASET_ROOT="${DATASET_ROOT:-/data/dataset}"
|
| 13 |
+
CHECKPOINT_DIR="${CHECKPOINT_DIR:-./checkpoints}"
|
| 14 |
+
PRETRAINED_DIR="${PRETRAINED_DIR:-./pretrained_models/CosyVoice2-0.5B}"
|
| 15 |
+
NUM_GPUS="${NUM_GPUS:-4}"
|
| 16 |
+
BATCH_SIZE="${BATCH_SIZE:-32}"
|
| 17 |
+
|
| 18 |
+
# Create checkpoint directories
|
| 19 |
+
mkdir -p $CHECKPOINT_DIR/{llm,flow}
|
| 20 |
+
|
| 21 |
+
# Check prerequisites
|
| 22 |
+
echo "📋 Checking prerequisites..."
|
| 23 |
+
if [ ! -d "$PRETRAINED_DIR" ]; then
|
| 24 |
+
echo "❌ Pretrained models not found. Please run scripts/download_pretrained.sh first"
|
| 25 |
+
exit 1
|
| 26 |
+
fi
|
| 27 |
+
|
| 28 |
+
if [ ! -f "./data/train.list" ]; then
|
| 29 |
+
echo "❌ Training data not found. Please run scripts/prepare_data.sh first"
|
| 30 |
+
exit 1
|
| 31 |
+
fi
|
| 32 |
+
|
| 33 |
+
# Set environment
|
| 34 |
+
export CUDA_VISIBLE_DEVICES="0,1,2,3" # Adjust as needed
|
| 35 |
+
export PYTHONPATH=$(pwd):$PYTHONPATH
|
| 36 |
+
|
| 37 |
+
echo "🚀 Starting Stage 1: LLM Training (BPE → FSQ tokens)"
|
| 38 |
+
echo "=================================================="
|
| 39 |
+
|
| 40 |
+
torchrun --nnodes=1 --nproc_per_node=$NUM_GPUS --rdzv_id=1986 --rdzv_backend="c10d" --rdzv_endpoint="localhost:1234" \
|
| 41 |
+
speech/train.py \
|
| 42 |
+
--train_engine torch_ddp \
|
| 43 |
+
--config speech/config.yaml \
|
| 44 |
+
--train_data ./data/train.list \
|
| 45 |
+
--cv_data ./data/val.list \
|
| 46 |
+
--qwen_pretrain_path $PRETRAINED_DIR/CosyVoice-BlankEN \
|
| 47 |
+
--model llm \
|
| 48 |
+
--model_dir $CHECKPOINT_DIR/llm/ \
|
| 49 |
+
--num_workers 24 \
|
| 50 |
+
--prefetch 100 \
|
| 51 |
+
--use_amp \
|
| 52 |
+
--pretrained_model $PRETRAINED_DIR/llm.pt \
|
| 53 |
+
--comet_project "learnable-speech" \
|
| 54 |
+
--comet_experiment_name "llm-training-$(date +%Y%m%d-%H%M%S)"
|
| 55 |
+
|
| 56 |
+
if [ $? -eq 0 ]; then
|
| 57 |
+
echo "✅ Stage 1 (LLM) training completed successfully!"
|
| 58 |
+
else
|
| 59 |
+
echo "❌ Stage 1 (LLM) training failed!"
|
| 60 |
+
exit 1
|
| 61 |
+
fi
|
| 62 |
+
|
| 63 |
+
echo "🚀 Starting Stage 2: Flow Training (FSQ → DAC latents)"
|
| 64 |
+
echo "====================================================="
|
| 65 |
+
|
| 66 |
+
# Find the latest LLM checkpoint
|
| 67 |
+
LATEST_LLM_CHECKPOINT=$(ls -t $CHECKPOINT_DIR/llm/*.pt | head -1)
|
| 68 |
+
echo "Using LLM checkpoint: $LATEST_LLM_CHECKPOINT"
|
| 69 |
+
|
| 70 |
+
torchrun --nnodes=1 --nproc_per_node=$NUM_GPUS --rdzv_id=1987 --rdzv_backend="c10d" --rdzv_endpoint="localhost:1235" \
|
| 71 |
+
speech/train.py \
|
| 72 |
+
--train_engine torch_ddp \
|
| 73 |
+
--config speech/config.yaml \
|
| 74 |
+
--train_data ./data/train.list \
|
| 75 |
+
--cv_data ./data/val.list \
|
| 76 |
+
--qwen_pretrain_path $PRETRAINED_DIR/CosyVoice-BlankEN \
|
| 77 |
+
--model flow \
|
| 78 |
+
--model_dir $CHECKPOINT_DIR/flow/ \
|
| 79 |
+
--num_workers 24 \
|
| 80 |
+
--prefetch 100 \
|
| 81 |
+
--use_amp \
|
| 82 |
+
--pretrained_model $PRETRAINED_DIR/flow.pt \
|
| 83 |
+
--comet_project "learnable-speech" \
|
| 84 |
+
--comet_experiment_name "flow-training-$(date +%Y%m%d-%H%M%S)"
|
| 85 |
+
|
| 86 |
+
if [ $? -eq 0 ]; then
|
| 87 |
+
echo "✅ Stage 2 (Flow) training completed successfully!"
|
| 88 |
+
else
|
| 89 |
+
echo "❌ Stage 2 (Flow) training failed!"
|
| 90 |
+
exit 1
|
| 91 |
+
fi
|
| 92 |
+
|
| 93 |
+
echo "🎉 Training pipeline completed successfully!"
|
| 94 |
+
echo "=========================================="
|
| 95 |
+
echo "Trained models saved in: $CHECKPOINT_DIR"
|
| 96 |
+
echo ""
|
| 97 |
+
echo "Next steps:"
|
| 98 |
+
echo "1. Test your models with inference scripts"
|
| 99 |
+
echo "2. Upload checkpoints to Hugging Face Hub"
|
| 100 |
+
echo "3. Update the Gradio app with trained models"
|
| 101 |
+
|
| 102 |
+
# Create a summary file
|
| 103 |
+
cat > $CHECKPOINT_DIR/training_summary.txt << EOF
|
| 104 |
+
Learnable-Speech Training Summary
|
| 105 |
+
Generated: $(date)
|
| 106 |
+
|
| 107 |
+
Dataset: $DATASET_ROOT
|
| 108 |
+
LLM Checkpoint: $(ls -t $CHECKPOINT_DIR/llm/*.pt | head -1)
|
| 109 |
+
Flow Checkpoint: $(ls -t $CHECKPOINT_DIR/flow/*.pt | head -1)
|
| 110 |
+
|
| 111 |
+
Configuration:
|
| 112 |
+
- GPUs: $NUM_GPUS
|
| 113 |
+
- Batch Size: $BATCH_SIZE
|
| 114 |
+
- Mixed Precision: Enabled
|
| 115 |
+
- Framework: PyTorch DDP
|
| 116 |
+
|
| 117 |
+
Training completed successfully!
|
| 118 |
+
EOF
|
| 119 |
+
|
| 120 |
+
echo "📄 Training summary saved to: $CHECKPOINT_DIR/training_summary.txt"
|
scripts/training_configs.sh
ADDED
|
@@ -0,0 +1,81 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Learnable-Speech Training Configuration for Different Environments
|
| 2 |
+
|
| 3 |
+
# ==== LOCAL TRAINING (Single GPU) ====
|
| 4 |
+
# For development and testing
|
| 5 |
+
|
| 6 |
+
export CUDA_VISIBLE_DEVICES="0"
|
| 7 |
+
export PYTHONPATH=/path/to/learnable-speech:$PYTHONPATH
|
| 8 |
+
|
| 9 |
+
# Single GPU training
|
| 10 |
+
python train.py \
|
| 11 |
+
--train_engine torch_ddp \
|
| 12 |
+
--config config.yaml \
|
| 13 |
+
--train_data ./data/train.list \
|
| 14 |
+
--cv_data ./data/val.list \
|
| 15 |
+
--qwen_pretrain_path ./pretrained_models/CosyVoice2-0.5B/CosyVoice-BlankEN \
|
| 16 |
+
--model llm \
|
| 17 |
+
--model_dir ./checkpoints/llm/ \
|
| 18 |
+
--num_workers 4 \
|
| 19 |
+
--prefetch 50 \
|
| 20 |
+
--use_amp \
|
| 21 |
+
--pretrained_model ./pretrained_models/CosyVoice2-0.5B/llm.pt
|
| 22 |
+
|
| 23 |
+
# ==== MULTI-GPU TRAINING (Local) ====
|
| 24 |
+
# For faster training on multiple GPUs
|
| 25 |
+
|
| 26 |
+
export CUDA_VISIBLE_DEVICES="0,1,2,3"
|
| 27 |
+
num_gpus=$(echo $CUDA_VISIBLE_DEVICES | awk -F "," '{print NF}')
|
| 28 |
+
|
| 29 |
+
torchrun --nnodes=1 --nproc_per_node=$num_gpus --rdzv_id=1986 --rdzv_backend="c10d" --rdzv_endpoint="localhost:1234" \
|
| 30 |
+
train.py \
|
| 31 |
+
--train_engine torch_ddp \
|
| 32 |
+
--config config.yaml \
|
| 33 |
+
--train_data ./data/train.list \
|
| 34 |
+
--cv_data ./data/val.list \
|
| 35 |
+
--qwen_pretrain_path ./pretrained_models/CosyVoice2-0.5B/CosyVoice-BlankEN \
|
| 36 |
+
--model llm \
|
| 37 |
+
--model_dir ./checkpoints/llm/ \
|
| 38 |
+
--num_workers 24 \
|
| 39 |
+
--prefetch 100 \
|
| 40 |
+
--use_amp \
|
| 41 |
+
--pretrained_model ./pretrained_models/CosyVoice2-0.5B/llm.pt
|
| 42 |
+
|
| 43 |
+
# ==== CLOUD TRAINING (Google Colab/Kaggle) ====
|
| 44 |
+
# Optimized for limited resources
|
| 45 |
+
|
| 46 |
+
export CUDA_VISIBLE_DEVICES="0"
|
| 47 |
+
pip install -r requirements.txt
|
| 48 |
+
|
| 49 |
+
python train.py \
|
| 50 |
+
--train_engine torch_ddp \
|
| 51 |
+
--config config.yaml \
|
| 52 |
+
--train_data ./data/small_train.list \
|
| 53 |
+
--cv_data ./data/small_val.list \
|
| 54 |
+
--qwen_pretrain_path ./pretrained_models/CosyVoice2-0.5B/CosyVoice-BlankEN \
|
| 55 |
+
--model llm \
|
| 56 |
+
--model_dir /content/checkpoints/llm/ \
|
| 57 |
+
--num_workers 2 \
|
| 58 |
+
--prefetch 25 \
|
| 59 |
+
--use_amp \
|
| 60 |
+
--pretrained_model ./pretrained_models/CosyVoice2-0.5B/llm.pt \
|
| 61 |
+
--comet_disabled # Disable logging for simplicity
|
| 62 |
+
|
| 63 |
+
# ==== HUGGING FACE SPACES TRAINING ====
|
| 64 |
+
# For training directly on HF infrastructure
|
| 65 |
+
|
| 66 |
+
# Note: This requires HF Pro subscription for GPU access
|
| 67 |
+
# Use smaller batch sizes and enable checkpointing
|
| 68 |
+
|
| 69 |
+
python train.py \
|
| 70 |
+
--train_engine torch_ddp \
|
| 71 |
+
--config config_hf.yaml \
|
| 72 |
+
--train_data ./data/hf_train.list \
|
| 73 |
+
--cv_data ./data/hf_val.list \
|
| 74 |
+
--qwen_pretrain_path ./pretrained_models/CosyVoice2-0.5B/CosyVoice-BlankEN \
|
| 75 |
+
--model llm \
|
| 76 |
+
--model_dir /tmp/checkpoints/llm/ \
|
| 77 |
+
--num_workers 1 \
|
| 78 |
+
--prefetch 10 \
|
| 79 |
+
--use_amp \
|
| 80 |
+
--pretrained_model ./pretrained_models/CosyVoice2-0.5B/llm.pt \
|
| 81 |
+
--timeout 1800 # 30 minutes timeout for HF
|
scripts/upload_to_hf.py
ADDED
|
@@ -0,0 +1,229 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
#!/usr/bin/env python3
|
| 2 |
+
"""Upload trained Learnable-Speech models to Hugging Face Hub"""
|
| 3 |
+
|
| 4 |
+
import os
|
| 5 |
+
import argparse
|
| 6 |
+
from huggingface_hub import HfApi, create_repo, upload_file, upload_folder
|
| 7 |
+
import torch
|
| 8 |
+
import json
|
| 9 |
+
from pathlib import Path
|
| 10 |
+
|
| 11 |
+
def create_model_card(model_name, training_info):
|
| 12 |
+
"""Create a model card for the uploaded model"""
|
| 13 |
+
return f"""---
|
| 14 |
+
license: apache-2.0
|
| 15 |
+
tags:
|
| 16 |
+
- text-to-speech
|
| 17 |
+
- speech-synthesis
|
| 18 |
+
- learnable-speech
|
| 19 |
+
- cosyvoice
|
| 20 |
+
- pytorch
|
| 21 |
+
pipeline_tag: text-to-speech
|
| 22 |
+
library_name: pytorch
|
| 23 |
+
---
|
| 24 |
+
|
| 25 |
+
# Learnable-Speech {model_name.upper()}
|
| 26 |
+
|
| 27 |
+
This is a trained {model_name} model from the Learnable-Speech project, an unofficial implementation based on improvements of CosyVoice with learnable encoder and DAC-VAE.
|
| 28 |
+
|
| 29 |
+
## Model Description
|
| 30 |
+
|
| 31 |
+
- **Model Type**: {model_name.upper()} ({"Language Model" if model_name == "llm" else "Flow Matching Decoder"})
|
| 32 |
+
- **Architecture**: {"Qwen2-based transformer for BPE→FSQ token mapping" if model_name == "llm" else "Causal conditional flow matching for FSQ→DAC latent mapping"}
|
| 33 |
+
- **Sample Rate**: 24kHz
|
| 34 |
+
- **Framework**: PyTorch
|
| 35 |
+
|
| 36 |
+
## Training Details
|
| 37 |
+
|
| 38 |
+
{training_info}
|
| 39 |
+
|
| 40 |
+
## Usage
|
| 41 |
+
|
| 42 |
+
```python
|
| 43 |
+
import torch
|
| 44 |
+
from learnable_speech import LearnableSpeech
|
| 45 |
+
|
| 46 |
+
# Load the model
|
| 47 |
+
model = LearnableSpeech.from_pretrained("your-username/learnable-speech-{model_name}")
|
| 48 |
+
|
| 49 |
+
# Generate speech
|
| 50 |
+
text = "Hello, this is Learnable-Speech!"
|
| 51 |
+
audio = model.synthesize(text)
|
| 52 |
+
```
|
| 53 |
+
|
| 54 |
+
## Citation
|
| 55 |
+
|
| 56 |
+
If you use this model, please cite:
|
| 57 |
+
|
| 58 |
+
```bibtex
|
| 59 |
+
@article{{learnable-speech,
|
| 60 |
+
title={{Learnable-Speech}},
|
| 61 |
+
author={{Learnable team}},
|
| 62 |
+
year={{2025}},
|
| 63 |
+
url={{https://arxiv.org/pdf/2505.07916}}
|
| 64 |
+
}}
|
| 65 |
+
```
|
| 66 |
+
|
| 67 |
+
## Links
|
| 68 |
+
|
| 69 |
+
- [GitHub Repository](https://github.com/primepake/learnable-speech)
|
| 70 |
+
- [Original Paper](https://arxiv.org/pdf/2505.07916)
|
| 71 |
+
- [Hugging Face Space Demo](https://huggingface.co/spaces/mnhatdaous/learnable-speech)
|
| 72 |
+
"""
|
| 73 |
+
|
| 74 |
+
def upload_model_to_hf(checkpoint_path, model_name, repo_name, token=None, private=False):
|
| 75 |
+
"""Upload trained model to Hugging Face Hub"""
|
| 76 |
+
|
| 77 |
+
api = HfApi(token=token)
|
| 78 |
+
|
| 79 |
+
# Create repository
|
| 80 |
+
try:
|
| 81 |
+
create_repo(
|
| 82 |
+
repo_id=repo_name,
|
| 83 |
+
token=token,
|
| 84 |
+
private=private,
|
| 85 |
+
exist_ok=True
|
| 86 |
+
)
|
| 87 |
+
print(f"✅ Repository {repo_name} created/found")
|
| 88 |
+
except Exception as e:
|
| 89 |
+
print(f"❌ Failed to create repository: {e}")
|
| 90 |
+
return False
|
| 91 |
+
|
| 92 |
+
# Load checkpoint to get training info
|
| 93 |
+
try:
|
| 94 |
+
checkpoint = torch.load(checkpoint_path, map_location='cpu')
|
| 95 |
+
training_info = f"""
|
| 96 |
+
- **Training Steps**: {checkpoint.get('step', 'Unknown')}
|
| 97 |
+
- **Training Epochs**: {checkpoint.get('epoch', 'Unknown')}
|
| 98 |
+
- **Training Framework**: PyTorch DDP with AMP
|
| 99 |
+
- **Optimizer**: AdamW
|
| 100 |
+
- **Learning Rate**: {checkpoint.get('lr', 'Unknown')}
|
| 101 |
+
"""
|
| 102 |
+
except Exception as e:
|
| 103 |
+
print(f"⚠️ Could not load checkpoint info: {e}")
|
| 104 |
+
training_info = "Training information not available"
|
| 105 |
+
|
| 106 |
+
# Create model card
|
| 107 |
+
model_card = create_model_card(model_name, training_info)
|
| 108 |
+
|
| 109 |
+
# Save model card to temporary file
|
| 110 |
+
with open(f"README_{model_name}.md", "w") as f:
|
| 111 |
+
f.write(model_card)
|
| 112 |
+
|
| 113 |
+
try:
|
| 114 |
+
# Upload checkpoint
|
| 115 |
+
upload_file(
|
| 116 |
+
path_or_fileobj=checkpoint_path,
|
| 117 |
+
path_in_repo="pytorch_model.bin",
|
| 118 |
+
repo_id=repo_name,
|
| 119 |
+
token=token
|
| 120 |
+
)
|
| 121 |
+
print(f"✅ Model checkpoint uploaded")
|
| 122 |
+
|
| 123 |
+
# Upload model card
|
| 124 |
+
upload_file(
|
| 125 |
+
path_or_fileobj=f"README_{model_name}.md",
|
| 126 |
+
path_in_repo="README.md",
|
| 127 |
+
repo_id=repo_name,
|
| 128 |
+
token=token
|
| 129 |
+
)
|
| 130 |
+
print(f"✅ Model card uploaded")
|
| 131 |
+
|
| 132 |
+
# Create and upload config
|
| 133 |
+
config = {
|
| 134 |
+
"model_type": "learnable_speech",
|
| 135 |
+
"architecture": model_name,
|
| 136 |
+
"sample_rate": 24000,
|
| 137 |
+
"framework": "pytorch"
|
| 138 |
+
}
|
| 139 |
+
|
| 140 |
+
with open(f"config_{model_name}.json", "w") as f:
|
| 141 |
+
json.dump(config, f, indent=2)
|
| 142 |
+
|
| 143 |
+
upload_file(
|
| 144 |
+
path_or_fileobj=f"config_{model_name}.json",
|
| 145 |
+
path_in_repo="config.json",
|
| 146 |
+
repo_id=repo_name,
|
| 147 |
+
token=token
|
| 148 |
+
)
|
| 149 |
+
print(f"✅ Config uploaded")
|
| 150 |
+
|
| 151 |
+
# Cleanup
|
| 152 |
+
os.remove(f"README_{model_name}.md")
|
| 153 |
+
os.remove(f"config_{model_name}.json")
|
| 154 |
+
|
| 155 |
+
print(f"🎉 Model successfully uploaded to: https://huggingface.co/{repo_name}")
|
| 156 |
+
return True
|
| 157 |
+
|
| 158 |
+
except Exception as e:
|
| 159 |
+
print(f"❌ Failed to upload: {e}")
|
| 160 |
+
return False
|
| 161 |
+
|
| 162 |
+
def main():
|
| 163 |
+
parser = argparse.ArgumentParser(description="Upload Learnable-Speech models to Hugging Face")
|
| 164 |
+
parser.add_argument("--checkpoint_dir", required=True, help="Directory containing trained checkpoints")
|
| 165 |
+
parser.add_argument("--username", required=True, help="Your Hugging Face username")
|
| 166 |
+
parser.add_argument("--token", help="Hugging Face API token (or set HF_TOKEN env var)")
|
| 167 |
+
parser.add_argument("--private", action="store_true", help="Make repositories private")
|
| 168 |
+
parser.add_argument("--models", nargs="+", choices=["llm", "flow", "both"], default=["both"],
|
| 169 |
+
help="Which models to upload")
|
| 170 |
+
|
| 171 |
+
args = parser.parse_args()
|
| 172 |
+
|
| 173 |
+
# Get token
|
| 174 |
+
token = args.token or os.getenv("HF_TOKEN")
|
| 175 |
+
if not token:
|
| 176 |
+
print("❌ Please provide Hugging Face token via --token or HF_TOKEN env var")
|
| 177 |
+
return
|
| 178 |
+
|
| 179 |
+
checkpoint_dir = Path(args.checkpoint_dir)
|
| 180 |
+
|
| 181 |
+
models_to_upload = []
|
| 182 |
+
if "both" in args.models:
|
| 183 |
+
models_to_upload = ["llm", "flow"]
|
| 184 |
+
else:
|
| 185 |
+
models_to_upload = args.models
|
| 186 |
+
|
| 187 |
+
success_count = 0
|
| 188 |
+
|
| 189 |
+
for model_name in models_to_upload:
|
| 190 |
+
print(f"\n🚀 Uploading {model_name.upper()} model...")
|
| 191 |
+
|
| 192 |
+
# Find latest checkpoint
|
| 193 |
+
model_dir = checkpoint_dir / model_name
|
| 194 |
+
if not model_dir.exists():
|
| 195 |
+
print(f"❌ Model directory not found: {model_dir}")
|
| 196 |
+
continue
|
| 197 |
+
|
| 198 |
+
checkpoint_files = list(model_dir.glob("*.pt"))
|
| 199 |
+
if not checkpoint_files:
|
| 200 |
+
print(f"❌ No checkpoint files found in {model_dir}")
|
| 201 |
+
continue
|
| 202 |
+
|
| 203 |
+
# Get the latest checkpoint (by modification time)
|
| 204 |
+
latest_checkpoint = max(checkpoint_files, key=os.path.getmtime)
|
| 205 |
+
print(f"📁 Using checkpoint: {latest_checkpoint}")
|
| 206 |
+
|
| 207 |
+
# Upload to HF
|
| 208 |
+
repo_name = f"{args.username}/learnable-speech-{model_name}"
|
| 209 |
+
success = upload_model_to_hf(
|
| 210 |
+
checkpoint_path=str(latest_checkpoint),
|
| 211 |
+
model_name=model_name,
|
| 212 |
+
repo_name=repo_name,
|
| 213 |
+
token=token,
|
| 214 |
+
private=args.private
|
| 215 |
+
)
|
| 216 |
+
|
| 217 |
+
if success:
|
| 218 |
+
success_count += 1
|
| 219 |
+
|
| 220 |
+
print(f"\n🎉 Upload complete! {success_count}/{len(models_to_upload)} models uploaded successfully")
|
| 221 |
+
|
| 222 |
+
if success_count > 0:
|
| 223 |
+
print("\n📝 Next steps:")
|
| 224 |
+
print("1. Update your Gradio app to use the uploaded models")
|
| 225 |
+
print("2. Test the models in your Hugging Face Space")
|
| 226 |
+
print("3. Share your trained models with the community!")
|
| 227 |
+
|
| 228 |
+
if __name__ == "__main__":
|
| 229 |
+
main()
|
speech/config_hf.yaml
ADDED
|
@@ -0,0 +1,192 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Hugging Face optimized configuration
|
| 2 |
+
# This config is optimized for training on HF Spaces with limited resources
|
| 3 |
+
|
| 4 |
+
# set random seed
|
| 5 |
+
__set_seed1: !apply:random.seed [1986]
|
| 6 |
+
__set_seed2: !apply:numpy.random.seed [1986]
|
| 7 |
+
__set_seed3: !apply:torch.manual_seed [1986]
|
| 8 |
+
__set_seed4: !apply:torch.cuda.manual_seed_all [1986]
|
| 9 |
+
|
| 10 |
+
# fixed params - optimized for HF
|
| 11 |
+
sample_rate: 24000
|
| 12 |
+
llm_input_size: 512 # Reduced from 896
|
| 13 |
+
llm_output_size: 512 # Reduced from 896
|
| 14 |
+
spk_embed_dim: 128 # Reduced from 192
|
| 15 |
+
qwen_pretrain_path: ''
|
| 16 |
+
token_frame_rate: 25
|
| 17 |
+
token_mel_ratio: 2
|
| 18 |
+
token_latent_ratio: 3
|
| 19 |
+
use_speaker_encoder: True
|
| 20 |
+
speaker_encoder_path: '/tmp/checkpoints/llm/best_speaker_encoder.pt'
|
| 21 |
+
|
| 22 |
+
# stream related params
|
| 23 |
+
chunk_size: 16 # Reduced from 25
|
| 24 |
+
num_decoding_left_chunks: -1
|
| 25 |
+
|
| 26 |
+
speaker_encoder_config:
|
| 27 |
+
mel_dim: 80
|
| 28 |
+
model_dim: 256 # Reduced from 512
|
| 29 |
+
output_dim: !ref <spk_embed_dim>
|
| 30 |
+
num_blocks: 4 # Reduced from 6
|
| 31 |
+
num_heads: 4 # Reduced from 8
|
| 32 |
+
kernel_size: 1
|
| 33 |
+
dropout: 0.1
|
| 34 |
+
max_conditioning_inputs: 2 # Reduced from 3
|
| 35 |
+
|
| 36 |
+
# Smaller LLM model for HF
|
| 37 |
+
llm: !new:cosyvoice.llm.llm.Qwen2LM
|
| 38 |
+
llm_input_size: !ref <llm_input_size>
|
| 39 |
+
llm_output_size: !ref <llm_output_size>
|
| 40 |
+
speech_token_size: 6561
|
| 41 |
+
length_normalized_loss: True
|
| 42 |
+
lsm_weight: 0
|
| 43 |
+
mix_ratio: [3, 10] # Reduced from [5, 15]
|
| 44 |
+
use_speaker_encoder: !ref <use_speaker_encoder>
|
| 45 |
+
spk_embed_dim: !ref <spk_embed_dim>
|
| 46 |
+
max_conditioning_inputs: 2
|
| 47 |
+
llm: !new:cosyvoice.llm.llm.Qwen2Encoder
|
| 48 |
+
pretrain_path: !ref <qwen_pretrain_path>
|
| 49 |
+
sampling: !name:cosyvoice.utils.common.ras_sampling
|
| 50 |
+
top_p: 0.8
|
| 51 |
+
top_k: 25
|
| 52 |
+
win_size: 8 # Reduced from 10
|
| 53 |
+
tau_r: 0.1
|
| 54 |
+
|
| 55 |
+
extract_reference_mel:
|
| 56 |
+
!name:cosyvoice.dataset.processor.extract_reference_mel_from_speech
|
| 57 |
+
feat_extractor: !ref <feat_extractor>
|
| 58 |
+
min_length: 0.5
|
| 59 |
+
max_length: 3.0 # Reduced from 4.0
|
| 60 |
+
num_crops: 1
|
| 61 |
+
training: True
|
| 62 |
+
sample_rate: !ref <sample_rate>
|
| 63 |
+
|
| 64 |
+
# Smaller Flow model for HF
|
| 65 |
+
flow: !new:cosyvoice.flow.flow.CausalMaskedDiffWithXvec
|
| 66 |
+
input_size: 256 # Reduced from 512
|
| 67 |
+
output_size: 64
|
| 68 |
+
spk_embed_dim: !ref <spk_embed_dim>
|
| 69 |
+
output_type: 'mel'
|
| 70 |
+
vocab_size: 6561
|
| 71 |
+
input_frame_rate: !ref <token_frame_rate>
|
| 72 |
+
only_mask_loss: True
|
| 73 |
+
token_latent_ratio: !ref <token_latent_ratio>
|
| 74 |
+
pre_lookahead_len: 2 # Reduced from 3
|
| 75 |
+
use_speaker_encoder: !ref <use_speaker_encoder>
|
| 76 |
+
freeze_speaker_encoder: True
|
| 77 |
+
speaker_encoder_path: !ref <speaker_encoder_path>
|
| 78 |
+
encoder: !new:cosyvoice.transformer.upsample_encoder.UpsampleConformerEncoder
|
| 79 |
+
output_size: 256 # Reduced from 512
|
| 80 |
+
attention_heads: 4 # Reduced from 8
|
| 81 |
+
linear_units: 1024 # Reduced from 2048
|
| 82 |
+
num_blocks: 4 # Reduced from 6
|
| 83 |
+
dropout_rate: 0.1
|
| 84 |
+
positional_dropout_rate: 0.1
|
| 85 |
+
attention_dropout_rate: 0.1
|
| 86 |
+
normalize_before: True
|
| 87 |
+
input_layer: 'linear'
|
| 88 |
+
pos_enc_layer_type: 'rel_pos_espnet'
|
| 89 |
+
selfattention_layer_type: 'rel_selfattn'
|
| 90 |
+
input_size: 256 # Reduced from 512
|
| 91 |
+
use_cnn_module: False
|
| 92 |
+
macaron_style: False
|
| 93 |
+
static_chunk_size: !ref <chunk_size>
|
| 94 |
+
decoder: !new:cosyvoice.flow.flow_matching.CausalConditionalCFM
|
| 95 |
+
in_channels: 240
|
| 96 |
+
n_spks: 1
|
| 97 |
+
spk_emb_dim: 80
|
| 98 |
+
cfm_params: !new:omegaconf.DictConfig
|
| 99 |
+
content:
|
| 100 |
+
sigma_min: 1e-06
|
| 101 |
+
solver: 'euler'
|
| 102 |
+
t_scheduler: 'cosine'
|
| 103 |
+
training_cfg_rate: 0.1 # Reduced from 0.2
|
| 104 |
+
inference_cfg_rate: 0.5 # Reduced from 0.7
|
| 105 |
+
reg_loss_type: 'l1'
|
| 106 |
+
use_immiscible: True
|
| 107 |
+
immiscible_k: 4 # Reduced from 8
|
| 108 |
+
use_contrastive_fm: True
|
| 109 |
+
contrastive_lambda: 0.03 # Reduced from 0.05
|
| 110 |
+
estimator: !new:cosyvoice.flow.decoder.CausalConditionalDecoder
|
| 111 |
+
in_channels: 320
|
| 112 |
+
out_channels: 64
|
| 113 |
+
channels: [128] # Reduced from [256]
|
| 114 |
+
dropout: 0.0
|
| 115 |
+
attention_head_dim: 32 # Reduced from 64
|
| 116 |
+
n_blocks: 3 # Reduced from 4
|
| 117 |
+
num_mid_blocks: 8 # Reduced from 12
|
| 118 |
+
num_heads: 4 # Reduced from 8
|
| 119 |
+
act_fn: 'gelu'
|
| 120 |
+
static_chunk_size: !ref <chunk_size> * <token_latent_ratio>
|
| 121 |
+
num_decoding_left_chunks: !ref <num_decoding_left_chunks>
|
| 122 |
+
|
| 123 |
+
# Processor functions (unchanged)
|
| 124 |
+
individual_file_opener: !name:cosyvoice.dataset.processor.individual_file_opener
|
| 125 |
+
parquet_opener: !name:cosyvoice.dataset.processor.parquet_opener
|
| 126 |
+
get_tokenizer: !name:cosyvoice.tokenizer.tokenizer.get_qwen_tokenizer
|
| 127 |
+
token_path: !ref <qwen_pretrain_path>
|
| 128 |
+
skip_special_tokens: True
|
| 129 |
+
allowed_special: 'all'
|
| 130 |
+
tokenize: !name:cosyvoice.dataset.processor.tokenize
|
| 131 |
+
get_tokenizer: !ref <get_tokenizer>
|
| 132 |
+
allowed_special: !ref <allowed_special>
|
| 133 |
+
filter: !name:cosyvoice.dataset.processor.filter
|
| 134 |
+
max_length: 20480 # Reduced from 40960
|
| 135 |
+
min_length: 100
|
| 136 |
+
token_max_length: 150 # Reduced from 200
|
| 137 |
+
token_min_length: 1
|
| 138 |
+
resample: !name:cosyvoice.dataset.processor.resample
|
| 139 |
+
resample_rate: !ref <sample_rate>
|
| 140 |
+
truncate: !name:cosyvoice.dataset.processor.truncate
|
| 141 |
+
truncate_length: 12240 # Reduced from 24480
|
| 142 |
+
feat_extractor: !name:matcha.utils.audio.mel_spectrogram
|
| 143 |
+
n_fft: 1920
|
| 144 |
+
num_mels: 80
|
| 145 |
+
sampling_rate: !ref <sample_rate>
|
| 146 |
+
hop_size: 480
|
| 147 |
+
win_size: 1920
|
| 148 |
+
fmin: 0
|
| 149 |
+
fmax: 8000
|
| 150 |
+
center: False
|
| 151 |
+
compute_fbank: !name:cosyvoice.dataset.processor.compute_fbank
|
| 152 |
+
feat_extractor: !ref <feat_extractor>
|
| 153 |
+
token_mel_ratio: !ref <token_mel_ratio>
|
| 154 |
+
shuffle: !name:cosyvoice.dataset.processor.shuffle
|
| 155 |
+
shuffle_size: 500 # Reduced from 1000
|
| 156 |
+
sort: !name:cosyvoice.dataset.processor.sort
|
| 157 |
+
sort_size: 250 # Reduced from 500
|
| 158 |
+
batch: !name:cosyvoice.dataset.processor.batch
|
| 159 |
+
batch_type: 'dynamic'
|
| 160 |
+
max_frames_in_batch: 2500 # Reduced from 5000
|
| 161 |
+
padding: !name:cosyvoice.dataset.processor.padding
|
| 162 |
+
use_speaker_encoder: !ref <use_speaker_encoder>
|
| 163 |
+
|
| 164 |
+
# dataset processor pipeline
|
| 165 |
+
data_pipeline:
|
| 166 |
+
[
|
| 167 |
+
!ref <individual_file_opener>,
|
| 168 |
+
!ref <tokenize>,
|
| 169 |
+
!ref <filter>,
|
| 170 |
+
!ref <resample>,
|
| 171 |
+
!ref <extract_reference_mel>,
|
| 172 |
+
!ref <compute_fbank>,
|
| 173 |
+
!ref <shuffle>,
|
| 174 |
+
!ref <sort>,
|
| 175 |
+
!ref <batch>,
|
| 176 |
+
!ref <padding>,
|
| 177 |
+
]
|
| 178 |
+
|
| 179 |
+
# HF optimized training configuration
|
| 180 |
+
train_conf:
|
| 181 |
+
optim: adamw
|
| 182 |
+
optim_conf:
|
| 183 |
+
lr: 3e-5 # Reduced from 5e-5
|
| 184 |
+
scheduler: constantlr
|
| 185 |
+
scheduler_conf:
|
| 186 |
+
warmup_steps: 200 # Reduced from 500
|
| 187 |
+
max_epoch: 50 # Reduced from 2000
|
| 188 |
+
grad_clip: 1
|
| 189 |
+
accum_grad: 2 # Added gradient accumulation
|
| 190 |
+
log_interval: 10 # Increased from 5
|
| 191 |
+
save_per_step: 1000 # Reduced from 2000
|
| 192 |
+
total_iters: 100000 # Reduced from 1000000000
|