mnhatdaous commited on
Commit
248479c
·
1 Parent(s): edfcfb2

Add comprehensive training pipeline for Hugging Face deployment

Browse files
TRAINING_GUIDE.md ADDED
@@ -0,0 +1,222 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # 🎤 Learnable-Speech Training Quick Start Guide
2
+
3
+ This guide will help you train the Learnable-Speech model from scratch and deploy it on Hugging Face.
4
+
5
+ ## 📋 Prerequisites
6
+
7
+ 1. **Hardware Requirements**:
8
+ - GPU with at least 8GB VRAM (16GB+ recommended)
9
+ - 32GB+ RAM
10
+ - 100GB+ storage space
11
+
12
+ 2. **Software Requirements**:
13
+ - Python 3.10+
14
+ - CUDA 11.8+
15
+ - PyTorch 2.0+
16
+
17
+ ## 🚀 Step-by-Step Training Process
18
+
19
+ ### Step 1: Environment Setup
20
+
21
+ ```bash
22
+ # Clone the repository
23
+ git clone https://github.com/primepake/learnable-speech.git
24
+ cd learnable-speech
25
+
26
+ # Install dependencies
27
+ pip install -r requirements.txt
28
+
29
+ # Install S3Tokenizer
30
+ cd speech/tools/S3Tokenizer
31
+ pip install .
32
+ cd ../../..
33
+ ```
34
+
35
+ ### Step 2: Download Prerequisites
36
+
37
+ ```bash
38
+ # Make scripts executable
39
+ chmod +x scripts/*.sh
40
+
41
+ # Download pretrained models
42
+ ./scripts/download_pretrained.sh
43
+ ```
44
+
45
+ ### Step 3: Prepare Your Dataset
46
+
47
+ ```bash
48
+ # Organize your dataset like this:
49
+ # dataset_root/
50
+ # ├── speaker1_001.wav
51
+ # ├── speaker1_001.txt
52
+ # ├── speaker1_002.wav
53
+ # ├── speaker1_002.txt
54
+ # └── ...
55
+
56
+ # Update DATASET_ROOT in the script
57
+ export DATASET_ROOT="/path/to/your/dataset"
58
+ export OUTPUT_DIR="/path/to/processed/data"
59
+
60
+ # Run data preparation
61
+ ./scripts/prepare_data.sh
62
+ ```
63
+
64
+ ### Step 4: Train the Models
65
+
66
+ ```bash
67
+ # Option A: Train full pipeline (recommended)
68
+ ./scripts/train_full_pipeline.sh
69
+
70
+ # Option B: Train stages separately
71
+ ./speech/llm_run.sh # Stage 1: LLM
72
+ ./speech/flow_run.sh # Stage 2: Flow
73
+ ```
74
+
75
+ ### Step 5: Upload to Hugging Face
76
+
77
+ ```bash
78
+ # Get your HF token from https://huggingface.co/settings/tokens
79
+ export HF_TOKEN="your_token_here"
80
+
81
+ # Upload trained models
82
+ python scripts/upload_to_hf.py \
83
+ --checkpoint_dir ./checkpoints \
84
+ --username your_hf_username \
85
+ --models both
86
+ ```
87
+
88
+ ### Step 6: Update Gradio App
89
+
90
+ ```python
91
+ # Update app.py to use your trained models
92
+ from huggingface_hub import hf_hub_download
93
+ import torch
94
+
95
+ # Download your trained models
96
+ llm_path = hf_hub_download(
97
+ repo_id="your_username/learnable-speech-llm",
98
+ filename="pytorch_model.bin"
99
+ )
100
+ flow_path = hf_hub_download(
101
+ repo_id="your_username/learnable-speech-flow",
102
+ filename="pytorch_model.bin"
103
+ )
104
+
105
+ # Load and use models in your synthesis function
106
+ def synthesize_speech(text, speaker_id=0):
107
+ # Replace placeholder with actual model inference
108
+ # ... your inference code here ...
109
+ pass
110
+ ```
111
+
112
+ ## 🎯 Training Configurations
113
+
114
+ ### For Different Environments:
115
+
116
+ 1. **Local Development** (Single GPU):
117
+ ```bash
118
+ export CUDA_VISIBLE_DEVICES="0"
119
+ python speech/train.py --config speech/config.yaml --model llm ...
120
+ ```
121
+
122
+ 2. **Multi-GPU Training**:
123
+ ```bash
124
+ export CUDA_VISIBLE_DEVICES="0,1,2,3"
125
+ torchrun --nproc_per_node=4 speech/train.py ...
126
+ ```
127
+
128
+ 3. **Cloud Training** (Google Colab/Kaggle):
129
+ ```python
130
+ # Use config_hf.yaml for resource-constrained environments
131
+ !python speech/train.py --config speech/config_hf.yaml ...
132
+ ```
133
+
134
+ 4. **Hugging Face Spaces**:
135
+ ```bash
136
+ # For direct training on HF infrastructure
137
+ python speech/train.py --config speech/config_hf.yaml --timeout 1800 ...
138
+ ```
139
+
140
+ ## 📊 Monitoring Training
141
+
142
+ 1. **Comet ML** (Recommended):
143
+ ```bash
144
+ # Set up Comet ML for experiment tracking
145
+ export COMET_API_KEY="your_api_key"
146
+ # Training will automatically log to Comet
147
+ ```
148
+
149
+ 2. **Tensorboard**:
150
+ ```bash
151
+ tensorboard --logdir ./tensorboard
152
+ ```
153
+
154
+ 3. **Command Line**:
155
+ ```bash
156
+ # Monitor log files
157
+ tail -f checkpoints/llm/train.log
158
+ ```
159
+
160
+ ## 🔧 Troubleshooting
161
+
162
+ ### Common Issues:
163
+
164
+ 1. **Out of Memory**:
165
+ - Reduce batch size in config
166
+ - Use gradient accumulation
167
+ - Enable mixed precision training (`--use_amp`)
168
+
169
+ 2. **Slow Training**:
170
+ - Increase num_workers for data loading
171
+ - Use multiple GPUs with DDP
172
+ - Optimize data preprocessing
173
+
174
+ 3. **Model Not Converging**:
175
+ - Check learning rate
176
+ - Verify data preprocessing
177
+ - Use pretrained checkpoints
178
+
179
+ ### Performance Tips:
180
+
181
+ 1. **Data Loading Optimization**:
182
+ ```yaml
183
+ # In config.yaml
184
+ num_workers: 24
185
+ prefetch: 100
186
+ pin_memory: true
187
+ ```
188
+
189
+ 2. **Memory Optimization**:
190
+ ```bash
191
+ # Use gradient checkpointing
192
+ --use_amp --accum_grad 2
193
+ ```
194
+
195
+ 3. **Speed Optimization**:
196
+ ```bash
197
+ # Compile model for faster training (PyTorch 2.0+)
198
+ export TORCH_COMPILE=1
199
+ ```
200
+
201
+ ## 📈 Expected Training Times
202
+
203
+ | Configuration | LLM Training | Flow Training | Total |
204
+ |---------------|--------------|---------------|-------|
205
+ | Single RTX 4090 | 2-3 days | 1-2 days | 3-5 days |
206
+ | 4x RTX 4090 | 12-18 hours | 6-12 hours | 1-2 days |
207
+ | 8x A100 | 6-8 hours | 3-4 hours | 9-12 hours |
208
+
209
+ ## 🎉 Success Criteria
210
+
211
+ Your training is successful when:
212
+
213
+ 1. **LLM Stage**: Perplexity < 2.0, Token accuracy > 95%
214
+ 2. **Flow Stage**: Reconstruction loss < 0.1, Mel spectral loss < 0.05
215
+ 3. **Audio Quality**: Generated samples sound natural and intelligible
216
+
217
+ ## 📚 Additional Resources
218
+
219
+ - [Training Logs Analysis](docs/training_analysis.md)
220
+ - [Hyperparameter Tuning Guide](docs/hyperparameters.md)
221
+ - [Deployment Best Practices](docs/deployment.md)
222
+ - [Community Discord](https://discord.gg/learnable-speech)
scripts/download_pretrained.sh ADDED
@@ -0,0 +1,17 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/bin/bash
2
+
3
+ # Create pretrained models directory
4
+ mkdir -p pretrained_models/CosyVoice2-0.5B
5
+
6
+ echo "Downloading CosyVoice2 pretrained models..."
7
+
8
+ # Download CosyVoice2 models (you'll need to get these from the official release)
9
+ # Replace these URLs with actual download links when available
10
+ wget -O pretrained_models/CosyVoice2-0.5B/llm.pt "https://huggingface.co/FunAudioLLM/CosyVoice2-0.5B/resolve/main/llm.pt"
11
+ wget -O pretrained_models/CosyVoice2-0.5B/flow.pt "https://huggingface.co/FunAudioLLM/CosyVoice2-0.5B/resolve/main/flow.pt"
12
+
13
+ # Download Qwen pretrained model
14
+ mkdir -p pretrained_models/CosyVoice2-0.5B/CosyVoice-BlankEN
15
+ echo "Download Qwen model manually from: https://huggingface.co/FunAudioLLM/CosyVoice2-0.5B"
16
+
17
+ echo "Pretrained models downloaded!"
scripts/prepare_data.sh ADDED
@@ -0,0 +1,53 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/bin/bash
2
+
3
+ # Data preparation pipeline for Learnable-Speech training
4
+
5
+ echo "=== Learnable-Speech Data Preparation Pipeline ==="
6
+
7
+ # Configuration
8
+ DATASET_ROOT="/path/to/your/dataset" # Change this to your dataset path
9
+ OUTPUT_DIR="/path/to/processed/data" # Change this to your output path
10
+
11
+ # Create output directories
12
+ mkdir -p $OUTPUT_DIR/{fsq,dac_latents,lists}
13
+
14
+ echo "Step 1: Extract FSQ tokens using S3Tokenizer..."
15
+ cd speech/tools/S3Tokenizer
16
+ pip install .
17
+
18
+ # Extract FSQ tokens (25Hz)
19
+ torchrun --nproc_per_node=4 --nnodes=1 --rdzv_id=2024 --rdzv_backend="c10d" --rdzv_endpoint="localhost:0" \
20
+ `which s3tokenizer` \
21
+ --root_path $DATASET_ROOT \
22
+ --model speech_tokenizer_v2_25hz \
23
+ --device "cuda" \
24
+ --batch_size 64 \
25
+ --file_list ../../../files_test.txt \
26
+ --skip_existing
27
+
28
+ echo "Step 2: Extract DAC-VAE latents..."
29
+ cd ../../../dac-vae
30
+
31
+ # Download DAC-VAE checkpoint
32
+ wget -O checkpoint.pt "https://github.com/primepake/learnable-speech/releases/download/dac-vae/dac_vae_checkpoint.pt"
33
+
34
+ # Extract DAC latents
35
+ python extract_dac_latents.py \
36
+ --checkpoint checkpoint.pt \
37
+ --config configs/config.yml \
38
+ --root_path $DATASET_ROOT \
39
+ --output_dir $OUTPUT_DIR/dac_latents
40
+
41
+ echo "Step 3: Create data lists..."
42
+ cd ../speech
43
+ python tools/create_data_list.py \
44
+ --src_dir $OUTPUT_DIR \
45
+ --output_dir $OUTPUT_DIR/lists
46
+
47
+ echo "Data preparation completed!"
48
+ echo "Your dataset should now have:"
49
+ echo " - Original audio files (.wav)"
50
+ echo " - Text transcriptions (.txt)"
51
+ echo " - FSQ tokens (*_fsq.pt)"
52
+ echo " - DAC latents (*_latent.pt)"
53
+ echo " - Data list files"
scripts/train_full_pipeline.sh ADDED
@@ -0,0 +1,120 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/bin/bash
2
+
3
+ # Complete Learnable-Speech Training Pipeline
4
+ # This script trains both LLM and Flow models sequentially
5
+
6
+ set -e # Exit on any error
7
+
8
+ echo "🎤 Starting Learnable-Speech Training Pipeline"
9
+ echo "=============================================="
10
+
11
+ # Configuration
12
+ DATASET_ROOT="${DATASET_ROOT:-/data/dataset}"
13
+ CHECKPOINT_DIR="${CHECKPOINT_DIR:-./checkpoints}"
14
+ PRETRAINED_DIR="${PRETRAINED_DIR:-./pretrained_models/CosyVoice2-0.5B}"
15
+ NUM_GPUS="${NUM_GPUS:-4}"
16
+ BATCH_SIZE="${BATCH_SIZE:-32}"
17
+
18
+ # Create checkpoint directories
19
+ mkdir -p $CHECKPOINT_DIR/{llm,flow}
20
+
21
+ # Check prerequisites
22
+ echo "📋 Checking prerequisites..."
23
+ if [ ! -d "$PRETRAINED_DIR" ]; then
24
+ echo "❌ Pretrained models not found. Please run scripts/download_pretrained.sh first"
25
+ exit 1
26
+ fi
27
+
28
+ if [ ! -f "./data/train.list" ]; then
29
+ echo "❌ Training data not found. Please run scripts/prepare_data.sh first"
30
+ exit 1
31
+ fi
32
+
33
+ # Set environment
34
+ export CUDA_VISIBLE_DEVICES="0,1,2,3" # Adjust as needed
35
+ export PYTHONPATH=$(pwd):$PYTHONPATH
36
+
37
+ echo "🚀 Starting Stage 1: LLM Training (BPE → FSQ tokens)"
38
+ echo "=================================================="
39
+
40
+ torchrun --nnodes=1 --nproc_per_node=$NUM_GPUS --rdzv_id=1986 --rdzv_backend="c10d" --rdzv_endpoint="localhost:1234" \
41
+ speech/train.py \
42
+ --train_engine torch_ddp \
43
+ --config speech/config.yaml \
44
+ --train_data ./data/train.list \
45
+ --cv_data ./data/val.list \
46
+ --qwen_pretrain_path $PRETRAINED_DIR/CosyVoice-BlankEN \
47
+ --model llm \
48
+ --model_dir $CHECKPOINT_DIR/llm/ \
49
+ --num_workers 24 \
50
+ --prefetch 100 \
51
+ --use_amp \
52
+ --pretrained_model $PRETRAINED_DIR/llm.pt \
53
+ --comet_project "learnable-speech" \
54
+ --comet_experiment_name "llm-training-$(date +%Y%m%d-%H%M%S)"
55
+
56
+ if [ $? -eq 0 ]; then
57
+ echo "✅ Stage 1 (LLM) training completed successfully!"
58
+ else
59
+ echo "❌ Stage 1 (LLM) training failed!"
60
+ exit 1
61
+ fi
62
+
63
+ echo "🚀 Starting Stage 2: Flow Training (FSQ → DAC latents)"
64
+ echo "====================================================="
65
+
66
+ # Find the latest LLM checkpoint
67
+ LATEST_LLM_CHECKPOINT=$(ls -t $CHECKPOINT_DIR/llm/*.pt | head -1)
68
+ echo "Using LLM checkpoint: $LATEST_LLM_CHECKPOINT"
69
+
70
+ torchrun --nnodes=1 --nproc_per_node=$NUM_GPUS --rdzv_id=1987 --rdzv_backend="c10d" --rdzv_endpoint="localhost:1235" \
71
+ speech/train.py \
72
+ --train_engine torch_ddp \
73
+ --config speech/config.yaml \
74
+ --train_data ./data/train.list \
75
+ --cv_data ./data/val.list \
76
+ --qwen_pretrain_path $PRETRAINED_DIR/CosyVoice-BlankEN \
77
+ --model flow \
78
+ --model_dir $CHECKPOINT_DIR/flow/ \
79
+ --num_workers 24 \
80
+ --prefetch 100 \
81
+ --use_amp \
82
+ --pretrained_model $PRETRAINED_DIR/flow.pt \
83
+ --comet_project "learnable-speech" \
84
+ --comet_experiment_name "flow-training-$(date +%Y%m%d-%H%M%S)"
85
+
86
+ if [ $? -eq 0 ]; then
87
+ echo "✅ Stage 2 (Flow) training completed successfully!"
88
+ else
89
+ echo "❌ Stage 2 (Flow) training failed!"
90
+ exit 1
91
+ fi
92
+
93
+ echo "🎉 Training pipeline completed successfully!"
94
+ echo "=========================================="
95
+ echo "Trained models saved in: $CHECKPOINT_DIR"
96
+ echo ""
97
+ echo "Next steps:"
98
+ echo "1. Test your models with inference scripts"
99
+ echo "2. Upload checkpoints to Hugging Face Hub"
100
+ echo "3. Update the Gradio app with trained models"
101
+
102
+ # Create a summary file
103
+ cat > $CHECKPOINT_DIR/training_summary.txt << EOF
104
+ Learnable-Speech Training Summary
105
+ Generated: $(date)
106
+
107
+ Dataset: $DATASET_ROOT
108
+ LLM Checkpoint: $(ls -t $CHECKPOINT_DIR/llm/*.pt | head -1)
109
+ Flow Checkpoint: $(ls -t $CHECKPOINT_DIR/flow/*.pt | head -1)
110
+
111
+ Configuration:
112
+ - GPUs: $NUM_GPUS
113
+ - Batch Size: $BATCH_SIZE
114
+ - Mixed Precision: Enabled
115
+ - Framework: PyTorch DDP
116
+
117
+ Training completed successfully!
118
+ EOF
119
+
120
+ echo "📄 Training summary saved to: $CHECKPOINT_DIR/training_summary.txt"
scripts/training_configs.sh ADDED
@@ -0,0 +1,81 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Learnable-Speech Training Configuration for Different Environments
2
+
3
+ # ==== LOCAL TRAINING (Single GPU) ====
4
+ # For development and testing
5
+
6
+ export CUDA_VISIBLE_DEVICES="0"
7
+ export PYTHONPATH=/path/to/learnable-speech:$PYTHONPATH
8
+
9
+ # Single GPU training
10
+ python train.py \
11
+ --train_engine torch_ddp \
12
+ --config config.yaml \
13
+ --train_data ./data/train.list \
14
+ --cv_data ./data/val.list \
15
+ --qwen_pretrain_path ./pretrained_models/CosyVoice2-0.5B/CosyVoice-BlankEN \
16
+ --model llm \
17
+ --model_dir ./checkpoints/llm/ \
18
+ --num_workers 4 \
19
+ --prefetch 50 \
20
+ --use_amp \
21
+ --pretrained_model ./pretrained_models/CosyVoice2-0.5B/llm.pt
22
+
23
+ # ==== MULTI-GPU TRAINING (Local) ====
24
+ # For faster training on multiple GPUs
25
+
26
+ export CUDA_VISIBLE_DEVICES="0,1,2,3"
27
+ num_gpus=$(echo $CUDA_VISIBLE_DEVICES | awk -F "," '{print NF}')
28
+
29
+ torchrun --nnodes=1 --nproc_per_node=$num_gpus --rdzv_id=1986 --rdzv_backend="c10d" --rdzv_endpoint="localhost:1234" \
30
+ train.py \
31
+ --train_engine torch_ddp \
32
+ --config config.yaml \
33
+ --train_data ./data/train.list \
34
+ --cv_data ./data/val.list \
35
+ --qwen_pretrain_path ./pretrained_models/CosyVoice2-0.5B/CosyVoice-BlankEN \
36
+ --model llm \
37
+ --model_dir ./checkpoints/llm/ \
38
+ --num_workers 24 \
39
+ --prefetch 100 \
40
+ --use_amp \
41
+ --pretrained_model ./pretrained_models/CosyVoice2-0.5B/llm.pt
42
+
43
+ # ==== CLOUD TRAINING (Google Colab/Kaggle) ====
44
+ # Optimized for limited resources
45
+
46
+ export CUDA_VISIBLE_DEVICES="0"
47
+ pip install -r requirements.txt
48
+
49
+ python train.py \
50
+ --train_engine torch_ddp \
51
+ --config config.yaml \
52
+ --train_data ./data/small_train.list \
53
+ --cv_data ./data/small_val.list \
54
+ --qwen_pretrain_path ./pretrained_models/CosyVoice2-0.5B/CosyVoice-BlankEN \
55
+ --model llm \
56
+ --model_dir /content/checkpoints/llm/ \
57
+ --num_workers 2 \
58
+ --prefetch 25 \
59
+ --use_amp \
60
+ --pretrained_model ./pretrained_models/CosyVoice2-0.5B/llm.pt \
61
+ --comet_disabled # Disable logging for simplicity
62
+
63
+ # ==== HUGGING FACE SPACES TRAINING ====
64
+ # For training directly on HF infrastructure
65
+
66
+ # Note: This requires HF Pro subscription for GPU access
67
+ # Use smaller batch sizes and enable checkpointing
68
+
69
+ python train.py \
70
+ --train_engine torch_ddp \
71
+ --config config_hf.yaml \
72
+ --train_data ./data/hf_train.list \
73
+ --cv_data ./data/hf_val.list \
74
+ --qwen_pretrain_path ./pretrained_models/CosyVoice2-0.5B/CosyVoice-BlankEN \
75
+ --model llm \
76
+ --model_dir /tmp/checkpoints/llm/ \
77
+ --num_workers 1 \
78
+ --prefetch 10 \
79
+ --use_amp \
80
+ --pretrained_model ./pretrained_models/CosyVoice2-0.5B/llm.pt \
81
+ --timeout 1800 # 30 minutes timeout for HF
scripts/upload_to_hf.py ADDED
@@ -0,0 +1,229 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ """Upload trained Learnable-Speech models to Hugging Face Hub"""
3
+
4
+ import os
5
+ import argparse
6
+ from huggingface_hub import HfApi, create_repo, upload_file, upload_folder
7
+ import torch
8
+ import json
9
+ from pathlib import Path
10
+
11
+ def create_model_card(model_name, training_info):
12
+ """Create a model card for the uploaded model"""
13
+ return f"""---
14
+ license: apache-2.0
15
+ tags:
16
+ - text-to-speech
17
+ - speech-synthesis
18
+ - learnable-speech
19
+ - cosyvoice
20
+ - pytorch
21
+ pipeline_tag: text-to-speech
22
+ library_name: pytorch
23
+ ---
24
+
25
+ # Learnable-Speech {model_name.upper()}
26
+
27
+ This is a trained {model_name} model from the Learnable-Speech project, an unofficial implementation based on improvements of CosyVoice with learnable encoder and DAC-VAE.
28
+
29
+ ## Model Description
30
+
31
+ - **Model Type**: {model_name.upper()} ({"Language Model" if model_name == "llm" else "Flow Matching Decoder"})
32
+ - **Architecture**: {"Qwen2-based transformer for BPE→FSQ token mapping" if model_name == "llm" else "Causal conditional flow matching for FSQ→DAC latent mapping"}
33
+ - **Sample Rate**: 24kHz
34
+ - **Framework**: PyTorch
35
+
36
+ ## Training Details
37
+
38
+ {training_info}
39
+
40
+ ## Usage
41
+
42
+ ```python
43
+ import torch
44
+ from learnable_speech import LearnableSpeech
45
+
46
+ # Load the model
47
+ model = LearnableSpeech.from_pretrained("your-username/learnable-speech-{model_name}")
48
+
49
+ # Generate speech
50
+ text = "Hello, this is Learnable-Speech!"
51
+ audio = model.synthesize(text)
52
+ ```
53
+
54
+ ## Citation
55
+
56
+ If you use this model, please cite:
57
+
58
+ ```bibtex
59
+ @article{{learnable-speech,
60
+ title={{Learnable-Speech}},
61
+ author={{Learnable team}},
62
+ year={{2025}},
63
+ url={{https://arxiv.org/pdf/2505.07916}}
64
+ }}
65
+ ```
66
+
67
+ ## Links
68
+
69
+ - [GitHub Repository](https://github.com/primepake/learnable-speech)
70
+ - [Original Paper](https://arxiv.org/pdf/2505.07916)
71
+ - [Hugging Face Space Demo](https://huggingface.co/spaces/mnhatdaous/learnable-speech)
72
+ """
73
+
74
+ def upload_model_to_hf(checkpoint_path, model_name, repo_name, token=None, private=False):
75
+ """Upload trained model to Hugging Face Hub"""
76
+
77
+ api = HfApi(token=token)
78
+
79
+ # Create repository
80
+ try:
81
+ create_repo(
82
+ repo_id=repo_name,
83
+ token=token,
84
+ private=private,
85
+ exist_ok=True
86
+ )
87
+ print(f"✅ Repository {repo_name} created/found")
88
+ except Exception as e:
89
+ print(f"❌ Failed to create repository: {e}")
90
+ return False
91
+
92
+ # Load checkpoint to get training info
93
+ try:
94
+ checkpoint = torch.load(checkpoint_path, map_location='cpu')
95
+ training_info = f"""
96
+ - **Training Steps**: {checkpoint.get('step', 'Unknown')}
97
+ - **Training Epochs**: {checkpoint.get('epoch', 'Unknown')}
98
+ - **Training Framework**: PyTorch DDP with AMP
99
+ - **Optimizer**: AdamW
100
+ - **Learning Rate**: {checkpoint.get('lr', 'Unknown')}
101
+ """
102
+ except Exception as e:
103
+ print(f"⚠️ Could not load checkpoint info: {e}")
104
+ training_info = "Training information not available"
105
+
106
+ # Create model card
107
+ model_card = create_model_card(model_name, training_info)
108
+
109
+ # Save model card to temporary file
110
+ with open(f"README_{model_name}.md", "w") as f:
111
+ f.write(model_card)
112
+
113
+ try:
114
+ # Upload checkpoint
115
+ upload_file(
116
+ path_or_fileobj=checkpoint_path,
117
+ path_in_repo="pytorch_model.bin",
118
+ repo_id=repo_name,
119
+ token=token
120
+ )
121
+ print(f"✅ Model checkpoint uploaded")
122
+
123
+ # Upload model card
124
+ upload_file(
125
+ path_or_fileobj=f"README_{model_name}.md",
126
+ path_in_repo="README.md",
127
+ repo_id=repo_name,
128
+ token=token
129
+ )
130
+ print(f"✅ Model card uploaded")
131
+
132
+ # Create and upload config
133
+ config = {
134
+ "model_type": "learnable_speech",
135
+ "architecture": model_name,
136
+ "sample_rate": 24000,
137
+ "framework": "pytorch"
138
+ }
139
+
140
+ with open(f"config_{model_name}.json", "w") as f:
141
+ json.dump(config, f, indent=2)
142
+
143
+ upload_file(
144
+ path_or_fileobj=f"config_{model_name}.json",
145
+ path_in_repo="config.json",
146
+ repo_id=repo_name,
147
+ token=token
148
+ )
149
+ print(f"✅ Config uploaded")
150
+
151
+ # Cleanup
152
+ os.remove(f"README_{model_name}.md")
153
+ os.remove(f"config_{model_name}.json")
154
+
155
+ print(f"🎉 Model successfully uploaded to: https://huggingface.co/{repo_name}")
156
+ return True
157
+
158
+ except Exception as e:
159
+ print(f"❌ Failed to upload: {e}")
160
+ return False
161
+
162
+ def main():
163
+ parser = argparse.ArgumentParser(description="Upload Learnable-Speech models to Hugging Face")
164
+ parser.add_argument("--checkpoint_dir", required=True, help="Directory containing trained checkpoints")
165
+ parser.add_argument("--username", required=True, help="Your Hugging Face username")
166
+ parser.add_argument("--token", help="Hugging Face API token (or set HF_TOKEN env var)")
167
+ parser.add_argument("--private", action="store_true", help="Make repositories private")
168
+ parser.add_argument("--models", nargs="+", choices=["llm", "flow", "both"], default=["both"],
169
+ help="Which models to upload")
170
+
171
+ args = parser.parse_args()
172
+
173
+ # Get token
174
+ token = args.token or os.getenv("HF_TOKEN")
175
+ if not token:
176
+ print("❌ Please provide Hugging Face token via --token or HF_TOKEN env var")
177
+ return
178
+
179
+ checkpoint_dir = Path(args.checkpoint_dir)
180
+
181
+ models_to_upload = []
182
+ if "both" in args.models:
183
+ models_to_upload = ["llm", "flow"]
184
+ else:
185
+ models_to_upload = args.models
186
+
187
+ success_count = 0
188
+
189
+ for model_name in models_to_upload:
190
+ print(f"\n🚀 Uploading {model_name.upper()} model...")
191
+
192
+ # Find latest checkpoint
193
+ model_dir = checkpoint_dir / model_name
194
+ if not model_dir.exists():
195
+ print(f"❌ Model directory not found: {model_dir}")
196
+ continue
197
+
198
+ checkpoint_files = list(model_dir.glob("*.pt"))
199
+ if not checkpoint_files:
200
+ print(f"❌ No checkpoint files found in {model_dir}")
201
+ continue
202
+
203
+ # Get the latest checkpoint (by modification time)
204
+ latest_checkpoint = max(checkpoint_files, key=os.path.getmtime)
205
+ print(f"📁 Using checkpoint: {latest_checkpoint}")
206
+
207
+ # Upload to HF
208
+ repo_name = f"{args.username}/learnable-speech-{model_name}"
209
+ success = upload_model_to_hf(
210
+ checkpoint_path=str(latest_checkpoint),
211
+ model_name=model_name,
212
+ repo_name=repo_name,
213
+ token=token,
214
+ private=args.private
215
+ )
216
+
217
+ if success:
218
+ success_count += 1
219
+
220
+ print(f"\n🎉 Upload complete! {success_count}/{len(models_to_upload)} models uploaded successfully")
221
+
222
+ if success_count > 0:
223
+ print("\n📝 Next steps:")
224
+ print("1. Update your Gradio app to use the uploaded models")
225
+ print("2. Test the models in your Hugging Face Space")
226
+ print("3. Share your trained models with the community!")
227
+
228
+ if __name__ == "__main__":
229
+ main()
speech/config_hf.yaml ADDED
@@ -0,0 +1,192 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Hugging Face optimized configuration
2
+ # This config is optimized for training on HF Spaces with limited resources
3
+
4
+ # set random seed
5
+ __set_seed1: !apply:random.seed [1986]
6
+ __set_seed2: !apply:numpy.random.seed [1986]
7
+ __set_seed3: !apply:torch.manual_seed [1986]
8
+ __set_seed4: !apply:torch.cuda.manual_seed_all [1986]
9
+
10
+ # fixed params - optimized for HF
11
+ sample_rate: 24000
12
+ llm_input_size: 512 # Reduced from 896
13
+ llm_output_size: 512 # Reduced from 896
14
+ spk_embed_dim: 128 # Reduced from 192
15
+ qwen_pretrain_path: ''
16
+ token_frame_rate: 25
17
+ token_mel_ratio: 2
18
+ token_latent_ratio: 3
19
+ use_speaker_encoder: True
20
+ speaker_encoder_path: '/tmp/checkpoints/llm/best_speaker_encoder.pt'
21
+
22
+ # stream related params
23
+ chunk_size: 16 # Reduced from 25
24
+ num_decoding_left_chunks: -1
25
+
26
+ speaker_encoder_config:
27
+ mel_dim: 80
28
+ model_dim: 256 # Reduced from 512
29
+ output_dim: !ref <spk_embed_dim>
30
+ num_blocks: 4 # Reduced from 6
31
+ num_heads: 4 # Reduced from 8
32
+ kernel_size: 1
33
+ dropout: 0.1
34
+ max_conditioning_inputs: 2 # Reduced from 3
35
+
36
+ # Smaller LLM model for HF
37
+ llm: !new:cosyvoice.llm.llm.Qwen2LM
38
+ llm_input_size: !ref <llm_input_size>
39
+ llm_output_size: !ref <llm_output_size>
40
+ speech_token_size: 6561
41
+ length_normalized_loss: True
42
+ lsm_weight: 0
43
+ mix_ratio: [3, 10] # Reduced from [5, 15]
44
+ use_speaker_encoder: !ref <use_speaker_encoder>
45
+ spk_embed_dim: !ref <spk_embed_dim>
46
+ max_conditioning_inputs: 2
47
+ llm: !new:cosyvoice.llm.llm.Qwen2Encoder
48
+ pretrain_path: !ref <qwen_pretrain_path>
49
+ sampling: !name:cosyvoice.utils.common.ras_sampling
50
+ top_p: 0.8
51
+ top_k: 25
52
+ win_size: 8 # Reduced from 10
53
+ tau_r: 0.1
54
+
55
+ extract_reference_mel:
56
+ !name:cosyvoice.dataset.processor.extract_reference_mel_from_speech
57
+ feat_extractor: !ref <feat_extractor>
58
+ min_length: 0.5
59
+ max_length: 3.0 # Reduced from 4.0
60
+ num_crops: 1
61
+ training: True
62
+ sample_rate: !ref <sample_rate>
63
+
64
+ # Smaller Flow model for HF
65
+ flow: !new:cosyvoice.flow.flow.CausalMaskedDiffWithXvec
66
+ input_size: 256 # Reduced from 512
67
+ output_size: 64
68
+ spk_embed_dim: !ref <spk_embed_dim>
69
+ output_type: 'mel'
70
+ vocab_size: 6561
71
+ input_frame_rate: !ref <token_frame_rate>
72
+ only_mask_loss: True
73
+ token_latent_ratio: !ref <token_latent_ratio>
74
+ pre_lookahead_len: 2 # Reduced from 3
75
+ use_speaker_encoder: !ref <use_speaker_encoder>
76
+ freeze_speaker_encoder: True
77
+ speaker_encoder_path: !ref <speaker_encoder_path>
78
+ encoder: !new:cosyvoice.transformer.upsample_encoder.UpsampleConformerEncoder
79
+ output_size: 256 # Reduced from 512
80
+ attention_heads: 4 # Reduced from 8
81
+ linear_units: 1024 # Reduced from 2048
82
+ num_blocks: 4 # Reduced from 6
83
+ dropout_rate: 0.1
84
+ positional_dropout_rate: 0.1
85
+ attention_dropout_rate: 0.1
86
+ normalize_before: True
87
+ input_layer: 'linear'
88
+ pos_enc_layer_type: 'rel_pos_espnet'
89
+ selfattention_layer_type: 'rel_selfattn'
90
+ input_size: 256 # Reduced from 512
91
+ use_cnn_module: False
92
+ macaron_style: False
93
+ static_chunk_size: !ref <chunk_size>
94
+ decoder: !new:cosyvoice.flow.flow_matching.CausalConditionalCFM
95
+ in_channels: 240
96
+ n_spks: 1
97
+ spk_emb_dim: 80
98
+ cfm_params: !new:omegaconf.DictConfig
99
+ content:
100
+ sigma_min: 1e-06
101
+ solver: 'euler'
102
+ t_scheduler: 'cosine'
103
+ training_cfg_rate: 0.1 # Reduced from 0.2
104
+ inference_cfg_rate: 0.5 # Reduced from 0.7
105
+ reg_loss_type: 'l1'
106
+ use_immiscible: True
107
+ immiscible_k: 4 # Reduced from 8
108
+ use_contrastive_fm: True
109
+ contrastive_lambda: 0.03 # Reduced from 0.05
110
+ estimator: !new:cosyvoice.flow.decoder.CausalConditionalDecoder
111
+ in_channels: 320
112
+ out_channels: 64
113
+ channels: [128] # Reduced from [256]
114
+ dropout: 0.0
115
+ attention_head_dim: 32 # Reduced from 64
116
+ n_blocks: 3 # Reduced from 4
117
+ num_mid_blocks: 8 # Reduced from 12
118
+ num_heads: 4 # Reduced from 8
119
+ act_fn: 'gelu'
120
+ static_chunk_size: !ref <chunk_size> * <token_latent_ratio>
121
+ num_decoding_left_chunks: !ref <num_decoding_left_chunks>
122
+
123
+ # Processor functions (unchanged)
124
+ individual_file_opener: !name:cosyvoice.dataset.processor.individual_file_opener
125
+ parquet_opener: !name:cosyvoice.dataset.processor.parquet_opener
126
+ get_tokenizer: !name:cosyvoice.tokenizer.tokenizer.get_qwen_tokenizer
127
+ token_path: !ref <qwen_pretrain_path>
128
+ skip_special_tokens: True
129
+ allowed_special: 'all'
130
+ tokenize: !name:cosyvoice.dataset.processor.tokenize
131
+ get_tokenizer: !ref <get_tokenizer>
132
+ allowed_special: !ref <allowed_special>
133
+ filter: !name:cosyvoice.dataset.processor.filter
134
+ max_length: 20480 # Reduced from 40960
135
+ min_length: 100
136
+ token_max_length: 150 # Reduced from 200
137
+ token_min_length: 1
138
+ resample: !name:cosyvoice.dataset.processor.resample
139
+ resample_rate: !ref <sample_rate>
140
+ truncate: !name:cosyvoice.dataset.processor.truncate
141
+ truncate_length: 12240 # Reduced from 24480
142
+ feat_extractor: !name:matcha.utils.audio.mel_spectrogram
143
+ n_fft: 1920
144
+ num_mels: 80
145
+ sampling_rate: !ref <sample_rate>
146
+ hop_size: 480
147
+ win_size: 1920
148
+ fmin: 0
149
+ fmax: 8000
150
+ center: False
151
+ compute_fbank: !name:cosyvoice.dataset.processor.compute_fbank
152
+ feat_extractor: !ref <feat_extractor>
153
+ token_mel_ratio: !ref <token_mel_ratio>
154
+ shuffle: !name:cosyvoice.dataset.processor.shuffle
155
+ shuffle_size: 500 # Reduced from 1000
156
+ sort: !name:cosyvoice.dataset.processor.sort
157
+ sort_size: 250 # Reduced from 500
158
+ batch: !name:cosyvoice.dataset.processor.batch
159
+ batch_type: 'dynamic'
160
+ max_frames_in_batch: 2500 # Reduced from 5000
161
+ padding: !name:cosyvoice.dataset.processor.padding
162
+ use_speaker_encoder: !ref <use_speaker_encoder>
163
+
164
+ # dataset processor pipeline
165
+ data_pipeline:
166
+ [
167
+ !ref <individual_file_opener>,
168
+ !ref <tokenize>,
169
+ !ref <filter>,
170
+ !ref <resample>,
171
+ !ref <extract_reference_mel>,
172
+ !ref <compute_fbank>,
173
+ !ref <shuffle>,
174
+ !ref <sort>,
175
+ !ref <batch>,
176
+ !ref <padding>,
177
+ ]
178
+
179
+ # HF optimized training configuration
180
+ train_conf:
181
+ optim: adamw
182
+ optim_conf:
183
+ lr: 3e-5 # Reduced from 5e-5
184
+ scheduler: constantlr
185
+ scheduler_conf:
186
+ warmup_steps: 200 # Reduced from 500
187
+ max_epoch: 50 # Reduced from 2000
188
+ grad_clip: 1
189
+ accum_grad: 2 # Added gradient accumulation
190
+ log_interval: 10 # Increased from 5
191
+ save_per_step: 1000 # Reduced from 2000
192
+ total_iters: 100000 # Reduced from 1000000000