Spaces:

mnhatdaous
/

learnable-speech

Sleeping

App Files Files Community

mnhatdaous commited on Sep 9

Commit

248479c

1 Parent(s): edfcfb2

Add comprehensive training pipeline for Hugging Face deployment

Browse files

Files changed (7) hide show

TRAINING_GUIDE.md +222 -0
scripts/download_pretrained.sh +17 -0
scripts/prepare_data.sh +53 -0
scripts/train_full_pipeline.sh +120 -0
scripts/training_configs.sh +81 -0
scripts/upload_to_hf.py +229 -0
speech/config_hf.yaml +192 -0

TRAINING_GUIDE.md ADDED Viewed

	@@ -0,0 +1,222 @@

+# 🎤 Learnable-Speech Training Quick Start Guide
+This guide will help you train the Learnable-Speech model from scratch and deploy it on Hugging Face.
+## 📋 Prerequisites
+1. **Hardware Requirements**:
+   - GPU with at least 8GB VRAM (16GB+ recommended)
+   - 32GB+ RAM
+   - 100GB+ storage space
+2. **Software Requirements**:
+   - Python 3.10+
+   - CUDA 11.8+
+   - PyTorch 2.0+
+## 🚀 Step-by-Step Training Process
+### Step 1: Environment Setup
+```bash
+# Clone the repository
+git clone https://github.com/primepake/learnable-speech.git
+cd learnable-speech
+# Install dependencies
+pip install -r requirements.txt
+# Install S3Tokenizer
+cd speech/tools/S3Tokenizer
+pip install .
+cd ../../..
+```
+### Step 2: Download Prerequisites
+```bash
+# Make scripts executable
+chmod +x scripts/*.sh
+# Download pretrained models
+./scripts/download_pretrained.sh
+```
+### Step 3: Prepare Your Dataset
+```bash
+# Organize your dataset like this:
+# dataset_root/
+# ├── speaker1_001.wav
+# ├── speaker1_001.txt
+# ├── speaker1_002.wav
+# ├── speaker1_002.txt
+# └── ...
+# Update DATASET_ROOT in the script
+export DATASET_ROOT="/path/to/your/dataset"
+export OUTPUT_DIR="/path/to/processed/data"
+# Run data preparation
+./scripts/prepare_data.sh
+```
+### Step 4: Train the Models
+```bash
+# Option A: Train full pipeline (recommended)
+./scripts/train_full_pipeline.sh
+# Option B: Train stages separately
+./speech/llm_run.sh    # Stage 1: LLM
+./speech/flow_run.sh   # Stage 2: Flow
+```
+### Step 5: Upload to Hugging Face
+```bash
+# Get your HF token from https://huggingface.co/settings/tokens
+export HF_TOKEN="your_token_here"
+# Upload trained models
+python scripts/upload_to_hf.py \
+  --checkpoint_dir ./checkpoints \
+  --username your_hf_username \
+  --models both
+```
+### Step 6: Update Gradio App
+```python
+# Update app.py to use your trained models
+from huggingface_hub import hf_hub_download
+import torch
+# Download your trained models
+llm_path = hf_hub_download(
+    repo_id="your_username/learnable-speech-llm",
+    filename="pytorch_model.bin"
+)
+flow_path = hf_hub_download(
+    repo_id="your_username/learnable-speech-flow",
+    filename="pytorch_model.bin"
+)
+# Load and use models in your synthesis function
+def synthesize_speech(text, speaker_id=0):
+    # Replace placeholder with actual model inference
+    # ... your inference code here ...
+    pass
+```
+## 🎯 Training Configurations
+### For Different Environments:
+1. **Local Development** (Single GPU):
+   ```bash
+   export CUDA_VISIBLE_DEVICES="0"
+   python speech/train.py --config speech/config.yaml --model llm ...
+   ```
+2. **Multi-GPU Training**:
+   ```bash
+   export CUDA_VISIBLE_DEVICES="0,1,2,3"
+   torchrun --nproc_per_node=4 speech/train.py ...
+   ```
+3. **Cloud Training** (Google Colab/Kaggle):
+   ```python
+   # Use config_hf.yaml for resource-constrained environments
+   !python speech/train.py --config speech/config_hf.yaml ...
+   ```
+4. **Hugging Face Spaces**:
+   ```bash
+   # For direct training on HF infrastructure
+   python speech/train.py --config speech/config_hf.yaml --timeout 1800 ...
+   ```
+## 📊 Monitoring Training
+1. **Comet ML** (Recommended):
+   ```bash
+   # Set up Comet ML for experiment tracking
+   export COMET_API_KEY="your_api_key"
+   # Training will automatically log to Comet
+   ```
+2. **Tensorboard**:
+   ```bash
+   tensorboard --logdir ./tensorboard
+   ```
+3. **Command Line**:
+   ```bash
+   # Monitor log files
+   tail -f checkpoints/llm/train.log
+   ```
+## 🔧 Troubleshooting
+### Common Issues:
+1. **Out of Memory**:
+   - Reduce batch size in config
+   - Use gradient accumulation
+   - Enable mixed precision training (`--use_amp`)
+2. **Slow Training**:
+   - Increase num_workers for data loading
+   - Use multiple GPUs with DDP
+   - Optimize data preprocessing
+3. **Model Not Converging**:
+   - Check learning rate
+   - Verify data preprocessing
+   - Use pretrained checkpoints
+### Performance Tips:
+1. **Data Loading Optimization**:
+   ```yaml
+   # In config.yaml
+   num_workers: 24
+   prefetch: 100
+   pin_memory: true
+   ```
+2. **Memory Optimization**:
+   ```bash
+   # Use gradient checkpointing
+   --use_amp --accum_grad 2
+   ```
+3. **Speed Optimization**:
+   ```bash
+   # Compile model for faster training (PyTorch 2.0+)
+   export TORCH_COMPILE=1
+   ```
+## 📈 Expected Training Times
+| Configuration | LLM Training | Flow Training | Total |
+|---------------|--------------|---------------|-------|
+| Single RTX 4090 | 2-3 days | 1-2 days | 3-5 days |
+| 4x RTX 4090 | 12-18 hours | 6-12 hours | 1-2 days |
+| 8x A100 | 6-8 hours | 3-4 hours | 9-12 hours |
+## 🎉 Success Criteria
+Your training is successful when:
+1. **LLM Stage**: Perplexity < 2.0, Token accuracy > 95%
+2. **Flow Stage**: Reconstruction loss < 0.1, Mel spectral loss < 0.05
+3. **Audio Quality**: Generated samples sound natural and intelligible
+## 📚 Additional Resources
+- [Training Logs Analysis](docs/training_analysis.md)
+- [Hyperparameter Tuning Guide](docs/hyperparameters.md)
+- [Deployment Best Practices](docs/deployment.md)
+- [Community Discord](https://discord.gg/learnable-speech)

scripts/download_pretrained.sh ADDED Viewed

	@@ -0,0 +1,17 @@

+#!/bin/bash
+# Create pretrained models directory
+mkdir -p pretrained_models/CosyVoice2-0.5B
+echo "Downloading CosyVoice2 pretrained models..."
+# Download CosyVoice2 models (you'll need to get these from the official release)
+# Replace these URLs with actual download links when available
+wget -O pretrained_models/CosyVoice2-0.5B/llm.pt "https://huggingface.co/FunAudioLLM/CosyVoice2-0.5B/resolve/main/llm.pt"
+wget -O pretrained_models/CosyVoice2-0.5B/flow.pt "https://huggingface.co/FunAudioLLM/CosyVoice2-0.5B/resolve/main/flow.pt"
+# Download Qwen pretrained model
+mkdir -p pretrained_models/CosyVoice2-0.5B/CosyVoice-BlankEN
+echo "Download Qwen model manually from: https://huggingface.co/FunAudioLLM/CosyVoice2-0.5B"
+echo "Pretrained models downloaded!"

scripts/prepare_data.sh ADDED Viewed

	@@ -0,0 +1,53 @@

+#!/bin/bash
+# Data preparation pipeline for Learnable-Speech training
+echo "=== Learnable-Speech Data Preparation Pipeline ==="
+# Configuration
+DATASET_ROOT="/path/to/your/dataset"  # Change this to your dataset path
+OUTPUT_DIR="/path/to/processed/data"  # Change this to your output path
+# Create output directories
+mkdir -p $OUTPUT_DIR/{fsq,dac_latents,lists}
+echo "Step 1: Extract FSQ tokens using S3Tokenizer..."
+cd speech/tools/S3Tokenizer
+pip install .
+# Extract FSQ tokens (25Hz)
+torchrun --nproc_per_node=4 --nnodes=1 --rdzv_id=2024 --rdzv_backend="c10d" --rdzv_endpoint="localhost:0" \
+    `which s3tokenizer` \
+    --root_path $DATASET_ROOT \
+    --model speech_tokenizer_v2_25hz \
+    --device "cuda" \
+    --batch_size 64 \
+    --file_list ../../../files_test.txt \
+    --skip_existing
+echo "Step 2: Extract DAC-VAE latents..."
+cd ../../../dac-vae
+# Download DAC-VAE checkpoint
+wget -O checkpoint.pt "https://github.com/primepake/learnable-speech/releases/download/dac-vae/dac_vae_checkpoint.pt"
+# Extract DAC latents
+python extract_dac_latents.py \
+    --checkpoint checkpoint.pt \
+    --config configs/config.yml \
+    --root_path $DATASET_ROOT \
+    --output_dir $OUTPUT_DIR/dac_latents
+echo "Step 3: Create data lists..."
+cd ../speech
+python tools/create_data_list.py \
+    --src_dir $OUTPUT_DIR \
+    --output_dir $OUTPUT_DIR/lists
+echo "Data preparation completed!"
+echo "Your dataset should now have:"
+echo "  - Original audio files (.wav)"
+echo "  - Text transcriptions (.txt)"
+echo "  - FSQ tokens (*_fsq.pt)"
+echo "  - DAC latents (*_latent.pt)"
+echo "  - Data list files"

scripts/train_full_pipeline.sh ADDED Viewed

	@@ -0,0 +1,120 @@

+#!/bin/bash
+# Complete Learnable-Speech Training Pipeline
+# This script trains both LLM and Flow models sequentially
+set -e  # Exit on any error
+echo "🎤 Starting Learnable-Speech Training Pipeline"
+echo "=============================================="
+# Configuration
+DATASET_ROOT="${DATASET_ROOT:-/data/dataset}"
+CHECKPOINT_DIR="${CHECKPOINT_DIR:-./checkpoints}"
+PRETRAINED_DIR="${PRETRAINED_DIR:-./pretrained_models/CosyVoice2-0.5B}"
+NUM_GPUS="${NUM_GPUS:-4}"
+BATCH_SIZE="${BATCH_SIZE:-32}"
+# Create checkpoint directories
+mkdir -p $CHECKPOINT_DIR/{llm,flow}
+# Check prerequisites
+echo "📋 Checking prerequisites..."
+if [ ! -d "$PRETRAINED_DIR" ]; then
+    echo "❌ Pretrained models not found. Please run scripts/download_pretrained.sh first"
+    exit 1
+fi
+if [ ! -f "./data/train.list" ]; then
+    echo "❌ Training data not found. Please run scripts/prepare_data.sh first"
+    exit 1
+fi
+# Set environment
+export CUDA_VISIBLE_DEVICES="0,1,2,3"  # Adjust as needed
+export PYTHONPATH=$(pwd):$PYTHONPATH
+echo "🚀 Starting Stage 1: LLM Training (BPE → FSQ tokens)"
+echo "=================================================="
+torchrun --nnodes=1 --nproc_per_node=$NUM_GPUS --rdzv_id=1986 --rdzv_backend="c10d" --rdzv_endpoint="localhost:1234" \
+  speech/train.py \
+  --train_engine torch_ddp \
+  --config speech/config.yaml \
+  --train_data ./data/train.list \
+  --cv_data ./data/val.list \
+  --qwen_pretrain_path $PRETRAINED_DIR/CosyVoice-BlankEN \
+  --model llm \
+  --model_dir $CHECKPOINT_DIR/llm/ \
+  --num_workers 24 \
+  --prefetch 100 \
+  --use_amp \
+  --pretrained_model $PRETRAINED_DIR/llm.pt \
+  --comet_project "learnable-speech" \
+  --comet_experiment_name "llm-training-$(date +%Y%m%d-%H%M%S)"
+if [ $? -eq 0 ]; then
+    echo "✅ Stage 1 (LLM) training completed successfully!"
+else
+    echo "❌ Stage 1 (LLM) training failed!"
+    exit 1
+fi
+echo "🚀 Starting Stage 2: Flow Training (FSQ → DAC latents)"
+echo "====================================================="
+# Find the latest LLM checkpoint
+LATEST_LLM_CHECKPOINT=$(ls -t $CHECKPOINT_DIR/llm/*.pt | head -1)
+echo "Using LLM checkpoint: $LATEST_LLM_CHECKPOINT"
+torchrun --nnodes=1 --nproc_per_node=$NUM_GPUS --rdzv_id=1987 --rdzv_backend="c10d" --rdzv_endpoint="localhost:1235" \
+  speech/train.py \
+  --train_engine torch_ddp \
+  --config speech/config.yaml \
+  --train_data ./data/train.list \
+  --cv_data ./data/val.list \
+  --qwen_pretrain_path $PRETRAINED_DIR/CosyVoice-BlankEN \
+  --model flow \
+  --model_dir $CHECKPOINT_DIR/flow/ \
+  --num_workers 24 \
+  --prefetch 100 \
+  --use_amp \
+  --pretrained_model $PRETRAINED_DIR/flow.pt \
+  --comet_project "learnable-speech" \
+  --comet_experiment_name "flow-training-$(date +%Y%m%d-%H%M%S)"
+if [ $? -eq 0 ]; then
+    echo "✅ Stage 2 (Flow) training completed successfully!"
+else
+    echo "❌ Stage 2 (Flow) training failed!"
+    exit 1
+fi
+echo "🎉 Training pipeline completed successfully!"
+echo "=========================================="
+echo "Trained models saved in: $CHECKPOINT_DIR"
+echo ""
+echo "Next steps:"
+echo "1. Test your models with inference scripts"
+echo "2. Upload checkpoints to Hugging Face Hub"
+echo "3. Update the Gradio app with trained models"
+# Create a summary file
+cat > $CHECKPOINT_DIR/training_summary.txt << EOF
+Learnable-Speech Training Summary
+Generated: $(date)
+Dataset: $DATASET_ROOT
+LLM Checkpoint: $(ls -t $CHECKPOINT_DIR/llm/*.pt | head -1)
+Flow Checkpoint: $(ls -t $CHECKPOINT_DIR/flow/*.pt | head -1)
+Configuration:
+- GPUs: $NUM_GPUS
+- Batch Size: $BATCH_SIZE
+- Mixed Precision: Enabled
+- Framework: PyTorch DDP
+Training completed successfully!
+EOF
+echo "📄 Training summary saved to: $CHECKPOINT_DIR/training_summary.txt"

scripts/training_configs.sh ADDED Viewed

	@@ -0,0 +1,81 @@

+# Learnable-Speech Training Configuration for Different Environments
+# ==== LOCAL TRAINING (Single GPU) ====
+# For development and testing
+export CUDA_VISIBLE_DEVICES="0"
+export PYTHONPATH=/path/to/learnable-speech:$PYTHONPATH
+# Single GPU training
+python train.py \
+  --train_engine torch_ddp \
+  --config config.yaml \
+  --train_data ./data/train.list \
+  --cv_data ./data/val.list \
+  --qwen_pretrain_path ./pretrained_models/CosyVoice2-0.5B/CosyVoice-BlankEN \
+  --model llm \
+  --model_dir ./checkpoints/llm/ \
+  --num_workers 4 \
+  --prefetch 50 \
+  --use_amp \
+  --pretrained_model ./pretrained_models/CosyVoice2-0.5B/llm.pt
+# ==== MULTI-GPU TRAINING (Local) ====
+# For faster training on multiple GPUs
+export CUDA_VISIBLE_DEVICES="0,1,2,3"
+num_gpus=$(echo $CUDA_VISIBLE_DEVICES | awk -F "," '{print NF}')
+torchrun --nnodes=1 --nproc_per_node=$num_gpus --rdzv_id=1986 --rdzv_backend="c10d" --rdzv_endpoint="localhost:1234" \
+  train.py \
+  --train_engine torch_ddp \
+  --config config.yaml \
+  --train_data ./data/train.list \
+  --cv_data ./data/val.list \
+  --qwen_pretrain_path ./pretrained_models/CosyVoice2-0.5B/CosyVoice-BlankEN \
+  --model llm \
+  --model_dir ./checkpoints/llm/ \
+  --num_workers 24 \
+  --prefetch 100 \
+  --use_amp \
+  --pretrained_model ./pretrained_models/CosyVoice2-0.5B/llm.pt
+# ==== CLOUD TRAINING (Google Colab/Kaggle) ====
+# Optimized for limited resources
+export CUDA_VISIBLE_DEVICES="0"
+pip install -r requirements.txt
+python train.py \
+  --train_engine torch_ddp \
+  --config config.yaml \
+  --train_data ./data/small_train.list \
+  --cv_data ./data/small_val.list \
+  --qwen_pretrain_path ./pretrained_models/CosyVoice2-0.5B/CosyVoice-BlankEN \
+  --model llm \
+  --model_dir /content/checkpoints/llm/ \
+  --num_workers 2 \
+  --prefetch 25 \
+  --use_amp \
+  --pretrained_model ./pretrained_models/CosyVoice2-0.5B/llm.pt \
+  --comet_disabled  # Disable logging for simplicity
+# ==== HUGGING FACE SPACES TRAINING ====
+# For training directly on HF infrastructure
+# Note: This requires HF Pro subscription for GPU access
+# Use smaller batch sizes and enable checkpointing
+python train.py \
+  --train_engine torch_ddp \
+  --config config_hf.yaml \
+  --train_data ./data/hf_train.list \
+  --cv_data ./data/hf_val.list \
+  --qwen_pretrain_path ./pretrained_models/CosyVoice2-0.5B/CosyVoice-BlankEN \
+  --model llm \
+  --model_dir /tmp/checkpoints/llm/ \
+  --num_workers 1 \
+  --prefetch 10 \
+  --use_amp \
+  --pretrained_model ./pretrained_models/CosyVoice2-0.5B/llm.pt \
+  --timeout 1800  # 30 minutes timeout for HF

scripts/upload_to_hf.py ADDED Viewed

	@@ -0,0 +1,229 @@

+#!/usr/bin/env python3
+"""Upload trained Learnable-Speech models to Hugging Face Hub"""
+import os
+import argparse
+from huggingface_hub import HfApi, create_repo, upload_file, upload_folder
+import torch
+import json
+from pathlib import Path
+def create_model_card(model_name, training_info):
+    """Create a model card for the uploaded model"""
+    return f"""---
+license: apache-2.0
+tags:
+- text-to-speech
+- speech-synthesis
+- learnable-speech
+- cosyvoice
+- pytorch
+pipeline_tag: text-to-speech
+library_name: pytorch
+---
+# Learnable-Speech {model_name.upper()}
+This is a trained {model_name} model from the Learnable-Speech project, an unofficial implementation based on improvements of CosyVoice with learnable encoder and DAC-VAE.
+## Model Description
+- **Model Type**: {model_name.upper()} ({"Language Model" if model_name == "llm" else "Flow Matching Decoder"})
+- **Architecture**: {"Qwen2-based transformer for BPE→FSQ token mapping" if model_name == "llm" else "Causal conditional flow matching for FSQ→DAC latent mapping"}
+- **Sample Rate**: 24kHz
+- **Framework**: PyTorch
+## Training Details
+{training_info}
+## Usage
+```python
+import torch
+from learnable_speech import LearnableSpeech
+# Load the model
+model = LearnableSpeech.from_pretrained("your-username/learnable-speech-{model_name}")
+# Generate speech
+text = "Hello, this is Learnable-Speech!"
+audio = model.synthesize(text)
+```
+## Citation
+If you use this model, please cite:
+```bibtex
+@article{{learnable-speech,
+  title={{Learnable-Speech}},
+  author={{Learnable team}},
+  year={{2025}},
+  url={{https://arxiv.org/pdf/2505.07916}}
+}}
+```
+## Links
+- [GitHub Repository](https://github.com/primepake/learnable-speech)
+- [Original Paper](https://arxiv.org/pdf/2505.07916)
+- [Hugging Face Space Demo](https://huggingface.co/spaces/mnhatdaous/learnable-speech)
+"""
+def upload_model_to_hf(checkpoint_path, model_name, repo_name, token=None, private=False):
+    """Upload trained model to Hugging Face Hub"""
+    api = HfApi(token=token)
+    # Create repository
+    try:
+        create_repo(
+            repo_id=repo_name,
+            token=token,
+            private=private,
+            exist_ok=True
+        )
+        print(f"✅ Repository {repo_name} created/found")
+    except Exception as e:
+        print(f"❌ Failed to create repository: {e}")
+        return False
+    # Load checkpoint to get training info
+    try:
+        checkpoint = torch.load(checkpoint_path, map_location='cpu')
+        training_info = f"""
+- **Training Steps**: {checkpoint.get('step', 'Unknown')}
+- **Training Epochs**: {checkpoint.get('epoch', 'Unknown')}
+- **Training Framework**: PyTorch DDP with AMP
+- **Optimizer**: AdamW
+- **Learning Rate**: {checkpoint.get('lr', 'Unknown')}
+"""
+    except Exception as e:
+        print(f"⚠️  Could not load checkpoint info: {e}")
+        training_info = "Training information not available"
+    # Create model card
+    model_card = create_model_card(model_name, training_info)
+    # Save model card to temporary file
+    with open(f"README_{model_name}.md", "w") as f:
+        f.write(model_card)
+    try:
+        # Upload checkpoint
+        upload_file(
+            path_or_fileobj=checkpoint_path,
+            path_in_repo="pytorch_model.bin",
+            repo_id=repo_name,
+            token=token
+        )
+        print(f"✅ Model checkpoint uploaded")
+        # Upload model card
+        upload_file(
+            path_or_fileobj=f"README_{model_name}.md",
+            path_in_repo="README.md",
+            repo_id=repo_name,
+            token=token
+        )
+        print(f"✅ Model card uploaded")
+        # Create and upload config
+        config = {
+            "model_type": "learnable_speech",
+            "architecture": model_name,
+            "sample_rate": 24000,
+            "framework": "pytorch"
+        }
+        with open(f"config_{model_name}.json", "w") as f:
+            json.dump(config, f, indent=2)
+        upload_file(
+            path_or_fileobj=f"config_{model_name}.json",
+            path_in_repo="config.json",
+            repo_id=repo_name,
+            token=token
+        )
+        print(f"✅ Config uploaded")
+        # Cleanup
+        os.remove(f"README_{model_name}.md")
+        os.remove(f"config_{model_name}.json")
+        print(f"🎉 Model successfully uploaded to: https://huggingface.co/{repo_name}")
+        return True
+    except Exception as e:
+        print(f"❌ Failed to upload: {e}")
+        return False
+def main():
+    parser = argparse.ArgumentParser(description="Upload Learnable-Speech models to Hugging Face")
+    parser.add_argument("--checkpoint_dir", required=True, help="Directory containing trained checkpoints")
+    parser.add_argument("--username", required=True, help="Your Hugging Face username")
+    parser.add_argument("--token", help="Hugging Face API token (or set HF_TOKEN env var)")
+    parser.add_argument("--private", action="store_true", help="Make repositories private")
+    parser.add_argument("--models", nargs="+", choices=["llm", "flow", "both"], default=["both"],
+                       help="Which models to upload")
+    args = parser.parse_args()
+    # Get token
+    token = args.token or os.getenv("HF_TOKEN")
+    if not token:
+        print("❌ Please provide Hugging Face token via --token or HF_TOKEN env var")
+        return
+    checkpoint_dir = Path(args.checkpoint_dir)
+    models_to_upload = []
+    if "both" in args.models:
+        models_to_upload = ["llm", "flow"]
+    else:
+        models_to_upload = args.models
+    success_count = 0
+    for model_name in models_to_upload:
+        print(f"\n🚀 Uploading {model_name.upper()} model...")
+        # Find latest checkpoint
+        model_dir = checkpoint_dir / model_name
+        if not model_dir.exists():
+            print(f"❌ Model directory not found: {model_dir}")
+            continue
+        checkpoint_files = list(model_dir.glob("*.pt"))
+        if not checkpoint_files:
+            print(f"❌ No checkpoint files found in {model_dir}")
+            continue
+        # Get the latest checkpoint (by modification time)
+        latest_checkpoint = max(checkpoint_files, key=os.path.getmtime)
+        print(f"📁 Using checkpoint: {latest_checkpoint}")
+        # Upload to HF
+        repo_name = f"{args.username}/learnable-speech-{model_name}"
+        success = upload_model_to_hf(
+            checkpoint_path=str(latest_checkpoint),
+            model_name=model_name,
+            repo_name=repo_name,
+            token=token,
+            private=args.private
+        )
+        if success:
+            success_count += 1
+    print(f"\n🎉 Upload complete! {success_count}/{len(models_to_upload)} models uploaded successfully")
+    if success_count > 0:
+        print("\n📝 Next steps:")
+        print("1. Update your Gradio app to use the uploaded models")
+        print("2. Test the models in your Hugging Face Space")
+        print("3. Share your trained models with the community!")
+if __name__ == "__main__":
+    main()

speech/config_hf.yaml ADDED Viewed

	@@ -0,0 +1,192 @@

+# Hugging Face optimized configuration
+# This config is optimized for training on HF Spaces with limited resources
+# set random seed
+__set_seed1: !apply:random.seed [1986]
+__set_seed2: !apply:numpy.random.seed [1986]
+__set_seed3: !apply:torch.manual_seed [1986]
+__set_seed4: !apply:torch.cuda.manual_seed_all [1986]
+# fixed params - optimized for HF
+sample_rate: 24000
+llm_input_size: 512 # Reduced from 896
+llm_output_size: 512 # Reduced from 896
+spk_embed_dim: 128 # Reduced from 192
+qwen_pretrain_path: ''
+token_frame_rate: 25
+token_mel_ratio: 2
+token_latent_ratio: 3
+use_speaker_encoder: True
+speaker_encoder_path: '/tmp/checkpoints/llm/best_speaker_encoder.pt'
+# stream related params
+chunk_size: 16 # Reduced from 25
+num_decoding_left_chunks: -1
+speaker_encoder_config:
+  mel_dim: 80
+  model_dim: 256 # Reduced from 512
+  output_dim: !ref <spk_embed_dim>
+  num_blocks: 4 # Reduced from 6
+  num_heads: 4 # Reduced from 8
+  kernel_size: 1
+  dropout: 0.1
+  max_conditioning_inputs: 2 # Reduced from 3
+# Smaller LLM model for HF
+llm: !new:cosyvoice.llm.llm.Qwen2LM
+  llm_input_size: !ref <llm_input_size>
+  llm_output_size: !ref <llm_output_size>
+  speech_token_size: 6561
+  length_normalized_loss: True
+  lsm_weight: 0
+  mix_ratio: [3, 10] # Reduced from [5, 15]
+  use_speaker_encoder: !ref <use_speaker_encoder>
+  spk_embed_dim: !ref <spk_embed_dim>
+  max_conditioning_inputs: 2
+  llm: !new:cosyvoice.llm.llm.Qwen2Encoder
+    pretrain_path: !ref <qwen_pretrain_path>
+  sampling: !name:cosyvoice.utils.common.ras_sampling
+    top_p: 0.8
+    top_k: 25
+    win_size: 8 # Reduced from 10
+    tau_r: 0.1
+extract_reference_mel:
+  !name:cosyvoice.dataset.processor.extract_reference_mel_from_speech
+  feat_extractor: !ref <feat_extractor>
+  min_length: 0.5
+  max_length: 3.0 # Reduced from 4.0
+  num_crops: 1
+  training: True
+  sample_rate: !ref <sample_rate>
+# Smaller Flow model for HF
+flow: !new:cosyvoice.flow.flow.CausalMaskedDiffWithXvec
+  input_size: 256 # Reduced from 512
+  output_size: 64
+  spk_embed_dim: !ref <spk_embed_dim>
+  output_type: 'mel'
+  vocab_size: 6561
+  input_frame_rate: !ref <token_frame_rate>
+  only_mask_loss: True
+  token_latent_ratio: !ref <token_latent_ratio>
+  pre_lookahead_len: 2 # Reduced from 3
+  use_speaker_encoder: !ref <use_speaker_encoder>
+  freeze_speaker_encoder: True
+  speaker_encoder_path: !ref <speaker_encoder_path>
+  encoder: !new:cosyvoice.transformer.upsample_encoder.UpsampleConformerEncoder
+    output_size: 256 # Reduced from 512
+    attention_heads: 4 # Reduced from 8
+    linear_units: 1024 # Reduced from 2048
+    num_blocks: 4 # Reduced from 6
+    dropout_rate: 0.1
+    positional_dropout_rate: 0.1
+    attention_dropout_rate: 0.1
+    normalize_before: True
+    input_layer: 'linear'
+    pos_enc_layer_type: 'rel_pos_espnet'
+    selfattention_layer_type: 'rel_selfattn'
+    input_size: 256 # Reduced from 512
+    use_cnn_module: False
+    macaron_style: False
+    static_chunk_size: !ref <chunk_size>
+  decoder: !new:cosyvoice.flow.flow_matching.CausalConditionalCFM
+    in_channels: 240
+    n_spks: 1
+    spk_emb_dim: 80
+    cfm_params: !new:omegaconf.DictConfig
+      content:
+        sigma_min: 1e-06
+        solver: 'euler'
+        t_scheduler: 'cosine'
+        training_cfg_rate: 0.1 # Reduced from 0.2
+        inference_cfg_rate: 0.5 # Reduced from 0.7
+        reg_loss_type: 'l1'
+        use_immiscible: True
+        immiscible_k: 4 # Reduced from 8
+        use_contrastive_fm: True
+        contrastive_lambda: 0.03 # Reduced from 0.05
+    estimator: !new:cosyvoice.flow.decoder.CausalConditionalDecoder
+      in_channels: 320
+      out_channels: 64
+      channels: [128] # Reduced from [256]
+      dropout: 0.0
+      attention_head_dim: 32 # Reduced from 64
+      n_blocks: 3 # Reduced from 4
+      num_mid_blocks: 8 # Reduced from 12
+      num_heads: 4 # Reduced from 8
+      act_fn: 'gelu'
+      static_chunk_size: !ref <chunk_size> * <token_latent_ratio>
+      num_decoding_left_chunks: !ref <num_decoding_left_chunks>
+# Processor functions (unchanged)
+individual_file_opener: !name:cosyvoice.dataset.processor.individual_file_opener
+parquet_opener: !name:cosyvoice.dataset.processor.parquet_opener
+get_tokenizer: !name:cosyvoice.tokenizer.tokenizer.get_qwen_tokenizer
+  token_path: !ref <qwen_pretrain_path>
+  skip_special_tokens: True
+allowed_special: 'all'
+tokenize: !name:cosyvoice.dataset.processor.tokenize
+  get_tokenizer: !ref <get_tokenizer>
+  allowed_special: !ref <allowed_special>
+filter: !name:cosyvoice.dataset.processor.filter
+  max_length: 20480 # Reduced from 40960
+  min_length: 100
+  token_max_length: 150 # Reduced from 200
+  token_min_length: 1
+resample: !name:cosyvoice.dataset.processor.resample
+  resample_rate: !ref <sample_rate>
+truncate: !name:cosyvoice.dataset.processor.truncate
+  truncate_length: 12240 # Reduced from 24480
+feat_extractor: !name:matcha.utils.audio.mel_spectrogram
+  n_fft: 1920
+  num_mels: 80
+  sampling_rate: !ref <sample_rate>
+  hop_size: 480
+  win_size: 1920
+  fmin: 0
+  fmax: 8000
+  center: False
+compute_fbank: !name:cosyvoice.dataset.processor.compute_fbank
+  feat_extractor: !ref <feat_extractor>
+  token_mel_ratio: !ref <token_mel_ratio>
+shuffle: !name:cosyvoice.dataset.processor.shuffle
+  shuffle_size: 500 # Reduced from 1000
+sort: !name:cosyvoice.dataset.processor.sort
+  sort_size: 250 # Reduced from 500
+batch: !name:cosyvoice.dataset.processor.batch
+  batch_type: 'dynamic'
+  max_frames_in_batch: 2500 # Reduced from 5000
+padding: !name:cosyvoice.dataset.processor.padding
+  use_speaker_encoder: !ref <use_speaker_encoder>
+# dataset processor pipeline
+data_pipeline:
+  [
+    !ref <individual_file_opener>,
+    !ref <tokenize>,
+    !ref <filter>,
+    !ref <resample>,
+    !ref <extract_reference_mel>,
+    !ref <compute_fbank>,
+    !ref <shuffle>,
+    !ref <sort>,
+    !ref <batch>,
+    !ref <padding>,
+  ]
+# HF optimized training configuration
+train_conf:
+  optim: adamw
+  optim_conf:
+    lr: 3e-5 # Reduced from 5e-5
+  scheduler: constantlr
+  scheduler_conf:
+    warmup_steps: 200 # Reduced from 500
+  max_epoch: 50 # Reduced from 2000
+  grad_clip: 1
+  accum_grad: 2 # Added gradient accumulation
+  log_interval: 10 # Increased from 5
+  save_per_step: 1000 # Reduced from 2000
+  total_iters: 100000 # Reduced from 1000000000