ASR-finetuning / legacy /Week1_Startup_Code.md
saadmannan's picture
HF space application - exclude binary PDFs
5554ef1

A newer version of the Gradio SDK is available: 6.3.0

Upgrade

Immediate Action: Week 1 Startup Code Templates

Your First Command (RIGHT NOW)

Open terminal and execute:

# Create workspace
mkdir ~/ai-career-project
cd ~/ai-career-project

# Create and activate conda environment
conda create -n voice_ai python=3.10 -y
conda activate voice_ai

# Install core packages
pip install --upgrade pip
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu125
pip install transformers datasets librosa soundfile accelerate wandb
pip install flash-attn --no-build-isolation
pip install bitsandbytes
pip install gradio streamlit fastapi uvicorn

# Initialize git
git init
git config user.name "Your Name"
git config user.email "your@email.com"

Project 1: Whisper Fine-tuning - Starter Template

File: project1_whisper_setup.py

#!/usr/bin/env python3
"""
Whisper Fine-tuning Setup
Purpose: Fine-tune Whisper-small on German Common Voice data
GPU: RTX 5060 Ti optimized
"""

import torch
import sys
from pathlib import Path

def check_environment():
    """Verify all dependencies are installed"""
    print("=" * 60)
    print("ENVIRONMENT CHECK")
    print("=" * 60)
    
    # PyTorch
    print(f"βœ“ PyTorch: {torch.__version__}")
    print(f"βœ“ CUDA available: {torch.cuda.is_available()}")
    
    if torch.cuda.is_available():
        print(f"βœ“ GPU: {torch.cuda.get_device_name(0)}")
        print(f"βœ“ CUDA Capability: {torch.cuda.get_device_capability(0)}")
        print(f"βœ“ VRAM: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")
    
    # Check transformers
    try:
        from transformers import AutoModel
        print("βœ“ Transformers: Installed")
    except ImportError:
        print("βœ— Transformers: NOT INSTALLED")
        return False
    
    # Check datasets
    try:
        from datasets import load_dataset
        print("βœ“ Datasets: Installed")
    except ImportError:
        print("βœ— Datasets: NOT INSTALLED")
        return False
    
    # Check librosa
    try:
        import librosa
        print("βœ“ Librosa: Installed")
    except ImportError:
        print("βœ— Librosa: NOT INSTALLED")
        return False
    
    print("\nβœ… All checks passed! Ready to start.\n")
    return True

def download_data():
    """Download Common Voice German dataset"""
    print("=" * 60)
    print("DOWNLOADING COMMON VOICE GERMAN")
    print("=" * 60)
    print("This will download ~500MB of German speech data...")
    print("Estimated time: 5-10 minutes depending on internet")
    
    from datasets import load_dataset
    
    # Load Common Voice German
    print("\nLoading dataset... (this may take a few minutes)")
    dataset = load_dataset(
        "mozilla-foundation/common_voice_11_0",
        "de",
        split="train[:10%]",  # Start with 10% (faster for first run)
        trust_remote_code=True
    )
    
    print(f"\nβœ“ Dataset loaded: {len(dataset)} samples")
    print(f"  Sample audio file: {dataset[0]['audio']}")
    print(f"  Sample text: {dataset[0]['sentence']}")
    
    # Save locally for faster loading next time
    print("\nSaving dataset locally...")
    dataset.save_to_disk("./data/common_voice_de")
    print("βœ“ Saved to ./data/common_voice_de/")
    
    return dataset

def optimize_settings():
    """Configure PyTorch for RTX 5060 Ti"""
    print("=" * 60)
    print("OPTIMIZING FOR RTX 5060 Ti")
    print("=" * 60)
    
    # Enable optimizations
    torch.set_float32_matmul_precision('high')
    torch.backends.cuda.matmul.allow_tf32 = True
    torch.backends.cudnn.benchmark = True
    
    print("βœ“ torch.set_float32_matmul_precision('high')")
    print("βœ“ torch.backends.cuda.matmul.allow_tf32 = True")
    print("βœ“ torch.backends.cudnn.benchmark = True")
    print("\nThese settings will:")
    print("  β€’ Use Tensor Float 32 (TF32) for faster matrix operations")
    print("  β€’ Enable cuDNN auto-tuning for optimal kernel selection")
    print("  β€’ Expected speedup: 10-20%")
    
    return True

def main():
    """Main setup function"""
    print("\n" + "=" * 60)
    print("WHISPER FINE-TUNING SETUP")
    print("Project: Multilingual ASR for German")
    print("GPU: RTX 5060 Ti (16GB VRAM)")
    print("=" * 60 + "\n")
    
    # Check environment
    if not check_environment():
        print("❌ Environment check failed. Please install missing packages.")
        return False
    
    # Optimize settings
    optimize_settings()
    
    # Download data
    try:
        dataset = download_data()
    except Exception as e:
        print(f"⚠️  Data download failed: {e}")
        print("You can retry later with: python project1_whisper_setup.py")
        return False
    
    print("\n" + "=" * 60)
    print("βœ… SETUP COMPLETE!")
    print("=" * 60)
    print("\nNext steps:")
    print("1. Review the dataset in ./data/common_voice_de/")
    print("2. Run: python project1_whisper_train.py")
    print("3. Fine-tuning will begin (expect 2-3 days on RTX 5060 Ti)")
    print("=" * 60 + "\n")
    
    return True

if __name__ == "__main__":
    success = main()
    sys.exit(0 if success else 1)

Run this:

python project1_whisper_setup.py

File: project1_whisper_train.py

#!/usr/bin/env python3
"""
Whisper Fine-training Script
Optimized for RTX 5060 Ti
"""

import torch
from transformers import (
    WhisperForConditionalGeneration,
    Seq2SeqTrainingArguments,
    Seq2SeqTrainer,
    WhisperProcessor
)
from datasets import load_from_disk, concatenate_datasets
import sys

def setup_training():
    """Configure training for RTX 5060 Ti"""
    
    print("\n" + "=" * 60)
    print("WHISPER FINE-TRAINING")
    print("=" * 60)
    
    # Load model
    print("\n1. Loading Whisper-small model...")
    model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-small")
    processor = WhisperProcessor.from_pretrained("openai/whisper-small")
    print(f"   Model size: {sum(p.numel() for p in model.parameters()) / 1e6:.0f}M parameters")
    
    # Load datasets
    print("\n2. Loading Common Voice data...")
    german_data = load_from_disk("./data/common_voice_de")
    
    # Split: 80% train, 20% eval
    split = german_data.train_test_split(test_size=0.2, seed=42)
    train_dataset = split['train']
    eval_dataset = split['test']
    
    print(f"   Training samples: {len(train_dataset)}")
    print(f"   Evaluation samples: {len(eval_dataset)}")
    
    # Training arguments optimized for RTX 5060 Ti
    print("\n3. Setting up training arguments...")
    training_args = Seq2SeqTrainingArguments(
        output_dir="./whisper_fine_tuned",
        per_device_train_batch_size=8,      # RTX 5060 Ti can handle this
        per_device_eval_batch_size=8,
        gradient_accumulation_steps=2,       # Simulate batch size of 32
        learning_rate=1e-5,
        warmup_steps=500,
        num_train_epochs=3,
        evaluation_strategy="steps",
        eval_steps=1000,
        save_steps=1000,
        logging_steps=25,
        save_total_limit=3,
        weight_decay=0.01,
        push_to_hub=False,
        mixed_precision="fp16",             # CRITICAL for RTX 5060 Ti
        gradient_checkpointing=True,         # Trade compute for memory
        report_to="none",
        generation_max_length=225,
        seed=42,
    )
    
    print(f"   Batch size: {training_args.per_device_train_batch_size}")
    print(f"   Effective batch: {training_args.per_device_train_batch_size * training_args.gradient_accumulation_steps}")
    print(f"   Mixed precision: FP16")
    print(f"   Gradient checkpointing: Enabled")
    print(f"   Total training steps: ~{len(train_dataset) // (training_args.per_device_train_batch_size * training_args.gradient_accumulation_steps) * 3}")
    
    # Create trainer
    print("\n4. Creating trainer...")
    trainer = Seq2SeqTrainer(
        model=model,
        args=training_args,
        train_dataset=train_dataset,
        eval_dataset=eval_dataset,
        processing_class=processor,
    )
    
    print("βœ“ Trainer created")
    
    return trainer, model

def train():
    """Run training"""
    print("\n⏱️  STARTING TRAINING...")
    print("   Estimated time: 2-3 days on RTX 5060 Ti")
    print("   Estimated VRAM usage: 14-16 GB")
    print("   You can monitor GPU with: watch -n 1 nvidia-smi")
    
    trainer, model = setup_training()
    
    try:
        # Start training
        trainer.train()
        
        print("\nβœ… TRAINING COMPLETE!")
        print("   Model saved to: ./whisper_fine_tuned")
        
        # Save final model
        model.save_pretrained("./whisper_fine_tuned_final")
        print("   Final checkpoint saved")
        
        return True
        
    except KeyboardInterrupt:
        print("\n⚠️  Training interrupted by user")
        print("   You can resume training later")
        return False
    except RuntimeError as e:
        if "out of memory" in str(e):
            print("\n❌ Out of memory error!")
            print("   Solutions:")
            print("   1. Reduce batch size (currently 8)")
            print("   2. Increase gradient accumulation steps (currently 2)")
            print("   3. Use smaller Whisper model (base instead of small)")
            return False
        raise

if __name__ == "__main__":
    success = train()
    sys.exit(0 if success else 1)

Run this:

python project1_whisper_train.py

Project 2: VAD + Speaker Diarization - Quick Start

File: project2_vad_diarization.py

#!/usr/bin/env python3
"""
Voice Activity Detection + Speaker Diarization
Simple script to get started
"""

import torch
import librosa
import numpy as np
from pathlib import Path

def setup_vad():
    """Setup Silero VAD"""
    print("Setting up Voice Activity Detection...")
    
    from silero_vad import load_silero_vad, get_speech_timestamps, read_audio
    
    model = load_silero_vad(onnx=False)
    print("βœ“ Silero VAD loaded (40 MB)")
    
    return model

def setup_diarization():
    """Setup Speaker Diarization"""
    print("Setting up Speaker Diarization...")
    print("⚠️  First download requires 1GB+ bandwidth (one-time)")
    
    from pyannote.audio import Pipeline
    
    # You need Hugging Face token for this
    # Get it: https://huggingface.co/settings/tokens
    
    try:
        pipeline = Pipeline.from_pretrained(
            "pyannote/speaker-diarization-3.0",
            use_auth_token="hf_YOUR_TOKEN_HERE"
        )
        print("βœ“ Diarization pipeline loaded")
        return pipeline
    except Exception as e:
        print(f"❌ Error: {e}")
        print("Get your HF token: https://huggingface.co/settings/tokens")
        return None

def demo_vad(audio_path, vad_model):
    """Demo VAD on an audio file"""
    print(f"\nVAD Analysis: {audio_path}")
    
    from silero_vad import get_speech_timestamps, read_audio
    
    wav = read_audio(audio_path, sr=16000)
    
    timestamps = get_speech_timestamps(
        wav,
        vad_model,
        num_steps_state=4,
        threshold=0.5,
        sampling_rate=16000
    )
    
    print(f"Found {len(timestamps)} speech segments:")
    for i, ts in enumerate(timestamps, 1):
        start_ms = ts['start']
        end_ms = ts['end']
        duration_ms = end_ms - start_ms
        print(f"  Segment {i}: {start_ms:6}ms - {end_ms:6}ms ({duration_ms:6}ms)")
    
    return timestamps

def demo_diarization(audio_path, diar_pipeline):
    """Demo Diarization on an audio file"""
    print(f"\nDiarization Analysis: {audio_path}")
    
    diarization = diar_pipeline(audio_path)
    
    print("Speaker timeline:")
    for turn, _, speaker in diarization.itertracks(yield_label=True):
        print(f"  {turn.start:6.2f}s - {turn.end:6.2f}s: {speaker}")

def create_test_audio():
    """Create a simple test audio file"""
    print("\nCreating test audio (10 seconds)...")
    
    import soundfile as sf
    
    # Generate simple sine wave
    sr = 16000
    duration = 10
    t = np.linspace(0, duration, int(sr * duration))
    
    # Mix of silence + speech-like patterns
    signal = np.zeros_like(t)
    signal[0:sr*2] = 0.1 * np.sin(2 * np.pi * 440 * t[0:sr*2])  # Tone
    signal[sr*3:sr*5] = 0  # Silence
    signal[sr*5:sr*7] = 0.1 * np.sin(2 * np.pi * 880 * t[0:sr*2])  # Different tone
    
    # Save
    sf.write("test_audio.wav", signal, sr)
    print("βœ“ Created test_audio.wav")
    
    return "test_audio.wav"

def main():
    print("\n" + "=" * 60)
    print("VOICE ACTIVITY DETECTION + SPEAKER DIARIZATION")
    print("=" * 60)
    
    # Setup VAD
    vad_model = setup_vad()
    
    # Setup Diarization (optional, requires HF token)
    diar_pipeline = setup_diarization()
    
    # Create test audio
    audio_path = create_test_audio()
    
    # Demo VAD
    demo_vad(audio_path, vad_model)
    
    # Demo Diarization
    if diar_pipeline:
        demo_diarization(audio_path, diar_pipeline)
    else:
        print("\n⚠️  Skipping diarization (no HF token)")
        print("   To enable: Get token at https://huggingface.co/settings/tokens")
        print("   Then update the script with: use_auth_token='your_token'")
    
    print("\n" + "=" * 60)
    print("βœ… Demo complete!")
    print("Next steps:")
    print("1. Get real audio files (use your FEARLESS STEPS data)")
    print("2. Process them with the functions above")
    print("3. Deploy with Gradio (see project2_gradio.py)")
    print("=" * 60 + "\n")

if __name__ == "__main__":
    main()

Run this:

python project2_vad_diarization.py

GitHub Repository Structure (Create this NOW)

# Create directory structure
mkdir -p whisper-german-asr/{data,notebooks,model,deployment,tests}
mkdir -p realtime-speaker-diarization/{data,notebooks,model,deployment,tests}
mkdir -p speech-emotion-recognition/{data,notebooks,model,deployment,tests}

# Create basic files for first project
cat > whisper-german-asr/README.md << 'EOF'
# Multilingual ASR Fine-tuning with Whisper

Fine-tuned OpenAI Whisper for German & English speech recognition

## Quick Start

```bash
pip install -r requirements.txt
python demo.py

Results

  • German WER: 8.2% (improved from 10.5% baseline)
  • English WER: 5.1%
  • Inference: Real-time on CPU, sub-second on GPU

Architecture

  1. Base Model: Whisper-small (244M parameters)
  2. Dataset: Common Voice German + English
  3. Training: Mixed precision (FP16) + gradient checkpointing
  4. Deployment: FastAPI + Docker

EOF

Create requirements file

cat > whisper-german-asr/requirements.txt << 'EOF' torch>=2.0.0 transformers>=4.30.0 datasets>=2.10.0 librosa>=0.10.0 soundfile>=0.12.0 accelerate>=0.20.0 gradio>=3.40.0 fastapi>=0.100.0 uvicorn>=0.23.0 EOF

Initialize git

cd whisper-german-asr git init git add README.md requirements.txt git commit -m "Initial commit: project structure"


---

## Week 1 Tasks (Checkbox)

IMMEDIATE (This Week): ☐ Install PyTorch 2.0 + CUDA 12.5 ☐ Run project1_whisper_setup.py (check environment) ☐ Download Common Voice German dataset ☐ Create GitHub repositories (3 projects) ☐ Push initial structure to GitHub ☐ Set up portfolio website (GitHub Pages template) ☐ Create LinkedIn profile update draft

OPTIONAL (If ahead of schedule): ☐ Start project2_vad_diarization.py ☐ Write first blog post draft ☐ Research target companies (ElevenLabs, voize, Parloa)


---

## Debugging Common Issues

### Issue: "CUDA out of memory"
**Solution:**
```python
# In training script, reduce batch size:
per_device_train_batch_size=4,  # Was 8
gradient_accumulation_steps=4,  # Increase to compensate

Issue: "Transformers not found"

Solution:

pip install transformers --upgrade

Issue: "Common Voice dataset won't download"

Solution:

# Check internet connection
# Try manually: https://commonvoice.mozilla.org/
# Or use cached version if available

Issue: "GPU not detected"

Solution:

python -c "import torch; print(torch.cuda.is_available())"
# If False, reinstall PyTorch with CUDA support
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu125

Success Checkpoints

Week 1 End:

  • Environment setup complete
  • Dataset downloaded
  • First training job started (or will start this weekend)

Week 2 End:

  • Project 1 (Whisper) training progress visible
  • Project 2 (VAD) demo working
  • GitHub repos initialized

Week 3 End:

  • All 3 projects deployed or near completion
  • Portfolio website live
  • First blog post published

What to Do RIGHT NOW (Today)

  1. Open terminal

    cd ~
    mkdir ai-career-project
    cd ai-career-project
    
  2. Run setup

    conda create -n voice_ai python=3.10 -y
    conda activate voice_ai
    pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu125
    
  3. Clone this repo structure

    git clone YOUR-GITHUB-REPO
    cd whisper-german-asr
    pip install -r requirements.txt
    
  4. Test environment

    python project1_whisper_setup.py
    
  5. If successful:

    python project1_whisper_train.py
    

You now have everything you need to start. Execute immediately. No more planning. Ship! πŸš€