Spaces:
Sleeping
A newer version of the Gradio SDK is available:
6.3.0
Immediate Action: Week 1 Startup Code Templates
Your First Command (RIGHT NOW)
Open terminal and execute:
# Create workspace
mkdir ~/ai-career-project
cd ~/ai-career-project
# Create and activate conda environment
conda create -n voice_ai python=3.10 -y
conda activate voice_ai
# Install core packages
pip install --upgrade pip
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu125
pip install transformers datasets librosa soundfile accelerate wandb
pip install flash-attn --no-build-isolation
pip install bitsandbytes
pip install gradio streamlit fastapi uvicorn
# Initialize git
git init
git config user.name "Your Name"
git config user.email "your@email.com"
Project 1: Whisper Fine-tuning - Starter Template
File: project1_whisper_setup.py
#!/usr/bin/env python3
"""
Whisper Fine-tuning Setup
Purpose: Fine-tune Whisper-small on German Common Voice data
GPU: RTX 5060 Ti optimized
"""
import torch
import sys
from pathlib import Path
def check_environment():
"""Verify all dependencies are installed"""
print("=" * 60)
print("ENVIRONMENT CHECK")
print("=" * 60)
# PyTorch
print(f"β PyTorch: {torch.__version__}")
print(f"β CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
print(f"β GPU: {torch.cuda.get_device_name(0)}")
print(f"β CUDA Capability: {torch.cuda.get_device_capability(0)}")
print(f"β VRAM: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")
# Check transformers
try:
from transformers import AutoModel
print("β Transformers: Installed")
except ImportError:
print("β Transformers: NOT INSTALLED")
return False
# Check datasets
try:
from datasets import load_dataset
print("β Datasets: Installed")
except ImportError:
print("β Datasets: NOT INSTALLED")
return False
# Check librosa
try:
import librosa
print("β Librosa: Installed")
except ImportError:
print("β Librosa: NOT INSTALLED")
return False
print("\nβ
All checks passed! Ready to start.\n")
return True
def download_data():
"""Download Common Voice German dataset"""
print("=" * 60)
print("DOWNLOADING COMMON VOICE GERMAN")
print("=" * 60)
print("This will download ~500MB of German speech data...")
print("Estimated time: 5-10 minutes depending on internet")
from datasets import load_dataset
# Load Common Voice German
print("\nLoading dataset... (this may take a few minutes)")
dataset = load_dataset(
"mozilla-foundation/common_voice_11_0",
"de",
split="train[:10%]", # Start with 10% (faster for first run)
trust_remote_code=True
)
print(f"\nβ Dataset loaded: {len(dataset)} samples")
print(f" Sample audio file: {dataset[0]['audio']}")
print(f" Sample text: {dataset[0]['sentence']}")
# Save locally for faster loading next time
print("\nSaving dataset locally...")
dataset.save_to_disk("./data/common_voice_de")
print("β Saved to ./data/common_voice_de/")
return dataset
def optimize_settings():
"""Configure PyTorch for RTX 5060 Ti"""
print("=" * 60)
print("OPTIMIZING FOR RTX 5060 Ti")
print("=" * 60)
# Enable optimizations
torch.set_float32_matmul_precision('high')
torch.backends.cuda.matmul.allow_tf32 = True
torch.backends.cudnn.benchmark = True
print("β torch.set_float32_matmul_precision('high')")
print("β torch.backends.cuda.matmul.allow_tf32 = True")
print("β torch.backends.cudnn.benchmark = True")
print("\nThese settings will:")
print(" β’ Use Tensor Float 32 (TF32) for faster matrix operations")
print(" β’ Enable cuDNN auto-tuning for optimal kernel selection")
print(" β’ Expected speedup: 10-20%")
return True
def main():
"""Main setup function"""
print("\n" + "=" * 60)
print("WHISPER FINE-TUNING SETUP")
print("Project: Multilingual ASR for German")
print("GPU: RTX 5060 Ti (16GB VRAM)")
print("=" * 60 + "\n")
# Check environment
if not check_environment():
print("β Environment check failed. Please install missing packages.")
return False
# Optimize settings
optimize_settings()
# Download data
try:
dataset = download_data()
except Exception as e:
print(f"β οΈ Data download failed: {e}")
print("You can retry later with: python project1_whisper_setup.py")
return False
print("\n" + "=" * 60)
print("β
SETUP COMPLETE!")
print("=" * 60)
print("\nNext steps:")
print("1. Review the dataset in ./data/common_voice_de/")
print("2. Run: python project1_whisper_train.py")
print("3. Fine-tuning will begin (expect 2-3 days on RTX 5060 Ti)")
print("=" * 60 + "\n")
return True
if __name__ == "__main__":
success = main()
sys.exit(0 if success else 1)
Run this:
python project1_whisper_setup.py
File: project1_whisper_train.py
#!/usr/bin/env python3
"""
Whisper Fine-training Script
Optimized for RTX 5060 Ti
"""
import torch
from transformers import (
WhisperForConditionalGeneration,
Seq2SeqTrainingArguments,
Seq2SeqTrainer,
WhisperProcessor
)
from datasets import load_from_disk, concatenate_datasets
import sys
def setup_training():
"""Configure training for RTX 5060 Ti"""
print("\n" + "=" * 60)
print("WHISPER FINE-TRAINING")
print("=" * 60)
# Load model
print("\n1. Loading Whisper-small model...")
model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-small")
processor = WhisperProcessor.from_pretrained("openai/whisper-small")
print(f" Model size: {sum(p.numel() for p in model.parameters()) / 1e6:.0f}M parameters")
# Load datasets
print("\n2. Loading Common Voice data...")
german_data = load_from_disk("./data/common_voice_de")
# Split: 80% train, 20% eval
split = german_data.train_test_split(test_size=0.2, seed=42)
train_dataset = split['train']
eval_dataset = split['test']
print(f" Training samples: {len(train_dataset)}")
print(f" Evaluation samples: {len(eval_dataset)}")
# Training arguments optimized for RTX 5060 Ti
print("\n3. Setting up training arguments...")
training_args = Seq2SeqTrainingArguments(
output_dir="./whisper_fine_tuned",
per_device_train_batch_size=8, # RTX 5060 Ti can handle this
per_device_eval_batch_size=8,
gradient_accumulation_steps=2, # Simulate batch size of 32
learning_rate=1e-5,
warmup_steps=500,
num_train_epochs=3,
evaluation_strategy="steps",
eval_steps=1000,
save_steps=1000,
logging_steps=25,
save_total_limit=3,
weight_decay=0.01,
push_to_hub=False,
mixed_precision="fp16", # CRITICAL for RTX 5060 Ti
gradient_checkpointing=True, # Trade compute for memory
report_to="none",
generation_max_length=225,
seed=42,
)
print(f" Batch size: {training_args.per_device_train_batch_size}")
print(f" Effective batch: {training_args.per_device_train_batch_size * training_args.gradient_accumulation_steps}")
print(f" Mixed precision: FP16")
print(f" Gradient checkpointing: Enabled")
print(f" Total training steps: ~{len(train_dataset) // (training_args.per_device_train_batch_size * training_args.gradient_accumulation_steps) * 3}")
# Create trainer
print("\n4. Creating trainer...")
trainer = Seq2SeqTrainer(
model=model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
processing_class=processor,
)
print("β Trainer created")
return trainer, model
def train():
"""Run training"""
print("\nβ±οΈ STARTING TRAINING...")
print(" Estimated time: 2-3 days on RTX 5060 Ti")
print(" Estimated VRAM usage: 14-16 GB")
print(" You can monitor GPU with: watch -n 1 nvidia-smi")
trainer, model = setup_training()
try:
# Start training
trainer.train()
print("\nβ
TRAINING COMPLETE!")
print(" Model saved to: ./whisper_fine_tuned")
# Save final model
model.save_pretrained("./whisper_fine_tuned_final")
print(" Final checkpoint saved")
return True
except KeyboardInterrupt:
print("\nβ οΈ Training interrupted by user")
print(" You can resume training later")
return False
except RuntimeError as e:
if "out of memory" in str(e):
print("\nβ Out of memory error!")
print(" Solutions:")
print(" 1. Reduce batch size (currently 8)")
print(" 2. Increase gradient accumulation steps (currently 2)")
print(" 3. Use smaller Whisper model (base instead of small)")
return False
raise
if __name__ == "__main__":
success = train()
sys.exit(0 if success else 1)
Run this:
python project1_whisper_train.py
Project 2: VAD + Speaker Diarization - Quick Start
File: project2_vad_diarization.py
#!/usr/bin/env python3
"""
Voice Activity Detection + Speaker Diarization
Simple script to get started
"""
import torch
import librosa
import numpy as np
from pathlib import Path
def setup_vad():
"""Setup Silero VAD"""
print("Setting up Voice Activity Detection...")
from silero_vad import load_silero_vad, get_speech_timestamps, read_audio
model = load_silero_vad(onnx=False)
print("β Silero VAD loaded (40 MB)")
return model
def setup_diarization():
"""Setup Speaker Diarization"""
print("Setting up Speaker Diarization...")
print("β οΈ First download requires 1GB+ bandwidth (one-time)")
from pyannote.audio import Pipeline
# You need Hugging Face token for this
# Get it: https://huggingface.co/settings/tokens
try:
pipeline = Pipeline.from_pretrained(
"pyannote/speaker-diarization-3.0",
use_auth_token="hf_YOUR_TOKEN_HERE"
)
print("β Diarization pipeline loaded")
return pipeline
except Exception as e:
print(f"β Error: {e}")
print("Get your HF token: https://huggingface.co/settings/tokens")
return None
def demo_vad(audio_path, vad_model):
"""Demo VAD on an audio file"""
print(f"\nVAD Analysis: {audio_path}")
from silero_vad import get_speech_timestamps, read_audio
wav = read_audio(audio_path, sr=16000)
timestamps = get_speech_timestamps(
wav,
vad_model,
num_steps_state=4,
threshold=0.5,
sampling_rate=16000
)
print(f"Found {len(timestamps)} speech segments:")
for i, ts in enumerate(timestamps, 1):
start_ms = ts['start']
end_ms = ts['end']
duration_ms = end_ms - start_ms
print(f" Segment {i}: {start_ms:6}ms - {end_ms:6}ms ({duration_ms:6}ms)")
return timestamps
def demo_diarization(audio_path, diar_pipeline):
"""Demo Diarization on an audio file"""
print(f"\nDiarization Analysis: {audio_path}")
diarization = diar_pipeline(audio_path)
print("Speaker timeline:")
for turn, _, speaker in diarization.itertracks(yield_label=True):
print(f" {turn.start:6.2f}s - {turn.end:6.2f}s: {speaker}")
def create_test_audio():
"""Create a simple test audio file"""
print("\nCreating test audio (10 seconds)...")
import soundfile as sf
# Generate simple sine wave
sr = 16000
duration = 10
t = np.linspace(0, duration, int(sr * duration))
# Mix of silence + speech-like patterns
signal = np.zeros_like(t)
signal[0:sr*2] = 0.1 * np.sin(2 * np.pi * 440 * t[0:sr*2]) # Tone
signal[sr*3:sr*5] = 0 # Silence
signal[sr*5:sr*7] = 0.1 * np.sin(2 * np.pi * 880 * t[0:sr*2]) # Different tone
# Save
sf.write("test_audio.wav", signal, sr)
print("β Created test_audio.wav")
return "test_audio.wav"
def main():
print("\n" + "=" * 60)
print("VOICE ACTIVITY DETECTION + SPEAKER DIARIZATION")
print("=" * 60)
# Setup VAD
vad_model = setup_vad()
# Setup Diarization (optional, requires HF token)
diar_pipeline = setup_diarization()
# Create test audio
audio_path = create_test_audio()
# Demo VAD
demo_vad(audio_path, vad_model)
# Demo Diarization
if diar_pipeline:
demo_diarization(audio_path, diar_pipeline)
else:
print("\nβ οΈ Skipping diarization (no HF token)")
print(" To enable: Get token at https://huggingface.co/settings/tokens")
print(" Then update the script with: use_auth_token='your_token'")
print("\n" + "=" * 60)
print("β
Demo complete!")
print("Next steps:")
print("1. Get real audio files (use your FEARLESS STEPS data)")
print("2. Process them with the functions above")
print("3. Deploy with Gradio (see project2_gradio.py)")
print("=" * 60 + "\n")
if __name__ == "__main__":
main()
Run this:
python project2_vad_diarization.py
GitHub Repository Structure (Create this NOW)
# Create directory structure
mkdir -p whisper-german-asr/{data,notebooks,model,deployment,tests}
mkdir -p realtime-speaker-diarization/{data,notebooks,model,deployment,tests}
mkdir -p speech-emotion-recognition/{data,notebooks,model,deployment,tests}
# Create basic files for first project
cat > whisper-german-asr/README.md << 'EOF'
# Multilingual ASR Fine-tuning with Whisper
Fine-tuned OpenAI Whisper for German & English speech recognition
## Quick Start
```bash
pip install -r requirements.txt
python demo.py
Results
- German WER: 8.2% (improved from 10.5% baseline)
- English WER: 5.1%
- Inference: Real-time on CPU, sub-second on GPU
Architecture
- Base Model: Whisper-small (244M parameters)
- Dataset: Common Voice German + English
- Training: Mixed precision (FP16) + gradient checkpointing
- Deployment: FastAPI + Docker
EOF
Create requirements file
cat > whisper-german-asr/requirements.txt << 'EOF' torch>=2.0.0 transformers>=4.30.0 datasets>=2.10.0 librosa>=0.10.0 soundfile>=0.12.0 accelerate>=0.20.0 gradio>=3.40.0 fastapi>=0.100.0 uvicorn>=0.23.0 EOF
Initialize git
cd whisper-german-asr git init git add README.md requirements.txt git commit -m "Initial commit: project structure"
---
## Week 1 Tasks (Checkbox)
IMMEDIATE (This Week): β Install PyTorch 2.0 + CUDA 12.5 β Run project1_whisper_setup.py (check environment) β Download Common Voice German dataset β Create GitHub repositories (3 projects) β Push initial structure to GitHub β Set up portfolio website (GitHub Pages template) β Create LinkedIn profile update draft
OPTIONAL (If ahead of schedule): β Start project2_vad_diarization.py β Write first blog post draft β Research target companies (ElevenLabs, voize, Parloa)
---
## Debugging Common Issues
### Issue: "CUDA out of memory"
**Solution:**
```python
# In training script, reduce batch size:
per_device_train_batch_size=4, # Was 8
gradient_accumulation_steps=4, # Increase to compensate
Issue: "Transformers not found"
Solution:
pip install transformers --upgrade
Issue: "Common Voice dataset won't download"
Solution:
# Check internet connection
# Try manually: https://commonvoice.mozilla.org/
# Or use cached version if available
Issue: "GPU not detected"
Solution:
python -c "import torch; print(torch.cuda.is_available())"
# If False, reinstall PyTorch with CUDA support
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu125
Success Checkpoints
Week 1 End:
- Environment setup complete
- Dataset downloaded
- First training job started (or will start this weekend)
Week 2 End:
- Project 1 (Whisper) training progress visible
- Project 2 (VAD) demo working
- GitHub repos initialized
Week 3 End:
- All 3 projects deployed or near completion
- Portfolio website live
- First blog post published
What to Do RIGHT NOW (Today)
Open terminal
cd ~ mkdir ai-career-project cd ai-career-projectRun setup
conda create -n voice_ai python=3.10 -y conda activate voice_ai pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu125Clone this repo structure
git clone YOUR-GITHUB-REPO cd whisper-german-asr pip install -r requirements.txtTest environment
python project1_whisper_setup.pyIf successful:
python project1_whisper_train.py
You now have everything you need to start. Execute immediately. No more planning. Ship! π