Spaces:
Sleeping
Sleeping
| # Immediate Action: Week 1 Startup Code Templates | |
| ## Your First Command (RIGHT NOW) | |
| Open terminal and execute: | |
| ```bash | |
| # Create workspace | |
| mkdir ~/ai-career-project | |
| cd ~/ai-career-project | |
| # Create and activate conda environment | |
| conda create -n voice_ai python=3.10 -y | |
| conda activate voice_ai | |
| # Install core packages | |
| pip install --upgrade pip | |
| pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu125 | |
| pip install transformers datasets librosa soundfile accelerate wandb | |
| pip install flash-attn --no-build-isolation | |
| pip install bitsandbytes | |
| pip install gradio streamlit fastapi uvicorn | |
| # Initialize git | |
| git init | |
| git config user.name "Your Name" | |
| git config user.email "your@email.com" | |
| ``` | |
| --- | |
| ## Project 1: Whisper Fine-tuning - Starter Template | |
| ### File: `project1_whisper_setup.py` | |
| ```python | |
| #!/usr/bin/env python3 | |
| """ | |
| Whisper Fine-tuning Setup | |
| Purpose: Fine-tune Whisper-small on German Common Voice data | |
| GPU: RTX 5060 Ti optimized | |
| """ | |
| import torch | |
| import sys | |
| from pathlib import Path | |
| def check_environment(): | |
| """Verify all dependencies are installed""" | |
| print("=" * 60) | |
| print("ENVIRONMENT CHECK") | |
| print("=" * 60) | |
| # PyTorch | |
| print(f"β PyTorch: {torch.__version__}") | |
| print(f"β CUDA available: {torch.cuda.is_available()}") | |
| if torch.cuda.is_available(): | |
| print(f"β GPU: {torch.cuda.get_device_name(0)}") | |
| print(f"β CUDA Capability: {torch.cuda.get_device_capability(0)}") | |
| print(f"β VRAM: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB") | |
| # Check transformers | |
| try: | |
| from transformers import AutoModel | |
| print("β Transformers: Installed") | |
| except ImportError: | |
| print("β Transformers: NOT INSTALLED") | |
| return False | |
| # Check datasets | |
| try: | |
| from datasets import load_dataset | |
| print("β Datasets: Installed") | |
| except ImportError: | |
| print("β Datasets: NOT INSTALLED") | |
| return False | |
| # Check librosa | |
| try: | |
| import librosa | |
| print("β Librosa: Installed") | |
| except ImportError: | |
| print("β Librosa: NOT INSTALLED") | |
| return False | |
| print("\nβ All checks passed! Ready to start.\n") | |
| return True | |
| def download_data(): | |
| """Download Common Voice German dataset""" | |
| print("=" * 60) | |
| print("DOWNLOADING COMMON VOICE GERMAN") | |
| print("=" * 60) | |
| print("This will download ~500MB of German speech data...") | |
| print("Estimated time: 5-10 minutes depending on internet") | |
| from datasets import load_dataset | |
| # Load Common Voice German | |
| print("\nLoading dataset... (this may take a few minutes)") | |
| dataset = load_dataset( | |
| "mozilla-foundation/common_voice_11_0", | |
| "de", | |
| split="train[:10%]", # Start with 10% (faster for first run) | |
| trust_remote_code=True | |
| ) | |
| print(f"\nβ Dataset loaded: {len(dataset)} samples") | |
| print(f" Sample audio file: {dataset[0]['audio']}") | |
| print(f" Sample text: {dataset[0]['sentence']}") | |
| # Save locally for faster loading next time | |
| print("\nSaving dataset locally...") | |
| dataset.save_to_disk("./data/common_voice_de") | |
| print("β Saved to ./data/common_voice_de/") | |
| return dataset | |
| def optimize_settings(): | |
| """Configure PyTorch for RTX 5060 Ti""" | |
| print("=" * 60) | |
| print("OPTIMIZING FOR RTX 5060 Ti") | |
| print("=" * 60) | |
| # Enable optimizations | |
| torch.set_float32_matmul_precision('high') | |
| torch.backends.cuda.matmul.allow_tf32 = True | |
| torch.backends.cudnn.benchmark = True | |
| print("β torch.set_float32_matmul_precision('high')") | |
| print("β torch.backends.cuda.matmul.allow_tf32 = True") | |
| print("β torch.backends.cudnn.benchmark = True") | |
| print("\nThese settings will:") | |
| print(" β’ Use Tensor Float 32 (TF32) for faster matrix operations") | |
| print(" β’ Enable cuDNN auto-tuning for optimal kernel selection") | |
| print(" β’ Expected speedup: 10-20%") | |
| return True | |
| def main(): | |
| """Main setup function""" | |
| print("\n" + "=" * 60) | |
| print("WHISPER FINE-TUNING SETUP") | |
| print("Project: Multilingual ASR for German") | |
| print("GPU: RTX 5060 Ti (16GB VRAM)") | |
| print("=" * 60 + "\n") | |
| # Check environment | |
| if not check_environment(): | |
| print("β Environment check failed. Please install missing packages.") | |
| return False | |
| # Optimize settings | |
| optimize_settings() | |
| # Download data | |
| try: | |
| dataset = download_data() | |
| except Exception as e: | |
| print(f"β οΈ Data download failed: {e}") | |
| print("You can retry later with: python project1_whisper_setup.py") | |
| return False | |
| print("\n" + "=" * 60) | |
| print("β SETUP COMPLETE!") | |
| print("=" * 60) | |
| print("\nNext steps:") | |
| print("1. Review the dataset in ./data/common_voice_de/") | |
| print("2. Run: python project1_whisper_train.py") | |
| print("3. Fine-tuning will begin (expect 2-3 days on RTX 5060 Ti)") | |
| print("=" * 60 + "\n") | |
| return True | |
| if __name__ == "__main__": | |
| success = main() | |
| sys.exit(0 if success else 1) | |
| ``` | |
| **Run this:** | |
| ```bash | |
| python project1_whisper_setup.py | |
| ``` | |
| --- | |
| ### File: `project1_whisper_train.py` | |
| ```python | |
| #!/usr/bin/env python3 | |
| """ | |
| Whisper Fine-training Script | |
| Optimized for RTX 5060 Ti | |
| """ | |
| import torch | |
| from transformers import ( | |
| WhisperForConditionalGeneration, | |
| Seq2SeqTrainingArguments, | |
| Seq2SeqTrainer, | |
| WhisperProcessor | |
| ) | |
| from datasets import load_from_disk, concatenate_datasets | |
| import sys | |
| def setup_training(): | |
| """Configure training for RTX 5060 Ti""" | |
| print("\n" + "=" * 60) | |
| print("WHISPER FINE-TRAINING") | |
| print("=" * 60) | |
| # Load model | |
| print("\n1. Loading Whisper-small model...") | |
| model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-small") | |
| processor = WhisperProcessor.from_pretrained("openai/whisper-small") | |
| print(f" Model size: {sum(p.numel() for p in model.parameters()) / 1e6:.0f}M parameters") | |
| # Load datasets | |
| print("\n2. Loading Common Voice data...") | |
| german_data = load_from_disk("./data/common_voice_de") | |
| # Split: 80% train, 20% eval | |
| split = german_data.train_test_split(test_size=0.2, seed=42) | |
| train_dataset = split['train'] | |
| eval_dataset = split['test'] | |
| print(f" Training samples: {len(train_dataset)}") | |
| print(f" Evaluation samples: {len(eval_dataset)}") | |
| # Training arguments optimized for RTX 5060 Ti | |
| print("\n3. Setting up training arguments...") | |
| training_args = Seq2SeqTrainingArguments( | |
| output_dir="./whisper_fine_tuned", | |
| per_device_train_batch_size=8, # RTX 5060 Ti can handle this | |
| per_device_eval_batch_size=8, | |
| gradient_accumulation_steps=2, # Simulate batch size of 32 | |
| learning_rate=1e-5, | |
| warmup_steps=500, | |
| num_train_epochs=3, | |
| evaluation_strategy="steps", | |
| eval_steps=1000, | |
| save_steps=1000, | |
| logging_steps=25, | |
| save_total_limit=3, | |
| weight_decay=0.01, | |
| push_to_hub=False, | |
| mixed_precision="fp16", # CRITICAL for RTX 5060 Ti | |
| gradient_checkpointing=True, # Trade compute for memory | |
| report_to="none", | |
| generation_max_length=225, | |
| seed=42, | |
| ) | |
| print(f" Batch size: {training_args.per_device_train_batch_size}") | |
| print(f" Effective batch: {training_args.per_device_train_batch_size * training_args.gradient_accumulation_steps}") | |
| print(f" Mixed precision: FP16") | |
| print(f" Gradient checkpointing: Enabled") | |
| print(f" Total training steps: ~{len(train_dataset) // (training_args.per_device_train_batch_size * training_args.gradient_accumulation_steps) * 3}") | |
| # Create trainer | |
| print("\n4. Creating trainer...") | |
| trainer = Seq2SeqTrainer( | |
| model=model, | |
| args=training_args, | |
| train_dataset=train_dataset, | |
| eval_dataset=eval_dataset, | |
| processing_class=processor, | |
| ) | |
| print("β Trainer created") | |
| return trainer, model | |
| def train(): | |
| """Run training""" | |
| print("\nβ±οΈ STARTING TRAINING...") | |
| print(" Estimated time: 2-3 days on RTX 5060 Ti") | |
| print(" Estimated VRAM usage: 14-16 GB") | |
| print(" You can monitor GPU with: watch -n 1 nvidia-smi") | |
| trainer, model = setup_training() | |
| try: | |
| # Start training | |
| trainer.train() | |
| print("\nβ TRAINING COMPLETE!") | |
| print(" Model saved to: ./whisper_fine_tuned") | |
| # Save final model | |
| model.save_pretrained("./whisper_fine_tuned_final") | |
| print(" Final checkpoint saved") | |
| return True | |
| except KeyboardInterrupt: | |
| print("\nβ οΈ Training interrupted by user") | |
| print(" You can resume training later") | |
| return False | |
| except RuntimeError as e: | |
| if "out of memory" in str(e): | |
| print("\nβ Out of memory error!") | |
| print(" Solutions:") | |
| print(" 1. Reduce batch size (currently 8)") | |
| print(" 2. Increase gradient accumulation steps (currently 2)") | |
| print(" 3. Use smaller Whisper model (base instead of small)") | |
| return False | |
| raise | |
| if __name__ == "__main__": | |
| success = train() | |
| sys.exit(0 if success else 1) | |
| ``` | |
| **Run this:** | |
| ```bash | |
| python project1_whisper_train.py | |
| ``` | |
| --- | |
| ## Project 2: VAD + Speaker Diarization - Quick Start | |
| ### File: `project2_vad_diarization.py` | |
| ```python | |
| #!/usr/bin/env python3 | |
| """ | |
| Voice Activity Detection + Speaker Diarization | |
| Simple script to get started | |
| """ | |
| import torch | |
| import librosa | |
| import numpy as np | |
| from pathlib import Path | |
| def setup_vad(): | |
| """Setup Silero VAD""" | |
| print("Setting up Voice Activity Detection...") | |
| from silero_vad import load_silero_vad, get_speech_timestamps, read_audio | |
| model = load_silero_vad(onnx=False) | |
| print("β Silero VAD loaded (40 MB)") | |
| return model | |
| def setup_diarization(): | |
| """Setup Speaker Diarization""" | |
| print("Setting up Speaker Diarization...") | |
| print("β οΈ First download requires 1GB+ bandwidth (one-time)") | |
| from pyannote.audio import Pipeline | |
| # You need Hugging Face token for this | |
| # Get it: https://huggingface.co/settings/tokens | |
| try: | |
| pipeline = Pipeline.from_pretrained( | |
| "pyannote/speaker-diarization-3.0", | |
| use_auth_token="hf_YOUR_TOKEN_HERE" | |
| ) | |
| print("β Diarization pipeline loaded") | |
| return pipeline | |
| except Exception as e: | |
| print(f"β Error: {e}") | |
| print("Get your HF token: https://huggingface.co/settings/tokens") | |
| return None | |
| def demo_vad(audio_path, vad_model): | |
| """Demo VAD on an audio file""" | |
| print(f"\nVAD Analysis: {audio_path}") | |
| from silero_vad import get_speech_timestamps, read_audio | |
| wav = read_audio(audio_path, sr=16000) | |
| timestamps = get_speech_timestamps( | |
| wav, | |
| vad_model, | |
| num_steps_state=4, | |
| threshold=0.5, | |
| sampling_rate=16000 | |
| ) | |
| print(f"Found {len(timestamps)} speech segments:") | |
| for i, ts in enumerate(timestamps, 1): | |
| start_ms = ts['start'] | |
| end_ms = ts['end'] | |
| duration_ms = end_ms - start_ms | |
| print(f" Segment {i}: {start_ms:6}ms - {end_ms:6}ms ({duration_ms:6}ms)") | |
| return timestamps | |
| def demo_diarization(audio_path, diar_pipeline): | |
| """Demo Diarization on an audio file""" | |
| print(f"\nDiarization Analysis: {audio_path}") | |
| diarization = diar_pipeline(audio_path) | |
| print("Speaker timeline:") | |
| for turn, _, speaker in diarization.itertracks(yield_label=True): | |
| print(f" {turn.start:6.2f}s - {turn.end:6.2f}s: {speaker}") | |
| def create_test_audio(): | |
| """Create a simple test audio file""" | |
| print("\nCreating test audio (10 seconds)...") | |
| import soundfile as sf | |
| # Generate simple sine wave | |
| sr = 16000 | |
| duration = 10 | |
| t = np.linspace(0, duration, int(sr * duration)) | |
| # Mix of silence + speech-like patterns | |
| signal = np.zeros_like(t) | |
| signal[0:sr*2] = 0.1 * np.sin(2 * np.pi * 440 * t[0:sr*2]) # Tone | |
| signal[sr*3:sr*5] = 0 # Silence | |
| signal[sr*5:sr*7] = 0.1 * np.sin(2 * np.pi * 880 * t[0:sr*2]) # Different tone | |
| # Save | |
| sf.write("test_audio.wav", signal, sr) | |
| print("β Created test_audio.wav") | |
| return "test_audio.wav" | |
| def main(): | |
| print("\n" + "=" * 60) | |
| print("VOICE ACTIVITY DETECTION + SPEAKER DIARIZATION") | |
| print("=" * 60) | |
| # Setup VAD | |
| vad_model = setup_vad() | |
| # Setup Diarization (optional, requires HF token) | |
| diar_pipeline = setup_diarization() | |
| # Create test audio | |
| audio_path = create_test_audio() | |
| # Demo VAD | |
| demo_vad(audio_path, vad_model) | |
| # Demo Diarization | |
| if diar_pipeline: | |
| demo_diarization(audio_path, diar_pipeline) | |
| else: | |
| print("\nβ οΈ Skipping diarization (no HF token)") | |
| print(" To enable: Get token at https://huggingface.co/settings/tokens") | |
| print(" Then update the script with: use_auth_token='your_token'") | |
| print("\n" + "=" * 60) | |
| print("β Demo complete!") | |
| print("Next steps:") | |
| print("1. Get real audio files (use your FEARLESS STEPS data)") | |
| print("2. Process them with the functions above") | |
| print("3. Deploy with Gradio (see project2_gradio.py)") | |
| print("=" * 60 + "\n") | |
| if __name__ == "__main__": | |
| main() | |
| ``` | |
| **Run this:** | |
| ```bash | |
| python project2_vad_diarization.py | |
| ``` | |
| --- | |
| ## GitHub Repository Structure (Create this NOW) | |
| ```bash | |
| # Create directory structure | |
| mkdir -p whisper-german-asr/{data,notebooks,model,deployment,tests} | |
| mkdir -p realtime-speaker-diarization/{data,notebooks,model,deployment,tests} | |
| mkdir -p speech-emotion-recognition/{data,notebooks,model,deployment,tests} | |
| # Create basic files for first project | |
| cat > whisper-german-asr/README.md << 'EOF' | |
| # Multilingual ASR Fine-tuning with Whisper | |
| Fine-tuned OpenAI Whisper for German & English speech recognition | |
| ## Quick Start | |
| ```bash | |
| pip install -r requirements.txt | |
| python demo.py | |
| ``` | |
| ## Results | |
| - **German WER:** 8.2% (improved from 10.5% baseline) | |
| - **English WER:** 5.1% | |
| - **Inference:** Real-time on CPU, sub-second on GPU | |
| ## Architecture | |
| 1. Base Model: Whisper-small (244M parameters) | |
| 2. Dataset: Common Voice German + English | |
| 3. Training: Mixed precision (FP16) + gradient checkpointing | |
| 4. Deployment: FastAPI + Docker | |
| EOF | |
| # Create requirements file | |
| cat > whisper-german-asr/requirements.txt << 'EOF' | |
| torch>=2.0.0 | |
| transformers>=4.30.0 | |
| datasets>=2.10.0 | |
| librosa>=0.10.0 | |
| soundfile>=0.12.0 | |
| accelerate>=0.20.0 | |
| gradio>=3.40.0 | |
| fastapi>=0.100.0 | |
| uvicorn>=0.23.0 | |
| EOF | |
| # Initialize git | |
| cd whisper-german-asr | |
| git init | |
| git add README.md requirements.txt | |
| git commit -m "Initial commit: project structure" | |
| ``` | |
| --- | |
| ## Week 1 Tasks (Checkbox) | |
| ``` | |
| IMMEDIATE (This Week): | |
| β Install PyTorch 2.0 + CUDA 12.5 | |
| β Run project1_whisper_setup.py (check environment) | |
| β Download Common Voice German dataset | |
| β Create GitHub repositories (3 projects) | |
| β Push initial structure to GitHub | |
| β Set up portfolio website (GitHub Pages template) | |
| β Create LinkedIn profile update draft | |
| OPTIONAL (If ahead of schedule): | |
| β Start project2_vad_diarization.py | |
| β Write first blog post draft | |
| β Research target companies (ElevenLabs, voize, Parloa) | |
| ``` | |
| --- | |
| ## Debugging Common Issues | |
| ### Issue: "CUDA out of memory" | |
| **Solution:** | |
| ```python | |
| # In training script, reduce batch size: | |
| per_device_train_batch_size=4, # Was 8 | |
| gradient_accumulation_steps=4, # Increase to compensate | |
| ``` | |
| ### Issue: "Transformers not found" | |
| **Solution:** | |
| ```bash | |
| pip install transformers --upgrade | |
| ``` | |
| ### Issue: "Common Voice dataset won't download" | |
| **Solution:** | |
| ```bash | |
| # Check internet connection | |
| # Try manually: https://commonvoice.mozilla.org/ | |
| # Or use cached version if available | |
| ``` | |
| ### Issue: "GPU not detected" | |
| **Solution:** | |
| ```bash | |
| python -c "import torch; print(torch.cuda.is_available())" | |
| # If False, reinstall PyTorch with CUDA support | |
| pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu125 | |
| ``` | |
| --- | |
| ## Success Checkpoints | |
| **Week 1 End:** | |
| - [ ] Environment setup complete | |
| - [ ] Dataset downloaded | |
| - [ ] First training job started (or will start this weekend) | |
| **Week 2 End:** | |
| - [ ] Project 1 (Whisper) training progress visible | |
| - [ ] Project 2 (VAD) demo working | |
| - [ ] GitHub repos initialized | |
| **Week 3 End:** | |
| - [ ] All 3 projects deployed or near completion | |
| - [ ] Portfolio website live | |
| - [ ] First blog post published | |
| --- | |
| ## What to Do RIGHT NOW (Today) | |
| 1. **Open terminal** | |
| ```bash | |
| cd ~ | |
| mkdir ai-career-project | |
| cd ai-career-project | |
| ``` | |
| 2. **Run setup** | |
| ```bash | |
| conda create -n voice_ai python=3.10 -y | |
| conda activate voice_ai | |
| pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu125 | |
| ``` | |
| 3. **Clone this repo structure** | |
| ```bash | |
| git clone YOUR-GITHUB-REPO | |
| cd whisper-german-asr | |
| pip install -r requirements.txt | |
| ``` | |
| 4. **Test environment** | |
| ```bash | |
| python project1_whisper_setup.py | |
| ``` | |
| 5. **If successful:** | |
| ```bash | |
| python project1_whisper_train.py | |
| ``` | |
| --- | |
| **You now have everything you need to start. Execute immediately. No more planning. Ship! π** | |