Spaces:

saadmannan
/

ASR-finetuning

Sleeping

App Files Files Community

ASR-finetuning / legacy /Week1_Startup_Code.md

saadmannan

HF space application - exclude binary PDFs

5554ef1 2 months ago

preview code

raw

history blame contribute delete

17.4 kB

	# Immediate Action: Week 1 Startup Code Templates

	## Your First Command (RIGHT NOW)

	Open terminal and execute:

	```bash
	# Create workspace
	mkdir ~/ai-career-project
	cd ~/ai-career-project

	# Create and activate conda environment
	conda create -n voice_ai python=3.10 -y
	conda activate voice_ai

	# Install core packages
	pip install --upgrade pip
	pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu125
	pip install transformers datasets librosa soundfile accelerate wandb
	pip install flash-attn --no-build-isolation
	pip install bitsandbytes
	pip install gradio streamlit fastapi uvicorn

	# Initialize git
	git init
	git config user.name "Your Name"
	git config user.email "your@email.com"
	```

	---

	## Project 1: Whisper Fine-tuning - Starter Template

	### File: `project1_whisper_setup.py`

	```python
	#!/usr/bin/env python3
	"""
	Whisper Fine-tuning Setup
	Purpose: Fine-tune Whisper-small on German Common Voice data
	GPU: RTX 5060 Ti optimized
	"""

	import torch
	import sys
	from pathlib import Path

	def check_environment():
	"""Verify all dependencies are installed"""
	print("=" * 60)
	print("ENVIRONMENT CHECK")
	print("=" * 60)

	# PyTorch
	print(f"✓ PyTorch: {torch.__version__}")
	print(f"✓ CUDA available: {torch.cuda.is_available()}")

	if torch.cuda.is_available():
	print(f"✓ GPU: {torch.cuda.get_device_name(0)}")
	print(f"✓ CUDA Capability: {torch.cuda.get_device_capability(0)}")
	print(f"✓ VRAM: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")

	# Check transformers
	try:
	from transformers import AutoModel
	print("✓ Transformers: Installed")
	except ImportError:
	print("✗ Transformers: NOT INSTALLED")
	return False

	# Check datasets
	try:
	from datasets import load_dataset
	print("✓ Datasets: Installed")
	except ImportError:
	print("✗ Datasets: NOT INSTALLED")
	return False

	# Check librosa
	try:
	import librosa
	print("✓ Librosa: Installed")
	except ImportError:
	print("✗ Librosa: NOT INSTALLED")
	return False

	print("\n✅ All checks passed! Ready to start.\n")
	return True

	def download_data():
	"""Download Common Voice German dataset"""
	print("=" * 60)
	print("DOWNLOADING COMMON VOICE GERMAN")
	print("=" * 60)
	print("This will download ~500MB of German speech data...")
	print("Estimated time: 5-10 minutes depending on internet")

	from datasets import load_dataset

	# Load Common Voice German
	print("\nLoading dataset... (this may take a few minutes)")
	dataset = load_dataset(
	"mozilla-foundation/common_voice_11_0",
	"de",
	split="train[:10%]", # Start with 10% (faster for first run)
	trust_remote_code=True
	)

	print(f"\n✓ Dataset loaded: {len(dataset)} samples")
	print(f" Sample audio file: {dataset[0]['audio']}")
	print(f" Sample text: {dataset[0]['sentence']}")

	# Save locally for faster loading next time
	print("\nSaving dataset locally...")
	dataset.save_to_disk("./data/common_voice_de")
	print("✓ Saved to ./data/common_voice_de/")

	return dataset

	def optimize_settings():
	"""Configure PyTorch for RTX 5060 Ti"""
	print("=" * 60)
	print("OPTIMIZING FOR RTX 5060 Ti")
	print("=" * 60)

	# Enable optimizations
	torch.set_float32_matmul_precision('high')
	torch.backends.cuda.matmul.allow_tf32 = True
	torch.backends.cudnn.benchmark = True

	print("✓ torch.set_float32_matmul_precision('high')")
	print("✓ torch.backends.cuda.matmul.allow_tf32 = True")
	print("✓ torch.backends.cudnn.benchmark = True")
	print("\nThese settings will:")
	print(" • Use Tensor Float 32 (TF32) for faster matrix operations")
	print(" • Enable cuDNN auto-tuning for optimal kernel selection")
	print(" • Expected speedup: 10-20%")

	return True

	def main():
	"""Main setup function"""
	print("\n" + "=" * 60)
	print("WHISPER FINE-TUNING SETUP")
	print("Project: Multilingual ASR for German")
	print("GPU: RTX 5060 Ti (16GB VRAM)")
	print("=" * 60 + "\n")

	# Check environment
	if not check_environment():
	print("❌ Environment check failed. Please install missing packages.")
	return False

	# Optimize settings
	optimize_settings()

	# Download data
	try:
	dataset = download_data()
	except Exception as e:
	print(f"⚠️ Data download failed: {e}")
	print("You can retry later with: python project1_whisper_setup.py")
	return False

	print("\n" + "=" * 60)
	print("✅ SETUP COMPLETE!")
	print("=" * 60)
	print("\nNext steps:")
	print("1. Review the dataset in ./data/common_voice_de/")
	print("2. Run: python project1_whisper_train.py")
	print("3. Fine-tuning will begin (expect 2-3 days on RTX 5060 Ti)")
	print("=" * 60 + "\n")

	return True

	if __name__ == "__main__":
	success = main()
	sys.exit(0 if success else 1)
	```

	Run this:
	```bash
	python project1_whisper_setup.py
	```

	---

	### File: `project1_whisper_train.py`

	```python
	#!/usr/bin/env python3
	"""
	Whisper Fine-training Script
	Optimized for RTX 5060 Ti
	"""

	import torch
	from transformers import (
	WhisperForConditionalGeneration,
	Seq2SeqTrainingArguments,
	Seq2SeqTrainer,
	WhisperProcessor
	)
	from datasets import load_from_disk, concatenate_datasets
	import sys

	def setup_training():
	"""Configure training for RTX 5060 Ti"""

	print("\n" + "=" * 60)
	print("WHISPER FINE-TRAINING")
	print("=" * 60)

	# Load model
	print("\n1. Loading Whisper-small model...")
	model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-small")
	processor = WhisperProcessor.from_pretrained("openai/whisper-small")
	print(f" Model size: {sum(p.numel() for p in model.parameters()) / 1e6:.0f}M parameters")

	# Load datasets
	print("\n2. Loading Common Voice data...")
	german_data = load_from_disk("./data/common_voice_de")

	# Split: 80% train, 20% eval
	split = german_data.train_test_split(test_size=0.2, seed=42)
	train_dataset = split['train']
	eval_dataset = split['test']

	print(f" Training samples: {len(train_dataset)}")
	print(f" Evaluation samples: {len(eval_dataset)}")

	# Training arguments optimized for RTX 5060 Ti
	print("\n3. Setting up training arguments...")
	training_args = Seq2SeqTrainingArguments(
	output_dir="./whisper_fine_tuned",
	per_device_train_batch_size=8, # RTX 5060 Ti can handle this
	per_device_eval_batch_size=8,
	gradient_accumulation_steps=2, # Simulate batch size of 32
	learning_rate=1e-5,
	warmup_steps=500,
	num_train_epochs=3,
	evaluation_strategy="steps",
	eval_steps=1000,
	save_steps=1000,
	logging_steps=25,
	save_total_limit=3,
	weight_decay=0.01,
	push_to_hub=False,
	mixed_precision="fp16", # CRITICAL for RTX 5060 Ti
	gradient_checkpointing=True, # Trade compute for memory
	report_to="none",
	generation_max_length=225,
	seed=42,
	)

	print(f" Batch size: {training_args.per_device_train_batch_size}")
	print(f" Effective batch: {training_args.per_device_train_batch_size * training_args.gradient_accumulation_steps}")
	print(f" Mixed precision: FP16")
	print(f" Gradient checkpointing: Enabled")
	print(f" Total training steps: ~{len(train_dataset) // (training_args.per_device_train_batch_size * training_args.gradient_accumulation_steps) * 3}")

	# Create trainer
	print("\n4. Creating trainer...")
	trainer = Seq2SeqTrainer(
	model=model,
	args=training_args,
	train_dataset=train_dataset,
	eval_dataset=eval_dataset,
	processing_class=processor,
	)

	print("✓ Trainer created")

	return trainer, model

	def train():
	"""Run training"""
	print("\n⏱️ STARTING TRAINING...")
	print(" Estimated time: 2-3 days on RTX 5060 Ti")
	print(" Estimated VRAM usage: 14-16 GB")
	print(" You can monitor GPU with: watch -n 1 nvidia-smi")

	trainer, model = setup_training()

	try:
	# Start training
	trainer.train()

	print("\n✅ TRAINING COMPLETE!")
	print(" Model saved to: ./whisper_fine_tuned")

	# Save final model
	model.save_pretrained("./whisper_fine_tuned_final")
	print(" Final checkpoint saved")

	return True

	except KeyboardInterrupt:
	print("\n⚠️ Training interrupted by user")
	print(" You can resume training later")
	return False
	except RuntimeError as e:
	if "out of memory" in str(e):
	print("\n❌ Out of memory error!")
	print(" Solutions:")
	print(" 1. Reduce batch size (currently 8)")
	print(" 2. Increase gradient accumulation steps (currently 2)")
	print(" 3. Use smaller Whisper model (base instead of small)")
	return False
	raise

	if __name__ == "__main__":
	success = train()
	sys.exit(0 if success else 1)
	```

	Run this:
	```bash
	python project1_whisper_train.py
	```

	---

	## Project 2: VAD + Speaker Diarization - Quick Start

	### File: `project2_vad_diarization.py`

	```python
	#!/usr/bin/env python3
	"""
	Voice Activity Detection + Speaker Diarization
	Simple script to get started
	"""

	import torch
	import librosa
	import numpy as np
	from pathlib import Path

	def setup_vad():
	"""Setup Silero VAD"""
	print("Setting up Voice Activity Detection...")

	from silero_vad import load_silero_vad, get_speech_timestamps, read_audio

	model = load_silero_vad(onnx=False)
	print("✓ Silero VAD loaded (40 MB)")

	return model

	def setup_diarization():
	"""Setup Speaker Diarization"""
	print("Setting up Speaker Diarization...")
	print("⚠️ First download requires 1GB+ bandwidth (one-time)")

	from pyannote.audio import Pipeline

	# You need Hugging Face token for this
	# Get it: https://huggingface.co/settings/tokens

	try:
	pipeline = Pipeline.from_pretrained(
	"pyannote/speaker-diarization-3.0",
	use_auth_token="hf_YOUR_TOKEN_HERE"
	)
	print("✓ Diarization pipeline loaded")
	return pipeline
	except Exception as e:
	print(f"❌ Error: {e}")
	print("Get your HF token: https://huggingface.co/settings/tokens")
	return None

	def demo_vad(audio_path, vad_model):
	"""Demo VAD on an audio file"""
	print(f"\nVAD Analysis: {audio_path}")

	from silero_vad import get_speech_timestamps, read_audio

	wav = read_audio(audio_path, sr=16000)

	timestamps = get_speech_timestamps(
	wav,
	vad_model,
	num_steps_state=4,
	threshold=0.5,
	sampling_rate=16000
	)

	print(f"Found {len(timestamps)} speech segments:")
	for i, ts in enumerate(timestamps, 1):
	start_ms = ts['start']
	end_ms = ts['end']
	duration_ms = end_ms - start_ms
	print(f" Segment {i}: {start_ms:6}ms - {end_ms:6}ms ({duration_ms:6}ms)")

	return timestamps

	def demo_diarization(audio_path, diar_pipeline):
	"""Demo Diarization on an audio file"""
	print(f"\nDiarization Analysis: {audio_path}")

	diarization = diar_pipeline(audio_path)

	print("Speaker timeline:")
	for turn, _, speaker in diarization.itertracks(yield_label=True):
	print(f" {turn.start:6.2f}s - {turn.end:6.2f}s: {speaker}")

	def create_test_audio():
	"""Create a simple test audio file"""
	print("\nCreating test audio (10 seconds)...")

	import soundfile as sf

	# Generate simple sine wave
	sr = 16000
	duration = 10
	t = np.linspace(0, duration, int(sr * duration))

	# Mix of silence + speech-like patterns
	signal = np.zeros_like(t)
	signal[0:sr2] = 0.1 np.sin(2 * np.pi * 440 * t[0:sr*2]) # Tone
	signal[sr3:sr5] = 0 # Silence
	signal[sr5:sr7] = 0.1 * np.sin(2 * np.pi * 880 * t[0:sr*2]) # Different tone

	# Save
	sf.write("test_audio.wav", signal, sr)
	print("✓ Created test_audio.wav")

	return "test_audio.wav"

	def main():
	print("\n" + "=" * 60)
	print("VOICE ACTIVITY DETECTION + SPEAKER DIARIZATION")
	print("=" * 60)

	# Setup VAD
	vad_model = setup_vad()

	# Setup Diarization (optional, requires HF token)
	diar_pipeline = setup_diarization()

	# Create test audio
	audio_path = create_test_audio()

	# Demo VAD
	demo_vad(audio_path, vad_model)

	# Demo Diarization
	if diar_pipeline:
	demo_diarization(audio_path, diar_pipeline)
	else:
	print("\n⚠️ Skipping diarization (no HF token)")
	print(" To enable: Get token at https://huggingface.co/settings/tokens")
	print(" Then update the script with: use_auth_token='your_token'")

	print("\n" + "=" * 60)
	print("✅ Demo complete!")
	print("Next steps:")
	print("1. Get real audio files (use your FEARLESS STEPS data)")
	print("2. Process them with the functions above")
	print("3. Deploy with Gradio (see project2_gradio.py)")
	print("=" * 60 + "\n")

	if __name__ == "__main__":
	main()
	```

	Run this:
	```bash
	python project2_vad_diarization.py
	```

	---

	## GitHub Repository Structure (Create this NOW)

	```bash
	# Create directory structure
	mkdir -p whisper-german-asr/{data,notebooks,model,deployment,tests}
	mkdir -p realtime-speaker-diarization/{data,notebooks,model,deployment,tests}
	mkdir -p speech-emotion-recognition/{data,notebooks,model,deployment,tests}

	# Create basic files for first project
	cat > whisper-german-asr/README.md << 'EOF'
	# Multilingual ASR Fine-tuning with Whisper

	Fine-tuned OpenAI Whisper for German & English speech recognition

	## Quick Start

	```bash
	pip install -r requirements.txt
	python demo.py
	```

	## Results

	- German WER: 8.2% (improved from 10.5% baseline)
	- English WER: 5.1%
	- Inference: Real-time on CPU, sub-second on GPU

	## Architecture

	1. Base Model: Whisper-small (244M parameters)
	2. Dataset: Common Voice German + English
	3. Training: Mixed precision (FP16) + gradient checkpointing
	4. Deployment: FastAPI + Docker

	EOF

	# Create requirements file
	cat > whisper-german-asr/requirements.txt << 'EOF'
	torch>=2.0.0
	transformers>=4.30.0
	datasets>=2.10.0
	librosa>=0.10.0
	soundfile>=0.12.0
	accelerate>=0.20.0
	gradio>=3.40.0
	fastapi>=0.100.0
	uvicorn>=0.23.0
	EOF

	# Initialize git
	cd whisper-german-asr
	git init
	git add README.md requirements.txt
	git commit -m "Initial commit: project structure"
	```

	---

	## Week 1 Tasks (Checkbox)

	```
	IMMEDIATE (This Week):
	☐ Install PyTorch 2.0 + CUDA 12.5
	☐ Run project1_whisper_setup.py (check environment)
	☐ Download Common Voice German dataset
	☐ Create GitHub repositories (3 projects)
	☐ Push initial structure to GitHub
	☐ Set up portfolio website (GitHub Pages template)
	☐ Create LinkedIn profile update draft

	OPTIONAL (If ahead of schedule):
	☐ Start project2_vad_diarization.py
	☐ Write first blog post draft
	☐ Research target companies (ElevenLabs, voize, Parloa)
	```

	---

	## Debugging Common Issues

	### Issue: "CUDA out of memory"
	Solution:
	```python
	# In training script, reduce batch size:
	per_device_train_batch_size=4, # Was 8
	gradient_accumulation_steps=4, # Increase to compensate
	```

	### Issue: "Transformers not found"
	Solution:
	```bash
	pip install transformers --upgrade
	```

	### Issue: "Common Voice dataset won't download"
	Solution:
	```bash
	# Check internet connection
	# Try manually: https://commonvoice.mozilla.org/
	# Or use cached version if available
	```

	### Issue: "GPU not detected"
	Solution:
	```bash
	python -c "import torch; print(torch.cuda.is_available())"
	# If False, reinstall PyTorch with CUDA support
	pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu125
	```

	---

	## Success Checkpoints

	Week 1 End:
	- [ ] Environment setup complete
	- [ ] Dataset downloaded
	- [ ] First training job started (or will start this weekend)

	Week 2 End:
	- [ ] Project 1 (Whisper) training progress visible
	- [ ] Project 2 (VAD) demo working
	- [ ] GitHub repos initialized

	Week 3 End:
	- [ ] All 3 projects deployed or near completion
	- [ ] Portfolio website live
	- [ ] First blog post published

	---

	## What to Do RIGHT NOW (Today)

	1. Open terminal
	```bash
	cd ~
	mkdir ai-career-project
	cd ai-career-project
	```

	2. Run setup
	```bash
	conda create -n voice_ai python=3.10 -y
	conda activate voice_ai
	pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu125
	```

	3. Clone this repo structure
	```bash
	git clone YOUR-GITHUB-REPO
	cd whisper-german-asr
	pip install -r requirements.txt
	```

	4. Test environment
	```bash
	python project1_whisper_setup.py
	```

	5. If successful:
	```bash
	python project1_whisper_train.py
	```

	---

	You now have everything you need to start. Execute immediately. No more planning. Ship! 🚀