# Whisper German ASR Fine-Tuning Project

## Project Overview
This project fine-tunes OpenAI's Whisper model for German Automatic Speech Recognition (ASR) using the PolyAI/minds14 dataset.

## Hardware Setup
- **GPU**: NVIDIA GeForce RTX 5060 Ti (16GB VRAM)
- **CUDA**: 13.0
- **PyTorch**: 2.9.0+cu130
- **Flash Attention 2**: Enabled (v2.8.3)

## Project Structure
```
ai-career-project/
├── project1_whisper_setup.py      # Dataset download and preparation
├── project1_whisper_train.py      # Model training script
├── project1_whisper_inference.py  # Inference and testing script
├── data/
│   └── minds14_small/             # Training dataset (122 samples)
└── whisper_test_tuned/            # Fine-tuned model checkpoints
    ├── checkpoint-28/
    └── checkpoint-224/            # Final checkpoint
```

## Dataset Options

| Size | Split | Samples | Training Time | VRAM Usage | Best For |
|------|-------|---------|---------------|------------|----------|
| **Tiny** | 5% | ~30 | 30 seconds | 8-10 GB | Quick testing |
| **Small** | 20% | ~120 | 2 minutes | 10-12 GB | Experiments ✅ |
| **Medium** | 50% | ~300 | 5-6 minutes | 12-14 GB | Good results |
| **Large** | 100% | ~600 | 10-12 minutes | 14-16 GB | Best performance |

## Training Results (Small Dataset)

### Configuration
- **Model**: Whisper-small (242M parameters)
- **Training samples**: 109
- **Evaluation samples**: 13
- **Batch size**: 4
- **Learning rate**: 2e-05
- **Epochs**: 8
- **Mixed precision**: BF16
- **Flash Attention 2**: Enabled
- **Gradient checkpointing**: Disabled

### Performance
- **Training time**: ~2 minutes (119 seconds)
- **Training speed**: 7.27 samples/second
- **Final training loss**: 4684.90
- **Final evaluation loss**: 2490.13

### Current Issues
⚠️ **Model Performance**: The model trained on the small dataset (109 samples) shows poor inference quality, generating repetitive outputs. This is expected with such a small dataset.

## Recommendations for Better Results

### 1. Use Larger Dataset ✅ **RECOMMENDED**
```bash
# Run setup with medium or large dataset
python project1_whisper_setup.py
# Select 'medium' or 'large' when prompted
```

**Expected improvements:**
- Medium (300 samples): 5-6 minutes training, significantly better quality
- Large (600 samples): 10-12 minutes training, best quality

### 2. Adjust Training Parameters
For larger datasets, the training script automatically adjusts:
- Batch size: 4
- Gradient accumulation: 2
- Learning rate: 1e-5
- Epochs: 5

### 3. Use Pre-trained Model for Inference
If you need immediate results, use the base Whisper model:
```python
from transformers import pipeline

# Use base Whisper model (no fine-tuning needed)
pipe = pipeline("automatic-speech-recognition", 
                model="openai/whisper-small",
                device=0)  # Use GPU

result = pipe("audio.wav", generate_kwargs={"language": "german"})
print(result["text"])
```

## Recent Improvements (v2.0)

### Training Pipeline Enhancements
✅ **Fixed Trainer API Issues**
- Corrected `evaluation_strategy` parameter (was `eval_strategy`)
- Fixed `tokenizer` parameter (was `processing_class`)
- Added German language/task conditioning for proper decoder behavior

✅ **Improved Hyperparameters**
- Increased learning rates: 1e-5 to 2e-5 (was 5e-6)
- Added warmup ratio (3-5%) for better convergence
- Removed dtype conflicts (let Trainer control precision)
- Optimized epochs by dataset size (8-15 epochs)

✅ **Data Quality & Processing**
- Duration filtering (0.5s - 30s)
- Transcript length validation
- Text normalization for consistent WER computation
- Group by length for reduced padding

✅ **Evaluation & Monitoring**
- WER (Word Error Rate) metric with jiwer
- TensorBoard logging for all metrics
- Best model selection by WER (not just loss)
- Predict with generate for proper evaluation

### Why Training Should Improve Now
1. **Proper evaluation**: WER tracking shows actual quality improvements
2. **Better learning rate**: 2-4x higher LR enables faster convergence
3. **Language conditioning**: Model knows it's transcribing German
4. **Data filtering**: Removes noisy/invalid samples that hurt training
5. **Best model selection**: Saves checkpoint with lowest WER, not just loss

## Installation

### 1. Install Dependencies
```bash
pip install -r requirements.txt
```

### 2. (Optional) Install Flash Attention 2
For faster training (requires CUDA toolkit):
```bash
pip install flash-attn --no-build-isolation
```

## Usage

### 1. Setup Dataset
```bash
python project1_whisper_setup.py
```
Select dataset size when prompted (recommend 'medium' or 'large')

### 2. Train Model
```bash
python project1_whisper_train.py
```

### 3. Monitor Training with TensorBoard
In a separate terminal, start TensorBoard:
```bash
tensorboard --logdir=./logs
```
Then open http://localhost:6006 in your browser to view:
- **Training/Evaluation Loss** - Track model convergence
- **WER (Word Error Rate)** - Monitor transcription quality
- **Learning Rate** - Visualize warmup and decay
- **Gradient Norms** - Check training stability

You can also monitor GPU usage:
```bash
nvidia-smi -l 1
```

### 4. Test Model
```bash
# Test with dataset samples
python project1_whisper_inference.py --test --num-samples 10

# Transcribe specific audio files
python project1_whisper_inference.py --audio file1.wav file2.wav

# Interactive mode
python project1_whisper_inference.py --interactive
```

## Key Features

### Flash Attention 2 Integration
- **Faster training**: 10-20% speedup
- **Memory efficient**: No gradient checkpointing needed
- **Stable training**: BF16 mixed precision

### Automatic Configuration
The training script automatically adjusts parameters based on dataset size:
- Batch size and gradient accumulation
- Learning rate (1e-5 to 2e-5) and warmup ratio
- Number of epochs (8-15)
- Training time estimation

### Data Quality Filtering
- **Duration filtering**: 0.5s to 30s audio clips
- **Transcript validation**: Removes empty or too-long texts
- **Quality checks**: Filters invalid audio samples
- **Automatic normalization**: Consistent text preprocessing

### Evaluation & Metrics
- **WER (Word Error Rate)**: Primary quality metric
- **TensorBoard logging**: Real-time training visualization
- **Best model selection**: Automatically saves best checkpoint by WER
- **Predict with generate**: Proper sequence generation for evaluation

### Flexible Dataset Handling
- Automatic train/validation split
- Caches processed datasets
- Supports different dataset sizes
- Progress tracking and metrics
- Group by length for efficient batching

## Performance Optimization

### Current Optimizations
✅ Flash Attention 2 enabled
✅ BF16 mixed precision
✅ TF32 matrix operations
✅ cuDNN auto-tuning
✅ Automatic device placement

### Training Speed
- **Small dataset (109 samples)**: ~2 minutes for 8 epochs
- **Estimated for medium (300 samples)**: ~5-6 minutes for 5 epochs
- **Estimated for large (600 samples)**: ~10-12 minutes for 5 epochs

## Next Steps

### Immediate Actions
1. **Retrain with larger dataset** (medium or large) for better results
2. **Evaluate model quality** with Word Error Rate (WER) metrics
3. **Test on real-world audio** samples

### Future Improvements
1. **Use larger Whisper model** (medium or large) for better accuracy
2. **Add data augmentation** (speed, pitch, noise)
3. **Create web interface** for easy testing
4. **Deploy model** as API service
5. **Push to Hugging Face Hub** for sharing and deployment

## Troubleshooting

### Common Issues

**1. Model generates repetitive outputs**
- **Cause**: Dataset too small (< 200 samples)
- **Solution**: Use medium or large dataset

**2. Out of memory errors**
- **Cause**: Batch size too large
- **Solution**: Reduce batch size in training script

**3. Slow training**
- **Cause**: Flash Attention 2 not enabled
- **Solution**: Verify `flash-attn` is installed

**4. Poor transcription quality**
- **Cause**: Insufficient training data
- **Solution**: Use larger dataset or more epochs

## Technical Details

### Model Architecture
- **Base model**: OpenAI Whisper-small
- **Parameters**: 242M
- **Input**: 16kHz mono audio
- **Output**: German text transcription

### Training Process
1. Load and preprocess audio (resample to 16kHz)
2. Extract mel-spectrogram features
3. Fine-tune encoder-decoder with teacher forcing
4. Evaluate on validation set each epoch
5. Save best checkpoint based on loss

### Generation Parameters
```python
model.generate(
    input_features,
    max_length=448,
    num_beams=5,
    temperature=0.0,
    do_sample=False,
    repetition_penalty=1.2,
    no_repeat_ngram_size=3
)
```

## Resources

- **Whisper Paper**: https://arxiv.org/abs/2212.04356
- **Hugging Face Transformers**: https://huggingface.co/docs/transformers
- **Flash Attention 2**: https://github.com/Dao-AILab/flash-attention
- **Dataset**: https://huggingface.co/datasets/PolyAI/minds14

## License
This project uses the MIT License. The Whisper model is licensed under Apache 2.0.

## Contact
For questions or issues, please create an issue in the project repository.