ASR-finetuning / docs /guides /README_WHISPER_PROJECT.md
saadmannan's picture
HF space application - exclude binary PDFs
5554ef1

A newer version of the Gradio SDK is available: 6.3.0

Upgrade

Whisper German ASR Fine-Tuning Project

Project Overview

This project fine-tunes OpenAI's Whisper model for German Automatic Speech Recognition (ASR) using the PolyAI/minds14 dataset.

Hardware Setup

  • GPU: NVIDIA GeForce RTX 5060 Ti (16GB VRAM)
  • CUDA: 13.0
  • PyTorch: 2.9.0+cu130
  • Flash Attention 2: Enabled (v2.8.3)

Project Structure

ai-career-project/
β”œβ”€β”€ project1_whisper_setup.py      # Dataset download and preparation
β”œβ”€β”€ project1_whisper_train.py      # Model training script
β”œβ”€β”€ project1_whisper_inference.py  # Inference and testing script
β”œβ”€β”€ data/
β”‚   └── minds14_small/             # Training dataset (122 samples)
└── whisper_test_tuned/            # Fine-tuned model checkpoints
    β”œβ”€β”€ checkpoint-28/
    └── checkpoint-224/            # Final checkpoint

Dataset Options

Size Split Samples Training Time VRAM Usage Best For
Tiny 5% ~30 30 seconds 8-10 GB Quick testing
Small 20% ~120 2 minutes 10-12 GB Experiments βœ…
Medium 50% ~300 5-6 minutes 12-14 GB Good results
Large 100% ~600 10-12 minutes 14-16 GB Best performance

Training Results (Small Dataset)

Configuration

  • Model: Whisper-small (242M parameters)
  • Training samples: 109
  • Evaluation samples: 13
  • Batch size: 4
  • Learning rate: 2e-05
  • Epochs: 8
  • Mixed precision: BF16
  • Flash Attention 2: Enabled
  • Gradient checkpointing: Disabled

Performance

  • Training time: ~2 minutes (119 seconds)
  • Training speed: 7.27 samples/second
  • Final training loss: 4684.90
  • Final evaluation loss: 2490.13

Current Issues

⚠️ Model Performance: The model trained on the small dataset (109 samples) shows poor inference quality, generating repetitive outputs. This is expected with such a small dataset.

Recommendations for Better Results

1. Use Larger Dataset βœ… RECOMMENDED

# Run setup with medium or large dataset
python project1_whisper_setup.py
# Select 'medium' or 'large' when prompted

Expected improvements:

  • Medium (300 samples): 5-6 minutes training, significantly better quality
  • Large (600 samples): 10-12 minutes training, best quality

2. Adjust Training Parameters

For larger datasets, the training script automatically adjusts:

  • Batch size: 4
  • Gradient accumulation: 2
  • Learning rate: 1e-5
  • Epochs: 5

3. Use Pre-trained Model for Inference

If you need immediate results, use the base Whisper model:

from transformers import pipeline

# Use base Whisper model (no fine-tuning needed)
pipe = pipeline("automatic-speech-recognition", 
                model="openai/whisper-small",
                device=0)  # Use GPU

result = pipe("audio.wav", generate_kwargs={"language": "german"})
print(result["text"])

Recent Improvements (v2.0)

Training Pipeline Enhancements

βœ… Fixed Trainer API Issues

  • Corrected evaluation_strategy parameter (was eval_strategy)
  • Fixed tokenizer parameter (was processing_class)
  • Added German language/task conditioning for proper decoder behavior

βœ… Improved Hyperparameters

  • Increased learning rates: 1e-5 to 2e-5 (was 5e-6)
  • Added warmup ratio (3-5%) for better convergence
  • Removed dtype conflicts (let Trainer control precision)
  • Optimized epochs by dataset size (8-15 epochs)

βœ… Data Quality & Processing

  • Duration filtering (0.5s - 30s)
  • Transcript length validation
  • Text normalization for consistent WER computation
  • Group by length for reduced padding

βœ… Evaluation & Monitoring

  • WER (Word Error Rate) metric with jiwer
  • TensorBoard logging for all metrics
  • Best model selection by WER (not just loss)
  • Predict with generate for proper evaluation

Why Training Should Improve Now

  1. Proper evaluation: WER tracking shows actual quality improvements
  2. Better learning rate: 2-4x higher LR enables faster convergence
  3. Language conditioning: Model knows it's transcribing German
  4. Data filtering: Removes noisy/invalid samples that hurt training
  5. Best model selection: Saves checkpoint with lowest WER, not just loss

Installation

1. Install Dependencies

pip install -r requirements.txt

2. (Optional) Install Flash Attention 2

For faster training (requires CUDA toolkit):

pip install flash-attn --no-build-isolation

Usage

1. Setup Dataset

python project1_whisper_setup.py

Select dataset size when prompted (recommend 'medium' or 'large')

2. Train Model

python project1_whisper_train.py

3. Monitor Training with TensorBoard

In a separate terminal, start TensorBoard:

tensorboard --logdir=./logs

Then open http://localhost:6006 in your browser to view:

  • Training/Evaluation Loss - Track model convergence
  • WER (Word Error Rate) - Monitor transcription quality
  • Learning Rate - Visualize warmup and decay
  • Gradient Norms - Check training stability

You can also monitor GPU usage:

nvidia-smi -l 1

4. Test Model

# Test with dataset samples
python project1_whisper_inference.py --test --num-samples 10

# Transcribe specific audio files
python project1_whisper_inference.py --audio file1.wav file2.wav

# Interactive mode
python project1_whisper_inference.py --interactive

Key Features

Flash Attention 2 Integration

  • Faster training: 10-20% speedup
  • Memory efficient: No gradient checkpointing needed
  • Stable training: BF16 mixed precision

Automatic Configuration

The training script automatically adjusts parameters based on dataset size:

  • Batch size and gradient accumulation
  • Learning rate (1e-5 to 2e-5) and warmup ratio
  • Number of epochs (8-15)
  • Training time estimation

Data Quality Filtering

  • Duration filtering: 0.5s to 30s audio clips
  • Transcript validation: Removes empty or too-long texts
  • Quality checks: Filters invalid audio samples
  • Automatic normalization: Consistent text preprocessing

Evaluation & Metrics

  • WER (Word Error Rate): Primary quality metric
  • TensorBoard logging: Real-time training visualization
  • Best model selection: Automatically saves best checkpoint by WER
  • Predict with generate: Proper sequence generation for evaluation

Flexible Dataset Handling

  • Automatic train/validation split
  • Caches processed datasets
  • Supports different dataset sizes
  • Progress tracking and metrics
  • Group by length for efficient batching

Performance Optimization

Current Optimizations

βœ… Flash Attention 2 enabled βœ… BF16 mixed precision βœ… TF32 matrix operations βœ… cuDNN auto-tuning βœ… Automatic device placement

Training Speed

  • Small dataset (109 samples): ~2 minutes for 8 epochs
  • Estimated for medium (300 samples): ~5-6 minutes for 5 epochs
  • Estimated for large (600 samples): ~10-12 minutes for 5 epochs

Next Steps

Immediate Actions

  1. Retrain with larger dataset (medium or large) for better results
  2. Evaluate model quality with Word Error Rate (WER) metrics
  3. Test on real-world audio samples

Future Improvements

  1. Use larger Whisper model (medium or large) for better accuracy
  2. Add data augmentation (speed, pitch, noise)
  3. Create web interface for easy testing
  4. Deploy model as API service
  5. Push to Hugging Face Hub for sharing and deployment

Troubleshooting

Common Issues

1. Model generates repetitive outputs

  • Cause: Dataset too small (< 200 samples)
  • Solution: Use medium or large dataset

2. Out of memory errors

  • Cause: Batch size too large
  • Solution: Reduce batch size in training script

3. Slow training

  • Cause: Flash Attention 2 not enabled
  • Solution: Verify flash-attn is installed

4. Poor transcription quality

  • Cause: Insufficient training data
  • Solution: Use larger dataset or more epochs

Technical Details

Model Architecture

  • Base model: OpenAI Whisper-small
  • Parameters: 242M
  • Input: 16kHz mono audio
  • Output: German text transcription

Training Process

  1. Load and preprocess audio (resample to 16kHz)
  2. Extract mel-spectrogram features
  3. Fine-tune encoder-decoder with teacher forcing
  4. Evaluate on validation set each epoch
  5. Save best checkpoint based on loss

Generation Parameters

model.generate(
    input_features,
    max_length=448,
    num_beams=5,
    temperature=0.0,
    do_sample=False,
    repetition_penalty=1.2,
    no_repeat_ngram_size=3
)

Resources

License

This project uses the MIT License. The Whisper model is licensed under Apache 2.0.

Contact

For questions or issues, please create an issue in the project repository.