Spaces:

saadmannan
/

ASR-finetuning

Sleeping

App Files Files Community

ASR-finetuning / docs /guides /README_WHISPER_PROJECT.md

saadmannan

HF space application - exclude binary PDFs

5554ef1 2 months ago

preview code

raw

history blame contribute delete

9.18 kB

	# Whisper German ASR Fine-Tuning Project

	## Project Overview
	This project fine-tunes OpenAI's Whisper model for German Automatic Speech Recognition (ASR) using the PolyAI/minds14 dataset.

	## Hardware Setup
	- GPU: NVIDIA GeForce RTX 5060 Ti (16GB VRAM)
	- CUDA: 13.0
	- PyTorch: 2.9.0+cu130
	- Flash Attention 2: Enabled (v2.8.3)

	## Project Structure
	```
	ai-career-project/
	├── project1_whisper_setup.py # Dataset download and preparation
	├── project1_whisper_train.py # Model training script
	├── project1_whisper_inference.py # Inference and testing script
	├── data/
	│ └── minds14_small/ # Training dataset (122 samples)
	└── whisper_test_tuned/ # Fine-tuned model checkpoints
	├── checkpoint-28/
	└── checkpoint-224/ # Final checkpoint
	```

	## Dataset Options

	\| Size \| Split \| Samples \| Training Time \| VRAM Usage \| Best For \|
	\|------\|-------\|---------\|---------------\|------------\|----------\|
	\| Tiny \| 5% \| ~30 \| 30 seconds \| 8-10 GB \| Quick testing \|
	\| Small \| 20% \| ~120 \| 2 minutes \| 10-12 GB \| Experiments ✅ \|
	\| Medium \| 50% \| ~300 \| 5-6 minutes \| 12-14 GB \| Good results \|
	\| Large \| 100% \| ~600 \| 10-12 minutes \| 14-16 GB \| Best performance \|

	## Training Results (Small Dataset)

	### Configuration
	- Model: Whisper-small (242M parameters)
	- Training samples: 109
	- Evaluation samples: 13
	- Batch size: 4
	- Learning rate: 2e-05
	- Epochs: 8
	- Mixed precision: BF16
	- Flash Attention 2: Enabled
	- Gradient checkpointing: Disabled

	### Performance
	- Training time: ~2 minutes (119 seconds)
	- Training speed: 7.27 samples/second
	- Final training loss: 4684.90
	- Final evaluation loss: 2490.13

	### Current Issues
	⚠️ Model Performance: The model trained on the small dataset (109 samples) shows poor inference quality, generating repetitive outputs. This is expected with such a small dataset.

	## Recommendations for Better Results

	### 1. Use Larger Dataset ✅ RECOMMENDED
	```bash
	# Run setup with medium or large dataset
	python project1_whisper_setup.py
	# Select 'medium' or 'large' when prompted
	```

	Expected improvements:
	- Medium (300 samples): 5-6 minutes training, significantly better quality
	- Large (600 samples): 10-12 minutes training, best quality

	### 2. Adjust Training Parameters
	For larger datasets, the training script automatically adjusts:
	- Batch size: 4
	- Gradient accumulation: 2
	- Learning rate: 1e-5
	- Epochs: 5

	### 3. Use Pre-trained Model for Inference
	If you need immediate results, use the base Whisper model:
	```python
	from transformers import pipeline

	# Use base Whisper model (no fine-tuning needed)
	pipe = pipeline("automatic-speech-recognition",
	model="openai/whisper-small",
	device=0) # Use GPU

	result = pipe("audio.wav", generate_kwargs={"language": "german"})
	print(result["text"])
	```

	## Recent Improvements (v2.0)

	### Training Pipeline Enhancements
	✅ Fixed Trainer API Issues
	- Corrected `evaluation_strategy` parameter (was `eval_strategy`)
	- Fixed `tokenizer` parameter (was `processing_class`)
	- Added German language/task conditioning for proper decoder behavior

	✅ Improved Hyperparameters
	- Increased learning rates: 1e-5 to 2e-5 (was 5e-6)
	- Added warmup ratio (3-5%) for better convergence
	- Removed dtype conflicts (let Trainer control precision)
	- Optimized epochs by dataset size (8-15 epochs)

	✅ Data Quality & Processing
	- Duration filtering (0.5s - 30s)
	- Transcript length validation
	- Text normalization for consistent WER computation
	- Group by length for reduced padding

	✅ Evaluation & Monitoring
	- WER (Word Error Rate) metric with jiwer
	- TensorBoard logging for all metrics
	- Best model selection by WER (not just loss)
	- Predict with generate for proper evaluation

	### Why Training Should Improve Now
	1. Proper evaluation: WER tracking shows actual quality improvements
	2. Better learning rate: 2-4x higher LR enables faster convergence
	3. Language conditioning: Model knows it's transcribing German
	4. Data filtering: Removes noisy/invalid samples that hurt training
	5. Best model selection: Saves checkpoint with lowest WER, not just loss

	## Installation

	### 1. Install Dependencies
	```bash
	pip install -r requirements.txt
	```

	### 2. (Optional) Install Flash Attention 2
	For faster training (requires CUDA toolkit):
	```bash
	pip install flash-attn --no-build-isolation
	```

	## Usage

	### 1. Setup Dataset
	```bash
	python project1_whisper_setup.py
	```
	Select dataset size when prompted (recommend 'medium' or 'large')

	### 2. Train Model
	```bash
	python project1_whisper_train.py
	```

	### 3. Monitor Training with TensorBoard
	In a separate terminal, start TensorBoard:
	```bash
	tensorboard --logdir=./logs
	```
	Then open http://localhost:6006 in your browser to view:
	- Training/Evaluation Loss - Track model convergence
	- WER (Word Error Rate) - Monitor transcription quality
	- Learning Rate - Visualize warmup and decay
	- Gradient Norms - Check training stability

	You can also monitor GPU usage:
	```bash
	nvidia-smi -l 1
	```

	### 4. Test Model
	```bash
	# Test with dataset samples
	python project1_whisper_inference.py --test --num-samples 10

	# Transcribe specific audio files
	python project1_whisper_inference.py --audio file1.wav file2.wav

	# Interactive mode
	python project1_whisper_inference.py --interactive
	```

	## Key Features

	### Flash Attention 2 Integration
	- Faster training: 10-20% speedup
	- Memory efficient: No gradient checkpointing needed
	- Stable training: BF16 mixed precision

	### Automatic Configuration
	The training script automatically adjusts parameters based on dataset size:
	- Batch size and gradient accumulation
	- Learning rate (1e-5 to 2e-5) and warmup ratio
	- Number of epochs (8-15)
	- Training time estimation

	### Data Quality Filtering
	- Duration filtering: 0.5s to 30s audio clips
	- Transcript validation: Removes empty or too-long texts
	- Quality checks: Filters invalid audio samples
	- Automatic normalization: Consistent text preprocessing

	### Evaluation & Metrics
	- WER (Word Error Rate): Primary quality metric
	- TensorBoard logging: Real-time training visualization
	- Best model selection: Automatically saves best checkpoint by WER
	- Predict with generate: Proper sequence generation for evaluation

	### Flexible Dataset Handling
	- Automatic train/validation split
	- Caches processed datasets
	- Supports different dataset sizes
	- Progress tracking and metrics
	- Group by length for efficient batching

	## Performance Optimization

	### Current Optimizations
	✅ Flash Attention 2 enabled
	✅ BF16 mixed precision
	✅ TF32 matrix operations
	✅ cuDNN auto-tuning
	✅ Automatic device placement

	### Training Speed
	- Small dataset (109 samples): ~2 minutes for 8 epochs
	- Estimated for medium (300 samples): ~5-6 minutes for 5 epochs
	- Estimated for large (600 samples): ~10-12 minutes for 5 epochs

	## Next Steps

	### Immediate Actions
	1. Retrain with larger dataset (medium or large) for better results
	2. Evaluate model quality with Word Error Rate (WER) metrics
	3. Test on real-world audio samples

	### Future Improvements
	1. Use larger Whisper model (medium or large) for better accuracy
	2. Add data augmentation (speed, pitch, noise)
	3. Create web interface for easy testing
	4. Deploy model as API service
	5. Push to Hugging Face Hub for sharing and deployment

	## Troubleshooting

	### Common Issues

	1. Model generates repetitive outputs
	- Cause: Dataset too small (< 200 samples)
	- Solution: Use medium or large dataset

	2. Out of memory errors
	- Cause: Batch size too large
	- Solution: Reduce batch size in training script

	3. Slow training
	- Cause: Flash Attention 2 not enabled
	- Solution: Verify `flash-attn` is installed

	4. Poor transcription quality
	- Cause: Insufficient training data
	- Solution: Use larger dataset or more epochs

	## Technical Details

	### Model Architecture
	- Base model: OpenAI Whisper-small
	- Parameters: 242M
	- Input: 16kHz mono audio
	- Output: German text transcription

	### Training Process
	1. Load and preprocess audio (resample to 16kHz)
	2. Extract mel-spectrogram features
	3. Fine-tune encoder-decoder with teacher forcing
	4. Evaluate on validation set each epoch
	5. Save best checkpoint based on loss

	### Generation Parameters
	```python
	model.generate(
	input_features,
	max_length=448,
	num_beams=5,
	temperature=0.0,
	do_sample=False,
	repetition_penalty=1.2,
	no_repeat_ngram_size=3
	)
	```

	## Resources

	- Whisper Paper: https://arxiv.org/abs/2212.04356
	- Hugging Face Transformers: https://huggingface.co/docs/transformers
	- Flash Attention 2: https://github.com/Dao-AILab/flash-attention
	- Dataset: https://huggingface.co/datasets/PolyAI/minds14

	## License
	This project uses the MIT License. The Whisper model is licensed under Apache 2.0.

	## Contact
	For questions or issues, please create an issue in the project repository.