Spaces:
Sleeping
A newer version of the Gradio SDK is available: 6.14.0
Whisper Training Pipeline - Improvements Summary
Overview
This document summarizes the comprehensive improvements made to the Whisper fine-tuning pipeline to fix training issues and enable proper evaluation.
Critical Fixes
1. Trainer API Issues (Breaking Bugs)
Problem: Training was using incorrect/deprecated API parameters Fixes:
- β
Changed
eval_strategy="epoch"βevaluation_strategy="epoch"- Impact: Evaluation was never running during training
- β
Changed
processing_class=processorβtokenizer=processor- Impact: Tokenizer wasn't properly saved with checkpoints
- β
Added
predict_with_generate=True- Impact: Enables proper sequence generation for WER evaluation
2. Language/Task Conditioning (Critical for Non-English)
Problem: Model wasn't conditioned for German transcription Fix:
model.config.forced_decoder_ids = processor.get_decoder_prompt_ids(
language="german",
task="transcribe"
)
model.config.suppress_tokens = []
Impact:
- Model now knows it's transcribing German
- Decoder generates German text consistently
- Training targets are properly aligned
3. Hyperparameter Issues
Learning Rate (Too Conservative)
Before: 5e-6 for all dataset sizes
After:
- Large datasets (>400):
2e-5 - Medium datasets (100-400):
1.5e-5 - Small datasets (<100):
1e-5
Impact: 2-4x higher learning rate enables actual learning with limited data
Warmup Strategy
Before: warmup_steps=min(100, len(train)//10) (could be 50%+ of training)
After: warmup_ratio=0.03-0.05 (3-5% of total steps)
Impact: More stable warmup that scales with dataset size
Precision/Dtype Conflict
Before: Model loaded with torch_dtype=torch.float16, Trainer uses bf16=True
After: Let Trainer control precision entirely
# Model loading - no dtype specified
model = WhisperForConditionalGeneration.from_pretrained(
"openai/whisper-small",
config=config,
device_map="auto"
)
# Trainer handles precision
bf16=torch.cuda.is_bf16_supported()
Impact: Eliminates dtype mismatches and training instability
4. Data Quality Filtering
Added Filters:
- β Duration: 0.5s β€ audio β€ 30s
- β Transcript: Not empty, 2+ chars, <500 chars
- β Audio validation: Valid array and sampling rate
- β Text normalization: Lowercase, remove punctuation, strip whitespace
Impact: Removes noisy samples that can dominate small datasets
5. Evaluation & Metrics
Added:
- β
WER (Word Error Rate) computation with
jiwer - β Text normalization for consistent metrics
- β Best model selection by WER (not just loss)
- β
load_best_model_at_end=True - β
metric_for_best_model="wer"
Impact: Can now track actual transcription quality improvements
6. TensorBoard Logging
Added:
report_to=["tensorboard"]
logging_dir="./logs"
logging_steps=10
logging_first_step=True
Metrics Logged:
- Training/Evaluation Loss
- WER (Word Error Rate)
- Learning Rate schedule
- Gradient norms
- Training speed
Usage:
tensorboard --logdir=./logs
# Open http://localhost:6006
7. Additional Optimizations
- β
group_by_length=True- Reduces padding overhead - β
generation_max_length=448- Full Whisper context (was 128) - β Data filtering before preprocessing
- β Better epoch/batch size scaling by dataset size
Expected Improvements
Before (v1.0)
- β No evaluation running (API bug)
- β No language conditioning
- β LR too low (5e-6)
- β No WER tracking
- β No data filtering
- β Dtype conflicts
- β Model selection by loss only
Result: Training appeared to run but model didn't improve
After (v2.0)
- β Evaluation runs every epoch
- β German language/task conditioning
- β Proper LR (1e-5 to 2e-5)
- β WER metric tracking
- β Quality data filtering
- β Consistent precision
- β Best model by WER
Expected Result: Visible WER improvements, better transcription quality
Hugging Face Compatibility
Current Status: β Fully Compatible
Using:
transformers.WhisperForConditionalGenerationtransformers.WhisperProcessortransformers.Seq2SeqTrainerdatasets.load_dataset/load_from_disk- Standard HF checkpoint format
To Push to Hub:
# In TrainingArguments
push_to_hub=True
hub_model_id="your-username/whisper-small-german"
hub_token="your_hf_token"
# Or manually after training
model.push_to_hub("your-username/whisper-small-german")
processor.push_to_hub("your-username/whisper-small-german")
GitHub Readiness
Added Files
- β
requirements.txt- All dependencies with versions - β
Updated
README_WHISPER_PROJECT.md- Installation, usage, TensorBoard - β
TRAINING_IMPROVEMENTS.md- This document
Reproducibility
- β Pinned dependency versions
- β Seed set to 42
- β Clear installation instructions
- β Dataset download script
- β Training/inference scripts
Missing (Optional)
.gitignorefor checkpoints/logsLICENSEfile- GitHub Actions for CI/CD
- Model card template
Data Processing vs Whisper Paper
Whisper Paper Approach
- 30-second audio chunks
- 80-channel log-mel spectrogram
- 16kHz sampling rate
- Padding/truncation to 30s
Our Implementation: β Matches Paper
# WhisperProcessor handles this automatically
input_features = processor(
audio_array, # Raw audio
sampling_rate=16000, # 16kHz β
return_tensors="pt"
).input_features # Returns 80x3000 mel spectrogram β
What happens:
- Audio resampled to 16kHz β
- Converted to 80-channel log-mel spectrogram β
- Padded/truncated to 3000 frames (30s at 16kHz) β
- Normalized β
For longer audio: Would need sliding window with stride (not needed for MINDS14)
Next Steps
Immediate
- Install dependencies:
pip install -r requirements.txt - Retrain model:
python project1_whisper_train.py - Monitor with TensorBoard:
tensorboard --logdir=./logs - Check WER improvements: Should see decreasing WER each epoch
Recommended
- Use medium or large dataset (300-600 samples)
- Monitor TensorBoard for convergence
- Compare WER across epochs
- Test on real-world German audio
Advanced
- Try Whisper-medium for better quality
- Add data augmentation (SpecAugment)
- Push best model to Hugging Face Hub
- Create demo/API endpoint
Summary
Root Causes of "No Learning":
- Evaluation never ran (API typo)
- No language conditioning for German
- Learning rate too conservative
- No quality metrics (WER)
- Dtype conflicts
All Fixed: Training should now show measurable WER improvements and produce usable German ASR models.