eeeeeeeeeeeeee3
/

soccer-ball-detection

Object Detection

computer-vision

Model card Files Files and versions

soccer-ball-detection / SPEED_OPTIMIZATION_GUIDE.md

eeeeeeeeeeeeee3's picture

eeeeeeeeeeeeee3

Upload SPEED_OPTIMIZATION_GUIDE.md with huggingface_hub

c8d21ee verified about 1 month ago

|

history blame contribute delete

4.17 kB

Training Speed Optimization Guide

Current Setup

GPU: RTX A5000 (24GB VRAM)
Current Config: configs/training.yaml (batch_size: 24, num_workers: 4)
Fast Config: configs/training_fast_a5000.yaml (batch_size: 36, num_workers: 12)

Speed Optimizations Applied

1. Batch Size Increase (2-3x speedup potential)

Current: 24
Optimized: 36
Impact: Larger batches = more efficient GPU utilization, fewer iterations per epoch
Memory: With mixed precision (FP16), 36 batch size fits comfortably in 24GB VRAM

2. Data Loading Optimization (20-40% speedup)

num_workers: 4 → 12 (3x more parallel data loading)
prefetch_factor: 2 → 6 (3x more batches prefetched)
persistent_workers: false → true (eliminates worker startup overhead)
Impact: GPU waits less for data, better utilization

3. Mixed Precision Training (Already enabled, ~2x speedup)

Status: ✅ Enabled (mixed_precision: true)
Impact: ~2x faster training, ~50% less memory usage
Note: Critical for tiny object detection - preserves accuracy

4. Memory Format Optimization (10-15% speedup)

channels_last: ✅ Enabled
Impact: Faster convolution operations on modern GPUs

5. TF32 Precision (1.5x speedup on matmul)

Status: ✅ Enabled (tf32: true)
Impact: Faster matrix multiplications on Ampere GPUs (A5000)

6. Reduced Augmentation Overhead (10-20% speedup)

CLAHE: Disabled (minimal accuracy impact)
Motion Blur: Disabled (minimal accuracy impact)
Impact: Faster data preprocessing

7. Reduced Validation Frequency (2x speedup during validation)

val_frequency: 1 → 2 (validate every 2 epochs instead of every epoch)
Impact: Less time spent on validation

8. Reduced Logging I/O (5-10% speedup)

print_frequency: 20 → 100 (less frequent console output)
log_every_n_steps: 50 → 200 (less frequent TensorBoard writes)
mlflow_log_models: false (no model artifact logging)

9. Memory Cleanup Optimization (2-5% speedup)

memory_cleanup_frequency: 10 → 30 (less frequent cleanup = less overhead)
adaptive_adjustment_interval: 50 → 100 (less frequent checks)

Expected Speed Improvements

Overall Training Speed

Current: ~5 hours for 20 epochs
Optimized: ~2-2.5 hours for 20 epochs (2-2.5x faster)

Per-Epoch Speed

Current: ~15 minutes per epoch
Optimized: ~6-7 minutes per epoch

How to Use

Option 1: Use Fast Config Directly

python scripts/train_detr.py --config configs/training_fast_a5000.yaml

Option 2: Modify Existing Config

Edit configs/training.yaml and update:

batch_size: 24 → batch_size: 36
num_workers: 4 → num_workers: 12
prefetch_factor: 2 → prefetch_factor: 6
val_frequency: 1 → val_frequency: 2 (in evaluation section)

Option 3: Test Batch Size Incrementally

If you encounter OOM (Out of Memory) errors:

Start with batch_size: 32
If successful, try batch_size: 36
If still successful, try batch_size: 40 (maximum for A5000)

Memory Monitoring

Monitor GPU memory usage during training:

watch -n 1 nvidia-smi

If you see memory usage > 95%, reduce batch_size by 4-8.

Trade-offs

Accuracy Impact

Minimal: Disabled augmentations (CLAHE, motion blur) have minimal impact
None: Other optimizations (batch size, data loading) don't affect accuracy

Stability

Same: All optimizations maintain training stability
Better: Larger batch size = more stable gradients

Additional Optimizations (Future)

Gradient Checkpointing: Trade compute for memory (if needed for even larger batches)
Model Compilation: PyTorch 2.0 torch.compile() (currently disabled due to variable input sizes)
Data Prefetching: Pre-process data to disk cache (if I/O becomes bottleneck)

Notes

All optimizations are GPU-safe and tested
Mixed precision is critical - keep it enabled
Monitor first epoch to ensure no OOM errors
If training crashes, reduce batch_size by 4 and retry