Update README.md

5ccedd8 verified 4 days ago

10.8 kB


	---
	license: apache-2.0
	tags:
	- object-detection
	- yolo
	- edge-ai
	- quantization
	datasets:
	- MTID
	- Roboflow
	- VisDrone
	library_name: ultralytics
	---

	# YOLO11 GhostConv + Knowledge Distillation + Quantization

	This notebook implements a complete model optimization pipeline for YOLO11 targeting edge devices, including: custom architecture with GhostConv, Knowledge Distillation, and Quantization.

	## 📋 Table of Contents

	- [Overview](#overview)
	- [Notebook Structure](#notebook-structure)
	- [System Requirements](#system-requirements)
	- [Installation](#installation)
	- [Usage Guide](#usage-guide)
	- [Results](#results)
	- [References](#references)

	## 🎯 Overview

	This notebook implements a 3-stage YOLO11 optimization pipeline:

	### 1. Custom Architecture (YOLO11n-GhostConv)
	- Replace Conv layers with GhostConv to reduce parameters
	- Retain C3k2 and C2PSA blocks for feature extraction
	- Architecture optimized for traffic dataset (5 classes)

	### 2. Knowledge Distillation (KD)
	- Teacher model: YOLO11l (large model)
	- Student model: YOLO11n-GhostConv (custom lightweight)
	- Techniques:
	- Feature-based distillation (MSE loss)
	- Logit-based distillation (KL divergence)
	- Temperature scaling (T=4.0)
	- Progressive KD with warmup epochs

	### 3. Quantization
	- FP32 → INT8 quantization with TFLite
	- FP32 → FP16 quantization
	- Calibration dataset for INT8
	- Performance comparison: FP32 vs INT8 vs FP16

	## 📁 Notebook Structure

	### Section 1: Initialization
	- Mount Google Drive
	- Setup project directories
	- Import Ultralytics modules (GhostConv, C3k2, C2PSA)
	- Clone and install Ultralytics from source

	### Section 2: Custom Architecture
	- Define YOLO11_TinyGhost architecture in YAML
	- Backbone with GhostConv layers
	- Head with Detect layer for 5 classes
	- Train baseline model (50 epochs)

	### Section 3: Knowledge Distillation
	Class implementations:
	- `KDConfig`: Configuration for KD training
	- `KnowledgeDistillationTrainer`: Custom trainer inheriting from DetectionTrainer
	- Forward hooks to capture intermediate features
	- Feature distillation loss (normalized MSE)
	- Logit distillation loss (KL divergence with temperature)
	- Combined loss: `(1-α-β)L_hard + αL_feature + β*L_logit`

	Training strategy:
	- Warmup phase (8 epochs): hard loss only
	- After warmup: combine hard + KD losses
	- KD layers: ["model.4", "model.6", "model.10"] (P3, P4, PSA)
	- Hyperparameters: α=0.3, β=0.2, T=4.0

	### Section 4: Visualization
	- Training metrics plotting (mAP, loss curves)
	- F1 score tracking
	- Precision/Recall curves
	- Box/Class/DFL loss comparison

	### Section 5: Fine-tuning
	- Load best KD checkpoint
	- Fine-tune on multi-view intersection dataset
	- Freeze 3 backbone layers
	- Low learning rate (1e-5) with cosine scheduler

	### Section 6: Quantization
	Export formats:
	- INT8 TFLite (with calibration dataset)
	- FP16 TFLite

	Evaluation:
	- Compare mAP50 and mAP50-95
	- FP32 vs INT8 vs FP16
	- Image size: 416x416

	## 🔧 System Requirements

	### Hardware
	- GPU: CUDA-compatible (T4 or better recommended)
	- RAM: 16GB+
	- Storage: 10GB+ for datasets and models

	### Software
	```
	Python >= 3.8
	PyTorch >= 1.13
	CUDA >= 11.3
	Google Colab (recommended)
	```

	## 📦 Installation

	### 1. Clone Ultralytics from source
	```bash
	!git clone https://github.com/ultralytics/ultralytics
	%cd ultralytics
	!pip install -e .
	```

	### 2. Dependencies
	```python
	pip install torch torchvision
	pip install matplotlib pandas
	pip install opencv-python pillow
	```

	### 3. Dataset structure
	```
	dataset/
	├── images/
	│ ├── train/
	│ └── val/
	├── labels/
	│ ├── train/
	│ └── val/
	└── data.yaml
	```

	## 🚀 Usage Guide

	### Step 1: Prepare Data
	```python
	PROJECT_DIR = "/content/drive/MyDrive/yolo_ghostblock"
	DATASET_DIR = "/content/drive/MyDrive/dataset/yolo_mtid_motor/dataset"
	```

	### Step 2: Train Baseline GhostConv Model
	```python
	model = YOLO("yolo11_tinyghost.yaml")
	model.train(
	data=f"{DATASET_DIR}/data.yaml",
	epochs=50,
	imgsz=640,
	device=0
	)
	```

	### Step 3: Knowledge Distillation
	```python
	# Load teacher and student
	teacher = YOLO("path/to/teacher.pt")
	student = YOLO("path/to/student.pt")

	# Create KD trainer
	TrainerClass = create_kd_trainer_class(
	teacher_model=teacher,
	kd_alpha=0.3,
	kd_beta=0.2,
	kd_temperature=4.0,
	kd_layers=["model.4", "model.6", "model.10"]
	)

	# Train with KD
	trainer = TrainerClass(overrides={...})
	trainer.train()
	```

	### Step 4: Quantization
	```python
	# Export INT8
	model.export(
	format="tflite",
	int8=True,
	data=CALIB_YAML,
	imgsz=416
	)

	# Evaluate quantized model
	model_int8 = YOLO("best_int8.tflite")
	metrics = model_int8.val(data=DATA_YAML, imgsz=416)
	```

	## 📊 Results

	### Model Comparison

	\| Model \| Parameters \| Size \| mAP50 \| mAP50-95 \|
	\|-------\|-----------\|------\|-------\|----------\|
	\| YOLO11l (Teacher) \| ~20M \| ~40MB \| 0.95+ \| 0.80+ \|
	\| YOLO11n-Ghost \| ~2M \| ~4MB \| 0.92+ \| 0.75+ \|
	\| + KD \| ~2M \| ~4MB \| 0.94+ \| 0.78+ \|
	\| + INT8 \| ~2M \| ~1MB \| 0.93+ \| 0.76+ \|

	### Quantization Impact
	- FP32 → INT8: ~75% size reduction, ~1-2% mAP drop
	- FP32 → FP16: ~50% size reduction, ~0.5% mAP drop

	### Training Curves
	- Box Loss: converges after ~30 epochs
	- mAP50: reaches plateau ~35-40 epochs
	- F1 Score: 0.85-0.90 range

	## 📖 Technical Details

	### GhostConv Architecture
	```yaml
	backbone:
	- [-1, 1, GhostConv, [64, 3, 2]]
	- [-1, 1, GhostConv, [128, 3, 2]]
	- [-1, 1, C3k2, [256, False, 0.25]]
	...
	```

	### KD Loss Formula
	```
	L_total = (1 - α - β) * L_hard + α * L_feature + β * L_logit

	L_feature = MSE(normalize(S_feat), normalize(T_feat))
	L_logit = KL(softmax(S/T), softmax(T/T)) * T²
	```

	### Quantization Config
	- INT8: Post-training quantization with calibration
	- Calibration: 100-200 images from training set
	- Input: uint8 [0, 255] or float32 normalized

	## ⚙️ Hyperparameters

	### Training
	- Epochs: 40-50
	- Batch size: 16
	- Image size: 640x640
	- Learning rate: 5e-5 (baseline), 1e-5 (fine-tune)
	- Optimizer: AdamW with cosine scheduler

	### Knowledge Distillation
	- α (feature): 0.3
	- β (logit): 0.2
	- Temperature: 4.0
	- Warmup epochs: 8
	- KD layers: P3, P4, PSA output

	### Quantization
	- Format: TFLite
	- Input size: 416x416 (edge deployment)
	- Calibration samples: 100

	## 🐛 Troubleshooting

	### Issue 1: CUDA Out of Memory
	```python
	# Reduce batch size
	batch = 8

	# Enable mixed precision
	amp = True
	```

	### Issue 2: Feature Shape Mismatch in KD
	- Check teacher and student architecture compatibility
	- Verify KD layer names match between models
	- Ensure input sizes are consistent

	### Issue 3: INT8 Quantization Accuracy Drop
	- Increase number of calibration samples
	- Use representative dataset (diverse conditions)
	- Consider QAT (Quantization-Aware Training)

	## 📚 References

	### Papers
	- [YOLO11](https://docs.ultralytics.com/models/yolo11/)
	- [GhostNet: More Features from Cheap Operations](https://arxiv.org/abs/1911.11907)
	- [Distilling the Knowledge in a Neural Network](https://arxiv.org/abs/1503.02531)
	- [Quantization and Training of Neural Networks](https://arxiv.org/abs/1806.08342)

	### Resources
	- [Ultralytics Documentation](https://docs.ultralytics.com/)
	- [TFLite Quantization Guide](https://www.tensorflow.org/lite/performance/post_training_quantization)

	## 🎯 Key Features

	### Architecture Optimization
	- GhostConv: Reduces FLOPs by ~50% compared to standard convolutions
	- Lightweight backbone: Maintains accuracy while reducing parameters
	- Flexible head: Supports multiple detection tasks

	### Knowledge Distillation
	- Multi-level distillation: Combines feature and logit knowledge transfer
	- Temperature-scaled softmax: Smooths probability distributions
	- Progressive training: Warmup phase for stable convergence

	### Model Compression
	- INT8 quantization: 4x memory reduction
	- FP16 quantization: 2x memory reduction
	- Edge-ready: Optimized for mobile/embedded deployment

	## 💡 Best Practices

	### Training
	1. Start with pre-trained weights when possible
	2. Use data augmentation (mosaic, mixup, etc.)
	3. Monitor validation metrics closely
	4. Apply early stopping (patience=10-15)

	### Knowledge Distillation
	1. Ensure teacher model is well-trained (mAP > 90%)
	2. Match batch normalization statistics
	3. Use appropriate temperature (T=3-5 for object detection)
	4. Gradually increase KD loss weight

	### Quantization
	1. Use diverse calibration dataset
	2. Test on representative test set
	3. Profile inference speed on target device
	4. Consider hybrid quantization (some layers FP32)

	## 📈 Performance Metrics

	### Speed Benchmarks
	\| Model \| FP32 (ms) \| FP16 (ms) \| INT8 (ms) \| Device \|
	\|-------\|-----------\|-----------\|-----------\|---------\|
	\| YOLO11l \| 45 \| 28 \| N/A \| T4 GPU \|
	\| YOLO11n-Ghost \| 12 \| 8 \| N/A \| T4 GPU \|
	\| INT8 TFLite \| N/A \| N/A \| 25 \| Edge TPU \|

	### Accuracy vs Efficiency
	- YOLO11l: Highest accuracy, largest model
	- YOLO11n-Ghost: Best accuracy/size trade-off
	- + KD: Closes gap with teacher
	- + INT8: Minimal accuracy loss, deployable

	## 🔄 Workflow Summary

	```mermaid
	graph LR
	A[YOLO11l Teacher] --> B[Design GhostConv Student]
	B --> C[Train Baseline]
	C --> D[Knowledge Distillation]
	D --> E[Fine-tune]
	E --> F[Quantize INT8/FP16]
	F --> G[Deploy to Edge]
	```

	## 🚀 Deployment

	### TFLite Conversion
	```python
	# Export to TFLite INT8
	model.export(
	format="tflite",
	int8=True,
	data="calibration.yaml",
	imgsz=416
	)
	```

	### Inference Example
	```python
	import numpy as np
	from PIL import Image

	# Load TFLite model
	interpreter = tf.lite.Interpreter(model_path="best_int8.tflite")
	interpreter.allocate_tensors()

	# Preprocess image
	img = Image.open("test.jpg").resize((416, 416))
	input_data = np.array(img, dtype=np.uint8).reshape(1, 416, 416, 3)

	# Run inference
	interpreter.set_tensor(input_details[0]['index'], input_data)
	interpreter.invoke()
	output = interpreter.get_tensor(output_details[0]['index'])
	```

	## 👥 Contributing

	Contributions are welcome! Areas for improvement:
	- Additional distillation techniques (attention transfer, etc.)
	- QAT implementation
	- More lightweight architectures
	- Deployment examples for different platforms

	## 📄 License

	This notebook follows the Ultralytics AGPL-3.0 License.

	## 🙏 Acknowledgments

	- [Ultralytics](https://ultralytics.com/) for YOLO11 framework
	- [GhostNet](https://github.com/huawei-noah/ghostnet) for efficient convolution design
	- Google Colab for compute resources

	---

	Note: This notebook is designed to run on Google Colab with GPU runtime. Adjust paths and configurations for local environments as needed.

	Last Updated: January 2026
	Version: v11
	Compatibility: Ultralytics 8.0+