mmla-dino-pose / POSE_CLASSIFIER_GUIDE.md

Add README, training/inference code, and trained DINOv2-small pose-classifier checkpoint (#1)

58b3e34 16 days ago

10.1 kB

	# Pose Classifier Guide

	## Overview

	The pose classifier predicts the orientation of animals (zebras, giraffes, etc.) relative to the camera position from aerial drone footage. This is critical for navigation and behavior analysis.

	## 8-Class Pose Classification System

	### Pose Classes

	The classifier identifies 8 discrete pose orientations arranged in a circle around the animal:

	1. front - Animal facing directly toward camera
	2. front-left - Animal facing camera, angled to the left (~45°)
	3. left - Animal's left side visible, perpendicular to camera
	4. back-left - Animal facing away, angled to the left (~45°)
	5. back - Animal facing directly away from camera
	6. back-right - Animal facing away, angled to the right (~45°)
	7. right - Animal's right side visible, perpendicular to camera
	8. front-right - Animal facing camera, angled to the right (~45°)

	### Visual Reference

	![Pose Reference Diagram](https://huggingface.co/imageomics/mmla-dino-pose/resolve/main/_reference.png)

	The diagram shows the 8 pose classes arranged in a circle. The camera is positioned at the bottom, and the animal (zebra) is in the center. Each orange dot represents one of the 8 possible pose classifications.

	## Example Poses

	### Front Pose
	Label: `front`

	The animal is facing directly toward the camera, with the head and front body visible.

	![Front Pose Example](https://huggingface.co/imageomics/mmla-dino-pose/resolve/main/front/mpala_session_1_DJI_0002_partition_1_DJI_0002_000171_c0_004.jpg)

	---

	### Front-Left Pose
	Label: `front-left`

	The animal is facing toward the camera but angled to its left (camera's right), showing both the front and left side.

	![Front-Left Pose Example](https://huggingface.co/imageomics/mmla-dino-pose/resolve/main/front-left/mpala_session_1_DJI_0002_partition_1_DJI_0002_000471_c0_005.jpg)

	---

	### Front-Right Pose
	Label: `front-right`

	The animal is facing toward the camera but angled to its right (camera's left), showing both the front and right side.

	![Front-Right Pose Example](https://huggingface.co/imageomics/mmla-dino-pose/resolve/main/front-right/mpala_session_2_DJI_0006_partition_2_DJI_0006_006552_c0_004.jpg)

	---

	### Left Pose
	Label: `left`

	The animal's left side is visible, perpendicular to the camera. This is a pure profile view.

	![Left Pose Example](https://huggingface.co/imageomics/mmla-dino-pose/resolve/main/left/mpala_session_1_DJI_0002_partition_1_DJI_0002_000321_c1_001.jpg)

	---

	### Right Pose
	Label: `right`

	The animal's right side is visible, perpendicular to the camera. This is a pure profile view from the opposite side.

	![Right Pose Example](https://huggingface.co/imageomics/mmla-dino-pose/resolve/main/right/mpala_session_1_DJI_0002_partition_1_DJI_0002_000171_c0_002.jpg)

	---

	### Back-Left Pose
	Label: `back-left`

	The animal is facing away from the camera but angled to its left, showing the rear-left quarter.

	![Back-Left Pose Example](https://huggingface.co/imageomics/mmla-dino-pose/resolve/main/back-left/mpala_session_1_DJI_0002_partition_1_DJI_0002_000171_c1_001.jpg)

	---

	### Back-Right Pose
	Label: `back-right`

	The animal is facing away from the camera but angled to its right, showing the rear-right quarter.

	![Back-Right Pose Example](https://huggingface.co/imageomics/mmla-dino-pose/resolve/main/back-right/mpala_session_1_DJI_0002_partition_1_DJI_0002_000171_c1_000.jpg)

	---

	### Back Pose
	Label: `back`

	The animal is facing directly away from the camera, with the rear and back visible.

	![Back Pose Example](https://huggingface.co/imageomics/mmla-dino-pose/resolve/main/back/mpala_session_1_DJI_0002_partition_1_DJI_0002_000321_c1_000.jpg)

	---

	## Model Architecture

	### DINOv2 + MLP Head

	The pose classifier uses a frozen DINOv2 backbone with a trainable MLP classification head:

	```
	Input Image (224×224)
	↓
	DINOv2 Vision Transformer (frozen)
	- Small: 384-dim features
	- Base: 768-dim features
	- Large: 1024-dim features
	↓
	MLP Head (trainable)
	- LayerNorm
	- Linear(feat_dim -> 256) + GELU + Dropout(0.3)
	- Linear(256 -> 128) + GELU + Dropout(0.3)
	- Linear(128 -> 8)
	↓
	Output Logits (8 classes)
	```

	### Why DINOv2?

	- Self-supervised learning on diverse images provides strong visual features
	- Frozen backbone reduces training time and prevents overfitting
	- Small memory footprint suitable for deployment
	- Robust to varying image quality from aerial footage

	## Training Pipeline

	### Data Organization

	Training data is organized in folder structure:
	```
	pose_labels/
	_reference.png # Visual guide
	front/ # Front-facing animals
	front-left/ # Front-left quarter
	left/ # Left profile
	back-left/ # Back-left quarter
	back/ # Back-facing animals
	back-right/ # Back-right quarter
	right/ # Right profile
	front-right/ # Front-right quarter
	```

	Or via CSV files with columns: `image_path, pose`

	### Data Augmentation

	Geometric Augmentation with Label Swapping:
	- Horizontal flip applied with 50% probability
	- When flipped, pose labels are swapped according to symmetry:
	- `left` <-> `right`
	- `front-left` <-> `front-right`
	- `back-left` <-> `back-right`
	- `front` and `back` remain unchanged

	Color/Transform Augmentation:
	- Random crop (256px -> 224px)
	- Color jitter: brightness (±30%), contrast (±30%), saturation (±20%)
	- Random rotation (±15°)

	Class Balancing:
	- Weighted random sampler ensures equal representation of all 8 classes during training

	### Training Configuration

	```bash
	python train_pose_classifier.py \
	--data_dir ./pose_labels \
	--model_size small \
	--epochs 30 \
	--batch_size 32 \
	--lr 1e-3
	```

	Key Parameters:
	- Model size: `small`, `base`, or `large` (DINOv2 variant)
	- Optimizer: AdamW with weight decay 0.01
	- Loss: CrossEntropyLoss with label smoothing (0.1)
	- Scheduler: CosineAnnealingLR
	- Mixed precision: Automatic on GPU

	Training Output:
	- Best model saved to `checkpoints/best_pose_model.pth`
	- Includes confusion matrix and per-class accuracy
	- Optional ONNX export for deployment

	## Usage in Navigation

	### Integration with Detection Pipeline

	The pose classifier is used in the navigation system after animal detection:

	```python
	from navigation.policy.pose_classifier import ViewPointClassifier
	from PIL import Image

	# Initialize classifier
	classifier = ViewPointClassifier(
	weight_path="model_weights/best_june_24_2025_IA_classifier_016.pth",
	device="cpu",
	threshold=0.5
	)

	# Process detected animal crops
	crops = [Image.open(path) for path in detection_crops]
	poses = classifier(crops) # Returns list of pose strings

	# Use poses for navigation decisions
	for pose in poses:
	if "front" in pose:
	print("Animal is facing camera - approach with caution")
	elif "back" in pose:
	print("Animal is facing away - good for following")
	```

	### Multi-Label Pose System (Alternative)

	The `ViewPointClassifier` in `pose_classifier.py` uses a different approach:

	- 5 multi-label classes: `up, front, back, right, left`
	- EfficientNet-B4 backbone trained on zebra crops
	- Input size: 512×512 pixels
	- Output: Concatenated string (e.g., `"upfrontright"`)
	- Threshold: 0.5 (configurable)

	This allows detecting compound poses like "animal is facing front-right while looking up."

	## Performance Considerations

	### Inference Speed
	- DINOv2-small: ~15-20ms per image (CPU)
	- DINOv2-base: ~30-40ms per image (CPU)
	- GPU acceleration: 5-10x faster

	### Accuracy Targets
	- Overall accuracy: >85% on validation set
	- Critical classes (front/back): >90% accuracy
	- Confusion: Most errors occur between adjacent classes (e.g., front vs. front-left)

	### Deployment Notes
	- Model checkpoint: ~150MB (small), ~350MB (base)
	- ONNX export available for optimized inference
	- Batch processing recommended for multiple detections

	## Common Issues & Tips

	### Issue: Poor performance on occluded animals
	Solution: Train with more occluded examples or use confidence thresholding

	### Issue: Confusion between adjacent poses
	Solution: This is expected due to continuous nature of orientations; consider using pose groups (front-facing vs. side-facing vs. back-facing)

	### Issue: Inconsistent predictions across frames
	Solution: Apply temporal smoothing or majority voting across consecutive frames

	### Issue: Different performance on zebras vs. other species
	Solution: Retrain with balanced dataset across species, or train species-specific models

	## Dataset Statistics

	Current training data distribution (from folder structure):
	- Folders: `front`, `front-left`, `front-right`, `left`, `right`, `back-left`, `back-right`, `back`
	- Images per class: Variable (check with `train_pose_classifier.py --data_dir pose_labels`)
	- Species: Primarily zebras and giraffes
	- Source: Aerial drone footage from Mpala and OPC sessions

	## References

	- DINOv2 Paper: [https://arxiv.org/abs/2304.07193](https://arxiv.org/abs/2304.07193)
	- VARe-ID (ViewPoint Classifier): [https://github.com/ziesski/VARe-ID](https://github.com/ziesski/VARe-ID)
	- Individual identification of wildlife: [https://doi.org/10.1007/s10344-021-01549-4](Review on methods used for wildlife species and individual identification)
	- Training script: [train_pose_classifier.py](train_pose_classifier.py)
	- Navigation integration: [navigation/policy/pose_classifier.py](../navigation/policy/pose_classifier.py)

	## Quick Start

	1. Prepare data: Organize images in `pose_labels/` folders by class
	2. Train model: `python train_pose_classifier.py --data_dir ./pose_labels --epochs 30`
	3. Evaluate: Check confusion matrix and per-class accuracy in output
	4. Export: Use `--export_onnx` flag for optimized deployment
	5. Integrate: Load checkpoint and use for inference on detection crops