mmla-dino-pose / POSE_CLASSIFIER_GUIDE.md
jennamk14's picture
Add README, training/inference code, and trained DINOv2-small pose-classifier checkpoint (#1)
58b3e34
# Pose Classifier Guide
## Overview
The pose classifier predicts the orientation of animals (zebras, giraffes, etc.) relative to the camera position from aerial drone footage. This is critical for navigation and behavior analysis.
## 8-Class Pose Classification System
### Pose Classes
The classifier identifies **8 discrete pose orientations** arranged in a circle around the animal:
1. **front** - Animal facing directly toward camera
2. **front-left** - Animal facing camera, angled to the left (~45°)
3. **left** - Animal's left side visible, perpendicular to camera
4. **back-left** - Animal facing away, angled to the left (~45°)
5. **back** - Animal facing directly away from camera
6. **back-right** - Animal facing away, angled to the right (~45°)
7. **right** - Animal's right side visible, perpendicular to camera
8. **front-right** - Animal facing camera, angled to the right (~45°)
### Visual Reference
![Pose Reference Diagram](https://huggingface.co/imageomics/mmla-dino-pose/resolve/main/_reference.png)
The diagram shows the 8 pose classes arranged in a circle. The camera is positioned at the bottom, and the animal (zebra) is in the center. Each orange dot represents one of the 8 possible pose classifications.
## Example Poses
### Front Pose
**Label:** `front`
The animal is facing directly toward the camera, with the head and front body visible.
![Front Pose Example](https://huggingface.co/imageomics/mmla-dino-pose/resolve/main/front/mpala_session_1_DJI_0002_partition_1_DJI_0002_000171_c0_004.jpg)
---
### Front-Left Pose
**Label:** `front-left`
The animal is facing toward the camera but angled to its left (camera's right), showing both the front and left side.
![Front-Left Pose Example](https://huggingface.co/imageomics/mmla-dino-pose/resolve/main/front-left/mpala_session_1_DJI_0002_partition_1_DJI_0002_000471_c0_005.jpg)
---
### Front-Right Pose
**Label:** `front-right`
The animal is facing toward the camera but angled to its right (camera's left), showing both the front and right side.
![Front-Right Pose Example](https://huggingface.co/imageomics/mmla-dino-pose/resolve/main/front-right/mpala_session_2_DJI_0006_partition_2_DJI_0006_006552_c0_004.jpg)
---
### Left Pose
**Label:** `left`
The animal's left side is visible, perpendicular to the camera. This is a pure profile view.
![Left Pose Example](https://huggingface.co/imageomics/mmla-dino-pose/resolve/main/left/mpala_session_1_DJI_0002_partition_1_DJI_0002_000321_c1_001.jpg)
---
### Right Pose
**Label:** `right`
The animal's right side is visible, perpendicular to the camera. This is a pure profile view from the opposite side.
![Right Pose Example](https://huggingface.co/imageomics/mmla-dino-pose/resolve/main/right/mpala_session_1_DJI_0002_partition_1_DJI_0002_000171_c0_002.jpg)
---
### Back-Left Pose
**Label:** `back-left`
The animal is facing away from the camera but angled to its left, showing the rear-left quarter.
![Back-Left Pose Example](https://huggingface.co/imageomics/mmla-dino-pose/resolve/main/back-left/mpala_session_1_DJI_0002_partition_1_DJI_0002_000171_c1_001.jpg)
---
### Back-Right Pose
**Label:** `back-right`
The animal is facing away from the camera but angled to its right, showing the rear-right quarter.
![Back-Right Pose Example](https://huggingface.co/imageomics/mmla-dino-pose/resolve/main/back-right/mpala_session_1_DJI_0002_partition_1_DJI_0002_000171_c1_000.jpg)
---
### Back Pose
**Label:** `back`
The animal is facing directly away from the camera, with the rear and back visible.
![Back Pose Example](https://huggingface.co/imageomics/mmla-dino-pose/resolve/main/back/mpala_session_1_DJI_0002_partition_1_DJI_0002_000321_c1_000.jpg)
---
## Model Architecture
### DINOv2 + MLP Head
The pose classifier uses a **frozen DINOv2 backbone** with a **trainable MLP classification head**:
```
Input Image (224×224)
DINOv2 Vision Transformer (frozen)
- Small: 384-dim features
- Base: 768-dim features
- Large: 1024-dim features
MLP Head (trainable)
- LayerNorm
- Linear(feat_dim -> 256) + GELU + Dropout(0.3)
- Linear(256 -> 128) + GELU + Dropout(0.3)
- Linear(128 -> 8)
Output Logits (8 classes)
```
### Why DINOv2?
- **Self-supervised learning** on diverse images provides strong visual features
- **Frozen backbone** reduces training time and prevents overfitting
- **Small memory footprint** suitable for deployment
- **Robust to varying image quality** from aerial footage
## Training Pipeline
### Data Organization
Training data is organized in folder structure:
```
pose_labels/
_reference.png # Visual guide
front/ # Front-facing animals
front-left/ # Front-left quarter
left/ # Left profile
back-left/ # Back-left quarter
back/ # Back-facing animals
back-right/ # Back-right quarter
right/ # Right profile
front-right/ # Front-right quarter
```
Or via CSV files with columns: `image_path, pose`
### Data Augmentation
**Geometric Augmentation with Label Swapping:**
- Horizontal flip applied with 50% probability
- When flipped, pose labels are swapped according to symmetry:
- `left` <-> `right`
- `front-left` <-> `front-right`
- `back-left` <-> `back-right`
- `front` and `back` remain unchanged
**Color/Transform Augmentation:**
- Random crop (256px -> 224px)
- Color jitter: brightness (±30%), contrast (±30%), saturation (±20%)
- Random rotation (±15°)
**Class Balancing:**
- Weighted random sampler ensures equal representation of all 8 classes during training
### Training Configuration
```bash
python train_pose_classifier.py \
--data_dir ./pose_labels \
--model_size small \
--epochs 30 \
--batch_size 32 \
--lr 1e-3
```
**Key Parameters:**
- **Model size**: `small`, `base`, or `large` (DINOv2 variant)
- **Optimizer**: AdamW with weight decay 0.01
- **Loss**: CrossEntropyLoss with label smoothing (0.1)
- **Scheduler**: CosineAnnealingLR
- **Mixed precision**: Automatic on GPU
**Training Output:**
- Best model saved to `checkpoints/best_pose_model.pth`
- Includes confusion matrix and per-class accuracy
- Optional ONNX export for deployment
## Usage in Navigation
### Integration with Detection Pipeline
The pose classifier is used in the navigation system after animal detection:
```python
from navigation.policy.pose_classifier import ViewPointClassifier
from PIL import Image
# Initialize classifier
classifier = ViewPointClassifier(
weight_path="model_weights/best_june_24_2025_IA_classifier_016.pth",
device="cpu",
threshold=0.5
)
# Process detected animal crops
crops = [Image.open(path) for path in detection_crops]
poses = classifier(crops) # Returns list of pose strings
# Use poses for navigation decisions
for pose in poses:
if "front" in pose:
print("Animal is facing camera - approach with caution")
elif "back" in pose:
print("Animal is facing away - good for following")
```
### Multi-Label Pose System (Alternative)
The `ViewPointClassifier` in `pose_classifier.py` uses a different approach:
- **5 multi-label classes**: `up, front, back, right, left`
- **EfficientNet-B4** backbone trained on zebra crops
- **Input size**: 512×512 pixels
- **Output**: Concatenated string (e.g., `"upfrontright"`)
- **Threshold**: 0.5 (configurable)
This allows detecting compound poses like "animal is facing front-right while looking up."
## Performance Considerations
### Inference Speed
- **DINOv2-small**: ~15-20ms per image (CPU)
- **DINOv2-base**: ~30-40ms per image (CPU)
- **GPU acceleration**: 5-10x faster
### Accuracy Targets
- **Overall accuracy**: >85% on validation set
- **Critical classes** (front/back): >90% accuracy
- **Confusion**: Most errors occur between adjacent classes (e.g., front vs. front-left)
### Deployment Notes
- Model checkpoint: ~150MB (small), ~350MB (base)
- ONNX export available for optimized inference
- Batch processing recommended for multiple detections
## Common Issues & Tips
### Issue: Poor performance on occluded animals
**Solution**: Train with more occluded examples or use confidence thresholding
### Issue: Confusion between adjacent poses
**Solution**: This is expected due to continuous nature of orientations; consider using pose groups (front-facing vs. side-facing vs. back-facing)
### Issue: Inconsistent predictions across frames
**Solution**: Apply temporal smoothing or majority voting across consecutive frames
### Issue: Different performance on zebras vs. other species
**Solution**: Retrain with balanced dataset across species, or train species-specific models
## Dataset Statistics
Current training data distribution (from folder structure):
- Folders: `front`, `front-left`, `front-right`, `left`, `right`, `back-left`, `back-right`, `back`
- Images per class: Variable (check with `train_pose_classifier.py --data_dir pose_labels`)
- Species: Primarily zebras and giraffes
- Source: Aerial drone footage from Mpala and OPC sessions
## References
- DINOv2 Paper: [https://arxiv.org/abs/2304.07193](https://arxiv.org/abs/2304.07193)
- VARe-ID (ViewPoint Classifier): [https://github.com/ziesski/VARe-ID](https://github.com/ziesski/VARe-ID)
- Individual identification of wildlife: [https://doi.org/10.1007/s10344-021-01549-4](Review on methods used for wildlife species and individual identification)
- Training script: [train_pose_classifier.py](train_pose_classifier.py)
- Navigation integration: [navigation/policy/pose_classifier.py](../navigation/policy/pose_classifier.py)
## Quick Start
1. **Prepare data**: Organize images in `pose_labels/` folders by class
2. **Train model**: `python train_pose_classifier.py --data_dir ./pose_labels --epochs 30`
3. **Evaluate**: Check confusion matrix and per-class accuracy in output
4. **Export**: Use `--export_onnx` flag for optimized deployment
5. **Integrate**: Load checkpoint and use for inference on detection crops