Pose Classifier Guide
Overview
The pose classifier predicts the orientation of animals (zebras, giraffes, etc.) relative to the camera position from aerial drone footage. This is critical for navigation and behavior analysis.
8-Class Pose Classification System
Pose Classes
The classifier identifies 8 discrete pose orientations arranged in a circle around the animal:
- front - Animal facing directly toward camera
- front-left - Animal facing camera, angled to the left (~45°)
- left - Animal's left side visible, perpendicular to camera
- back-left - Animal facing away, angled to the left (~45°)
- back - Animal facing directly away from camera
- back-right - Animal facing away, angled to the right (~45°)
- right - Animal's right side visible, perpendicular to camera
- front-right - Animal facing camera, angled to the right (~45°)
Visual Reference
The diagram shows the 8 pose classes arranged in a circle. The camera is positioned at the bottom, and the animal (zebra) is in the center. Each orange dot represents one of the 8 possible pose classifications.
Example Poses
Front Pose
Label: front
The animal is facing directly toward the camera, with the head and front body visible.
Front-Left Pose
Label: front-left
The animal is facing toward the camera but angled to its left (camera's right), showing both the front and left side.
Front-Right Pose
Label: front-right
The animal is facing toward the camera but angled to its right (camera's left), showing both the front and right side.
Left Pose
Label: left
The animal's left side is visible, perpendicular to the camera. This is a pure profile view.
Right Pose
Label: right
The animal's right side is visible, perpendicular to the camera. This is a pure profile view from the opposite side.
Back-Left Pose
Label: back-left
The animal is facing away from the camera but angled to its left, showing the rear-left quarter.
Back-Right Pose
Label: back-right
The animal is facing away from the camera but angled to its right, showing the rear-right quarter.
Back Pose
Label: back
The animal is facing directly away from the camera, with the rear and back visible.
Model Architecture
DINOv2 + MLP Head
The pose classifier uses a frozen DINOv2 backbone with a trainable MLP classification head:
Input Image (224×224)
↓
DINOv2 Vision Transformer (frozen)
- Small: 384-dim features
- Base: 768-dim features
- Large: 1024-dim features
↓
MLP Head (trainable)
- LayerNorm
- Linear(feat_dim -> 256) + GELU + Dropout(0.3)
- Linear(256 -> 128) + GELU + Dropout(0.3)
- Linear(128 -> 8)
↓
Output Logits (8 classes)
Why DINOv2?
- Self-supervised learning on diverse images provides strong visual features
- Frozen backbone reduces training time and prevents overfitting
- Small memory footprint suitable for deployment
- Robust to varying image quality from aerial footage
Training Pipeline
Data Organization
Training data is organized in folder structure:
pose_labels/
_reference.png # Visual guide
front/ # Front-facing animals
front-left/ # Front-left quarter
left/ # Left profile
back-left/ # Back-left quarter
back/ # Back-facing animals
back-right/ # Back-right quarter
right/ # Right profile
front-right/ # Front-right quarter
Or via CSV files with columns: image_path, pose
Data Augmentation
Geometric Augmentation with Label Swapping:
- Horizontal flip applied with 50% probability
- When flipped, pose labels are swapped according to symmetry:
left<->rightfront-left<->front-rightback-left<->back-rightfrontandbackremain unchanged
Color/Transform Augmentation:
- Random crop (256px -> 224px)
- Color jitter: brightness (±30%), contrast (±30%), saturation (±20%)
- Random rotation (±15°)
Class Balancing:
- Weighted random sampler ensures equal representation of all 8 classes during training
Training Configuration
python train_pose_classifier.py \
--data_dir ./pose_labels \
--model_size small \
--epochs 30 \
--batch_size 32 \
--lr 1e-3
Key Parameters:
- Model size:
small,base, orlarge(DINOv2 variant) - Optimizer: AdamW with weight decay 0.01
- Loss: CrossEntropyLoss with label smoothing (0.1)
- Scheduler: CosineAnnealingLR
- Mixed precision: Automatic on GPU
Training Output:
- Best model saved to
checkpoints/best_pose_model.pth - Includes confusion matrix and per-class accuracy
- Optional ONNX export for deployment
Usage in Navigation
Integration with Detection Pipeline
The pose classifier is used in the navigation system after animal detection:
from navigation.policy.pose_classifier import ViewPointClassifier
from PIL import Image
# Initialize classifier
classifier = ViewPointClassifier(
weight_path="model_weights/best_june_24_2025_IA_classifier_016.pth",
device="cpu",
threshold=0.5
)
# Process detected animal crops
crops = [Image.open(path) for path in detection_crops]
poses = classifier(crops) # Returns list of pose strings
# Use poses for navigation decisions
for pose in poses:
if "front" in pose:
print("Animal is facing camera - approach with caution")
elif "back" in pose:
print("Animal is facing away - good for following")
Multi-Label Pose System (Alternative)
The ViewPointClassifier in pose_classifier.py uses a different approach:
- 5 multi-label classes:
up, front, back, right, left - EfficientNet-B4 backbone trained on zebra crops
- Input size: 512×512 pixels
- Output: Concatenated string (e.g.,
"upfrontright") - Threshold: 0.5 (configurable)
This allows detecting compound poses like "animal is facing front-right while looking up."
Performance Considerations
Inference Speed
- DINOv2-small: ~15-20ms per image (CPU)
- DINOv2-base: ~30-40ms per image (CPU)
- GPU acceleration: 5-10x faster
Accuracy Targets
- Overall accuracy: >85% on validation set
- Critical classes (front/back): >90% accuracy
- Confusion: Most errors occur between adjacent classes (e.g., front vs. front-left)
Deployment Notes
- Model checkpoint: ~150MB (small), ~350MB (base)
- ONNX export available for optimized inference
- Batch processing recommended for multiple detections
Common Issues & Tips
Issue: Poor performance on occluded animals
Solution: Train with more occluded examples or use confidence thresholding
Issue: Confusion between adjacent poses
Solution: This is expected due to continuous nature of orientations; consider using pose groups (front-facing vs. side-facing vs. back-facing)
Issue: Inconsistent predictions across frames
Solution: Apply temporal smoothing or majority voting across consecutive frames
Issue: Different performance on zebras vs. other species
Solution: Retrain with balanced dataset across species, or train species-specific models
Dataset Statistics
Current training data distribution (from folder structure):
- Folders:
front,front-left,front-right,left,right,back-left,back-right,back - Images per class: Variable (check with
train_pose_classifier.py --data_dir pose_labels) - Species: Primarily zebras and giraffes
- Source: Aerial drone footage from Mpala and OPC sessions
References
- DINOv2 Paper: https://arxiv.org/abs/2304.07193
- VARe-ID (ViewPoint Classifier): https://github.com/ziesski/VARe-ID
- Individual identification of wildlife: [https://doi.org/10.1007/s10344-021-01549-4](Review on methods used for wildlife species and individual identification)
- Training script: train_pose_classifier.py
- Navigation integration: navigation/policy/pose_classifier.py
Quick Start
- Prepare data: Organize images in
pose_labels/folders by class - Train model:
python train_pose_classifier.py --data_dir ./pose_labels --epochs 30 - Evaluate: Check confusion matrix and per-class accuracy in output
- Export: Use
--export_onnxflag for optimized deployment - Integrate: Load checkpoint and use for inference on detection crops








