# Pose Classifier Guide ## Overview The pose classifier predicts the orientation of animals (zebras, giraffes, etc.) relative to the camera position from aerial drone footage. This is critical for navigation and behavior analysis. ## 8-Class Pose Classification System ### Pose Classes The classifier identifies **8 discrete pose orientations** arranged in a circle around the animal: 1. **front** - Animal facing directly toward camera 2. **front-left** - Animal facing camera, angled to the left (~45°) 3. **left** - Animal's left side visible, perpendicular to camera 4. **back-left** - Animal facing away, angled to the left (~45°) 5. **back** - Animal facing directly away from camera 6. **back-right** - Animal facing away, angled to the right (~45°) 7. **right** - Animal's right side visible, perpendicular to camera 8. **front-right** - Animal facing camera, angled to the right (~45°) ### Visual Reference ![Pose Reference Diagram](https://huggingface.co/imageomics/mmla-dino-pose/resolve/main/_reference.png) The diagram shows the 8 pose classes arranged in a circle. The camera is positioned at the bottom, and the animal (zebra) is in the center. Each orange dot represents one of the 8 possible pose classifications. ## Example Poses ### Front Pose **Label:** `front` The animal is facing directly toward the camera, with the head and front body visible. ![Front Pose Example](https://huggingface.co/imageomics/mmla-dino-pose/resolve/main/front/mpala_session_1_DJI_0002_partition_1_DJI_0002_000171_c0_004.jpg) --- ### Front-Left Pose **Label:** `front-left` The animal is facing toward the camera but angled to its left (camera's right), showing both the front and left side. ![Front-Left Pose Example](https://huggingface.co/imageomics/mmla-dino-pose/resolve/main/front-left/mpala_session_1_DJI_0002_partition_1_DJI_0002_000471_c0_005.jpg) --- ### Front-Right Pose **Label:** `front-right` The animal is facing toward the camera but angled to its right (camera's left), showing both the front and right side. ![Front-Right Pose Example](https://huggingface.co/imageomics/mmla-dino-pose/resolve/main/front-right/mpala_session_2_DJI_0006_partition_2_DJI_0006_006552_c0_004.jpg) --- ### Left Pose **Label:** `left` The animal's left side is visible, perpendicular to the camera. This is a pure profile view. ![Left Pose Example](https://huggingface.co/imageomics/mmla-dino-pose/resolve/main/left/mpala_session_1_DJI_0002_partition_1_DJI_0002_000321_c1_001.jpg) --- ### Right Pose **Label:** `right` The animal's right side is visible, perpendicular to the camera. This is a pure profile view from the opposite side. ![Right Pose Example](https://huggingface.co/imageomics/mmla-dino-pose/resolve/main/right/mpala_session_1_DJI_0002_partition_1_DJI_0002_000171_c0_002.jpg) --- ### Back-Left Pose **Label:** `back-left` The animal is facing away from the camera but angled to its left, showing the rear-left quarter. ![Back-Left Pose Example](https://huggingface.co/imageomics/mmla-dino-pose/resolve/main/back-left/mpala_session_1_DJI_0002_partition_1_DJI_0002_000171_c1_001.jpg) --- ### Back-Right Pose **Label:** `back-right` The animal is facing away from the camera but angled to its right, showing the rear-right quarter. ![Back-Right Pose Example](https://huggingface.co/imageomics/mmla-dino-pose/resolve/main/back-right/mpala_session_1_DJI_0002_partition_1_DJI_0002_000171_c1_000.jpg) --- ### Back Pose **Label:** `back` The animal is facing directly away from the camera, with the rear and back visible. ![Back Pose Example](https://huggingface.co/imageomics/mmla-dino-pose/resolve/main/back/mpala_session_1_DJI_0002_partition_1_DJI_0002_000321_c1_000.jpg) --- ## Model Architecture ### DINOv2 + MLP Head The pose classifier uses a **frozen DINOv2 backbone** with a **trainable MLP classification head**: ``` Input Image (224×224) ↓ DINOv2 Vision Transformer (frozen) - Small: 384-dim features - Base: 768-dim features - Large: 1024-dim features ↓ MLP Head (trainable) - LayerNorm - Linear(feat_dim -> 256) + GELU + Dropout(0.3) - Linear(256 -> 128) + GELU + Dropout(0.3) - Linear(128 -> 8) ↓ Output Logits (8 classes) ``` ### Why DINOv2? - **Self-supervised learning** on diverse images provides strong visual features - **Frozen backbone** reduces training time and prevents overfitting - **Small memory footprint** suitable for deployment - **Robust to varying image quality** from aerial footage ## Training Pipeline ### Data Organization Training data is organized in folder structure: ``` pose_labels/ _reference.png # Visual guide front/ # Front-facing animals front-left/ # Front-left quarter left/ # Left profile back-left/ # Back-left quarter back/ # Back-facing animals back-right/ # Back-right quarter right/ # Right profile front-right/ # Front-right quarter ``` Or via CSV files with columns: `image_path, pose` ### Data Augmentation **Geometric Augmentation with Label Swapping:** - Horizontal flip applied with 50% probability - When flipped, pose labels are swapped according to symmetry: - `left` <-> `right` - `front-left` <-> `front-right` - `back-left` <-> `back-right` - `front` and `back` remain unchanged **Color/Transform Augmentation:** - Random crop (256px -> 224px) - Color jitter: brightness (±30%), contrast (±30%), saturation (±20%) - Random rotation (±15°) **Class Balancing:** - Weighted random sampler ensures equal representation of all 8 classes during training ### Training Configuration ```bash python train_pose_classifier.py \ --data_dir ./pose_labels \ --model_size small \ --epochs 30 \ --batch_size 32 \ --lr 1e-3 ``` **Key Parameters:** - **Model size**: `small`, `base`, or `large` (DINOv2 variant) - **Optimizer**: AdamW with weight decay 0.01 - **Loss**: CrossEntropyLoss with label smoothing (0.1) - **Scheduler**: CosineAnnealingLR - **Mixed precision**: Automatic on GPU **Training Output:** - Best model saved to `checkpoints/best_pose_model.pth` - Includes confusion matrix and per-class accuracy - Optional ONNX export for deployment ## Usage in Navigation ### Integration with Detection Pipeline The pose classifier is used in the navigation system after animal detection: ```python from navigation.policy.pose_classifier import ViewPointClassifier from PIL import Image # Initialize classifier classifier = ViewPointClassifier( weight_path="model_weights/best_june_24_2025_IA_classifier_016.pth", device="cpu", threshold=0.5 ) # Process detected animal crops crops = [Image.open(path) for path in detection_crops] poses = classifier(crops) # Returns list of pose strings # Use poses for navigation decisions for pose in poses: if "front" in pose: print("Animal is facing camera - approach with caution") elif "back" in pose: print("Animal is facing away - good for following") ``` ### Multi-Label Pose System (Alternative) The `ViewPointClassifier` in `pose_classifier.py` uses a different approach: - **5 multi-label classes**: `up, front, back, right, left` - **EfficientNet-B4** backbone trained on zebra crops - **Input size**: 512×512 pixels - **Output**: Concatenated string (e.g., `"upfrontright"`) - **Threshold**: 0.5 (configurable) This allows detecting compound poses like "animal is facing front-right while looking up." ## Performance Considerations ### Inference Speed - **DINOv2-small**: ~15-20ms per image (CPU) - **DINOv2-base**: ~30-40ms per image (CPU) - **GPU acceleration**: 5-10x faster ### Accuracy Targets - **Overall accuracy**: >85% on validation set - **Critical classes** (front/back): >90% accuracy - **Confusion**: Most errors occur between adjacent classes (e.g., front vs. front-left) ### Deployment Notes - Model checkpoint: ~150MB (small), ~350MB (base) - ONNX export available for optimized inference - Batch processing recommended for multiple detections ## Common Issues & Tips ### Issue: Poor performance on occluded animals **Solution**: Train with more occluded examples or use confidence thresholding ### Issue: Confusion between adjacent poses **Solution**: This is expected due to continuous nature of orientations; consider using pose groups (front-facing vs. side-facing vs. back-facing) ### Issue: Inconsistent predictions across frames **Solution**: Apply temporal smoothing or majority voting across consecutive frames ### Issue: Different performance on zebras vs. other species **Solution**: Retrain with balanced dataset across species, or train species-specific models ## Dataset Statistics Current training data distribution (from folder structure): - Folders: `front`, `front-left`, `front-right`, `left`, `right`, `back-left`, `back-right`, `back` - Images per class: Variable (check with `train_pose_classifier.py --data_dir pose_labels`) - Species: Primarily zebras and giraffes - Source: Aerial drone footage from Mpala and OPC sessions ## References - DINOv2 Paper: [https://arxiv.org/abs/2304.07193](https://arxiv.org/abs/2304.07193) - VARe-ID (ViewPoint Classifier): [https://github.com/ziesski/VARe-ID](https://github.com/ziesski/VARe-ID) - Individual identification of wildlife: [https://doi.org/10.1007/s10344-021-01549-4](Review on methods used for wildlife species and individual identification) - Training script: [train_pose_classifier.py](train_pose_classifier.py) - Navigation integration: [navigation/policy/pose_classifier.py](../navigation/policy/pose_classifier.py) ## Quick Start 1. **Prepare data**: Organize images in `pose_labels/` folders by class 2. **Train model**: `python train_pose_classifier.py --data_dir ./pose_labels --epochs 30` 3. **Evaluate**: Check confusion matrix and per-class accuracy in output 4. **Export**: Use `--export_onnx` flag for optimized deployment 5. **Integrate**: Load checkpoint and use for inference on detection crops