mmla-dino-pose / POSE_CLASSIFIER_GUIDE.md
jennamk14's picture
Add README, training/inference code, and trained DINOv2-small pose-classifier checkpoint (#1)
58b3e34

Pose Classifier Guide

Overview

The pose classifier predicts the orientation of animals (zebras, giraffes, etc.) relative to the camera position from aerial drone footage. This is critical for navigation and behavior analysis.

8-Class Pose Classification System

Pose Classes

The classifier identifies 8 discrete pose orientations arranged in a circle around the animal:

  1. front - Animal facing directly toward camera
  2. front-left - Animal facing camera, angled to the left (~45°)
  3. left - Animal's left side visible, perpendicular to camera
  4. back-left - Animal facing away, angled to the left (~45°)
  5. back - Animal facing directly away from camera
  6. back-right - Animal facing away, angled to the right (~45°)
  7. right - Animal's right side visible, perpendicular to camera
  8. front-right - Animal facing camera, angled to the right (~45°)

Visual Reference

Pose Reference Diagram

The diagram shows the 8 pose classes arranged in a circle. The camera is positioned at the bottom, and the animal (zebra) is in the center. Each orange dot represents one of the 8 possible pose classifications.

Example Poses

Front Pose

Label: front

The animal is facing directly toward the camera, with the head and front body visible.

Front Pose Example


Front-Left Pose

Label: front-left

The animal is facing toward the camera but angled to its left (camera's right), showing both the front and left side.

Front-Left Pose Example


Front-Right Pose

Label: front-right

The animal is facing toward the camera but angled to its right (camera's left), showing both the front and right side.

Front-Right Pose Example


Left Pose

Label: left

The animal's left side is visible, perpendicular to the camera. This is a pure profile view.

Left Pose Example


Right Pose

Label: right

The animal's right side is visible, perpendicular to the camera. This is a pure profile view from the opposite side.

Right Pose Example


Back-Left Pose

Label: back-left

The animal is facing away from the camera but angled to its left, showing the rear-left quarter.

Back-Left Pose Example


Back-Right Pose

Label: back-right

The animal is facing away from the camera but angled to its right, showing the rear-right quarter.

Back-Right Pose Example


Back Pose

Label: back

The animal is facing directly away from the camera, with the rear and back visible.

Back Pose Example


Model Architecture

DINOv2 + MLP Head

The pose classifier uses a frozen DINOv2 backbone with a trainable MLP classification head:

Input Image (224×224)
    ↓
DINOv2 Vision Transformer (frozen)
    - Small: 384-dim features
    - Base: 768-dim features
    - Large: 1024-dim features
    ↓
MLP Head (trainable)
    - LayerNorm
    - Linear(feat_dim -> 256) + GELU + Dropout(0.3)
    - Linear(256 -> 128) + GELU + Dropout(0.3)
    - Linear(128 -> 8)
    ↓
Output Logits (8 classes)

Why DINOv2?

  • Self-supervised learning on diverse images provides strong visual features
  • Frozen backbone reduces training time and prevents overfitting
  • Small memory footprint suitable for deployment
  • Robust to varying image quality from aerial footage

Training Pipeline

Data Organization

Training data is organized in folder structure:

pose_labels/
  _reference.png          # Visual guide
  front/                  # Front-facing animals
  front-left/             # Front-left quarter
  left/                   # Left profile
  back-left/              # Back-left quarter
  back/                   # Back-facing animals
  back-right/             # Back-right quarter
  right/                  # Right profile
  front-right/            # Front-right quarter

Or via CSV files with columns: image_path, pose

Data Augmentation

Geometric Augmentation with Label Swapping:

  • Horizontal flip applied with 50% probability
  • When flipped, pose labels are swapped according to symmetry:
    • left <-> right
    • front-left <-> front-right
    • back-left <-> back-right
    • front and back remain unchanged

Color/Transform Augmentation:

  • Random crop (256px -> 224px)
  • Color jitter: brightness (±30%), contrast (±30%), saturation (±20%)
  • Random rotation (±15°)

Class Balancing:

  • Weighted random sampler ensures equal representation of all 8 classes during training

Training Configuration

python train_pose_classifier.py \
    --data_dir ./pose_labels \
    --model_size small \
    --epochs 30 \
    --batch_size 32 \
    --lr 1e-3

Key Parameters:

  • Model size: small, base, or large (DINOv2 variant)
  • Optimizer: AdamW with weight decay 0.01
  • Loss: CrossEntropyLoss with label smoothing (0.1)
  • Scheduler: CosineAnnealingLR
  • Mixed precision: Automatic on GPU

Training Output:

  • Best model saved to checkpoints/best_pose_model.pth
  • Includes confusion matrix and per-class accuracy
  • Optional ONNX export for deployment

Usage in Navigation

Integration with Detection Pipeline

The pose classifier is used in the navigation system after animal detection:

from navigation.policy.pose_classifier import ViewPointClassifier
from PIL import Image

# Initialize classifier
classifier = ViewPointClassifier(
    weight_path="model_weights/best_june_24_2025_IA_classifier_016.pth",
    device="cpu",
    threshold=0.5
)

# Process detected animal crops
crops = [Image.open(path) for path in detection_crops]
poses = classifier(crops)  # Returns list of pose strings

# Use poses for navigation decisions
for pose in poses:
    if "front" in pose:
        print("Animal is facing camera - approach with caution")
    elif "back" in pose:
        print("Animal is facing away - good for following")

Multi-Label Pose System (Alternative)

The ViewPointClassifier in pose_classifier.py uses a different approach:

  • 5 multi-label classes: up, front, back, right, left
  • EfficientNet-B4 backbone trained on zebra crops
  • Input size: 512×512 pixels
  • Output: Concatenated string (e.g., "upfrontright")
  • Threshold: 0.5 (configurable)

This allows detecting compound poses like "animal is facing front-right while looking up."

Performance Considerations

Inference Speed

  • DINOv2-small: ~15-20ms per image (CPU)
  • DINOv2-base: ~30-40ms per image (CPU)
  • GPU acceleration: 5-10x faster

Accuracy Targets

  • Overall accuracy: >85% on validation set
  • Critical classes (front/back): >90% accuracy
  • Confusion: Most errors occur between adjacent classes (e.g., front vs. front-left)

Deployment Notes

  • Model checkpoint: ~150MB (small), ~350MB (base)
  • ONNX export available for optimized inference
  • Batch processing recommended for multiple detections

Common Issues & Tips

Issue: Poor performance on occluded animals

Solution: Train with more occluded examples or use confidence thresholding

Issue: Confusion between adjacent poses

Solution: This is expected due to continuous nature of orientations; consider using pose groups (front-facing vs. side-facing vs. back-facing)

Issue: Inconsistent predictions across frames

Solution: Apply temporal smoothing or majority voting across consecutive frames

Issue: Different performance on zebras vs. other species

Solution: Retrain with balanced dataset across species, or train species-specific models

Dataset Statistics

Current training data distribution (from folder structure):

  • Folders: front, front-left, front-right, left, right, back-left, back-right, back
  • Images per class: Variable (check with train_pose_classifier.py --data_dir pose_labels)
  • Species: Primarily zebras and giraffes
  • Source: Aerial drone footage from Mpala and OPC sessions

References

Quick Start

  1. Prepare data: Organize images in pose_labels/ folders by class
  2. Train model: python train_pose_classifier.py --data_dir ./pose_labels --epochs 30
  3. Evaluate: Check confusion matrix and per-class accuracy in output
  4. Export: Use --export_onnx flag for optimized deployment
  5. Integrate: Load checkpoint and use for inference on detection crops