mmla-dino-pose / POSE_CLASSIFIER_GUIDE.md

Add README, training/inference code, and trained DINOv2-small pose-classifier checkpoint (#1)

58b3e34 16 days ago

preview code

raw

history blame contribute delete

10.1 kB

Pose Classifier Guide

Overview

The pose classifier predicts the orientation of animals (zebras, giraffes, etc.) relative to the camera position from aerial drone footage. This is critical for navigation and behavior analysis.

8-Class Pose Classification System

Pose Classes

The classifier identifies 8 discrete pose orientations arranged in a circle around the animal:

front - Animal facing directly toward camera
front-left - Animal facing camera, angled to the left (~45°)
left - Animal's left side visible, perpendicular to camera
back-left - Animal facing away, angled to the left (~45°)
back - Animal facing directly away from camera
back-right - Animal facing away, angled to the right (~45°)
right - Animal's right side visible, perpendicular to camera
front-right - Animal facing camera, angled to the right (~45°)

Visual Reference

The diagram shows the 8 pose classes arranged in a circle. The camera is positioned at the bottom, and the animal (zebra) is in the center. Each orange dot represents one of the 8 possible pose classifications.

Example Poses

Front Pose

Label: front

The animal is facing directly toward the camera, with the head and front body visible.

Front-Left Pose

Label: front-left

The animal is facing toward the camera but angled to its left (camera's right), showing both the front and left side.

Front-Right Pose

Label: front-right

The animal is facing toward the camera but angled to its right (camera's left), showing both the front and right side.

Left Pose

Label: left

The animal's left side is visible, perpendicular to the camera. This is a pure profile view.

Right Pose

Label: right

The animal's right side is visible, perpendicular to the camera. This is a pure profile view from the opposite side.

Back-Left Pose

Label: back-left

The animal is facing away from the camera but angled to its left, showing the rear-left quarter.

Back-Right Pose

Label: back-right

The animal is facing away from the camera but angled to its right, showing the rear-right quarter.

Back Pose

Label: back

The animal is facing directly away from the camera, with the rear and back visible.

Model Architecture

DINOv2 + MLP Head

The pose classifier uses a frozen DINOv2 backbone with a trainable MLP classification head:

Input Image (224×224)
    ↓
DINOv2 Vision Transformer (frozen)
    - Small: 384-dim features
    - Base: 768-dim features
    - Large: 1024-dim features
    ↓
MLP Head (trainable)
    - LayerNorm
    - Linear(feat_dim -> 256) + GELU + Dropout(0.3)
    - Linear(256 -> 128) + GELU + Dropout(0.3)
    - Linear(128 -> 8)
    ↓
Output Logits (8 classes)

Why DINOv2?

Self-supervised learning on diverse images provides strong visual features
Frozen backbone reduces training time and prevents overfitting
Small memory footprint suitable for deployment
Robust to varying image quality from aerial footage

Training Pipeline

Data Organization

Training data is organized in folder structure:

pose_labels/
  _reference.png          # Visual guide
  front/                  # Front-facing animals
  front-left/             # Front-left quarter
  left/                   # Left profile
  back-left/              # Back-left quarter
  back/                   # Back-facing animals
  back-right/             # Back-right quarter
  right/                  # Right profile
  front-right/            # Front-right quarter

Or via CSV files with columns: image_path, pose

Data Augmentation

Geometric Augmentation with Label Swapping:

Horizontal flip applied with 50% probability
When flipped, pose labels are swapped according to symmetry:
- left <-> right
- front-left <-> front-right
- back-left <-> back-right
- front and back remain unchanged

Color/Transform Augmentation:

Random crop (256px -> 224px)
Color jitter: brightness (±30%), contrast (±30%), saturation (±20%)
Random rotation (±15°)

Class Balancing:

Weighted random sampler ensures equal representation of all 8 classes during training

Training Configuration

python train_pose_classifier.py \
    --data_dir ./pose_labels \
    --model_size small \
    --epochs 30 \
    --batch_size 32 \
    --lr 1e-3

Key Parameters:

Model size: small, base, or large (DINOv2 variant)
Optimizer: AdamW with weight decay 0.01
Loss: CrossEntropyLoss with label smoothing (0.1)
Scheduler: CosineAnnealingLR
Mixed precision: Automatic on GPU

Training Output:

Best model saved to checkpoints/best_pose_model.pth
Includes confusion matrix and per-class accuracy
Optional ONNX export for deployment

Usage in Navigation

Integration with Detection Pipeline

The pose classifier is used in the navigation system after animal detection:

from navigation.policy.pose_classifier import ViewPointClassifier
from PIL import Image

# Initialize classifier
classifier = ViewPointClassifier(
    weight_path="model_weights/best_june_24_2025_IA_classifier_016.pth",
    device="cpu",
    threshold=0.5
)

# Process detected animal crops
crops = [Image.open(path) for path in detection_crops]
poses = classifier(crops)  # Returns list of pose strings

# Use poses for navigation decisions
for pose in poses:
    if "front" in pose:
        print("Animal is facing camera - approach with caution")
    elif "back" in pose:
        print("Animal is facing away - good for following")

Multi-Label Pose System (Alternative)

The ViewPointClassifier in pose_classifier.py uses a different approach:

5 multi-label classes: up, front, back, right, left
EfficientNet-B4 backbone trained on zebra crops
Input size: 512×512 pixels
Output: Concatenated string (e.g., "upfrontright")
Threshold: 0.5 (configurable)

This allows detecting compound poses like "animal is facing front-right while looking up."

Performance Considerations

Inference Speed

DINOv2-small: ~15-20ms per image (CPU)
DINOv2-base: ~30-40ms per image (CPU)
GPU acceleration: 5-10x faster

Accuracy Targets

Overall accuracy: >85% on validation set
Critical classes (front/back): >90% accuracy
Confusion: Most errors occur between adjacent classes (e.g., front vs. front-left)

Deployment Notes

Model checkpoint: ~150MB (small), ~350MB (base)
ONNX export available for optimized inference
Batch processing recommended for multiple detections

Common Issues & Tips

Issue: Poor performance on occluded animals

Solution: Train with more occluded examples or use confidence thresholding

Issue: Confusion between adjacent poses

Solution: This is expected due to continuous nature of orientations; consider using pose groups (front-facing vs. side-facing vs. back-facing)

Issue: Inconsistent predictions across frames

Solution: Apply temporal smoothing or majority voting across consecutive frames

Issue: Different performance on zebras vs. other species

Solution: Retrain with balanced dataset across species, or train species-specific models

Dataset Statistics

Current training data distribution (from folder structure):

Folders: front, front-left, front-right, left, right, back-left, back-right, back
Images per class: Variable (check with train_pose_classifier.py --data_dir pose_labels)
Species: Primarily zebras and giraffes
Source: Aerial drone footage from Mpala and OPC sessions

References

DINOv2 Paper: https://arxiv.org/abs/2304.07193
VARe-ID (ViewPoint Classifier): https://github.com/ziesski/VARe-ID
Individual identification of wildlife: [https://doi.org/10.1007/s10344-021-01549-4](Review on methods used for wildlife species and individual identification)
Training script: train_pose_classifier.py
Navigation integration: navigation/policy/pose_classifier.py

Quick Start

Prepare data: Organize images in pose_labels/ folders by class
Train model: python train_pose_classifier.py --data_dir ./pose_labels --epochs 30
Evaluate: Check confusion matrix and per-class accuracy in output
Export: Use --export_onnx flag for optimized deployment
Integrate: Load checkpoint and use for inference on detection crops