3d_model / docs /ARKIT_INTEGRATION.md
Azan
Clean deployment build (Squashed)
7a87926

ARKit Integration Guide

Overview

The ARKit integration allows us to:

  1. Use ARKit poses as ground truth for evaluating DA3 and BA
  2. Compare DA3 poses vs ARKit poses (VIO-based)
  3. Compare BA poses vs ARKit poses
  4. Use ARKit intrinsics for more accurate BA

ARKit Data Structure

Metadata JSON Format

{
  "frames": [
    {
      "camera": {
        "viewMatrix": [[...]],      // 4x4 camera-to-world transform
        "intrinsics": [[...]],      // 3x3 camera intrinsics
        "trackingState": "limited", // "normal", "limited", "notAvailable"
        "trackingStateReason": "initializing" // "normal", "initializing", "relocalizing"
      },
      "featurePointCount": 0,
      "worldMappingStatus": "notAvailable",
      "timestamp": 1764913298.01684,
      "frameIndex": 0
    }
  ]
}

Key Fields

  • viewMatrix: 4x4 camera-to-world transformation (ARKit convention)
  • intrinsics: 3x3 camera intrinsics matrix (fx, fy, cx, cy)
  • trackingState: Overall tracking quality
  • trackingStateReason: Why tracking is in current state
  • featurePointCount: Number of tracked feature points (may be 0 in metadata)

Usage

Basic Processing

from ylff.arkit_processor import ARKitProcessor
from pathlib import Path

# Initialize processor
processor = ARKitProcessor(
    video_path=Path("arkit/video.MOV"),
    metadata_path=Path("arkit/metadata.json")
)

# Process for BA validation
arkit_data = processor.process_for_ba_validation(
    output_dir=Path("output"),
    max_frames=50,
    frame_interval=1,
    use_good_tracking_only=False,  # Use all frames if tracking is limited
)

# Extract data
image_paths = arkit_data['image_paths']
arkit_poses_c2w = arkit_data['arkit_poses_c2w']  # 4x4 camera-to-world
arkit_poses_w2c = arkit_data['arkit_poses_w2c']  # 3x4 world-to-camera (DA3 format)
arkit_intrinsics = arkit_data['arkit_intrinsics']  # 3x3

Running BA Validation

python scripts/run_arkit_ba_validation.py \
    --arkit-dir assets/examples/ARKit \
    --output-dir data/arkit_ba_validation \
    --max-frames 30 \
    --frame-interval 1 \
    --device cpu

This script will:

  1. Extract frames from ARKit video
  2. Parse ARKit poses and intrinsics
  3. Run DA3 inference
  4. Compare DA3 vs ARKit (ground truth)
  5. Run BA validation
  6. Compare BA vs ARKit (ground truth)
  7. Compare DA3 vs BA
  8. Save results to JSON

Coordinate System Conversion

ARKit uses camera-to-world (c2w) convention:

  • viewMatrix: 4x4 c2w transform
  • Right-handed coordinate system
  • Y-up convention

DA3 uses world-to-camera (w2c) convention:

  • extrinsics: 3x4 w2c transform
  • OpenCV convention (typically)

The ARKitProcessor automatically converts:

w2c_poses = processor.convert_arkit_to_w2c(c2w_poses)  # (N, 3, 4)

Evaluation Metrics

The validation script computes:

  1. DA3 vs ARKit:

    • Rotation error (degrees)
    • Translation error
    • Shows how well DA3 matches ARKit VIO
  2. BA vs ARKit:

    • Rotation error (degrees)
    • Translation error
    • Shows how well BA matches ARKit VIO
  3. DA3 vs BA:

    • Rotation error (degrees)
    • Shows agreement between DA3 and BA

Notes

  • ARKit poses are VIO-based (Visual-Inertial Odometry)
  • They may drift over long sequences
  • For short sequences (< 1 minute), ARKit poses are very accurate
  • Feature point counts may be 0 in metadata (not always included)
  • Tracking state "limited" is acceptable for short sequences

Example Output

=== Comparing DA3 vs ARKit (Ground Truth) ===
DA3 vs ARKit:
  Mean rotation error: 2.45°
  Max rotation error: 8.32°
  Mean translation error: 0.12

=== Comparing BA vs ARKit (Ground Truth) ===
BA vs ARKit:
  Mean rotation error: 1.23°
  Max rotation error: 3.45°
  Mean translation error: 0.08

=== Comparing DA3 vs BA ===
DA3 vs BA:
  Mean rotation error: 1.89°
  Max rotation error: 5.67°

This shows:

  • DA3 is within ~2.5° of ARKit (good)
  • BA is within ~1.2° of ARKit (better, as expected)
  • DA3 and BA agree within ~1.9° (reasonable)