ARKit Integration Guide
Overview
The ARKit integration allows us to:
- Use ARKit poses as ground truth for evaluating DA3 and BA
- Compare DA3 poses vs ARKit poses (VIO-based)
- Compare BA poses vs ARKit poses
- Use ARKit intrinsics for more accurate BA
ARKit Data Structure
Metadata JSON Format
{
"frames": [
{
"camera": {
"viewMatrix": [[...]], // 4x4 camera-to-world transform
"intrinsics": [[...]], // 3x3 camera intrinsics
"trackingState": "limited", // "normal", "limited", "notAvailable"
"trackingStateReason": "initializing" // "normal", "initializing", "relocalizing"
},
"featurePointCount": 0,
"worldMappingStatus": "notAvailable",
"timestamp": 1764913298.01684,
"frameIndex": 0
}
]
}
Key Fields
- viewMatrix: 4x4 camera-to-world transformation (ARKit convention)
- intrinsics: 3x3 camera intrinsics matrix (fx, fy, cx, cy)
- trackingState: Overall tracking quality
- trackingStateReason: Why tracking is in current state
- featurePointCount: Number of tracked feature points (may be 0 in metadata)
Usage
Basic Processing
from ylff.arkit_processor import ARKitProcessor
from pathlib import Path
# Initialize processor
processor = ARKitProcessor(
video_path=Path("arkit/video.MOV"),
metadata_path=Path("arkit/metadata.json")
)
# Process for BA validation
arkit_data = processor.process_for_ba_validation(
output_dir=Path("output"),
max_frames=50,
frame_interval=1,
use_good_tracking_only=False, # Use all frames if tracking is limited
)
# Extract data
image_paths = arkit_data['image_paths']
arkit_poses_c2w = arkit_data['arkit_poses_c2w'] # 4x4 camera-to-world
arkit_poses_w2c = arkit_data['arkit_poses_w2c'] # 3x4 world-to-camera (DA3 format)
arkit_intrinsics = arkit_data['arkit_intrinsics'] # 3x3
Running BA Validation
python scripts/run_arkit_ba_validation.py \
--arkit-dir assets/examples/ARKit \
--output-dir data/arkit_ba_validation \
--max-frames 30 \
--frame-interval 1 \
--device cpu
This script will:
- Extract frames from ARKit video
- Parse ARKit poses and intrinsics
- Run DA3 inference
- Compare DA3 vs ARKit (ground truth)
- Run BA validation
- Compare BA vs ARKit (ground truth)
- Compare DA3 vs BA
- Save results to JSON
Coordinate System Conversion
ARKit uses camera-to-world (c2w) convention:
viewMatrix: 4x4 c2w transform- Right-handed coordinate system
- Y-up convention
DA3 uses world-to-camera (w2c) convention:
extrinsics: 3x4 w2c transform- OpenCV convention (typically)
The ARKitProcessor automatically converts:
w2c_poses = processor.convert_arkit_to_w2c(c2w_poses) # (N, 3, 4)
Evaluation Metrics
The validation script computes:
DA3 vs ARKit:
- Rotation error (degrees)
- Translation error
- Shows how well DA3 matches ARKit VIO
BA vs ARKit:
- Rotation error (degrees)
- Translation error
- Shows how well BA matches ARKit VIO
DA3 vs BA:
- Rotation error (degrees)
- Shows agreement between DA3 and BA
Notes
- ARKit poses are VIO-based (Visual-Inertial Odometry)
- They may drift over long sequences
- For short sequences (< 1 minute), ARKit poses are very accurate
- Feature point counts may be 0 in metadata (not always included)
- Tracking state "limited" is acceptable for short sequences
Example Output
=== Comparing DA3 vs ARKit (Ground Truth) ===
DA3 vs ARKit:
Mean rotation error: 2.45°
Max rotation error: 8.32°
Mean translation error: 0.12
=== Comparing BA vs ARKit (Ground Truth) ===
BA vs ARKit:
Mean rotation error: 1.23°
Max rotation error: 3.45°
Mean translation error: 0.08
=== Comparing DA3 vs BA ===
DA3 vs BA:
Mean rotation error: 1.89°
Max rotation error: 5.67°
This shows:
- DA3 is within ~2.5° of ARKit (good)
- BA is within ~1.2° of ARKit (better, as expected)
- DA3 and BA agree within ~1.9° (reasonable)