# ARKit Integration Guide ## Overview The ARKit integration allows us to: 1. Use ARKit poses as **ground truth** for evaluating DA3 and BA 2. Compare DA3 poses vs ARKit poses (VIO-based) 3. Compare BA poses vs ARKit poses 4. Use ARKit intrinsics for more accurate BA ## ARKit Data Structure ### Metadata JSON Format ```json { "frames": [ { "camera": { "viewMatrix": [[...]], // 4x4 camera-to-world transform "intrinsics": [[...]], // 3x3 camera intrinsics "trackingState": "limited", // "normal", "limited", "notAvailable" "trackingStateReason": "initializing" // "normal", "initializing", "relocalizing" }, "featurePointCount": 0, "worldMappingStatus": "notAvailable", "timestamp": 1764913298.01684, "frameIndex": 0 } ] } ``` ### Key Fields - **viewMatrix**: 4x4 camera-to-world transformation (ARKit convention) - **intrinsics**: 3x3 camera intrinsics matrix (fx, fy, cx, cy) - **trackingState**: Overall tracking quality - **trackingStateReason**: Why tracking is in current state - **featurePointCount**: Number of tracked feature points (may be 0 in metadata) ## Usage ### Basic Processing ```python from ylff.arkit_processor import ARKitProcessor from pathlib import Path # Initialize processor processor = ARKitProcessor( video_path=Path("arkit/video.MOV"), metadata_path=Path("arkit/metadata.json") ) # Process for BA validation arkit_data = processor.process_for_ba_validation( output_dir=Path("output"), max_frames=50, frame_interval=1, use_good_tracking_only=False, # Use all frames if tracking is limited ) # Extract data image_paths = arkit_data['image_paths'] arkit_poses_c2w = arkit_data['arkit_poses_c2w'] # 4x4 camera-to-world arkit_poses_w2c = arkit_data['arkit_poses_w2c'] # 3x4 world-to-camera (DA3 format) arkit_intrinsics = arkit_data['arkit_intrinsics'] # 3x3 ``` ### Running BA Validation ```bash python scripts/run_arkit_ba_validation.py \ --arkit-dir assets/examples/ARKit \ --output-dir data/arkit_ba_validation \ --max-frames 30 \ --frame-interval 1 \ --device cpu ``` This script will: 1. Extract frames from ARKit video 2. Parse ARKit poses and intrinsics 3. Run DA3 inference 4. Compare DA3 vs ARKit (ground truth) 5. Run BA validation 6. Compare BA vs ARKit (ground truth) 7. Compare DA3 vs BA 8. Save results to JSON ## Coordinate System Conversion ARKit uses **camera-to-world** (c2w) convention: - `viewMatrix`: 4x4 c2w transform - Right-handed coordinate system - Y-up convention DA3 uses **world-to-camera** (w2c) convention: - `extrinsics`: 3x4 w2c transform - OpenCV convention (typically) The `ARKitProcessor` automatically converts: ```python w2c_poses = processor.convert_arkit_to_w2c(c2w_poses) # (N, 3, 4) ``` ## Evaluation Metrics The validation script computes: 1. **DA3 vs ARKit**: - Rotation error (degrees) - Translation error - Shows how well DA3 matches ARKit VIO 2. **BA vs ARKit**: - Rotation error (degrees) - Translation error - Shows how well BA matches ARKit VIO 3. **DA3 vs BA**: - Rotation error (degrees) - Shows agreement between DA3 and BA ## Notes - ARKit poses are VIO-based (Visual-Inertial Odometry) - They may drift over long sequences - For short sequences (< 1 minute), ARKit poses are very accurate - Feature point counts may be 0 in metadata (not always included) - Tracking state "limited" is acceptable for short sequences ## Example Output ``` === Comparing DA3 vs ARKit (Ground Truth) === DA3 vs ARKit: Mean rotation error: 2.45° Max rotation error: 8.32° Mean translation error: 0.12 === Comparing BA vs ARKit (Ground Truth) === BA vs ARKit: Mean rotation error: 1.23° Max rotation error: 3.45° Mean translation error: 0.08 === Comparing DA3 vs BA === DA3 vs BA: Mean rotation error: 1.89° Max rotation error: 5.67° ``` This shows: - DA3 is within ~2.5° of ARKit (good) - BA is within ~1.2° of ARKit (better, as expected) - DA3 and BA agree within ~1.9° (reasonable)