3d_model / research_docs /VGGT_POSE_DERIVATION.md
Azan
Clean deployment build (Squashed)
7a87926

VGGT Pose Derivation: How the Model Learns Camera Parameters

Overview

VGGT learns to predict camera poses through supervised learning with ground truth camera parameters from multi-view datasets. The model doesn't "derive" pose from geometry—it learns to predict pose from visual features through training.

1. Ground Truth Data Sources

1.1 Training Datasets

VGGT is trained on datasets that provide ground truth camera parameters:

Primary Datasets:

  1. CO3D (Common Objects in 3D)

    • Provides extrinsics and intrinsics for object-centric scenes
    • Camera poses estimated via COLMAP/SfM
    • File: training/data/datasets/co3d.py
  2. vKITTI (Virtual KITTI)

    • Synthetic driving scenes with perfect camera parameters
    • File: training/data/datasets/vkitti.py

Data Format:

{
    "images": List[np.ndarray],           # RGB images
    "depths": List[np.ndarray],           # Depth maps
    "extrinsics": List[np.ndarray],       # Camera extrinsics (3×4, OpenCV convention)
    "intrinsics": List[np.ndarray],       # Camera intrinsics (3×3)
    "world_points": np.ndarray,           # 3D points in world coordinates
    "point_masks": np.ndarray,            # Validity masks for points
}

1.2 Ground Truth Camera Parameters

Extrinsics (extri_opencv):

  • Format: [R | t] (3×4 matrix)
  • Convention: OpenCV camera-from-world transformation
  • Source: COLMAP/SfM reconstruction or synthetic data

Intrinsics (intri_opencv):

  • Format:
    [[fx, 0,  cx],
     [0,  fy, cy],
     [0,  0,  1 ]]
    
  • Source: Camera calibration or dataset metadata

Example from CO3D:

# training/data/datasets/co3d.py, lines 205-258
extrinsics = []
intrinsics = []

for frame_data in sequence_data:
    # Load camera parameters from CO3D annotation
    extri_opencv = frame_data['extrinsics']  # 3×4 matrix
    intri_opencv = frame_data['intrinsics']  # 3×3 matrix

    extrinsics.append(extri_opencv)
    intrinsics.append(intri_opencv)

2. Training Process: How Pose is Learned

2.1 Forward Pass

Input: Images [B, S, 3, H, W]

Processing:

  1. Aggregator (backbone) extracts visual features
  2. Camera Head predicts pose encoding from camera tokens
  3. Output: pose_enc_list - list of pose encodings (one per iteration)

File: vggt/models/vggt.py

# Forward pass
predictions = model(images)
pose_enc_list = predictions["pose_enc_list"]  # List of [B, S, 9] tensors

2.2 Ground Truth Encoding

File: training/loss.py, lines 101-109

# Get ground truth camera extrinsics and intrinsics
gt_extrinsics = batch_data['extrinsics']      # [B, S, 3, 4]
gt_intrinsics = batch_data['intrinsics']      # [B, S, 3, 3]
image_hw = batch_data['images'].shape[-2:]    # (H, W)

# Encode ground truth pose to match predicted encoding format
gt_pose_encoding = extri_intri_to_pose_encoding(
    gt_extrinsics, gt_intrinsics, image_hw,
    pose_encoding_type="absT_quaR_FoV"
)  # [B, S, 9]

Encoding Process (vggt/utils/pose_enc.py):

# Extract components
R = extrinsics[:, :, :3, :3]  # Rotation matrix
T = extrinsics[:, :, :3, 3]   # Translation vector

# Convert rotation to quaternion
quat = mat_to_quat(R)  # [B, S, 4]

# Compute field of view from intrinsics
H, W = image_size_hw
fov_h = 2 * torch.atan((H / 2) / intrinsics[..., 1, 1])  # Vertical FOV
fov_w = 2 * torch.atan((W / 2) / intrinsics[..., 0, 0])  # Horizontal FOV

# Combine into 9D encoding
pose_encoding = [T(3), quat(4), fov_h(1), fov_w(1)]  # [B, S, 9]

2.3 Loss Computation

File: training/loss.py, lines 81-155

Multi-Stage Loss (with temporal decay):

def compute_camera_loss(
    pred_dict,              # Contains 'pose_enc_list'
    batch_data,             # Contains 'extrinsics', 'intrinsics'
    loss_type="l1",
    gamma=0.6,              # Temporal decay weight
    weight_trans=1.0,       # Translation loss weight
    weight_rot=1.0,         # Rotation loss weight
    weight_focal=0.5,       # FOV loss weight
):
    pred_pose_encodings = pred_dict['pose_enc_list']  # List of [B, S, 9]
    n_stages = len(pred_pose_encodings)  # 4 iterations

    # Encode ground truth
    gt_pose_encoding = extri_intri_to_pose_encoding(...)  # [B, S, 9]

    # Compute loss for each iteration
    for stage_idx in range(n_stages):
        stage_weight = gamma ** (n_stages - stage_idx - 1)  # Later stages weighted more
        pred_pose_stage = pred_pose_encodings[stage_idx]

        # Compute component-wise losses
        loss_T_stage, loss_R_stage, loss_FL_stage = camera_loss_single(
            pred_pose_stage[valid_frame_mask],
            gt_pose_encoding[valid_frame_mask],
            loss_type=loss_type
        )

        # Accumulate weighted losses
        total_loss_T += loss_T_stage * stage_weight
        total_loss_R += loss_R_stage * stage_weight
        total_loss_FL += loss_FL_stage * stage_weight

    # Weighted combination
    total_camera_loss = (
        avg_loss_T * weight_trans +
        avg_loss_R * weight_rot +
        avg_loss_FL * weight_focal
    )

Component Losses (camera_loss_single):

def camera_loss_single(pred_pose_enc, gt_pose_enc, loss_type="l1"):
    # L1 loss for each component
    loss_T = (pred_pose_enc[..., :3] - gt_pose_enc[..., :3]).abs()      # Translation
    loss_R = (pred_pose_enc[..., 3:7] - gt_pose_enc[..., 3:7]).abs()    # Rotation (quaternion)
    loss_FL = (pred_pose_enc[..., 7:] - gt_pose_enc[..., 7:]).abs()    # Field of view

    return loss_T.mean(), loss_R.mean(), loss_FL.mean()

Key Points:

  • L1 loss (absolute error) for each component
  • Separate losses for translation, rotation, and FOV
  • Temporal weighting: Later iterations weighted more (gamma=0.6)
  • Valid frame filtering: Only frames with >100 valid points

3. Data Normalization

3.1 Why Normalize?

Camera parameters have different scales:

  • Translation: Can be in meters (0.1-100m)
  • Rotation: Quaternion (unit norm)
  • FOV: Radians (0.1-3.0)

Normalization ensures stable training.

3.2 Normalization Process

File: training/train_utils/normalization.py, lines 27-122

def normalize_camera_extrinsics_and_points_batch(
    extrinsics: torch.Tensor,  # [B, S, 3, 4]
    world_points: torch.Tensor,
    cam_points: torch.Tensor,
    depths: torch.Tensor,
):
    """
    Normalize camera extrinsics and 3D points.

    Strategy:
    1. Set first camera as identity (reference frame)
    2. Normalize translation scale to unit average
    """
    B, S, _, _ = extrinsics.shape

    # Convert to homogeneous form
    extrinsics_homog = torch.cat([
        extrinsics,
        torch.zeros(B, S, 1, 4, device=device)
    ], dim=2)
    extrinsics_homog[:, :, -1, -1] = 1.0

    # Set first camera as identity (reference frame)
    first_cam_extrinsic_inv = closed_form_inverse_se3(extrinsics_homog[:, 0])
    new_extrinsics = torch.matmul(extrinsics_homog, first_cam_extrinsic_inv.unsqueeze(1))

    # Normalize translation scale
    # Compute average scale from translation magnitudes
    avg_scale = new_extrinsics[:, :, :3, 3].norm(dim=-1).mean(dim=-1)  # [B]
    new_extrinsics[:, :, :3, 3] = new_extrinsics[:, :, :3, 3] / avg_scale.view(-1, 1, 1)

    # Transform world points accordingly
    R = extrinsics[:, 0, :3, :3]
    t = extrinsics[:, 0, :3, 3]
    new_world_points = (world_points @ R.T) + t
    new_world_points = new_world_points / avg_scale.view(-1, 1, 1, 1)

    return new_extrinsics[:, :, :3], new_cam_points, new_world_points, depths

Normalization Strategy:

  1. Reference frame: First camera set to identity [I | 0]
  2. Scale normalization: Translation magnitudes normalized to unit average
  3. Coordinate transform: All cameras and points transformed to normalized frame

Why This Works:

  • Scale-invariant: Model learns relative poses, not absolute scales
  • Stable training: Normalized values prevent gradient issues
  • Generalization: Works across different scene scales

4. What the Model Actually Learns

4.1 Visual Features → Camera Parameters

The model learns a mapping from visual features to camera parameters:

Visual Features (from images)
    ↓
Camera Tokens (aggregated features)
    ↓
Iterative Refinement (4 steps)
    ↓
Pose Encoding [T(3), quat(4), FOV(2)]

Key Insight: The model doesn't solve geometry—it learns to recognize visual patterns that correlate with camera poses.

4.2 What Visual Cues Does It Use?

The model likely learns to recognize:

  1. Parallax: Relative motion between near/far objects
  2. Perspective: Vanishing points, horizon lines
  3. Multi-view consistency: How objects appear from different angles
  4. Depth cues: Occlusion, relative sizes, texture gradients

Evidence: The model uses alternating frame/global attention to:

  • Frame attention: Process each view independently
  • Global attention: Share information across all views
  • This enables learning multi-view geometric relationships

4.3 Training Objective

Total Loss:

L_total = L_camera + L_depth + L_point + L_track

Camera Loss:

L_camera = weight_trans * L_T + weight_rot * L_R + weight_focal * L_FOV

Component Losses:

  • Translation loss: L1(pred_T - gt_T)
  • Rotation loss: L1(pred_quat - gt_quat)
  • FOV loss: L1(pred_FOV - gt_FOV)

5. Comparison: VGGT vs Traditional SfM

Traditional SfM (COLMAP)

  1. Feature detection: Detect keypoints (SIFT, etc.)
  2. Feature matching: Match keypoints across views
  3. Epipolar geometry: Compute fundamental/essential matrices
  4. Bundle adjustment: Optimize camera poses and 3D points jointly

Key: Solves geometry from correspondences

VGGT

  1. Visual feature extraction: Extract features from images
  2. Multi-view attention: Share information across views
  3. Direct prediction: Predict camera parameters directly
  4. Supervised learning: Learn from ground truth poses

Key: Learns to predict pose from visual patterns


6. Key Takeaways

How VGGT "Derives" Pose

  1. Training Phase:

    • Learns mapping: Visual Features → Camera Parameters
    • Supervised by ground truth poses from COLMAP/SfM
    • Uses L1 loss on translation, rotation, and FOV
  2. Inference Phase:

    • Extracts visual features from input images
    • Predicts pose encoding directly from features
    • No geometric solving—pure feed-forward prediction
  3. What It Learns:

    • Visual patterns that correlate with camera poses
    • Multi-view geometric relationships
    • Scale-invariant pose representations

Why This Works

  1. Large-scale training: Millions of images with ground truth poses
  2. Multi-view supervision: Learns from multiple views simultaneously
  3. Iterative refinement: Progressively improves predictions
  4. Normalization: Scale-invariant training enables generalization

Limitations

  1. Statistical consistency: Learns average behavior, not hard geometric constraints
  2. Generalization: May struggle on scenes very different from training data
  3. No explicit geometry: Doesn't solve epipolar geometry or bundle adjustment

7. Code References

Training Loss:

  • training/loss.py: compute_camera_loss(), camera_loss_single()

Data Loading:

  • training/data/datasets/co3d.py: CO3D dataset loader
  • training/data/datasets/vkitti.py: vKITTI dataset loader

Normalization:

  • training/train_utils/normalization.py: normalize_camera_extrinsics_and_points_batch()

Pose Encoding:

  • vggt/utils/pose_enc.py: extri_intri_to_pose_encoding(), pose_encoding_to_extri_intri()

Model:

  • vggt/models/vggt.py: Main model forward pass
  • vggt/heads/camera_head.py: Camera head prediction