map-anything

Sleeping

App Files Files Community

jcompanion commited on Dec 31, 2025

Commit

56015c7

1 Parent(s): 08b52d4

Fix: extrinsic is camera-to-world (not world-to-camera), update camera JSON output

Browse files

Files changed (2) hide show

MULTI_ANGLE_STAGING.md +199 -0
app.py +51 -8

MULTI_ANGLE_STAGING.md ADDED Viewed

	@@ -0,0 +1,199 @@

+# Multi-Angle Virtual Staging Pipeline
+## Project Goal
+Project furniture from a staged room image into 3D space, then reproject onto other camera angles to generate masks for Nano Banana Pro inpainting.
+```
+Flow:
+1. Upload 3 images (1 staged with furniture + 2 empty rooms, different angles)
+2. Run Map Anything reconstruction → GLB + depth maps + camera poses
+3. Extract furniture mask from staged image (white = furniture)
+4. Project mask through 3D space to other camera views
+5. Use projected masks with Nano Banana Pro to inpaint furniture into empty room photos
+```
+---
+## What We Built (Dec 31, 2024)
+### Added to `app.py`:
+#### 1. `project_mask_to_cameras()` function (lines ~1127-1317)
+Projects a furniture mask from one camera view to all other camera views using:
+- Raw metric depth from `predictions.npz`
+- Camera extrinsic matrices (4x4, world-to-camera)
+- Camera intrinsic matrices (3x3, fx/fy/cx/cy)
+```python
+def project_mask_to_cameras(
+    mask_image: np.ndarray,      # Binary mask - white (255) = furniture
+    source_camera_index: int,    # Which camera captured the staged image
+    target_dir: str,             # Directory with predictions.npz
+    dilate_radius: int = 5,      # Fills gaps in sparse projection
+    flip_y: bool = False         # Debug option for Y-axis issues
+) -> dict:
+```
+#### 2. `gradio_project_mask()` wrapper (lines ~1320-1400)
+Handles Gradio UI input/output, creates overlay visualizations.
+#### 3. "Project Mask" UI Tab
+- Upload furniture mask
+- Select source camera index (0, 1, or 2)
+- Dilation radius slider
+- Flip Y checkbox (for debugging)
+- Shows: Projected masks gallery + Overlay on original images
+---
+## Current Issue: Camera 1 Projection is Wrong
+### Symptoms
+- **Camera 0**: Projection looks mostly correct, furniture in right area
+- **Camera 1**: Projection is wildly off - extreme pixel coordinates
+### Debug Output Shows
+```
+Camera 0: proj_x range=[-20, 369], proj_y range=[116, 292]  ✅ Good
+Camera 1: proj_x range=[-41559, 9230], proj_y range=[96, 16719]  ❌ Extreme!
+```
+### Root Cause Hypothesis
+The extreme projection values for Camera 1 suggest:
+1. Points transformed to Camera 1's coordinate system have very small Z values (near camera plane)
+2. Division by small Z causes projection to blow up
+3. Possibly Camera 1 is oriented very differently than expected
+### Key Discovery: Scene Alignment Transform
+Looking at `visual_util.py`, the GLB scene is aligned using:
+```python
+# Line 496-501 in visual_util.py
+initial_transformation = (
+    np.linalg.inv(extrinsics_matrices[0])  # Inverse of camera 0
+    @ opengl_conversion_matrix              # Flips Y and Z
+    @ align_rotation
+)
+scene_3d.apply_transform(initial_transformation)
+```
+**This means the GLB is transformed to be viewed from Camera 0's perspective!**
+The OpenGL conversion matrix flips Y and Z:
+```python
+matrix[1, 1] = -1  # Flip Y
+matrix[2, 2] = -1  # Flip Z
+```
+### The Problem
+Our projection code uses **raw extrinsics** from `predictions.npz`, but the GLB visualization applies additional transforms. This mismatch could explain why:
+- Camera 0 works (it's the reference)
+- Camera 1 doesn't (needs the same transforms applied)
+---
+## Plan to Fix
+### Option 1: Apply Same Transforms as GLB
+In `project_mask_to_cameras()`, apply the scene alignment transform to world points before projecting:
+```python
+# After unprojecting to world coordinates, apply GLB scene alignment
+opengl_matrix = np.diag([1, -1, -1, 1])  # Flip Y and Z
+scene_alignment = np.linalg.inv(extrinsics[0]) @ opengl_matrix
+points_world_aligned = (scene_alignment @ points_world_homo.T).T[:, :3]
+```
+### Option 2: Use world_points Directly
+Map Anything already computes `predictions["world_points"]` - these are already in world coordinates. We could:
+1. Use `world_points` directly instead of unprojecting from depth
+2. Just need to mask which points belong to furniture
+3. Then project to target cameras
+### Option 3: Debug Camera Orientations
+Add more debug output to understand camera relationships:
+- Print full extrinsic matrices
+- Visualize camera frustums and point cloud
+- Verify cameras are positioned as expected
+---
+## Key Files
+### Map Anything Repo
+- `app.py` - Main Gradio app with projection functions
+- `mapanything/utils/hf_utils/visual_util.py` - GLB creation and camera placement
+- `mapanything/utils/geometry.py` - Depth unprojection math
+### Data in predictions.npz
+```python
+predictions["depth"]        # (S, H, W, 1) - raw metric depth in meters
+predictions["extrinsic"]    # (S, 4, 4) - world-to-camera transforms
+predictions["intrinsic"]    # (S, 3, 3) - camera intrinsics [fx,0,cx; 0,fy,cy; 0,0,1]
+predictions["world_points"] # (S, H, W, 3) - 3D points in world coords
+predictions["images"]       # (S, H, W, 3) or (S, 3, H, W) - input images
+predictions["final_mask"]   # (S, H, W) - valid depth mask
+```
+---
+## Coordinate Systems
+### OpenCV (Map Anything internal)
+- X: Right
+- Y: Down
+- Z: Forward (into scene)
+- Extrinsic: world-to-camera transform
+### OpenGL/GLB Viewer
+- X: Right
+- Y: Up
+- Z: Backward (out of screen)
+- Conversion: flip Y and Z
+### GLB Scene Alignment
+The entire scene is transformed by:
+```
+inv(camera_0_extrinsic) @ opengl_conversion
+```
+This puts Camera 0 at origin looking down -Z.
+---
+## Next Steps
+1. [ ] Print full camera extrinsic matrices to understand orientations
+2. [ ] Try applying scene alignment transform to projection
+3. [ ] Consider using `world_points` directly instead of unprojecting
+4. [ ] Test if the issue is specific to certain camera arrangements
+5. [ ] Once projection works, test end-to-end with Nano Banana Pro
+---
+## Commands
+```bash
+# Run Map Anything locally
+cd /Users/joshcompanion/Documents/personal/map-anything
+python app.py
+# Push to HuggingFace
+git add -A && git commit -m "message" && git push
+# View logs
+# https://huggingface.co/spaces/joshcompanion/map-anything/logs
+```
+---
+## Related Project
+The frontend test bed is at:
+`/Users/joshcompanion/Documents/personal/staging-test/apps/web/src/components/PipelineTestBed.tsx`
+But we're moving projection logic to Map Anything (Python) for better accuracy.

app.py CHANGED Viewed

@@ -55,18 +55,57 @@ def get_logo_base64():
         return None
 def get_camera_poses_json(predictions):
-      """Convert camera poses to JSON-serializable format"""
       cameras = []
       if "extrinsic" in predictions and "intrinsic" in predictions:
-          extrinsic = predictions["extrinsic"]  # Shape: (S, 4, 4)
           intrinsic = predictions["intrinsic"]  # Shape: (S, 3, 3)
           for i in range(len(extrinsic)):
               cameras.append({
                   "index": i,
-                  "extrinsic": extrinsic[i].tolist(),
-                  "intrinsic": intrinsic[i].tolist(),
               })
       return json.dumps(cameras)
@@ -1235,8 +1274,10 @@ def project_mask_to_cameras(
     points_cam = np.stack([X_cam, Y_cam, Z_cam, np.ones_like(X_cam)], axis=1)
     # Transform to world coordinates
-    # extrinsic is world-to-camera, so we need its inverse
-    cam_to_world = np.linalg.inv(source_extrinsic)
     points_world = (cam_to_world @ points_cam.T).T[:, :3]  # (N, 3)
     print(f"[Project Mask] World points range: X=[{points_world[:,0].min():.2f}, {points_world[:,0].max():.2f}], "
@@ -1253,11 +1294,13 @@ def project_mask_to_cameras(
             continue
         target_intrinsic = intrinsics[target_idx]
-        target_extrinsic = extrinsics[target_idx]  # world-to-camera
         # Transform world points to target camera coordinates
         points_world_homo = np.hstack([points_world, np.ones((len(points_world), 1))])
-        points_target_cam = (target_extrinsic @ points_world_homo.T).T[:, :3]  # (N, 3)
         # Debug: Print Z range before filtering
         print(f"[Project Mask] Camera {target_idx}: Z range in camera coords = [{points_target_cam[:, 2].min():.3f}, {points_target_cam[:, 2].max():.3f}]")

         return None
 def get_camera_poses_json(predictions):
+      """Convert camera poses to JSON-serializable format
+      NOTE: In Map Anything, 'extrinsic' is actually a camera-to-world transform,
+      NOT world-to-camera as is common in OpenCV. This is confirmed by geometry.py
+      line 98 which uses: pts3d_world = einsum(camera_pose, pts3d_cam_homo, ...)
+      """
       cameras = []
       if "extrinsic" in predictions and "intrinsic" in predictions:
+          extrinsic = predictions["extrinsic"]  # Shape: (S, 4, 4) - camera-to-world!
           intrinsic = predictions["intrinsic"]  # Shape: (S, 3, 3)
           for i in range(len(extrinsic)):
+              ext = extrinsic[i]
+              intr = intrinsic[i]
+              # Extract camera position (translation part of camera-to-world)
+              # Since extrinsic IS camera-to-world, position is just the translation column
+              cam_pos = ext[:3, 3].tolist()
+              # Extract camera forward direction (Z axis in world coords)
+              # The rotation part of camera-to-world transforms camera Z to world
+              cam_forward = ext[:3, 2].tolist()  # Third column of rotation
+              # Calculate FOV from intrinsics
+              fx = float(intr[0, 0])
+              fy = float(intr[1, 1])
+              cx = float(intr[0, 2])
+              cy = float(intr[1, 2])
+              # Assume image width/height from principal point (cx, cy are typically at center)
+              width = int(cx * 2)
+              height = int(cy * 2)
+              fov_x = float(2 * np.arctan(width / (2 * fx)) * 180 / np.pi)
+              fov_y = float(2 * np.arctan(height / (2 * fy)) * 180 / np.pi)
               cameras.append({
                   "index": i,
+                  "cameraName": f"cam_{i}",
+                  "extrinsic": ext.tolist(),  # camera-to-world transform
+                  "intrinsic": intr.tolist(),
+                  "position": cam_pos,
+                  "forward": cam_forward,
+                  "fov": fov_y,  # Vertical FOV
+                  "fov_x": fov_x,
+                  "width": width,
+                  "height": height,
+                  "fx": fx,
+                  "fy": fy,
+                  "cx": cx,
+                  "cy": cy,
+                  "note": "extrinsic is camera-to-world (not world-to-camera)"
               })
       return json.dumps(cameras)
     points_cam = np.stack([X_cam, Y_cam, Z_cam, np.ones_like(X_cam)], axis=1)
     # Transform to world coordinates
+    # NOTE: In Map Anything, "extrinsic" is actually camera-to-world (not world-to-camera!)
+    # See geometry.py line 98: pts3d_world = einsum(camera_pose, pts3d_cam_homo, ...)
+    # So we use it directly, NOT its inverse
+    cam_to_world = source_extrinsic  # Already camera-to-world!
     points_world = (cam_to_world @ points_cam.T).T[:, :3]  # (N, 3)
     print(f"[Project Mask] World points range: X=[{points_world[:,0].min():.2f}, {points_world[:,0].max():.2f}], "
             continue
         target_intrinsic = intrinsics[target_idx]
+        target_extrinsic = extrinsics[target_idx]  # This is camera-to-world!
         # Transform world points to target camera coordinates
+        # Since extrinsic is camera-to-world, we need its inverse for world-to-camera
+        world_to_target_cam = np.linalg.inv(target_extrinsic)
         points_world_homo = np.hstack([points_world, np.ones((len(points_world), 1))])
+        points_target_cam = (world_to_target_cam @ points_world_homo.T).T[:, :3]  # (N, 3)
         # Debug: Print Z range before filtering
         print(f"[Project Mask] Camera {target_idx}: Z range in camera coords = [{points_target_cam[:, 2].min():.3f}, {points_target_cam[:, 2].max():.3f}]")