jcompanion commited on
Commit
56015c7
·
1 Parent(s): 08b52d4

Fix: extrinsic is camera-to-world (not world-to-camera), update camera JSON output

Browse files
Files changed (2) hide show
  1. MULTI_ANGLE_STAGING.md +199 -0
  2. app.py +51 -8
MULTI_ANGLE_STAGING.md ADDED
@@ -0,0 +1,199 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Multi-Angle Virtual Staging Pipeline
2
+
3
+ ## Project Goal
4
+
5
+ Project furniture from a staged room image into 3D space, then reproject onto other camera angles to generate masks for Nano Banana Pro inpainting.
6
+
7
+ ```
8
+ Flow:
9
+ 1. Upload 3 images (1 staged with furniture + 2 empty rooms, different angles)
10
+ 2. Run Map Anything reconstruction → GLB + depth maps + camera poses
11
+ 3. Extract furniture mask from staged image (white = furniture)
12
+ 4. Project mask through 3D space to other camera views
13
+ 5. Use projected masks with Nano Banana Pro to inpaint furniture into empty room photos
14
+ ```
15
+
16
+ ---
17
+
18
+ ## What We Built (Dec 31, 2024)
19
+
20
+ ### Added to `app.py`:
21
+
22
+ #### 1. `project_mask_to_cameras()` function (lines ~1127-1317)
23
+ Projects a furniture mask from one camera view to all other camera views using:
24
+ - Raw metric depth from `predictions.npz`
25
+ - Camera extrinsic matrices (4x4, world-to-camera)
26
+ - Camera intrinsic matrices (3x3, fx/fy/cx/cy)
27
+
28
+ ```python
29
+ def project_mask_to_cameras(
30
+ mask_image: np.ndarray, # Binary mask - white (255) = furniture
31
+ source_camera_index: int, # Which camera captured the staged image
32
+ target_dir: str, # Directory with predictions.npz
33
+ dilate_radius: int = 5, # Fills gaps in sparse projection
34
+ flip_y: bool = False # Debug option for Y-axis issues
35
+ ) -> dict:
36
+ ```
37
+
38
+ #### 2. `gradio_project_mask()` wrapper (lines ~1320-1400)
39
+ Handles Gradio UI input/output, creates overlay visualizations.
40
+
41
+ #### 3. "Project Mask" UI Tab
42
+ - Upload furniture mask
43
+ - Select source camera index (0, 1, or 2)
44
+ - Dilation radius slider
45
+ - Flip Y checkbox (for debugging)
46
+ - Shows: Projected masks gallery + Overlay on original images
47
+
48
+ ---
49
+
50
+ ## Current Issue: Camera 1 Projection is Wrong
51
+
52
+ ### Symptoms
53
+ - **Camera 0**: Projection looks mostly correct, furniture in right area
54
+ - **Camera 1**: Projection is wildly off - extreme pixel coordinates
55
+
56
+ ### Debug Output Shows
57
+ ```
58
+ Camera 0: proj_x range=[-20, 369], proj_y range=[116, 292] ✅ Good
59
+ Camera 1: proj_x range=[-41559, 9230], proj_y range=[96, 16719] ❌ Extreme!
60
+ ```
61
+
62
+ ### Root Cause Hypothesis
63
+ The extreme projection values for Camera 1 suggest:
64
+ 1. Points transformed to Camera 1's coordinate system have very small Z values (near camera plane)
65
+ 2. Division by small Z causes projection to blow up
66
+ 3. Possibly Camera 1 is oriented very differently than expected
67
+
68
+ ### Key Discovery: Scene Alignment Transform
69
+
70
+ Looking at `visual_util.py`, the GLB scene is aligned using:
71
+
72
+ ```python
73
+ # Line 496-501 in visual_util.py
74
+ initial_transformation = (
75
+ np.linalg.inv(extrinsics_matrices[0]) # Inverse of camera 0
76
+ @ opengl_conversion_matrix # Flips Y and Z
77
+ @ align_rotation
78
+ )
79
+ scene_3d.apply_transform(initial_transformation)
80
+ ```
81
+
82
+ **This means the GLB is transformed to be viewed from Camera 0's perspective!**
83
+
84
+ The OpenGL conversion matrix flips Y and Z:
85
+ ```python
86
+ matrix[1, 1] = -1 # Flip Y
87
+ matrix[2, 2] = -1 # Flip Z
88
+ ```
89
+
90
+ ### The Problem
91
+ Our projection code uses **raw extrinsics** from `predictions.npz`, but the GLB visualization applies additional transforms. This mismatch could explain why:
92
+ - Camera 0 works (it's the reference)
93
+ - Camera 1 doesn't (needs the same transforms applied)
94
+
95
+ ---
96
+
97
+ ## Plan to Fix
98
+
99
+ ### Option 1: Apply Same Transforms as GLB
100
+
101
+ In `project_mask_to_cameras()`, apply the scene alignment transform to world points before projecting:
102
+
103
+ ```python
104
+ # After unprojecting to world coordinates, apply GLB scene alignment
105
+ opengl_matrix = np.diag([1, -1, -1, 1]) # Flip Y and Z
106
+ scene_alignment = np.linalg.inv(extrinsics[0]) @ opengl_matrix
107
+ points_world_aligned = (scene_alignment @ points_world_homo.T).T[:, :3]
108
+ ```
109
+
110
+ ### Option 2: Use world_points Directly
111
+
112
+ Map Anything already computes `predictions["world_points"]` - these are already in world coordinates. We could:
113
+ 1. Use `world_points` directly instead of unprojecting from depth
114
+ 2. Just need to mask which points belong to furniture
115
+ 3. Then project to target cameras
116
+
117
+ ### Option 3: Debug Camera Orientations
118
+
119
+ Add more debug output to understand camera relationships:
120
+ - Print full extrinsic matrices
121
+ - Visualize camera frustums and point cloud
122
+ - Verify cameras are positioned as expected
123
+
124
+ ---
125
+
126
+ ## Key Files
127
+
128
+ ### Map Anything Repo
129
+ - `app.py` - Main Gradio app with projection functions
130
+ - `mapanything/utils/hf_utils/visual_util.py` - GLB creation and camera placement
131
+ - `mapanything/utils/geometry.py` - Depth unprojection math
132
+
133
+ ### Data in predictions.npz
134
+ ```python
135
+ predictions["depth"] # (S, H, W, 1) - raw metric depth in meters
136
+ predictions["extrinsic"] # (S, 4, 4) - world-to-camera transforms
137
+ predictions["intrinsic"] # (S, 3, 3) - camera intrinsics [fx,0,cx; 0,fy,cy; 0,0,1]
138
+ predictions["world_points"] # (S, H, W, 3) - 3D points in world coords
139
+ predictions["images"] # (S, H, W, 3) or (S, 3, H, W) - input images
140
+ predictions["final_mask"] # (S, H, W) - valid depth mask
141
+ ```
142
+
143
+ ---
144
+
145
+ ## Coordinate Systems
146
+
147
+ ### OpenCV (Map Anything internal)
148
+ - X: Right
149
+ - Y: Down
150
+ - Z: Forward (into scene)
151
+ - Extrinsic: world-to-camera transform
152
+
153
+ ### OpenGL/GLB Viewer
154
+ - X: Right
155
+ - Y: Up
156
+ - Z: Backward (out of screen)
157
+ - Conversion: flip Y and Z
158
+
159
+ ### GLB Scene Alignment
160
+ The entire scene is transformed by:
161
+ ```
162
+ inv(camera_0_extrinsic) @ opengl_conversion
163
+ ```
164
+ This puts Camera 0 at origin looking down -Z.
165
+
166
+ ---
167
+
168
+ ## Next Steps
169
+
170
+ 1. [ ] Print full camera extrinsic matrices to understand orientations
171
+ 2. [ ] Try applying scene alignment transform to projection
172
+ 3. [ ] Consider using `world_points` directly instead of unprojecting
173
+ 4. [ ] Test if the issue is specific to certain camera arrangements
174
+ 5. [ ] Once projection works, test end-to-end with Nano Banana Pro
175
+
176
+ ---
177
+
178
+ ## Commands
179
+
180
+ ```bash
181
+ # Run Map Anything locally
182
+ cd /Users/joshcompanion/Documents/personal/map-anything
183
+ python app.py
184
+
185
+ # Push to HuggingFace
186
+ git add -A && git commit -m "message" && git push
187
+
188
+ # View logs
189
+ # https://huggingface.co/spaces/joshcompanion/map-anything/logs
190
+ ```
191
+
192
+ ---
193
+
194
+ ## Related Project
195
+
196
+ The frontend test bed is at:
197
+ `/Users/joshcompanion/Documents/personal/staging-test/apps/web/src/components/PipelineTestBed.tsx`
198
+
199
+ But we're moving projection logic to Map Anything (Python) for better accuracy.
app.py CHANGED
@@ -55,18 +55,57 @@ def get_logo_base64():
55
  return None
56
 
57
  def get_camera_poses_json(predictions):
58
- """Convert camera poses to JSON-serializable format"""
 
 
 
 
 
59
  cameras = []
60
 
61
  if "extrinsic" in predictions and "intrinsic" in predictions:
62
- extrinsic = predictions["extrinsic"] # Shape: (S, 4, 4)
63
  intrinsic = predictions["intrinsic"] # Shape: (S, 3, 3)
64
 
65
  for i in range(len(extrinsic)):
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
66
  cameras.append({
67
  "index": i,
68
- "extrinsic": extrinsic[i].tolist(),
69
- "intrinsic": intrinsic[i].tolist(),
 
 
 
 
 
 
 
 
 
 
 
 
70
  })
71
 
72
  return json.dumps(cameras)
@@ -1235,8 +1274,10 @@ def project_mask_to_cameras(
1235
  points_cam = np.stack([X_cam, Y_cam, Z_cam, np.ones_like(X_cam)], axis=1)
1236
 
1237
  # Transform to world coordinates
1238
- # extrinsic is world-to-camera, so we need its inverse
1239
- cam_to_world = np.linalg.inv(source_extrinsic)
 
 
1240
  points_world = (cam_to_world @ points_cam.T).T[:, :3] # (N, 3)
1241
 
1242
  print(f"[Project Mask] World points range: X=[{points_world[:,0].min():.2f}, {points_world[:,0].max():.2f}], "
@@ -1253,11 +1294,13 @@ def project_mask_to_cameras(
1253
  continue
1254
 
1255
  target_intrinsic = intrinsics[target_idx]
1256
- target_extrinsic = extrinsics[target_idx] # world-to-camera
1257
 
1258
  # Transform world points to target camera coordinates
 
 
1259
  points_world_homo = np.hstack([points_world, np.ones((len(points_world), 1))])
1260
- points_target_cam = (target_extrinsic @ points_world_homo.T).T[:, :3] # (N, 3)
1261
 
1262
  # Debug: Print Z range before filtering
1263
  print(f"[Project Mask] Camera {target_idx}: Z range in camera coords = [{points_target_cam[:, 2].min():.3f}, {points_target_cam[:, 2].max():.3f}]")
 
55
  return None
56
 
57
  def get_camera_poses_json(predictions):
58
+ """Convert camera poses to JSON-serializable format
59
+
60
+ NOTE: In Map Anything, 'extrinsic' is actually a camera-to-world transform,
61
+ NOT world-to-camera as is common in OpenCV. This is confirmed by geometry.py
62
+ line 98 which uses: pts3d_world = einsum(camera_pose, pts3d_cam_homo, ...)
63
+ """
64
  cameras = []
65
 
66
  if "extrinsic" in predictions and "intrinsic" in predictions:
67
+ extrinsic = predictions["extrinsic"] # Shape: (S, 4, 4) - camera-to-world!
68
  intrinsic = predictions["intrinsic"] # Shape: (S, 3, 3)
69
 
70
  for i in range(len(extrinsic)):
71
+ ext = extrinsic[i]
72
+ intr = intrinsic[i]
73
+
74
+ # Extract camera position (translation part of camera-to-world)
75
+ # Since extrinsic IS camera-to-world, position is just the translation column
76
+ cam_pos = ext[:3, 3].tolist()
77
+
78
+ # Extract camera forward direction (Z axis in world coords)
79
+ # The rotation part of camera-to-world transforms camera Z to world
80
+ cam_forward = ext[:3, 2].tolist() # Third column of rotation
81
+
82
+ # Calculate FOV from intrinsics
83
+ fx = float(intr[0, 0])
84
+ fy = float(intr[1, 1])
85
+ cx = float(intr[0, 2])
86
+ cy = float(intr[1, 2])
87
+ # Assume image width/height from principal point (cx, cy are typically at center)
88
+ width = int(cx * 2)
89
+ height = int(cy * 2)
90
+ fov_x = float(2 * np.arctan(width / (2 * fx)) * 180 / np.pi)
91
+ fov_y = float(2 * np.arctan(height / (2 * fy)) * 180 / np.pi)
92
+
93
  cameras.append({
94
  "index": i,
95
+ "cameraName": f"cam_{i}",
96
+ "extrinsic": ext.tolist(), # camera-to-world transform
97
+ "intrinsic": intr.tolist(),
98
+ "position": cam_pos,
99
+ "forward": cam_forward,
100
+ "fov": fov_y, # Vertical FOV
101
+ "fov_x": fov_x,
102
+ "width": width,
103
+ "height": height,
104
+ "fx": fx,
105
+ "fy": fy,
106
+ "cx": cx,
107
+ "cy": cy,
108
+ "note": "extrinsic is camera-to-world (not world-to-camera)"
109
  })
110
 
111
  return json.dumps(cameras)
 
1274
  points_cam = np.stack([X_cam, Y_cam, Z_cam, np.ones_like(X_cam)], axis=1)
1275
 
1276
  # Transform to world coordinates
1277
+ # NOTE: In Map Anything, "extrinsic" is actually camera-to-world (not world-to-camera!)
1278
+ # See geometry.py line 98: pts3d_world = einsum(camera_pose, pts3d_cam_homo, ...)
1279
+ # So we use it directly, NOT its inverse
1280
+ cam_to_world = source_extrinsic # Already camera-to-world!
1281
  points_world = (cam_to_world @ points_cam.T).T[:, :3] # (N, 3)
1282
 
1283
  print(f"[Project Mask] World points range: X=[{points_world[:,0].min():.2f}, {points_world[:,0].max():.2f}], "
 
1294
  continue
1295
 
1296
  target_intrinsic = intrinsics[target_idx]
1297
+ target_extrinsic = extrinsics[target_idx] # This is camera-to-world!
1298
 
1299
  # Transform world points to target camera coordinates
1300
+ # Since extrinsic is camera-to-world, we need its inverse for world-to-camera
1301
+ world_to_target_cam = np.linalg.inv(target_extrinsic)
1302
  points_world_homo = np.hstack([points_world, np.ones((len(points_world), 1))])
1303
+ points_target_cam = (world_to_target_cam @ points_world_homo.T).T[:, :3] # (N, 3)
1304
 
1305
  # Debug: Print Z range before filtering
1306
  print(f"[Project Mask] Camera {target_idx}: Z range in camera coords = [{points_target_cam[:, 2].min():.3f}, {points_target_cam[:, 2].max():.3f}]")