Spaces:
Sleeping
Sleeping
jcompanion commited on
Commit ·
56015c7
1
Parent(s): 08b52d4
Fix: extrinsic is camera-to-world (not world-to-camera), update camera JSON output
Browse files- MULTI_ANGLE_STAGING.md +199 -0
- app.py +51 -8
MULTI_ANGLE_STAGING.md
ADDED
|
@@ -0,0 +1,199 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Multi-Angle Virtual Staging Pipeline
|
| 2 |
+
|
| 3 |
+
## Project Goal
|
| 4 |
+
|
| 5 |
+
Project furniture from a staged room image into 3D space, then reproject onto other camera angles to generate masks for Nano Banana Pro inpainting.
|
| 6 |
+
|
| 7 |
+
```
|
| 8 |
+
Flow:
|
| 9 |
+
1. Upload 3 images (1 staged with furniture + 2 empty rooms, different angles)
|
| 10 |
+
2. Run Map Anything reconstruction → GLB + depth maps + camera poses
|
| 11 |
+
3. Extract furniture mask from staged image (white = furniture)
|
| 12 |
+
4. Project mask through 3D space to other camera views
|
| 13 |
+
5. Use projected masks with Nano Banana Pro to inpaint furniture into empty room photos
|
| 14 |
+
```
|
| 15 |
+
|
| 16 |
+
---
|
| 17 |
+
|
| 18 |
+
## What We Built (Dec 31, 2024)
|
| 19 |
+
|
| 20 |
+
### Added to `app.py`:
|
| 21 |
+
|
| 22 |
+
#### 1. `project_mask_to_cameras()` function (lines ~1127-1317)
|
| 23 |
+
Projects a furniture mask from one camera view to all other camera views using:
|
| 24 |
+
- Raw metric depth from `predictions.npz`
|
| 25 |
+
- Camera extrinsic matrices (4x4, world-to-camera)
|
| 26 |
+
- Camera intrinsic matrices (3x3, fx/fy/cx/cy)
|
| 27 |
+
|
| 28 |
+
```python
|
| 29 |
+
def project_mask_to_cameras(
|
| 30 |
+
mask_image: np.ndarray, # Binary mask - white (255) = furniture
|
| 31 |
+
source_camera_index: int, # Which camera captured the staged image
|
| 32 |
+
target_dir: str, # Directory with predictions.npz
|
| 33 |
+
dilate_radius: int = 5, # Fills gaps in sparse projection
|
| 34 |
+
flip_y: bool = False # Debug option for Y-axis issues
|
| 35 |
+
) -> dict:
|
| 36 |
+
```
|
| 37 |
+
|
| 38 |
+
#### 2. `gradio_project_mask()` wrapper (lines ~1320-1400)
|
| 39 |
+
Handles Gradio UI input/output, creates overlay visualizations.
|
| 40 |
+
|
| 41 |
+
#### 3. "Project Mask" UI Tab
|
| 42 |
+
- Upload furniture mask
|
| 43 |
+
- Select source camera index (0, 1, or 2)
|
| 44 |
+
- Dilation radius slider
|
| 45 |
+
- Flip Y checkbox (for debugging)
|
| 46 |
+
- Shows: Projected masks gallery + Overlay on original images
|
| 47 |
+
|
| 48 |
+
---
|
| 49 |
+
|
| 50 |
+
## Current Issue: Camera 1 Projection is Wrong
|
| 51 |
+
|
| 52 |
+
### Symptoms
|
| 53 |
+
- **Camera 0**: Projection looks mostly correct, furniture in right area
|
| 54 |
+
- **Camera 1**: Projection is wildly off - extreme pixel coordinates
|
| 55 |
+
|
| 56 |
+
### Debug Output Shows
|
| 57 |
+
```
|
| 58 |
+
Camera 0: proj_x range=[-20, 369], proj_y range=[116, 292] ✅ Good
|
| 59 |
+
Camera 1: proj_x range=[-41559, 9230], proj_y range=[96, 16719] ❌ Extreme!
|
| 60 |
+
```
|
| 61 |
+
|
| 62 |
+
### Root Cause Hypothesis
|
| 63 |
+
The extreme projection values for Camera 1 suggest:
|
| 64 |
+
1. Points transformed to Camera 1's coordinate system have very small Z values (near camera plane)
|
| 65 |
+
2. Division by small Z causes projection to blow up
|
| 66 |
+
3. Possibly Camera 1 is oriented very differently than expected
|
| 67 |
+
|
| 68 |
+
### Key Discovery: Scene Alignment Transform
|
| 69 |
+
|
| 70 |
+
Looking at `visual_util.py`, the GLB scene is aligned using:
|
| 71 |
+
|
| 72 |
+
```python
|
| 73 |
+
# Line 496-501 in visual_util.py
|
| 74 |
+
initial_transformation = (
|
| 75 |
+
np.linalg.inv(extrinsics_matrices[0]) # Inverse of camera 0
|
| 76 |
+
@ opengl_conversion_matrix # Flips Y and Z
|
| 77 |
+
@ align_rotation
|
| 78 |
+
)
|
| 79 |
+
scene_3d.apply_transform(initial_transformation)
|
| 80 |
+
```
|
| 81 |
+
|
| 82 |
+
**This means the GLB is transformed to be viewed from Camera 0's perspective!**
|
| 83 |
+
|
| 84 |
+
The OpenGL conversion matrix flips Y and Z:
|
| 85 |
+
```python
|
| 86 |
+
matrix[1, 1] = -1 # Flip Y
|
| 87 |
+
matrix[2, 2] = -1 # Flip Z
|
| 88 |
+
```
|
| 89 |
+
|
| 90 |
+
### The Problem
|
| 91 |
+
Our projection code uses **raw extrinsics** from `predictions.npz`, but the GLB visualization applies additional transforms. This mismatch could explain why:
|
| 92 |
+
- Camera 0 works (it's the reference)
|
| 93 |
+
- Camera 1 doesn't (needs the same transforms applied)
|
| 94 |
+
|
| 95 |
+
---
|
| 96 |
+
|
| 97 |
+
## Plan to Fix
|
| 98 |
+
|
| 99 |
+
### Option 1: Apply Same Transforms as GLB
|
| 100 |
+
|
| 101 |
+
In `project_mask_to_cameras()`, apply the scene alignment transform to world points before projecting:
|
| 102 |
+
|
| 103 |
+
```python
|
| 104 |
+
# After unprojecting to world coordinates, apply GLB scene alignment
|
| 105 |
+
opengl_matrix = np.diag([1, -1, -1, 1]) # Flip Y and Z
|
| 106 |
+
scene_alignment = np.linalg.inv(extrinsics[0]) @ opengl_matrix
|
| 107 |
+
points_world_aligned = (scene_alignment @ points_world_homo.T).T[:, :3]
|
| 108 |
+
```
|
| 109 |
+
|
| 110 |
+
### Option 2: Use world_points Directly
|
| 111 |
+
|
| 112 |
+
Map Anything already computes `predictions["world_points"]` - these are already in world coordinates. We could:
|
| 113 |
+
1. Use `world_points` directly instead of unprojecting from depth
|
| 114 |
+
2. Just need to mask which points belong to furniture
|
| 115 |
+
3. Then project to target cameras
|
| 116 |
+
|
| 117 |
+
### Option 3: Debug Camera Orientations
|
| 118 |
+
|
| 119 |
+
Add more debug output to understand camera relationships:
|
| 120 |
+
- Print full extrinsic matrices
|
| 121 |
+
- Visualize camera frustums and point cloud
|
| 122 |
+
- Verify cameras are positioned as expected
|
| 123 |
+
|
| 124 |
+
---
|
| 125 |
+
|
| 126 |
+
## Key Files
|
| 127 |
+
|
| 128 |
+
### Map Anything Repo
|
| 129 |
+
- `app.py` - Main Gradio app with projection functions
|
| 130 |
+
- `mapanything/utils/hf_utils/visual_util.py` - GLB creation and camera placement
|
| 131 |
+
- `mapanything/utils/geometry.py` - Depth unprojection math
|
| 132 |
+
|
| 133 |
+
### Data in predictions.npz
|
| 134 |
+
```python
|
| 135 |
+
predictions["depth"] # (S, H, W, 1) - raw metric depth in meters
|
| 136 |
+
predictions["extrinsic"] # (S, 4, 4) - world-to-camera transforms
|
| 137 |
+
predictions["intrinsic"] # (S, 3, 3) - camera intrinsics [fx,0,cx; 0,fy,cy; 0,0,1]
|
| 138 |
+
predictions["world_points"] # (S, H, W, 3) - 3D points in world coords
|
| 139 |
+
predictions["images"] # (S, H, W, 3) or (S, 3, H, W) - input images
|
| 140 |
+
predictions["final_mask"] # (S, H, W) - valid depth mask
|
| 141 |
+
```
|
| 142 |
+
|
| 143 |
+
---
|
| 144 |
+
|
| 145 |
+
## Coordinate Systems
|
| 146 |
+
|
| 147 |
+
### OpenCV (Map Anything internal)
|
| 148 |
+
- X: Right
|
| 149 |
+
- Y: Down
|
| 150 |
+
- Z: Forward (into scene)
|
| 151 |
+
- Extrinsic: world-to-camera transform
|
| 152 |
+
|
| 153 |
+
### OpenGL/GLB Viewer
|
| 154 |
+
- X: Right
|
| 155 |
+
- Y: Up
|
| 156 |
+
- Z: Backward (out of screen)
|
| 157 |
+
- Conversion: flip Y and Z
|
| 158 |
+
|
| 159 |
+
### GLB Scene Alignment
|
| 160 |
+
The entire scene is transformed by:
|
| 161 |
+
```
|
| 162 |
+
inv(camera_0_extrinsic) @ opengl_conversion
|
| 163 |
+
```
|
| 164 |
+
This puts Camera 0 at origin looking down -Z.
|
| 165 |
+
|
| 166 |
+
---
|
| 167 |
+
|
| 168 |
+
## Next Steps
|
| 169 |
+
|
| 170 |
+
1. [ ] Print full camera extrinsic matrices to understand orientations
|
| 171 |
+
2. [ ] Try applying scene alignment transform to projection
|
| 172 |
+
3. [ ] Consider using `world_points` directly instead of unprojecting
|
| 173 |
+
4. [ ] Test if the issue is specific to certain camera arrangements
|
| 174 |
+
5. [ ] Once projection works, test end-to-end with Nano Banana Pro
|
| 175 |
+
|
| 176 |
+
---
|
| 177 |
+
|
| 178 |
+
## Commands
|
| 179 |
+
|
| 180 |
+
```bash
|
| 181 |
+
# Run Map Anything locally
|
| 182 |
+
cd /Users/joshcompanion/Documents/personal/map-anything
|
| 183 |
+
python app.py
|
| 184 |
+
|
| 185 |
+
# Push to HuggingFace
|
| 186 |
+
git add -A && git commit -m "message" && git push
|
| 187 |
+
|
| 188 |
+
# View logs
|
| 189 |
+
# https://huggingface.co/spaces/joshcompanion/map-anything/logs
|
| 190 |
+
```
|
| 191 |
+
|
| 192 |
+
---
|
| 193 |
+
|
| 194 |
+
## Related Project
|
| 195 |
+
|
| 196 |
+
The frontend test bed is at:
|
| 197 |
+
`/Users/joshcompanion/Documents/personal/staging-test/apps/web/src/components/PipelineTestBed.tsx`
|
| 198 |
+
|
| 199 |
+
But we're moving projection logic to Map Anything (Python) for better accuracy.
|
app.py
CHANGED
|
@@ -55,18 +55,57 @@ def get_logo_base64():
|
|
| 55 |
return None
|
| 56 |
|
| 57 |
def get_camera_poses_json(predictions):
|
| 58 |
-
"""Convert camera poses to JSON-serializable format
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 59 |
cameras = []
|
| 60 |
|
| 61 |
if "extrinsic" in predictions and "intrinsic" in predictions:
|
| 62 |
-
extrinsic = predictions["extrinsic"] # Shape: (S, 4, 4)
|
| 63 |
intrinsic = predictions["intrinsic"] # Shape: (S, 3, 3)
|
| 64 |
|
| 65 |
for i in range(len(extrinsic)):
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 66 |
cameras.append({
|
| 67 |
"index": i,
|
| 68 |
-
"
|
| 69 |
-
"
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 70 |
})
|
| 71 |
|
| 72 |
return json.dumps(cameras)
|
|
@@ -1235,8 +1274,10 @@ def project_mask_to_cameras(
|
|
| 1235 |
points_cam = np.stack([X_cam, Y_cam, Z_cam, np.ones_like(X_cam)], axis=1)
|
| 1236 |
|
| 1237 |
# Transform to world coordinates
|
| 1238 |
-
# extrinsic is
|
| 1239 |
-
|
|
|
|
|
|
|
| 1240 |
points_world = (cam_to_world @ points_cam.T).T[:, :3] # (N, 3)
|
| 1241 |
|
| 1242 |
print(f"[Project Mask] World points range: X=[{points_world[:,0].min():.2f}, {points_world[:,0].max():.2f}], "
|
|
@@ -1253,11 +1294,13 @@ def project_mask_to_cameras(
|
|
| 1253 |
continue
|
| 1254 |
|
| 1255 |
target_intrinsic = intrinsics[target_idx]
|
| 1256 |
-
target_extrinsic = extrinsics[target_idx] #
|
| 1257 |
|
| 1258 |
# Transform world points to target camera coordinates
|
|
|
|
|
|
|
| 1259 |
points_world_homo = np.hstack([points_world, np.ones((len(points_world), 1))])
|
| 1260 |
-
points_target_cam = (
|
| 1261 |
|
| 1262 |
# Debug: Print Z range before filtering
|
| 1263 |
print(f"[Project Mask] Camera {target_idx}: Z range in camera coords = [{points_target_cam[:, 2].min():.3f}, {points_target_cam[:, 2].max():.3f}]")
|
|
|
|
| 55 |
return None
|
| 56 |
|
| 57 |
def get_camera_poses_json(predictions):
|
| 58 |
+
"""Convert camera poses to JSON-serializable format
|
| 59 |
+
|
| 60 |
+
NOTE: In Map Anything, 'extrinsic' is actually a camera-to-world transform,
|
| 61 |
+
NOT world-to-camera as is common in OpenCV. This is confirmed by geometry.py
|
| 62 |
+
line 98 which uses: pts3d_world = einsum(camera_pose, pts3d_cam_homo, ...)
|
| 63 |
+
"""
|
| 64 |
cameras = []
|
| 65 |
|
| 66 |
if "extrinsic" in predictions and "intrinsic" in predictions:
|
| 67 |
+
extrinsic = predictions["extrinsic"] # Shape: (S, 4, 4) - camera-to-world!
|
| 68 |
intrinsic = predictions["intrinsic"] # Shape: (S, 3, 3)
|
| 69 |
|
| 70 |
for i in range(len(extrinsic)):
|
| 71 |
+
ext = extrinsic[i]
|
| 72 |
+
intr = intrinsic[i]
|
| 73 |
+
|
| 74 |
+
# Extract camera position (translation part of camera-to-world)
|
| 75 |
+
# Since extrinsic IS camera-to-world, position is just the translation column
|
| 76 |
+
cam_pos = ext[:3, 3].tolist()
|
| 77 |
+
|
| 78 |
+
# Extract camera forward direction (Z axis in world coords)
|
| 79 |
+
# The rotation part of camera-to-world transforms camera Z to world
|
| 80 |
+
cam_forward = ext[:3, 2].tolist() # Third column of rotation
|
| 81 |
+
|
| 82 |
+
# Calculate FOV from intrinsics
|
| 83 |
+
fx = float(intr[0, 0])
|
| 84 |
+
fy = float(intr[1, 1])
|
| 85 |
+
cx = float(intr[0, 2])
|
| 86 |
+
cy = float(intr[1, 2])
|
| 87 |
+
# Assume image width/height from principal point (cx, cy are typically at center)
|
| 88 |
+
width = int(cx * 2)
|
| 89 |
+
height = int(cy * 2)
|
| 90 |
+
fov_x = float(2 * np.arctan(width / (2 * fx)) * 180 / np.pi)
|
| 91 |
+
fov_y = float(2 * np.arctan(height / (2 * fy)) * 180 / np.pi)
|
| 92 |
+
|
| 93 |
cameras.append({
|
| 94 |
"index": i,
|
| 95 |
+
"cameraName": f"cam_{i}",
|
| 96 |
+
"extrinsic": ext.tolist(), # camera-to-world transform
|
| 97 |
+
"intrinsic": intr.tolist(),
|
| 98 |
+
"position": cam_pos,
|
| 99 |
+
"forward": cam_forward,
|
| 100 |
+
"fov": fov_y, # Vertical FOV
|
| 101 |
+
"fov_x": fov_x,
|
| 102 |
+
"width": width,
|
| 103 |
+
"height": height,
|
| 104 |
+
"fx": fx,
|
| 105 |
+
"fy": fy,
|
| 106 |
+
"cx": cx,
|
| 107 |
+
"cy": cy,
|
| 108 |
+
"note": "extrinsic is camera-to-world (not world-to-camera)"
|
| 109 |
})
|
| 110 |
|
| 111 |
return json.dumps(cameras)
|
|
|
|
| 1274 |
points_cam = np.stack([X_cam, Y_cam, Z_cam, np.ones_like(X_cam)], axis=1)
|
| 1275 |
|
| 1276 |
# Transform to world coordinates
|
| 1277 |
+
# NOTE: In Map Anything, "extrinsic" is actually camera-to-world (not world-to-camera!)
|
| 1278 |
+
# See geometry.py line 98: pts3d_world = einsum(camera_pose, pts3d_cam_homo, ...)
|
| 1279 |
+
# So we use it directly, NOT its inverse
|
| 1280 |
+
cam_to_world = source_extrinsic # Already camera-to-world!
|
| 1281 |
points_world = (cam_to_world @ points_cam.T).T[:, :3] # (N, 3)
|
| 1282 |
|
| 1283 |
print(f"[Project Mask] World points range: X=[{points_world[:,0].min():.2f}, {points_world[:,0].max():.2f}], "
|
|
|
|
| 1294 |
continue
|
| 1295 |
|
| 1296 |
target_intrinsic = intrinsics[target_idx]
|
| 1297 |
+
target_extrinsic = extrinsics[target_idx] # This is camera-to-world!
|
| 1298 |
|
| 1299 |
# Transform world points to target camera coordinates
|
| 1300 |
+
# Since extrinsic is camera-to-world, we need its inverse for world-to-camera
|
| 1301 |
+
world_to_target_cam = np.linalg.inv(target_extrinsic)
|
| 1302 |
points_world_homo = np.hstack([points_world, np.ones((len(points_world), 1))])
|
| 1303 |
+
points_target_cam = (world_to_target_cam @ points_world_homo.T).T[:, :3] # (N, 3)
|
| 1304 |
|
| 1305 |
# Debug: Print Z range before filtering
|
| 1306 |
print(f"[Project Mask] Camera {target_idx}: Z range in camera coords = [{points_target_cam[:, 2].min():.3f}, {points_target_cam[:, 2].max():.3f}]")
|