Argus-3D
Class-agnostic 3D bounding box detection on a frozen EUPE-ViT-B backbone. Given a posed RGB image and camera intrinsics, returns 7-DoF boxes (cx, cy, cz, w, h, d, theta) for the objects in the scene.
The detector pairs a non-linear (or linear) per-patch foreground head with feature-dim discovery for depth, k-means clustering for size priors derived from 20 CA-1M val scenes, and 3D-IoU-based multi-view fusion. No 2D bounding boxes, no class labels, no segmentation map as final output. Camera-frame 3D boxes only.
Architecture
Per-frame:
Image (1024x1024)
-> EUPE-ViT-B (frozen, reused from phanerozoic/argus)
-> patch tokens (4096, 768) on a 64x64 grid
-> instance head: 2-layer MLP (default) or linear ridge -> per-patch foreground score
depth head: ridge over 768 dims -> per-patch metric depth (m)
k-means modes: 8 cluster centers (20-scene) -> per-patch object-type assignment
-> threshold instance score, upsample mask to 1024x1024, connected components
-> for each component:
unproject to 3D using depth + K
DBSCAN-split for instance separation
PCA-on-xz for yaw, percentile extents for (w, h, d)
blend extents toward the matched cluster's size prior
-> camera-frame 7-DoF box list
Multi-view (per scene):
-> transform every per-frame box to world frame using camera RT
-> 3D-IoU-based clustering: union-find with edges where iou_3d_zup(box_i, box_j) > 0.2
-> filter clusters by min_obs and total inlier weight (per-cluster confidence)
-> per-cluster: weighted-median fuse to single 7-DoF box
Components
| Component | Parameters | Discovery / training |
|---|---|---|
| EUPE-ViT-B backbone (frozen, reused) | not part of this head | reused from phanerozoic/argus, run at 1024x1024 input (64x64 patch grid) |
| Instance head β MLP (default) | ~200 K | 2-layer MLP (768 β 256 β 1, ReLU + dropout 0.5), 30 epochs AdamW BCE-with-logits on 20-scene patch split at 1024 input |
| Instance head β linear ridge (fallback) | 769 floats + threshold | random K=20 subset search + hard-neg mining, AUC selection |
| Depth head (ridge over 768 dims) | 769 floats | random K=20 subset search, RMSE selection |
| K-means cluster centers | 8 Γ 768 floats | MiniBatchKMeans on foreground patches across 20 CA-1M val scenes |
| Per-cluster size priors (w, h, d) | 8 Γ 3 floats | median of observed extents per mode (8 800 instances aggregated) |
| OBB fitter (PCA + percentile + Tikhonov) | 0 | closed-form |
| Multi-view fusion (3D-IoU union-find + weighted median) | 0 | closed-form |
| Total head footprint (MLP) | ~200 K params / ~830 KB |
File layout
instance_head_mlp.safetensors # MLP fc1/fc2 weights
instance_head_mlp_meta.json # MLP config + threshold
instance_head.safetensors # linear ridge: dims + coef + intercept + threshold
depth_head.safetensors # depth ridge: dims + coef + intercept
size_priors.safetensors # 8 cluster centers + 8 (w, h, d) priors (20-scene)
config.json # input_res, patch_grid, prior_weight, fusion_iou, etc.
argus_3d.py # Argus3D class
infer.py # CLI dispatcher
Usage
from argus_3d import Argus3D
import numpy as np
model = Argus3D.from_pretrained("phanerozoic/argus-3d", device="cuda")
K = np.array([[850, 0, 395], [0, 850, 510], [0, 0, 1]])
boxes = model.detect("room.jpg", K) # list of Box3D
boxes = model.detect("room.jpg", K, depth=d) # supply RGBD sensor depth
out = model.perceive("room.jpg", K) # fg score map + depth map + boxes
for b in boxes:
print(b.cx, b.cy, b.cz, b.w, b.h, b.d, b.theta)
Eval
CA-1M val sequence ca1m-val-45662921. Class-agnostic per-scene 3D IoU after multi-view fusion across 284 frames (stride-4 sampling of 1135 total). The head produces its own instance hypotheses; no ground-truth 2D bounding boxes are used. Sensor depth is supplied; the discovered depth head can be used in its place.
mAP @ IoU thresholds (Boxer-comparable)
Class-agnostic AP at the same IoU thresholds Boxer reports:
| Threshold | Strict default (44 boxes) | Loose filter (246 boxes) |
|---|---|---|
| AP @ 0.05 IoU | 0.144 | 0.246 |
| AP @ 0.10 IoU | 0.116 | 0.172 |
| AP @ 0.15 IoU | 0.075 | 0.128 |
| AP @ 0.25 IoU | 0.035 | 0.049 |
| AP @ 0.50 IoU | 0.004 | 0.001 |
| mAP @ [0.05, 0.5] | 0.075 | 0.119 |
Boxer reports 0.43 mAP at IoU [0.05, 0.5] on CA-1M with GT 2D bounding boxes plus RGBD as input plus a learned 2Dβ3D lifting head. Our pipeline runs class-agnostic without GT 2D boxes and the OBB fitter is closed-form (PCA + percentile). The ~4Γ gap at AP @ 0.5 IoU is dominated by the percentile OBB fit β point-cloud percentile extents cap the high-IoU tail.
Headline result
1024-input MLP head + 20-scene size priors + 3D-IoU multi-view fusion + iterative refinement:
| Metric | Linear ridge (v0) + 768 + position fusion | MLP + 768 + IoU fusion | MLP + 1024 + IoU fusion (default) |
|---|---|---|---|
| Mean 3D IoU | 0.063 | 0.177 | 0.216 |
| Median 3D IoU | β | 0.146 | 0.175 |
| Fraction > 0.1 IoU | 27.8 % | 60.9 % | 68.2 % |
| Fraction > 0.25 IoU | 6.9 % | 34.8 % | 36.4 % |
| Fraction > 0.5 IoU | 0.0 % | 4.3 % | 9.1 % |
| Recall (matched / GT) | 19.8 % | 18.0 % | 17.5 % |
| Fused boxes per scene | 72 | 46 | 44 |
Default config: 1024 input resolution (64Γ64 patch grid), MLP head trained at 1024 features on 20 scenes, 20-scene size priors derived from foreground patches at 1024 resolution, 3D-IoU multi-view fusion at threshold 0.4 with iterative refinement (drop observations with IoU < 0.4 to the cluster consensus, re-fuse, up to 3 iterations). Per-cluster filter is min 4 observations and weight floor at the 80th percentile of nonzero cluster weights.
Resolution and fusion ablations (all on ca1m-val-45662921):
| Variant | Mean IoU | > 0.25 IoU | > 0.5 IoU | Recall |
|---|---|---|---|---|
| Linear ridge (v0) + position fusion at 768 | 0.063 | 6.9 % | 0.0 % | 19.8 % |
| MLP + position fusion at 768 | 0.076 | 11.5 % | 1.3 % | 24.4 % |
| MLP + IoU fusion (thresh 0.2) at 768 | 0.158 | 28.2 % | 2.8 % | 28.1 % |
| MLP + IoU fusion (thresh 0.4) at 768 | 0.177 | 34.8 % | 4.3 % | 18.0 % |
| MLP + IoU fusion at 1024 (default) | 0.216 | 36.4 % | 9.1 % | 17.5 % |
Per-stage discovery metrics (in-distribution random patch split, 4 scenes mixed)
| Discovery output | Metric | Linear ridge | MLP |
|---|---|---|---|
| Instance head, per-patch foreground | AUC | 0.860 | 0.980 |
| Instance head, per-patch foreground | F1 (tuned threshold) | 0.569 | 0.815 |
| Depth head, foreground patches in 0.1-3 m | RMSE | 0.190 m | 0.133 m (MLP variant) |
| Depth head, foreground patches in 0.1-3 m | delta1 (1.25Γ ratio) | 0.919 | 0.974 (MLP variant) |
Cross-scene held-out
The numbers above use a random patch split that mixes patches from 4 cached scenes. That setup overstates generalization because adjacent patches in the same room share scene-specific cues. A leave-one-scene-out eval gives the honest cross-scene head AUC:
| Head | In-distribution AUC | Cross-scene AUC | Cross-scene F1 |
|---|---|---|---|
| Linear ridge (all-768) | 0.860 | 0.604 | 0.301 |
| MLP (3 train scenes) | 0.980 | 0.566 (fold 45662921) | 0.312 |
| MLP (19 train scenes) | 0.980 | 0.780 (45662921 held out) | 0.465 |
Scaling the train set from 3 to 19 scenes lifts cross-scene AUC by 0.21 (0.566 β 0.780).
End-to-end cross-scene mean IoU on 45662921 with the 19-scene MLP head:
| Pipeline | Mean IoU | > 0.25 IoU | Recall |
|---|---|---|---|
| Position-only fusion | 0.046 | 1.7 % | 18.0 % |
| 3D-IoU fusion (this work) | 0.112 | 16.5 % | 28.1 % |
The 3D-IoU multi-view fusion improvement transfers to the cross-scene setting: 2.4Γ lift in mean IoU (0.046 β 0.112) and an order-of-magnitude lift in > 0.25 IoU. Per-frame box quality limits per-scene IoU on held-out scenes; the 0.20 in-distribution-to-cross-scene gap (0.154 vs 0.112 mean IoU) is the head still leaning on per-scene cues for box localization.
Backbone
EUPE-ViT-B from Meta FAIR (arXiv:2603.22387) via phanerozoic/argus. The backbone is frozen and not modified by this repo.
License
FAIR Research License (non-commercial), inherited via the EUPE-ViT-B backbone.
- Downloads last month
- 134