Argus-3D

Class-agnostic 3D bounding box detection on a frozen EUPE-ViT-B backbone. Given a posed RGB image and camera intrinsics, returns 7-DoF boxes (cx, cy, cz, w, h, d, theta) for the objects in the scene.

The detector pairs a non-linear (or linear) per-patch foreground head with feature-dim discovery for depth, k-means clustering for size priors derived from 20 CA-1M val scenes, and 3D-IoU-based multi-view fusion. No 2D bounding boxes, no class labels, no segmentation map as final output. Camera-frame 3D boxes only.

Architecture

Per-frame:
  Image (1024x1024)
    -> EUPE-ViT-B (frozen, reused from phanerozoic/argus)
    -> patch tokens (4096, 768) on a 64x64 grid
    -> instance head: 2-layer MLP (default) or linear ridge -> per-patch foreground score
       depth head:    ridge over 768 dims                   -> per-patch metric depth (m)
       k-means modes: 8 cluster centers (20-scene)          -> per-patch object-type assignment
    -> threshold instance score, upsample mask to 1024x1024, connected components
    -> for each component:
         unproject to 3D using depth + K
         DBSCAN-split for instance separation
         PCA-on-xz for yaw, percentile extents for (w, h, d)
         blend extents toward the matched cluster's size prior
    -> camera-frame 7-DoF box list

Multi-view (per scene):
  -> transform every per-frame box to world frame using camera RT
  -> 3D-IoU-based clustering: union-find with edges where iou_3d_zup(box_i, box_j) > 0.2
  -> filter clusters by min_obs and total inlier weight (per-cluster confidence)
  -> per-cluster: weighted-median fuse to single 7-DoF box

Components

Component Parameters Discovery / training
EUPE-ViT-B backbone (frozen, reused) not part of this head reused from phanerozoic/argus, run at 1024x1024 input (64x64 patch grid)
Instance head β€” MLP (default) ~200 K 2-layer MLP (768 β†’ 256 β†’ 1, ReLU + dropout 0.5), 30 epochs AdamW BCE-with-logits on 20-scene patch split at 1024 input
Instance head β€” linear ridge (fallback) 769 floats + threshold random K=20 subset search + hard-neg mining, AUC selection
Depth head (ridge over 768 dims) 769 floats random K=20 subset search, RMSE selection
K-means cluster centers 8 Γ— 768 floats MiniBatchKMeans on foreground patches across 20 CA-1M val scenes
Per-cluster size priors (w, h, d) 8 Γ— 3 floats median of observed extents per mode (8 800 instances aggregated)
OBB fitter (PCA + percentile + Tikhonov) 0 closed-form
Multi-view fusion (3D-IoU union-find + weighted median) 0 closed-form
Total head footprint (MLP) ~200 K params / ~830 KB

File layout

instance_head_mlp.safetensors   # MLP fc1/fc2 weights
instance_head_mlp_meta.json     # MLP config + threshold
instance_head.safetensors       # linear ridge: dims + coef + intercept + threshold
depth_head.safetensors          # depth ridge: dims + coef + intercept
size_priors.safetensors         # 8 cluster centers + 8 (w, h, d) priors (20-scene)
config.json                     # input_res, patch_grid, prior_weight, fusion_iou, etc.
argus_3d.py                     # Argus3D class
infer.py                        # CLI dispatcher

Usage

from argus_3d import Argus3D
import numpy as np

model = Argus3D.from_pretrained("phanerozoic/argus-3d", device="cuda")

K = np.array([[850, 0, 395], [0, 850, 510], [0, 0, 1]])
boxes = model.detect("room.jpg", K)               # list of Box3D
boxes = model.detect("room.jpg", K, depth=d)      # supply RGBD sensor depth
out   = model.perceive("room.jpg", K)             # fg score map + depth map + boxes

for b in boxes:
    print(b.cx, b.cy, b.cz, b.w, b.h, b.d, b.theta)

Eval

CA-1M val sequence ca1m-val-45662921. Class-agnostic per-scene 3D IoU after multi-view fusion across 284 frames (stride-4 sampling of 1135 total). The head produces its own instance hypotheses; no ground-truth 2D bounding boxes are used. Sensor depth is supplied; the discovered depth head can be used in its place.

mAP @ IoU thresholds (Boxer-comparable)

Class-agnostic AP at the same IoU thresholds Boxer reports:

Threshold Strict default (44 boxes) Loose filter (246 boxes)
AP @ 0.05 IoU 0.144 0.246
AP @ 0.10 IoU 0.116 0.172
AP @ 0.15 IoU 0.075 0.128
AP @ 0.25 IoU 0.035 0.049
AP @ 0.50 IoU 0.004 0.001
mAP @ [0.05, 0.5] 0.075 0.119

Boxer reports 0.43 mAP at IoU [0.05, 0.5] on CA-1M with GT 2D bounding boxes plus RGBD as input plus a learned 2D→3D lifting head. Our pipeline runs class-agnostic without GT 2D boxes and the OBB fitter is closed-form (PCA + percentile). The ~4× gap at AP @ 0.5 IoU is dominated by the percentile OBB fit — point-cloud percentile extents cap the high-IoU tail.

Headline result

1024-input MLP head + 20-scene size priors + 3D-IoU multi-view fusion + iterative refinement:

Metric Linear ridge (v0) + 768 + position fusion MLP + 768 + IoU fusion MLP + 1024 + IoU fusion (default)
Mean 3D IoU 0.063 0.177 0.216
Median 3D IoU β€” 0.146 0.175
Fraction > 0.1 IoU 27.8 % 60.9 % 68.2 %
Fraction > 0.25 IoU 6.9 % 34.8 % 36.4 %
Fraction > 0.5 IoU 0.0 % 4.3 % 9.1 %
Recall (matched / GT) 19.8 % 18.0 % 17.5 %
Fused boxes per scene 72 46 44

Default config: 1024 input resolution (64Γ—64 patch grid), MLP head trained at 1024 features on 20 scenes, 20-scene size priors derived from foreground patches at 1024 resolution, 3D-IoU multi-view fusion at threshold 0.4 with iterative refinement (drop observations with IoU < 0.4 to the cluster consensus, re-fuse, up to 3 iterations). Per-cluster filter is min 4 observations and weight floor at the 80th percentile of nonzero cluster weights.

Resolution and fusion ablations (all on ca1m-val-45662921):

Variant Mean IoU > 0.25 IoU > 0.5 IoU Recall
Linear ridge (v0) + position fusion at 768 0.063 6.9 % 0.0 % 19.8 %
MLP + position fusion at 768 0.076 11.5 % 1.3 % 24.4 %
MLP + IoU fusion (thresh 0.2) at 768 0.158 28.2 % 2.8 % 28.1 %
MLP + IoU fusion (thresh 0.4) at 768 0.177 34.8 % 4.3 % 18.0 %
MLP + IoU fusion at 1024 (default) 0.216 36.4 % 9.1 % 17.5 %

Per-stage discovery metrics (in-distribution random patch split, 4 scenes mixed)

Discovery output Metric Linear ridge MLP
Instance head, per-patch foreground AUC 0.860 0.980
Instance head, per-patch foreground F1 (tuned threshold) 0.569 0.815
Depth head, foreground patches in 0.1-3 m RMSE 0.190 m 0.133 m (MLP variant)
Depth head, foreground patches in 0.1-3 m delta1 (1.25Γ— ratio) 0.919 0.974 (MLP variant)

Cross-scene held-out

The numbers above use a random patch split that mixes patches from 4 cached scenes. That setup overstates generalization because adjacent patches in the same room share scene-specific cues. A leave-one-scene-out eval gives the honest cross-scene head AUC:

Head In-distribution AUC Cross-scene AUC Cross-scene F1
Linear ridge (all-768) 0.860 0.604 0.301
MLP (3 train scenes) 0.980 0.566 (fold 45662921) 0.312
MLP (19 train scenes) 0.980 0.780 (45662921 held out) 0.465

Scaling the train set from 3 to 19 scenes lifts cross-scene AUC by 0.21 (0.566 β†’ 0.780).

End-to-end cross-scene mean IoU on 45662921 with the 19-scene MLP head:

Pipeline Mean IoU > 0.25 IoU Recall
Position-only fusion 0.046 1.7 % 18.0 %
3D-IoU fusion (this work) 0.112 16.5 % 28.1 %

The 3D-IoU multi-view fusion improvement transfers to the cross-scene setting: 2.4Γ— lift in mean IoU (0.046 β†’ 0.112) and an order-of-magnitude lift in > 0.25 IoU. Per-frame box quality limits per-scene IoU on held-out scenes; the 0.20 in-distribution-to-cross-scene gap (0.154 vs 0.112 mean IoU) is the head still leaning on per-scene cues for box localization.

Backbone

EUPE-ViT-B from Meta FAIR (arXiv:2603.22387) via phanerozoic/argus. The backbone is frozen and not modified by this repo.

License

FAIR Research License (non-commercial), inherited via the EUPE-ViT-B backbone.

Downloads last month
134
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for phanerozoic/argus-3d

Finetuned
(1)
this model

Paper for phanerozoic/argus-3d