Argus-3D

Class-agnostic 3D bounding box detection on a frozen EUPE-ViT-B backbone. Given a posed RGB image and camera intrinsics, returns 7-DoF boxes (cx, cy, cz, w, h, d, theta) for the objects in the scene.

The detector pairs a non-linear (or linear) per-patch foreground head with feature-dim discovery for depth, k-means clustering for size priors derived from 20 CA-1M val scenes, and 3D-IoU-based multi-view fusion. No 2D bounding boxes, no class labels, no segmentation map as final output. Camera-frame 3D boxes only.

Architecture

Per-frame:
  Image (1024x1024)
    -> EUPE-ViT-B (frozen, reused from phanerozoic/argus)
    -> patch tokens (4096, 768) on a 64x64 grid
    -> instance head: 2-layer MLP (default) or linear ridge -> per-patch foreground score
       depth head:    ridge over 768 dims                   -> per-patch metric depth (m)
       k-means modes: 8 cluster centers (20-scene)          -> per-patch object-type assignment
    -> threshold instance score, upsample mask to 1024x1024, connected components
    -> for each component:
         unproject to 3D using depth + K
         DBSCAN-split for instance separation
         PCA-on-xz for yaw, percentile extents for (w, h, d)
         blend extents toward the matched cluster's size prior
    -> camera-frame 7-DoF box list

Multi-view (per scene):
  -> transform every per-frame box to world frame using camera RT
  -> 3D-IoU-based clustering: union-find with edges where iou_3d_zup(box_i, box_j) > 0.2
  -> filter clusters by min_obs and total inlier weight (per-cluster confidence)
  -> per-cluster: weighted-median fuse to single 7-DoF box

Components

Component	Parameters	Discovery / training
EUPE-ViT-B backbone (frozen, reused)	not part of this head	reused from phanerozoic/argus, run at 1024x1024 input (64x64 patch grid)
Instance head — MLP (default)	~200 K	2-layer MLP (768 → 256 → 1, ReLU + dropout 0.5), 30 epochs AdamW BCE-with-logits on 20-scene patch split at 1024 input
Instance head — linear ridge (fallback)	769 floats + threshold	random K=20 subset search + hard-neg mining, AUC selection
Depth head (ridge over 768 dims)	769 floats	random K=20 subset search, RMSE selection
K-means cluster centers	8 × 768 floats	MiniBatchKMeans on foreground patches across 20 CA-1M val scenes
Per-cluster size priors (w, h, d)	8 × 3 floats	median of observed extents per mode (8 800 instances aggregated)
OBB fitter (PCA + percentile + Tikhonov)	0	closed-form
Multi-view fusion (3D-IoU union-find + weighted median)	0	closed-form
Total head footprint (MLP)	~200 K params / ~830 KB

File layout

instance_head_mlp.safetensors   # MLP fc1/fc2 weights
instance_head_mlp_meta.json     # MLP config + threshold
instance_head.safetensors       # linear ridge: dims + coef + intercept + threshold
depth_head.safetensors          # depth ridge: dims + coef + intercept
size_priors.safetensors         # 8 cluster centers + 8 (w, h, d) priors (20-scene)
config.json                     # input_res, patch_grid, prior_weight, fusion_iou, etc.
argus_3d.py                     # Argus3D class
infer.py                        # CLI dispatcher

Usage

from argus_3d import Argus3D
import numpy as np

model = Argus3D.from_pretrained("phanerozoic/argus-3d", device="cuda")

K = np.array([[850, 0, 395], [0, 850, 510], [0, 0, 1]])
boxes = model.detect("room.jpg", K)               # list of Box3D
boxes = model.detect("room.jpg", K, depth=d)      # supply RGBD sensor depth
out   = model.perceive("room.jpg", K)             # fg score map + depth map + boxes

for b in boxes:
    print(b.cx, b.cy, b.cz, b.w, b.h, b.d, b.theta)

Eval

CA-1M val sequence ca1m-val-45662921. Class-agnostic per-scene 3D IoU after multi-view fusion across 284 frames (stride-4 sampling of 1135 total). The head produces its own instance hypotheses; no ground-truth 2D bounding boxes are used. Sensor depth is supplied; the discovered depth head can be used in its place.

mAP @ IoU thresholds (Boxer-comparable)

Class-agnostic AP at the same IoU thresholds Boxer reports:

Threshold	Strict default (44 boxes)	Loose filter (246 boxes)
AP @ 0.05 IoU	0.144	0.246
AP @ 0.10 IoU	0.116	0.172
AP @ 0.15 IoU	0.075	0.128
AP @ 0.25 IoU	0.035	0.049
AP @ 0.50 IoU	0.004	0.001
mAP @ [0.05, 0.5]	0.075	0.119

Boxer reports 0.43 mAP at IoU [0.05, 0.5] on CA-1M with GT 2D bounding boxes plus RGBD as input plus a learned 2D→3D lifting head. Our pipeline runs class-agnostic without GT 2D boxes and the OBB fitter is closed-form (PCA + percentile). The ~4× gap at AP @ 0.5 IoU is dominated by the percentile OBB fit — point-cloud percentile extents cap the high-IoU tail.

Headline result

1024-input MLP head + 20-scene size priors + 3D-IoU multi-view fusion + iterative refinement:

Metric	Linear ridge (v0) + 768 + position fusion	MLP + 768 + IoU fusion	MLP + 1024 + IoU fusion (default)
Mean 3D IoU	0.063	0.177	0.216
Median 3D IoU	—	0.146	0.175
Fraction > 0.1 IoU	27.8 %	60.9 %	68.2 %
Fraction > 0.25 IoU	6.9 %	34.8 %	36.4 %
Fraction > 0.5 IoU	0.0 %	4.3 %	9.1 %
Recall (matched / GT)	19.8 %	18.0 %	17.5 %
Fused boxes per scene	72	46	44

Default config: 1024 input resolution (64×64 patch grid), MLP head trained at 1024 features on 20 scenes, 20-scene size priors derived from foreground patches at 1024 resolution, 3D-IoU multi-view fusion at threshold 0.4 with iterative refinement (drop observations with IoU < 0.4 to the cluster consensus, re-fuse, up to 3 iterations). Per-cluster filter is min 4 observations and weight floor at the 80th percentile of nonzero cluster weights.

Resolution and fusion ablations (all on ca1m-val-45662921):

Variant	Mean IoU	> 0.25 IoU	> 0.5 IoU	Recall
Linear ridge (v0) + position fusion at 768	0.063	6.9 %	0.0 %	19.8 %
MLP + position fusion at 768	0.076	11.5 %	1.3 %	24.4 %
MLP + IoU fusion (thresh 0.2) at 768	0.158	28.2 %	2.8 %	28.1 %
MLP + IoU fusion (thresh 0.4) at 768	0.177	34.8 %	4.3 %	18.0 %
MLP + IoU fusion at 1024 (default)	0.216	36.4 %	9.1 %	17.5 %

Per-stage discovery metrics (in-distribution random patch split, 4 scenes mixed)

Discovery output	Metric	Linear ridge	MLP
Instance head, per-patch foreground	AUC	0.860	0.980
Instance head, per-patch foreground	F1 (tuned threshold)	0.569	0.815
Depth head, foreground patches in 0.1-3 m	RMSE	0.190 m	0.133 m (MLP variant)
Depth head, foreground patches in 0.1-3 m	delta1 (1.25× ratio)	0.919	0.974 (MLP variant)

Cross-scene held-out

The numbers above use a random patch split that mixes patches from 4 cached scenes. That setup overstates generalization because adjacent patches in the same room share scene-specific cues. A leave-one-scene-out eval gives the honest cross-scene head AUC:

Head	In-distribution AUC	Cross-scene AUC	Cross-scene F1
Linear ridge (all-768)	0.860	0.604	0.301
MLP (3 train scenes)	0.980	0.566 (fold 45662921)	0.312
MLP (19 train scenes)	0.980	0.780 (45662921 held out)	0.465

Scaling the train set from 3 to 19 scenes lifts cross-scene AUC by 0.21 (0.566 → 0.780).

End-to-end cross-scene mean IoU on 45662921 with the 19-scene MLP head:

Pipeline	Mean IoU	> 0.25 IoU	Recall
Position-only fusion	0.046	1.7 %	18.0 %
3D-IoU fusion (this work)	0.112	16.5 %	28.1 %

The 3D-IoU multi-view fusion improvement transfers to the cross-scene setting: 2.4× lift in mean IoU (0.046 → 0.112) and an order-of-magnitude lift in > 0.25 IoU. Per-frame box quality limits per-scene IoU on held-out scenes; the 0.20 in-distribution-to-cross-scene gap (0.154 vs 0.112 mean IoU) is the head still leaning on per-scene cues for box localization.

Backbone

EUPE-ViT-B from Meta FAIR (arXiv:2603.22387) via phanerozoic/argus. The backbone is frozen and not modified by this repo.

License

FAIR Research License (non-commercial), inherited via the EUPE-ViT-B backbone.

Downloads last month: 2

Model tree for phanerozoic/argus-3d

Base model

facebook/EUPE-ViT-B

Finetuned

phanerozoic/argus

Finetuned

(1)

this model

Paper for phanerozoic/argus-3d

Efficient Universal Perception Encoder

Paper • 2603.22387 • Published Mar 23 • 11