--- license: other license_name: fair-research-license license_link: LICENSE tags: - multi-task-perception - computer-vision - image-classification - semantic-segmentation - depth-estimation - keypoint-correspondence - vision-transformer library_name: pytorch datasets: - imagenet-1k - scene_parse_150 - sayakpaul/nyu_depth_v2 metrics: - accuracy --- # Argus Argus is a multi-task perception system built on a single compact vision backbone. From one forward pass through the encoder, the model produces classification labels, semantic segmentation masks, metric depth maps, and dense keypoint correspondences, thereby collapsing four domain-specific pipelines into a unified package of roughly 86 million parameters. The system is named after Argus Panoptes, the many-eyed giant of Greek mythology who was tasked by Hera with watching over everything at once. The underlying backbone is EUPE-ViT-B, which was introduced in *Efficient Universal Perception Encoder* (Zhu et al., Meta FAIR, arXiv:2603.22387, March 2026). That paper demonstrates that a small vision encoder can be distilled from a collection of larger specialist teachers, yielding features that transfer well to image understanding, dense prediction, and vision–language tasks simultaneously. Argus takes the released EUPE-ViT-B backbone, leaves its weights frozen, and attaches four lightweight task heads that were trained or constructed independently for this project. ## Architecture ``` Image → EUPE-ViT-B (frozen, 86M parameters) → shared features ├── Classification — two methods on the same CLS token: │ kNN over 1000 class prototypes (default) │ trained linear softmax over 1000 classes ├── Segmentation — linear head, 150 ADE20K classes ├── Depth — linear head, 256 bins, trained on NYU ├── Detection — FCOS head with simple feature pyramid, 80 COCO classes └── Correspondence — training-free dense feature matching ``` The segmentation and depth heads each consist of a BatchNorm layer followed by a single 1×1 convolution, and they were trained with the backbone held frozen throughout. Classification supports two methods operating on the same normalized CLS token: a kNN protocol that computes cosine similarity against a precomputed matrix of 1000 class prototypes built from the full ImageNet-1k training set, and a trained linear softmax classifier consisting of a single `Linear(768, 1000)` layer with bias. Both methods run from the same backbone forward pass and the caller selects between them via a `method` argument on `classify()`. Detection uses an FCOS-style anchor-free detector built on a ViTDet-style simple feature pyramid that synthesizes five spatial scales (strides 8 through 128) from the backbone's stride-16 patch features, with shared four-layer convolutional towers for classification and box regression across all pyramid levels. The detection head runs at 640-pixel input with letterbox padding to preserve aspect ratio, and returns per-image lists of bounding boxes with class labels, confidence scores, and COCO class names. Keypoint correspondence requires no trained parameters at all: source and target features are extracted from two images, upsampled to pixel resolution, and matched by cosine similarity at each source keypoint. ## Reproduction of the EUPE Paper All four of the paper's reported benchmarks were reproduced as part of building Argus, and the results either matched the published numbers within rounding error or exceeded them modestly. | Task | Dataset | Metric | Paper | Argus | Delta | |----------------|--------------|------------------------|-------|--------|---------| | Classification | ImageNet-1k | kNN k=10 top-1 | 84.1 | 84.07 | −0.03 | | Segmentation | ADE20K | mean IoU | 52.4 | 52.72 | +0.32 | | Depth | NYU Depth v2 | RMSE (lower is better) | 0.391 | 0.3914 | +0.0004 | | Correspondence | SPair-71k | PCK@0.1 | 51.3 | 54.35 | +3.05 | The classification evaluation used the full 1.28-million-image ImageNet-1k training set as the kNN reference and the 50,000-image validation set as the query. The segmentation and depth heads were trained using the same linear-probe configurations described in the EUPE repository. Correspondence was evaluated on the SPair-71k test split at 512-pixel resolution across all 12,234 test pairs, for a total of 88,328 keypoints, with no failures during the run. The trained linear softmax classifier was added after the paper reproduction work and is not a paper benchmark. On ImageNet-1k val it reaches 85.53% top-1 and 97.69% top-5, which improves on the kNN reference by +1.46 points on top-1 and +3.70 points on top-5. The top-5 improvement is the more meaningful number: softmax is decisive where nearest-mean kNN is flat on visually similar classes. | Classification method | Top-1 | Top-5 | |-----------------------|----------|----------| | kNN (k=10) | 84.07 % | 93.99 % | | Linear softmax | 85.53 % | 97.69 % | ### Detection The FCOS detection head was trained on COCO 2017 train (117,266 images) at 640-pixel input with the backbone frozen throughout. Evaluation on COCO val2017 (5,000 images) with the standard pycocotools protocol: | Metric | Value | |--------|-------| | mAP@[0.5:0.95] | **41.0** | | mAP@0.50 | 64.8 | | mAP@0.75 | 43.2 | | mAP (small objects) | 21.4 | | mAP (medium objects) | 44.9 | | mAP (large objects) | 62.1 | For context, FCOS with a fully-trained ResNet-50-FPN backbone achieves 39.1 mAP on the same benchmark. The frozen EUPE-ViT-B backbone at 41.0 mAP exceeds that baseline while sharing its features with four other task heads simultaneously. ## Comparison with Standard Baselines As a sanity check, Argus was compared against several well-known models on the same 200-image COCO subset. The classification comparison uses a keyword cross-reference between each model's top-k ImageNet predictions and the COCO ground-truth detection labels on those images, which provides a consistent yardstick across differently-trained models despite the label-space mismatch. **These hit rates measure agreement with COCO detection labels via keyword matching on the 200-image subset; they are not raw ImageNet accuracy.** For reference, all three classifiers exceed 80% top-1 on the full ImageNet validation set. **Classification** (hit rate against COCO detection labels, 200 images): | Model | Parameters | Top-1 hit | Top-5 hit | Latency | Peak VRAM | |--------------------|------------|-----------|-----------|---------|-----------| | Argus (EUPE-ViT-B) | 86 M | 42.2% | 66.8% | 13.1 ms | 0.34 GB | | ConvNeXt-Base | 89 M | 40.2% | 71.4% | 10.4 ms | 0.35 GB | | ResNet50 | 26 M | 36.2% | 61.8% | 8.4 ms | 0.12 GB | **Segmentation**: | Model | Parameters | Classes | Latency | Peak VRAM | |----------------------------|------------|---------|---------|-----------| | Argus (EUPE + linear head) | 86 M | 150 | 11.8 ms | 0.41 GB | | DeepLabV3-ResNet50 | 42 M | 21 | 15.9 ms | 0.33 GB | **Depth**: | Model | Parameters | Latency | Peak VRAM | |----------------------------|------------|---------|-----------| | Argus (EUPE + linear head) | 86 M | 13.3 ms | 0.35 GB | | Depth-Anything-V2-Base | 98 M | 18.8 ms | 0.68 GB | Argus produces the top-1 classification accuracy of the three image classifiers, with ConvNeXt-Base edging it slightly on top-5. The Argus row above uses the kNN classification method, which is decisive on top-1 but flatter on top-k than a trained softmax. Argus is faster than DeepLabV3 while predicting a much richer label space, and it is faster than Depth-Anything-V2 while using roughly half the VRAM. Although these baselines and Argus were trained for different objectives on different datasets, the comparison is useful for understanding what the model delivers in practice. ## Usage ```python from PIL import Image from transformers import AutoModel model = AutoModel.from_pretrained("phanerozoic/argus", trust_remote_code=True) image = Image.open("your_image.jpg").convert("RGB") # Any single task can be called directly: top5 = model.classify(image, top_k=5) # default: kNN method top5 = model.classify(image, top_k=5, method="softmax") # alternative: trained linear seg = model.segment(image) # returns [H, W] class indices depth = model.depth(image) # returns [H, W] metric depth in meters # Detection runs at 640px with letterbox padding: dets = model.detect(image, score_thresh=0.3) # returns list of {"box": [x1,y1,x2,y2], "score": float, "label": int, "class_name": str} # Classification, segmentation, and depth can be run at once: result = model.perceive(image) # result["classification"] — list of top-5 {"class_id", "class_name", "score"} # result["segmentation"] — numpy array of ADE20K class indices # result["depth"] — numpy array of depth values in meters # result["timings_ms"] — per-task latency breakdown # Detection is called separately because it uses a different input resolution: dets = model.detect(image) # Keypoint correspondence requires two images and a set of source points: target = Image.open("other_image.jpg").convert("RGB") src_points = [[100, 100], [200, 200]] predicted_target_points = model.correspond(image, target, src_points) ``` Every single-image method also accepts a list of images. When a list is passed, the return type becomes a list of per-image results in the same shape that a single call would produce: ```python images = [Image.open(p).convert("RGB") for p in paths] top5_batch = model.classify(images, top_k=5) # list of list-of-dict seg_batch = model.segment(images) # list of [H, W] tensors depth_batch = model.depth(images) # list of [H, W] tensors perceive_batch = model.perceive(images) # list of dicts ``` Per-task confidence and uncertainty are available as opt-in outputs. Classification always carries a `margin` field (top-1 score minus top-2 score) on the first entry. Segmentation and depth expose confidence maps when `return_confidence=True` is passed: ```python seg_map, seg_conf = model.segment(image, return_confidence=True) # seg_conf is per-pixel max softmax probability in [0, 1] depth_map, depth_std = model.depth(image, return_confidence=True) # depth_std is per-pixel standard deviation of the 256-bin distribution result = model.perceive(image, return_confidence=True) # result["segmentation_confidence"] and result["depth_uncertainty"] are populated ``` The model can be exported to ONNX. This produces three separate graphs — backbone, segmentation head, and depth head — with verification against the PyTorch reference automatically performed when `verify=True`: ```python paths = model.export_onnx("/path/to/out_dir", backbone_resolution=224, verify=True) # paths["backbone"], paths["seg_head"], paths["depth_head"] # paths["verification"] — max abs diff per component ``` Classification (kNN over class prototypes) and correspondence run as post-processing on top of the backbone output and need no separate graph. The model uses HuggingFace's custom-code mechanism (`trust_remote_code=True`), so the loader code is fetched from the model repo automatically. No additional files need to be cloned. ## Training The backbone is frozen for every task. Only the task heads are trained, and the class prototypes are extracted (not trained at all). ### Heads | Component | Source dataset | Trained by | |---------------------------|----------------------------------------------------------|-------------------------------------------------------------| | EUPE-ViT-B backbone | LVD-1689M (approximately 1.7 billion web images) | Meta FAIR (used here frozen) | | Segmentation head | ADE20K (20,210 training images, 2,000 validation images) | This repository, 40,000 iterations of linear-probe training | | Depth head | NYU Depth V2 (24,231 training images) | This repository, 38,400 iterations of linear-probe training | | Class prototypes (kNN) | ImageNet-1k (1.28 million training images) | This repository, mean CLS feature per class | | Linear softmax classifier | ImageNet-1k (1.28 million training images) | This repository, SGD over cached frozen features | | Detection head | COCO 2017 (117,266 training images, 80 classes) | This repository, FCOS with simple FPN, 8 epochs at 640px | | Correspondence | None (training-free) | — | The trainable heads sum to approximately 17.2M parameters (seg 117K + depth 201K + linear classifier 769K + detection 16.14M), of which the detection head accounts for 94%. The unified `model.safetensors` is 396 MB. ### Precision variants Two safetensors files with the same weights at different on-disk precision. Inference behavior is identical; the smaller file is for users with limited bandwidth or storage. | File | Size | Load | |---|---|---| | `model.safetensors` | 334 MB | `AutoModel.from_pretrained("phanerozoic/argus", trust_remote_code=True)` | | `model.bf16_backbone.safetensors` | 170 MB | `AutoModel.from_pretrained("phanerozoic/argus", trust_remote_code=True, variant="bf16_backbone")` | Both files load into the same FP32 model in memory; PyTorch automatically upcasts the bfloat16 stored weights at construction time. The smaller variant saves download bandwidth and disk space but does not reduce inference VRAM. ### Architecture details **Segmentation head** is `BatchNorm2d(768) → Conv2d(768, 150, 1×1)` — 116,886 parameters, 1.4 MB on disk. Trained at 512×512 with cross-entropy loss, AdamW (lr 1e-3, weight decay 1e-3), WarmupOneCycleLR with 1500-step warmup, batch size 16. **Depth head** is `BatchNorm2d(768) → Conv2d(768, 256, 1×1)`, with the 256 output channels treated as linear depth bins between 0.001 m and 10 m and combined into a metric prediction by weighted sum — 200,961 parameters, 2.3 MB on disk. Trained on 416×544 crops with SigLoss, AdamW (lr 3e-4, weight decay 1e-3), WarmupOneCycleLR with 12,800-step warmup, batch size 16. **Class prototypes (kNN path)** are produced by running the frozen backbone over the full ImageNet-1k training set at 224×224 resolution, computing the mean L2-normalized CLS feature per class, and saving the resulting 1000×768 matrix. No training, just feature extraction. At inference, the kNN path normalizes the query CLS token and computes cosine similarity against the prototype matrix. **Linear softmax classifier** is a single `Linear(768, 1000)` layer with bias — 769,000 parameters, about 3 MB on disk. Trained as a two-pass job: first the frozen backbone is run over the ImageNet-1k training set to cache a per-image CLS feature tensor (1,281,167 × 768, stored once at ~3.9 GB), then the linear layer is trained on the cached features alone. The training pass uses SGD with momentum 0.9, weight decay 0, batch size 4096, cosine schedule, 100 epochs, no augmentation, and the best checkpoint by validation top-1 is restored at the end. A small learning-rate sweep over `{0.5, 1.0, 3.0, 10.0, 30.0}` selects the best configuration; the L2-normalized CLS features and zero-initialized weights demand an unusually large learning rate to grow the weight scale to the point where softmax distributions become sharp. The best run used lr = 30.0 and produced 85.53% top-1 / 97.69% top-5 on ImageNet-1k val, beating the kNN protocol on both metrics. **Detection head** is an FCOS-style anchor-free detector on a ViTDet-style simple feature pyramid. The FPN takes the backbone's stride-16 spatial features and synthesizes five levels: P3 (stride 8, via transposed convolution), P4 (stride 16, identity with channel reduction), P5 (stride 32), P6 (stride 64), and P7 (stride 128), each with 256 channels and GroupNorm normalization. Two shared four-layer convolutional towers (classification and regression) with GroupNorm and GELU process each level, followed by three prediction heads: 80 classification channels, 4 box regression channels (left/top/right/bottom distances, exponentiated with learned per-level scale), and 1 centerness channel. 16,138,074 parameters total, 61.6 MB on disk. Trained at 640×640 with letterbox padding, focal loss (alpha 0.25, gamma 2.0) for classification, GIoU loss for boxes, BCE for centerness, AdamW (lr 1e-3, weight decay 1e-4), cosine schedule with 3% warmup, batch size 64, 8 epochs. **Correspondence** has no learned parameters. At inference time, dense patch features are extracted from both images, upsampled to 512×512 pixel resolution, and matched by cosine similarity per source keypoint. ### Compute | Task | Iterations | Wall time | |------------------------------|---------------------------|-------------| | Segmentation (ADE20K) | 40,000 | ~5 hours | | Depth (NYU Depth V2) | 38,400 | ~3 hours | | Class prototypes (IN1k) | 1.28M images, single pass | ~45 minutes | | Linear classifier (IN1k) | 100 epochs × 313 steps | ~25 seconds (on cached features, extraction amortized with the kNN prototype pass) | | Detection (COCO 2017) | 8 epochs × 1,832 batches | ~6 hours at batch 64, 640px, FP32, frozen backbone | | Correspondence (SPair) | training-free | — | Training was done on a single 48 GB workstation GPU. Peak VRAM was approximately 10 GB during segmentation training, 7 GB during depth training, 2.5 GB during class prototype extraction, and 4 GB during the linear classifier training (once features are cached, the training loop only holds the 3.7 GB cached feature tensor on GPU). ### Why minimal heads The decision to use BatchNorm + 1×1 convolution for segmentation and depth, and a single linear layer for classification, is the same one the EUPE paper makes. A minimal head means downstream performance can be attributed to the backbone's features rather than to a sophisticated decoder. The same backbone with a Mask2Former-style head would produce higher segmentation numbers, but those numbers wouldn't tell you anything about the backbone in isolation. The linear softmax classifier added here follows the same principle: one fully-connected layer on top of the frozen CLS token, no intermediate hidden layers, no augmentation at training time. ## Notes The segmentation head was trained on ADE20K's 150-class indoor-and-urban label space. The depth head was trained on NYU Depth v2 and is indoor-biased; outdoor metric depth should be treated as approximate. The detection head was trained on COCO 2017's 80-class label space at 640-pixel input; small-object detection (mAP 21.4) is the expected weakness because the stride-8 P3 level can only resolve objects roughly 12 pixels and larger at that resolution. Classification supports both a kNN protocol (decisive on top-1, flat on top-k because nearest-mean distances compress on visually similar classes) and a trained linear softmax (sharper top-k, calibrated probabilities); the caller picks between them per call via the `method` argument. ## License The EUPE-ViT-B backbone weights inside this checkpoint were released by Meta FAIR under the [FAIR Research License](https://huggingface.co/facebook/EUPE-ViT-B/blob/main/LICENSE), which restricts use to non-commercial research and education. The task heads and class prototypes in this checkpoint were trained independently by the author of this repository and would on their own be releasable under a permissive license. However, because they are inseparably bundled with the backbone weights in a single file, the unified checkpoint inherits the more restrictive license of its most restricted component. In practical terms, the entire `argus.pt` file should be treated as released under the FAIR Research License. See `LICENSE` for the full text. ## Citation If you use Argus or the underlying EUPE backbone in academic work, please cite the original paper: ```bibtex @misc{zhu2026eupe, title={Efficient Universal Perception Encoder}, author={Zhu, Chenchen and Suri, Saksham and Jose, Cijo and Oquab, Maxime and Szafraniec, Marc and Wen, Wei and Xiong, Yunyang and Labatut, Patrick and Bojanowski, Piotr and Krishnamoorthi, Raghuraman and Chandra, Vikas}, year={2026}, eprint={2603.22387}, archivePrefix={arXiv}, primaryClass={cs.CV} } ``` ## Acknowledgements The EUPE backbone was trained and released by Meta FAIR. The dataset loading utilities are from the DINOv3 repository. The Argus task heads, benchmarks, and packaging were done by [phanerozoic](https://huggingface.co/phanerozoic).