Argus
Argus is a multi-task perception system built on a single compact vision backbone. From one forward pass through the encoder, the model produces classification labels, semantic segmentation masks, metric depth maps, and dense keypoint correspondences, thereby collapsing four domain-specific pipelines into a unified package of roughly 86 million parameters. The system is named after Argus Panoptes, the many-eyed giant of Greek mythology who was tasked by Hera with watching over everything at once.
The underlying backbone is EUPE-ViT-B, which was introduced in Efficient Universal Perception Encoder (Zhu et al., Meta FAIR, arXiv:2603.22387, March 2026). That paper demonstrates that a small vision encoder can be distilled from a collection of larger specialist teachers, yielding features that transfer well to image understanding, dense prediction, and visionβlanguage tasks simultaneously. Argus takes the released EUPE-ViT-B backbone, leaves its weights frozen, and attaches four lightweight task heads that were trained or constructed independently for this project.
Architecture
Image β EUPE-ViT-B (frozen, 86M parameters) β shared features
βββ Classification β kNN over 1000 class prototypes
βββ Segmentation β linear head, 150 ADE20K classes
βββ Depth β linear head, 256 bins, trained on NYU
βββ Correspondence β training-free dense feature matching
The segmentation and depth heads each consist of a BatchNorm layer followed by a single 1Γ1 convolution, and they were trained with the backbone held frozen throughout. Classification is performed by extracting the backbone's CLS token, normalizing it, and computing cosine similarity against a precomputed matrix of 1000 class prototypes that were built from the full ImageNet-1k training set. Keypoint correspondence requires no trained parameters at all: source and target features are extracted from two images, upsampled to pixel resolution, and matched by cosine similarity at each source keypoint.
Argus does not perform object detection. The four tasks above are what the model was built to do, and they are the scope in which its behavior has been validated. Detection would require a trained detection head on top of the backbone, which is out of scope for this release.
Reproduction of the EUPE Paper
All four of the paper's reported benchmarks were reproduced as part of building Argus, and the results either matched the published numbers within rounding error or exceeded them modestly.
| Task | Dataset | Metric | Paper | Argus | Delta |
|---|---|---|---|---|---|
| Classification | ImageNet-1k | kNN k=10 top-1 | 84.1 | 84.07 | β0.03 |
| Segmentation | ADE20K | mean IoU | 52.4 | 52.72 | +0.32 |
| Depth | NYU Depth v2 | RMSE (lower is better) | 0.391 | 0.3914 | +0.0004 |
| Correspondence | SPair-71k | PCK@0.1 | 51.3 | 54.35 | +3.05 |
The classification evaluation used the full 1.28-million-image ImageNet-1k training set as the kNN reference and the 50,000-image validation set as the query. The segmentation and depth heads were trained using the same linear-probe configurations described in the EUPE repository. Correspondence was evaluated on the SPair-71k test split at 512-pixel resolution across all 12,234 test pairs, for a total of 88,328 keypoints, with no failures during the run.
Comparison with Standard Baselines
As a sanity check, Argus was compared against several well-known models on the same 200-image COCO subset. The classification comparison uses a keyword cross-reference between each model's top-k ImageNet predictions and the COCO ground-truth detection labels on those images, which provides a consistent yardstick across differently-trained models despite the label-space mismatch. These hit rates measure agreement with COCO detection labels via keyword matching on the 200-image subset; they are not raw ImageNet accuracy. For reference, all three classifiers exceed 80% top-1 on the full ImageNet validation set.
Classification (hit rate against COCO detection labels, 200 images):
| Model | Parameters | Top-1 hit | Top-5 hit | Latency | Peak VRAM |
|---|---|---|---|---|---|
| Argus (EUPE-ViT-B) | 86 M | 42.2% | 66.8% | 13.1 ms | 0.34 GB |
| ConvNeXt-Base | 89 M | 40.2% | 71.4% | 10.4 ms | 0.35 GB |
| ResNet50 | 26 M | 36.2% | 61.8% | 8.4 ms | 0.12 GB |
Segmentation:
| Model | Parameters | Classes | Latency | Peak VRAM |
|---|---|---|---|---|
| Argus (EUPE + linear head) | 86 M | 150 | 11.8 ms | 0.41 GB |
| DeepLabV3-ResNet50 | 42 M | 21 | 15.9 ms | 0.33 GB |
Depth:
| Model | Parameters | Latency | Peak VRAM |
|---|---|---|---|
| Argus (EUPE + linear head) | 86 M | 13.3 ms | 0.35 GB |
| Depth-Anything-V2-Base | 98 M | 18.8 ms | 0.68 GB |
Argus produces the top-1 classification accuracy of the three image classifiers, with ConvNeXt-Base edging it slightly on top-5 (which is characteristic of trained softmax heads relative to kNN). It is faster than DeepLabV3 while predicting a much richer label space, and it is faster than Depth-Anything-V2 while using roughly half the VRAM. Although these baselines and Argus were trained for different objectives on different datasets, the comparison is useful for understanding what the model delivers in practice.
Usage
from PIL import Image
from transformers import AutoModel
model = AutoModel.from_pretrained("phanerozoic/argus", trust_remote_code=True)
image = Image.open("your_image.jpg").convert("RGB")
# Any single task can be called directly:
top5 = model.classify(image, top_k=5)
seg = model.segment(image) # returns [H, W] class indices
depth = model.depth(image) # returns [H, W] metric depth in meters
# Or all three can be run at once in a single call:
result = model.perceive(image)
# result["classification"] β list of top-5 {"class_id", "class_name", "score"}
# result["segmentation"] β numpy array of ADE20K class indices
# result["depth"] β numpy array of depth values in meters
# result["timings_ms"] β per-task latency breakdown
# Keypoint correspondence requires two images and a set of source points:
target = Image.open("other_image.jpg").convert("RGB")
src_points = [[100, 100], [200, 200]]
predicted_target_points = model.correspond(image, target, src_points)
The model uses HuggingFace's custom-code mechanism (trust_remote_code=True),
so the loader code is fetched from the model repo automatically. No additional
files need to be cloned.
Training
The backbone is frozen for every task. Only the task heads are trained, and the class prototypes are extracted (not trained at all).
Heads
| Component | Source dataset | Trained by |
|---|---|---|
| EUPE-ViT-B backbone | LVD-1689M (approximately 1.7 billion web images) | Meta FAIR (used here frozen) |
| Segmentation head | ADE20K (20,210 training images, 2,000 validation images) | This repository, 40,000 iterations of linear-probe training |
| Depth head | NYU Depth V2 (24,231 training images) | This repository, 38,400 iterations of linear-probe training |
| Class prototypes | ImageNet-1k (1.28 million training images) | This repository, mean CLS feature per class |
| Correspondence | None (training-free) | β |
The trainable heads sum to 317,847 parameters, which is 0.37% of the 85.7M backbone. The unified model.safetensors is 332 MB, of which roughly 99% is the backbone, 1.5% is the prototype matrix, and the segmentation and depth heads together account for about 0.001%.
Architecture details
Segmentation head is BatchNorm2d(768) β Conv2d(768, 150, 1Γ1) β 116,886 parameters, 1.4 MB on disk. Trained at 512Γ512 with cross-entropy loss, AdamW (lr 1e-3, weight decay 1e-3), WarmupOneCycleLR with 1500-step warmup, batch size 16.
Depth head is BatchNorm2d(768) β Conv2d(768, 256, 1Γ1), with the 256 output channels treated as linear depth bins between 0.001 m and 10 m and combined into a metric prediction by weighted sum β 200,961 parameters, 2.3 MB on disk. Trained on 416Γ544 crops with SigLoss, AdamW (lr 3e-4, weight decay 1e-3), WarmupOneCycleLR with 12,800-step warmup, batch size 16.
Class prototypes are produced by running the frozen backbone over the full ImageNet-1k training set at 224Γ224 resolution, computing the mean L2-normalized CLS feature per class, and saving the resulting 1000Γ768 matrix. No training, just feature extraction.
Correspondence has no learned parameters. At inference time, dense patch features are extracted from both images, upsampled to 512Γ512 pixel resolution, and matched by cosine similarity per source keypoint.
Compute
| Task | Iterations | Wall time |
|---|---|---|
| Segmentation (ADE20K) | 40,000 | ~5 hours |
| Depth (NYU Depth V2) | 38,400 | ~3 hours |
| Class prototypes (IN1k) | 1.28M images, single pass | ~45 minutes |
| Correspondence (SPair) | training-free | β |
Training was done on a single 48 GB workstation GPU. Peak VRAM was approximately 10 GB during segmentation training, 7 GB during depth training, and 2.5 GB during class prototype extraction.
Why minimal heads
The decision to use BatchNorm + 1Γ1 convolution instead of a heavier decoder is the same one the EUPE paper makes. A minimal head means downstream performance can be attributed to the backbone's features rather than to a sophisticated decoder. The same backbone with a Mask2Former-style head would produce higher segmentation numbers, but those numbers wouldn't tell you anything about the backbone in isolation.
Notes
The segmentation head was trained on ADE20K's 150-class indoor-and-urban label space, which does not align directly with COCO or other detection benchmarks. The depth head was trained on NYU Depth v2 and is indoor-biased; outdoor metric depth should be treated as approximate. Classification uses kNN over class prototypes rather than a trained softmax head, which produces more decisive top-1 predictions but flatter top-k distributions.
License
The EUPE-ViT-B backbone weights inside this checkpoint were released by Meta FAIR under the FAIR Research License, which restricts use to non-commercial research and education. The task heads and class prototypes in this checkpoint were trained independently by the author of this repository and would on their own be releasable under a permissive license. However, because they are inseparably bundled with the backbone weights in a single file, the unified checkpoint inherits the more restrictive license of its most restricted component. In practical terms, the entire argus.pt file should be treated as released under the FAIR Research License. See LICENSE for the full text.
Citation
If you use Argus or the underlying EUPE backbone in academic work, please cite the original paper:
@misc{zhu2026eupe,
title={Efficient Universal Perception Encoder},
author={Zhu, Chenchen and Suri, Saksham and Jose, Cijo and Oquab, Maxime and Szafraniec, Marc and Wen, Wei and Xiong, Yunyang and Labatut, Patrick and Bojanowski, Piotr and Krishnamoorthi, Raghuraman and Chandra, Vikas},
year={2026},
eprint={2603.22387},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
Acknowledgements
The EUPE backbone was trained and released by Meta FAIR. The dataset loading utilities are from the DINOv3 repository. The Argus task heads, benchmarks, and packaging were done by phanerozoic.
- Downloads last month
- -