Argus

Argus is a multi-task perception system built on a single compact vision backbone. From one forward pass through the encoder, the model produces classification labels, semantic segmentation masks, metric depth maps, object detections with bounding boxes, and dense keypoint correspondences, collapsing five domain-specific pipelines into a unified package of roughly 103 million parameters (85.6M frozen backbone + 17.4M trained heads + buffers). The system is named after Argus Panoptes, the many-eyed giant of Greek mythology who was tasked by Hera with watching over everything at once.

The underlying backbone is EUPE-ViT-B (86M parameters), which was introduced in Efficient Universal Perception Encoder (Zhu et al., Meta FAIR, arXiv:2603.22387, March 2026). That paper demonstrates that a small vision encoder can be distilled from a collection of larger specialist teachers, yielding features that transfer well to image understanding, dense prediction, and vision–language tasks simultaneously. Argus takes the released EUPE-ViT-B backbone, leaves its weights frozen, and attaches five lightweight task heads that were trained or constructed independently for this project.

Architecture

Image → EUPE-ViT-B (frozen, 86M parameters) → shared features

  ├── Classification   — trained linear softmax, 1000 ImageNet classes
  ├── Segmentation     — linear head, 150 ADE20K classes
  ├── Depth            — DPT multi-scale decoder, metric depth in meters (NYU Depth V2)
  ├── Detection        — cofiber pyramid + CLIP-text-aligned cosine head, 80 COCO classes
  └── Correspondence   — training-free dense feature matching

Classification — trained linear softmax, a single Linear(768, 1000) layer with bias applied to the L2-normalized CLS token. 85.53% top-1 and 97.69% top-5 on ImageNet-1k val.
Segmentation — BatchNorm layer followed by a single 1×1 convolution, trained with the backbone held frozen throughout.
Depth — DPT (Dense Prediction Transformer) decoder hooking into four intermediate ViT layers (blocks 2, 5, 8, 11), fusing their features at four spatial scales via residual conv fusion, producing metric depth in meters via a 256-bin weighted sum over the 0.001 to 10 meter range. Improves RMSE by 8% over a linear probe on the same backbone (0.480 vs 0.520 on NYUv2 test), with abs_rel improving by 28%. Attempts to extend the depth head to outdoor scenes via mixed indoor/outdoor training with scale-and-shift invariant loss degraded indoor accuracy without producing a model that worked reliably across both domains; outdoor depth remains a known limitation.
Detection — anchor-free detector built on a parameter-free cofiber decomposition of the backbone's stride-16 spatial features (iterated downsample-then-subtract residuals at four scales, concatenated with a 64-dim sinusoidal positional embedding to give an 832-channel input). A 1x1 stem reduces to a 160-channel hidden representation; shared 9-layer cls/reg towers (5 ConvGN blocks plus 4 depthwise-residual blocks) feed three predictions per pyramid level: a CLIP-text-aligned cosine classifier with 80 frozen COCO text embeddings (text_embed), a learned logit_scale and per-class cls_bias, an LTRB box regressor with per-level learned scale, and a centerness branch. Runs at 768-pixel input with letterbox padding; returns per-image lists of bounding boxes with class labels, confidence scores, and COCO class names. ~3M trainable parameters.
Correspondence — no trained parameters. Source and target features are extracted from two images, upsampled to pixel resolution, and matched by cosine similarity at each source keypoint.

Benchmarks

Reproduction of the EUPE Paper

All four of the paper's reported benchmarks were reproduced as part of building Argus, and the results either matched the published numbers within rounding error or exceeded them modestly.

Task	Dataset	Metric	Paper	Argus	Delta
Classification	ImageNet-1k	kNN k=10 top-1	84.1	84.07	−0.03
Segmentation	ADE20K	mean IoU	52.4	52.72	+0.32
Depth	NYU Depth v2	RMSE (lower is better)	0.391	0.3914	+0.0004
Correspondence	SPair-71k	PCK@0.1	51.3	54.35	+3.05

The classification evaluation used the full 1.28-million-image ImageNet-1k training set as the kNN reference and the 50,000-image validation set as the query. The segmentation and depth heads were trained using the same linear-probe configurations described in the EUPE repository. Correspondence was evaluated on the SPair-71k test split at 512-pixel resolution across all 12,234 test pairs, for a total of 88,328 keypoints, with no failures during the run.

The shipped classification head is a trained linear softmax that reaches 85.53% top-1 and 97.69% top-5 on ImageNet-1k val. The kNN protocol shown in the EUPE-reproduction table above was the development baseline and is no longer exposed at the API level.

Detection

The EUPE paper evaluates its backbone exclusively through minimal decoders, linear probes, kNN, and training-free matching, so that downstream performance can be attributed to the learned features rather than to the capacity of the head. The paper's three evaluation domains are image understanding, dense prediction, and vision-language modeling; detection is not among them. The same frozen-backbone protocol applies: the detection head is trained on COCO 2017 train (117,266 images) at 768-pixel input while the backbone weights remain fixed.

Evaluation on COCO val2017 (5,000 images) with the standard pycocotools protocol:

Metric	Hard NMS	Soft NMS
mAP@[0.5:0.95]	42.64	42.71
mAP@0.50	65.70	65.67
mAP@0.75	45.10	45.29
mAP (small objects)	22.3	22.3
mAP (medium)	48.3	48.4
mAP (large)	62.9	63.1

The full lineage of the cofiber-based detection head, including the analytical closed-form, evolved-circuit, dim-sweep, and person-specialist variants and the recipe ablations that took the same architecture from 24.6 to 42.64 mAP, lives in phanerozoic/cofiber-detection.

Cross-domain transfer behavior on a 20-domain Roboflow 100 VL subset is recorded in rf100vl_zero_shot_cross_domain_eval.json: class-agnostic AR@100 averages 0.289 across the 20 domains, with the per-domain breakdown stored in the same file.

Depth Decoder Comparison (Development Reference)

A DPT multi-scale decoder was trained and evaluated alongside the linear probe during development. The DPT decoder hooks into four intermediate ViT layers and fuses their features at multiple spatial scales, producing an 8% RMSE improvement on indoor scenes and is the shipping depth head. Subsequent attempts to extend it to outdoor depth via mixed NYU+KITTI training with hybrid scale-and-shift invariant loss did not produce a model that maintained indoor accuracy while gaining outdoor capability; outdoor depth remains a known limitation.

Decoder	RMSE ↓	abs_rel ↓	a1 (δ<1.25) ↑
Linear probe	0.520	0.303	0.860
DPT multi-scale (current)	0.480	0.219	0.872

Segmentation Decoder Comparison (Development Reference)

The same DPT multi-scale architecture that improved depth by 8% was also trained and evaluated for semantic segmentation on ADE20K, using the same reassemble + fusion approach with a 150-class classification head and cross-entropy loss. The DPT segmentation decoder did not improve over the linear probe.

Decoder	ADE20K mIoU
Linear probe (current)	52.72%
DPT multi-scale	52.28%

The linear probe stays as the shipping segmentation head. Unlike depth, where multi-scale feature fusion captures spatial gradients that a single-layer head misses, the segmentation task's sharp class boundaries at stride-16 resolution are already as well-predicted by a 1×1 convolution as the frozen features allow. The heavier decoder adds capacity without improving the output.

Cross-Dataset Segmentation Transfer

To test whether the backbone's features generalize beyond the ADE20K distribution on which the segmentation head was trained, a separate linear probe (identical BatchNorm + 1×1 Conv architecture, same training recipe) was trained on the Cityscapes urban driving dataset using the frozen backbone. Cityscapes contains 2,975 training images and 500 validation images of street scenes captured from a vehicle-mounted camera in German cities, annotated with 19 semantic classes. The backbone was never exposed to driving scenes during EUPE's multi-teacher distillation or during any phase of Argus head training.

Dataset	Classes	Train images	mIoU
ADE20K (original head)	150	20,210	52.72%
Cityscapes (transfer probe)	19	2,975	63.76%

The Cityscapes probe reaches 63.76% mIoU, with road at 96.4%, car at 87.9%, sky at 88.8%, building at 86.7%, and vegetation at 85.6%. The weaker categories are thin vertical structures — pole at 17.8%, traffic light at 36.4%, traffic sign at 48.3% — which is an inherent resolution limitation of the stride-16 patch grid rather than a deficiency in the learned representation. The frozen backbone produces features that transfer to an entirely unseen visual domain with a minimal linear decoder, which is the property that makes a universal perception encoder worth having.

Comparison with Standard Baselines

Classification on ImageNet-1k val (top-1 / top-5):

Model	Parameters	Top-1	Top-5
Argus (EUPE-ViT-B)	86 M	85.53%	97.69%
ConvNeXt-Base	89 M	83.85%	96.74%
ResNet50	26 M	80.86%	95.43%

ConvNeXt-Base and ResNet50 numbers are from the torchvision pretrained-model accuracy table (ConvNeXt_Base_Weights.IMAGENET1K_V1, ResNet50_Weights.IMAGENET1K_V2). Argus is evaluated on the same 50,000-image ImageNet-1k validation set with center-crop preprocessing.

Segmentation:

Model	Parameters	Classes	Latency	Peak VRAM
Argus (EUPE + linear head)	86 M	150	11.8 ms	0.41 GB
DeepLabV3-ResNet50	42 M	21	15.9 ms	0.33 GB

Depth:

Model	Parameters	Latency	Peak VRAM
Argus (EUPE + linear head)	86 M	13.3 ms	0.35 GB
Depth-Anything-V2-Base	98 M	18.8 ms	0.68 GB

Argus leads both baselines on top-1 and top-5 hit rates. Argus is faster than DeepLabV3 while predicting a much richer label space, and it is faster than Depth-Anything-V2 while using roughly half the VRAM. Although these baselines and Argus were trained for different objectives on different datasets, the comparison is useful for understanding what the model delivers in practice.

Multi-Task Throughput

The per-task comparisons above measure each head against its single-task counterpart in isolation. A separate question is what happens when a user needs all of the tasks at once, which is the typical situation in dataset annotation, model evaluation, and any pipeline where images pass through multiple analysis stages in sequence. The alternative to Argus in that situation is to load and run four separate single-task models of comparable quality, each carrying its own backbone, its own preprocessing, and its own forward pass. The total cost is the sum of the four individual inference times, plus the memory overhead of holding four independent models on the device simultaneously.

The models chosen for this comparison were selected to match the quality tier of the EUPE-ViT-B backbone rather than to minimize size or maximize speed. ConvNeXt-Base (88.6M parameters) is a widely-used ImageNet-1k classifier at the same parameter scale as EUPE-ViT-B. SegFormer-B3 (47.3M) is a transformer-based ADE20K semantic segmenter that is the standard mid-range alternative to a linear probe on a frozen backbone. Depth-Anything-V2-Base (97.5M) is the current standard for single-image monocular depth estimation at base scale. YOLO26l (26.3M) is the large variant of the January 2026 YOLO release from Ultralytics, representing the state of the art in efficient real-time detection. All measurements were taken on the same RTX 6000 Ada GPU across the same nine example images, with five timed runs after a three-image warmup pass to eliminate cold-start effects.

Pipeline	Parameters	Latency per image	Tasks
Argus unified	103 M	56 ms	5 (classify, segment, depth, detect, correspond)
Four separate models	260 M	68 ms	4 (classify, segment, depth, detect)

The per-model breakdown for the separate pipeline is ConvNeXt-Base at 6 ms, SegFormer-B3 at 19 ms, Depth-Anything-V2-Base at 31 ms, and YOLO26l at 12 ms, summing to 68 ms when the tasks are run sequentially on the same image. Argus completes five tasks, the same four plus keypoint correspondence which the separate pipeline does not attempt, in 56 ms from a single model load. The total parameter count for the separate pipeline is 260M across four independent weight sets, while Argus carries 103M in a single file.

The throughput advantage comes from the shared backbone. Each of the four separate models pays the cost of encoding the image through its own network before producing task-specific output. Argus encodes the image once through EUPE-ViT-B and then routes the resulting features to five lightweight heads, each of which adds only a few milliseconds on top of the shared representation. The backbone forward pass is the dominant cost in both pipelines, and running it once rather than four times is where the 1.2x throughput improvement and 2.2x parameter reduction originate. The practical consequence for deployment is that Argus requires a single model download (334 MB), a single checkpoint load into VRAM (0.53 GB), and a single Python import, where the equivalent separate-model pipeline requires four downloads totaling over a gigabyte, four loads consuming over a gigabyte of VRAM if held concurrently, and four separate dependency trees to manage.

Usage

from PIL import Image
from transformers import AutoModel

model = AutoModel.from_pretrained("phanerozoic/argus", trust_remote_code=True)

image = Image.open("your_image.jpg").convert("RGB")

# Any single task can be called directly:
top5   = model.classify(image, top_k=5)                  # trained linear softmax, 1000 ImageNet classes
seg    = model.segment(image)                            # returns [H, W] class indices
depth  = model.depth(image)                              # returns [H, W] metric depth in meters

# Detection runs at 768px with letterbox padding:
dets   = model.detect(image, score_thresh=0.3)
# returns list of {"box": [x1,y1,x2,y2], "score": float, "label": int, "class_name": str}

# Classification, segmentation, and depth can be run at once:
result = model.perceive(image)
# result["classification"] — list of top-5 {"class_id", "class_name", "score"}
# result["segmentation"]   — numpy array of ADE20K class indices
# result["depth"]          — numpy array of depth values in meters
# result["timings_ms"]     — per-task latency breakdown

# Detection is called separately because it uses a different input resolution:
dets = model.detect(image)

# Dense patch correspondence between two images:
target = Image.open("other_image.jpg").convert("RGB")
corr = model.correspond(image, target)
# corr["matches"]  — argmax target-patch index for each source patch (length grid*grid)
# corr["scores"]   — cosine similarity at the match
# corr["grid"]     — patch-grid side length (e.g. 32 at resolution=512)

Every single-image method also accepts a list of images. When a list is passed, the return type becomes a list of per-image results in the same shape that a single call would produce:

images = [Image.open(p).convert("RGB") for p in paths]

top5_batch = model.classify(images, top_k=5)           # list of list-of-dict
seg_batch  = model.segment(images)                     # list of [H, W] tensors
depth_batch = model.depth(images)                      # list of [H, W] tensors
perceive_batch = model.perceive(images)                # list of dicts

Per-task confidence and uncertainty are available as opt-in outputs. Classification always carries a margin field (top-1 score minus top-2 score) on the first entry. Segmentation and depth expose confidence maps when return_confidence=True is passed:

seg_map, seg_conf   = model.segment(image, return_confidence=True)
# seg_conf is per-pixel max softmax probability in [0, 1]

depth_map, depth_std = model.depth(image, return_confidence=True)
# depth_std is per-pixel standard deviation of the 256-bin distribution

result = model.perceive(image, return_confidence=True)
# result["segmentation_confidence"] and result["depth_uncertainty"] are populated

The model can be exported to ONNX. This produces three separate graphs — backbone, segmentation head, and depth head — with verification against the PyTorch reference automatically performed when verify=True:

paths = model.export_onnx("/path/to/out_dir", backbone_resolution=224, verify=True)
# paths["backbone"], paths["seg_head"], paths["depth_head"]
# paths["verification"] — max abs diff per component

Correspondence has no learned parameters and runs as cosine-max on the backbone's spatial output, so it needs no separate graph.

For reduced VRAM on memory-constrained hardware, INT8 weight-only quantization is available via torchao. This quantizes the Linear weight matrices to INT8 while keeping activations in BF16, avoiding the outlier-channel problems that break naive INT8 quantization of ViT models:

model = AutoModel.from_pretrained("phanerozoic/argus", trust_remote_code=True)
model = model.cuda().eval().quantize_int8()  # requires: pip install torchao

# All methods work identically after quantization:
result = model.perceive(image)       # 100% classification agreement, <0.05 m depth drift
dets   = model.detect(image)         # detection counts identical to FP32

Mode	perceive latency	VRAM	Classification	Depth drift
BF16 mixed (default)	35 ms	0.53 GB	reference	—
INT8 weight-only	39 ms	0.47 GB	100% agreement	0.013 m mean

The model uses HuggingFace's custom-code mechanism (trust_remote_code=True), so the loader code is fetched from the model repo automatically. No additional files need to be cloned.

Training

The backbone is frozen for every task. Only the task heads are trained, and the class prototypes are extracted (not trained at all).

Heads

Component	Source dataset	Trained by
EUPE-ViT-B backbone	LVD-1689M (approximately 1.7 billion web images)	Meta FAIR (used here frozen)
Segmentation head	ADE20K (20,210 training images, 2,000 validation images)	This repository, 40,000 iterations of linear-probe training
Depth head	NYU Depth V2 (24,231 training images)	This repository, 38,400 iterations of linear-probe training
Linear softmax classifier	ImageNet-1k (1.28 million training images)	This repository, SGD over cached frozen features
Detection head	COCO 2017 (117,266 training images, 80 classes)	This repository, cofiber pyramid + CLIP-text-aligned cosine classifier, frozen text embeddings
Correspondence	None (training-free)	—

The trainable heads sum to approximately 17.4M parameters (seg 117K + depth DPT 13.45M + linear classifier 769K + detection 3.04M). The unified model.safetensors is 412 MB.

Precision variants

Two safetensors files with the same weights at different on-disk precision. Inference behavior is identical; the smaller file is for users with limited bandwidth or storage.

File	Size	Load
`model.safetensors`	412 MB	`AutoModel.from_pretrained("phanerozoic/argus", trust_remote_code=True)`
`model.bf16_backbone.safetensors`	241 MB	`AutoModel.from_pretrained("phanerozoic/argus", trust_remote_code=True, variant="bf16_backbone")`

Both files load into the same FP32 model in memory; PyTorch automatically upcasts the bfloat16 stored weights at construction time. The smaller variant saves download bandwidth and disk space but does not reduce inference VRAM.

Architecture details

Backbone is the 86M-parameter EUPE-ViT-B with a deliberate simplification: qkv_bias is set to False and mask_k_bias is set to False. The upstream EUPE-ViT-B release shipped a qkv.bias_mask buffer that is identically zero across all attention blocks, which makes the effective qkv bias zero everywhere through masked_bias = bias * 0 = 0. Argus drops the bias parameter entirely so the computation is bit-equivalent in fp32. The bf16 output drift this introduces is sub-ULP and is absorbed by every head except the DPT depth decoder, where it surfaces as roughly 2 cm of noise against a 0.39 m RMSE (well below the depth head's own metric floor). The Argus-Lite (ViT-S) variant uses the same qkv_bias=False setting; the Argus-Edge (ViT-T) variant restores qkv_bias=True and mask_k_bias=True because the upstream EUPE-ViT-T release ships a non-trivial bias and bias-mask, neither of which can be folded away.

Segmentation head is BatchNorm2d(768) → Conv2d(768, 150, 1×1), 116,886 parameters, 1.4 MB on disk. Trained at 512×512 with cross-entropy loss, AdamW (lr 1e-3, weight decay 1e-3), WarmupOneCycleLR with 1500-step warmup, batch size 16.

Depth head is a DPT multi-scale decoder that hooks into backbone blocks [2, 5, 8, 11] via PyTorch forward hooks, capturing intermediate representations without modifying the backbone. A reassemble stage projects each block's output from 768 to 256 channels via LayerNorm + Linear, reshapes to spatial grids, and rescales to four target strides (4, 8, 16, 32) via bilinear interpolation. A bottom-up fusion path combines these four scales through residual conv blocks with skip connections, progressively doubling spatial resolution from stride 32 to stride 2. A final conv head produces 256 depth-bin logits, outputting metric depth in meters via a bin-weighted sum. 13,450,000 parameters, ~51 MB on disk. Trained at 416×416 with SILog loss, AdamW (lr 1e-4, weight decay 1e-3), cosine schedule with 3% warmup, batch size 16, 38,400 iterations.

Linear softmax classifier is a single Linear(768, 1000) layer with bias, 769,000 parameters, about 3 MB on disk. Trained as a two-pass job: first the frozen backbone is run over the ImageNet-1k training set to cache a per-image CLS feature tensor (1,281,167 × 768, stored once at ~3.9 GB), then the linear layer is trained on the cached features alone. The training pass uses SGD with momentum 0.9, weight decay 0, batch size 4096, cosine schedule, 100 epochs, no augmentation, and the best checkpoint by validation top-1 is restored at the end. A small learning-rate sweep over {0.5, 1.0, 3.0, 10.0, 30.0} selects the best configuration; the L2-normalized CLS features and zero-initialized weights demand an unusually large learning rate to grow the weight scale to the point where softmax distributions become sharp. The best run used lr = 30.0 and produced 85.53% top-1 / 97.69% top-5 on ImageNet-1k val.

Detection head is an anchor-free per-pixel detector built on a parameter-free cofiber decomposition of the backbone's stride-16 spatial features. The cofiber decomposition iterates an avg-pool then bilinear-upsample-and-subtract residual at four scales (a Rocq/HoTT machine-checked exact decomposition in a semi-additive category, see phanerozoic/cofiber-detection/CofiberDecomposition.v), giving frequency-separated bands at strides 16, 32, 64, and 128 with no learned parameters and no FPN. Each band is concatenated with a 64-dim sinusoidal positional embedding (832-channel input per scale), passed through a per-scale GroupNorm and a shared 1x1 stem to 160 hidden channels; top-down lateral fusion combines coarser bands into finer ones, and a single stride-2 transposed convolution synthesizes the stride-8 P3 level from the stride-16 band. Each of the five resulting pyramid levels (strides 8, 16, 32, 64, 128) goes through split classification and regression towers, each tower being five standard 3x3 conv blocks (Conv3x3 + GroupNorm + activation) followed by four depthwise-residual blocks. The classifier projects the 160-dim tower output to a 768-dim space via cls_project and scores via cosine similarity against a frozen 80x768 text_embed matrix of COCO class names encoded with a CLIP ViT-L/14 8-prompt average; a learned logit_scale and per-class cls_bias sharpen the scores. The regressor predicts LTRB distances exponentiated with a per-level learned scale, and a centerness branch gates final scores. 2,975,067 parameters, 11.4 MB on disk.

Trained on COCO 2017 at 768x768 with letterbox padding for 16 epochs at batch 64, AdamW (lr 1e-3, weight decay 1e-4) cosine schedule with 3% warmup, ATSS target assignment, horizontal-flip augmentation, EMA decay 0.9998, and a cls_project initialized with the principal components of the text embedding (first 80 columns from the SVD of text_embed, remaining 80 columns a random orthogonal complement). After the main run, a 3-epoch partial fine-tune updates only cls_project, cls_bias, and logit_scale at lr 1e-4 with the towers and the cofiber path frozen, picking up +0.15 mAP. The 24.6 to 42.64 mAP path was entirely recipe and resolution: hidden width actually went down (192 to 160), and the cofiber decomposition itself never changed.

Correspondence has no learned parameters. At inference time, dense patch features are extracted from both images, upsampled to 512×512 pixel resolution, and matched by cosine similarity per source keypoint.

Compute

Task	Iterations	Wall time
Segmentation (ADE20K)	40,000	~5 hours
Depth (NYU Depth V2)	38,400	~3 hours
Linear classifier (IN1k)	100 epochs × 313 steps	~25 seconds on cached features held on GPU; ~45 minutes for the one-shot feature extraction pass over the full 1.28M training set if the cache is not already on disk
Detection (COCO 2017)	8 epochs × 1,832 batches	~6 hours at batch 64, 768px, FP32, frozen backbone
DPT depth decoder (NYUv2)	38,400 iterations	~5.3 hours at batch 16, 416px, SILog loss, frozen backbone
Correspondence (SPair)	training-free	—

Training was done on a single 48 GB workstation GPU. Peak VRAM was approximately 10 GB during segmentation training, 7 GB during depth training, 2.5 GB during class prototype extraction, and 4 GB during the linear classifier training (once features are cached, the training loop only holds the 3.7 GB cached feature tensor on GPU).

Why minimal heads

The segmentation and classification heads follow the EUPE paper's evaluation principle: a minimal decoder isolates the backbone's contribution from the head's capacity. A Mask2Former-style segmentation head would produce higher mIoU, but those numbers would reflect the decoder as much as the features. The depth and detection heads are heavier. The DPT decoder fuses features from four intermediate ViT layers at multiple spatial scales; the cofiber detection head builds a five-level pyramid (strides 8 through 128) from the backbone's stride-16 output via a parameter-free decomposition rather than a learned FPN. Depth requires multi-scale fusion to capture spatial gradients across a scene, and detection requires a feature pyramid to resolve objects that range from a dozen pixels to the full image. In both cases the backbone remains frozen and only the head is trained.

Notes

The segmentation head was trained on ADE20K's 150-class indoor-and-urban label space. The depth head was trained on NYU Depth v2 and is indoor-biased; outdoor metric depth should be treated as approximate. The detection head was trained on COCO 2017's 80-class label space at 768-pixel input; small-object detection is the expected weakness because the stride-8 P3 level can only resolve objects roughly 14 pixels and larger at that resolution. Classification uses a trained linear softmax classifier that produces calibrated probabilities and reaches 85.53% top-1 on ImageNet-1k val.

License

The EUPE-ViT-B backbone weights inside this checkpoint were released by Meta FAIR under the FAIR Research License, which restricts use to non-commercial research and education. The task heads and class prototypes in this checkpoint were trained independently by the author of this repository and would on their own be releasable under a permissive license. However, because they are inseparably bundled with the backbone weights in a single file, the unified checkpoint inherits the more restrictive license of its most restricted component. In practical terms, the entire argus.pt file should be treated as released under the FAIR Research License. See LICENSE for the full text.

Citation

If you use Argus or the underlying EUPE backbone in academic work, please cite the original paper:

@misc{zhu2026eupe,
  title={Efficient Universal Perception Encoder},
  author={Zhu, Chenchen and Suri, Saksham and Jose, Cijo and Oquab, Maxime and Szafraniec, Marc and Wen, Wei and Xiong, Yunyang and Labatut, Patrick and Bojanowski, Piotr and Krishnamoorthi, Raghuraman and Chandra, Vikas},
  year={2026},
  eprint={2603.22387},
  archivePrefix={arXiv},
  primaryClass={cs.CV}
}

Acknowledgements

The EUPE backbone was trained and released by Meta FAIR. The dataset loading utilities are from the DINOv3 repository. The Argus task heads, benchmarks, and packaging were done by phanerozoic.