phanerozoic commited on
Commit
83014b9
·
verified ·
1 Parent(s): 0114ccd

Argus v1.1: add trained FCOS detection head as the fifth task

Browse files

Adds object detection to Argus via an FCOS-style anchor-free detector
built on a ViTDet-style simple feature pyramid, trained on COCO 2017
train (117,266 images, 80 classes) with the EUPE-ViT-B backbone frozen.

Detection results on COCO val2017 (5,000 images)
-------------------------------------------------
mAP@[0.5:0.95] = 41.0
mAP@0.50 = 64.8
mAP@0.75 = 43.2
mAP small/med/lg = 21.4 / 44.9 / 62.1

For context, FCOS with a fully-trained ResNet-50-FPN backbone achieves
39.1 mAP on the same benchmark. The frozen EUPE-ViT-B backbone exceeds
that baseline at 41.0 mAP while sharing its features with four other
task heads simultaneously.

Architecture
------------
The simple feature pyramid takes the backbone stride-16 spatial features
and synthesizes five levels (P3 through P7, strides 8 through 128) via
a transposed convolution for P3, identity with channel reduction for P4,
and chained stride-2 convolutions for P5-P7, each with 256 channels and
GroupNorm. Two shared four-layer conv towers (classification and
regression) with GroupNorm and GELU process each level. Three prediction
heads output 80 classification channels, 4 box regression channels
(left/top/right/bottom distances, exponentiated with learned per-level
scale), and 1 centerness channel. 16.14M trainable parameters total.

Training recipe
---------------
640px input with letterbox padding, batch 64, AdamW lr 1e-3, cosine
schedule with 3% warmup, weight decay 1e-4, gradient clipping at 10.0,
8 epochs, full FP32 throughout. Focal loss (alpha 0.25, gamma 2.0) for
classification, GIoU for boxes, BCE for centerness. ~6 hours wall clock
on a single RTX 6000 Ada at 0.7 it/s with 23 GB peak VRAM.

API
---
model.detect(image) returns a list of dicts:
[{"box": [x1,y1,x2,y2], "score": float, "label": int, "class_name": str}]

Detection uses a separate forward pass at 640px (the other tasks use
224/512/416), so it lives in its own method rather than in perceive().
Accepts single images or batches. Configurable score_thresh, nms_thresh,
and max_per_image.

Backward compatibility
----------------------
All existing methods (classify, segment, depth, perceive, correspond)
return identical results to v1.0. The detection head adds 16.14M
parameters and 62 MB to the checkpoint (334 MB to 396 MB). perceive()
does not include detection in its output.

Files changed
-------------
argus.py: +SimpleFeaturePyramid, +FCOSHead, +DetectionHead,
+detect() method, +_make_locations, +_decode_detections,
+_letterbox_to_square, +COCO_CLASSES, +FPN_STRIDES,
extended _init_weights for Conv2d/GroupNorm
model.safetensors: +79 detection_head.* tensors (334 MB to 396 MB)
config.json: +detection_num_classes, +detection_fpn_channels,
+detection_num_convs
README.md: detection in architecture diagram, mAP table,
detect() usage example, head specs, training details

Files changed (1) hide show
  1. config.json +3 -0
config.json CHANGED
@@ -2018,5 +2018,8 @@
2018
  "ear, spike, capitulum",
2019
  "toilet tissue, toilet paper, bathroom tissue"
2020
  ],
 
 
 
2021
  "torch_dtype": "float32"
2022
  }
 
2018
  "ear, spike, capitulum",
2019
  "toilet tissue, toilet paper, bathroom tissue"
2020
  ],
2021
+ "detection_num_classes": 80,
2022
+ "detection_fpn_channels": 256,
2023
+ "detection_num_convs": 4,
2024
  "torch_dtype": "float32"
2025
  }