UniFormer-S_h14 β COCO-pretrained backbone (acaua mirror)
Backbone-only mirror. This is not a runnable detection or segmentation model on its own. It ships the UniFormer-S backbone weights as trained jointly with the upstream Mask R-CNN detector on COCO; the task head (FPN + RPN + ROI + mask head) has been stripped.
The mirror exists to gate acaua's Stage 1.5 UniFormer-dense-prediction
spike β testing whether
UniFormer-S can be hosted as a backbone inside torchvision.models. detection.MaskRCNN and transformers.UperNetForSemanticSegmentation
without forcing users to download the full 165MB upstream checkpoint
from Google Drive every time.
Provenance
| Upstream code | Sense-X/UniFormer @ main (Apache-2.0); specifically object_detection/mmdet/models/backbones/uniformer.py |
| Upstream weights | Google Drive file id 13KhBYkHKQg-CyhAgn1LQM1K0R4bwSpWT, filename mask_rcnn_1x_uniformer_s_h14.pth (165MB full MaskRCNN; we stripped the head) |
| Upstream SHA256 (full checkpoint) | aa1e6bbec1c83344de96705f0e1aee853f1eec78df365e41ec802c202f00d9cf |
| Upstream report | box mAP 45.6 / mask mAP 41.6 on COCO val, 1x schedule, single-clip single-scale eval |
| Architecture | depth=[3,4,8,3], embed_dim=[64,128,320,512], head_dim=64, hybrid=True, window_size=14 β upstream's standard UniFormer-S_h14 dense-prediction variant |
| Backbone params | 21.04M (of 41M full MaskRCNN total) |
| Mirrored on | 2026-04-24 |
| Mirrored by | CondadosAI/acaua |
Usage via acaua
Not usable through acaua.Model.from_pretrained yet β this ships as
backbone-only infrastructure for the upcoming Stage 1.5.b spike. Direct
usage:
from huggingface_hub import hf_hub_download
from safetensors.torch import load_file
from acaua.adapters.uniformer._backbone_dense import UniFormer2DDense
from acaua.adapters.uniformer._config import DENSE_VARIANTS
path = hf_hub_download(
"CondadosAI/uniformer_s_h14_backbone_coco", "model.safetensors"
)
sd = load_file(path)
m = UniFormer2DDense(DENSE_VARIANTS["s_h14_det"])
m.load_state_dict(sd, strict=True)
m.eval()
# ...plug into a detection/segmentation head...
The backbone emits 4 multi-scale feature maps with strides 4/8/16/32.
At 800x1280 input, the shapes are
(1, 64, 200, 320), (1, 128, 100, 160), (1, 320, 50, 80),
(1, 512, 25, 40).
Files
model.safetensorsβ backbone weights (332 tensors, 21M params).NOTICEβ attribution chain (code + weights).LICENSEβ Apache-2.0.
License
Apache-2.0. Redistribution of the upstream UniFormer code and weights
under their original declaration β see NOTICE for the
attribution chain.
Citation
@inproceedings{li2022uniformer,
title = {UniFormer: Unifying Convolution and Self-attention for Visual Recognition},
author = {Li, Kunchang and Wang, Yali and Zhang, Junhao and Gao, Peng and Song, Guanglu and Liu, Yu and Li, Hongsheng and Qiao, Yu},
booktitle = {ICLR},
year = {2022},
}