Simple ViT - Object Detection on COCO

Faster R-CNN with Simple ViT-Tiny backbone (learned positional embeddings)

Model Details

Architecture: Faster R-CNN with ViT-Tiny backbone
Backbone: Simple ViT (192-dim, 12 layers, 3 heads)
Positional Embedding: SIMPLE
Training Resolution: 512x512
Dataset: COCO 2017
Framework: MMDetection

Training Configuration

Parameter	Value
Image Size	512x512
Patch Size	16x16
Hidden Dim	192
Layers	12
Heads	3
MLP Dim	768

Checkpoint Info

Filename: best_coco_bbox_mAP_epoch_12.pth
Size: 114.9 MB
Epoch: 12

Usage

from mmdet.apis import init_detector, inference_detector

config_file = 'detection/configs/faster_rcnn_simple_vit_tiny_coco.py'
checkpoint_file = 'best_coco_bbox_mAP_epoch_12.pth'

model = init_detector(config_file, checkpoint_file, device='cuda:0')
result = inference_detector(model, 'test.jpg')

Citation

If you use this model, please cite:

@misc{vit_detection_coco,
  title={Vision Transformer Object Detection with Simple ViT},
  year={2026},
  publisher={Hugging Face},
}

License

Apache 2.0

Downloads last month: -; Downloads are not tracked for this model. How to track