Simple ViT - Object Detection on COCO

Faster R-CNN with Simple ViT-Tiny backbone (learned positional embeddings)

Model Details

  • Architecture: Faster R-CNN with ViT-Tiny backbone
  • Backbone: Simple ViT (192-dim, 12 layers, 3 heads)
  • Positional Embedding: SIMPLE
  • Training Resolution: 512x512
  • Dataset: COCO 2017
  • Framework: MMDetection

Training Configuration

Parameter Value
Image Size 512x512
Patch Size 16x16
Hidden Dim 192
Layers 12
Heads 3
MLP Dim 768

Checkpoint Info

  • Filename: best_coco_bbox_mAP_epoch_12.pth
  • Size: 114.9 MB
  • Epoch: 12

Usage

from mmdet.apis import init_detector, inference_detector

config_file = 'detection/configs/faster_rcnn_simple_vit_tiny_coco.py'
checkpoint_file = 'best_coco_bbox_mAP_epoch_12.pth'

model = init_detector(config_file, checkpoint_file, device='cuda:0')
result = inference_detector(model, 'test.jpg')

Citation

If you use this model, please cite:

@misc{vit_detection_coco,
  title={Vision Transformer Object Detection with Simple ViT},
  year={2026},
  publisher={Hugging Face},
}

License

Apache 2.0

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support