aadex's picture
Upload README.md with huggingface_hub
9e7840d verified
metadata
license: apache-2.0
tags:
  - object-detection
  - vision-transformer
  - coco
  - faster-rcnn
  - positional-embeddings
  - simple-vit
datasets:
  - coco
library_name: mmdetection

Simple ViT - Object Detection on COCO

Faster R-CNN with Simple ViT-Tiny backbone (learned positional embeddings)

Model Details

  • Architecture: Faster R-CNN with ViT-Tiny backbone
  • Backbone: Simple ViT (192-dim, 12 layers, 3 heads)
  • Positional Embedding: SIMPLE
  • Training Resolution: 512x512
  • Dataset: COCO 2017
  • Framework: MMDetection

Training Configuration

Parameter Value
Image Size 512x512
Patch Size 16x16
Hidden Dim 192
Layers 12
Heads 3
MLP Dim 768

Checkpoint Info

  • Filename: best_coco_bbox_mAP_epoch_12.pth
  • Size: 114.9 MB
  • Epoch: 12

Usage

from mmdet.apis import init_detector, inference_detector

config_file = 'detection/configs/faster_rcnn_simple_vit_tiny_coco.py'
checkpoint_file = 'best_coco_bbox_mAP_epoch_12.pth'

model = init_detector(config_file, checkpoint_file, device='cuda:0')
result = inference_detector(model, 'test.jpg')

Citation

If you use this model, please cite:

@misc{vit_detection_coco,
  title={Vision Transformer Object Detection with Simple ViT},
  year={2026},
  publisher={Hugging Face},
}

License

Apache 2.0