metadata
license: apache-2.0
tags:
- object-detection
- vision-transformer
- coco
- faster-rcnn
- positional-embeddings
- simple-vit
datasets:
- coco
library_name: mmdetection
Simple ViT - Object Detection on COCO
Faster R-CNN with Simple ViT-Tiny backbone (learned positional embeddings)
Model Details
- Architecture: Faster R-CNN with ViT-Tiny backbone
- Backbone: Simple ViT (192-dim, 12 layers, 3 heads)
- Positional Embedding: SIMPLE
- Training Resolution: 512x512
- Dataset: COCO 2017
- Framework: MMDetection
Training Configuration
| Parameter | Value |
|---|---|
| Image Size | 512x512 |
| Patch Size | 16x16 |
| Hidden Dim | 192 |
| Layers | 12 |
| Heads | 3 |
| MLP Dim | 768 |
Checkpoint Info
- Filename:
best_coco_bbox_mAP_epoch_12.pth - Size: 114.9 MB
- Epoch: 12
Usage
from mmdet.apis import init_detector, inference_detector
config_file = 'detection/configs/faster_rcnn_simple_vit_tiny_coco.py'
checkpoint_file = 'best_coco_bbox_mAP_epoch_12.pth'
model = init_detector(config_file, checkpoint_file, device='cuda:0')
result = inference_detector(model, 'test.jpg')
Citation
If you use this model, please cite:
@misc{vit_detection_coco,
title={Vision Transformer Object Detection with Simple ViT},
year={2026},
publisher={Hugging Face},
}
License
Apache 2.0