---
license: apache-2.0
tags:
- object-detection
- vision-transformer
- coco
- faster-rcnn
- positional-embeddings
- simple-vit
datasets:
- coco
library_name: mmdetection
---

# Simple ViT - Object Detection on COCO

Faster R-CNN with Simple ViT-Tiny backbone (learned positional embeddings)

## Model Details

- **Architecture**: Faster R-CNN with ViT-Tiny backbone
- **Backbone**: Simple ViT (192-dim, 12 layers, 3 heads)
- **Positional Embedding**: SIMPLE
- **Training Resolution**: 512x512
- **Dataset**: COCO 2017
- **Framework**: MMDetection

## Training Configuration

| Parameter | Value |
|-----------|-------|
| Image Size | 512x512 |
| Patch Size | 16x16 |
| Hidden Dim | 192 |
| Layers | 12 |
| Heads | 3 |
| MLP Dim | 768 |

## Checkpoint Info

- **Filename**: `best_coco_bbox_mAP_epoch_12.pth`
- **Size**: 114.9 MB
- **Epoch**: 12

## Usage

```python
from mmdet.apis import init_detector, inference_detector

config_file = 'detection/configs/faster_rcnn_simple_vit_tiny_coco.py'
checkpoint_file = 'best_coco_bbox_mAP_epoch_12.pth'

model = init_detector(config_file, checkpoint_file, device='cuda:0')
result = inference_detector(model, 'test.jpg')
```

## Citation

If you use this model, please cite:

```bibtex
@misc{vit_detection_coco,
  title={Vision Transformer Object Detection with Simple ViT},
  year={2026},
  publisher={Hugging Face},
}
```

## License

Apache 2.0