|
|
--- |
|
|
license: apache-2.0 |
|
|
tags: |
|
|
- object-detection |
|
|
- vision-transformer |
|
|
- coco |
|
|
- faster-rcnn |
|
|
- positional-embeddings |
|
|
- simple-vit |
|
|
datasets: |
|
|
- coco |
|
|
library_name: mmdetection |
|
|
--- |
|
|
|
|
|
# Simple ViT - Object Detection on COCO |
|
|
|
|
|
Faster R-CNN with Simple ViT-Tiny backbone (learned positional embeddings) |
|
|
|
|
|
## Model Details |
|
|
|
|
|
- **Architecture**: Faster R-CNN with ViT-Tiny backbone |
|
|
- **Backbone**: Simple ViT (192-dim, 12 layers, 3 heads) |
|
|
- **Positional Embedding**: SIMPLE |
|
|
- **Training Resolution**: 512x512 |
|
|
- **Dataset**: COCO 2017 |
|
|
- **Framework**: MMDetection |
|
|
|
|
|
## Training Configuration |
|
|
|
|
|
| Parameter | Value | |
|
|
|-----------|-------| |
|
|
| Image Size | 512x512 | |
|
|
| Patch Size | 16x16 | |
|
|
| Hidden Dim | 192 | |
|
|
| Layers | 12 | |
|
|
| Heads | 3 | |
|
|
| MLP Dim | 768 | |
|
|
|
|
|
## Checkpoint Info |
|
|
|
|
|
- **Filename**: `best_coco_bbox_mAP_epoch_12.pth` |
|
|
- **Size**: 114.9 MB |
|
|
- **Epoch**: 12 |
|
|
|
|
|
## Usage |
|
|
|
|
|
```python |
|
|
from mmdet.apis import init_detector, inference_detector |
|
|
|
|
|
config_file = 'detection/configs/faster_rcnn_simple_vit_tiny_coco.py' |
|
|
checkpoint_file = 'best_coco_bbox_mAP_epoch_12.pth' |
|
|
|
|
|
model = init_detector(config_file, checkpoint_file, device='cuda:0') |
|
|
result = inference_detector(model, 'test.jpg') |
|
|
``` |
|
|
|
|
|
## Citation |
|
|
|
|
|
If you use this model, please cite: |
|
|
|
|
|
```bibtex |
|
|
@misc{vit_detection_coco, |
|
|
title={Vision Transformer Object Detection with Simple ViT}, |
|
|
year={2026}, |
|
|
publisher={Hugging Face}, |
|
|
} |
|
|
``` |
|
|
|
|
|
## License |
|
|
|
|
|
Apache 2.0 |
|
|
|