--- license: apache-2.0 tags: - object-detection - vision-transformer - coco - faster-rcnn - positional-embeddings - simple-vit datasets: - coco library_name: mmdetection --- # Simple ViT - Object Detection on COCO Faster R-CNN with Simple ViT-Tiny backbone (learned positional embeddings) ## Model Details - **Architecture**: Faster R-CNN with ViT-Tiny backbone - **Backbone**: Simple ViT (192-dim, 12 layers, 3 heads) - **Positional Embedding**: SIMPLE - **Training Resolution**: 512x512 - **Dataset**: COCO 2017 - **Framework**: MMDetection ## Training Configuration | Parameter | Value | |-----------|-------| | Image Size | 512x512 | | Patch Size | 16x16 | | Hidden Dim | 192 | | Layers | 12 | | Heads | 3 | | MLP Dim | 768 | ## Checkpoint Info - **Filename**: `best_coco_bbox_mAP_epoch_12.pth` - **Size**: 114.9 MB - **Epoch**: 12 ## Usage ```python from mmdet.apis import init_detector, inference_detector config_file = 'detection/configs/faster_rcnn_simple_vit_tiny_coco.py' checkpoint_file = 'best_coco_bbox_mAP_epoch_12.pth' model = init_detector(config_file, checkpoint_file, device='cuda:0') result = inference_detector(model, 'test.jpg') ``` ## Citation If you use this model, please cite: ```bibtex @misc{vit_detection_coco, title={Vision Transformer Object Detection with Simple ViT}, year={2026}, publisher={Hugging Face}, } ``` ## License Apache 2.0