|
|
--- |
|
|
license: mit |
|
|
tags: |
|
|
- computer-vision |
|
|
- object-detection |
|
|
- semantic-segmentation |
|
|
- instance-segmentation |
|
|
- pascal-voc |
|
|
- multi-task-learning |
|
|
library_name: pytorch |
|
|
pipeline_tag: image-segmentation |
|
|
datasets: Pascal_VOC |
|
|
--- |
|
|
|
|
|
# Pascal-TriheadNet: Joint Detection & Segmentation |
|
|
|
|
|
**Single-stage unified perception model for Pascal VOC: Detection, Semantic, and Instance Segmentation in one forward pass.** |
|
|
|
|
|
Pascal-TriheadNet is a multi-task learning model that jointly solves three computer vision tasks using a unified Vision Transformer backbone with three specialized task heads. Validated on Pascal VOC 2012, it achieves strong performance across all tasks while maintaining efficient inference. |
|
|
|
|
|
π **[View Full Code & Documentation on GitHub](https://github.com/Sivamorgan/Pascal-TriheadNet)** |
|
|
|
|
|
## π Key Highlights |
|
|
|
|
|
- **Detection mAP@50**: 75.6% |
|
|
- **Semantic mIoU**: 87.3% |
|
|
- **Instance Mask mAP@50**: 65.7% |
|
|
- **Architecture**: One Backbone, One Neck, Three Heads (ViT + FPN) |
|
|
|
|
|
## π₯ Model Checkpoints |
|
|
|
|
|
Two versions of the model are provided: |
|
|
|
|
|
| File | Description | Size | |
|
|
| :--- | :--- | :--- | |
|
|
| **`checkpoint_epoch_50.pth`** | Best performing FP32 model. | 826MB |
|
|
| **`checkpoint_epoch_50_quantized.pth`** | optimized INT8 Quantized model. | 136MB |
|
|
|
|
|
> **Training Context**: Model was fine-tuned on an **L4 GPU** in Google Colab. |
|
|
|
|
|
## π Performance Metrics |
|
|
|
|
|
Evaluated on the Pascal VOC 2012 Validation set: |
|
|
|
|
|
| Task | Metric | Score | |
|
|
|------|--------|-------| |
|
|
| **Detection** | mAP (0.5:0.95) | **46.7%** | |
|
|
| **Detection** | mAP@50 | **75.6%** | |
|
|
| **Semantic** | mIoU | **87.3%** | |
|
|
| **Instance** | Mask mAP (0.5:0.95) | **35.8%** | |
|
|
| **Instance** | Mask mAP@50 | **65.7%** | |
|
|
|
|
|
*For detailed per-class analysis and ablation studies, please refer to the [GitHub Repository](https://github.com/Sivamorgan/Pascal-TriheadNet).* |
|
|
|
|
|
## π Model Overview |
|
|
|
|
|
The architecture utilizes a **Vision Transformer (ViT-Base)** backbone pretrained on ImageNet. |
|
|
|
|
|
1. **Backbone**: `vit_base_patch16_224` with the last 6 blocks fine-tuned. |
|
|
2. **Neck**: A **Simple Feature Pyramid** (ViTDet-style) that creates multi-scale feature maps (P2-P5) from the single-scale ViT output. |
|
|
3. **Heads**: |
|
|
* **Detection**: FCOS-style anchor-free detector. |
|
|
* **Semantic**: Panoptic FPN-style segmentation head. |
|
|
* **Instance**: Mask R-CNN-style head using RoI Align. |
|
|
|
|
|
## βοΈ Training Configuration |
|
|
|
|
|
- **Epochs**: 50 |
|
|
- **Batch Size**: 32 |
|
|
- **Optimizer**: AdamW (Base LR: 2e-4) |
|
|
- **Loss**: Weighted sum of Focal Loss (Det), Cross-Entropy/Dice (Sem/Inst), and GIoU (Box). |
|
|
|
|
|
--- |
|
|
|
|
|
### Model Details |
|
|
- **Developed by:** Sivasubiramaniam Subbiah |
|
|
- **Model type:** Multi-task Vision Model |
|
|
- **Language(s):** Python, PyTorch |
|
|
- **License:** MIT |
|
|
- **Finetuned from:** Vision Transformer (ViT) |
|
|
|