pascal-triheadnet / README.md
Sivamorgan's picture
Update README.md
211153b verified
---
license: mit
tags:
- computer-vision
- object-detection
- semantic-segmentation
- instance-segmentation
- pascal-voc
- multi-task-learning
library_name: pytorch
pipeline_tag: image-segmentation
datasets: Pascal_VOC
---
# Pascal-TriheadNet: Joint Detection & Segmentation
**Single-stage unified perception model for Pascal VOC: Detection, Semantic, and Instance Segmentation in one forward pass.**
Pascal-TriheadNet is a multi-task learning model that jointly solves three computer vision tasks using a unified Vision Transformer backbone with three specialized task heads. Validated on Pascal VOC 2012, it achieves strong performance across all tasks while maintaining efficient inference.
πŸ”— **[View Full Code & Documentation on GitHub](https://github.com/Sivamorgan/Pascal-TriheadNet)**
## πŸš€ Key Highlights
- **Detection mAP@50**: 75.6%
- **Semantic mIoU**: 87.3%
- **Instance Mask mAP@50**: 65.7%
- **Architecture**: One Backbone, One Neck, Three Heads (ViT + FPN)
## πŸ“₯ Model Checkpoints
Two versions of the model are provided:
| File | Description | Size |
| :--- | :--- | :--- |
| **`checkpoint_epoch_50.pth`** | Best performing FP32 model. | 826MB
| **`checkpoint_epoch_50_quantized.pth`** | optimized INT8 Quantized model. | 136MB
> **Training Context**: Model was fine-tuned on an **L4 GPU** in Google Colab.
## πŸ“Š Performance Metrics
Evaluated on the Pascal VOC 2012 Validation set:
| Task | Metric | Score |
|------|--------|-------|
| **Detection** | mAP (0.5:0.95) | **46.7%** |
| **Detection** | mAP@50 | **75.6%** |
| **Semantic** | mIoU | **87.3%** |
| **Instance** | Mask mAP (0.5:0.95) | **35.8%** |
| **Instance** | Mask mAP@50 | **65.7%** |
*For detailed per-class analysis and ablation studies, please refer to the [GitHub Repository](https://github.com/Sivamorgan/Pascal-TriheadNet).*
## πŸ— Model Overview
The architecture utilizes a **Vision Transformer (ViT-Base)** backbone pretrained on ImageNet.
1. **Backbone**: `vit_base_patch16_224` with the last 6 blocks fine-tuned.
2. **Neck**: A **Simple Feature Pyramid** (ViTDet-style) that creates multi-scale feature maps (P2-P5) from the single-scale ViT output.
3. **Heads**:
* **Detection**: FCOS-style anchor-free detector.
* **Semantic**: Panoptic FPN-style segmentation head.
* **Instance**: Mask R-CNN-style head using RoI Align.
## βš™οΈ Training Configuration
- **Epochs**: 50
- **Batch Size**: 32
- **Optimizer**: AdamW (Base LR: 2e-4)
- **Loss**: Weighted sum of Focal Loss (Det), Cross-Entropy/Dice (Sem/Inst), and GIoU (Box).
---
### Model Details
- **Developed by:** Sivasubiramaniam Subbiah
- **Model type:** Multi-task Vision Model
- **Language(s):** Python, PyTorch
- **License:** MIT
- **Finetuned from:** Vision Transformer (ViT)