--- license: mit tags: - computer-vision - object-detection - semantic-segmentation - instance-segmentation - pascal-voc - multi-task-learning library_name: pytorch pipeline_tag: image-segmentation datasets: Pascal_VOC --- # Pascal-TriheadNet: Joint Detection & Segmentation **Single-stage unified perception model for Pascal VOC: Detection, Semantic, and Instance Segmentation in one forward pass.** Pascal-TriheadNet is a multi-task learning model that jointly solves three computer vision tasks using a unified Vision Transformer backbone with three specialized task heads. Validated on Pascal VOC 2012, it achieves strong performance across all tasks while maintaining efficient inference. 🔗 **[View Full Code & Documentation on GitHub](https://github.com/Sivamorgan/Pascal-TriheadNet)** ## 🚀 Key Highlights - **Detection mAP@50**: 75.6% - **Semantic mIoU**: 87.3% - **Instance Mask mAP@50**: 65.7% - **Architecture**: One Backbone, One Neck, Three Heads (ViT + FPN) ## 📥 Model Checkpoints Two versions of the model are provided: | File | Description | Size | | :--- | :--- | :--- | | **`checkpoint_epoch_50.pth`** | Best performing FP32 model. | 826MB | **`checkpoint_epoch_50_quantized.pth`** | optimized INT8 Quantized model. | 136MB > **Training Context**: Model was fine-tuned on an **L4 GPU** in Google Colab. ## 📊 Performance Metrics Evaluated on the Pascal VOC 2012 Validation set: | Task | Metric | Score | |------|--------|-------| | **Detection** | mAP (0.5:0.95) | **46.7%** | | **Detection** | mAP@50 | **75.6%** | | **Semantic** | mIoU | **87.3%** | | **Instance** | Mask mAP (0.5:0.95) | **35.8%** | | **Instance** | Mask mAP@50 | **65.7%** | *For detailed per-class analysis and ablation studies, please refer to the [GitHub Repository](https://github.com/Sivamorgan/Pascal-TriheadNet).* ## 🏗 Model Overview The architecture utilizes a **Vision Transformer (ViT-Base)** backbone pretrained on ImageNet. 1. **Backbone**: `vit_base_patch16_224` with the last 6 blocks fine-tuned. 2. **Neck**: A **Simple Feature Pyramid** (ViTDet-style) that creates multi-scale feature maps (P2-P5) from the single-scale ViT output. 3. **Heads**: * **Detection**: FCOS-style anchor-free detector. * **Semantic**: Panoptic FPN-style segmentation head. * **Instance**: Mask R-CNN-style head using RoI Align. ## ⚙️ Training Configuration - **Epochs**: 50 - **Batch Size**: 32 - **Optimizer**: AdamW (Base LR: 2e-4) - **Loss**: Weighted sum of Focal Loss (Det), Cross-Entropy/Dice (Sem/Inst), and GIoU (Box). --- ### Model Details - **Developed by:** Sivasubiramaniam Subbiah - **Model type:** Multi-task Vision Model - **Language(s):** Python, PyTorch - **License:** MIT - **Finetuned from:** Vision Transformer (ViT)