---
license: mit
tags:
- computer-vision
- object-detection
- semantic-segmentation
- instance-segmentation
- pascal-voc
- multi-task-learning
library_name: pytorch
pipeline_tag: image-segmentation
datasets: Pascal_VOC
---

# Pascal-TriheadNet: Joint Detection & Segmentation

**Single-stage unified perception model for Pascal VOC: Detection, Semantic, and Instance Segmentation in one forward pass.**

Pascal-TriheadNet is a multi-task learning model that jointly solves three computer vision tasks using a unified Vision Transformer backbone with three specialized task heads. Validated on Pascal VOC 2012, it achieves strong performance across all tasks while maintaining efficient inference.

🔗 **[View Full Code & Documentation on GitHub](https://github.com/Sivamorgan/Pascal-TriheadNet)**

## 🚀 Key Highlights

- **Detection mAP@50**: 75.6%
- **Semantic mIoU**: 87.3%
- **Instance Mask mAP@50**: 65.7%
- **Architecture**: One Backbone, One Neck, Three Heads (ViT + FPN)

## 📥 Model Checkpoints

Two versions of the model are provided:

| File | Description | Size |
| :--- | :--- | :--- |
| **`checkpoint_epoch_50.pth`** | Best performing FP32 model. | 826MB
| **`checkpoint_epoch_50_quantized.pth`** | optimized INT8 Quantized model. | 136MB

> **Training Context**: Model was fine-tuned on an **L4 GPU** in Google Colab.

## 📊 Performance Metrics

Evaluated on the Pascal VOC 2012 Validation set:

| Task | Metric | Score |
|------|--------|-------|
| **Detection** | mAP (0.5:0.95) | **46.7%** |
| **Detection** | mAP@50 | **75.6%** |
| **Semantic** | mIoU | **87.3%** |
| **Instance** | Mask mAP (0.5:0.95) | **35.8%** |
| **Instance** | Mask mAP@50 | **65.7%** |

*For detailed per-class analysis and ablation studies, please refer to the [GitHub Repository](https://github.com/Sivamorgan/Pascal-TriheadNet).*

## 🏗 Model Overview

The architecture utilizes a **Vision Transformer (ViT-Base)** backbone pretrained on ImageNet. 

1.  **Backbone**: `vit_base_patch16_224` with the last 6 blocks fine-tuned.
2.  **Neck**: A **Simple Feature Pyramid** (ViTDet-style) that creates multi-scale feature maps (P2-P5) from the single-scale ViT output.
3.  **Heads**:
    *   **Detection**: FCOS-style anchor-free detector.
    *   **Semantic**: Panoptic FPN-style segmentation head.
    *   **Instance**: Mask R-CNN-style head using RoI Align.

## ⚙️ Training Configuration

- **Epochs**: 50
- **Batch Size**: 32
- **Optimizer**: AdamW (Base LR: 2e-4)
- **Loss**: Weighted sum of Focal Loss (Det), Cross-Entropy/Dice (Sem/Inst), and GIoU (Box).

---

### Model Details
- **Developed by:** Sivasubiramaniam Subbiah
- **Model type:** Multi-task Vision Model
- **Language(s):** Python, PyTorch
- **License:** MIT
- **Finetuned from:** Vision Transformer (ViT)