pascal-triheadnet / README.md

Update README.md

211153b verified about 1 month ago

2.78 kB

	---
	license: mit
	tags:
	- computer-vision
	- object-detection
	- semantic-segmentation
	- instance-segmentation
	- pascal-voc
	- multi-task-learning
	library_name: pytorch
	pipeline_tag: image-segmentation
	datasets: Pascal_VOC
	---

	# Pascal-TriheadNet: Joint Detection & Segmentation

	Single-stage unified perception model for Pascal VOC: Detection, Semantic, and Instance Segmentation in one forward pass.

	Pascal-TriheadNet is a multi-task learning model that jointly solves three computer vision tasks using a unified Vision Transformer backbone with three specialized task heads. Validated on Pascal VOC 2012, it achieves strong performance across all tasks while maintaining efficient inference.

	🔗 [View Full Code & Documentation on GitHub](https://github.com/Sivamorgan/Pascal-TriheadNet)

	## 🚀 Key Highlights

	- Detection mAP@50: 75.6%
	- Semantic mIoU: 87.3%
	- Instance Mask mAP@50: 65.7%
	- Architecture: One Backbone, One Neck, Three Heads (ViT + FPN)

	## 📥 Model Checkpoints

	Two versions of the model are provided:

	\| File \| Description \| Size \|
	\| :--- \| :--- \| :--- \|
	\| `checkpoint_epoch_50.pth` \| Best performing FP32 model. \| 826MB
	\| `checkpoint_epoch_50_quantized.pth` \| optimized INT8 Quantized model. \| 136MB

	> Training Context: Model was fine-tuned on an L4 GPU in Google Colab.

	## 📊 Performance Metrics

	Evaluated on the Pascal VOC 2012 Validation set:

	\| Task \| Metric \| Score \|
	\|------\|--------\|-------\|
	\| Detection \| mAP (0.5:0.95) \| 46.7% \|
	\| Detection \| mAP@50 \| 75.6% \|
	\| Semantic \| mIoU \| 87.3% \|
	\| Instance \| Mask mAP (0.5:0.95) \| 35.8% \|
	\| Instance \| Mask mAP@50 \| 65.7% \|

	For detailed per-class analysis and ablation studies, please refer to the [GitHub Repository](https://github.com/Sivamorgan/Pascal-TriheadNet).

	## 🏗 Model Overview

	The architecture utilizes a Vision Transformer (ViT-Base) backbone pretrained on ImageNet.

	1. Backbone: `vit_base_patch16_224` with the last 6 blocks fine-tuned.
	2. Neck: A Simple Feature Pyramid (ViTDet-style) that creates multi-scale feature maps (P2-P5) from the single-scale ViT output.
	3. Heads:
	* Detection: FCOS-style anchor-free detector.
	* Semantic: Panoptic FPN-style segmentation head.
	* Instance: Mask R-CNN-style head using RoI Align.

	## ⚙️ Training Configuration

	- Epochs: 50
	- Batch Size: 32
	- Optimizer: AdamW (Base LR: 2e-4)
	- Loss: Weighted sum of Focal Loss (Det), Cross-Entropy/Dice (Sem/Inst), and GIoU (Box).

	---

	### Model Details
	- Developed by: Sivasubiramaniam Subbiah
	- Model type: Multi-task Vision Model
	- Language(s): Python, PyTorch
	- License: MIT
	- Finetuned from: Vision Transformer (ViT)

	---
	license: mit
	tags:
	- computer-vision
	- object-detection
	- semantic-segmentation
	- instance-segmentation
	- pascal-voc
	- multi-task-learning
	library_name: pytorch
	pipeline_tag: image-segmentation
	datasets: Pascal_VOC
	---

	# Pascal-TriheadNet: Joint Detection & Segmentation

	Single-stage unified perception model for Pascal VOC: Detection, Semantic, and Instance Segmentation in one forward pass.

	Pascal-TriheadNet is a multi-task learning model that jointly solves three computer vision tasks using a unified Vision Transformer backbone with three specialized task heads. Validated on Pascal VOC 2012, it achieves strong performance across all tasks while maintaining efficient inference.

	🔗 [View Full Code & Documentation on GitHub](https://github.com/Sivamorgan/Pascal-TriheadNet)

	## 🚀 Key Highlights

	- Detection mAP@50: 75.6%
	- Semantic mIoU: 87.3%
	- Instance Mask mAP@50: 65.7%
	- Architecture: One Backbone, One Neck, Three Heads (ViT + FPN)

	## 📥 Model Checkpoints

	Two versions of the model are provided:

	\| File \| Description \| Size \|
	\| :--- \| :--- \| :--- \|
	\| `checkpoint_epoch_50.pth` \| Best performing FP32 model. \| 826MB
	\| `checkpoint_epoch_50_quantized.pth` \| optimized INT8 Quantized model. \| 136MB

	> Training Context: Model was fine-tuned on an L4 GPU in Google Colab.

	## 📊 Performance Metrics

	Evaluated on the Pascal VOC 2012 Validation set:

	\| Task \| Metric \| Score \|
	\|------\|--------\|-------\|
	\| Detection \| mAP (0.5:0.95) \| 46.7% \|
	\| Detection \| mAP@50 \| 75.6% \|
	\| Semantic \| mIoU \| 87.3% \|
	\| Instance \| Mask mAP (0.5:0.95) \| 35.8% \|
	\| Instance \| Mask mAP@50 \| 65.7% \|

	For detailed per-class analysis and ablation studies, please refer to the [GitHub Repository](https://github.com/Sivamorgan/Pascal-TriheadNet).

	## 🏗 Model Overview

	The architecture utilizes a Vision Transformer (ViT-Base) backbone pretrained on ImageNet.

	1. Backbone: `vit_base_patch16_224` with the last 6 blocks fine-tuned.
	2. Neck: A Simple Feature Pyramid (ViTDet-style) that creates multi-scale feature maps (P2-P5) from the single-scale ViT output.
	3. Heads:
	* Detection: FCOS-style anchor-free detector.
	* Semantic: Panoptic FPN-style segmentation head.
	* Instance: Mask R-CNN-style head using RoI Align.

	## ⚙️ Training Configuration

	- Epochs: 50
	- Batch Size: 32
	- Optimizer: AdamW (Base LR: 2e-4)
	- Loss: Weighted sum of Focal Loss (Det), Cross-Entropy/Dice (Sem/Inst), and GIoU (Box).

	---

	### Model Details
	- Developed by: Sivasubiramaniam Subbiah
	- Model type: Multi-task Vision Model
	- Language(s): Python, PyTorch
	- License: MIT
	- Finetuned from: Vision Transformer (ViT)