Sivamorgan
/

pascal-triheadnet

@@ -4,312 +4,77 @@ tags:
 - computer-vision
 - object-detection
 - semantic-segmentation
 - pascal-voc
-- vision-transformer
 - multi-task-learning
 library_name: pytorch
-datasets:
-- detection-datasets/pascal_voc
-metrics:
-- mean_average_precision
-- intersection_over_union
 ---
 # Pascal-TriheadNet: Joint Detection & Segmentation
 **Single-stage unified perception model for Pascal VOC: Detection, Semantic, and Instance Segmentation in one forward pass.**
-Pascal-TriheadNet is a multi-task learning model that jointly solves three computer vision tasks using a unified Vision Transformer backbone with three specialized task heads. The architecture achieves strong performance across all three tasks while maintaining efficient inference through shared feature computation.
 ## 🚀 Key Highlights
-- **Detection mAP@50**: 75.6% (Pascal VOC standard)
 - **Semantic mIoU**: 87.3%
 - **Instance Mask mAP@50**: 65.7%
-- **Architecture**: One Backbone, One Neck, Three Heads
-- **Efficient**: Single forward pass for all three tasks
-## Model Overview
-### 1. Backbone: Vision Transformer
-- **Model**: `vit_base_patch16_224` (pretrained on ImageNet)
-- **Input Resolution**: 224×224 RGB
-- **Output**: Single-scale feature map at 1/16 resolution
-- **Fine-tuning**: Last 6 transformer blocks unfrozen
-**Architecture Details:**
-- Patch size: 16×16
-- Hidden dimension: 768
-- Attention heads: 12
-- Transformer blocks: 12 (last 6 trainable)
-### 2. Neck: Simple Feature Pyramid (ViTDet-style)
-Unlike traditional FPN with top-down pathways, we use a **parallel multi-scale** approach optimized for Vision Transformers:
-- **P2 (1/4)**: 4× Bilinear Upsample + Conv
-- **P3 (1/8)**: 2× Bilinear Upsample + Conv
-- **P4 (1/16)**: Conv (base scale)
-- **P5 (1/32)**: 2× Stride Conv
-All pyramid levels have **256 channels** for consistency.
-### 3. Task Heads
-#### A. Detection Head (FCOS-style)
-**Type**: Anchor-free, fully convolutional one-stage detector
-**Outputs per FPN level (P2-P5)**:
-1. **Classification**: (N, 20, H, W) - class logits
-2. **Box Regression**: (N, 4, H, W) - LTRB offsets
-3. **Centerness**: (N, 1, H, W) - quality score
-**Loss Components**:
-- Classification: Focal Loss
-- Regression: GIoU Loss
-- Centerness: Binary Cross-Entropy
-#### B. Semantic Segmentation Head (Panoptic FPN-style)
-**Architecture**:
-- Merges all FPN levels (P2-P5) via recursive upsampling
-- P5 (1/32) provides global context
-- Upsamples and fuses to match P2 (1/4)
-- Final 4× upsample to native 224×224
-**Output**: (N, 21, 224, 224) - 20 object classes + background
-**Loss**:
-- Cross-Entropy Loss
-- Dice Loss
-- Boundary-weighted loss (2.0× weight on boundaries)
-#### C. Instance Segmentation Head (Mask R-CNN-style)
-**Pipeline**:
-1. **Training**: Uses ground truth boxes
-2. **Inference**: Uses predicted boxes from detection head
-3. **RoI Align**: Extracts 14×14 features per box from appropriate FPN level (P2-P5) based on box scale
-4. **Mask FCN**: Predicts 28×28 binary masks
-5. **Post-processing**: Pastes masks into full image based on box coordinates
-**Loss**:
-- Binary Cross-Entropy Loss
-- Dice Loss
-## Training Details
-### Loss Function
-Total loss is a weighted combination:
-$$L_{total} = \lambda_{det} L_{det} + \lambda_{sem} L_{sem} + \lambda_{inst} L_{inst}$$
-**Weights**:
-- Detection (λ_det): 1.0
-- Semantic (λ_sem): 1.0
-- Instance (λ_inst): 1.0
-- Boundary weight: 2.0 (for semantic edges)
--
-### Hyperparameters
-| Parameter | Value |
-|-----------|-------|
-| **Epochs** | 50 |
-| **Batch Size** | 32 |
-| **Base Learning Rate** | 2e-4 |
-| **Backbone LR Multiplier** | 0.01 (2e-6 for ViT) |
-| **Optimizer** | AdamW |
-| **Weight Decay** | 0.01 |
-| **Warmup Epochs** | 5 |
-| **LR Schedule** | Cosine Annealing (after warmup) |
-| **Precision** | Mixed (FP16) |
-| **Segmentation Ratio** | 0.15 (15% of batch has masks) |
-## Performance Metrics
-### Quantitative Results (Validation Set)
 | Task | Metric | Score |
 |------|--------|-------|
 | **Detection** | mAP (0.5:0.95) | **46.7%** |
 | **Detection** | mAP@50 | **75.6%** |
-| **Detection** | mAP@75 | **49.5%** |
 | **Semantic** | mIoU | **87.3%** |
-| **Semantic** | Pixel Accuracy | **96.4%** |
 | **Instance** | Mask mAP (0.5:0.95) | **35.8%** |
 | **Instance** | Mask mAP@50 | **65.7%** |
-### Per-Class Performance
-| Class | Detection AP@50 | Instance Mask AP@50 |
-|-------|:---------------:|:-------------------:|
-| Aeroplane | 55.4% | 38.6% |
-| Bicycle | 51.0% | 0.02% |
-| Bird | 47.1% | 44.1% |
-| Boat | 37.0% | 27.0% |
-| Bottle | 25.6% | 27.8% |
-| Bus | 62.0% | 56.1% |
-| Car | 37.4% | 30.3% |
-| Cat | 67.4% | 66.1% |
-| Chair | 25.9% | 5.5% |
-| Cow | 48.3% | 38.5% |
-| Dining Table | 42.5% | 29.7% |
-| Dog | 64.3% | 60.6% |
-| Horse | 58.2% | 33.1% |
-| Motorbike | 53.3% | 34.5% |
-| Person | 40.9% | 25.6% |
-| Potted Plant | 23.2% | 13.5% |
-| Sheep | 43.1% | 33.2% |
-| Sofa | 41.7% | 43.1% |
-| Train | 61.0% | 60.9% |
-| TV Monitor | 48.2% | 48.5% |
-**Observations**:
-- Strong performance on animals (cat, dog) and vehicles (bus, train)
-- Challenging classes: bicycle (instance), chair, bottle, potted plant
-- Person detection competitive but instance segmentation room for improvement
-## Ablation Study: Fine-Tuning Depth
-We compared unfreezing different numbers of ViT backbone layers:
-| Configuration | Detection mAP@50 | Semantic mIoU | Instance mAP@50 |
-|---------------|:----------------:|:-------------:|:---------------:|
-| **4 Layers** | 71.2% | 85.1% | 62.3% |
-| **6 Layers** | **75.6%** | **87.3%** | **65.7%** |
-**Conclusion**: Unfreezing 6 layers allows better adaptation to Pascal VOC geometry and semantics, yielding +4.4% detection improvement, +2.2% semantic improvement, and +3.4% instance improvement.
 - **Developed by:** Sivasubiramaniam Subbiah
-- **Model type:** Multi-task Vision Model (Detection + Semantic + Instance Segmentation)
-- **Language(s):** Python,Pytorch
 - **License:** MIT
-- **Finetuned from model [optional]:** Vision Transformer (ViT)
-### Model Sources [optional]
-<!-- Provide the basic links for the model. -->
-- **Repository:** (https://github.com/Sivamorgan/Pascal-TriheadNet)
-- **Demo [optional]:** [More Information Needed]
-### Training Procedure
-<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
-#### Preprocessing [optional]
-[More Information Needed]
-#### Training Hyperparameters
-- **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
-#### Speeds, Sizes, Times [optional]
-<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
-[More Information Needed]
-## Evaluation
-<!-- This section describes the evaluation protocols and provides the results. -->
-### Testing Data, Factors & Metrics
-#### Testing Data
-<!-- This should link to a Dataset Card if possible. -->
-[More Information Needed]
-#### Factors
-<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
-[More Information Needed]
-#### Metrics
-<!-- These are the evaluation metrics being used, ideally with a description of why. -->
-[More Information Needed]
-### Results
-[More Information Needed]
-#### Summary
-## Model Examination [optional]
-<!-- Relevant interpretability work for the model goes here -->
-[More Information Needed]
-## Environmental Impact
-<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
-Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
-- **Hardware Type:** [More Information Needed]
-- **Hours used:** [More Information Needed]
-- **Cloud Provider:** [More Information Needed]
-- **Compute Region:** [More Information Needed]
-- **Carbon Emitted:** [More Information Needed]
-## Technical Specifications [optional]
-### Model Architecture and Objective
-[More Information Needed]
-### Compute Infrastructure
-[More Information Needed]
-#### Hardware
-[More Information Needed]
-#### Software
-[More Information Needed]
-## Citation [optional]
-<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
-**BibTeX:**
-[More Information Needed]
-**APA:**
-[More Information Needed]
-## Glossary [optional]
-<!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
-[More Information Needed]
-## More Information [optional]
-[More Information Needed]
-## Model Card Authors [optional]
-[More Information Needed]
-## Model Card Contact
-[More Information Needed]

 - computer-vision
 - object-detection
 - semantic-segmentation
+- instance-segmentation
 - pascal-voc
 - multi-task-learning
 library_name: pytorch
+pipeline_tag: image-segmentation
+datasets: Pascal_VOC
 ---
 # Pascal-TriheadNet: Joint Detection & Segmentation
 **Single-stage unified perception model for Pascal VOC: Detection, Semantic, and Instance Segmentation in one forward pass.**
+Pascal-TriheadNet is a multi-task learning model that jointly solves three computer vision tasks using a unified Vision Transformer backbone with three specialized task heads. Validated on Pascal VOC 2012, it achieves strong performance across all tasks while maintaining efficient inference.
+🔗 **[View Full Code & Documentation on GitHub](https://github.com/Sivamorgan/Pascal-TriheadNet)**
 ## 🚀 Key Highlights
+- **Detection mAP@50**: 75.6%
 - **Semantic mIoU**: 87.3%
 - **Instance Mask mAP@50**: 65.7%
+- **Architecture**: One Backbone, One Neck, Three Heads (ViT + FPN)
+## 📥 Model Checkpoints
+Two versions of the model are provided:
+| File | Description | Size |
+| :--- | :--- | :--- |
+| **`checkpoint_epoch_50.pth`** | Best performing FP32 model. | 826MB
+| **`checkpoint_epoch_50_quantized.pth`** | optimized INT8 Quantized model. | 136MB
+> **Training Context**: Model was fine-tuned on an **L4 GPU** in Google Colab.
+## 📊 Performance Metrics
+Evaluated on the Pascal VOC 2012 Validation set:
 | Task | Metric | Score |
 |------|--------|-------|
 | **Detection** | mAP (0.5:0.95) | **46.7%** |
 | **Detection** | mAP@50 | **75.6%** |
 | **Semantic** | mIoU | **87.3%** |
 | **Instance** | Mask mAP (0.5:0.95) | **35.8%** |
 | **Instance** | Mask mAP@50 | **65.7%** |
+*For detailed per-class analysis and ablation studies, please refer to the [GitHub Repository](https://github.com/Sivamorgan/Pascal-TriheadNet).*
+## 🏗 Model Overview
+The architecture utilizes a **Vision Transformer (ViT-Base)** backbone pretrained on ImageNet.
+1.  **Backbone**: `vit_base_patch16_224` with the last 6 blocks fine-tuned.
+2.  **Neck**: A **Simple Feature Pyramid** (ViTDet-style) that creates multi-scale feature maps (P2-P5) from the single-scale ViT output.
+3.  **Heads**:
+    *   **Detection**: FCOS-style anchor-free detector.
+    *   **Semantic**: Panoptic FPN-style segmentation head.
+    *   **Instance**: Mask R-CNN-style head using RoI Align.
+## ⚙️ Training Configuration
+- **Epochs**: 50
+- **Batch Size**: 32
+- **Optimizer**: AdamW (Base LR: 2e-4)
+- **Loss**: Weighted sum of Focal Loss (Det), Cross-Entropy/Dice (Sem/Inst), and GIoU (Box).
+---
+### Model Details
 - **Developed by:** Sivasubiramaniam Subbiah
+- **Model type:** Multi-task Vision Model
+- **Language(s):** Python, PyTorch
 - **License:** MIT
+- **Finetuned from:** Vision Transformer (ViT)