Sivamorgan commited on
Commit
211153b
·
verified ·
1 Parent(s): 0c29ede

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +37 -272
README.md CHANGED
@@ -4,312 +4,77 @@ tags:
4
  - computer-vision
5
  - object-detection
6
  - semantic-segmentation
 
7
  - pascal-voc
8
- - vision-transformer
9
  - multi-task-learning
10
  library_name: pytorch
11
- datasets:
12
- - detection-datasets/pascal_voc
13
- metrics:
14
- - mean_average_precision
15
- - intersection_over_union
16
  ---
 
17
  # Pascal-TriheadNet: Joint Detection & Segmentation
18
 
19
  **Single-stage unified perception model for Pascal VOC: Detection, Semantic, and Instance Segmentation in one forward pass.**
20
 
21
- Pascal-TriheadNet is a multi-task learning model that jointly solves three computer vision tasks using a unified Vision Transformer backbone with three specialized task heads. The architecture achieves strong performance across all three tasks while maintaining efficient inference through shared feature computation.
 
 
22
 
23
  ## 🚀 Key Highlights
24
 
25
- - **Detection mAP@50**: 75.6% (Pascal VOC standard)
26
  - **Semantic mIoU**: 87.3%
27
  - **Instance Mask mAP@50**: 65.7%
28
- - **Architecture**: One Backbone, One Neck, Three Heads
29
- - **Efficient**: Single forward pass for all three tasks
30
-
31
- ## Model Overview
32
- ### 1. Backbone: Vision Transformer
33
-
34
- - **Model**: `vit_base_patch16_224` (pretrained on ImageNet)
35
- - **Input Resolution**: 224×224 RGB
36
- - **Output**: Single-scale feature map at 1/16 resolution
37
- - **Fine-tuning**: Last 6 transformer blocks unfrozen
38
-
39
- **Architecture Details:**
40
- - Patch size: 16×16
41
- - Hidden dimension: 768
42
- - Attention heads: 12
43
- - Transformer blocks: 12 (last 6 trainable)
44
-
45
- ### 2. Neck: Simple Feature Pyramid (ViTDet-style)
46
-
47
- Unlike traditional FPN with top-down pathways, we use a **parallel multi-scale** approach optimized for Vision Transformers:
48
-
49
- - **P2 (1/4)**: 4× Bilinear Upsample + Conv
50
- - **P3 (1/8)**: 2× Bilinear Upsample + Conv
51
- - **P4 (1/16)**: Conv (base scale)
52
- - **P5 (1/32)**: 2× Stride Conv
53
-
54
- All pyramid levels have **256 channels** for consistency.
55
-
56
- ### 3. Task Heads
57
-
58
- #### A. Detection Head (FCOS-style)
59
-
60
- **Type**: Anchor-free, fully convolutional one-stage detector
61
-
62
- **Outputs per FPN level (P2-P5)**:
63
- 1. **Classification**: (N, 20, H, W) - class logits
64
- 2. **Box Regression**: (N, 4, H, W) - LTRB offsets
65
- 3. **Centerness**: (N, 1, H, W) - quality score
66
-
67
- **Loss Components**:
68
- - Classification: Focal Loss
69
- - Regression: GIoU Loss
70
- - Centerness: Binary Cross-Entropy
71
-
72
- #### B. Semantic Segmentation Head (Panoptic FPN-style)
73
-
74
- **Architecture**:
75
- - Merges all FPN levels (P2-P5) via recursive upsampling
76
- - P5 (1/32) provides global context
77
- - Upsamples and fuses to match P2 (1/4)
78
- - Final 4× upsample to native 224×224
79
-
80
- **Output**: (N, 21, 224, 224) - 20 object classes + background
81
-
82
- **Loss**:
83
- - Cross-Entropy Loss
84
- - Dice Loss
85
- - Boundary-weighted loss (2.0× weight on boundaries)
86
-
87
- #### C. Instance Segmentation Head (Mask R-CNN-style)
88
-
89
- **Pipeline**:
90
- 1. **Training**: Uses ground truth boxes
91
- 2. **Inference**: Uses predicted boxes from detection head
92
- 3. **RoI Align**: Extracts 14×14 features per box from appropriate FPN level (P2-P5) based on box scale
93
- 4. **Mask FCN**: Predicts 28×28 binary masks
94
- 5. **Post-processing**: Pastes masks into full image based on box coordinates
95
-
96
- **Loss**:
97
- - Binary Cross-Entropy Loss
98
- - Dice Loss
99
-
100
 
101
- ## Training Details
102
 
103
- ### Loss Function
104
 
105
- Total loss is a weighted combination:
 
 
 
106
 
107
- $$L_{total} = \lambda_{det} L_{det} + \lambda_{sem} L_{sem} + \lambda_{inst} L_{inst}$$
108
 
109
- **Weights**:
110
- - Detection (λ_det): 1.0
111
- - Semantic (λ_sem): 1.0
112
- - Instance (λ_inst): 1.0
113
- - Boundary weight: 2.0 (for semantic edges)
114
- -
115
- ### Hyperparameters
116
 
117
- | Parameter | Value |
118
- |-----------|-------|
119
- | **Epochs** | 50 |
120
- | **Batch Size** | 32 |
121
- | **Base Learning Rate** | 2e-4 |
122
- | **Backbone LR Multiplier** | 0.01 (2e-6 for ViT) |
123
- | **Optimizer** | AdamW |
124
- | **Weight Decay** | 0.01 |
125
- | **Warmup Epochs** | 5 |
126
- | **LR Schedule** | Cosine Annealing (after warmup) |
127
- | **Precision** | Mixed (FP16) |
128
- | **Segmentation Ratio** | 0.15 (15% of batch has masks) |
129
-
130
- ## Performance Metrics
131
-
132
- ### Quantitative Results (Validation Set)
133
 
134
  | Task | Metric | Score |
135
  |------|--------|-------|
136
  | **Detection** | mAP (0.5:0.95) | **46.7%** |
137
  | **Detection** | mAP@50 | **75.6%** |
138
- | **Detection** | mAP@75 | **49.5%** |
139
  | **Semantic** | mIoU | **87.3%** |
140
- | **Semantic** | Pixel Accuracy | **96.4%** |
141
  | **Instance** | Mask mAP (0.5:0.95) | **35.8%** |
142
  | **Instance** | Mask mAP@50 | **65.7%** |
143
 
144
- ### Per-Class Performance
145
 
146
- | Class | Detection AP@50 | Instance Mask AP@50 |
147
- |-------|:---------------:|:-------------------:|
148
- | Aeroplane | 55.4% | 38.6% |
149
- | Bicycle | 51.0% | 0.02% |
150
- | Bird | 47.1% | 44.1% |
151
- | Boat | 37.0% | 27.0% |
152
- | Bottle | 25.6% | 27.8% |
153
- | Bus | 62.0% | 56.1% |
154
- | Car | 37.4% | 30.3% |
155
- | Cat | 67.4% | 66.1% |
156
- | Chair | 25.9% | 5.5% |
157
- | Cow | 48.3% | 38.5% |
158
- | Dining Table | 42.5% | 29.7% |
159
- | Dog | 64.3% | 60.6% |
160
- | Horse | 58.2% | 33.1% |
161
- | Motorbike | 53.3% | 34.5% |
162
- | Person | 40.9% | 25.6% |
163
- | Potted Plant | 23.2% | 13.5% |
164
- | Sheep | 43.1% | 33.2% |
165
- | Sofa | 41.7% | 43.1% |
166
- | Train | 61.0% | 60.9% |
167
- | TV Monitor | 48.2% | 48.5% |
168
 
169
- **Observations**:
170
- - Strong performance on animals (cat, dog) and vehicles (bus, train)
171
- - Challenging classes: bicycle (instance), chair, bottle, potted plant
172
- - Person detection competitive but instance segmentation room for improvement
173
 
174
- ## Ablation Study: Fine-Tuning Depth
 
 
 
 
 
175
 
176
- We compared unfreezing different numbers of ViT backbone layers:
177
 
178
- | Configuration | Detection mAP@50 | Semantic mIoU | Instance mAP@50 |
179
- |---------------|:----------------:|:-------------:|:---------------:|
180
- | **4 Layers** | 71.2% | 85.1% | 62.3% |
181
- | **6 Layers** | **75.6%** | **87.3%** | **65.7%** |
182
 
183
- **Conclusion**: Unfreezing 6 layers allows better adaptation to Pascal VOC geometry and semantics, yielding +4.4% detection improvement, +2.2% semantic improvement, and +3.4% instance improvement.
184
 
 
185
  - **Developed by:** Sivasubiramaniam Subbiah
186
- - **Model type:** Multi-task Vision Model (Detection + Semantic + Instance Segmentation)
187
- - **Language(s):** Python,Pytorch
188
  - **License:** MIT
189
- - **Finetuned from model [optional]:** Vision Transformer (ViT)
190
-
191
- ### Model Sources [optional]
192
-
193
- <!-- Provide the basic links for the model. -->
194
-
195
- - **Repository:** (https://github.com/Sivamorgan/Pascal-TriheadNet)
196
- - **Demo [optional]:** [More Information Needed]
197
-
198
-
199
-
200
- ### Training Procedure
201
-
202
- <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
203
-
204
- #### Preprocessing [optional]
205
-
206
- [More Information Needed]
207
-
208
-
209
- #### Training Hyperparameters
210
-
211
- - **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
212
-
213
- #### Speeds, Sizes, Times [optional]
214
-
215
- <!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
216
-
217
- [More Information Needed]
218
-
219
- ## Evaluation
220
-
221
- <!-- This section describes the evaluation protocols and provides the results. -->
222
-
223
- ### Testing Data, Factors & Metrics
224
-
225
- #### Testing Data
226
-
227
- <!-- This should link to a Dataset Card if possible. -->
228
-
229
- [More Information Needed]
230
-
231
- #### Factors
232
-
233
- <!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
234
-
235
- [More Information Needed]
236
-
237
- #### Metrics
238
-
239
- <!-- These are the evaluation metrics being used, ideally with a description of why. -->
240
-
241
- [More Information Needed]
242
-
243
- ### Results
244
-
245
- [More Information Needed]
246
-
247
- #### Summary
248
-
249
-
250
-
251
- ## Model Examination [optional]
252
-
253
- <!-- Relevant interpretability work for the model goes here -->
254
-
255
- [More Information Needed]
256
-
257
- ## Environmental Impact
258
-
259
- <!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
260
-
261
- Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
262
-
263
- - **Hardware Type:** [More Information Needed]
264
- - **Hours used:** [More Information Needed]
265
- - **Cloud Provider:** [More Information Needed]
266
- - **Compute Region:** [More Information Needed]
267
- - **Carbon Emitted:** [More Information Needed]
268
-
269
- ## Technical Specifications [optional]
270
-
271
- ### Model Architecture and Objective
272
-
273
- [More Information Needed]
274
-
275
- ### Compute Infrastructure
276
-
277
- [More Information Needed]
278
-
279
- #### Hardware
280
-
281
- [More Information Needed]
282
-
283
- #### Software
284
-
285
- [More Information Needed]
286
-
287
- ## Citation [optional]
288
-
289
- <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
290
-
291
- **BibTeX:**
292
-
293
- [More Information Needed]
294
-
295
- **APA:**
296
-
297
- [More Information Needed]
298
-
299
- ## Glossary [optional]
300
-
301
- <!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
302
-
303
- [More Information Needed]
304
-
305
- ## More Information [optional]
306
-
307
- [More Information Needed]
308
-
309
- ## Model Card Authors [optional]
310
-
311
- [More Information Needed]
312
-
313
- ## Model Card Contact
314
-
315
- [More Information Needed]
 
4
  - computer-vision
5
  - object-detection
6
  - semantic-segmentation
7
+ - instance-segmentation
8
  - pascal-voc
 
9
  - multi-task-learning
10
  library_name: pytorch
11
+ pipeline_tag: image-segmentation
12
+ datasets: Pascal_VOC
 
 
 
13
  ---
14
+
15
  # Pascal-TriheadNet: Joint Detection & Segmentation
16
 
17
  **Single-stage unified perception model for Pascal VOC: Detection, Semantic, and Instance Segmentation in one forward pass.**
18
 
19
+ Pascal-TriheadNet is a multi-task learning model that jointly solves three computer vision tasks using a unified Vision Transformer backbone with three specialized task heads. Validated on Pascal VOC 2012, it achieves strong performance across all tasks while maintaining efficient inference.
20
+
21
+ 🔗 **[View Full Code & Documentation on GitHub](https://github.com/Sivamorgan/Pascal-TriheadNet)**
22
 
23
  ## 🚀 Key Highlights
24
 
25
+ - **Detection mAP@50**: 75.6%
26
  - **Semantic mIoU**: 87.3%
27
  - **Instance Mask mAP@50**: 65.7%
28
+ - **Architecture**: One Backbone, One Neck, Three Heads (ViT + FPN)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
29
 
30
+ ## 📥 Model Checkpoints
31
 
32
+ Two versions of the model are provided:
33
 
34
+ | File | Description | Size |
35
+ | :--- | :--- | :--- |
36
+ | **`checkpoint_epoch_50.pth`** | Best performing FP32 model. | 826MB
37
+ | **`checkpoint_epoch_50_quantized.pth`** | optimized INT8 Quantized model. | 136MB
38
 
39
+ > **Training Context**: Model was fine-tuned on an **L4 GPU** in Google Colab.
40
 
41
+ ## 📊 Performance Metrics
 
 
 
 
 
 
42
 
43
+ Evaluated on the Pascal VOC 2012 Validation set:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
44
 
45
  | Task | Metric | Score |
46
  |------|--------|-------|
47
  | **Detection** | mAP (0.5:0.95) | **46.7%** |
48
  | **Detection** | mAP@50 | **75.6%** |
 
49
  | **Semantic** | mIoU | **87.3%** |
 
50
  | **Instance** | Mask mAP (0.5:0.95) | **35.8%** |
51
  | **Instance** | Mask mAP@50 | **65.7%** |
52
 
53
+ *For detailed per-class analysis and ablation studies, please refer to the [GitHub Repository](https://github.com/Sivamorgan/Pascal-TriheadNet).*
54
 
55
+ ## 🏗 Model Overview
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
56
 
57
+ The architecture utilizes a **Vision Transformer (ViT-Base)** backbone pretrained on ImageNet.
 
 
 
58
 
59
+ 1. **Backbone**: `vit_base_patch16_224` with the last 6 blocks fine-tuned.
60
+ 2. **Neck**: A **Simple Feature Pyramid** (ViTDet-style) that creates multi-scale feature maps (P2-P5) from the single-scale ViT output.
61
+ 3. **Heads**:
62
+ * **Detection**: FCOS-style anchor-free detector.
63
+ * **Semantic**: Panoptic FPN-style segmentation head.
64
+ * **Instance**: Mask R-CNN-style head using RoI Align.
65
 
66
+ ## ⚙️ Training Configuration
67
 
68
+ - **Epochs**: 50
69
+ - **Batch Size**: 32
70
+ - **Optimizer**: AdamW (Base LR: 2e-4)
71
+ - **Loss**: Weighted sum of Focal Loss (Det), Cross-Entropy/Dice (Sem/Inst), and GIoU (Box).
72
 
73
+ ---
74
 
75
+ ### Model Details
76
  - **Developed by:** Sivasubiramaniam Subbiah
77
+ - **Model type:** Multi-task Vision Model
78
+ - **Language(s):** Python, PyTorch
79
  - **License:** MIT
80
+ - **Finetuned from:** Vision Transformer (ViT)