File size: 12,985 Bytes
cb79136
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d3edd7a
 
 
 
 
 
 
514ac0f
d3edd7a
 
 
 
 
 
 
7a6f9c2
d3edd7a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7a6f9c2
d3edd7a
 
 
 
f2e2f82
d3edd7a
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
---
license: mit
language:
  - en
tags:
  - clipseg
  - segmentation
  - construction
  - drywall
  - quality-assurance
  - text-conditioned
  - binary-mask
library_name: transformers
base_model: CIDAS/clipseg-rd64-refined
pipeline_tag: image-segmentation
datasets:
  - roboflow/drywall-join-detect
  - roboflow/cracks-3ii36
metrics:
  - iou
  - dice
---

<p align="center">
  <h1 align="center">Prompted Segmentation for Drywall QA</h1>
  <p align="center">
    Text-conditioned binary mask prediction for construction defect detection
  </p>
</p>

![Python 3.11](https://img.shields.io/badge/python-3.11-3776AB?logo=python&logoColor=white) ![PyTorch](https://img.shields.io/badge/PyTorch-2.11-EE4C2C?logo=pytorch&logoColor=white) ![HuggingFace](https://img.shields.io/badge/HuggingFace-transformers-FFD21E?logo=huggingface&logoColor=black) ![CLIPSeg](https://img.shields.io/badge/model-CLIPSeg-blue) ![Typst](https://img.shields.io/badge/report-Typst_PDF-239DAD?logo=typst&logoColor=white) ![uv](https://img.shields.io/badge/package_manager-uv-DE5FE9?logo=uv&logoColor=white) ![MIT](https://img.shields.io/badge/license-MIT-green)

<p align="center">
  <a href="#1-methodology">Methodology</a> &bull;
  <a href="#2-data-preparation">Data Preparation</a> &bull;
  <a href="#3-results">Results</a> &bull;
  <a href="#4-failure-cases--potential-solutions">Failure Cases</a> &bull;
  <a href="#quick-start">Quick Start</a> &bull;
  Full Report (PDF)
</p>

---

Feed a construction photo and a text prompt. Get a binary segmentation mask back.

Two tasks β€” **crack detection** and **drywall taping/joint detection** β€” both driven by natural language at inference time. Change the prompt, change what gets segmented. No class heads, no retraining.

```
Input:  image.jpg  +  "segment wall crack"
Output: image__segment_wall_crack.png   (binary mask, {0, 255})
```

---

## 1. Methodology

### Model: CLIPSeg

We fine-tune [**CLIPSeg**](https://arxiv.org/abs/2112.10003) (Luddecke & Ecker, CVPR 2022) β€” a text-conditioned segmentation model built on CLIP. The entire CLIP backbone (149.6M params) stays **frozen**. Only a lightweight 3-block transformer decoder with U-Net skip connections (**1.13M params**) is trained.

![CLIPSeg architecture: frozen CLIP backbone + trainable decoder](reports/diagrams/architecture.png)

The model takes an RGB image and a text prompt. The CLIP vision encoder (ViT-B/16) and text encoder independently produce embeddings. The decoder fuses these via cross-attention and generates logits at 352x352, which are thresholded at 0.5 to produce binary masks.

<details>
<summary><b>Why CLIPSeg over Grounded SAM, SEEM, X-Decoder?</b></summary>

<br>

| | CLIPSeg | Grounded SAM | SEEM | X-Decoder |
|:--|:--------|:-------------|:-----|:-----------|
| Text-to-mask | Direct | Two-stage (text β†’ bbox β†’ mask) | Multi-modal | Yes |
| Small-data fine-tuning | Proven | Moderate | Difficult | Not ideal |
| Consumer GPU (Apple M4) | Yes | Decoder only | No | No |
| HuggingFace native | Yes | Yes | GitHub only | Limited |

CLIPSeg is the only architecture that gives **direct** text-to-mask conditioning without bounding box intermediates, fine-tunes reliably on small datasets, and runs on consumer hardware with mature HuggingFace support.

</details>

### Training Configuration

| Parameter | Value |
|:----------|:------|
| Base model | [`CIDAS/clipseg-rd64-refined`](https://huggingface.co/CIDAS/clipseg-rd64-refined) |
| Trainable | 1,127,009 params (decoder only) |
| Frozen | 149,620,737 params (CLIP backbone) |
| Loss | `BCEDiceLoss` β€” 0.5 BCE + 0.5 Dice |
| Optimizer | AdamW (lr=1e-4, wd=1e-4) + CosineAnnealingLR |
| Early stopping | patience 7 on val mIoU |
| Device | Apple M4 (MPS backend) |
| Wall time | **97.2 min** (18 epochs, best at epoch 11) |

<details>
<summary><b>Why BCEDiceLoss instead of standard BCE?</b></summary>

<br>

Standard BCE alone fails on thin structures like cracks β€” the severe foreground/background imbalance means BCE happily predicts "all background" at low loss. Dice loss directly optimizes overlap, forcing the model to find crack pixels. The 50/50 blend gives gradient stability (BCE) and overlap-awareness (Dice).

</details>

### Training Pipeline

![Training loop: load pretrained β†’ freeze backbone β†’ train decoder β†’ early stop β†’ evaluate](reports/diagrams/Training%20Pipeline.png)

Training converged at epoch 11 (val mIoU 0.1605). The remaining 7 epochs showed no improvement before early stopping triggered at epoch 18.

All hyperparameters: [`configs/train_config.yaml`](configs/train_config.yaml)

---

## 2. Data Preparation

### Sources

Two datasets from [Roboflow Universe](https://universe.roboflow.com/), downloaded manually in COCO format:

| Dataset | Source | Images | Raw Annotation | Mask Strategy |
|:--------|:-------|:------:|:---------------|:--------------|
| Taping | [drywall-join-detect](https://universe.roboflow.com/objectdetect-pu6rn/drywall-join-detect) | 1,186 | Bounding boxes only | Filled rectangles |
| Cracks | [cracks-3ii36](https://universe.roboflow.com/fyp-ny1jt/cracks-3ii36) | 5,369 | COCO polygons | Pixel-accurate binary masks via `pycocotools` |

> **Note:** The cracks dataset had 0 generated Roboflow versions β€” the owner never created an exportable version, making API download impossible. The raw export was downloaded directly from the website.

### Mask Rendering

- **Cracks:** COCO polygon annotations rendered to pixel-accurate binary masks using `pycocotools.mask`. Some annotations had empty segmentation fields (edge case) β€” handled with try/except fallback to bounding box rendering.
- **Taping:** Only bounding box annotations available. Filled rectangles used as mask approximations. This is a known limitation β€” the rectangles include substantial background, which affects training signal quality.

### Prompt Augmentation

5 synonyms per class, randomly sampled each training iteration. This forces the decoder to learn semantic meaning from the text encoder rather than memorize exact strings:

| Class | Prompts |
|:------|:--------|
| Cracks | `"segment crack"` Β· `"segment wall crack"` Β· `"segment surface crack"` Β· `"segment drywall crack"` Β· `"segment fracture"` |
| Taping | `"segment taping area"` Β· `"segment joint tape"` Β· `"segment drywall seam"` Β· `"segment drywall joint"` Β· `"segment tape line"` |

### Pipeline

![Data pipeline: Roboflow β†’ inspect annotations β†’ render masks β†’ unified manifest β†’ stratified split](reports/diagrams/pipeline.png)

### Splits

Stratified by class (taping vs cracks), seed 42:

| Train | Validation | Test |
|:-----:|:----------:|:----:|
| 4,588 (70%) | 982 (15%) | 985 (15%) |

Preprocessing code: [`src/data/preprocess.py`](src/data/preprocess.py) Β· Dataset class: [`src/data/dataset.py`](src/data/dataset.py)

---

## 3. Results

### Best Predictions

The model's strongest predictions reach **IoU 0.78** on both cracks and taping:

![Best test-set predictions ranked by IoU β€” 3 cracks + 3 taping](reports/figures/best_predictions.png)

### Test-Set Metrics (985 samples)

| Class | mIoU | Dice | Samples |
|:------|:----:|:----:|:-------:|
| Taping | 0.1917 | 0.2780 | 179 |
| Cracks | 0.1639 | 0.2434 | 806 |
| **Overall** | **0.1689** | **0.2497** | **985** |

Taping outperforms cracks because filled-rectangle masks provide a stronger supervision signal (larger contiguous regions) compared to thin crack annotations where minor spatial offsets cause disproportionate IoU drops.

### Inference

| Metric | Value |
|:-------|:------|
| Avg inference time | 58.7 ms / image |
| Model size | 575.1 MB |
| Output format | PNG, single-channel `{0, 255}`, resized to original dimensions |
| Threshold | 0.5 (sigmoid β†’ binary) |

---

## 4. Failure Cases & Potential Solutions

### Worst Predictions

The model's worst predictions (IoU near zero) reveal systematic failure patterns:

![Failure cases β€” worst test-set predictions by IoU, 3 cracks + 3 taping](reports/figures/failure_cases.png)

**What's going wrong in these examples:**

- **Cracks (rows 1–3):** The model activates over broad wall regions instead of tracing the thin crack lines. Fine cracks disappear at 352x352 resolution, and the frozen CLIP backbone has no features for hairline construction defects. The predictions show the model "knows something is there" but can't localize it precisely.
- **Taping (rows 4–6):** The model predicts large rectangular blobs that don't match the actual joint locations. This directly traces back to the filled-rectangle training masks β€” the model learned to predict rectangles because that's what it was supervised on.

### Root Causes

| # | Factor | Impact |
|:-:|:-------|:-------|
| 1 | **Coarse taping annotations** | Source dataset has bounding boxes, not pixel masks. Filled rectangles include background β†’ model over-predicts. |
| 2 | **Thin crack IoU sensitivity** | A 1px crack shifted 2px = near-zero IoU despite visual similarity. Dominates aggregate. |
| 3 | **352x352 resolution ceiling** | CLIPSeg's fixed input size discards fine detail from high-res construction photos. |
| 4 | **Frozen backbone domain gap** | CLIP was trained on internet images, not construction imagery. Cannot adapt feature extraction. |
| 5 | **Small decoder (1.13M params)** | Limited capacity to learn construction-specific visual patterns. |

### Proposed Solutions

| Limitation | Solution | Expected Impact |
|:-----------|:---------|:----------------|
| Coarse taping masks | Use **SAM/SAM2** to generate pixel-accurate masks from bounding boxes before training | High β€” directly fixes the supervision signal |
| Frozen backbone | **Unfreeze last 2–3 ViT blocks** with 10x lower learning rate for domain adaptation | High β€” lets the model learn construction-specific features |
| 352x352 resolution | Switch to **SAM2 with text-prompt conditioning** or a higher-res architecture | High β€” preserves fine crack detail |
| Small decoder | Add decoder blocks or increase hidden dimension (monitor overfitting) | Medium β€” more capacity, but risk of overfitting on small data |
| Thin-crack metric sensitivity | Use **boundary IoU** or distance-tolerant evaluation instead of standard IoU | Low β€” doesn't improve the model, but gives fairer measurement |

---

## Repo Structure

![Repository structure β€” color-coded by module with data flow arrows](reports/diagrams/repo_structure.png)

<details>
<summary><b>File-by-file listing</b></summary>

<br>

| Path | Purpose |
|:-----|:--------|
| [`configs/train_config.yaml`](configs/train_config.yaml) | All hyperparameters in one file |
| [`src/data/preprocess.py`](src/data/preprocess.py) | Annotation inspection, mask rendering, stratified splits |
| [`src/data/dataset.py`](src/data/dataset.py) | PyTorch Dataset + CLIPSegProcessor collation |
| [`src/model/clipseg_wrapper.py`](src/model/clipseg_wrapper.py) | Model loading + backbone freezing |
| [`src/model/losses.py`](src/model/losses.py) | BCEDiceLoss implementation |
| [`src/train.py`](src/train.py) | Training loop with early stopping + logging |
| [`src/evaluate.py`](src/evaluate.py) | Test metrics, mask generation, visual comparisons |
| [`src/predict.py`](src/predict.py) | Single-image CLI inference |
| [`src/best_predictions.py`](src/best_predictions.py) | Per-sample IoU scoring, best/worst prediction figures |
| [`reports/report.typ`](reports/report.typ) | Typst source β†’ [`report.pdf`](reports/report.pdf) |

</details>

---

## Quick Start

**Prerequisites:** Python 3.11+, [uv](https://docs.astral.sh/uv/), Homebrew (macOS)

```bash
brew install graphviz plantuml typst d2
uv sync
```

### 1. Get the data

Download both datasets from Roboflow Universe in COCO format β†’ place under `data/raw/`:

```
data/raw/
β”œβ”€β”€ taping/          # drywall-join-detect (COCO export)
β”‚   β”œβ”€β”€ train/
β”‚   └── valid/
└── cracks/          # cracks-3ii36 (COCO export)
    └── train/
```

### 2. Preprocess

```bash
uv run python -m src.data.preprocess
```

### 3. Train

```bash
uv run python -m src.train
```

### 4. Evaluate

```bash
uv run python -m src.evaluate
```

### 5. Predict on a single image

```bash
uv run python -m src.predict path/to/image.jpg "segment crack"
```

### 6. Build the report

```bash
d2 reports/diagrams/pipeline.d2 reports/diagrams/pipeline.png
plantuml -tpng reports/diagrams/training.puml
uv run python reports/diagrams/architecture.py
typst compile reports/report.typ reports/report.pdf
```

---

## Reproducibility

- All random state seeded with **42** (data splits, PyTorch, NumPy).
- Hyperparameters: [`configs/train_config.yaml`](configs/train_config.yaml). 
- Per-epoch training logs: [`outputs/logs/`](https://huggingface.co/youngPhilosopher/drywall-qa-clipseg/tree/main/outputs/logs).

---

<p align="center">
  <a href="https://huggingface.co/youngPhilosopher/drywall-qa-clipseg/blob/main/reports/report.pdf"><b>Read the full report (PDF)</b></a>
</p>