| #set document(title: "Prompted Segmentation for Drywall QA", author: "Karthik M Dani") |
| #set page(paper: "a4", margin: (x: 1cm, y: 1cm), numbering: "1") |
| #set text(font: "New Computer Modern", size: 11pt) |
| #set heading(numbering: "1.0") |
| #set par(justify: true) |
|
|
| // Title page |
| #align(center)[ |
| #v(3cm) |
| #text(size: 24pt, weight: "bold")[Prompted Segmentation for Drywall QA] |
| #v(1cm) |
| #text(size: 14pt)[Text-Conditioned Binary Mask Prediction] |
| #v(0.5cm) |
| #text(size: 12pt, fill: gray)[CLIPSeg Fine-Tuning on Construction Datasets] |
| #v(2cm) |
| #text(size: 12pt)[Karthik M Dani] |
| #v(0.3cm) |
| #text(size: 11pt, fill: gray)[April 2026] |
| #v(3cm) |
| ] |
|
|
| #pagebreak() |
|
|
| // Table of contents |
| #outline(indent: 1.5em) |
| #pagebreak() |
|
|
| = Goal Summary |
|
|
| Given an image and a natural-language prompt, produce a binary segmentation mask for: |
| - *"segment crack"* — identifying wall cracks |
| - *"segment taping area"* — identifying drywall joint/tape regions |
|
|
| The model must generalize across varied scenes and respond to text prompts at inference time, enabling flexible QA workflows. |
|
|
| = Approach |
|
|
| == Why CLIPSeg? |
|
|
| We evaluated four text-conditioned segmentation architectures: |
|
|
| #table( |
| columns: (1fr, 1fr, 1fr, 1fr, 1fr), |
| align: (left, center, center, center, center), |
| table.header( |
| [*Model*], [*Text Input*], [*Small Data*], [*Consumer GPU*], [*HF Support*], |
| ), |
| [CLIPSeg], [Direct], [Excellent], [Yes], [Native], |
| [Grounded SAM], [Via detector], [Moderate], [Decoder only], [Native], |
| [SEEM], [Multi-modal], [Difficult], [No], [GitHub], |
| [X-Decoder], [Yes], [Not ideal], [No], [Limited], |
| ) |
|
|
| *CLIPSeg* was selected because: |
| + Direct text-to-mask conditioning (no bounding box intermediate) |
| + Only 1.13M trainable decoder parameters on frozen 149.6M CLIP backbone |
| + Proven fine-tuning on small datasets (under 1,000 images) |
| + Native HuggingFace `transformers` support |
|
|
| == Architecture |
|
|
| #figure( |
| image("diagrams/architecture.png", width: 90%), |
| caption: [CLIPSeg architecture: frozen CLIP backbone with trainable decoder], |
| ) |
|
|
| The model takes an RGB image and a text prompt. The CLIP vision encoder (ViT-B/16) and text encoder produce embeddings. A lightweight 3-block transformer decoder with U-Net skip connections generates logits at 352×352, which are thresholded to produce binary masks. |
|
|
| = Data |
|
|
| == Sources |
|
|
| #table( |
| columns: (1fr, 2fr, 1fr, 1fr), |
| align: (left, left, center, center), |
| table.header( |
| [*Dataset*], [*Source*], [*Images*], [*Annotation*], |
| ), |
| [Taping], [Roboflow: drywall-join-detect], [1,186], [Bounding boxes], |
| [Cracks], [Roboflow: cracks-3ii36], [5,369], [Segmentation polygons], |
| ) |
|
|
| == Data Pipeline |
|
|
| #figure( |
| image("diagrams/pipeline.png", width: 65%), |
| caption: [Data preparation pipeline from download to train/val/test splits], |
| ) |
|
|
| - *Taping dataset*: Bounding box annotations converted to filled-rectangle binary masks |
| - *Cracks dataset*: COCO polygon annotations rendered to pixel-accurate binary masks via `pycocotools` |
| - Prompt augmentation: 5 synonyms per class, randomly sampled during training |
|
|
| == Split Counts |
|
|
| #table( |
| columns: (1fr, 1fr, 1fr, 1fr), |
| align: (left, center, center, center), |
| table.header( |
| [*Split*], [*Train*], [*Validation*], [*Test*], |
| ), |
| [Count], [4,588], [982], [985], |
| [Ratio], [70%], [15%], [15%], |
| ) |
|
|
| Stratified by dataset class (taping vs. cracks), seed = 42. |
|
|
| = Training |
|
|
| == Pipeline |
|
|
| #figure( |
| image("diagrams/Training Pipeline.png", width: 40%), |
| caption: [Training loop with early stopping], |
| ) |
|
|
| == Hyperparameters |
|
|
| #table( |
| columns: (1fr, 1fr), |
| align: (left, left), |
| table.header( |
| [*Parameter*], [*Value*], |
| ), |
| [Model], [`CIDAS/clipseg-rd64-refined`], |
| [Trainable params], [1,127,009 (decoder only)], |
| [Frozen params], [149,620,737 (CLIP backbone)], |
| [Optimizer], [AdamW (lr=1e-4, weight_decay=1e-4)], |
| [Scheduler], [CosineAnnealingLR], |
| [Loss], [0.5 × BCE + 0.5 × Dice], |
| [Batch size], [8], |
| [Max epochs], [30], |
| [Early stopping], [patience = 7 on val mIoU], |
| [Seed], [42], |
| ) |
|
|
| == Training Results |
|
|
| Training ran for 18 epochs before early stopping triggered (patience=7). Best validation mIoU achieved at epoch 11. |
|
|
| #table( |
| columns: (1fr, 1fr, 1fr, 1fr, 1fr), |
| align: (center, center, center, center, center), |
| table.header( |
| [*Epoch*], [*Train Loss*], [*Val Loss*], [*Val mIoU*], [*Val Dice*], |
| ), |
| [1], [0.5512], [0.5339], [0.1186], [0.1895], |
| [4], [0.5196], [0.5213], [0.1539], [0.2312], |
| [8], [0.5113], [0.5135], [0.1543], [0.2300], |
| [*11*], [*0.5085*], [*0.5117*], [*0.1605*], [*0.2370*], |
| [14], [0.5056], [0.5077], [0.1501], [0.2237], |
| [18], [0.5033], [0.5068], [0.1531], [0.2273], |
| ) |
|
|
| The model showed steady improvement in the first 11 epochs, with diminishing returns and eventual plateau thereafter. The loss landscape appears relatively flat in this region, which is expected given the frozen backbone and small decoder. |
|
|
| = Evaluation Results |
|
|
| == Metrics |
|
|
| #table( |
| columns: (1fr, 1fr, 1fr, 1fr), |
| align: (left, center, center, center), |
| table.header( |
| [*Class*], [*mIoU*], [*Dice*], [*Test Samples*], |
| ), |
| [Taping], [0.1917], [0.2780], [179], |
| [Cracks], [0.1639], [0.2434], [806], |
| [*Overall*], [*0.1689*], [*0.2497*], [*985*], |
| ) |
|
|
| Taping detection outperforms crack detection, likely because filled-rectangle masks provide a stronger supervision signal (larger contiguous regions) compared to thin crack annotations. The class imbalance in test samples (179 taping vs 806 cracks) reflects the original dataset sizes. |
|
|
| == Visual Examples |
|
|
| The best individual predictions reach IoU 0.78 for both cracks and taping, demonstrating that the model has learned meaningful text-conditioned segmentation despite the low aggregate scores. The gap between best-sample and mean performance is driven primarily by thin-crack samples where minor spatial offsets cause disproportionate IoU drops. |
|
|
| #figure( |
| image("figures/best_predictions.png", width: 70%), |
| caption: [Best test-set predictions ranked by IoU: Original | Ground Truth | Model Prediction], |
| ) |
|
|
| = Failure Cases & Potential Solutions |
|
|
| == Worst Predictions |
|
|
| The following examples show the model's worst test-set predictions (IoU near zero). These failures reveal systematic patterns that inform targeted improvements. |
|
|
| #figure( |
| image("figures/failure_cases.png", width: 65%), |
| caption: [Failure cases — worst test-set predictions ranked by IoU (ascending)], |
| ) |
|
|
| == Root Causes |
|
|
| - *Taping annotations are coarse*: The source dataset provides bounding boxes, not pixel-level masks. Filled rectangles include substantial background, teaching the model to predict overly large regions. |
| - *Cracks are thin structures*: Even small positional errors in crack predictions cause significant IoU drops. A 1-pixel-wide crack shifted by 2 pixels yields near-zero IoU despite visual similarity. |
| - *Resolution bottleneck*: CLIPSeg operates at 352×352 fixed resolution. Fine crack details are lost during downscaling, particularly for high-resolution input images. |
| - *Decoder capacity*: With only 1.13M trainable parameters, the decoder has limited capacity to learn domain-specific features for construction imagery. |
| - *Domain gap*: The pretrained CLIP backbone was trained on internet images, not construction-specific content. The frozen backbone cannot adapt its feature extraction to this domain. |
|
|
| == Potential Solutions |
|
|
| #table( |
| columns: (1fr, 2fr), |
| align: (left, left), |
| table.header( |
| [*Limitation*], [*Proposed Solution*], |
| ), |
| [Coarse taping masks], [Use SAM or SAM2 to generate pixel-accurate masks from bounding boxes instead of filled rectangles], |
| [Frozen backbone domain gap], [Unfreeze the last 2--3 ViT blocks with a 10× lower learning rate for domain adaptation], |
| [352×352 resolution ceiling], [Switch to a higher-resolution architecture (e.g. SAM2 with text-prompt conditioning)], |
| [Small decoder], [Add more decoder blocks or increase hidden dimension while monitoring overfitting], |
| [Thin-crack IoU sensitivity], [Use boundary-aware metrics (e.g. boundary IoU) or distance-tolerant evaluation], |
| ) |
|
|
| = Runtime & Footprint |
|
|
| #table( |
| columns: (1fr, 1fr), |
| align: (left, left), |
| table.header( |
| [*Metric*], [*Value*], |
| ), |
| [Training time], [97.2 minutes (18 epochs)], |
| [Training device], [Apple M4 (MPS)], |
| [Training speed], [~2.1 iterations/second], |
| [Avg inference time], [58.7 ms/image], |
| [Model size (full)], [575.1 MB], |
| [Trainable parameters], [1.13M (decoder)], |
| [Total parameters], [150.7M], |
| ) |
|
|
| = Audit Log |
|
|
| #table( |
| columns: (1fr, 2fr), |
| align: (left, left), |
| table.header( |
| [*Step*], [*Details*], |
| ), |
| [Environment], [Python 3.11, PyTorch 2.11, transformers 5.5.3, uv], |
| [Datasets], [Roboflow Universe: drywall-join-detect (v1), cracks-3ii36 (raw)], |
| [Annotation handling], [Taping: bbox→rectangle masks; Cracks: polygon→binary masks], |
| [Model], [CLIPSeg (CIDAS/clipseg-rd64-refined), decoder-only fine-tuning], |
| [Loss], [BCEDiceLoss (0.5/0.5)], |
| [Device], [Apple M4, MPS backend], |
| [Diagrams], [diagrams (Python), d2, PlantUML], |
| [Report], [Typst], |
| [Seeds], [42 (data splits, torch, numpy)], |
| ) |
|
|