uncleMehrzad
/

polyp-segmentation

 ---
+language: en
+tags:
+- medical-imaging
+- polyp-segmentation
+- dinov3
+- vision-transformer
+- kvasir-seg
+- colonoscopy
+- unet
+datasets:
+- kmader/kvasir-segmentation
+metrics:
+- dice
+- iou
+- precision
+- recall
+- hd95
+library_name: pytorch
+pipeline_tag: image-segmentation
 license: mit
 ---
+# DINOv3 Polyp Segmentation with U-Net Decoder
+## Model Description
+This model performs **polyp segmentation** in colonoscopy images using a frozen DINOv3-ViT-L/16 backbone with multi-scale feature extraction and a U-Net style decoder with skip connections. The model was trained on the Kvasir-SEG dataset.
+**Key Features:**
+- 🏗️ **U-Net architecture**: Skip connections from shallow stem for precise boundary detection
+- 📐 **Multi-scale features**: Extracts DINOv3 features from layers [5, 11, 17, 20, 23] for rich hierarchical representation
+- 🩺 **Medical-grade segmentation**: Specifically designed for polyp detection in colonoscopy
+- 🔒 **Frozen backbone**: Leverages DINOv3's rich visual features without overfitting
+- 📊 **Comprehensive metrics**: Evaluated with Dice, IoU, Precision, Recall, and HD95
+- 🔄 **Cosine annealing**: Uses CosineAnnealingWarmRestarts for better convergence
+## Model Architecture
+Input Image (256×256×3)
+↓
+┌───────────────────────┬──────────────────────┐
+│ Shallow Stem │ DINOv3 Encoder │
+│ (Trainable) │ (Frozen) │
+│ │ │
+│ Conv 3→64 (3×3) │ Layers [5,11,17, │
+│ Conv 64→128 (stride2)│ 20,23] │
+│ Conv 128→256 (stride2)│ Multi-scale concat │
+│ Conv 256→512 (stride2)│ 5 × 1024 = 5120 │
+└───────┬───────────────┴──────────┬───────────┘
+│ Skip Connections │
+│ [512, 256, 128] │
+↓ ↓
+┌──────────────────────────────────────────┐
+│ U-Net Decoder (Trainable) │
+│ │
+│ Conv 5120→256 + Skip(512) → ConvBlock │
+│ Upsample → Conv 384→128 + Skip(256) │
+│ Upsample → Conv 192→64 + Skip(128) │
+│ Upsample → Final Conv 64→1 (1×1) │
+└──────────────────┬───────────────────────┘
+↓
+Segmentation Mask (256×256×1)
+## Training Details
+| Hyperparameter | Value |
+|---------------|-------|
+| Backbone | DINOv3-ViT-L/16 (frozen) |
+| Multi-scale Layers | [5, 11, 17, 20, 23] |
+| Input Resolution | 256×256 |
+| Batch Size | 32 |
+| Epochs | 100 |
+| Learning Rate | 1e-4 (initial) |
+| Min Learning Rate | 1e-6 |
+| Weight Decay | 1e-4 |
+| Optimizer | AdamW |
+| Scheduler | CosineAnnealingWarmRestarts |
+| Scheduler Config | T_0=10, T_mult=2 |
+| Loss Function | Focal + Dice (0.7/0.3 weights) |
+| Focal Loss Gamma | 2.0 |
+| Focal Loss Alpha | 0.25 |
+| Trainable Parameters | ~8.5M (Stem + Decoder) |
+### Data Augmentation
+- Random 90° rotation
+- Horizontal/Vertical flips
+- ShiftScaleRotate (shift=0.05, scale=0.05, rotate=15°)
+- MotionBlur/GaussianBlur
+- ColorJitter (brightness, contrast, saturation, hue)
+## Performance Metrics
+### Final Test Set Results
+| Metric | Score |
+|--------|-------|
+| **Dice Score** | **{test_dice:.4f} ± {test_dice_std:.4f}** |
+| **IoU** | **{test_iou:.4f} ± {test_iou_std:.4f}** |
+| **Precision** | {test_precision:.4f} ± {test_precision_std:.4f} |
+| **Recall** | {test_recall:.4f} ± {test_recall_std:.4f} |
+| **HD95 (pixels)** | {test_hd95:.2f} ± {test_hd95_std:.2f} |
+| **Best Validation Dice** | {best_dice:.4f} |
+### Validation Set Results
+| Metric | Score |
+|--------|-------|
+| **Dice Score** | {val_dice:.4f} ± {val_dice_std:.4f} |
+| **IoU** | {val_iou:.4f} ± {val_iou_std:.4f} |
+| **Precision** | {val_precision:.4f} ± {val_precision_std:.4f} |
+| **Recall** | {val_recall:.4f} ± {val_recall_std:.4f} |
+| **HD95 (pixels)** | {val_hd95:.2f} ± {val_hd95_std:.2f} |
+## Usage
+### Installation
+```bash
+pip install torch transformers pillow matplotlib numpy opencv-python albumentations scipy scikit-learn
+Basic Inference
+python
+import torch
+import numpy as np
+from PIL import Image
+import matplotlib.pyplot as plt
+# Import the model architecture (same as training)
+from model import DINOv3Encoder, ShallowStem, UNetDecoder, PolypSegmentationModel
+# Load model
+model = PolypSegmentationModel.from_pretrained(
+    "your-username/dinov3-polyp-seg",
+    device="cuda" if torch.cuda.is_available() else "cpu"
+)
+# Preprocess image
+def preprocess_image(image_path, target_size=(256, 256)):
+    image = Image.open(image_path).convert('RGB')
+    image = image.resize(target_size, Image.Resampling.BILINEAR)
+    # Convert to numpy and normalize
+    image_array = np.array(image).astype(np.float32) / 255.0
+    mean = np.array([0.485, 0.456, 0.406]).reshape(1, 1, 3)
+    std = np.array([0.229, 0.224, 0.225]).reshape(1, 1, 3)
+    image_array = (image_array - mean) / std
+    # Convert to tensor [B, C, H, W]
+    image_tensor = torch.from_numpy(image_array).permute(2, 0, 1).unsqueeze(0)
+    return image_tensor, image
+# Run inference
+image_tensor, original_image = preprocess_image("colonoscopy_image.jpg")
+with torch.no_grad():
+    prediction = model(image_tensor)
+    mask = torch.sigmoid(prediction)
+    binary_mask = (mask > 0.5).float()
+    mask_np = binary_mask.squeeze().cpu().numpy()
+# Visualize
+fig, axes = plt.subplots(1, 3, figsize=(15, 5))
+axes[0].imshow(original_image)
+axes[0].set_title("Input Image")
+axes[1].imshow(mask_np, cmap='gray')
+axes[1].set_title("Polyp Segmentation")
+axes[2].imshow(original_image)
+axes[2].imshow(mask_np, cmap='Reds', alpha=0.5)
+axes[2].set_title("Overlay")
+plt.show()
+Advanced Usage with Metrics
+python
+from scipy.ndimage import morphology
+def compute_hd95(pred, target):
+    """Compute Hausdorff Distance 95th percentile"""
+    if pred.sum() == 0 or target.sum() == 0:
+        return float('inf')
+    pred_border = pred - morphology.binary_erosion(pred)
+    target_border = target - morphology.binary_erosion(target)
+    pred_coords = np.argwhere(pred_border > 0)
+    target_coords = np.argwhere(target_border > 0)
+    distances = []
+    for p in pred_coords:
+        dist = np.min(np.sqrt(np.sum((target_coords - p) ** 2, axis=1)))
+        distances.append(dist)
+    return np.percentile(distances, 95)
+# Batch inference
+dataloader = DataLoader(dataset, batch_size=16, shuffle=False)
+all_metrics = {'dice': [], 'iou': [], 'hd95': []}
+for images, masks in dataloader:
+    with torch.no_grad():
+        predictions = model(images)
+    # Calculate metrics for each image
+    for pred, mask in zip(predictions, masks):
+        pred_binary = (torch.sigmoid(pred) > 0.5).float()
+        # Dice
+        intersection = (pred_binary * mask).sum()
+        dice = (2. * intersection) / (pred_binary.sum() + mask.sum() + 1e-6)
+        # IoU
+        union = pred_binary.sum() + mask.sum() - intersection
+        iou = intersection / (union + 1e-6)
+        # HD95
+        hd95 = compute_hd95(pred_binary.numpy().squeeze(), mask.numpy().squeeze())
+        all_metrics['dice'].append(dice.item())
+        all_metrics['iou'].append(iou.item())
+        all_metrics['hd95'].append(hd95)
+print(f"Average Dice: {np.mean(all_metrics['dice']):.4f} ± {np.std(all_metrics['dice']):.4f}")
+print(f"Average IoU: {np.mean(all_metrics['iou']):.4f} ± {np.std(all_metrics['iou']):.4f}")
+print(f"Average HD95: {np.mean(all_metrics['hd95']):.2f} ± {np.std(all_metrics['hd95']):.2f}")
+Model Limitations
+Input size: Fixed to 256×256 pixels (resize your images accordingly)
+Domain: Trained only on colonoscopy images from Kvasir-SEG
+Polyp types: May not generalize to all polyp morphologies
+Image quality: Best performance with standard white-light colonoscopy images
+## Dataset
+Trained on the Kvasir-SEG dataset, which contains 1000 polyp images with corresponding ground truth masks from colonoscopy procedures.
+## License
+This model is released under the MIT License.
+## Citation
+If you use this model in your research, please cite:
+bibtex
+@software{dinov3_polyp_seg,
+  author = {Your Name},
+  title = {DINOv3 Polyp Segmentation with U-Net Decoder},
+  year = {2024},
+  url = {https://huggingface.co/your-username/dinov3-polyp-seg}
+}
+## Acknowledgments
+DINOv3 team for the powerful vision backbone
+Kvasir-SEG dataset providers for the polyp segmentation data
+HuggingFace for model hosting infrastructure
+```python
+class PolypSegmentationModel(nn.Module):
+    """Complete model wrapper matching training architecture"""
+    def __init__(self, encoder, stem, decoder):
+        super().__init__()
+        self.encoder = encoder
+        self.stem = stem
+        self.decoder = decoder
+    def forward(self, x):
+        vit_features = self.encoder(x)
+        skip_features = self.stem(x)
+        return self.decoder(vit_features, skip_features)
+    @classmethod
+    def from_pretrained(cls, model_path, config, device="cpu"):
+        """Load the complete model from checkpoint"""
+        checkpoint = torch.load(model_path, map_location=device)
+        # Initialize components
+        encoder = DINOv3Encoder(
+            model_name=config.model_name,
+            local_path=config.local_model_path,
+            freeze=True,
+            layers=config.multi_scale_layers
+        )
+        stem = ShallowStem(in_channels=3, base_channels=64)
+        decoder = UNetDecoder(
+            vit_channels=encoder.out_channels,
+            stem_channels=[512, 256, 128],
+            num_classes=1
+        )
+        # Load weights
+        decoder.load_state_dict(checkpoint['decoder_state_dict'])
+        stem.load_state_dict(checkpoint['stem_state_dict'])
+        model = cls(encoder, stem, decoder)
+        model.to(device)
+        model.eval()
+        return model