---
language: en
tags:
- medical-imaging
- polyp-segmentation
- dinov3
- vision-transformer
- kvasir-seg
- colonoscopy
- unet
datasets:
- kmader/kvasir-segmentation
metrics:
- dice
- iou
- precision
- recall
- hd95
library_name: pytorch
pipeline_tag: image-segmentation
license: mit
---

# DINOv3 Polyp Segmentation with U-Net Decoder

## Model Description

This model performs **polyp segmentation** in colonoscopy images using a frozen DINOv3-ViT-L/16 backbone with multi-scale feature extraction and a U-Net style decoder with skip connections. The model was trained on the Kvasir-SEG dataset.

**Key Features:**
- 🏗️ **U-Net architecture**: Skip connections from shallow stem for precise boundary detection
- 📐 **Multi-scale features**: Extracts DINOv3 features from layers [5, 11, 17, 20, 23] for rich hierarchical representation
- 🩺 **Medical-grade segmentation**: Specifically designed for polyp detection in colonoscopy
- 🔒 **Frozen backbone**: Leverages DINOv3's rich visual features without overfitting
- 📊 **Comprehensive metrics**: Evaluated with Dice, IoU, Precision, Recall, and HD95
- 🔄 **Cosine annealing**: Uses CosineAnnealingWarmRestarts for better convergence

## Model Architecture
Input Image (256×256×3)
↓
┌───────────────────────┬──────────────────────┐
│ Shallow Stem │ DINOv3 Encoder │
│ (Trainable) │ (Frozen) │
│ │ │
│ Conv 3→64 (3×3) │ Layers [5,11,17, │
│ Conv 64→128 (stride2)│ 20,23] │
│ Conv 128→256 (stride2)│ Multi-scale concat │
│ Conv 256→512 (stride2)│ 5 × 1024 = 5120 │
└───────┬───────────────┴──────────┬───────────┘
│ Skip Connections │
│ [512, 256, 128] │
↓ ↓
┌──────────────────────────────────────────┐
│ U-Net Decoder (Trainable) │
│ │
│ Conv 5120→256 + Skip(512) → ConvBlock │
│ Upsample → Conv 384→128 + Skip(256) │
│ Upsample → Conv 192→64 + Skip(128) │
│ Upsample → Final Conv 64→1 (1×1) │
└──────────────────┬───────────────────────┘
↓
Segmentation Mask (256×256×1)


## Training Details

| Hyperparameter | Value |
|---------------|-------|
| Backbone | DINOv3-ViT-L/16 (frozen) |
| Multi-scale Layers | [5, 11, 17, 20, 23] |
| Input Resolution | 256×256 |
| Batch Size | 96 |
| Epochs | 150 |
| Learning Rate | 1e-4 (initial) |
| Min Learning Rate | 1e-6 |
| Weight Decay | 1e-4 |
| Optimizer | AdamW |
| Scheduler | CosineAnnealingWarmRestarts |
| Scheduler Config | T_0=10, T_mult=2 |
| Loss Function | Focal + Dice (0.7/0.3 weights) |
| Focal Loss Gamma | 2.0 |
| Focal Loss Alpha | 0.25 |
| Trainable Parameters | ~21M (Stem + Decoder) |

### Data Augmentation
- Random 90° rotation
- Horizontal/Vertical flips
- ShiftScaleRotate (shift=0.05, scale=0.05, rotate=15°)
- MotionBlur/GaussianBlur
- ColorJitter (brightness, contrast, saturation, hue)

## Performance Metrics

### Final Test Set Results

| Metric | Score |
|--------|-------|
| **Dice Score** | **0.8289 ± 0.0000** |
| **IoU** | **0.7078 ± 0.0000** |
| **Precision** | 0.7910 ± 0.0000 |
| **Recall** | 0.8705 ± 0.0000 |
| **HD95 (pixels)** | 45.46 ± 0.00 |
| **Best Validation Dice** | 0.7327 |

### Validation Set Results

| Metric | Score |
|--------|-------|
| **Dice Score** | **0.8795 ± 0.0304** |
| **IoU** | **0.7862 ± 0.0485** |
| **Precision** | 0.8846 ± 0.0256 |
| **Recall** | 0.8744 ± 0.0351 |
| **HD95 (pixels)** | 30.85 ± 8.90 |

### Training Set Results (Final Epoch)

| Metric | Score |
|--------|-------|
| **Dice Score** | 0.8747 ± 0.0108 |
| **IoU** | 0.7775 ± 0.0170 |
| **Precision** | 0.8698 ± 0.0189 |
| **Recall** | 0.8801 ± 0.0136 |
| **HD95 (pixels)** | 33.91 ± 1.69 |

## Usage

### Installation

```bash
pip install torch transformers pillow matplotlib numpy opencv-python albumentations scipy scikit-learn
Basic Inference
python
import torch
import numpy as np
from PIL import Image
import matplotlib.pyplot as plt

# Import the model architecture (same as training)
from model import DINOv3Encoder, ShallowStem, UNetDecoder, PolypSegmentationModel

# Load model
model = PolypSegmentationModel.from_pretrained(
    "your-username/dinov3-polyp-seg",
    device="cuda" if torch.cuda.is_available() else "cpu"
)

# Preprocess image
def preprocess_image(image_path, target_size=(256, 256)):
    image = Image.open(image_path).convert('RGB')
    image = image.resize(target_size, Image.Resampling.BILINEAR)
    
    # Convert to numpy and normalize
    image_array = np.array(image).astype(np.float32) / 255.0
    mean = np.array([0.485, 0.456, 0.406]).reshape(1, 1, 3)
    std = np.array([0.229, 0.224, 0.225]).reshape(1, 1, 3)
    image_array = (image_array - mean) / std
    
    # Convert to tensor [B, C, H, W]
    image_tensor = torch.from_numpy(image_array).permute(2, 0, 1).unsqueeze(0)
    return image_tensor, image

# Run inference
image_tensor, original_image = preprocess_image("colonoscopy_image.jpg")

with torch.no_grad():
    prediction = model(image_tensor)
    mask = torch.sigmoid(prediction)
    binary_mask = (mask > 0.5).float()
    mask_np = binary_mask.squeeze().cpu().numpy()

# Visualize
fig, axes = plt.subplots(1, 3, figsize=(15, 5))
axes[0].imshow(original_image)
axes[0].set_title("Input Image")
axes[1].imshow(mask_np, cmap='gray')
axes[1].set_title("Polyp Segmentation")
axes[2].imshow(original_image)
axes[2].imshow(mask_np, cmap='Reds', alpha=0.5)
axes[2].set_title("Overlay")
plt.show()
Advanced Usage with Metrics
python
from scipy.ndimage import morphology

def compute_hd95(pred, target):
    """Compute Hausdorff Distance 95th percentile"""
    if pred.sum() == 0 or target.sum() == 0:
        return float('inf')
    
    pred_border = pred - morphology.binary_erosion(pred)
    target_border = target - morphology.binary_erosion(target)
    
    pred_coords = np.argwhere(pred_border > 0)
    target_coords = np.argwhere(target_border > 0)
    
    distances = []
    for p in pred_coords:
        dist = np.min(np.sqrt(np.sum((target_coords - p) ** 2, axis=1)))
        distances.append(dist)
    
    return np.percentile(distances, 95)

# Batch inference
dataloader = DataLoader(dataset, batch_size=16, shuffle=False)

all_metrics = {'dice': [], 'iou': [], 'hd95': []}
for images, masks in dataloader:
    with torch.no_grad():
        predictions = model(images)
    
    # Calculate metrics for each image
    for pred, mask in zip(predictions, masks):
        pred_binary = (torch.sigmoid(pred) > 0.5).float()
        
        # Dice
        intersection = (pred_binary * mask).sum()
        dice = (2. * intersection) / (pred_binary.sum() + mask.sum() + 1e-6)
        
        # IoU
        union = pred_binary.sum() + mask.sum() - intersection
        iou = intersection / (union + 1e-6)
        
        # HD95
        hd95 = compute_hd95(pred_binary.numpy().squeeze(), mask.numpy().squeeze())
        
        all_metrics['dice'].append(dice.item())
        all_metrics['iou'].append(iou.item())
        all_metrics['hd95'].append(hd95)

print(f"Average Dice: {np.mean(all_metrics['dice']):.4f} ± {np.std(all_metrics['dice']):.4f}")
print(f"Average IoU: {np.mean(all_metrics['iou']):.4f} ± {np.std(all_metrics['iou']):.4f}")
print(f"Average HD95: {np.mean(all_metrics['hd95']):.2f} ± {np.std(all_metrics['hd95']):.2f}")
Model Limitations
Input size: Fixed to 256×256 pixels (resize your images accordingly)

Domain: Trained only on colonoscopy images from Kvasir-SEG

Polyp types: May not generalize to all polyp morphologies

Image quality: Best performance with standard white-light colonoscopy images

## Dataset
Trained on the Kvasir-SEG dataset, which contains 1000 polyp images with corresponding ground truth masks from colonoscopy procedures.

## License
This model is released under the MIT License.

## Citation
If you use this model in your research, please cite:

bibtex
@software{dinov3_polyp_seg,
  author = {Amirreza Mehrzadian},
  title = {DINOv3 Polyp Segmentation with U-Net Decoder},
  year = {2024},
  url = {https://huggingface.co/uncleMehrzad/dinov3-polyp-seg}
}
## Acknowledgments
DINOv3 team for the powerful vision backbone

Kvasir-SEG dataset providers for the polyp segmentation data

HuggingFace for model hosting infrastructure


```python
class PolypSegmentationModel(nn.Module):
    """Complete model wrapper matching training architecture"""
    
    def __init__(self, encoder, stem, decoder):
        super().__init__()
        self.encoder = encoder
        self.stem = stem
        self.decoder = decoder
    
    def forward(self, x):
        vit_features = self.encoder(x)
        skip_features = self.stem(x)
        return self.decoder(vit_features, skip_features)
    
    @classmethod
    def from_pretrained(cls, model_path, config, device="cpu"):
        """Load the complete model from checkpoint"""
        checkpoint = torch.load(model_path, map_location=device)
        
        # Initialize components
        encoder = DINOv3Encoder(
            model_name=config.model_name,
            local_path=config.local_model_path,
            freeze=True,
            layers=config.multi_scale_layers
        )
        
        stem = ShallowStem(in_channels=3, base_channels=64)
        
        decoder = UNetDecoder(
            vit_channels=encoder.out_channels,
            stem_channels=[512, 256, 128],
            num_classes=1
        )
        
        # Load weights
        decoder.load_state_dict(checkpoint['decoder_state_dict'])
        stem.load_state_dict(checkpoint['stem_state_dict'])
        
        model = cls(encoder, stem, decoder)
        model.to(device)
        model.eval()
        
        return model