OmniRad: A General-Purpose Radiological Foundation Model

💻 Code

OmniRad is a self-supervised radiological foundation model designed to learn stable, transferable, and task-agnostic visual representations for medical imaging. It is pretrained on large-scale, heterogeneous radiological data and intended for reuse across classification, segmentation, and exploratory vision–language tasks without task-specific pretraining.

This repository provides the OmniRad-small variant, a compact Vision Transformer encoder that offers an excellent trade-off between computational efficiency and representational power.

Key Features

Radiology-focused foundation model pretrained on >1M radiological images
Self-supervised learning based on a customized DINOv2 framework
Task-agnostic encoder reusable across classification, segmentation, and multimodal pipelines
Strong transferability across modalities (CT, MRI, X-ray, ultrasound)
Radiomics-oriented design, emphasizing representation stability and reuse

Example Usage: Feature Extraction

from PIL import Image
from torchvision import transforms
import timm
import torch

# Load OmniRad-small from Hugging Face Hub
model = timm.create_model(
    "hf_hub:Snarcy/OmniRad-small",
    pretrained=True,
    num_classes=0  # return embeddings
)

model.eval()
device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)

# Preprocessing
transform = transforms.Compose([
    transforms.Resize((224, 224)),
    transforms.ToTensor(),
    transforms.Normalize(
        mean=[0.485, 0.456, 0.406],
        std=[0.229, 0.224, 0.225],
    ),
])

# Load image
image = Image.open("path/to/radiology_image.png").convert("RGB")
x = transform(image).unsqueeze(0).to(device)

# Extract features
with torch.no_grad():
    embedding = model(x)  # shape: [1, 384]

Available Downstream Code

The official OmniRad repository provides end-to-end implementations for all evaluated downstream tasks:

👉 https://github.com/unica-visual-intelligence-lab/OmniRad

Including:

Image-level classification (MedMNIST v2 benchmarks)
Dense medical image segmentation (MedSegBench, frozen encoder + lightweight decoders)
Radiological image captioning (BART-based vision–language framework)
Full training, evaluation, and ablation scripts
Reproducible experimental configurations matching the paper

Model Details

Architecture: Vision Transformer (ViT-S)
Patch size: 14
Embedding dimension: 384
Pretraining framework: Modified DINOv2 (global crops only)
Pretraining dataset: RadImageNet (~1.2M radiological images)
Input resolution: 224 × 224
Backbone type: Encoder-only (no task-specific heads)

Pretraining Notes

Local crops are removed to improve training stability and downstream transferability
No feature collapse observed during training
Same hyperparameter configuration used across small and base variants
Designed to support frozen-backbone adaptation and lightweight fine-tuning