Model Description

This model is a CLIP-style vision–language model trained on full Medtrinity dataset.

Technical Specifications:

Base model: facebook/metaclip-b16-400m (CLIP-like architecture)
Architecture: CLIPModel from the transformers library
Processor: CLIPProcessor (handles both image and text preprocessing)

Example Usage

from transformers import CLIPProcessor, CLIPModel
from PIL import Image
import requests
import torch

model_id = "Mihara-bot/metaclip-b16-400m-medtrinity_Full"  

processor = CLIPProcessor.from_pretrained(model_id)
model = CLIPModel.from_pretrained(model_id)

# Example image & text
url = "https://your-image-url"
image = Image.open(requests.get(url, stream=True).raw).convert("RGB")
texts = ["a medical image of ...", "a normal image of ..."]

inputs = processor(
    text=texts,
    images=image,
    return_tensors="pt",
    padding="max_length",
    truncation=True,
    max_length=77,
)

with torch.no_grad():
    outputs = model(
        pixel_values=inputs["pixel_values"],
        input_ids=inputs["input_ids"],
        attention_mask=inputs["attention_mask"],
    )

image_embeds = outputs.image_embeds      # (batch, dim)
text_embeds = outputs.text_embeds        # (batch, dim)

# Calculate similarity
logits_per_image = image_embeds @ text_embeds.t()
probs = logits_per_image.softmax(dim=-1)
print(probs)

Intended Use

Vision–language tasks such as image–text retrieval, zero-shot classification, or image–text similarity in the biomedical/medical domain (depending on the specific dataset subset used).
Research on data selection, influence functions, and the efficient adaptation of CLIP models.

Not Intended For

Any safety‑critical clinical diagnosis or automated medical decision-making.

Any deployment without human oversight, especially within healthcare environments.

Limitations

The model is trained on Medtinity; it may reflect the biases and coverage limitations of the underlying dataset.

Performance outside the target domain (e.g., general web images) is likely weaker than generic CLIP models.

Training text largely consists of short captions; performance on long, structured clinical narratives may be limited.

Citation

If you find this model useful, please cite the CHIPS paper:

@misc{zhuang2025chipsefficientclipadaptation,
      title={CHIPS: Efficient CLIP Adaptation via Curvature-aware Hybrid Influence-based Data Selection}, 
      author={Xinlin Zhuang and Yichen Li and Xiwei Liu and Haolin Yang and Yifan Lu and Ziyun Zou and Yulong Li and Huifa Li and Dongliang Chen and Qinglei Wang and Weiyang Liu and Ying Qian and Jiangming Shi and Imran Razzak},
      year={2025},
      eprint={2511.18519},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2511.18519}, 
}