Envision Eye Imaging Classifier

SetFit few-shot classifier for identifying eye imaging datasets from scientific metadata.

Developed by: FAIR Data Innovations Hub, California Medical Innovations Institute (CalMI²), in collaboration with the EyeACT Study

Model Description

Uses sentence-transformers/all-mpnet-base-v2 as backbone with binary classification:

  • EYE_IMAGING (1): Actual ophthalmic imaging datasets (fundus photography, OCT, OCTA, corneal imaging, slit-lamp, anterior segment)
  • NEGATIVE (0): Everything else (non-eye data, software/code, eye-adjacent non-imaging, non-eye medical imaging)

Results on Zenodo

Tested on 515 Zenodo datasets (filtered to resource_type=dataset only):

Class Count
EYE_IMAGING 60
NEGATIVE 455

Training

  • Base Model: sentence-transformers/all-mpnet-base-v2 (768-dimensional embeddings)
  • Training Examples: 891 (262 EYE_IMAGING, 629 NEGATIVE)
  • Positive Data Sources: Multi-repository (Zenodo, Figshare, Dryad, Kaggle, NEI) — LLM-verified
  • Negative Data Sources: Real dataset records from discovery pipelines + targeted hard negatives (non-eye medical imaging, non-eye OCT, eye-adjacent non-imaging)
  • Epochs: 2
  • Batch Size: 16

Validation

Held-out Test Set (20% stratified split)

Metric Value
Accuracy 0.961
Macro F1 0.954
EYE_IMAGING F1 0.936
EYE_IMAGING Precision 0.911
EYE_IMAGING Recall 0.962

Spot-Check Validation (33 expert-verified Zenodo records)

Metric Value
Accuracy 30/33 (90.9%)
EYE_IMAGING F1 0.824
EYE_IMAGING Precision 0.875
EYE_IMAGING Recall 0.778

Usage

from sentence_transformers import SentenceTransformer
import joblib

model = SentenceTransformer("fairdataihub/envision-eye-imaging-classifier")
head = joblib.load("model_head.pkl")

embeddings = model.encode(["Retinal OCT dataset for diabetic retinopathy"])
predictions = head.predict(embeddings)
probabilities = head.predict_proba(embeddings)

labels = ["NEGATIVE", "EYE_IMAGING"]
print(f"Label: {labels[predictions[0]]}")
print(f"Confidence: {max(probabilities[0]):.3f}")

Data Pipeline

  • Harvests metadata from multiple scientific data repositories (Zenodo, Figshare, DataCite, Kaggle, Dryad, NEI)
  • Classifies records as eye imaging or not
  • Identified eye imaging datasets are registered on the Envision Portal

Citation

If you use this model in your work, please cite:

  • FAIR Data Innovations Hub (fairdataihub.org)
  • EyeACT Study (eyeactstudy.org)
  • Tunstall et al. (2022). "Efficient Few-Shot Learning Without Prompts" (SetFit)
  • sentence-transformers/all-mpnet-base-v2

Contact

Downloads last month
30
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support