Envision Eye Imaging Classifier
SetFit few-shot classifier for identifying eye imaging datasets from scientific metadata.
Developed by: FAIR Data Innovations Hub, California Medical Innovations Institute (CalMI²), in collaboration with the EyeACT Study
Model Description
Uses sentence-transformers/all-mpnet-base-v2 as backbone with binary classification:
- EYE_IMAGING (1): Actual ophthalmic imaging datasets (fundus photography, OCT, OCTA, corneal imaging, slit-lamp, anterior segment)
- NEGATIVE (0): Everything else (non-eye data, software/code, eye-adjacent non-imaging, non-eye medical imaging)
Results on Zenodo
Tested on 515 Zenodo datasets (filtered to resource_type=dataset only):
| Class | Count |
|---|---|
| EYE_IMAGING | 60 |
| NEGATIVE | 455 |
Training
- Base Model:
sentence-transformers/all-mpnet-base-v2(768-dimensional embeddings) - Training Examples: 891 (262 EYE_IMAGING, 629 NEGATIVE)
- Positive Data Sources: Multi-repository (Zenodo, Figshare, Dryad, Kaggle, NEI) — LLM-verified
- Negative Data Sources: Real dataset records from discovery pipelines + targeted hard negatives (non-eye medical imaging, non-eye OCT, eye-adjacent non-imaging)
- Epochs: 2
- Batch Size: 16
Validation
Held-out Test Set (20% stratified split)
| Metric | Value |
|---|---|
| Accuracy | 0.961 |
| Macro F1 | 0.954 |
| EYE_IMAGING F1 | 0.936 |
| EYE_IMAGING Precision | 0.911 |
| EYE_IMAGING Recall | 0.962 |
Spot-Check Validation (33 expert-verified Zenodo records)
| Metric | Value |
|---|---|
| Accuracy | 30/33 (90.9%) |
| EYE_IMAGING F1 | 0.824 |
| EYE_IMAGING Precision | 0.875 |
| EYE_IMAGING Recall | 0.778 |
Usage
from sentence_transformers import SentenceTransformer
import joblib
model = SentenceTransformer("fairdataihub/envision-eye-imaging-classifier")
head = joblib.load("model_head.pkl")
embeddings = model.encode(["Retinal OCT dataset for diabetic retinopathy"])
predictions = head.predict(embeddings)
probabilities = head.predict_proba(embeddings)
labels = ["NEGATIVE", "EYE_IMAGING"]
print(f"Label: {labels[predictions[0]]}")
print(f"Confidence: {max(probabilities[0]):.3f}")
Data Pipeline
- Harvests metadata from multiple scientific data repositories (Zenodo, Figshare, DataCite, Kaggle, Dryad, NEI)
- Classifies records as eye imaging or not
- Identified eye imaging datasets are registered on the Envision Portal
Citation
If you use this model in your work, please cite:
- FAIR Data Innovations Hub (fairdataihub.org)
- EyeACT Study (eyeactstudy.org)
- Tunstall et al. (2022). "Efficient Few-Shot Learning Without Prompts" (SetFit)
sentence-transformers/all-mpnet-base-v2
Contact
- James O'Neill (joneill@calmi2.org)
- Bhavesh Patel (bpatel@calmi2.org)
- Downloads last month
- 30