fairdataihub
/

envision-eye-imaging-classifier

@@ -9,95 +9,81 @@ tags:
 - medical-imaging
 - fair-data
 - eyeact
-- binary-classification
 ---
 # Envision Eye Imaging Classifier
-SetFit few-shot classifier for identifying eye imaging datasets from scientific metadata.
-**Developed by**: FAIR Data Innovations Hub, California Medical Innovations Institute (CalMI²), in collaboration with the EyeACT Study
 ## Model Description
 Uses `sentence-transformers/all-mpnet-base-v2` as backbone with binary classification:
-- **EYE_IMAGING (1)**: Actual ophthalmic imaging datasets (fundus photography, OCT, OCTA, corneal imaging, slit-lamp, anterior segment)
-- **NEGATIVE (0)**: Everything else (non-eye data, software/code, eye-adjacent non-imaging, non-eye medical imaging)
-## Results on Zenodo
-Tested on 515 Zenodo datasets (filtered to `resource_type=dataset` only):
-| Class | Count |
-|-------|-------|
-| EYE_IMAGING | 60 |
-| NEGATIVE | 455 |
-## Training
-- **Base Model**: `sentence-transformers/all-mpnet-base-v2` (768-dimensional embeddings)
-- **Training Examples**: 891 (262 EYE_IMAGING, 629 NEGATIVE)
-- **Positive Data Sources**: Multi-repository (Zenodo, Figshare, Dryad, Kaggle, NEI) — LLM-verified
-- **Negative Data Sources**: Real dataset records from discovery pipelines + targeted hard negatives (non-eye medical imaging, non-eye OCT, eye-adjacent non-imaging)
-- **Epochs**: 2
-- **Batch Size**: 16
 ## Validation
-### Held-out Test Set (20% stratified split)
-| Metric | Value |
 |--------|-------|
-| Accuracy | 0.961 |
-| Macro F1 | 0.954 |
-| EYE_IMAGING F1 | 0.936 |
-| EYE_IMAGING Precision | 0.911 |
-| EYE_IMAGING Recall | 0.962 |
-### Spot-Check Validation (33 expert-verified Zenodo records)
-| Metric | Value |
 |--------|-------|
-| Accuracy | 30/33 (90.9%) |
-| EYE_IMAGING F1 | 0.824 |
-| EYE_IMAGING Precision | 0.875 |
-| EYE_IMAGING Recall | 0.778 |
 ## Usage
 ```python
-from sentence_transformers import SentenceTransformer
-import joblib
-model = SentenceTransformer("fairdataihub/envision-eye-imaging-classifier")
-head = joblib.load("model_head.pkl")
-embeddings = model.encode(["Retinal OCT dataset for diabetic retinopathy"])
-predictions = head.predict(embeddings)
-probabilities = head.predict_proba(embeddings)
-labels = ["NEGATIVE", "EYE_IMAGING"]
-print(f"Label: {labels[predictions[0]]}")
-print(f"Confidence: {max(probabilities[0]):.3f}")
 ```
-## Data Pipeline
-- Harvests metadata from multiple scientific data repositories (Zenodo, Figshare, DataCite, Kaggle, Dryad, NEI)
-- Classifies records as eye imaging or not
-- Identified eye imaging datasets are registered on the [Envision Portal](https://envisionportal.org)
 ## Citation
-If you use this model in your work, please cite:
-- FAIR Data Innovations Hub ([fairdataihub.org](https://fairdataihub.org))
-- EyeACT Study ([eyeactstudy.org](https://eyeactstudy.org))
-- Tunstall et al. (2022). "Efficient Few-Shot Learning Without Prompts" (SetFit)
-- `sentence-transformers/all-mpnet-base-v2`
 ## Contact
-- James O'Neill (joneill@calmi2.org)
-- Bhavesh Patel (bpatel@calmi2.org)

 - medical-imaging
 - fair-data
 - eyeact
+datasets:
+- fairdataihub/envision-eye-imaging-training-data
 ---
 # Envision Eye Imaging Classifier
+SetFit binary classifier for identifying eye imaging datasets from scientific metadata.
+**Developed by**: FAIR Data Innovations Hub in collaboration with the EyeACT Study
 ## Model Description
 Uses `sentence-transformers/all-mpnet-base-v2` as backbone with binary classification:
+- **EYE_IMAGING (1)**: Actual ophthalmic imaging datasets (fundus, OCT, OCTA, cornea)
+- **NEGATIVE (0)**: Everything else (software, non-imaging eye data, unrelated)
 ## Validation
+### Spot-check (33 expert-verified Zenodo records)
+| Metric | Score |
 |--------|-------|
+| Accuracy | 0.939 (31/33) |
+| Macro F1 | 0.923 |
+| EYE_IMAGING F1 | 0.889 (P=0.889, R=0.889) |
+| NEGATIVE F1 | 0.958 (P=0.958, R=0.958) |
+### Held-out test set (20% stratified split)
+| Metric | Score |
 |--------|-------|
+| Accuracy | 0.940 |
+| Macro F1 | 0.936 |
+| EYE_IMAGING F1 | 0.922 (P=0.887, R=0.959) |
+| NEGATIVE F1 | 0.951 (P=0.975, R=0.929) |
+### Multi-repository spot-check (6,833 records across 6 sources)
+| Source | Records | EYE_IMAGING F1 | Precision | Recall |
+|--------|---------|----------------|-----------|--------|
+| Zenodo | 514 | 0.677 | 0.537 | 0.917 |
+| DataCite | 1,836 | 0.866 | 0.858 | 0.874 |
+| Figshare | 2,000 | 0.833 | 0.788 | 0.884 |
+| Kaggle | 732 | 0.739 | 0.939 | 0.610 |
+| Dryad | 89 | 0.764 | 0.750 | 0.778 |
+| NEI | 1,662 | 0.814 | 0.931 | 0.724 |
+| **Overall** | **6,833** | **0.822** | **0.845** | **0.800** |
+## Training
+- **Base model**: sentence-transformers/all-mpnet-base-v2 (768-dimensional)
+- **Training data**: 994 examples (365 EYE_IMAGING, 629 NEGATIVE) from multi-repository sources (Zenodo, Figshare, Dryad, Kaggle, NEI)
+- **Dataset**: [fairdataihub/envision-eye-imaging-training-data](https://huggingface.co/datasets/fairdataihub/envision-eye-imaging-training-data)
+- **Epochs**: 10 (early stopping, patience=3)
+- **Batch size**: 16
+- **Learning rate**: 2e-5 (default)
+- **Scheduler**: linear with 10% warmup
 ## Usage
 ```python
+from setfit import SetFitModel
+model = SetFitModel.from_pretrained("fairdataihub/envision-eye-imaging-classifier")
+predictions = model.predict(["Retinal OCT dataset for diabetic retinopathy"])
 ```
 ## Citation
+- EyeACT Envision project
+- FAIR Data Innovations Hub (fairdataihub.org)
+- sentence-transformers/all-mpnet-base-v2
 ## Contact
+EyeACT team: [eyeactstudy.org](https://eyeactstudy.org)

model.safetensors CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:9a7e50e685a80b5b4f937fb456cdc60099022e6e79b66b2f7a5af5da9bff4324
 size 437967672

 version https://git-lfs.github.com/spec/v1
+oid sha256:0d8c84e85ce3f286c1abdecb4eef2bb7fc6e879dc4ca51db0c5d177a6d2f3f4f
 size 437967672

model_head.pkl CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:8e1bbbc718b53d3f1cb3f8649931df1a2aae6db926c8e2c0280fa39e7e854c94
 size 7007

 version https://git-lfs.github.com/spec/v1
+oid sha256:0949a5a2a839e79e6d2704fdf88ad6c86efc313060871789482f92abbfd8a7ee
 size 7007