jimnoneill commited on
Commit
e28cb35
·
verified ·
1 Parent(s): 3ae6679

Update ENVISION eye imaging classifier v2.0

Browse files
Files changed (3) hide show
  1. README.md +47 -61
  2. model.safetensors +1 -1
  3. model_head.pkl +1 -1
README.md CHANGED
@@ -9,95 +9,81 @@ tags:
9
  - medical-imaging
10
  - fair-data
11
  - eyeact
12
- - binary-classification
 
13
  ---
14
 
15
  # Envision Eye Imaging Classifier
16
 
17
- SetFit few-shot classifier for identifying eye imaging datasets from scientific metadata.
18
 
19
- **Developed by**: FAIR Data Innovations Hub, California Medical Innovations Institute (CalMI²), in collaboration with the EyeACT Study
20
 
21
  ## Model Description
22
 
23
  Uses `sentence-transformers/all-mpnet-base-v2` as backbone with binary classification:
24
 
25
- - **EYE_IMAGING (1)**: Actual ophthalmic imaging datasets (fundus photography, OCT, OCTA, corneal imaging, slit-lamp, anterior segment)
26
- - **NEGATIVE (0)**: Everything else (non-eye data, software/code, eye-adjacent non-imaging, non-eye medical imaging)
27
-
28
- ## Results on Zenodo
29
-
30
- Tested on 515 Zenodo datasets (filtered to `resource_type=dataset` only):
31
-
32
- | Class | Count |
33
- |-------|-------|
34
- | EYE_IMAGING | 60 |
35
- | NEGATIVE | 455 |
36
-
37
- ## Training
38
-
39
- - **Base Model**: `sentence-transformers/all-mpnet-base-v2` (768-dimensional embeddings)
40
- - **Training Examples**: 891 (262 EYE_IMAGING, 629 NEGATIVE)
41
- - **Positive Data Sources**: Multi-repository (Zenodo, Figshare, Dryad, Kaggle, NEI) — LLM-verified
42
- - **Negative Data Sources**: Real dataset records from discovery pipelines + targeted hard negatives (non-eye medical imaging, non-eye OCT, eye-adjacent non-imaging)
43
- - **Epochs**: 2
44
- - **Batch Size**: 16
45
 
46
  ## Validation
47
 
48
- ### Held-out Test Set (20% stratified split)
49
 
50
- | Metric | Value |
51
  |--------|-------|
52
- | Accuracy | 0.961 |
53
- | Macro F1 | 0.954 |
54
- | EYE_IMAGING F1 | 0.936 |
55
- | EYE_IMAGING Precision | 0.911 |
56
- | EYE_IMAGING Recall | 0.962 |
57
 
58
- ### Spot-Check Validation (33 expert-verified Zenodo records)
59
 
60
- | Metric | Value |
61
  |--------|-------|
62
- | Accuracy | 30/33 (90.9%) |
63
- | EYE_IMAGING F1 | 0.824 |
64
- | EYE_IMAGING Precision | 0.875 |
65
- | EYE_IMAGING Recall | 0.778 |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
66
 
67
  ## Usage
68
 
69
  ```python
70
- from sentence_transformers import SentenceTransformer
71
- import joblib
72
-
73
- model = SentenceTransformer("fairdataihub/envision-eye-imaging-classifier")
74
- head = joblib.load("model_head.pkl")
75
 
76
- embeddings = model.encode(["Retinal OCT dataset for diabetic retinopathy"])
77
- predictions = head.predict(embeddings)
78
- probabilities = head.predict_proba(embeddings)
79
 
80
- labels = ["NEGATIVE", "EYE_IMAGING"]
81
- print(f"Label: {labels[predictions[0]]}")
82
- print(f"Confidence: {max(probabilities[0]):.3f}")
83
  ```
84
 
85
- ## Data Pipeline
86
-
87
- - Harvests metadata from multiple scientific data repositories (Zenodo, Figshare, DataCite, Kaggle, Dryad, NEI)
88
- - Classifies records as eye imaging or not
89
- - Identified eye imaging datasets are registered on the [Envision Portal](https://envisionportal.org)
90
-
91
  ## Citation
92
 
93
- If you use this model in your work, please cite:
94
-
95
- - FAIR Data Innovations Hub ([fairdataihub.org](https://fairdataihub.org))
96
- - EyeACT Study ([eyeactstudy.org](https://eyeactstudy.org))
97
- - Tunstall et al. (2022). "Efficient Few-Shot Learning Without Prompts" (SetFit)
98
- - `sentence-transformers/all-mpnet-base-v2`
99
 
100
  ## Contact
101
 
102
- - James O'Neill (joneill@calmi2.org)
103
- - Bhavesh Patel (bpatel@calmi2.org)
 
9
  - medical-imaging
10
  - fair-data
11
  - eyeact
12
+ datasets:
13
+ - fairdataihub/envision-eye-imaging-training-data
14
  ---
15
 
16
  # Envision Eye Imaging Classifier
17
 
18
+ SetFit binary classifier for identifying eye imaging datasets from scientific metadata.
19
 
20
+ **Developed by**: FAIR Data Innovations Hub in collaboration with the EyeACT Study
21
 
22
  ## Model Description
23
 
24
  Uses `sentence-transformers/all-mpnet-base-v2` as backbone with binary classification:
25
 
26
+ - **EYE_IMAGING (1)**: Actual ophthalmic imaging datasets (fundus, OCT, OCTA, cornea)
27
+ - **NEGATIVE (0)**: Everything else (software, non-imaging eye data, unrelated)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
28
 
29
  ## Validation
30
 
31
+ ### Spot-check (33 expert-verified Zenodo records)
32
 
33
+ | Metric | Score |
34
  |--------|-------|
35
+ | Accuracy | 0.939 (31/33) |
36
+ | Macro F1 | 0.923 |
37
+ | EYE_IMAGING F1 | 0.889 (P=0.889, R=0.889) |
38
+ | NEGATIVE F1 | 0.958 (P=0.958, R=0.958) |
 
39
 
40
+ ### Held-out test set (20% stratified split)
41
 
42
+ | Metric | Score |
43
  |--------|-------|
44
+ | Accuracy | 0.940 |
45
+ | Macro F1 | 0.936 |
46
+ | EYE_IMAGING F1 | 0.922 (P=0.887, R=0.959) |
47
+ | NEGATIVE F1 | 0.951 (P=0.975, R=0.929) |
48
+
49
+ ### Multi-repository spot-check (6,833 records across 6 sources)
50
+
51
+ | Source | Records | EYE_IMAGING F1 | Precision | Recall |
52
+ |--------|---------|----------------|-----------|--------|
53
+ | Zenodo | 514 | 0.677 | 0.537 | 0.917 |
54
+ | DataCite | 1,836 | 0.866 | 0.858 | 0.874 |
55
+ | Figshare | 2,000 | 0.833 | 0.788 | 0.884 |
56
+ | Kaggle | 732 | 0.739 | 0.939 | 0.610 |
57
+ | Dryad | 89 | 0.764 | 0.750 | 0.778 |
58
+ | NEI | 1,662 | 0.814 | 0.931 | 0.724 |
59
+ | **Overall** | **6,833** | **0.822** | **0.845** | **0.800** |
60
+
61
+ ## Training
62
+
63
+ - **Base model**: sentence-transformers/all-mpnet-base-v2 (768-dimensional)
64
+ - **Training data**: 994 examples (365 EYE_IMAGING, 629 NEGATIVE) from multi-repository sources (Zenodo, Figshare, Dryad, Kaggle, NEI)
65
+ - **Dataset**: [fairdataihub/envision-eye-imaging-training-data](https://huggingface.co/datasets/fairdataihub/envision-eye-imaging-training-data)
66
+ - **Epochs**: 10 (early stopping, patience=3)
67
+ - **Batch size**: 16
68
+ - **Learning rate**: 2e-5 (default)
69
+ - **Scheduler**: linear with 10% warmup
70
 
71
  ## Usage
72
 
73
  ```python
74
+ from setfit import SetFitModel
 
 
 
 
75
 
76
+ model = SetFitModel.from_pretrained("fairdataihub/envision-eye-imaging-classifier")
 
 
77
 
78
+ predictions = model.predict(["Retinal OCT dataset for diabetic retinopathy"])
 
 
79
  ```
80
 
 
 
 
 
 
 
81
  ## Citation
82
 
83
+ - EyeACT Envision project
84
+ - FAIR Data Innovations Hub (fairdataihub.org)
85
+ - sentence-transformers/all-mpnet-base-v2
 
 
 
86
 
87
  ## Contact
88
 
89
+ EyeACT team: [eyeactstudy.org](https://eyeactstudy.org)
 
model.safetensors CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:9a7e50e685a80b5b4f937fb456cdc60099022e6e79b66b2f7a5af5da9bff4324
3
  size 437967672
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:0d8c84e85ce3f286c1abdecb4eef2bb7fc6e879dc4ca51db0c5d177a6d2f3f4f
3
  size 437967672
model_head.pkl CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:8e1bbbc718b53d3f1cb3f8649931df1a2aae6db926c8e2c0280fa39e7e854c94
3
  size 7007
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:0949a5a2a839e79e6d2704fdf88ad6c86efc313060871789482f92abbfd8a7ee
3
  size 7007