jimnoneill commited on
Commit
8b01cd7
·
verified ·
1 Parent(s): d01c30c

Update model card with confidence distribution

Browse files
Files changed (1) hide show
  1. README.md +44 -131
README.md CHANGED
@@ -1,160 +1,73 @@
1
  ---
2
  license: mit
3
- language:
4
- - en
5
- library_name: setfit
6
  tags:
7
- - setfit
8
- - sentence-transformers
9
  - text-classification
10
- - ophthalmology
11
- - medical
12
  - eye-imaging
13
- - dataset-discovery
14
- pipeline_tag: text-classification
15
- datasets:
16
- - custom
17
- metrics:
18
- - accuracy
19
- base_model: Alibaba-NLP/gte-large-en-v1.5
20
- model-index:
21
- - name: envision-eye-imaging-classifier
22
- results: []
23
  ---
24
 
25
- # ENVISION Eye Imaging Dataset Classifier
26
 
27
- A SetFit few-shot classifier for detecting ophthalmic imaging datasets from repository metadata (titles, descriptions, keywords).
28
 
29
- ## Model Description
30
 
31
- This model classifies scientific dataset metadata into four categories:
32
 
33
- | Label | Description | Examples |
34
- |-------|-------------|----------|
35
- | **EYE_IMAGING** | Actual ophthalmic imaging datasets | Fundus photos, OCT scans, OCTA, corneal imaging |
36
- | **EYE_SOFTWARE** | Code/tools for eye imaging (no data) | GitHub repos, model weights, toolboxes |
37
- | **EDGE_CASE** | Eye research but not imaging datasets | Review papers, clinical trials, animal studies |
38
- | **NEGATIVE** | Not eye-related | Other medical imaging, taxonomy papers, etc. |
39
 
40
- ## Intended Use
 
 
 
41
 
42
- - **Primary Use**: Automated discovery of eye imaging datasets from scientific repositories (Zenodo, Figshare, etc.)
43
- - **Input**: Dataset metadata text (title + description + keywords)
44
- - **Output**: Classification label and confidence scores
45
 
46
- ## Training Data
 
 
 
 
 
47
 
48
- | Class | Examples | Source |
49
- |-------|----------|--------|
50
- | EYE_IMAGING | 99 | Known benchmarks (DRIVE, IDRiD, REFUGE, OLIVES, etc.) |
51
- | EYE_SOFTWARE | 30 | GitHub eye imaging repositories |
52
- | EDGE_CASE | 90 | Eye research papers, reviews, adjacent imaging |
53
- | NEGATIVE | 233 | False positive patterns (cardiovascular OCT, taxonomy papers, industrial imaging) |
54
 
55
- **Total: 452 training examples**
 
 
 
 
56
 
57
- ## Model Architecture
58
 
59
- - **Base Model**: [Alibaba-NLP/gte-large-en-v1.5](https://huggingface.co/Alibaba-NLP/gte-large-en-v1.5)
60
- - **Framework**: SetFit (few-shot learning)
61
- - **Embedding Dimension**: 1024
62
- - **Max Sequence Length**: 8192 tokens
63
- - **Training**: 2 epochs, batch size 16
64
 
65
  ## Usage
66
 
67
  ```python
68
- from setfit import SetFitModel
69
-
70
- # Load model
71
- model = SetFitModel.from_pretrained("EyeACT/envision-eye-imaging-classifier")
72
-
73
- # Classify dataset metadata
74
- texts = [
75
- "Fundus photography dataset for diabetic retinopathy detection with 10,000 images",
76
- "Deep learning code for retinal vessel segmentation PyTorch implementation",
77
- "Climate change impact on coral reef ecosystems"
78
- ]
79
-
80
- predictions = model.predict(texts)
81
- probabilities = model.predict_proba(texts)
82
-
83
- for text, pred, probs in zip(texts, predictions, probabilities):
84
- print(f"Text: {text[:50]}...")
85
- print(f"Prediction: {pred}")
86
- print(f"Probabilities: {probs}")
87
- print()
88
- ```
89
-
90
- ## Performance
91
-
92
- Initial validation on Zenodo metadata (30,439 records):
93
-
94
- | Metric | Value |
95
- |--------|-------|
96
- | Records with data files | 9,881 |
97
- | EYE_IMAGING detected | ~380 |
98
- | EYE_SOFTWARE detected | ~70 |
99
- | EDGE_CASE | ~2,500 |
100
- | NEGATIVE | ~6,900 |
101
-
102
- **Note**: Results require manual validation. The model is optimized for high precision on the EYE_IMAGING class.
103
-
104
- ## Limitations
105
-
106
- 1. **Training Data**: Based on synthetic/curated examples, not manually labeled records
107
- 2. **Domain Specificity**: Optimized for ophthalmic imaging; may not generalize to other medical domains
108
- 3. **False Positives**: Some categories (e.g., papers with "FIGURES" in title) may still be misclassified
109
- 4. **Language**: English only
110
-
111
- ## Ethical Considerations
112
 
113
- - This model is designed for dataset discovery, not clinical decision-making
114
- - Results should be manually validated before use in research
115
- - The model may reflect biases in the training data
116
 
117
- ## Citation
118
-
119
- If you use this model, please cite:
120
-
121
- ```bibtex
122
- @misc{envision2026,
123
- title={ENVISION: Eye Imaging Dataset Discovery},
124
- author={FAIR Data Innovations Hub and EyeACT Study Team},
125
- year={2026},
126
- publisher={Hugging Face},
127
- url={https://huggingface.co/EyeACT/envision-eye-imaging-classifier}
128
- }
129
  ```
130
 
131
- ## About
132
-
133
- ### FAIR Data Innovations Hub
134
-
135
- The [FAIR Data Innovations Hub](https://fairdataihub.org/) develops open-source tools and practices to make biomedical research data Findable, Accessible, Interoperable, and Reusable (FAIR). We are part of the California Medical Innovations Institute (CALMI2).
136
-
137
- ### EyeACT Study
138
-
139
- The [Eye ACT study](https://eyeactstudy.org/) investigates the connection between eye health and brain function, pioneering research to uncover early indicators of Alzheimer's disease through ophthalmic imaging. The study is a collaboration between:
140
-
141
- - **California Medical Innovations Institute (CALMI2)**
142
- - **Kaiser Permanente Washington**
143
- - **University of Washington**
144
-
145
- ### ENVISION Project
146
-
147
- ENVISION (Eye Imaging Dataset Discovery) is a systematic effort to catalog and classify publicly available ophthalmic imaging datasets. This supports the broader goal of making eye imaging research data FAIR and accessible to the scientific community.
148
-
149
- ## Links
150
-
151
- - **GitHub**: [EyeACT/envision-discovery](https://github.com/EyeACT/envision-discovery)
152
- - **EyeACT Study**: [eyeactstudy.org](https://eyeactstudy.org/)
153
- - **FAIR Data Innovations Hub**: [fairdataihub.org](https://fairdataihub.org/)
154
- - **CALMI2**: [calmi2.org](https://calmi2.org/)
155
-
156
- ## License
157
 
158
- MIT License
 
 
159
 
 
160
 
 
 
1
  ---
2
  license: mit
 
 
 
3
  tags:
 
 
4
  - text-classification
5
+ - setfit
6
+ - sentence-embedding
7
  - eye-imaging
8
+ - ophthalmology
9
+ - medical-imaging
10
+ - fair-data
11
+ - eyeact
 
 
 
 
 
 
12
  ---
13
 
14
+ # Envision Eye Imaging Classifier
15
 
16
+ SetFit few-shot classifier for identifying eye imaging datasets from scientific metadata.
17
 
18
+ **Developed by**: FAIR Data Innovations Hub in collaboration with the EyeACT Study
19
 
20
+ ## Model Description
21
 
22
+ Uses `Alibaba-NLP/gte-large-en-v1.5` as backbone with 4-class classification:
 
 
 
 
 
23
 
24
+ - **EYE_IMAGING (3)**: Actual ophthalmic imaging datasets (fundus, OCT, OCTA, cornea)
25
+ - **EYE_SOFTWARE (2)**: Code, tools, models for eye imaging
26
+ - **EDGE_CASE (1)**: Eye research papers, reviews, non-imaging data
27
+ - **NEGATIVE (0)**: Not eye-related
28
 
29
+ ## Results on Zenodo
 
 
30
 
31
+ | Class | Count |
32
+ |-------|-------|
33
+ | EYE_IMAGING | 524 |
34
+ | EYE_SOFTWARE | 1,150 |
35
+ | EDGE_CASE | 99 |
36
+ | NEGATIVE | 7,675 |
37
 
38
+ ### Confidence Distribution (EYE_IMAGING)
 
 
 
 
 
39
 
40
+ | Confidence | Count |
41
+ |------------|-------|
42
+ | High (≥0.95) | 485 |
43
+ | Medium (0.80-0.95) | 20 |
44
+ | Lower (<0.80) | 19 |
45
 
46
+ ## Training
47
 
48
+ - **Examples**: 452 (99 positive, 30 software, 90 edge case, 233 negative)
49
+ - **Epochs**: 2
50
+ - **Batch Size**: 16
 
 
51
 
52
  ## Usage
53
 
54
  ```python
55
+ from sentence_transformers import SentenceTransformer
56
+ import joblib
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
57
 
58
+ model = SentenceTransformer("jimnoneill/envision-eye-imaging-classifier", trust_remote_code=True)
59
+ head = joblib.load("model_head.pkl")
 
60
 
61
+ embeddings = model.encode(["Retinal OCT dataset for diabetic retinopathy"])
62
+ predictions = head.predict(embeddings)
 
 
 
 
 
 
 
 
 
 
63
  ```
64
 
65
+ ## Citation
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
66
 
67
+ - EyeACT Envision project
68
+ - FAIR Data Innovations Hub (fairdataihub.org)
69
+ - Alibaba-NLP/gte-large-en-v1.5
70
 
71
+ ## Contact
72
 
73
+ EyeACT team: [eyeactstudy.org](https://eyeactstudy.org)