drAbreu commited on
Commit
092ba0c
·
verified ·
1 Parent(s): 3036acf

Update model card with training details and usage instructions

Browse files
Files changed (1) hide show
  1. README.md +176 -0
README.md ADDED
@@ -0,0 +1,176 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ base_model: answerdotai/ModernBERT-base
4
+ tags:
5
+ - sentence-transformers
6
+ - feature-extraction
7
+ - sentence-similarity
8
+ - biomedical
9
+ - embeddings
10
+ - life-sciences
11
+ - scientific-text
12
+ - SODA-VEC
13
+ - EMBO
14
+ datasets:
15
+ - EMBO/soda-vec-data-full_pmc_title_abstract_paired
16
+ metrics:
17
+ - cosine-similarity
18
+ ---
19
+
20
+ # Dot Only Model
21
+
22
+ ## Model Description
23
+
24
+ SODA-VEC embedding model trained with dot product loss only. This model uses normalized embeddings with only contrastive learning (dot product) to learn biomedical text representations.
25
+
26
+ This model is part of the **SODA-VEC** (Scientific Open Domain Adaptation for Vector Embeddings) project, which focuses on creating high-quality embedding models for biomedical and life sciences text.
27
+
28
+ **Key Features:**
29
+ - Trained on **26.5M biomedical title-abstract pairs** from PubMed Central
30
+ - Based on **ModernBERT-base** architecture
31
+ - Optimized for **biomedical text similarity** and **semantic search**
32
+ - Produces **768-dimensional embeddings** with mean pooling
33
+
34
+ ## Training Details
35
+
36
+ ### Training Data
37
+
38
+ - **Dataset**: [`EMBO/soda-vec-data-full_pmc_title_abstract_paired`](https://huggingface.co/datasets/EMBO/soda-vec-data-full_pmc_title_abstract_paired)
39
+ - **Size**: 26,473,900 training pairs
40
+ - **Source**: Complete PubMed Central baseline (July 2024)
41
+ - **Format**: Paired title-abstract examples optimized for contrastive learning
42
+
43
+ ### Training Procedure
44
+
45
+ **Loss Function**: Dot Only: normalized embeddings with only dot product loss (diagonal + off-diagonal)
46
+
47
+ **Coefficients**: dot=1.0
48
+ **Base Model**: `answerdotai/ModernBERT-base`
49
+
50
+ **Training Configuration:**
51
+ - **GPUs**: 4
52
+ - **Batch Size per GPU**: 16
53
+ - **Gradient Accumulation**: 4
54
+ - **Effective Batch Size**: 256
55
+ - **Learning Rate**: 2e-05
56
+ - **Warmup Steps**: 100
57
+ - **Pooling Strategy**: mean
58
+ - **Epochs**: 1 (full dataset pass)
59
+
60
+ **Training Command:**
61
+ ```bash
62
+ python scripts/soda-vec-train.py --config dot_only --coeff_dot 1 --push_to_hub --hub_org EMBO --save_limit 5
63
+ ```
64
+
65
+ ### Model Architecture
66
+
67
+ - **Base Architecture**: ModernBERT-base (12 layers, 768 hidden size)
68
+ - **Pooling**: Mean pooling over token embeddings
69
+ - **Output Dimension**: 768
70
+ - **Normalization**: L2-normalized embeddings (for VICReg-based models)
71
+
72
+ ## Usage
73
+
74
+ ### Using Sentence-Transformers
75
+
76
+ ```python
77
+ from sentence_transformers import SentenceTransformer
78
+
79
+ # Load the model
80
+ model = SentenceTransformer("EMBO/dot_only")
81
+
82
+ # Encode sentences
83
+ sentences = [
84
+ "CRISPR-Cas9 gene editing in human cells",
85
+ "Genome editing using CRISPR technology"
86
+ ]
87
+
88
+ embeddings = model.encode(sentences)
89
+ print(f"Embedding shape: {embeddings.shape}")
90
+
91
+ # Compute similarity
92
+ from sentence_transformers.util import cos_sim
93
+ similarity = cos_sim(embeddings[0], embeddings[1])
94
+ print(f"Similarity: {similarity.item():.4f}")
95
+ ```
96
+
97
+ ### Using Hugging Face Transformers
98
+
99
+ ```python
100
+ from transformers import AutoTokenizer, AutoModel
101
+ import torch
102
+ import torch.nn.functional as F
103
+
104
+ # Load model and tokenizer
105
+ tokenizer = AutoTokenizer.from_pretrained("EMBO/dot_only")
106
+ model = AutoModel.from_pretrained("EMBO/dot_only")
107
+
108
+ # Encode sentences
109
+ sentences = [
110
+ "CRISPR-Cas9 gene editing in human cells",
111
+ "Genome editing using CRISPR technology"
112
+ ]
113
+
114
+ inputs = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt")
115
+ with torch.no_grad():
116
+ outputs = model(**inputs)
117
+
118
+ # Mean pooling
119
+ embeddings = outputs.last_hidden_state.mean(dim=1)
120
+
121
+ # Normalize (for VICReg models)
122
+ embeddings = F.normalize(embeddings, p=2, dim=1)
123
+
124
+ # Compute similarity
125
+ similarity = F.cosine_similarity(embeddings[0:1], embeddings[1:2])
126
+ print(f"Similarity: {similarity.item():.4f}")
127
+ ```
128
+
129
+ ## Evaluation
130
+
131
+ The model has been evaluated on comprehensive biomedical benchmarks including:
132
+
133
+ - **Journal-Category Classification**: Matching journals to BioRxiv subject categories
134
+ - **Title-Abstract Similarity**: Discriminating between related and unrelated paper pairs
135
+ - **Field-Specific Separability**: Distinguishing between different biological fields
136
+ - **Semantic Search**: Retrieval quality on biomedical text corpora
137
+
138
+ For detailed evaluation results, see the [SODA-VEC benchmark notebooks](https://github.com/EMBO/soda-vec).
139
+
140
+ ## Intended Use
141
+
142
+ This model is designed for:
143
+
144
+ - **Biomedical Semantic Search**: Finding relevant papers, abstracts, or text passages
145
+ - **Scientific Text Similarity**: Computing similarity between biomedical texts
146
+ - **Information Retrieval**: Building search systems for scientific literature
147
+ - **Downstream Tasks**: As a base for fine-tuning on specific biomedical tasks
148
+ - **Research Applications**: Academic and research use in life sciences
149
+
150
+ ## Limitations
151
+
152
+ - **Domain Specificity**: Optimized for biomedical and life sciences text; may not perform as well on general domain text
153
+ - **Language**: English only
154
+ - **Text Length**: Optimized for titles and abstracts; longer documents may require chunking
155
+ - **Bias**: Inherits biases from the training data (PubMed Central corpus)
156
+
157
+ ## Citation
158
+
159
+ If you use this model, please cite:
160
+
161
+ ```bibtex
162
+ @software{soda_vec,
163
+ title = {SODA-VEC: Scientific Open Domain Adaptation for Vector Embeddings},
164
+ author = {EMBO},
165
+ year = {2024},
166
+ url = {https://github.com/EMBO/soda-vec}
167
+ }
168
+ ```
169
+
170
+ ## Model Card Contact
171
+
172
+ For questions or issues, please open an issue on the [SODA-VEC GitHub repository](https://github.com/EMBO/soda-vec).
173
+
174
+ ---
175
+
176
+ **Model Card Generated**: 2025-11-10