Model Card for Indus SDE Sentence Transformer Stage 2
The model was first further fine tuned on sentence embedding task on top of previous (nasa-impact/indus-sde-st-v0.1) using stage 2 dataset (scientific dataset) for a epoch. Then this model is again fined tuned for 2 more epoches on NASA SDE and NASA ADS corpus.
The initial stage of Indus-SDE-ST training focused on adapting the base Indus-SDE model to comprehend general domain semantics and sentence-pair relationships. The stage 2 dataset was designed for scieinfic domain adaptation. The primary objective was to establish a broad linguistic foundation before specializing in scientific content (for subsequent stages). This was achieved using a diverse corpus comprising pairs from S2ORC, arxiv, PubMed, NASA ADS and NASA SDE set in a contrastive learning objective: Multiple Negatives Ranking loss.
Dataset table
| Dataset Name | Data Points | Type | Link |
|---|---|---|---|
| S2ORC_title_abstract | ~41.8M | Title-Body | Link |
| S2ORC_abstract_citation | ~39.6M | Body-Body | Link |
| S2ORC_title_citation | ~51M | Title-Title | Link |
| arxiv_title_abstract | ~2.7M | Title-Body | Link |
| PubMed | ~ 24M | Title-Body | Link |
| specter | ~684K | Title-Body | Link |
| nasa_ads | ~2.66M | Title-Abstract | Link |
| SDE-syntisaized | 177486 | question-answer | Link |
| SDE-syntisaized | 194382 | search_terms-document | |
| CMR-natural | 53974 | Title-Description | |
| PDS-natural | 9832 | Title-Description | |
| CMR-syntisaized | 796097 | search_terms-document | |
| PDS-syntisaized | 52777 | search_terms-document | |
| Total | ~162.4M |
Evaluation
We evaluate the model on a variety of benchmark datasets, especially the following:
We observe that the model from this stage has overall better performance compared to original INDUS Sentence Transformer and ModernBERT-based ST.
The model uploaded to the Hf is indus-sde-st-v0.2_vocal-river-16
models = {
"modernbert-embed-base": "ModernBERT based embedding model",
"nasa-smd-ibm-st-v2": "Original Indus Sentence Transformer",
"indus-sde-st-v0.1": "Indus-SDE Stage 1 Sentence Transformer",
"indus-sde-st-v0.2_whole-moon-14": "Indus-SDE Stage 2 Sentence Transformer (Trained on full dataset and faster learning rate)",
"indus-sde-st-v0.2_atomic-plasma-15": "Indus-SDE Stage 2 Sentence Transformer (Trained just on the sde/ads dataset)",
"indus-sde-st-v0.2_vocal-river-16": "Indus-SDE Stage 2 Sentence Transformer (Trained on top of model 14 with nasa sde/ads for 2 epoch)",
}
NASA SDE IR Benchmark
Nano BEIR
NASA SMD IR Benchmark
- Downloads last month
- 579


