Model Card for Indus SDE Sentence Transformer Stage 2

The model was first further fine tuned on sentence embedding task on top of previous (nasa-impact/indus-sde-st-v0.1) using stage 2 dataset (scientific dataset) for a epoch. Then this model is again fined tuned for 2 more epoches on NASA SDE and NASA ADS corpus.

The initial stage of Indus-SDE-ST training focused on adapting the base Indus-SDE model to comprehend general domain semantics and sentence-pair relationships. The stage 2 dataset was designed for scieinfic domain adaptation. The primary objective was to establish a broad linguistic foundation before specializing in scientific content (for subsequent stages). This was achieved using a diverse corpus comprising pairs from S2ORC, arxiv, PubMed, NASA ADS and NASA SDE set in a contrastive learning objective: Multiple Negatives Ranking loss.

Dataset table

Dataset Name Data Points Type Link
S2ORC_title_abstract ~41.8M Title-Body Link
S2ORC_abstract_citation ~39.6M Body-Body Link
S2ORC_title_citation ~51M Title-Title Link
arxiv_title_abstract ~2.7M Title-Body Link
PubMed ~ 24M Title-Body Link
specter ~684K Title-Body Link
nasa_ads ~2.66M Title-Abstract Link
SDE-syntisaized 177486 question-answer Link
SDE-syntisaized 194382 search_terms-document
CMR-natural 53974 Title-Description
PDS-natural 9832 Title-Description
CMR-syntisaized 796097 search_terms-document
PDS-syntisaized 52777 search_terms-document
Total ~162.4M

Evaluation

We evaluate the model on a variety of benchmark datasets, especially the following:

We observe that the model from this stage has overall better performance compared to original INDUS Sentence Transformer and ModernBERT-based ST.

The model uploaded to the Hf is indus-sde-st-v0.2_vocal-river-16

models = {
    "modernbert-embed-base": "ModernBERT based embedding model",
    "nasa-smd-ibm-st-v2": "Original Indus Sentence Transformer",
    "indus-sde-st-v0.1": "Indus-SDE Stage 1 Sentence Transformer",
    "indus-sde-st-v0.2_whole-moon-14": "Indus-SDE Stage 2 Sentence Transformer (Trained on full dataset and faster learning rate)",
    "indus-sde-st-v0.2_atomic-plasma-15": "Indus-SDE Stage 2 Sentence Transformer (Trained just on the sde/ads dataset)",
    "indus-sde-st-v0.2_vocal-river-16": "Indus-SDE Stage 2 Sentence Transformer (Trained on top of model 14 with nasa sde/ads for 2 epoch)",
}

NASA SDE IR Benchmark

image/png

Nano BEIR

image/png

NASA SMD IR Benchmark

image/png

Downloads last month
579
Safetensors
Model size
0.1B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for nasa-impact/indus-sde-st-v0.2

Finetunes
1 model

Collection including nasa-impact/indus-sde-st-v0.2