Model Card for Indus SDE Sentence Transformer Stage 2

The model was first further fine tuned on sentence embedding task on top of previous (nasa-impact/indus-sde-st-v0.1) using stage 2 dataset (scientific dataset) for a epoch. Then this model is again fined tuned for 2 more epoches on NASA SDE and NASA ADS corpus.

The initial stage of Indus-SDE-ST training focused on adapting the base Indus-SDE model to comprehend general domain semantics and sentence-pair relationships. The stage 2 dataset was designed for scieinfic domain adaptation. The primary objective was to establish a broad linguistic foundation before specializing in scientific content (for subsequent stages). This was achieved using a diverse corpus comprising pairs from S2ORC, arxiv, PubMed, NASA ADS and NASA SDE set in a contrastive learning objective: Multiple Negatives Ranking loss.

Dataset table

Dataset Name	Data Points	Type	Link
S2ORC_title_abstract	~41.8M	Title-Body	Link
S2ORC_abstract_citation	~39.6M	Body-Body	Link
S2ORC_title_citation	~51M	Title-Title	Link
arxiv_title_abstract	~2.7M	Title-Body	Link
PubMed	~ 24M	Title-Body	Link
specter	~684K	Title-Body	Link
nasa_ads	~2.66M	Title-Abstract	Link
SDE-syntisaized	177486	question-answer	Link
SDE-syntisaized	194382	search_terms-document
CMR-natural	53974	Title-Description
PDS-natural	9832	Title-Description
CMR-syntisaized	796097	search_terms-document
PDS-syntisaized	52777	search_terms-document
Total	~162.4M

Evaluation

We evaluate the model on a variety of benchmark datasets, especially the following:

We observe that the model from this stage has overall better performance compared to original INDUS Sentence Transformer and ModernBERT-based ST.

The model uploaded to the Hf is indus-sde-st-v0.2_vocal-river-16

models = {
    "modernbert-embed-base": "ModernBERT based embedding model",
    "nasa-smd-ibm-st-v2": "Original Indus Sentence Transformer",
    "indus-sde-st-v0.1": "Indus-SDE Stage 1 Sentence Transformer",
    "indus-sde-st-v0.2_whole-moon-14": "Indus-SDE Stage 2 Sentence Transformer (Trained on full dataset and faster learning rate)",
    "indus-sde-st-v0.2_atomic-plasma-15": "Indus-SDE Stage 2 Sentence Transformer (Trained just on the sde/ads dataset)",
    "indus-sde-st-v0.2_vocal-river-16": "Indus-SDE Stage 2 Sentence Transformer (Trained on top of model 14 with nasa sde/ads for 2 epoch)",
}