Model Card for Indus (modernbert-sde-v0.2)

This model was further pre-trained on full Science Discovery Engine (SDE) website data from answerdotai/ModernBERT-base.

Model Details

  • Base Model: answerdotai/ModernBERT-base
  • Tokenizer: answerdotai/ModernBERT-base
  • Parameters: 150M
  • Pretraining Strategy: Masked Language Modeling (MLM)

Training Data

  • Full Science Discovery Engine (SDE) Website Data

Training Procedure

  • transformers Version: 4.48.3
  • Strategy: Masked Language Modeling (MLM)
  • Masking Strategy:
    • Weighted Dynamic Masking based on Keyword Importance (YAKE) and Random Masking
      • The idea for masking important keywords is to force the model to generalize for "science" keywords that gives a high signal for the document
    • Masked Language Model Probability: 30%
  • Batch Size: 7
  • Learning rate: 5e-5
  • Warmup ratio: 0.1

Dataset

  • Total Data Size: 545,717
  • Validation Data Size: 10% of total size
  • Test Data Size: 10% of total size
Downloads last month
8
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for nasa-impact/modernbert-sde-v0.2

Finetuned
(1083)
this model

Collection including nasa-impact/modernbert-sde-v0.2