Model Card for Indus (modernbert-sde-v0.2)
This model was further pre-trained on full Science Discovery Engine (SDE) website data from answerdotai/ModernBERT-base.
Model Details
- Base Model: answerdotai/ModernBERT-base
- Tokenizer: answerdotai/ModernBERT-base
- Parameters: 150M
- Pretraining Strategy: Masked Language Modeling (MLM)
Training Data
- Full Science Discovery Engine (SDE) Website Data
Training Procedure
- transformers Version: 4.48.3
- Strategy: Masked Language Modeling (MLM)
- Masking Strategy:
- Weighted Dynamic Masking based on Keyword Importance (YAKE) and Random Masking
- The idea for masking important keywords is to force the model to generalize for "science" keywords that gives a high signal for the document
- Masked Language Model Probability: 30%
- Batch Size: 7
- Learning rate: 5e-5
- Warmup ratio: 0.1
Dataset
- Total Data Size: 545,717
- Validation Data Size: 10% of total size
- Test Data Size: 10% of total size