HelioBERT-XL
HelioBERT-XL is a BERT-based language model specialized for heliophysics domain text, developed by the NASA-ADS team at the Harvard-Smithsonian Center for Astrophysics. The model was trained on approximately 80,000 full-text heliophysics research articles.
The suffix XL refers to the scale and scope of the pretraining corpus, not to the model architecture.
An independently developed model named HelioBERT has been described in prior work by Khoo, F.S. et al. (2023) using curated definitions and abstracts. Although the names partially overlap, the two models were developed independently and differ substantially in training data.
Model Details
- Developed by: NASA-ADS team (Harvard-Smithsonian Center for Astrophysics)
- Base model: nasa-impact/nasa-smd-ibm-v0.1 (INDUS initialization)
- Training approach: Continued pre-training with Masked Language Modeling (MLM)
- Training epochs: 20
- Final validation loss: 0.915
- Final perplexity: 2.50
- Max sequence length: 512
- Training data: Heliophysics domain corpus
Training Configuration
- Learning rate: 5e-5
- Batch size: 16 (per device) × 2 (gradient accumulation) = 32 effective
- Warmup ratio: 0.06
- Weight decay: 0.01
- Mixed precision: FP16
Performance
- Initial perplexity: 2.93
- Final perplexity: 2.50
- Improvement: ~15% reduction in perplexity
Usage
from transformers import AutoModelForMaskedLM, AutoTokenizer, pipeline
model = AutoModelForMaskedLM.from_pretrained("adsabs/HelioBERT-XL")
tokenizer = AutoTokenizer.from_pretrained("adsabs/HelioBERT-XL")
# Example: Fill mask
fill_mask = pipeline("fill-mask", model=model, tokenizer=tokenizer)
results = fill_mask("The solar <mask> affects space weather.")
print(results)
Citation
If you use this model, please cite:
@misc{helioBERT2026,
title={HelioBERT-XL: a pre-trained language model for large-scale heliophysical literature mining},
author={{Alkan}, Atilla Kaan and {Accomazzi}, Alberto},
year={2026},
publisher={HuggingFace},
howpublished={\url{https://huggingface.co/adsabs/HelioBERT-XL}}
}
Contact
For questions or issues, please contact the ADS team or open an issue on the model repository.
- Downloads last month
- 30
Model tree for adsabs/HelioBERT-XL
Base model
nasa-impact/nasa-smd-ibm-v0.1