| | --- |
| | license: apache-2.0 |
| | language: |
| | - en |
| | pipeline_tag: fill-mask |
| | widget: |
| | - text: >- |
| | The Standard Model (SM) of [MASK] physics has been tested by many |
| | experiments over the last four decades and has been shown to successfully |
| | describe high energy particle interactions. |
| | example_title: particle physics |
| | - text: >- |
| | Clear evidence for the production of a neutral boson with a measured mass of |
| | [MASK].0 ± 0.4 (stat) ± 0.4 (sys) GeV is presented. |
| | example_title: 126.0 ± 0.4 (stat) ± 0.4 (sys) GeV |
| | - text: >- |
| | An excess of [MASK] is observed above the expected background, with a local |
| | significance of 5.0 standard deviations, at a mass near 125 GeV, signalling |
| | the production of a new particle. |
| | example_title: excess of events |
| | - text: >- |
| | On September 14, 2015 at 09:50:45 UTC the two [MASK] of the Laser |
| | Interferometer Gravitational-Wave Observatory simultaneously observed a |
| | transient gravitational-wave signal. |
| | example_title: two detectors |
| | - text: >- |
| | These first images from the EHT achieve the highest [MASK] resolution in the |
| | history of ground-based VLBI. |
| | example_title: angular resolution |
| | - text: >- |
| | We propose a comprehensive theory of [MASK] matter that explains the recent |
| | proliferation of unexpected observations in high-energy astrophysics. |
| | example_title: dark matter |
| | - text: >- |
| | Formation of galaxy clusters corresponds to the collapse of the largest |
| | gravitationally bound overdensities in the initial [MASK] field and is |
| | accompanied by the most energetic phenomena since the Big Bang and by the |
| | complex interplay between gravity-induced dynamics of collapse and baryonic |
| | processes associated with galaxy formation. |
| | example_title: initial density field |
| | - text: >- |
| | The Event [MASK] Telescope (EHT) has led to the first images of a |
| | supermassive black hole, revealing the central compact objects in the |
| | elliptical galaxy M87 and the Milky Way. |
| | example_title: Event Horizon Telescope |
| | datasets: |
| | - wikipedia |
| | - bookcorpus |
| | - arnosimons/astro-hep-corpus |
| | tags: |
| | - arXiv |
| | - astrophysics |
| | - conceptual analysis |
| | - epistemic change |
| | - high-energy physics (HEP) |
| | - history of science |
| | - semantic shift detection |
| | - sociology of science |
| | - philosophy of science |
| | - physics |
| | - word embeddings |
| | --- |
| | |
| | # Model Card for Astro-HEP-BERT |
| |
|
| | **Astro-HEP-BERT** is a bidirectional transformer designed primarily to generate contextualized word embeddings for computational conceptual analysis in astrophysics and high-energy physics (HEP). Built upon Google's `bert-base-uncased`, the model underwent additional training for three epochs using the <a target="_blank" rel="noopener noreferrer" href="https://huggingface.co/datasets/arnosimons/astro-hep-corpus">Astro-HEP Corpus</a>, containing 21.84 million paragraphs found in more than 600,000 scholarly articles sourced from arXiv, all pertaining to astrophysics and/or high-energy physics. The sole training objective was **Masked Language Modeling (MLM)**. |
| |
|
| | To optimize the model's ability to embed domain-specific language, **training was conducted exclusively on entire paragraphs**, rather than packing in as many sentences as possible, as often suggested in BERT tutorials. This "full-paragraphs format" preserves sentences within their original context, which is especially meaningful in academic writing where paragraphs focus on one idea. |
| |
|
| | The Astro-HEP-BERT project demonstrates the general feasibility of training a customized bidirectional transformer for computational conceptual analysis in the history, philosophy, and sociology of science as an open-source endeavor that does not require a substantial budget. Leveraging only freely available code, weights, and text inputs, the entire training process was conducted on a single MacBook Pro Laptop (M2/96GB). |
| |
|
| | For further insights into the model, the corpus, and the underlying research project (<a target="_blank" rel="noopener noreferrer" href="https://doi.org/10.3030/101044932" >Network Epistemology in Practice</a>) please refer to the following three papers: |
| |
|
| | 1) <a target="_blank" rel="noopener noreferrer" href="https://arxiv.org/abs/2411.14877">Simons, A (2024). Astro-HEP-BERT: A bidirectional language model for studying the meanings of concepts in astrophysics and high energy physics. arXiv:2411.14877.</a> |
| |
|
| | 2) <a target="_blank" rel="noopener noreferrer" href="https://arxiv.org/abs/2411.14073">Simons, A (2024). Meaning at the planck scale? Contextualized word embeddings for doing history, philosophy, and sociology of science. arXiv:2411.14073.</a> |
| |
|
| | 3) <a target="_blank" rel="noopener noreferrer" href="https://arxiv.org/abs/2506.12242">Simons, A; Zichert, M; and Wüthrich, A (2025). Large Language Models for History, Philosophy, and Sociology of Science: Interpretive Uses, Methodological Challenges, and Critical Perspectives. arXiv:2506.12242.</a> |
| |
|
| |
|
| | ## Model Details |
| |
|
| | - **Developer:** <a target="_blank" rel="noopener noreferrer" href="https://www.tu.berlin/en/hps-mod-sci/arno-simons">Arno Simons</a> |
| | - **Funded by:** The European Union under Grant agreement ID: <a target="_blank" rel="noopener noreferrer" href="https://doi.org/10.3030/101044932" >101044932</a> |
| | - **Language (NLP):** English |
| | - **License:** apache-2.0 |
| | - **Parent model:** Google's <a target="_blank" rel="noopener noreferrer" href="https://github.com/google-research/bert">`bert-base-uncased`</a> |
| |
|
| | <!--- |
| |
|
| | ## How to Get Started with the Model |
| |
|
| | Use the code below to get started with the model. |
| |
|
| | [Coming soon] |
| |
|
| |
|
| | ## Citation |
| |
|
| |
|
| | **BibTeX:** |
| |
|
| | [Coming soon] |
| |
|
| | **APA:** |
| |
|
| | [Coming soon] |
| |
|
| | --> |