|
|
--- |
|
|
language: |
|
|
- en |
|
|
license: mit |
|
|
library_name: transformers |
|
|
tags: |
|
|
- materials-science |
|
|
- crystallography |
|
|
- generative-ai |
|
|
- inverse-design |
|
|
- chemistry |
|
|
datasets: |
|
|
- c-bone/mpdb-2prop_clean |
|
|
base_model: c-bone/CrystaLLM-pi_base |
|
|
pipeline_tag: text-generation |
|
|
--- |
|
|
|
|
|
# Model Card for CrystaLLM-pi_bandgap |
|
|
|
|
|
## Model Details |
|
|
|
|
|
### Model Description |
|
|
|
|
|
**CrystaLLM-pi_bandgap** is a conditional generative model designed for the inverse design of inorganic crystalline materials. It is a fine-tuned version of the `CrystaLLM-pi` framework, based on a GPT-2 decoder-only architecture. This specific variant employs the **Property-Key-Value (PKV)** attention mechanism (referred to as "Prefix attention" in the associated preprint) to condition the generation of Crystallographic Information Files (CIFs) on specific electronic and thermodynamic properties. |
|
|
|
|
|
The model generates crystal structures (cell parameters and atomic positions) based on two target scalar properties: |
|
|
1. **Band gap** (eV) |
|
|
2. **Energy above convex hull** ($E_{hull}$, eV/atom) - a proxy for thermodynamic stability |
|
|
|
|
|
- **Developed by:** Bone et al. (University College London) |
|
|
- **Model type:** Autoregressive Transformer with Prefix Attention Conditioning |
|
|
- **Language(s):** CIF (Crystallographic Information File) syntax |
|
|
- **License:** MIT |
|
|
- **Finetuned from model:** `c-bone/CrystaLLM-pi_base` |
|
|
|
|
|
### Model Sources |
|
|
|
|
|
- **Repository:** [GitHub: CrystaLLM-pi](https://github.com/C-Bone-UCL/CrystaLLM-pi) |
|
|
- **Paper:** [Discovery and recovery of crystalline materials with property-conditioned transformers (arXiv:2511.21299)](https://arxiv.org/abs/2511.21299) |
|
|
- **Dataset:** [HuggingFace: c-bone/mpdb-2prop_clean](https://huggingface.co/datasets/c-bone/mpdb-2prop_clean) |
|
|
|
|
|
## Uses |
|
|
|
|
|
### Direct Use |
|
|
|
|
|
The model is intended for research in materials science, specifically for the exploration of chemical space targeting specific electronic properties. Users can input a desired band gap and a stability criterion to generate candidate crystal structures. |
|
|
|
|
|
### Out-of-Scope Use |
|
|
|
|
|
- **Organic Materials:** The model was trained exclusively on inorganic crystal structures. |
|
|
- **Large Unit Cells:** Due to the context window limit of 1024 tokens, the model cannot reliably generate unit cells containing more than approximately 20 atoms. |
|
|
- **Disordered Systems:** The model currently generates ordered structures and does not natively handle partial occupancies. |
|
|
- **Production Deployment:** This is a research artifact. Generated structures must be validated via Density Functional Theory (DFT) or other simulation methods before synthesis attempts. |
|
|
|
|
|
## Bias, Risks, and Limitations |
|
|
|
|
|
- **Training Distribution Bias:** The model is trained on the Materials Project database. It exhibits higher performance in regions of chemical space well-represented in the training data (e.g., band gaps near 0 eV). Performance degrades in sparse regions of the property manifold. |
|
|
- **Validity:** As an autoregressive language model, it may generate syntactically incorrect CIFs or chemically implausible structures. Post-processing validation is required. |
|
|
- **Hallucination:** The model may generate "novel" compositions that are thermodynamically unstable. |
|
|
|
|
|
## How to Get Started with the Model |
|
|
|
|
|
For instructions on how to load and run generation with this model, please refer to the `_load_and_generate.py` script in the [CrystaLLM-pi GitHub Repository](https://github.com/C-Bone-UCL/CrystaLLM-pi). This script handles the necessary tokenization, property normalization, and prompt construction required to properly condition the model |
|
|
|
|
|
## Training Details |
|
|
|
|
|
### Training Data |
|
|
|
|
|
The model was fine-tuned on the **MP Bandgap** dataset, a subset of the Materials Project containing approximately 53.3K inorganic structures labeled with PBE band gaps and $E_{hull}$ values. |
|
|
|
|
|
- **Source:** Materials Project (via `c-bone/mpdb-2prop_clean`) |
|
|
- **Preprocessing:** CIFs are augmented, tokenized, and property values are normalized before injection into the attention mechanism. |
|
|
|
|
|
### Training Procedure |
|
|
|
|
|
- **Architecture:** GPT-2 Small with additional Property-Key-Value (PKV) encoder layers. (~61.6M parameters) |
|
|
- **Mechanism:** Continuous property values are projected into the attention mechanism's key-value space (Prefix Tuning), allowing the model to attend to the target properties at every generation step. |
|
|
- **Optimization:** A dual optimization strategy was employed, using a lower learning rate for the pre-trained backbone and a higher learning rate for the condition encoder to prevent catastrophic forgetting. |
|
|
|
|
|
## Evaluation |
|
|
|
|
|
### Metrics |
|
|
|
|
|
The model is evaluated based on: |
|
|
|
|
|
1. **Validity:** Percentage of generated files that are valid CIFs. |
|
|
2. **Hit-Rate:** The fraction of generated structures where the predicted property (via surrogate model) falls within a tolerance of the target property. |
|
|
3. **VSUN:** A composite metric ensuring structures are Valid, Stable (low $E_{hull}$), Unique, and Novel. |
|
|
|
|
|
### Results |
|
|
|
|
|
As detailed in *Figure 3* of the associated preprint, the PKV (Prefix) architecture demonstrates strong capability in steering generation toward target band gaps, particularly when compared to sequence-level conditioning baselines. |
|
|
|
|
|
## Citation |
|
|
|
|
|
```bibtex |
|
|
@misc{bone2025discoveryrecoverycrystallinematerials, |
|
|
title={Discovery and recovery of crystalline materials with property-conditioned transformers}, |
|
|
author={Cyprien Bone and Matthew Walker and Kuangdai Leng and Luis M. Antunes and Ricardo Grau-Crespo and Amil Aligayev and Javier Dominguez and Keith T. Butler}, |
|
|
year={2025}, |
|
|
eprint={2511.21299}, |
|
|
archivePrefix={arXiv}, |
|
|
primaryClass={cond-mat.mtrl-sci}, |
|
|
url={[https://arxiv.org/abs/2511.21299](https://arxiv.org/abs/2511.21299)}, |
|
|
} |