c-bone's picture
Create README.md
d656830 verified
---
language:
- en
license: mit
library_name: transformers
tags:
- materials-science
- crystallography
- generative-ai
- inverse-design
- chemistry
- xrd
datasets:
- c-bone/chili100k_strat
pipeline_tag: text-generation
---
# Model Card for CrystaLLM-pi\_Chili100K-XRD
## Model Details
### Model Description
**CrystaLLM-pi\_Chili100K-XRD** is a conditional generative model designed for the recovery of crystal structures from X-ray Diffraction (XRD) data. It is a fine-tuned version of the `CrystaLLM-pi` framework, utilizing a GPT-2 decoder-only architecture. This model employs a **Residual Attention (Slider)** mechanism to condition the generation of Crystallographic Information Files (CIFs) on heterogeneous X-ray diffraction data.
The model generates crystal structures based on an XRD pattern input vector consisting of the 20 most intense peaks:
1. **Peak Positions** ($2\theta$)
2. **Peak Intensities**
The Chili-100K XRD dataset the model is fine-tuned on contains experimentally determined structures sourced from Chili-100K, which is an inorganic experimental nanomaterials curated and filtered subset of the Crystallographic Open Database (COD). Notably, this model features an extended context window of **1536 tokens**, enabling the generation of larger and more complex unit cells containing up to \~100 atoms.
- **Developed by:** Bone et al. (University College London)
- **Model type:** Autoregressive Transformer with Residual Attention Conditioning
- **Language(s):** CIF (Crystallographic Information File) syntax
- **License:** MIT
- **Finetuned from model:** `c-bone/CrystaLLM-pi_Mattergen-XRD`
### Model Sources
- **Repository:** [GitHub: CrystaLLM-pi](https://github.com/C-Bone-UCL/CrystaLLM-pi)
- **Paper:** [Discovery and recovery of crystalline materials with property-conditioned transformers (arXiv:2511.21299)](https://arxiv.org/abs/2511.21299)
- **Dataset:** [HuggingFace: c-bone/mattergen\_XRD](https://huggingface.co/datasets/c-bone/mattergen_XRD) (Stage 1), [HuggingFace: c-bone/chili100k\_strat](https://www.google.com/search?q=https://huggingface.co/datasets/c-bone/chili100k_strat) (Stage 2)
## Uses
### Direct Use
The model is intended for structure solution and recovery from powder XRD data. Researchers can input a list of peak positions and intensities derived from experimental diffraction patterns to generate candidate crystal structures that match the experimental signature.
### Out-of-Scope Use
- **Disordered Systems:** The model does not natively handle partial occupancies or significant disorder.
- **Organic/MOFs:** The training data was strictly filtered for inorganic nanomaterials as per the Chili-100K dataset methdology.
- **Extremely Large Unit Cells:** While the context window is expanded to 1536 tokens, structures with high numbers of atoms per unit cell may face or degradation in generation quality.
## Bias, Risks, and Limitations
- **Experimental Noise:** Performance relies on the quality of the input peak extraction and rarity of material.
- **Missing Data:** The "Slider" mechanism handles missing peaks (padded with -100), but significant data loss degrades recovery rates.
- **Polymorphs:** In cases of strong structural similarity, the model may bias towards the polymorph most prevalent in the Chili-100K distribution.
## How to Get Started with the Model
For instructions on loading and running generation, refer to the `_load_and_generate.py` script in the [CrystaLLM-pi GitHub Repository](https://github.com/C-Bone-UCL/CrystaLLM-pi). This script handles XRD vector tokenization and normalization.
## Training Details
### Training Data
The model underwent a two-stage fine-tuning process:
1. **MatterGen XRD:** Theoretical XRD patterns generated from the MatterGen dataset.
2. **Chili-100K XRD (`c-bone/chili100k_strat`):** An experimentally determined, curated, and filtered subset of inorganic nanomaterials from the COD (accessed April 2026). After deduplication, this comprises \~14K materials derived from \~21K CIFs.
**Dataset Splitting (Chili-100K):**
- **Train:Val:Test Ratio:** 78.6:10.7:10.7
- **Leakage-Aware Test Set:** The test set was strictly stratified to evaluate generalization:
- **500 materials:** Fully seen during training (LeMaterial, MatterGen XRD, or Chili-100K train/val).
- **500 materials:** Reduced formula seen during training, but the specific structure was unseen (measured via Structure Novelty metric).
- **500 materials:** Neither reduced formula nor structure seen in any training phase.
### Training Procedure
- **Architecture:** GPT-2 with **Residual Attention (Slider)** layers. (\~47.7M parameters)
- **Mechanism:** The Slider mechanism computes a parallel attention score for the conditioning vector, dynamically weighting it against base self-attention to robustly handle heterogeneous/missing diffraction data.
## Evaluation
### Metrics
The model is evaluated on the leakage-aware test splits using:
1. **Match Rate:** Percentage of ground truth structures successfully recovered.
2. **RMS-d:** Root Mean Square distance between ground truth and generated structures.
3. **Lattice Parameter and Volume MAE:** Mean Absolute Error of predicted unit cell dimensions.
4. **N atoms match**: The average amount of atoms in the unit cell of matched material in the test set.
## Citation
**Primary Model Paper:**
```bibtex
@misc{bone2025discoveryrecoverycrystallinematerials,
title={Discovery and recovery of crystalline materials with property-conditioned transformers},
author={Cyprien Bone and Matthew Walker and Kuangdai Leng and Luis M. Antunes and Ricardo Grau-Crespo and Amil Aligayev and Javier Dominguez and Keith T. Butler},
year={2025},
eprint={2511.21299},
archivePrefix={arXiv},
primaryClass={cond-mat.mtrl-sci},
url={https://arxiv.org/abs/2511.21299},
}
```
**CHILI Dataset:**
```bibtex
@inproceedings{10.1145/3637528.3671538,
author = {Friis-Jensen, Ulrik and Johansen, Frederik L. and Anker, Andy S. and Dam, Erik B. and Jensen, Kirsten M. \O{}. and Selvan, Raghavendra},
title = {CHILI: Chemically-Informed Large-scale Inorganic Nanomaterials Dataset for Advancing Graph Machine Learning},
year = {2024},
isbn = {9798400704901},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3637528.3671538},
doi = {10.1145/3637528.3671538},
pages = {4962–4973},
numpages = {12},
keywords = {atomic structure, chemistry, datasets, deep learning, graph neural network, graphs, machine learning, nanomaterials, neutron, scattering, x-ray},
location = {Barcelona, Spain},
series = {KDD '24}
}
```