Model Card for CrystaLLM-pi_COD-XRD
Model Details
Model Description
CrystaLLM-pi_COD-XRD is a conditional generative model designed for the recovery of crystal structures from X-ray Diffraction (XRD) data. It is a fine-tuned version of the CrystaLLM-pi framework, based on a GPT-2 decoder-only architecture. This variant employs the Residual Attention (Slider) mechanism to condition the generation of Crystallographic Information Files (CIFs) on high-dimensional experimental data.
The model generates crystal structures based on an XRD pattern input vector, consisting of the 20 most intense peaks:
- Peak Positions ($2\theta$)
- Peak Intensities
- Developed by: Bone et al. (University College London)
- Model type: Autoregressive Transformer with Residual Attention Conditioning
- Language(s): CIF (Crystallographic Information File) syntax
- License: MIT
- Finetuned from model:
c-bone/CrystaLLM-pi_base
Model Sources
- Repository: GitHub: CrystaLLM-pi
- Paper: Discovery and recovery of crystalline materials with property-conditioned transformers (arXiv:2511.21299)
- Dataset: HuggingFace: c-bone/mattergen_XRD (Stage 1), HuggingFace: c-bone/COD_XRD_small_nohc (Stage 2)
Uses
Direct Use
The model is intended for structure solution and recovery from powder XRD data. Researchers can input a list of peak positions and intensities derived from experimental diffraction patterns to generate candidate crystal structures that match the experimental signature.
Out-of-Scope Use
- Disordered Systems: The model was trained on ordered approximations of structures from the Crystallography Open Database (COD). It does not natively handle partial occupancies or significant disorder.
- Large Unit Cells: Context window limits apply (~20 atoms/cell).
- Organic/MOFs: The training data was filtered to exclude hydrocarbon-containing compounds (inorganic only).
Bias, Risks, and Limitations
- Experimental Noise: While robust to some experimental deviations, the model's performance relies on the quality of the input peak extraction.
- Missing Data: The "Slider" mechanism is designed to handle missing peaks (padded with -100), but significant data loss will degrade recovery rates.
- Polymorphs: In cases of strong structural similarity or ambiguous diffraction patterns, the model may be biased towards the polymorph most represented in the training distribution.
How to Get Started with the Model
For instructions on how to load and run generation with this model, please refer to the _load_and_generate.py script in the CrystaLLM-pi GitHub Repository. This script handles the necessary tokenization and normalization of XRD vectors.
Training Details
Training Data
The model underwent a two-stage fine-tuning process:
- MatterGen XRD: Theoretical XRD patterns generated from the MatterGen dataset.
- COD XRD: Experimental XRD data from the Crystallography Open Database (COD), filtered for inorganic structures and processed to remove partial occupancies.
Training Procedure
- Architecture: GPT-2 with Residual Attention (Slider) layers. (~47.7M parameters)
- Mechanism: The Slider mechanism computes a parallel attention score for the conditioning vector and dynamically weights it against the base self-attention. This allows for "softer" conditioning and robust handling of heterogeneous or missing data points in the diffraction pattern.
Evaluation
Metrics
The model is evaluated based on:
- Match Rate: The percentage of ground truth structures successfully recovered (within structural similarity tolerances).
- RMS-d: Root Mean Square distance between the ground truth and generated structures.
- Lattice Parameter MAE: Mean Absolute Error of the predicted unit cell dimensions.
Results
The model achieves competitive performance on the MP-20 and experimental COD benchmarks, effectively recovering structures from experimental XRD data with reduced computational cost compared to traditional solution methods.
Citation
@misc{bone2025discoveryrecoverycrystallinematerials,
title={Discovery and recovery of crystalline materials with property-conditioned transformers},
author={Cyprien Bone and Matthew Walker and Kuangdai Leng and Luis M. Antunes and Ricardo Grau-Crespo and Amil Aligayev and Javier Dominguez and Keith T. Butler},
year={2025},
eprint={2511.21299},
archivePrefix={arXiv},
primaryClass={cond-mat.mtrl-sci},
url={[https://arxiv.org/abs/2511.21299](https://arxiv.org/abs/2511.21299)},
}
- Downloads last month
- 7
Model tree for c-bone/CrystaLLM-pi_COD-XRD
Base model
c-bone/CrystaLLM-pi_base