--- language: - en license: mit library_name: transformers tags: - materials-science - crystallography - generative-ai - inverse-design - chemistry - xrd datasets: - c-bone/chili100k_strat pipeline_tag: text-generation --- # Model Card for CrystaLLM-pi\_Chili100K-XRD ## Model Details ### Model Description **CrystaLLM-pi\_Chili100K-XRD** is a conditional generative model designed for the recovery of crystal structures from X-ray Diffraction (XRD) data. It is a fine-tuned version of the `CrystaLLM-pi` framework, utilizing a GPT-2 decoder-only architecture. This model employs a **Residual Attention (Slider)** mechanism to condition the generation of Crystallographic Information Files (CIFs) on heterogeneous X-ray diffraction data. The model generates crystal structures based on an XRD pattern input vector consisting of the 20 most intense peaks: 1. **Peak Positions** ($2\theta$) 2. **Peak Intensities** The Chili-100K XRD dataset the model is fine-tuned on contains experimentally determined structures sourced from Chili-100K, which is an inorganic experimental nanomaterials curated and filtered subset of the Crystallographic Open Database (COD). Notably, this model features an extended context window of **1536 tokens**, enabling the generation of larger and more complex unit cells containing up to \~100 atoms. - **Developed by:** Bone et al. (University College London) - **Model type:** Autoregressive Transformer with Residual Attention Conditioning - **Language(s):** CIF (Crystallographic Information File) syntax - **License:** MIT - **Finetuned from model:** `c-bone/CrystaLLM-pi_Mattergen-XRD` ### Model Sources - **Repository:** [GitHub: CrystaLLM-pi](https://github.com/C-Bone-UCL/CrystaLLM-pi) - **Paper:** [Discovery and recovery of crystalline materials with property-conditioned transformers (arXiv:2511.21299)](https://arxiv.org/abs/2511.21299) - **Dataset:** [HuggingFace: c-bone/mattergen\_XRD](https://huggingface.co/datasets/c-bone/mattergen_XRD) (Stage 1), [HuggingFace: c-bone/chili100k\_strat](https://www.google.com/search?q=https://huggingface.co/datasets/c-bone/chili100k_strat) (Stage 2) ## Uses ### Direct Use The model is intended for structure solution and recovery from powder XRD data. Researchers can input a list of peak positions and intensities derived from experimental diffraction patterns to generate candidate crystal structures that match the experimental signature. ### Out-of-Scope Use - **Disordered Systems:** The model does not natively handle partial occupancies or significant disorder. - **Organic/MOFs:** The training data was strictly filtered for inorganic nanomaterials as per the Chili-100K dataset methdology. - **Extremely Large Unit Cells:** While the context window is expanded to 1536 tokens, structures with high numbers of atoms per unit cell may face or degradation in generation quality. ## Bias, Risks, and Limitations - **Experimental Noise:** Performance relies on the quality of the input peak extraction and rarity of material. - **Missing Data:** The "Slider" mechanism handles missing peaks (padded with -100), but significant data loss degrades recovery rates. - **Polymorphs:** In cases of strong structural similarity, the model may bias towards the polymorph most prevalent in the Chili-100K distribution. ## How to Get Started with the Model For instructions on loading and running generation, refer to the `_load_and_generate.py` script in the [CrystaLLM-pi GitHub Repository](https://github.com/C-Bone-UCL/CrystaLLM-pi). This script handles XRD vector tokenization and normalization. ## Training Details ### Training Data The model underwent a two-stage fine-tuning process: 1. **MatterGen XRD:** Theoretical XRD patterns generated from the MatterGen dataset. 2. **Chili-100K XRD (`c-bone/chili100k_strat`):** An experimentally determined, curated, and filtered subset of inorganic nanomaterials from the COD (accessed April 2026). After deduplication, this comprises \~14K materials derived from \~21K CIFs. **Dataset Splitting (Chili-100K):** - **Train:Val:Test Ratio:** 78.6:10.7:10.7 - **Leakage-Aware Test Set:** The test set was strictly stratified to evaluate generalization: - **500 materials:** Fully seen during training (LeMaterial, MatterGen XRD, or Chili-100K train/val). - **500 materials:** Reduced formula seen during training, but the specific structure was unseen (measured via Structure Novelty metric). - **500 materials:** Neither reduced formula nor structure seen in any training phase. ### Training Procedure - **Architecture:** GPT-2 with **Residual Attention (Slider)** layers. (\~47.7M parameters) - **Mechanism:** The Slider mechanism computes a parallel attention score for the conditioning vector, dynamically weighting it against base self-attention to robustly handle heterogeneous/missing diffraction data. ## Evaluation ### Metrics The model is evaluated on the leakage-aware test splits using: 1. **Match Rate:** Percentage of ground truth structures successfully recovered. 2. **RMS-d:** Root Mean Square distance between ground truth and generated structures. 3. **Lattice Parameter and Volume MAE:** Mean Absolute Error of predicted unit cell dimensions. 4. **N atoms match**: The average amount of atoms in the unit cell of matched material in the test set. ## Citation **Primary Model Paper:** ```bibtex @misc{bone2025discoveryrecoverycrystallinematerials, title={Discovery and recovery of crystalline materials with property-conditioned transformers}, author={Cyprien Bone and Matthew Walker and Kuangdai Leng and Luis M. Antunes and Ricardo Grau-Crespo and Amil Aligayev and Javier Dominguez and Keith T. Butler}, year={2025}, eprint={2511.21299}, archivePrefix={arXiv}, primaryClass={cond-mat.mtrl-sci}, url={https://arxiv.org/abs/2511.21299}, } ``` **CHILI Dataset:** ```bibtex @inproceedings{10.1145/3637528.3671538, author = {Friis-Jensen, Ulrik and Johansen, Frederik L. and Anker, Andy S. and Dam, Erik B. and Jensen, Kirsten M. \O{}. and Selvan, Raghavendra}, title = {CHILI: Chemically-Informed Large-scale Inorganic Nanomaterials Dataset for Advancing Graph Machine Learning}, year = {2024}, isbn = {9798400704901}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, url = {https://doi.org/10.1145/3637528.3671538}, doi = {10.1145/3637528.3671538}, pages = {4962–4973}, numpages = {12}, keywords = {atomic structure, chemistry, datasets, deep learning, graph neural network, graphs, machine learning, nanomaterials, neutron, scattering, x-ray}, location = {Barcelona, Spain}, series = {KDD '24} } ```