---
language:
 - en
license: mit
library_name: transformers
tags:
- materials-science
- crystallography
- generative-ai
- inverse-design
- chemistry
- xrd
datasets:
- c-bone/chili100k_strat
pipeline_tag: text-generation
---

# Model Card for CrystaLLM-pi\_Chili100K-XRD

## Model Details

### Model Description

**CrystaLLM-pi\_Chili100K-XRD** is a conditional generative model designed for the recovery of crystal structures from X-ray Diffraction (XRD) data. It is a fine-tuned version of the `CrystaLLM-pi` framework, utilizing a GPT-2 decoder-only architecture. This model employs a **Residual Attention (Slider)** mechanism to condition the generation of Crystallographic Information Files (CIFs) on heterogeneous X-ray diffraction data.

The model generates crystal structures based on an XRD pattern input vector consisting of the 20 most intense peaks:

1.  **Peak Positions** ($2\theta$)
2.  **Peak Intensities**

The Chili-100K XRD dataset the model is fine-tuned on contains experimentally determined structures sourced from Chili-100K, which is an inorganic experimental nanomaterials curated and filtered subset of the Crystallographic Open Database (COD). Notably, this model features an extended context window of **1536 tokens**, enabling the generation of larger and more complex unit cells containing up to \~100 atoms.

  - **Developed by:** Bone et al. (University College London)
  - **Model type:** Autoregressive Transformer with Residual Attention Conditioning
  - **Language(s):** CIF (Crystallographic Information File) syntax
  - **License:** MIT
  - **Finetuned from model:** `c-bone/CrystaLLM-pi_Mattergen-XRD`

### Model Sources

  - **Repository:** [GitHub: CrystaLLM-pi](https://github.com/C-Bone-UCL/CrystaLLM-pi)
  - **Paper:** [Discovery and recovery of crystalline materials with property-conditioned transformers (arXiv:2511.21299)](https://arxiv.org/abs/2511.21299)
  - **Dataset:** [HuggingFace: c-bone/mattergen\_XRD](https://huggingface.co/datasets/c-bone/mattergen_XRD) (Stage 1), [HuggingFace: c-bone/chili100k\_strat](https://www.google.com/search?q=https://huggingface.co/datasets/c-bone/chili100k_strat) (Stage 2)

## Uses

### Direct Use

The model is intended for structure solution and recovery from powder XRD data. Researchers can input a list of peak positions and intensities derived from experimental diffraction patterns to generate candidate crystal structures that match the experimental signature.

### Out-of-Scope Use

  - **Disordered Systems:** The model does not natively handle partial occupancies or significant disorder.
  - **Organic/MOFs:** The training data was strictly filtered for inorganic nanomaterials as per the Chili-100K dataset methdology.
  - **Extremely Large Unit Cells:** While the context window is expanded to 1536 tokens, structures with high numbers of atoms per unit cell may face or degradation in generation quality.

## Bias, Risks, and Limitations

  - **Experimental Noise:** Performance relies on the quality of the input peak extraction and rarity of material.
  - **Missing Data:** The "Slider" mechanism handles missing peaks (padded with -100), but significant data loss degrades recovery rates.
  - **Polymorphs:** In cases of strong structural similarity, the model may bias towards the polymorph most prevalent in the Chili-100K distribution.

## How to Get Started with the Model

For instructions on loading and running generation, refer to the `_load_and_generate.py` script in the [CrystaLLM-pi GitHub Repository](https://github.com/C-Bone-UCL/CrystaLLM-pi). This script handles XRD vector tokenization and normalization.

## Training Details

### Training Data

The model underwent a two-stage fine-tuning process:

1.  **MatterGen XRD:** Theoretical XRD patterns generated from the MatterGen dataset.
2.  **Chili-100K XRD (`c-bone/chili100k_strat`):** An experimentally determined, curated, and filtered subset of inorganic nanomaterials from the COD (accessed April 2026). After deduplication, this comprises \~14K materials derived from \~21K CIFs.

**Dataset Splitting (Chili-100K):**

  - **Train:Val:Test Ratio:** 78.6:10.7:10.7
  - **Leakage-Aware Test Set:** The test set was strictly stratified to evaluate generalization:
      - **500 materials:** Fully seen during training (LeMaterial, MatterGen XRD, or Chili-100K train/val).
      - **500 materials:** Reduced formula seen during training, but the specific structure was unseen (measured via Structure Novelty metric).
      - **500 materials:** Neither reduced formula nor structure seen in any training phase.

### Training Procedure

  - **Architecture:** GPT-2 with **Residual Attention (Slider)** layers. (\~47.7M parameters)
  - **Mechanism:** The Slider mechanism computes a parallel attention score for the conditioning vector, dynamically weighting it against base self-attention to robustly handle heterogeneous/missing diffraction data.

## Evaluation

### Metrics

The model is evaluated on the leakage-aware test splits using:

1.  **Match Rate:** Percentage of ground truth structures successfully recovered.
2.  **RMS-d:** Root Mean Square distance between ground truth and generated structures.
3.  **Lattice Parameter and Volume MAE:** Mean Absolute Error of predicted unit cell dimensions.
4.  **N atoms match**: The average amount of atoms in the unit cell of matched material in the test set.

## Citation

**Primary Model Paper:**

```bibtex
@misc{bone2025discoveryrecoverycrystallinematerials,
      title={Discovery and recovery of crystalline materials with property-conditioned transformers}, 
      author={Cyprien Bone and Matthew Walker and Kuangdai Leng and Luis M. Antunes and Ricardo Grau-Crespo and Amil Aligayev and Javier Dominguez and Keith T. Butler},
      year={2025},
      eprint={2511.21299},
      archivePrefix={arXiv},
      primaryClass={cond-mat.mtrl-sci},
      url={https://arxiv.org/abs/2511.21299}, 
}
```

**CHILI Dataset:**

```bibtex
@inproceedings{10.1145/3637528.3671538,
      author = {Friis-Jensen, Ulrik and Johansen, Frederik L. and Anker, Andy S. and Dam, Erik B. and Jensen, Kirsten M. \O{}. and Selvan, Raghavendra},
      title = {CHILI: Chemically-Informed Large-scale Inorganic Nanomaterials Dataset for Advancing Graph Machine Learning},
      year = {2024},
      isbn = {9798400704901},
      publisher = {Association for Computing Machinery},
      address = {New York, NY, USA},
      url = {https://doi.org/10.1145/3637528.3671538},
      doi = {10.1145/3637528.3671538},
      pages = {4962–4973},
      numpages = {12},
      keywords = {atomic structure, chemistry, datasets, deep learning, graph neural network, graphs, machine learning, nanomaterials, neutron, scattering, x-ray},
      location = {Barcelona, Spain},
      series = {KDD '24}
}
```