File size: 6,798 Bytes
d656830
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
---
language:
 - en
license: mit
library_name: transformers
tags:
- materials-science
- crystallography
- generative-ai
- inverse-design
- chemistry
- xrd
datasets:
- c-bone/chili100k_strat
pipeline_tag: text-generation
---

# Model Card for CrystaLLM-pi\_Chili100K-XRD

## Model Details

### Model Description

**CrystaLLM-pi\_Chili100K-XRD** is a conditional generative model designed for the recovery of crystal structures from X-ray Diffraction (XRD) data. It is a fine-tuned version of the `CrystaLLM-pi` framework, utilizing a GPT-2 decoder-only architecture. This model employs a **Residual Attention (Slider)** mechanism to condition the generation of Crystallographic Information Files (CIFs) on heterogeneous X-ray diffraction data.

The model generates crystal structures based on an XRD pattern input vector consisting of the 20 most intense peaks:

1.  **Peak Positions** ($2\theta$)
2.  **Peak Intensities**

The Chili-100K XRD dataset the model is fine-tuned on contains experimentally determined structures sourced from Chili-100K, which is an inorganic experimental nanomaterials curated and filtered subset of the Crystallographic Open Database (COD). Notably, this model features an extended context window of **1536 tokens**, enabling the generation of larger and more complex unit cells containing up to \~100 atoms.

  - **Developed by:** Bone et al. (University College London)
  - **Model type:** Autoregressive Transformer with Residual Attention Conditioning
  - **Language(s):** CIF (Crystallographic Information File) syntax
  - **License:** MIT
  - **Finetuned from model:** `c-bone/CrystaLLM-pi_Mattergen-XRD`

### Model Sources

  - **Repository:** [GitHub: CrystaLLM-pi](https://github.com/C-Bone-UCL/CrystaLLM-pi)
  - **Paper:** [Discovery and recovery of crystalline materials with property-conditioned transformers (arXiv:2511.21299)](https://arxiv.org/abs/2511.21299)
  - **Dataset:** [HuggingFace: c-bone/mattergen\_XRD](https://huggingface.co/datasets/c-bone/mattergen_XRD) (Stage 1), [HuggingFace: c-bone/chili100k\_strat](https://www.google.com/search?q=https://huggingface.co/datasets/c-bone/chili100k_strat) (Stage 2)

## Uses

### Direct Use

The model is intended for structure solution and recovery from powder XRD data. Researchers can input a list of peak positions and intensities derived from experimental diffraction patterns to generate candidate crystal structures that match the experimental signature.

### Out-of-Scope Use

  - **Disordered Systems:** The model does not natively handle partial occupancies or significant disorder.
  - **Organic/MOFs:** The training data was strictly filtered for inorganic nanomaterials as per the Chili-100K dataset methdology.
  - **Extremely Large Unit Cells:** While the context window is expanded to 1536 tokens, structures with high numbers of atoms per unit cell may face or degradation in generation quality.

## Bias, Risks, and Limitations

  - **Experimental Noise:** Performance relies on the quality of the input peak extraction and rarity of material.
  - **Missing Data:** The "Slider" mechanism handles missing peaks (padded with -100), but significant data loss degrades recovery rates.
  - **Polymorphs:** In cases of strong structural similarity, the model may bias towards the polymorph most prevalent in the Chili-100K distribution.

## How to Get Started with the Model

For instructions on loading and running generation, refer to the `_load_and_generate.py` script in the [CrystaLLM-pi GitHub Repository](https://github.com/C-Bone-UCL/CrystaLLM-pi). This script handles XRD vector tokenization and normalization.

## Training Details

### Training Data

The model underwent a two-stage fine-tuning process:

1.  **MatterGen XRD:** Theoretical XRD patterns generated from the MatterGen dataset.
2.  **Chili-100K XRD (`c-bone/chili100k_strat`):** An experimentally determined, curated, and filtered subset of inorganic nanomaterials from the COD (accessed April 2026). After deduplication, this comprises \~14K materials derived from \~21K CIFs.

**Dataset Splitting (Chili-100K):**

  - **Train:Val:Test Ratio:** 78.6:10.7:10.7
  - **Leakage-Aware Test Set:** The test set was strictly stratified to evaluate generalization:
      - **500 materials:** Fully seen during training (LeMaterial, MatterGen XRD, or Chili-100K train/val).
      - **500 materials:** Reduced formula seen during training, but the specific structure was unseen (measured via Structure Novelty metric).
      - **500 materials:** Neither reduced formula nor structure seen in any training phase.

### Training Procedure

  - **Architecture:** GPT-2 with **Residual Attention (Slider)** layers. (\~47.7M parameters)
  - **Mechanism:** The Slider mechanism computes a parallel attention score for the conditioning vector, dynamically weighting it against base self-attention to robustly handle heterogeneous/missing diffraction data.

## Evaluation

### Metrics

The model is evaluated on the leakage-aware test splits using:

1.  **Match Rate:** Percentage of ground truth structures successfully recovered.
2.  **RMS-d:** Root Mean Square distance between ground truth and generated structures.
3.  **Lattice Parameter and Volume MAE:** Mean Absolute Error of predicted unit cell dimensions.
4.  **N atoms match**: The average amount of atoms in the unit cell of matched material in the test set.

## Citation

**Primary Model Paper:**

```bibtex
@misc{bone2025discoveryrecoverycrystallinematerials,
      title={Discovery and recovery of crystalline materials with property-conditioned transformers}, 
      author={Cyprien Bone and Matthew Walker and Kuangdai Leng and Luis M. Antunes and Ricardo Grau-Crespo and Amil Aligayev and Javier Dominguez and Keith T. Butler},
      year={2025},
      eprint={2511.21299},
      archivePrefix={arXiv},
      primaryClass={cond-mat.mtrl-sci},
      url={https://arxiv.org/abs/2511.21299}, 
}
```

**CHILI Dataset:**

```bibtex
@inproceedings{10.1145/3637528.3671538,
      author = {Friis-Jensen, Ulrik and Johansen, Frederik L. and Anker, Andy S. and Dam, Erik B. and Jensen, Kirsten M. \O{}. and Selvan, Raghavendra},
      title = {CHILI: Chemically-Informed Large-scale Inorganic Nanomaterials Dataset for Advancing Graph Machine Learning},
      year = {2024},
      isbn = {9798400704901},
      publisher = {Association for Computing Machinery},
      address = {New York, NY, USA},
      url = {https://doi.org/10.1145/3637528.3671538},
      doi = {10.1145/3637528.3671538},
      pages = {4962–4973},
      numpages = {12},
      keywords = {atomic structure, chemistry, datasets, deep learning, graph neural network, graphs, machine learning, nanomaterials, neutron, scattering, x-ray},
      location = {Barcelona, Spain},
      series = {KDD '24}
}
```