Text Generation
Transformers
Safetensors
English
gpt2
materials-science
crystallography
generative-ai
inverse-design
chemistry
xrd
text-generation-inference
Instructions to use c-bone/CrystaLLM-pi_Chili100K-XRD with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use c-bone/CrystaLLM-pi_Chili100K-XRD with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="c-bone/CrystaLLM-pi_Chili100K-XRD")# Load model directly from transformers import AutoTokenizer, SliderGPT tokenizer = AutoTokenizer.from_pretrained("c-bone/CrystaLLM-pi_Chili100K-XRD") model = SliderGPT.from_pretrained("c-bone/CrystaLLM-pi_Chili100K-XRD") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use c-bone/CrystaLLM-pi_Chili100K-XRD with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "c-bone/CrystaLLM-pi_Chili100K-XRD" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "c-bone/CrystaLLM-pi_Chili100K-XRD", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/c-bone/CrystaLLM-pi_Chili100K-XRD
- SGLang
How to use c-bone/CrystaLLM-pi_Chili100K-XRD with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "c-bone/CrystaLLM-pi_Chili100K-XRD" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "c-bone/CrystaLLM-pi_Chili100K-XRD", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "c-bone/CrystaLLM-pi_Chili100K-XRD" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "c-bone/CrystaLLM-pi_Chili100K-XRD", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use c-bone/CrystaLLM-pi_Chili100K-XRD with Docker Model Runner:
docker model run hf.co/c-bone/CrystaLLM-pi_Chili100K-XRD
| language: | |
| - en | |
| license: mit | |
| library_name: transformers | |
| tags: | |
| - materials-science | |
| - crystallography | |
| - generative-ai | |
| - inverse-design | |
| - chemistry | |
| - xrd | |
| datasets: | |
| - c-bone/chili100k_strat | |
| pipeline_tag: text-generation | |
| # Model Card for CrystaLLM-pi\_Chili100K-XRD | |
| ## Model Details | |
| ### Model Description | |
| **CrystaLLM-pi\_Chili100K-XRD** is a conditional generative model designed for the recovery of crystal structures from X-ray Diffraction (XRD) data. It is a fine-tuned version of the `CrystaLLM-pi` framework, utilizing a GPT-2 decoder-only architecture. This model employs a **Residual Attention (Slider)** mechanism to condition the generation of Crystallographic Information Files (CIFs) on heterogeneous X-ray diffraction data. | |
| The model generates crystal structures based on an XRD pattern input vector consisting of the 20 most intense peaks: | |
| 1. **Peak Positions** ($2\theta$) | |
| 2. **Peak Intensities** | |
| The Chili-100K XRD dataset the model is fine-tuned on contains experimentally determined structures sourced from Chili-100K, which is an inorganic experimental nanomaterials curated and filtered subset of the Crystallographic Open Database (COD). Notably, this model features an extended context window of **1536 tokens**, enabling the generation of larger and more complex unit cells containing up to \~100 atoms. | |
| - **Developed by:** Bone et al. (University College London) | |
| - **Model type:** Autoregressive Transformer with Residual Attention Conditioning | |
| - **Language(s):** CIF (Crystallographic Information File) syntax | |
| - **License:** MIT | |
| - **Finetuned from model:** `c-bone/CrystaLLM-pi_Mattergen-XRD` | |
| ### Model Sources | |
| - **Repository:** [GitHub: CrystaLLM-pi](https://github.com/C-Bone-UCL/CrystaLLM-pi) | |
| - **Paper:** [Discovery and recovery of crystalline materials with property-conditioned transformers (arXiv:2511.21299)](https://arxiv.org/abs/2511.21299) | |
| - **Dataset:** [HuggingFace: c-bone/mattergen\_XRD](https://huggingface.co/datasets/c-bone/mattergen_XRD) (Stage 1), [HuggingFace: c-bone/chili100k\_strat](https://www.google.com/search?q=https://huggingface.co/datasets/c-bone/chili100k_strat) (Stage 2) | |
| ## Uses | |
| ### Direct Use | |
| The model is intended for structure solution and recovery from powder XRD data. Researchers can input a list of peak positions and intensities derived from experimental diffraction patterns to generate candidate crystal structures that match the experimental signature. | |
| ### Out-of-Scope Use | |
| - **Disordered Systems:** The model does not natively handle partial occupancies or significant disorder. | |
| - **Organic/MOFs:** The training data was strictly filtered for inorganic nanomaterials as per the Chili-100K dataset methdology. | |
| - **Extremely Large Unit Cells:** While the context window is expanded to 1536 tokens, structures with high numbers of atoms per unit cell may face or degradation in generation quality. | |
| ## Bias, Risks, and Limitations | |
| - **Experimental Noise:** Performance relies on the quality of the input peak extraction and rarity of material. | |
| - **Missing Data:** The "Slider" mechanism handles missing peaks (padded with -100), but significant data loss degrades recovery rates. | |
| - **Polymorphs:** In cases of strong structural similarity, the model may bias towards the polymorph most prevalent in the Chili-100K distribution. | |
| ## How to Get Started with the Model | |
| For instructions on loading and running generation, refer to the `_load_and_generate.py` script in the [CrystaLLM-pi GitHub Repository](https://github.com/C-Bone-UCL/CrystaLLM-pi). This script handles XRD vector tokenization and normalization. | |
| ## Training Details | |
| ### Training Data | |
| The model underwent a two-stage fine-tuning process: | |
| 1. **MatterGen XRD:** Theoretical XRD patterns generated from the MatterGen dataset. | |
| 2. **Chili-100K XRD (`c-bone/chili100k_strat`):** An experimentally determined, curated, and filtered subset of inorganic nanomaterials from the COD (accessed April 2026). After deduplication, this comprises \~14K materials derived from \~21K CIFs. | |
| **Dataset Splitting (Chili-100K):** | |
| - **Train:Val:Test Ratio:** 78.6:10.7:10.7 | |
| - **Leakage-Aware Test Set:** The test set was strictly stratified to evaluate generalization: | |
| - **500 materials:** Fully seen during training (LeMaterial, MatterGen XRD, or Chili-100K train/val). | |
| - **500 materials:** Reduced formula seen during training, but the specific structure was unseen (measured via Structure Novelty metric). | |
| - **500 materials:** Neither reduced formula nor structure seen in any training phase. | |
| ### Training Procedure | |
| - **Architecture:** GPT-2 with **Residual Attention (Slider)** layers. (\~47.7M parameters) | |
| - **Mechanism:** The Slider mechanism computes a parallel attention score for the conditioning vector, dynamically weighting it against base self-attention to robustly handle heterogeneous/missing diffraction data. | |
| ## Evaluation | |
| ### Metrics | |
| The model is evaluated on the leakage-aware test splits using: | |
| 1. **Match Rate:** Percentage of ground truth structures successfully recovered. | |
| 2. **RMS-d:** Root Mean Square distance between ground truth and generated structures. | |
| 3. **Lattice Parameter and Volume MAE:** Mean Absolute Error of predicted unit cell dimensions. | |
| 4. **N atoms match**: The average amount of atoms in the unit cell of matched material in the test set. | |
| ## Citation | |
| **Primary Model Paper:** | |
| ```bibtex | |
| @misc{bone2025discoveryrecoverycrystallinematerials, | |
| title={Discovery and recovery of crystalline materials with property-conditioned transformers}, | |
| author={Cyprien Bone and Matthew Walker and Kuangdai Leng and Luis M. Antunes and Ricardo Grau-Crespo and Amil Aligayev and Javier Dominguez and Keith T. Butler}, | |
| year={2025}, | |
| eprint={2511.21299}, | |
| archivePrefix={arXiv}, | |
| primaryClass={cond-mat.mtrl-sci}, | |
| url={https://arxiv.org/abs/2511.21299}, | |
| } | |
| ``` | |
| **CHILI Dataset:** | |
| ```bibtex | |
| @inproceedings{10.1145/3637528.3671538, | |
| author = {Friis-Jensen, Ulrik and Johansen, Frederik L. and Anker, Andy S. and Dam, Erik B. and Jensen, Kirsten M. \O{}. and Selvan, Raghavendra}, | |
| title = {CHILI: Chemically-Informed Large-scale Inorganic Nanomaterials Dataset for Advancing Graph Machine Learning}, | |
| year = {2024}, | |
| isbn = {9798400704901}, | |
| publisher = {Association for Computing Machinery}, | |
| address = {New York, NY, USA}, | |
| url = {https://doi.org/10.1145/3637528.3671538}, | |
| doi = {10.1145/3637528.3671538}, | |
| pages = {4962–4973}, | |
| numpages = {12}, | |
| keywords = {atomic structure, chemistry, datasets, deep learning, graph neural network, graphs, machine learning, nanomaterials, neutron, scattering, x-ray}, | |
| location = {Barcelona, Spain}, | |
| series = {KDD '24} | |
| } | |
| ``` |