CLOUD: A Scalable and Physics-Informed Foundation Model for Crystal Representation Learning

CLOUD (Crystal Language mOdel for Unified and Differentiable materials modeling) is a Transformer-based foundation model that learns crystal representations from string encodings of crystal structures. Crystals are serialized with a novel Symmetry-Consistent Ordered Parameter Encoding (SCOPE), a compact, coordinate-free representation that captures space-group symmetry, Wyckoff positions, and composition. The model can be fine-tuned for accurate, generalizable, and scalable property prediction, and can be combined with physics laws (e.g. the Debye model) for thermodynamic-consistent predictions.

📄 Paper (Nature Communications 17, 4074, 2026): CLOUD: A Scalable and Physics-Informed Foundation Model for Crystal Representation Learning (arXiv preprint)
💻 Code: github.com/BattModels/CLOUD
🏛️ Authors: Changwen Xu, Shang Zhu, Venkatasubramanian Viswanathan (University of Michigan)

Model Details


Architecture	BERT encoder (`BertForMaskedLM`)
Hidden size	768
Hidden layers	12
Attention heads	12
Intermediate size	3072
Max sequence length	64
Vocab size	30522 (custom SCOPE tokenizer)
Parameters	~110M
Precision	float32
Pretraining objective	Masked language modeling on SCOPE strings
Pretraining data	~6M crystal structures from OPTIMADE

Repository Layout

ckpt/
├── config.json              # BertForMaskedLM config
├── generation_config.json
├── model.safetensors        # Pretrained weights (~530 MB)
├── training_args.bin        # HF Trainer arguments used for pretraining
├── tokenizer_config.json    # SCOPE tokenizer config
├── special_tokens_map.json
├── added_tokens.json
└── vocab.txt                # SCOPE vocabulary

Usage

Load the pretrained model and tokenizer

from transformers import AutoTokenizer, AutoModelForMaskedLM

tokenizer = AutoTokenizer.from_pretrained("ChangwenXu/CLOUD", subfolder="ckpt")
model = AutoModelForMaskedLM.from_pretrained("ChangwenXu/CLOUD", subfolder="ckpt")

Encode a crystal structure to a SCOPE string

CLOUD operates on SCOPE string representations. Use the conversion utility from the code repository to turn a CIF file into a SCOPE string:

git clone https://github.com/BattModels/CLOUD.git
cd CLOUD
python structure_to_str.py --dir <path_to_cif> --out <output_path> \
    --numproc <num_of_processes> --batchsize <batch_size>

Get crystal embeddings

import torch
from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("ChangwenXu/CLOUD", subfolder="ckpt")
encoder = AutoModel.from_pretrained("ChangwenXu/CLOUD", subfolder="ckpt")
encoder.eval()

scope_string = "..."  # SCOPE representation produced by structure_to_str.py
inputs = tokenizer(scope_string, return_tensors="pt", padding=True, truncation=True, max_length=64)
with torch.no_grad():
    outputs = encoder(**inputs)
# Pooled [CLS] embedding for the crystal:
crystal_embedding = outputs.last_hidden_state[:, 0]

Fine-tuning

Recipes for fine-tuning on MatBench, UnconvBench, MatBench Discovery / WBM, and the physics-informed CLOUD-DEBYE variant are provided in the GitHub repository (train.py, train_mp.py, wbm_predict.py, train_debye.py).

Intended Use

Pretrained backbone for downstream crystal property prediction (formation energy, bandgap, mechanical, thermodynamic properties, etc.)
Featurizer for materials screening and discovery workflows
Backbone for physics-informed extensions such as CLOUD-DEBYE

Out-of-scope

Direct generation of crystal structures from scratch
Predicting properties of non-crystalline systems (molecules, amorphous solids)
Use as a substitute for high-fidelity DFT/MD without task-specific fine-tuning and validation

Limitations

Trained on equilibrium / known crystal structures from OPTIMADE; out-of-distribution behavior on highly disordered, defective, or hypothetical structures is not guaranteed.
Maximum sequence length of 64 tokens; very large or low-symmetry unit cells may be truncated by the SCOPE encoder.
Property predictions require task-specific fine-tuning; the released checkpoint is the masked-language-model pretrained backbone only.

Citation

If you find CLOUD useful in your research, please cite:

@article{xu2026cloud,
  title   = {{CLOUD}: A Scalable and Physics-Informed Foundation Model for Crystal Representation Learning},
  author  = {Xu, Changwen and Zhu, Shang and Viswanathan, Venkatasubramanian},
  journal = {Nature Communications},
  volume  = {17},
  number  = {1},
  pages   = {4074},
  year    = {2026},
  doi     = {10.1038/s41467-026-70467-3}
}

@inproceedings{xu2024cloud,
  title     = {{CLOUD}: A Scalable Scientific Foundation Model for Crystal Representation Learning},
  author    = {Xu, Changwen and Zhu, Shang and Viswanathan, Venkatasubramanian},
  booktitle = {NeurIPS 2024 Workshop on Foundation Models for Science: Progress, Opportunities, and Challenges},
  year      = {2024}
}

License

Downloads last month: -; Downloads are not tracked for this model. How to track

Paper for ChangwenXu/CLOUD

CLOUD: A Scalable and Physics-Informed Foundation Model for Crystal Representation Learning

Paper • 2506.17345 • Published Jun 19, 2025