--- license: mit library_name: transformers pipeline_tag: fill-mask tags: - materials-science - crystal-structure - foundation-model - chemistry - bert - scope - cloud language: - en --- # CLOUD: A Scalable and Physics-Informed Foundation Model for Crystal Representation Learning CLOUD (**C**rystal **L**anguage m**O**del for **U**nified and **D**ifferentiable materials modeling) is a Transformer-based foundation model that learns crystal representations from string encodings of crystal structures. Crystals are serialized with a novel **Symmetry-Consistent Ordered Parameter Encoding (SCOPE)**, a compact, coordinate-free representation that captures space-group symmetry, Wyckoff positions, and composition. The model can be fine-tuned for accurate, generalizable, and scalable property prediction, and can be combined with physics laws (e.g. the Debye model) for thermodynamic-consistent predictions. - 📄 Paper (*Nature Communications* **17**, 4074, 2026): [CLOUD: A Scalable and Physics-Informed Foundation Model for Crystal Representation Learning](https://doi.org/10.1038/s41467-026-70467-3) ([arXiv preprint](https://arxiv.org/abs/2506.17345)) - 💻 Code: [github.com/BattModels/CLOUD](https://github.com/BattModels/CLOUD) - 🏛️ Authors: Changwen Xu, Shang Zhu, Venkatasubramanian Viswanathan (University of Michigan) ## Model Details | | | |---|---| | Architecture | BERT encoder (`BertForMaskedLM`) | | Hidden size | 768 | | Hidden layers | 12 | | Attention heads | 12 | | Intermediate size | 3072 | | Max sequence length | 64 | | Vocab size | 30522 (custom SCOPE tokenizer) | | Parameters | ~110M | | Precision | float32 | | Pretraining objective | Masked language modeling on SCOPE strings | | Pretraining data | ~6M crystal structures from OPTIMADE | ## Repository Layout ``` ckpt/ ├── config.json # BertForMaskedLM config ├── generation_config.json ├── model.safetensors # Pretrained weights (~530 MB) ├── training_args.bin # HF Trainer arguments used for pretraining ├── tokenizer_config.json # SCOPE tokenizer config ├── special_tokens_map.json ├── added_tokens.json └── vocab.txt # SCOPE vocabulary ``` ## Usage ### Load the pretrained model and tokenizer ```python from transformers import AutoTokenizer, AutoModelForMaskedLM tokenizer = AutoTokenizer.from_pretrained("ChangwenXu/CLOUD", subfolder="ckpt") model = AutoModelForMaskedLM.from_pretrained("ChangwenXu/CLOUD", subfolder="ckpt") ``` ### Encode a crystal structure to a SCOPE string CLOUD operates on SCOPE string representations. Use the conversion utility from the [code repository](https://github.com/BattModels/CLOUD) to turn a CIF file into a SCOPE string: ```bash git clone https://github.com/BattModels/CLOUD.git cd CLOUD python structure_to_str.py --dir --out \ --numproc --batchsize ``` ### Get crystal embeddings ```python import torch from transformers import AutoTokenizer, AutoModel tokenizer = AutoTokenizer.from_pretrained("ChangwenXu/CLOUD", subfolder="ckpt") encoder = AutoModel.from_pretrained("ChangwenXu/CLOUD", subfolder="ckpt") encoder.eval() scope_string = "..." # SCOPE representation produced by structure_to_str.py inputs = tokenizer(scope_string, return_tensors="pt", padding=True, truncation=True, max_length=64) with torch.no_grad(): outputs = encoder(**inputs) # Pooled [CLS] embedding for the crystal: crystal_embedding = outputs.last_hidden_state[:, 0] ``` ### Fine-tuning Recipes for fine-tuning on MatBench, UnconvBench, MatBench Discovery / WBM, and the physics-informed CLOUD-DEBYE variant are provided in the [GitHub repository](https://github.com/BattModels/CLOUD) (`train.py`, `train_mp.py`, `wbm_predict.py`, `train_debye.py`). ## Intended Use - Pretrained backbone for downstream crystal property prediction (formation energy, bandgap, mechanical, thermodynamic properties, etc.) - Featurizer for materials screening and discovery workflows - Backbone for physics-informed extensions such as CLOUD-DEBYE ### Out-of-scope - Direct generation of crystal structures from scratch - Predicting properties of non-crystalline systems (molecules, amorphous solids) - Use as a substitute for high-fidelity DFT/MD without task-specific fine-tuning and validation ## Limitations - Trained on equilibrium / known crystal structures from OPTIMADE; out-of-distribution behavior on highly disordered, defective, or hypothetical structures is not guaranteed. - Maximum sequence length of 64 tokens; very large or low-symmetry unit cells may be truncated by the SCOPE encoder. - Property predictions require task-specific fine-tuning; the released checkpoint is the masked-language-model pretrained backbone only. ## Citation If you find CLOUD useful in your research, please cite: ```bibtex @article{xu2026cloud, title = {{CLOUD}: A Scalable and Physics-Informed Foundation Model for Crystal Representation Learning}, author = {Xu, Changwen and Zhu, Shang and Viswanathan, Venkatasubramanian}, journal = {Nature Communications}, volume = {17}, number = {1}, pages = {4074}, year = {2026}, doi = {10.1038/s41467-026-70467-3} } @inproceedings{xu2024cloud, title = {{CLOUD}: A Scalable Scientific Foundation Model for Crystal Representation Learning}, author = {Xu, Changwen and Zhu, Shang and Viswanathan, Venkatasubramanian}, booktitle = {NeurIPS 2024 Workshop on Foundation Models for Science: Progress, Opportunities, and Challenges}, year = {2024} } ``` ## License Released under the [MIT License](https://github.com/BattModels/CLOUD/blob/master/LICENSE), © 2025 Changwen Xu.