CLOUD / README.md

Changwen Xu

Add model card and SCOPE tokenizer files

7cf9973 24 days ago

5.75 kB

	---
	license: mit
	library_name: transformers
	pipeline_tag: fill-mask
	tags:
	- materials-science
	- crystal-structure
	- foundation-model
	- chemistry
	- bert
	- scope
	- cloud
	language:
	- en
	---

	# CLOUD: A Scalable and Physics-Informed Foundation Model for Crystal Representation Learning

	CLOUD (Crystal Language mOdel for Unified and Differentiable materials modeling) is a Transformer-based foundation model that learns crystal representations from string encodings of crystal structures. Crystals are serialized with a novel Symmetry-Consistent Ordered Parameter Encoding (SCOPE), a compact, coordinate-free representation that captures space-group symmetry, Wyckoff positions, and composition. The model can be fine-tuned for accurate, generalizable, and scalable property prediction, and can be combined with physics laws (e.g. the Debye model) for thermodynamic-consistent predictions.

	- 📄 Paper (Nature Communications 17, 4074, 2026): [CLOUD: A Scalable and Physics-Informed Foundation Model for Crystal Representation Learning](https://doi.org/10.1038/s41467-026-70467-3) ([arXiv preprint](https://arxiv.org/abs/2506.17345))
	- 💻 Code: [github.com/BattModels/CLOUD](https://github.com/BattModels/CLOUD)
	- 🏛️ Authors: Changwen Xu, Shang Zhu, Venkatasubramanian Viswanathan (University of Michigan)

	## Model Details

	\| \| \|
	\|---\|---\|
	\| Architecture \| BERT encoder (`BertForMaskedLM`) \|
	\| Hidden size \| 768 \|
	\| Hidden layers \| 12 \|
	\| Attention heads \| 12 \|
	\| Intermediate size \| 3072 \|
	\| Max sequence length \| 64 \|
	\| Vocab size \| 30522 (custom SCOPE tokenizer) \|
	\| Parameters \| ~110M \|
	\| Precision \| float32 \|
	\| Pretraining objective \| Masked language modeling on SCOPE strings \|
	\| Pretraining data \| ~6M crystal structures from OPTIMADE \|

	## Repository Layout

	```
	ckpt/
	├── config.json # BertForMaskedLM config
	├── generation_config.json
	├── model.safetensors # Pretrained weights (~530 MB)
	├── training_args.bin # HF Trainer arguments used for pretraining
	├── tokenizer_config.json # SCOPE tokenizer config
	├── special_tokens_map.json
	├── added_tokens.json
	└── vocab.txt # SCOPE vocabulary
	```

	## Usage

	### Load the pretrained model and tokenizer

	```python
	from transformers import AutoTokenizer, AutoModelForMaskedLM

	tokenizer = AutoTokenizer.from_pretrained("ChangwenXu/CLOUD", subfolder="ckpt")
	model = AutoModelForMaskedLM.from_pretrained("ChangwenXu/CLOUD", subfolder="ckpt")
	```

	### Encode a crystal structure to a SCOPE string

	CLOUD operates on SCOPE string representations. Use the conversion utility from the [code repository](https://github.com/BattModels/CLOUD) to turn a CIF file into a SCOPE string:

	```bash
	git clone https://github.com/BattModels/CLOUD.git
	cd CLOUD
	python structure_to_str.py --dir <path_to_cif> --out <output_path> \
	--numproc <num_of_processes> --batchsize <batch_size>
	```

	### Get crystal embeddings

	```python
	import torch
	from transformers import AutoTokenizer, AutoModel

	tokenizer = AutoTokenizer.from_pretrained("ChangwenXu/CLOUD", subfolder="ckpt")
	encoder = AutoModel.from_pretrained("ChangwenXu/CLOUD", subfolder="ckpt")
	encoder.eval()

	scope_string = "..." # SCOPE representation produced by structure_to_str.py
	inputs = tokenizer(scope_string, return_tensors="pt", padding=True, truncation=True, max_length=64)
	with torch.no_grad():
	outputs = encoder(**inputs)
	# Pooled [CLS] embedding for the crystal:
	crystal_embedding = outputs.last_hidden_state[:, 0]
	```

	### Fine-tuning

	Recipes for fine-tuning on MatBench, UnconvBench, MatBench Discovery / WBM, and the physics-informed CLOUD-DEBYE variant are provided in the [GitHub repository](https://github.com/BattModels/CLOUD) (`train.py`, `train_mp.py`, `wbm_predict.py`, `train_debye.py`).

	## Intended Use

	- Pretrained backbone for downstream crystal property prediction (formation energy, bandgap, mechanical, thermodynamic properties, etc.)
	- Featurizer for materials screening and discovery workflows
	- Backbone for physics-informed extensions such as CLOUD-DEBYE

	### Out-of-scope

	- Direct generation of crystal structures from scratch
	- Predicting properties of non-crystalline systems (molecules, amorphous solids)
	- Use as a substitute for high-fidelity DFT/MD without task-specific fine-tuning and validation

	## Limitations

	- Trained on equilibrium / known crystal structures from OPTIMADE; out-of-distribution behavior on highly disordered, defective, or hypothetical structures is not guaranteed.
	- Maximum sequence length of 64 tokens; very large or low-symmetry unit cells may be truncated by the SCOPE encoder.
	- Property predictions require task-specific fine-tuning; the released checkpoint is the masked-language-model pretrained backbone only.

	## Citation

	If you find CLOUD useful in your research, please cite:

	```bibtex
	@article{xu2026cloud,
	title = {{CLOUD}: A Scalable and Physics-Informed Foundation Model for Crystal Representation Learning},
	author = {Xu, Changwen and Zhu, Shang and Viswanathan, Venkatasubramanian},
	journal = {Nature Communications},
	volume = {17},
	number = {1},
	pages = {4074},
	year = {2026},
	doi = {10.1038/s41467-026-70467-3}
	}

	@inproceedings{xu2024cloud,
	title = {{CLOUD}: A Scalable Scientific Foundation Model for Crystal Representation Learning},
	author = {Xu, Changwen and Zhu, Shang and Viswanathan, Venkatasubramanian},
	booktitle = {NeurIPS 2024 Workshop on Foundation Models for Science: Progress, Opportunities, and Challenges},
	year = {2024}
	}
	```

	## License

	Released under the [MIT License](https://github.com/BattModels/CLOUD/blob/master/LICENSE), © 2025 Changwen Xu.