CrysMTM Dataset Card
Dataset Description
- Repository: CrysMTM
- Paper: CrysMTM: A Multiphase, Temperature-Resolved, Multimodal Dataset for Crystalline Materials
- Authors: Can Polat, Erchin Serpedin, Mustafa Kurban, and Hasan Kurban
- Point of Contact: Can Polat
Dataset Summary
CrysMTM is a comprehensive multiphase, temperature-resolved, multimodal dataset for crystalline materials research, specifically focused on titanium dioxide (TiOβ) polymorphs. The dataset is designed primarily for regression tasks to predict 9 key material properties from multimodal inputs. It contains three crystalline phases of TiOβ (anatase, brookite, and rutile) across a temperature range of 0-1000K, with multiple data modalities including atomic coordinates, visual representations, and textual descriptions.
Supported Tasks and Leaderboards
The dataset primarily supports regression tasks for materials property prediction:
- Main Task - Regression: Predict 9 material properties from multimodal inputs
- HOMO energy, LUMO energy, band gap, Fermi energy, total energy, energy per atom, atomic displacement, volumetric expansion, and bond length changes
- Main Task - LLM Property Prediction: Zero-shot and few-shot prediction of the 9 material properties using large language models
- Secondary Task - LLM Summary Generation: Generate textual summaries of crystal structures and properties using large language models
- Tertiary Task - Classification: Predict the crystalline phase (anatase, brookite, or rutile) from multimodal inputs
Languages
The dataset contains English text descriptions of crystal structures and properties.
Dataset Structure
Data Instances
Each data instance represents a TiOβ crystal structure at a specific temperature and rotation, containing:
- Phase: One of three TiOβ polymorphs (anatase, brookite, rutile)
- Temperature: Temperature in Kelvin (0-1000K, in 50K increments)
- Rotation: Rotation index for the crystal structure
- Modalities: Multiple data representations of the same structure
Data Fields
Core Metadata
phase(string): Crystal phase - "anatase", "brookite", or "rutile"temperature(integer): Temperature in Kelvin (0, 50, 100, ..., 1000)rotation(integer): Rotation index for the crystal structure
Multimodal Data
image(PIL.Image): Visual representation of the crystal structure (PNG format)xyz(torch.Tensor): Atomic coordinates in XYZ format (NΓ3 tensor)text(string): Textual description of the crystal structure and propertieselement(list): List of element symbols for each atom
Labels
Primary Labels - Regression:
regression_label(torch.Tensor): 9-dimensional tensor containing the main prediction targets:HOMO(float): HOMO energy (E_H) in eVLUMO(float): LUMO energy (E_L) in eVEg(float): Band gap energy (E_g) in eVEf(float): Fermi energy (E_f) in eVEt(float): Total energy of the system (E_T) in eVEta(float): Total energy per atom (E_Ta) in eVdisp(float): Maximum atomic displacement (Ξr_max) in Γvol(float): Volumetric expansion (ΞV) in Γ Β³bond(float): Ti-O bond length change (Ξd_Ti-O) in Γ
Secondary Labels - Classification:
label(integer): Phase label (0=anatase, 1=brookite, 2=rutile)
LLM Task Labels:
- Individual property values for zero-shot/few-shot prediction
- Text summaries for generation tasks
Data Splits
The dataset is organized by temperature ranges:
- Training Set: Temperatures 0-850K (excluding 250K, 450K, 650K, 750K, 800K)
- In-Distribution (ID) Test: Temperatures 250K, 450K, 650K, 750K, 800K
- Out-of-Distribution (OOD) Test: Temperatures 0K, 50K, 100K, 900K, 950K, 1000K
Citation Information
@dataset{crysmtm2024,
title={CrysMTM: A Multiphase, Temperature-Resolved, Multimodal Dataset for Crystalline Materials},
author={Can Polat and Erchin Serpedin and Mustafa Kurban and Hasan Kurban},
year={2024},
url={https://github.com/KurbanIntelligenceLab/CrysMTM}
}
Usage Examples
Option 1: Download and Use Locally
- Download the dataset from https://huggingface.co/datasets/johnpolat/CrysMTM
- Use the provided loading script:
# Download load_dataset.py from the repository and place it in your data directory
from load_dataset import load_dataset
# Load the dataset
dataset = load_dataset(".")
# Access splits
train_dataset = dataset["train"] # 5,064 samples
test_id_dataset = dataset["test_id"] # 1,380 samples
test_ood_dataset = dataset["test_ood"] # 6,588 samples
# Get a sample
sample = train_dataset[0]
print(f"Phase: {sample['phase']}")
print(f"Temperature: {sample['temperature']}K")
print(f"Image: {sample['image']}")
print(f"Regression labels: {sample['regression_labels']}")
Option 2: Use with Original Dataloaders
from dataloaders.regression_dataloader import RegressionLoader
# Load dataset for regression (main task)
dataset = RegressionLoader(
label_dir="data",
modalities=["image", "xyz", "text"],
normalize_labels=True
)
# Get a sample
sample = dataset[0]
print(f"Target Properties: {sample['regression_label']}")
print(f"Temperature: {sample['temperature']}K")
print(f"Phase: {sample['phase']}")
print(f"Image shape: {sample['image'].size}")
print(f"XYZ coordinates shape: {sample['xyz'].shape}")
Main Task - LLM Property Prediction
from dataloaders.llm_regression_dataloader import LLMLoader
# Load dataset for LLM property prediction (main task)
dataset = LLMLoader(
label_dir="data",
modalities=["text", "image"]
)
# Get a sample for zero-shot/few-shot property prediction
sample = dataset[0]
print(f"HOMO: {sample['HOMO']}")
print(f"LUMO: {sample['LUMO']}")
print(f"Band gap: {sample['Eg']}")
print(f"Temperature: {sample['temperature']}K")
print(f"Phase: {sample['phase']}")
Secondary Task - LLM Summary Generation
from dataloaders.llm_regression_dataloader import LLMLoader
# Load dataset for LLM summary generation (secondary task)
dataset = LLMLoader(
label_dir="data",
modalities=["text", "image"]
)
# Get a sample for summary generation
sample = dataset[0]
print(f"Input text: {sample['text'][:200]}...")
print(f"Target properties: {sample['HOMO']}, {sample['LUMO']}, {sample['Eg']}")
Tertiary Task - Classification
from dataloaders.classification_dataloader import ClassificationLoader
# Load dataset for classification (tertiary task)
dataset = ClassificationLoader(
base_dir="data",
modalities=["image", "xyz", "text"],
max_rotations=10
)
# Get a sample
sample = dataset[0]
print(f"Phase: {sample['label']}")
print(f"Image shape: {sample['image'].size}")
print(f"XYZ coordinates shape: {sample['xyz'].shape}")
print(f"Text: {sample['text'][:100]}...")
PyTorch Geometric Integration
# For graph neural networks
dataset = ClassificationLoader(
base_dir="data",
modalities=["xyz", "element"],
as_pyg_data=True
)
# Returns PyG Data objects
sample = dataset[0]
print(f"Node features: {sample.z}")
print(f"Positions: {sample.pos}")
print(f"Label: {sample.y}")
Technical Details
File Structure
data/
βββ anatase/
β βββ 0K/
β β βββ images/
β β β βββ rot_0.png
β β β βββ rot_1.png
β β β βββ ...
β β βββ xyz/
β β β βββ rot_0.xyz
β β β βββ rot_1.xyz
β β β βββ ...
β β βββ text/
β β βββ rot_0.txt
β β βββ rot_1.txt
β β βββ ...
β βββ 50K/
β βββ ...
βββ brookite/
βββ rutile/
βββ labels.csv
Data Formats
XYZ Files
Standard XYZ format with atomic coordinates:
[number of atoms]
[comment line]
[element] [x] [y] [z]
[element] [x] [y] [z]
...
Images
PNG format visualizations of crystal structures.
Text Files
Natural language descriptions of crystal structures and properties.
Labels CSV
Contains material properties for each phase-temperature combination:
Polymorph,Temperature,Parameter,Value
anatase,0K,HOMO,-7.2340
anatase,0K,LUMO,-4.1234
...
Supported Models
The dataset is compatible with various model architectures:
- Vision Models: ResNet, ViT
- Graph Neural Networks: SchNet, DimeNet, EGNN, FAENet, GoTenNet
- Language Models: LLMs for zero-shot/few-shot learning
- Multimodal Models: CLIP, Pure2DopeNet, ViSNet
Performance Metrics
Primary Task - Regression
- Mean Absolute Error (MAE)
- Root Mean Square Error (RMSE)
- RΒ² score
- Per-property evaluation metrics
Primary Task - LLM Property Prediction
- Property prediction accuracy
- Zero-shot vs few-shot performance comparison
- Out-of-distribution generalization
- Per-property evaluation metrics
Secondary Task - LLM Summary Generation
The nanoparticle summary task requires domain-specific evaluation beyond traditional string-based metrics like ROUGE or BLEU, which do not penalize incorrect numerical values. A more meaningful strategy is to extract structured keyβvalue pairs-such as particle size, center of mass, coordination numbers, or bond angles-and compare them to ground truth using:
- Information-level Fβ score that accepts only values within defined tolerances (e.g., 0.1 Γ or 1 degree)
- MAPE over all numeric entries
- Factual consistency score like BERTScore or QA-based faithfulness after masking numeric values
- Optional assessments of readability and clarity using expert judgment or coherence-based metrics (e.g., Coh-LM)
Tertiary Task - Classification
A three-class classification task to distinguish among the TiOβ polymorphs. While overall accuracy provides a general overview, it is important to also report:
- Class-wise precision, recall, and their harmonic mean (Fβ score), followed by macro-averaging to account for class imbalance
- Full 3Γ3 confusion matrix to identify systematic misclassifications between phase pairs
- Matthews correlation coefficient (MCC) and Cohen's ΞΊ statistic for chance-adjusted evaluations
- Cross-entropy loss and macro-averaged area under the ROC curve (AUROC) when class probabilities are available
Known Limitations
- Limited Chemical Space: Only covers TiOβ polymorphs
- Temperature Range: Limited to 0-1000K
- Computational Data: All properties are from DFT calculations
- Modality Dependencies: Some modalities may not be available for all samples
Future Work
- Extend to other materials systems
- Include experimental data
- Add more temperature points
- Incorporate additional material properties
- Support for more crystal structures