YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

CrysMTM Dataset Card

Dataset Description

  • Repository: CrysMTM
  • Paper: CrysMTM: A Multiphase, Temperature-Resolved, Multimodal Dataset for Crystalline Materials
  • Authors: Can Polat, Erchin Serpedin, Mustafa Kurban, and Hasan Kurban
  • Point of Contact: Can Polat

Dataset Summary

CrysMTM is a comprehensive multiphase, temperature-resolved, multimodal dataset for crystalline materials research, specifically focused on titanium dioxide (TiOβ‚‚) polymorphs. The dataset is designed primarily for regression tasks to predict 9 key material properties from multimodal inputs. It contains three crystalline phases of TiOβ‚‚ (anatase, brookite, and rutile) across a temperature range of 0-1000K, with multiple data modalities including atomic coordinates, visual representations, and textual descriptions.

Supported Tasks and Leaderboards

The dataset primarily supports regression tasks for materials property prediction:

  1. Main Task - Regression: Predict 9 material properties from multimodal inputs
    • HOMO energy, LUMO energy, band gap, Fermi energy, total energy, energy per atom, atomic displacement, volumetric expansion, and bond length changes
  2. Main Task - LLM Property Prediction: Zero-shot and few-shot prediction of the 9 material properties using large language models
  3. Secondary Task - LLM Summary Generation: Generate textual summaries of crystal structures and properties using large language models
  4. Tertiary Task - Classification: Predict the crystalline phase (anatase, brookite, or rutile) from multimodal inputs

Languages

The dataset contains English text descriptions of crystal structures and properties.

Dataset Structure

Data Instances

Each data instance represents a TiOβ‚‚ crystal structure at a specific temperature and rotation, containing:

  • Phase: One of three TiOβ‚‚ polymorphs (anatase, brookite, rutile)
  • Temperature: Temperature in Kelvin (0-1000K, in 50K increments)
  • Rotation: Rotation index for the crystal structure
  • Modalities: Multiple data representations of the same structure

Data Fields

Core Metadata

  • phase (string): Crystal phase - "anatase", "brookite", or "rutile"
  • temperature (integer): Temperature in Kelvin (0, 50, 100, ..., 1000)
  • rotation (integer): Rotation index for the crystal structure

Multimodal Data

  • image (PIL.Image): Visual representation of the crystal structure (PNG format)
  • xyz (torch.Tensor): Atomic coordinates in XYZ format (NΓ—3 tensor)
  • text (string): Textual description of the crystal structure and properties
  • element (list): List of element symbols for each atom

Labels

Primary Labels - Regression:

  • regression_label (torch.Tensor): 9-dimensional tensor containing the main prediction targets:
    • HOMO (float): HOMO energy (E_H) in eV
    • LUMO (float): LUMO energy (E_L) in eV
    • Eg (float): Band gap energy (E_g) in eV
    • Ef (float): Fermi energy (E_f) in eV
    • Et (float): Total energy of the system (E_T) in eV
    • Eta (float): Total energy per atom (E_Ta) in eV
    • disp (float): Maximum atomic displacement (Ξ”r_max) in Γ…
    • vol (float): Volumetric expansion (Ξ”V) in Γ…Β³
    • bond (float): Ti-O bond length change (Ξ”d_Ti-O) in Γ…

Secondary Labels - Classification:

  • label (integer): Phase label (0=anatase, 1=brookite, 2=rutile)

LLM Task Labels:

  • Individual property values for zero-shot/few-shot prediction
  • Text summaries for generation tasks

Data Splits

The dataset is organized by temperature ranges:

  • Training Set: Temperatures 0-850K (excluding 250K, 450K, 650K, 750K, 800K)
  • In-Distribution (ID) Test: Temperatures 250K, 450K, 650K, 750K, 800K
  • Out-of-Distribution (OOD) Test: Temperatures 0K, 50K, 100K, 900K, 950K, 1000K

Citation Information

@dataset{crysmtm2024,
  title={CrysMTM: A Multiphase, Temperature-Resolved, Multimodal Dataset for Crystalline Materials},
  author={Can Polat and Erchin Serpedin and Mustafa Kurban and Hasan Kurban},
  year={2024},
  url={https://github.com/KurbanIntelligenceLab/CrysMTM}
}

Usage Examples

Option 1: Download and Use Locally

  1. Download the dataset from https://huggingface.co/datasets/johnpolat/CrysMTM
  2. Use the provided loading script:
# Download load_dataset.py from the repository and place it in your data directory
from load_dataset import load_dataset

# Load the dataset
dataset = load_dataset(".")

# Access splits
train_dataset = dataset["train"]      # 5,064 samples
test_id_dataset = dataset["test_id"]  # 1,380 samples
test_ood_dataset = dataset["test_ood"] # 6,588 samples

# Get a sample
sample = train_dataset[0]
print(f"Phase: {sample['phase']}")
print(f"Temperature: {sample['temperature']}K")
print(f"Image: {sample['image']}")
print(f"Regression labels: {sample['regression_labels']}")

Option 2: Use with Original Dataloaders

from dataloaders.regression_dataloader import RegressionLoader

# Load dataset for regression (main task)
dataset = RegressionLoader(
    label_dir="data",
    modalities=["image", "xyz", "text"],
    normalize_labels=True
)

# Get a sample
sample = dataset[0]
print(f"Target Properties: {sample['regression_label']}")
print(f"Temperature: {sample['temperature']}K")
print(f"Phase: {sample['phase']}")
print(f"Image shape: {sample['image'].size}")
print(f"XYZ coordinates shape: {sample['xyz'].shape}")

Main Task - LLM Property Prediction

from dataloaders.llm_regression_dataloader import LLMLoader

# Load dataset for LLM property prediction (main task)
dataset = LLMLoader(
    label_dir="data",
    modalities=["text", "image"]
)

# Get a sample for zero-shot/few-shot property prediction
sample = dataset[0]
print(f"HOMO: {sample['HOMO']}")
print(f"LUMO: {sample['LUMO']}")
print(f"Band gap: {sample['Eg']}")
print(f"Temperature: {sample['temperature']}K")
print(f"Phase: {sample['phase']}")

Secondary Task - LLM Summary Generation

from dataloaders.llm_regression_dataloader import LLMLoader

# Load dataset for LLM summary generation (secondary task)
dataset = LLMLoader(
    label_dir="data",
    modalities=["text", "image"]
)

# Get a sample for summary generation
sample = dataset[0]
print(f"Input text: {sample['text'][:200]}...")
print(f"Target properties: {sample['HOMO']}, {sample['LUMO']}, {sample['Eg']}")

Tertiary Task - Classification

from dataloaders.classification_dataloader import ClassificationLoader

# Load dataset for classification (tertiary task)
dataset = ClassificationLoader(
    base_dir="data",
    modalities=["image", "xyz", "text"],
    max_rotations=10
)

# Get a sample
sample = dataset[0]
print(f"Phase: {sample['label']}")
print(f"Image shape: {sample['image'].size}")
print(f"XYZ coordinates shape: {sample['xyz'].shape}")
print(f"Text: {sample['text'][:100]}...")

PyTorch Geometric Integration

# For graph neural networks
dataset = ClassificationLoader(
    base_dir="data",
    modalities=["xyz", "element"],
    as_pyg_data=True
)

# Returns PyG Data objects
sample = dataset[0]
print(f"Node features: {sample.z}")
print(f"Positions: {sample.pos}")
print(f"Label: {sample.y}")

Technical Details

File Structure

data/
β”œβ”€β”€ anatase/
β”‚   β”œβ”€β”€ 0K/
β”‚   β”‚   β”œβ”€β”€ images/
β”‚   β”‚   β”‚   β”œβ”€β”€ rot_0.png
β”‚   β”‚   β”‚   β”œβ”€β”€ rot_1.png
β”‚   β”‚   β”‚   └── ...
β”‚   β”‚   β”œβ”€β”€ xyz/
β”‚   β”‚   β”‚   β”œβ”€β”€ rot_0.xyz
β”‚   β”‚   β”‚   β”œβ”€β”€ rot_1.xyz
β”‚   β”‚   β”‚   └── ...
β”‚   β”‚   └── text/
β”‚   β”‚       β”œβ”€β”€ rot_0.txt
β”‚   β”‚       β”œβ”€β”€ rot_1.txt
β”‚   β”‚       └── ...
β”‚   β”œβ”€β”€ 50K/
β”‚   └── ...
β”œβ”€β”€ brookite/
β”œβ”€β”€ rutile/
└── labels.csv

Data Formats

XYZ Files

Standard XYZ format with atomic coordinates:

[number of atoms]
[comment line]
[element] [x] [y] [z]
[element] [x] [y] [z]
...

Images

PNG format visualizations of crystal structures.

Text Files

Natural language descriptions of crystal structures and properties.

Labels CSV

Contains material properties for each phase-temperature combination:

Polymorph,Temperature,Parameter,Value
anatase,0K,HOMO,-7.2340
anatase,0K,LUMO,-4.1234
...

Supported Models

The dataset is compatible with various model architectures:

  • Vision Models: ResNet, ViT
  • Graph Neural Networks: SchNet, DimeNet, EGNN, FAENet, GoTenNet
  • Language Models: LLMs for zero-shot/few-shot learning
  • Multimodal Models: CLIP, Pure2DopeNet, ViSNet

Performance Metrics

Primary Task - Regression

  • Mean Absolute Error (MAE)
  • Root Mean Square Error (RMSE)
  • RΒ² score
  • Per-property evaluation metrics

Primary Task - LLM Property Prediction

  • Property prediction accuracy
  • Zero-shot vs few-shot performance comparison
  • Out-of-distribution generalization
  • Per-property evaluation metrics

Secondary Task - LLM Summary Generation

The nanoparticle summary task requires domain-specific evaluation beyond traditional string-based metrics like ROUGE or BLEU, which do not penalize incorrect numerical values. A more meaningful strategy is to extract structured key–value pairs-such as particle size, center of mass, coordination numbers, or bond angles-and compare them to ground truth using:

  • Information-level F₁ score that accepts only values within defined tolerances (e.g., 0.1 Γ… or 1 degree)
  • MAPE over all numeric entries
  • Factual consistency score like BERTScore or QA-based faithfulness after masking numeric values
  • Optional assessments of readability and clarity using expert judgment or coherence-based metrics (e.g., Coh-LM)

Tertiary Task - Classification

A three-class classification task to distinguish among the TiOβ‚‚ polymorphs. While overall accuracy provides a general overview, it is important to also report:

  • Class-wise precision, recall, and their harmonic mean (F₁ score), followed by macro-averaging to account for class imbalance
  • Full 3Γ—3 confusion matrix to identify systematic misclassifications between phase pairs
  • Matthews correlation coefficient (MCC) and Cohen's ΞΊ statistic for chance-adjusted evaluations
  • Cross-entropy loss and macro-averaged area under the ROC curve (AUROC) when class probabilities are available

Known Limitations

  1. Limited Chemical Space: Only covers TiOβ‚‚ polymorphs
  2. Temperature Range: Limited to 0-1000K
  3. Computational Data: All properties are from DFT calculations
  4. Modality Dependencies: Some modalities may not be available for all samples

Future Work

  • Extend to other materials systems
  • Include experimental data
  • Add more temperature points
  • Incorporate additional material properties
  • Support for more crystal structures
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support