CrysMTM / README.md
johnpolat's picture
Upload README.md with huggingface_hub
6910ba8 verified
# CrysMTM Dataset Card
## Dataset Description
- **Repository:** [CrysMTM](https://github.com/KurbanIntelligenceLab/CrysMTM)
- **Paper:** CrysMTM: A Multiphase, Temperature-Resolved, Multimodal Dataset for Crystalline Materials
- **Authors:** Can Polat, Erchin Serpedin, Mustafa Kurban, and Hasan Kurban
- **Point of Contact:** [Can Polat](johnpolat.com)
### Dataset Summary
CrysMTM is a comprehensive multiphase, temperature-resolved, multimodal dataset for crystalline materials research, specifically focused on titanium dioxide (TiOβ‚‚) polymorphs. The dataset is designed primarily for regression tasks to predict 9 key material properties from multimodal inputs. It contains three crystalline phases of TiOβ‚‚ (anatase, brookite, and rutile) across a temperature range of 0-1000K, with multiple data modalities including atomic coordinates, visual representations, and textual descriptions.
### Supported Tasks and Leaderboards
The dataset primarily supports regression tasks for materials property prediction:
1. **Main Task - Regression**: Predict 9 material properties from multimodal inputs
- HOMO energy, LUMO energy, band gap, Fermi energy, total energy, energy per atom, atomic displacement, volumetric expansion, and bond length changes
2. **Main Task - LLM Property Prediction**: Zero-shot and few-shot prediction of the 9 material properties using large language models
3. **Secondary Task - LLM Summary Generation**: Generate textual summaries of crystal structures and properties using large language models
4. **Tertiary Task - Classification**: Predict the crystalline phase (anatase, brookite, or rutile) from multimodal inputs
### Languages
The dataset contains English text descriptions of crystal structures and properties.
## Dataset Structure
### Data Instances
Each data instance represents a TiOβ‚‚ crystal structure at a specific temperature and rotation, containing:
- **Phase**: One of three TiOβ‚‚ polymorphs (anatase, brookite, rutile)
- **Temperature**: Temperature in Kelvin (0-1000K, in 50K increments)
- **Rotation**: Rotation index for the crystal structure
- **Modalities**: Multiple data representations of the same structure
### Data Fields
#### Core Metadata
- `phase` (string): Crystal phase - "anatase", "brookite", or "rutile"
- `temperature` (integer): Temperature in Kelvin (0, 50, 100, ..., 1000)
- `rotation` (integer): Rotation index for the crystal structure
#### Multimodal Data
- `image` (PIL.Image): Visual representation of the crystal structure (PNG format)
- `xyz` (torch.Tensor): Atomic coordinates in XYZ format (NΓ—3 tensor)
- `text` (string): Textual description of the crystal structure and properties
- `element` (list): List of element symbols for each atom
#### Labels
**Primary Labels - Regression**:
- `regression_label` (torch.Tensor): 9-dimensional tensor containing the main prediction targets:
- `HOMO` (float): HOMO energy (E_H) in eV
- `LUMO` (float): LUMO energy (E_L) in eV
- `Eg` (float): Band gap energy (E_g) in eV
- `Ef` (float): Fermi energy (E_f) in eV
- `Et` (float): Total energy of the system (E_T) in eV
- `Eta` (float): Total energy per atom (E_Ta) in eV
- `disp` (float): Maximum atomic displacement (Ξ”r_max) in Γ…
- `vol` (float): Volumetric expansion (Ξ”V) in Γ…Β³
- `bond` (float): Ti-O bond length change (Ξ”d_Ti-O) in Γ…
**Secondary Labels - Classification**:
- `label` (integer): Phase label (0=anatase, 1=brookite, 2=rutile)
**LLM Task Labels**:
- Individual property values for zero-shot/few-shot prediction
- Text summaries for generation tasks
### Data Splits
The dataset is organized by temperature ranges:
- **Training Set**: Temperatures 0-850K (excluding 250K, 450K, 650K, 750K, 800K)
- **In-Distribution (ID) Test**: Temperatures 250K, 450K, 650K, 750K, 800K
- **Out-of-Distribution (OOD) Test**: Temperatures 0K, 50K, 100K, 900K, 950K, 1000K
### Citation Information
```bibtex
@dataset{crysmtm2024,
title={CrysMTM: A Multiphase, Temperature-Resolved, Multimodal Dataset for Crystalline Materials},
author={Can Polat and Erchin Serpedin and Mustafa Kurban and Hasan Kurban},
year={2024},
url={https://github.com/KurbanIntelligenceLab/CrysMTM}
}
```
## Usage Examples
### Option 1: Download and Use Locally
1. **Download the dataset** from [https://huggingface.co/datasets/johnpolat/CrysMTM](https://huggingface.co/datasets/johnpolat/CrysMTM)
2. **Use the provided loading script**:
```python
# Download load_dataset.py from the repository and place it in your data directory
from load_dataset import load_dataset
# Load the dataset
dataset = load_dataset(".")
# Access splits
train_dataset = dataset["train"] # 5,064 samples
test_id_dataset = dataset["test_id"] # 1,380 samples
test_ood_dataset = dataset["test_ood"] # 6,588 samples
# Get a sample
sample = train_dataset[0]
print(f"Phase: {sample['phase']}")
print(f"Temperature: {sample['temperature']}K")
print(f"Image: {sample['image']}")
print(f"Regression labels: {sample['regression_labels']}")
```
### Option 2: Use with Original Dataloaders
```python
from dataloaders.regression_dataloader import RegressionLoader
# Load dataset for regression (main task)
dataset = RegressionLoader(
label_dir="data",
modalities=["image", "xyz", "text"],
normalize_labels=True
)
# Get a sample
sample = dataset[0]
print(f"Target Properties: {sample['regression_label']}")
print(f"Temperature: {sample['temperature']}K")
print(f"Phase: {sample['phase']}")
print(f"Image shape: {sample['image'].size}")
print(f"XYZ coordinates shape: {sample['xyz'].shape}")
```
### Main Task - LLM Property Prediction
```python
from dataloaders.llm_regression_dataloader import LLMLoader
# Load dataset for LLM property prediction (main task)
dataset = LLMLoader(
label_dir="data",
modalities=["text", "image"]
)
# Get a sample for zero-shot/few-shot property prediction
sample = dataset[0]
print(f"HOMO: {sample['HOMO']}")
print(f"LUMO: {sample['LUMO']}")
print(f"Band gap: {sample['Eg']}")
print(f"Temperature: {sample['temperature']}K")
print(f"Phase: {sample['phase']}")
```
### Secondary Task - LLM Summary Generation
```python
from dataloaders.llm_regression_dataloader import LLMLoader
# Load dataset for LLM summary generation (secondary task)
dataset = LLMLoader(
label_dir="data",
modalities=["text", "image"]
)
# Get a sample for summary generation
sample = dataset[0]
print(f"Input text: {sample['text'][:200]}...")
print(f"Target properties: {sample['HOMO']}, {sample['LUMO']}, {sample['Eg']}")
```
### Tertiary Task - Classification
```python
from dataloaders.classification_dataloader import ClassificationLoader
# Load dataset for classification (tertiary task)
dataset = ClassificationLoader(
base_dir="data",
modalities=["image", "xyz", "text"],
max_rotations=10
)
# Get a sample
sample = dataset[0]
print(f"Phase: {sample['label']}")
print(f"Image shape: {sample['image'].size}")
print(f"XYZ coordinates shape: {sample['xyz'].shape}")
print(f"Text: {sample['text'][:100]}...")
```
### PyTorch Geometric Integration
```python
# For graph neural networks
dataset = ClassificationLoader(
base_dir="data",
modalities=["xyz", "element"],
as_pyg_data=True
)
# Returns PyG Data objects
sample = dataset[0]
print(f"Node features: {sample.z}")
print(f"Positions: {sample.pos}")
print(f"Label: {sample.y}")
```
## Technical Details
### File Structure
```
data/
β”œβ”€β”€ anatase/
β”‚ β”œβ”€β”€ 0K/
β”‚ β”‚ β”œβ”€β”€ images/
β”‚ β”‚ β”‚ β”œβ”€β”€ rot_0.png
β”‚ β”‚ β”‚ β”œβ”€β”€ rot_1.png
β”‚ β”‚ β”‚ └── ...
β”‚ β”‚ β”œβ”€β”€ xyz/
β”‚ β”‚ β”‚ β”œβ”€β”€ rot_0.xyz
β”‚ β”‚ β”‚ β”œβ”€β”€ rot_1.xyz
β”‚ β”‚ β”‚ └── ...
β”‚ β”‚ └── text/
β”‚ β”‚ β”œβ”€β”€ rot_0.txt
β”‚ β”‚ β”œβ”€β”€ rot_1.txt
β”‚ β”‚ └── ...
β”‚ β”œβ”€β”€ 50K/
β”‚ └── ...
β”œβ”€β”€ brookite/
β”œβ”€β”€ rutile/
└── labels.csv
```
### Data Formats
#### XYZ Files
Standard XYZ format with atomic coordinates:
```
[number of atoms]
[comment line]
[element] [x] [y] [z]
[element] [x] [y] [z]
...
```
#### Images
PNG format visualizations of crystal structures.
#### Text Files
Natural language descriptions of crystal structures and properties.
#### Labels CSV
Contains material properties for each phase-temperature combination:
```csv
Polymorph,Temperature,Parameter,Value
anatase,0K,HOMO,-7.2340
anatase,0K,LUMO,-4.1234
...
```
### Supported Models
The dataset is compatible with various model architectures:
- **Vision Models**: ResNet, ViT
- **Graph Neural Networks**: SchNet, DimeNet, EGNN, FAENet, GoTenNet
- **Language Models**: LLMs for zero-shot/few-shot learning
- **Multimodal Models**: CLIP, Pure2DopeNet, ViSNet
### Performance Metrics
#### Primary Task - Regression
- Mean Absolute Error (MAE)
- Root Mean Square Error (RMSE)
- RΒ² score
- Per-property evaluation metrics
#### Primary Task - LLM Property Prediction
- Property prediction accuracy
- Zero-shot vs few-shot performance comparison
- Out-of-distribution generalization
- Per-property evaluation metrics
#### Secondary Task - LLM Summary Generation
The nanoparticle summary task requires domain-specific evaluation beyond traditional string-based metrics like ROUGE or BLEU, which do not penalize incorrect numerical values. A more meaningful strategy is to extract structured key–value pairs-such as particle size, center of mass, coordination numbers, or bond angles-and compare them to ground truth using:
- Information-level F₁ score that accepts only values within defined tolerances (e.g., 0.1 Γ… or 1 degree)
- MAPE over all numeric entries
- Factual consistency score like BERTScore or QA-based faithfulness after masking numeric values
- Optional assessments of readability and clarity using expert judgment or coherence-based metrics (e.g., Coh-LM)
#### Tertiary Task - Classification
A three-class classification task to distinguish among the TiOβ‚‚ polymorphs. While overall accuracy provides a general overview, it is important to also report:
- Class-wise precision, recall, and their harmonic mean (F₁ score), followed by macro-averaging to account for class imbalance
- Full 3Γ—3 confusion matrix to identify systematic misclassifications between phase pairs
- Matthews correlation coefficient (MCC) and Cohen's ΞΊ statistic for chance-adjusted evaluations
- Cross-entropy loss and macro-averaged area under the ROC curve (AUROC) when class probabilities are available
### Known Limitations
1. **Limited Chemical Space**: Only covers TiOβ‚‚ polymorphs
2. **Temperature Range**: Limited to 0-1000K
3. **Computational Data**: All properties are from DFT calculations
4. **Modality Dependencies**: Some modalities may not be available for all samples
### Future Work
- Extend to other materials systems
- Include experimental data
- Add more temperature points
- Incorporate additional material properties
- Support for more crystal structures