File size: 10,999 Bytes
a2b3944
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6910ba8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a2b3944
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
# CrysMTM Dataset Card

## Dataset Description

- **Repository:** [CrysMTM](https://github.com/KurbanIntelligenceLab/CrysMTM)
- **Paper:** CrysMTM: A Multiphase, Temperature-Resolved, Multimodal Dataset for Crystalline Materials
- **Authors:** Can Polat, Erchin Serpedin, Mustafa Kurban, and Hasan Kurban
- **Point of Contact:** [Can Polat](johnpolat.com)

### Dataset Summary

CrysMTM is a comprehensive multiphase, temperature-resolved, multimodal dataset for crystalline materials research, specifically focused on titanium dioxide (TiOβ‚‚) polymorphs. The dataset is designed primarily for regression tasks to predict 9 key material properties from multimodal inputs. It contains three crystalline phases of TiOβ‚‚ (anatase, brookite, and rutile) across a temperature range of 0-1000K, with multiple data modalities including atomic coordinates, visual representations, and textual descriptions.

### Supported Tasks and Leaderboards

The dataset primarily supports regression tasks for materials property prediction:

1. **Main Task - Regression**: Predict 9 material properties from multimodal inputs
   - HOMO energy, LUMO energy, band gap, Fermi energy, total energy, energy per atom, atomic displacement, volumetric expansion, and bond length changes
2. **Main Task - LLM Property Prediction**: Zero-shot and few-shot prediction of the 9 material properties using large language models
3. **Secondary Task - LLM Summary Generation**: Generate textual summaries of crystal structures and properties using large language models
4. **Tertiary Task - Classification**: Predict the crystalline phase (anatase, brookite, or rutile) from multimodal inputs

### Languages

The dataset contains English text descriptions of crystal structures and properties.

## Dataset Structure

### Data Instances

Each data instance represents a TiOβ‚‚ crystal structure at a specific temperature and rotation, containing:

- **Phase**: One of three TiOβ‚‚ polymorphs (anatase, brookite, rutile)
- **Temperature**: Temperature in Kelvin (0-1000K, in 50K increments)
- **Rotation**: Rotation index for the crystal structure
- **Modalities**: Multiple data representations of the same structure

### Data Fields

#### Core Metadata
- `phase` (string): Crystal phase - "anatase", "brookite", or "rutile"
- `temperature` (integer): Temperature in Kelvin (0, 50, 100, ..., 1000)
- `rotation` (integer): Rotation index for the crystal structure

#### Multimodal Data
- `image` (PIL.Image): Visual representation of the crystal structure (PNG format)
- `xyz` (torch.Tensor): Atomic coordinates in XYZ format (NΓ—3 tensor)
- `text` (string): Textual description of the crystal structure and properties
- `element` (list): List of element symbols for each atom

#### Labels
**Primary Labels - Regression**:
- `regression_label` (torch.Tensor): 9-dimensional tensor containing the main prediction targets:
  - `HOMO` (float): HOMO energy (E_H) in eV
  - `LUMO` (float): LUMO energy (E_L) in eV
  - `Eg` (float): Band gap energy (E_g) in eV
  - `Ef` (float): Fermi energy (E_f) in eV
  - `Et` (float): Total energy of the system (E_T) in eV
  - `Eta` (float): Total energy per atom (E_Ta) in eV
  - `disp` (float): Maximum atomic displacement (Ξ”r_max) in Γ…
  - `vol` (float): Volumetric expansion (Ξ”V) in Γ…Β³
  - `bond` (float): Ti-O bond length change (Ξ”d_Ti-O) in Γ…

**Secondary Labels - Classification**:
- `label` (integer): Phase label (0=anatase, 1=brookite, 2=rutile)

**LLM Task Labels**:
- Individual property values for zero-shot/few-shot prediction
- Text summaries for generation tasks

### Data Splits

The dataset is organized by temperature ranges:

- **Training Set**: Temperatures 0-850K (excluding 250K, 450K, 650K, 750K, 800K)
- **In-Distribution (ID) Test**: Temperatures 250K, 450K, 650K, 750K, 800K
- **Out-of-Distribution (OOD) Test**: Temperatures 0K, 50K, 100K, 900K, 950K, 1000K

### Citation Information
```bibtex
@dataset{crysmtm2024,
  title={CrysMTM: A Multiphase, Temperature-Resolved, Multimodal Dataset for Crystalline Materials},
  author={Can Polat and Erchin Serpedin and Mustafa Kurban and Hasan Kurban},
  year={2024},
  url={https://github.com/KurbanIntelligenceLab/CrysMTM}
}
```

## Usage Examples

### Option 1: Download and Use Locally

1. **Download the dataset** from [https://huggingface.co/datasets/johnpolat/CrysMTM](https://huggingface.co/datasets/johnpolat/CrysMTM)
2. **Use the provided loading script**:

```python
# Download load_dataset.py from the repository and place it in your data directory
from load_dataset import load_dataset

# Load the dataset
dataset = load_dataset(".")

# Access splits
train_dataset = dataset["train"]      # 5,064 samples
test_id_dataset = dataset["test_id"]  # 1,380 samples
test_ood_dataset = dataset["test_ood"] # 6,588 samples

# Get a sample
sample = train_dataset[0]
print(f"Phase: {sample['phase']}")
print(f"Temperature: {sample['temperature']}K")
print(f"Image: {sample['image']}")
print(f"Regression labels: {sample['regression_labels']}")
```

### Option 2: Use with Original Dataloaders

```python
from dataloaders.regression_dataloader import RegressionLoader

# Load dataset for regression (main task)
dataset = RegressionLoader(
    label_dir="data",
    modalities=["image", "xyz", "text"],
    normalize_labels=True
)

# Get a sample
sample = dataset[0]
print(f"Target Properties: {sample['regression_label']}")
print(f"Temperature: {sample['temperature']}K")
print(f"Phase: {sample['phase']}")
print(f"Image shape: {sample['image'].size}")
print(f"XYZ coordinates shape: {sample['xyz'].shape}")
```

### Main Task - LLM Property Prediction
```python
from dataloaders.llm_regression_dataloader import LLMLoader

# Load dataset for LLM property prediction (main task)
dataset = LLMLoader(
    label_dir="data",
    modalities=["text", "image"]
)

# Get a sample for zero-shot/few-shot property prediction
sample = dataset[0]
print(f"HOMO: {sample['HOMO']}")
print(f"LUMO: {sample['LUMO']}")
print(f"Band gap: {sample['Eg']}")
print(f"Temperature: {sample['temperature']}K")
print(f"Phase: {sample['phase']}")
```

### Secondary Task - LLM Summary Generation
```python
from dataloaders.llm_regression_dataloader import LLMLoader

# Load dataset for LLM summary generation (secondary task)
dataset = LLMLoader(
    label_dir="data",
    modalities=["text", "image"]
)

# Get a sample for summary generation
sample = dataset[0]
print(f"Input text: {sample['text'][:200]}...")
print(f"Target properties: {sample['HOMO']}, {sample['LUMO']}, {sample['Eg']}")
```

### Tertiary Task - Classification
```python
from dataloaders.classification_dataloader import ClassificationLoader

# Load dataset for classification (tertiary task)
dataset = ClassificationLoader(
    base_dir="data",
    modalities=["image", "xyz", "text"],
    max_rotations=10
)

# Get a sample
sample = dataset[0]
print(f"Phase: {sample['label']}")
print(f"Image shape: {sample['image'].size}")
print(f"XYZ coordinates shape: {sample['xyz'].shape}")
print(f"Text: {sample['text'][:100]}...")
```

### PyTorch Geometric Integration
```python
# For graph neural networks
dataset = ClassificationLoader(
    base_dir="data",
    modalities=["xyz", "element"],
    as_pyg_data=True
)

# Returns PyG Data objects
sample = dataset[0]
print(f"Node features: {sample.z}")
print(f"Positions: {sample.pos}")
print(f"Label: {sample.y}")
```

## Technical Details

### File Structure
```
data/
β”œβ”€β”€ anatase/
β”‚   β”œβ”€β”€ 0K/
β”‚   β”‚   β”œβ”€β”€ images/
β”‚   β”‚   β”‚   β”œβ”€β”€ rot_0.png
β”‚   β”‚   β”‚   β”œβ”€β”€ rot_1.png
β”‚   β”‚   β”‚   └── ...
β”‚   β”‚   β”œβ”€β”€ xyz/
β”‚   β”‚   β”‚   β”œβ”€β”€ rot_0.xyz
β”‚   β”‚   β”‚   β”œβ”€β”€ rot_1.xyz
β”‚   β”‚   β”‚   └── ...
β”‚   β”‚   └── text/
β”‚   β”‚       β”œβ”€β”€ rot_0.txt
β”‚   β”‚       β”œβ”€β”€ rot_1.txt
β”‚   β”‚       └── ...
β”‚   β”œβ”€β”€ 50K/
β”‚   └── ...
β”œβ”€β”€ brookite/
β”œβ”€β”€ rutile/
└── labels.csv
```

### Data Formats

#### XYZ Files
Standard XYZ format with atomic coordinates:
```
[number of atoms]
[comment line]
[element] [x] [y] [z]
[element] [x] [y] [z]
...
```

#### Images
PNG format visualizations of crystal structures.

#### Text Files
Natural language descriptions of crystal structures and properties.

#### Labels CSV
Contains material properties for each phase-temperature combination:
```csv
Polymorph,Temperature,Parameter,Value
anatase,0K,HOMO,-7.2340
anatase,0K,LUMO,-4.1234
...
```

### Supported Models

The dataset is compatible with various model architectures:

- **Vision Models**: ResNet, ViT
- **Graph Neural Networks**: SchNet, DimeNet, EGNN, FAENet, GoTenNet
- **Language Models**: LLMs for zero-shot/few-shot learning
- **Multimodal Models**: CLIP, Pure2DopeNet, ViSNet 

### Performance Metrics

#### Primary Task - Regression
- Mean Absolute Error (MAE)
- Root Mean Square Error (RMSE)
- RΒ² score
- Per-property evaluation metrics

#### Primary Task - LLM Property Prediction
- Property prediction accuracy
- Zero-shot vs few-shot performance comparison
- Out-of-distribution generalization
- Per-property evaluation metrics

#### Secondary Task - LLM Summary Generation
The nanoparticle summary task requires domain-specific evaluation beyond traditional string-based metrics like ROUGE or BLEU, which do not penalize incorrect numerical values. A more meaningful strategy is to extract structured key–value pairs-such as particle size, center of mass, coordination numbers, or bond angles-and compare them to ground truth using:
- Information-level F₁ score that accepts only values within defined tolerances (e.g., 0.1 Γ… or 1 degree)
- MAPE over all numeric entries
- Factual consistency score like BERTScore or QA-based faithfulness after masking numeric values
- Optional assessments of readability and clarity using expert judgment or coherence-based metrics (e.g., Coh-LM)

#### Tertiary Task - Classification
A three-class classification task to distinguish among the TiOβ‚‚ polymorphs. While overall accuracy provides a general overview, it is important to also report:
- Class-wise precision, recall, and their harmonic mean (F₁ score), followed by macro-averaging to account for class imbalance
- Full 3Γ—3 confusion matrix to identify systematic misclassifications between phase pairs
- Matthews correlation coefficient (MCC) and Cohen's ΞΊ statistic for chance-adjusted evaluations
- Cross-entropy loss and macro-averaged area under the ROC curve (AUROC) when class probabilities are available

### Known Limitations

1. **Limited Chemical Space**: Only covers TiOβ‚‚ polymorphs
2. **Temperature Range**: Limited to 0-1000K
3. **Computational Data**: All properties are from DFT calculations
4. **Modality Dependencies**: Some modalities may not be available for all samples

### Future Work

- Extend to other materials systems
- Include experimental data
- Add more temperature points
- Incorporate additional material properties
- Support for more crystal structures