johnpolat commited on
Commit
a2b3944
Β·
verified Β·
1 Parent(s): c144b01

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +285 -0
README.md ADDED
@@ -0,0 +1,285 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # CrysMTM Dataset Card
2
+
3
+ ## Dataset Description
4
+
5
+ - **Repository:** [CrysMTM](https://github.com/KurbanIntelligenceLab/CrysMTM)
6
+ - **Paper:** CrysMTM: A Multiphase, Temperature-Resolved, Multimodal Dataset for Crystalline Materials
7
+ - **Authors:** Can Polat, Erchin Serpedin, Mustafa Kurban, and Hasan Kurban
8
+ - **Point of Contact:** [Can Polat](johnpolat.com)
9
+
10
+ ### Dataset Summary
11
+
12
+ CrysMTM is a comprehensive multiphase, temperature-resolved, multimodal dataset for crystalline materials research, specifically focused on titanium dioxide (TiOβ‚‚) polymorphs. The dataset is designed primarily for regression tasks to predict 9 key material properties from multimodal inputs. It contains three crystalline phases of TiOβ‚‚ (anatase, brookite, and rutile) across a temperature range of 0-1000K, with multiple data modalities including atomic coordinates, visual representations, and textual descriptions.
13
+
14
+ ### Supported Tasks and Leaderboards
15
+
16
+ The dataset primarily supports regression tasks for materials property prediction:
17
+
18
+ 1. **Main Task - Regression**: Predict 9 material properties from multimodal inputs
19
+ - HOMO energy, LUMO energy, band gap, Fermi energy, total energy, energy per atom, atomic displacement, volumetric expansion, and bond length changes
20
+ 2. **Main Task - LLM Property Prediction**: Zero-shot and few-shot prediction of the 9 material properties using large language models
21
+ 3. **Secondary Task - LLM Summary Generation**: Generate textual summaries of crystal structures and properties using large language models
22
+ 4. **Tertiary Task - Classification**: Predict the crystalline phase (anatase, brookite, or rutile) from multimodal inputs
23
+
24
+ ### Languages
25
+
26
+ The dataset contains English text descriptions of crystal structures and properties.
27
+
28
+ ## Dataset Structure
29
+
30
+ ### Data Instances
31
+
32
+ Each data instance represents a TiOβ‚‚ crystal structure at a specific temperature and rotation, containing:
33
+
34
+ - **Phase**: One of three TiOβ‚‚ polymorphs (anatase, brookite, rutile)
35
+ - **Temperature**: Temperature in Kelvin (0-1000K, in 50K increments)
36
+ - **Rotation**: Rotation index for the crystal structure
37
+ - **Modalities**: Multiple data representations of the same structure
38
+
39
+ ### Data Fields
40
+
41
+ #### Core Metadata
42
+ - `phase` (string): Crystal phase - "anatase", "brookite", or "rutile"
43
+ - `temperature` (integer): Temperature in Kelvin (0, 50, 100, ..., 1000)
44
+ - `rotation` (integer): Rotation index for the crystal structure
45
+
46
+ #### Multimodal Data
47
+ - `image` (PIL.Image): Visual representation of the crystal structure (PNG format)
48
+ - `xyz` (torch.Tensor): Atomic coordinates in XYZ format (NΓ—3 tensor)
49
+ - `text` (string): Textual description of the crystal structure and properties
50
+ - `element` (list): List of element symbols for each atom
51
+
52
+ #### Labels
53
+ **Primary Labels - Regression**:
54
+ - `regression_label` (torch.Tensor): 9-dimensional tensor containing the main prediction targets:
55
+ - `HOMO` (float): HOMO energy (E_H) in eV
56
+ - `LUMO` (float): LUMO energy (E_L) in eV
57
+ - `Eg` (float): Band gap energy (E_g) in eV
58
+ - `Ef` (float): Fermi energy (E_f) in eV
59
+ - `Et` (float): Total energy of the system (E_T) in eV
60
+ - `Eta` (float): Total energy per atom (E_Ta) in eV
61
+ - `disp` (float): Maximum atomic displacement (Ξ”r_max) in Γ…
62
+ - `vol` (float): Volumetric expansion (Ξ”V) in Γ…Β³
63
+ - `bond` (float): Ti-O bond length change (Ξ”d_Ti-O) in Γ…
64
+
65
+ **Secondary Labels - Classification**:
66
+ - `label` (integer): Phase label (0=anatase, 1=brookite, 2=rutile)
67
+
68
+ **LLM Task Labels**:
69
+ - Individual property values for zero-shot/few-shot prediction
70
+ - Text summaries for generation tasks
71
+
72
+ ### Data Splits
73
+
74
+ The dataset is organized by temperature ranges:
75
+
76
+ - **Training Set**: Temperatures 0-850K (excluding 250K, 450K, 650K, 750K, 800K)
77
+ - **In-Distribution (ID) Test**: Temperatures 250K, 450K, 650K, 750K, 800K
78
+ - **Out-of-Distribution (OOD) Test**: Temperatures 0K, 50K, 100K, 900K, 950K, 1000K
79
+
80
+ ### Citation Information
81
+ ```bibtex
82
+ @dataset{crysmtm2024,
83
+ title={CrysMTM: A Multiphase, Temperature-Resolved, Multimodal Dataset for Crystalline Materials},
84
+ author={Can Polat and Erchin Serpedin and Mustafa Kurban and Hasan Kurban},
85
+ year={2024},
86
+ url={https://github.com/KurbanIntelligenceLab/CrysMTM}
87
+ }
88
+ ```
89
+
90
+ ## Usage Examples
91
+
92
+ ### Main Task - Regression
93
+ ```python
94
+ from dataloaders.regression_dataloader import RegressionLoader
95
+
96
+ # Load dataset for regression (main task)
97
+ dataset = RegressionLoader(
98
+ label_dir="data",
99
+ modalities=["image", "xyz", "text"],
100
+ normalize_labels=True
101
+ )
102
+
103
+ # Get a sample
104
+ sample = dataset[0]
105
+ print(f"Target Properties: {sample['regression_label']}")
106
+ print(f"Temperature: {sample['temperature']}K")
107
+ print(f"Phase: {sample['phase']}")
108
+ print(f"Image shape: {sample['image'].size}")
109
+ print(f"XYZ coordinates shape: {sample['xyz'].shape}")
110
+ ```
111
+
112
+ ### Main Task - LLM Property Prediction
113
+ ```python
114
+ from dataloaders.llm_regression_dataloader import LLMLoader
115
+
116
+ # Load dataset for LLM property prediction (main task)
117
+ dataset = LLMLoader(
118
+ label_dir="data",
119
+ modalities=["text", "image"]
120
+ )
121
+
122
+ # Get a sample for zero-shot/few-shot property prediction
123
+ sample = dataset[0]
124
+ print(f"HOMO: {sample['HOMO']}")
125
+ print(f"LUMO: {sample['LUMO']}")
126
+ print(f"Band gap: {sample['Eg']}")
127
+ print(f"Temperature: {sample['temperature']}K")
128
+ print(f"Phase: {sample['phase']}")
129
+ ```
130
+
131
+ ### Secondary Task - LLM Summary Generation
132
+ ```python
133
+ from dataloaders.llm_regression_dataloader import LLMLoader
134
+
135
+ # Load dataset for LLM summary generation (secondary task)
136
+ dataset = LLMLoader(
137
+ label_dir="data",
138
+ modalities=["text", "image"]
139
+ )
140
+
141
+ # Get a sample for summary generation
142
+ sample = dataset[0]
143
+ print(f"Input text: {sample['text'][:200]}...")
144
+ print(f"Target properties: {sample['HOMO']}, {sample['LUMO']}, {sample['Eg']}")
145
+ ```
146
+
147
+ ### Tertiary Task - Classification
148
+ ```python
149
+ from dataloaders.classification_dataloader import ClassificationLoader
150
+
151
+ # Load dataset for classification (tertiary task)
152
+ dataset = ClassificationLoader(
153
+ base_dir="data",
154
+ modalities=["image", "xyz", "text"],
155
+ max_rotations=10
156
+ )
157
+
158
+ # Get a sample
159
+ sample = dataset[0]
160
+ print(f"Phase: {sample['label']}")
161
+ print(f"Image shape: {sample['image'].size}")
162
+ print(f"XYZ coordinates shape: {sample['xyz'].shape}")
163
+ print(f"Text: {sample['text'][:100]}...")
164
+ ```
165
+
166
+ ### PyTorch Geometric Integration
167
+ ```python
168
+ # For graph neural networks
169
+ dataset = ClassificationLoader(
170
+ base_dir="data",
171
+ modalities=["xyz", "element"],
172
+ as_pyg_data=True
173
+ )
174
+
175
+ # Returns PyG Data objects
176
+ sample = dataset[0]
177
+ print(f"Node features: {sample.z}")
178
+ print(f"Positions: {sample.pos}")
179
+ print(f"Label: {sample.y}")
180
+ ```
181
+
182
+ ## Technical Details
183
+
184
+ ### File Structure
185
+ ```
186
+ data/
187
+ β”œβ”€β”€ anatase/
188
+ β”‚ β”œβ”€β”€ 0K/
189
+ β”‚ β”‚ β”œβ”€β”€ images/
190
+ β”‚ β”‚ β”‚ β”œβ”€β”€ rot_0.png
191
+ β”‚ β”‚ β”‚ β”œβ”€β”€ rot_1.png
192
+ β”‚ β”‚ β”‚ └── ...
193
+ β”‚ β”‚ β”œβ”€β”€ xyz/
194
+ β”‚ β”‚ β”‚ β”œβ”€β”€ rot_0.xyz
195
+ β”‚ β”‚ β”‚ β”œβ”€β”€ rot_1.xyz
196
+ β”‚ β”‚ β”‚ └── ...
197
+ β”‚ β”‚ └── text/
198
+ β”‚ β”‚ β”œβ”€β”€ rot_0.txt
199
+ β”‚ β”‚ β”œβ”€β”€ rot_1.txt
200
+ β”‚ β”‚ └── ...
201
+ β”‚ β”œβ”€β”€ 50K/
202
+ β”‚ └── ...
203
+ β”œβ”€β”€ brookite/
204
+ β”œβ”€β”€ rutile/
205
+ └── labels.csv
206
+ ```
207
+
208
+ ### Data Formats
209
+
210
+ #### XYZ Files
211
+ Standard XYZ format with atomic coordinates:
212
+ ```
213
+ [number of atoms]
214
+ [comment line]
215
+ [element] [x] [y] [z]
216
+ [element] [x] [y] [z]
217
+ ...
218
+ ```
219
+
220
+ #### Images
221
+ PNG format visualizations of crystal structures.
222
+
223
+ #### Text Files
224
+ Natural language descriptions of crystal structures and properties.
225
+
226
+ #### Labels CSV
227
+ Contains material properties for each phase-temperature combination:
228
+ ```csv
229
+ Polymorph,Temperature,Parameter,Value
230
+ anatase,0K,HOMO,-7.2340
231
+ anatase,0K,LUMO,-4.1234
232
+ ...
233
+ ```
234
+
235
+ ### Supported Models
236
+
237
+ The dataset is compatible with various model architectures:
238
+
239
+ - **Vision Models**: ResNet, ViT
240
+ - **Graph Neural Networks**: SchNet, DimeNet, EGNN, FAENet, GoTenNet
241
+ - **Language Models**: LLMs for zero-shot/few-shot learning
242
+ - **Multimodal Models**: CLIP, Pure2DopeNet, ViSNet
243
+
244
+ ### Performance Metrics
245
+
246
+ #### Primary Task - Regression
247
+ - Mean Absolute Error (MAE)
248
+ - Root Mean Square Error (RMSE)
249
+ - RΒ² score
250
+ - Per-property evaluation metrics
251
+
252
+ #### Primary Task - LLM Property Prediction
253
+ - Property prediction accuracy
254
+ - Zero-shot vs few-shot performance comparison
255
+ - Out-of-distribution generalization
256
+ - Per-property evaluation metrics
257
+
258
+ #### Secondary Task - LLM Summary Generation
259
+ The nanoparticle summary task requires domain-specific evaluation beyond traditional string-based metrics like ROUGE or BLEU, which do not penalize incorrect numerical values. A more meaningful strategy is to extract structured key–value pairs-such as particle size, center of mass, coordination numbers, or bond angles-and compare them to ground truth using:
260
+ - Information-level F₁ score that accepts only values within defined tolerances (e.g., 0.1 Γ… or 1 degree)
261
+ - MAPE over all numeric entries
262
+ - Factual consistency score like BERTScore or QA-based faithfulness after masking numeric values
263
+ - Optional assessments of readability and clarity using expert judgment or coherence-based metrics (e.g., Coh-LM)
264
+
265
+ #### Tertiary Task - Classification
266
+ A three-class classification task to distinguish among the TiOβ‚‚ polymorphs. While overall accuracy provides a general overview, it is important to also report:
267
+ - Class-wise precision, recall, and their harmonic mean (F₁ score), followed by macro-averaging to account for class imbalance
268
+ - Full 3Γ—3 confusion matrix to identify systematic misclassifications between phase pairs
269
+ - Matthews correlation coefficient (MCC) and Cohen's ΞΊ statistic for chance-adjusted evaluations
270
+ - Cross-entropy loss and macro-averaged area under the ROC curve (AUROC) when class probabilities are available
271
+
272
+ ### Known Limitations
273
+
274
+ 1. **Limited Chemical Space**: Only covers TiOβ‚‚ polymorphs
275
+ 2. **Temperature Range**: Limited to 0-1000K
276
+ 3. **Computational Data**: All properties are from DFT calculations
277
+ 4. **Modality Dependencies**: Some modalities may not be available for all samples
278
+
279
+ ### Future Work
280
+
281
+ - Extend to other materials systems
282
+ - Include experimental data
283
+ - Add more temperature points
284
+ - Incorporate additional material properties
285
+ - Support for more crystal structures