Upload README.md with huggingface_hub
Browse files
README.md
ADDED
|
@@ -0,0 +1,285 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# CrysMTM Dataset Card
|
| 2 |
+
|
| 3 |
+
## Dataset Description
|
| 4 |
+
|
| 5 |
+
- **Repository:** [CrysMTM](https://github.com/KurbanIntelligenceLab/CrysMTM)
|
| 6 |
+
- **Paper:** CrysMTM: A Multiphase, Temperature-Resolved, Multimodal Dataset for Crystalline Materials
|
| 7 |
+
- **Authors:** Can Polat, Erchin Serpedin, Mustafa Kurban, and Hasan Kurban
|
| 8 |
+
- **Point of Contact:** [Can Polat](johnpolat.com)
|
| 9 |
+
|
| 10 |
+
### Dataset Summary
|
| 11 |
+
|
| 12 |
+
CrysMTM is a comprehensive multiphase, temperature-resolved, multimodal dataset for crystalline materials research, specifically focused on titanium dioxide (TiOβ) polymorphs. The dataset is designed primarily for regression tasks to predict 9 key material properties from multimodal inputs. It contains three crystalline phases of TiOβ (anatase, brookite, and rutile) across a temperature range of 0-1000K, with multiple data modalities including atomic coordinates, visual representations, and textual descriptions.
|
| 13 |
+
|
| 14 |
+
### Supported Tasks and Leaderboards
|
| 15 |
+
|
| 16 |
+
The dataset primarily supports regression tasks for materials property prediction:
|
| 17 |
+
|
| 18 |
+
1. **Main Task - Regression**: Predict 9 material properties from multimodal inputs
|
| 19 |
+
- HOMO energy, LUMO energy, band gap, Fermi energy, total energy, energy per atom, atomic displacement, volumetric expansion, and bond length changes
|
| 20 |
+
2. **Main Task - LLM Property Prediction**: Zero-shot and few-shot prediction of the 9 material properties using large language models
|
| 21 |
+
3. **Secondary Task - LLM Summary Generation**: Generate textual summaries of crystal structures and properties using large language models
|
| 22 |
+
4. **Tertiary Task - Classification**: Predict the crystalline phase (anatase, brookite, or rutile) from multimodal inputs
|
| 23 |
+
|
| 24 |
+
### Languages
|
| 25 |
+
|
| 26 |
+
The dataset contains English text descriptions of crystal structures and properties.
|
| 27 |
+
|
| 28 |
+
## Dataset Structure
|
| 29 |
+
|
| 30 |
+
### Data Instances
|
| 31 |
+
|
| 32 |
+
Each data instance represents a TiOβ crystal structure at a specific temperature and rotation, containing:
|
| 33 |
+
|
| 34 |
+
- **Phase**: One of three TiOβ polymorphs (anatase, brookite, rutile)
|
| 35 |
+
- **Temperature**: Temperature in Kelvin (0-1000K, in 50K increments)
|
| 36 |
+
- **Rotation**: Rotation index for the crystal structure
|
| 37 |
+
- **Modalities**: Multiple data representations of the same structure
|
| 38 |
+
|
| 39 |
+
### Data Fields
|
| 40 |
+
|
| 41 |
+
#### Core Metadata
|
| 42 |
+
- `phase` (string): Crystal phase - "anatase", "brookite", or "rutile"
|
| 43 |
+
- `temperature` (integer): Temperature in Kelvin (0, 50, 100, ..., 1000)
|
| 44 |
+
- `rotation` (integer): Rotation index for the crystal structure
|
| 45 |
+
|
| 46 |
+
#### Multimodal Data
|
| 47 |
+
- `image` (PIL.Image): Visual representation of the crystal structure (PNG format)
|
| 48 |
+
- `xyz` (torch.Tensor): Atomic coordinates in XYZ format (NΓ3 tensor)
|
| 49 |
+
- `text` (string): Textual description of the crystal structure and properties
|
| 50 |
+
- `element` (list): List of element symbols for each atom
|
| 51 |
+
|
| 52 |
+
#### Labels
|
| 53 |
+
**Primary Labels - Regression**:
|
| 54 |
+
- `regression_label` (torch.Tensor): 9-dimensional tensor containing the main prediction targets:
|
| 55 |
+
- `HOMO` (float): HOMO energy (E_H) in eV
|
| 56 |
+
- `LUMO` (float): LUMO energy (E_L) in eV
|
| 57 |
+
- `Eg` (float): Band gap energy (E_g) in eV
|
| 58 |
+
- `Ef` (float): Fermi energy (E_f) in eV
|
| 59 |
+
- `Et` (float): Total energy of the system (E_T) in eV
|
| 60 |
+
- `Eta` (float): Total energy per atom (E_Ta) in eV
|
| 61 |
+
- `disp` (float): Maximum atomic displacement (Ξr_max) in Γ
|
| 62 |
+
- `vol` (float): Volumetric expansion (ΞV) in Γ
Β³
|
| 63 |
+
- `bond` (float): Ti-O bond length change (Ξd_Ti-O) in Γ
|
| 64 |
+
|
| 65 |
+
**Secondary Labels - Classification**:
|
| 66 |
+
- `label` (integer): Phase label (0=anatase, 1=brookite, 2=rutile)
|
| 67 |
+
|
| 68 |
+
**LLM Task Labels**:
|
| 69 |
+
- Individual property values for zero-shot/few-shot prediction
|
| 70 |
+
- Text summaries for generation tasks
|
| 71 |
+
|
| 72 |
+
### Data Splits
|
| 73 |
+
|
| 74 |
+
The dataset is organized by temperature ranges:
|
| 75 |
+
|
| 76 |
+
- **Training Set**: Temperatures 0-850K (excluding 250K, 450K, 650K, 750K, 800K)
|
| 77 |
+
- **In-Distribution (ID) Test**: Temperatures 250K, 450K, 650K, 750K, 800K
|
| 78 |
+
- **Out-of-Distribution (OOD) Test**: Temperatures 0K, 50K, 100K, 900K, 950K, 1000K
|
| 79 |
+
|
| 80 |
+
### Citation Information
|
| 81 |
+
```bibtex
|
| 82 |
+
@dataset{crysmtm2024,
|
| 83 |
+
title={CrysMTM: A Multiphase, Temperature-Resolved, Multimodal Dataset for Crystalline Materials},
|
| 84 |
+
author={Can Polat and Erchin Serpedin and Mustafa Kurban and Hasan Kurban},
|
| 85 |
+
year={2024},
|
| 86 |
+
url={https://github.com/KurbanIntelligenceLab/CrysMTM}
|
| 87 |
+
}
|
| 88 |
+
```
|
| 89 |
+
|
| 90 |
+
## Usage Examples
|
| 91 |
+
|
| 92 |
+
### Main Task - Regression
|
| 93 |
+
```python
|
| 94 |
+
from dataloaders.regression_dataloader import RegressionLoader
|
| 95 |
+
|
| 96 |
+
# Load dataset for regression (main task)
|
| 97 |
+
dataset = RegressionLoader(
|
| 98 |
+
label_dir="data",
|
| 99 |
+
modalities=["image", "xyz", "text"],
|
| 100 |
+
normalize_labels=True
|
| 101 |
+
)
|
| 102 |
+
|
| 103 |
+
# Get a sample
|
| 104 |
+
sample = dataset[0]
|
| 105 |
+
print(f"Target Properties: {sample['regression_label']}")
|
| 106 |
+
print(f"Temperature: {sample['temperature']}K")
|
| 107 |
+
print(f"Phase: {sample['phase']}")
|
| 108 |
+
print(f"Image shape: {sample['image'].size}")
|
| 109 |
+
print(f"XYZ coordinates shape: {sample['xyz'].shape}")
|
| 110 |
+
```
|
| 111 |
+
|
| 112 |
+
### Main Task - LLM Property Prediction
|
| 113 |
+
```python
|
| 114 |
+
from dataloaders.llm_regression_dataloader import LLMLoader
|
| 115 |
+
|
| 116 |
+
# Load dataset for LLM property prediction (main task)
|
| 117 |
+
dataset = LLMLoader(
|
| 118 |
+
label_dir="data",
|
| 119 |
+
modalities=["text", "image"]
|
| 120 |
+
)
|
| 121 |
+
|
| 122 |
+
# Get a sample for zero-shot/few-shot property prediction
|
| 123 |
+
sample = dataset[0]
|
| 124 |
+
print(f"HOMO: {sample['HOMO']}")
|
| 125 |
+
print(f"LUMO: {sample['LUMO']}")
|
| 126 |
+
print(f"Band gap: {sample['Eg']}")
|
| 127 |
+
print(f"Temperature: {sample['temperature']}K")
|
| 128 |
+
print(f"Phase: {sample['phase']}")
|
| 129 |
+
```
|
| 130 |
+
|
| 131 |
+
### Secondary Task - LLM Summary Generation
|
| 132 |
+
```python
|
| 133 |
+
from dataloaders.llm_regression_dataloader import LLMLoader
|
| 134 |
+
|
| 135 |
+
# Load dataset for LLM summary generation (secondary task)
|
| 136 |
+
dataset = LLMLoader(
|
| 137 |
+
label_dir="data",
|
| 138 |
+
modalities=["text", "image"]
|
| 139 |
+
)
|
| 140 |
+
|
| 141 |
+
# Get a sample for summary generation
|
| 142 |
+
sample = dataset[0]
|
| 143 |
+
print(f"Input text: {sample['text'][:200]}...")
|
| 144 |
+
print(f"Target properties: {sample['HOMO']}, {sample['LUMO']}, {sample['Eg']}")
|
| 145 |
+
```
|
| 146 |
+
|
| 147 |
+
### Tertiary Task - Classification
|
| 148 |
+
```python
|
| 149 |
+
from dataloaders.classification_dataloader import ClassificationLoader
|
| 150 |
+
|
| 151 |
+
# Load dataset for classification (tertiary task)
|
| 152 |
+
dataset = ClassificationLoader(
|
| 153 |
+
base_dir="data",
|
| 154 |
+
modalities=["image", "xyz", "text"],
|
| 155 |
+
max_rotations=10
|
| 156 |
+
)
|
| 157 |
+
|
| 158 |
+
# Get a sample
|
| 159 |
+
sample = dataset[0]
|
| 160 |
+
print(f"Phase: {sample['label']}")
|
| 161 |
+
print(f"Image shape: {sample['image'].size}")
|
| 162 |
+
print(f"XYZ coordinates shape: {sample['xyz'].shape}")
|
| 163 |
+
print(f"Text: {sample['text'][:100]}...")
|
| 164 |
+
```
|
| 165 |
+
|
| 166 |
+
### PyTorch Geometric Integration
|
| 167 |
+
```python
|
| 168 |
+
# For graph neural networks
|
| 169 |
+
dataset = ClassificationLoader(
|
| 170 |
+
base_dir="data",
|
| 171 |
+
modalities=["xyz", "element"],
|
| 172 |
+
as_pyg_data=True
|
| 173 |
+
)
|
| 174 |
+
|
| 175 |
+
# Returns PyG Data objects
|
| 176 |
+
sample = dataset[0]
|
| 177 |
+
print(f"Node features: {sample.z}")
|
| 178 |
+
print(f"Positions: {sample.pos}")
|
| 179 |
+
print(f"Label: {sample.y}")
|
| 180 |
+
```
|
| 181 |
+
|
| 182 |
+
## Technical Details
|
| 183 |
+
|
| 184 |
+
### File Structure
|
| 185 |
+
```
|
| 186 |
+
data/
|
| 187 |
+
βββ anatase/
|
| 188 |
+
β βββ 0K/
|
| 189 |
+
β β βββ images/
|
| 190 |
+
β β β βββ rot_0.png
|
| 191 |
+
β β β βββ rot_1.png
|
| 192 |
+
β β β βββ ...
|
| 193 |
+
β β βββ xyz/
|
| 194 |
+
β β β βββ rot_0.xyz
|
| 195 |
+
β β β βββ rot_1.xyz
|
| 196 |
+
β β β βββ ...
|
| 197 |
+
β β βββ text/
|
| 198 |
+
β β βββ rot_0.txt
|
| 199 |
+
β β βββ rot_1.txt
|
| 200 |
+
β β βββ ...
|
| 201 |
+
β βββ 50K/
|
| 202 |
+
β βββ ...
|
| 203 |
+
βββ brookite/
|
| 204 |
+
βββ rutile/
|
| 205 |
+
βββ labels.csv
|
| 206 |
+
```
|
| 207 |
+
|
| 208 |
+
### Data Formats
|
| 209 |
+
|
| 210 |
+
#### XYZ Files
|
| 211 |
+
Standard XYZ format with atomic coordinates:
|
| 212 |
+
```
|
| 213 |
+
[number of atoms]
|
| 214 |
+
[comment line]
|
| 215 |
+
[element] [x] [y] [z]
|
| 216 |
+
[element] [x] [y] [z]
|
| 217 |
+
...
|
| 218 |
+
```
|
| 219 |
+
|
| 220 |
+
#### Images
|
| 221 |
+
PNG format visualizations of crystal structures.
|
| 222 |
+
|
| 223 |
+
#### Text Files
|
| 224 |
+
Natural language descriptions of crystal structures and properties.
|
| 225 |
+
|
| 226 |
+
#### Labels CSV
|
| 227 |
+
Contains material properties for each phase-temperature combination:
|
| 228 |
+
```csv
|
| 229 |
+
Polymorph,Temperature,Parameter,Value
|
| 230 |
+
anatase,0K,HOMO,-7.2340
|
| 231 |
+
anatase,0K,LUMO,-4.1234
|
| 232 |
+
...
|
| 233 |
+
```
|
| 234 |
+
|
| 235 |
+
### Supported Models
|
| 236 |
+
|
| 237 |
+
The dataset is compatible with various model architectures:
|
| 238 |
+
|
| 239 |
+
- **Vision Models**: ResNet, ViT
|
| 240 |
+
- **Graph Neural Networks**: SchNet, DimeNet, EGNN, FAENet, GoTenNet
|
| 241 |
+
- **Language Models**: LLMs for zero-shot/few-shot learning
|
| 242 |
+
- **Multimodal Models**: CLIP, Pure2DopeNet, ViSNet
|
| 243 |
+
|
| 244 |
+
### Performance Metrics
|
| 245 |
+
|
| 246 |
+
#### Primary Task - Regression
|
| 247 |
+
- Mean Absolute Error (MAE)
|
| 248 |
+
- Root Mean Square Error (RMSE)
|
| 249 |
+
- RΒ² score
|
| 250 |
+
- Per-property evaluation metrics
|
| 251 |
+
|
| 252 |
+
#### Primary Task - LLM Property Prediction
|
| 253 |
+
- Property prediction accuracy
|
| 254 |
+
- Zero-shot vs few-shot performance comparison
|
| 255 |
+
- Out-of-distribution generalization
|
| 256 |
+
- Per-property evaluation metrics
|
| 257 |
+
|
| 258 |
+
#### Secondary Task - LLM Summary Generation
|
| 259 |
+
The nanoparticle summary task requires domain-specific evaluation beyond traditional string-based metrics like ROUGE or BLEU, which do not penalize incorrect numerical values. A more meaningful strategy is to extract structured keyβvalue pairs-such as particle size, center of mass, coordination numbers, or bond angles-and compare them to ground truth using:
|
| 260 |
+
- Information-level Fβ score that accepts only values within defined tolerances (e.g., 0.1 Γ
or 1 degree)
|
| 261 |
+
- MAPE over all numeric entries
|
| 262 |
+
- Factual consistency score like BERTScore or QA-based faithfulness after masking numeric values
|
| 263 |
+
- Optional assessments of readability and clarity using expert judgment or coherence-based metrics (e.g., Coh-LM)
|
| 264 |
+
|
| 265 |
+
#### Tertiary Task - Classification
|
| 266 |
+
A three-class classification task to distinguish among the TiOβ polymorphs. While overall accuracy provides a general overview, it is important to also report:
|
| 267 |
+
- Class-wise precision, recall, and their harmonic mean (Fβ score), followed by macro-averaging to account for class imbalance
|
| 268 |
+
- Full 3Γ3 confusion matrix to identify systematic misclassifications between phase pairs
|
| 269 |
+
- Matthews correlation coefficient (MCC) and Cohen's ΞΊ statistic for chance-adjusted evaluations
|
| 270 |
+
- Cross-entropy loss and macro-averaged area under the ROC curve (AUROC) when class probabilities are available
|
| 271 |
+
|
| 272 |
+
### Known Limitations
|
| 273 |
+
|
| 274 |
+
1. **Limited Chemical Space**: Only covers TiOβ polymorphs
|
| 275 |
+
2. **Temperature Range**: Limited to 0-1000K
|
| 276 |
+
3. **Computational Data**: All properties are from DFT calculations
|
| 277 |
+
4. **Modality Dependencies**: Some modalities may not be available for all samples
|
| 278 |
+
|
| 279 |
+
### Future Work
|
| 280 |
+
|
| 281 |
+
- Extend to other materials systems
|
| 282 |
+
- Include experimental data
|
| 283 |
+
- Add more temperature points
|
| 284 |
+
- Incorporate additional material properties
|
| 285 |
+
- Support for more crystal structures
|