| # CrysMTM Dataset Card | |
| ## Dataset Description | |
| - **Repository:** [CrysMTM](https://github.com/KurbanIntelligenceLab/CrysMTM) | |
| - **Paper:** CrysMTM: A Multiphase, Temperature-Resolved, Multimodal Dataset for Crystalline Materials | |
| - **Authors:** Can Polat, Erchin Serpedin, Mustafa Kurban, and Hasan Kurban | |
| - **Point of Contact:** [Can Polat](johnpolat.com) | |
| ### Dataset Summary | |
| CrysMTM is a comprehensive multiphase, temperature-resolved, multimodal dataset for crystalline materials research, specifically focused on titanium dioxide (TiOβ) polymorphs. The dataset is designed primarily for regression tasks to predict 9 key material properties from multimodal inputs. It contains three crystalline phases of TiOβ (anatase, brookite, and rutile) across a temperature range of 0-1000K, with multiple data modalities including atomic coordinates, visual representations, and textual descriptions. | |
| ### Supported Tasks and Leaderboards | |
| The dataset primarily supports regression tasks for materials property prediction: | |
| 1. **Main Task - Regression**: Predict 9 material properties from multimodal inputs | |
| - HOMO energy, LUMO energy, band gap, Fermi energy, total energy, energy per atom, atomic displacement, volumetric expansion, and bond length changes | |
| 2. **Main Task - LLM Property Prediction**: Zero-shot and few-shot prediction of the 9 material properties using large language models | |
| 3. **Secondary Task - LLM Summary Generation**: Generate textual summaries of crystal structures and properties using large language models | |
| 4. **Tertiary Task - Classification**: Predict the crystalline phase (anatase, brookite, or rutile) from multimodal inputs | |
| ### Languages | |
| The dataset contains English text descriptions of crystal structures and properties. | |
| ## Dataset Structure | |
| ### Data Instances | |
| Each data instance represents a TiOβ crystal structure at a specific temperature and rotation, containing: | |
| - **Phase**: One of three TiOβ polymorphs (anatase, brookite, rutile) | |
| - **Temperature**: Temperature in Kelvin (0-1000K, in 50K increments) | |
| - **Rotation**: Rotation index for the crystal structure | |
| - **Modalities**: Multiple data representations of the same structure | |
| ### Data Fields | |
| #### Core Metadata | |
| - `phase` (string): Crystal phase - "anatase", "brookite", or "rutile" | |
| - `temperature` (integer): Temperature in Kelvin (0, 50, 100, ..., 1000) | |
| - `rotation` (integer): Rotation index for the crystal structure | |
| #### Multimodal Data | |
| - `image` (PIL.Image): Visual representation of the crystal structure (PNG format) | |
| - `xyz` (torch.Tensor): Atomic coordinates in XYZ format (NΓ3 tensor) | |
| - `text` (string): Textual description of the crystal structure and properties | |
| - `element` (list): List of element symbols for each atom | |
| #### Labels | |
| **Primary Labels - Regression**: | |
| - `regression_label` (torch.Tensor): 9-dimensional tensor containing the main prediction targets: | |
| - `HOMO` (float): HOMO energy (E_H) in eV | |
| - `LUMO` (float): LUMO energy (E_L) in eV | |
| - `Eg` (float): Band gap energy (E_g) in eV | |
| - `Ef` (float): Fermi energy (E_f) in eV | |
| - `Et` (float): Total energy of the system (E_T) in eV | |
| - `Eta` (float): Total energy per atom (E_Ta) in eV | |
| - `disp` (float): Maximum atomic displacement (Ξr_max) in Γ | |
| - `vol` (float): Volumetric expansion (ΞV) in Γ Β³ | |
| - `bond` (float): Ti-O bond length change (Ξd_Ti-O) in Γ | |
| **Secondary Labels - Classification**: | |
| - `label` (integer): Phase label (0=anatase, 1=brookite, 2=rutile) | |
| **LLM Task Labels**: | |
| - Individual property values for zero-shot/few-shot prediction | |
| - Text summaries for generation tasks | |
| ### Data Splits | |
| The dataset is organized by temperature ranges: | |
| - **Training Set**: Temperatures 0-850K (excluding 250K, 450K, 650K, 750K, 800K) | |
| - **In-Distribution (ID) Test**: Temperatures 250K, 450K, 650K, 750K, 800K | |
| - **Out-of-Distribution (OOD) Test**: Temperatures 0K, 50K, 100K, 900K, 950K, 1000K | |
| ### Citation Information | |
| ```bibtex | |
| @dataset{crysmtm2024, | |
| title={CrysMTM: A Multiphase, Temperature-Resolved, Multimodal Dataset for Crystalline Materials}, | |
| author={Can Polat and Erchin Serpedin and Mustafa Kurban and Hasan Kurban}, | |
| year={2024}, | |
| url={https://github.com/KurbanIntelligenceLab/CrysMTM} | |
| } | |
| ``` | |
| ## Usage Examples | |
| ### Option 1: Download and Use Locally | |
| 1. **Download the dataset** from [https://huggingface.co/datasets/johnpolat/CrysMTM](https://huggingface.co/datasets/johnpolat/CrysMTM) | |
| 2. **Use the provided loading script**: | |
| ```python | |
| # Download load_dataset.py from the repository and place it in your data directory | |
| from load_dataset import load_dataset | |
| # Load the dataset | |
| dataset = load_dataset(".") | |
| # Access splits | |
| train_dataset = dataset["train"] # 5,064 samples | |
| test_id_dataset = dataset["test_id"] # 1,380 samples | |
| test_ood_dataset = dataset["test_ood"] # 6,588 samples | |
| # Get a sample | |
| sample = train_dataset[0] | |
| print(f"Phase: {sample['phase']}") | |
| print(f"Temperature: {sample['temperature']}K") | |
| print(f"Image: {sample['image']}") | |
| print(f"Regression labels: {sample['regression_labels']}") | |
| ``` | |
| ### Option 2: Use with Original Dataloaders | |
| ```python | |
| from dataloaders.regression_dataloader import RegressionLoader | |
| # Load dataset for regression (main task) | |
| dataset = RegressionLoader( | |
| label_dir="data", | |
| modalities=["image", "xyz", "text"], | |
| normalize_labels=True | |
| ) | |
| # Get a sample | |
| sample = dataset[0] | |
| print(f"Target Properties: {sample['regression_label']}") | |
| print(f"Temperature: {sample['temperature']}K") | |
| print(f"Phase: {sample['phase']}") | |
| print(f"Image shape: {sample['image'].size}") | |
| print(f"XYZ coordinates shape: {sample['xyz'].shape}") | |
| ``` | |
| ### Main Task - LLM Property Prediction | |
| ```python | |
| from dataloaders.llm_regression_dataloader import LLMLoader | |
| # Load dataset for LLM property prediction (main task) | |
| dataset = LLMLoader( | |
| label_dir="data", | |
| modalities=["text", "image"] | |
| ) | |
| # Get a sample for zero-shot/few-shot property prediction | |
| sample = dataset[0] | |
| print(f"HOMO: {sample['HOMO']}") | |
| print(f"LUMO: {sample['LUMO']}") | |
| print(f"Band gap: {sample['Eg']}") | |
| print(f"Temperature: {sample['temperature']}K") | |
| print(f"Phase: {sample['phase']}") | |
| ``` | |
| ### Secondary Task - LLM Summary Generation | |
| ```python | |
| from dataloaders.llm_regression_dataloader import LLMLoader | |
| # Load dataset for LLM summary generation (secondary task) | |
| dataset = LLMLoader( | |
| label_dir="data", | |
| modalities=["text", "image"] | |
| ) | |
| # Get a sample for summary generation | |
| sample = dataset[0] | |
| print(f"Input text: {sample['text'][:200]}...") | |
| print(f"Target properties: {sample['HOMO']}, {sample['LUMO']}, {sample['Eg']}") | |
| ``` | |
| ### Tertiary Task - Classification | |
| ```python | |
| from dataloaders.classification_dataloader import ClassificationLoader | |
| # Load dataset for classification (tertiary task) | |
| dataset = ClassificationLoader( | |
| base_dir="data", | |
| modalities=["image", "xyz", "text"], | |
| max_rotations=10 | |
| ) | |
| # Get a sample | |
| sample = dataset[0] | |
| print(f"Phase: {sample['label']}") | |
| print(f"Image shape: {sample['image'].size}") | |
| print(f"XYZ coordinates shape: {sample['xyz'].shape}") | |
| print(f"Text: {sample['text'][:100]}...") | |
| ``` | |
| ### PyTorch Geometric Integration | |
| ```python | |
| # For graph neural networks | |
| dataset = ClassificationLoader( | |
| base_dir="data", | |
| modalities=["xyz", "element"], | |
| as_pyg_data=True | |
| ) | |
| # Returns PyG Data objects | |
| sample = dataset[0] | |
| print(f"Node features: {sample.z}") | |
| print(f"Positions: {sample.pos}") | |
| print(f"Label: {sample.y}") | |
| ``` | |
| ## Technical Details | |
| ### File Structure | |
| ``` | |
| data/ | |
| βββ anatase/ | |
| β βββ 0K/ | |
| β β βββ images/ | |
| β β β βββ rot_0.png | |
| β β β βββ rot_1.png | |
| β β β βββ ... | |
| β β βββ xyz/ | |
| β β β βββ rot_0.xyz | |
| β β β βββ rot_1.xyz | |
| β β β βββ ... | |
| β β βββ text/ | |
| β β βββ rot_0.txt | |
| β β βββ rot_1.txt | |
| β β βββ ... | |
| β βββ 50K/ | |
| β βββ ... | |
| βββ brookite/ | |
| βββ rutile/ | |
| βββ labels.csv | |
| ``` | |
| ### Data Formats | |
| #### XYZ Files | |
| Standard XYZ format with atomic coordinates: | |
| ``` | |
| [number of atoms] | |
| [comment line] | |
| [element] [x] [y] [z] | |
| [element] [x] [y] [z] | |
| ... | |
| ``` | |
| #### Images | |
| PNG format visualizations of crystal structures. | |
| #### Text Files | |
| Natural language descriptions of crystal structures and properties. | |
| #### Labels CSV | |
| Contains material properties for each phase-temperature combination: | |
| ```csv | |
| Polymorph,Temperature,Parameter,Value | |
| anatase,0K,HOMO,-7.2340 | |
| anatase,0K,LUMO,-4.1234 | |
| ... | |
| ``` | |
| ### Supported Models | |
| The dataset is compatible with various model architectures: | |
| - **Vision Models**: ResNet, ViT | |
| - **Graph Neural Networks**: SchNet, DimeNet, EGNN, FAENet, GoTenNet | |
| - **Language Models**: LLMs for zero-shot/few-shot learning | |
| - **Multimodal Models**: CLIP, Pure2DopeNet, ViSNet | |
| ### Performance Metrics | |
| #### Primary Task - Regression | |
| - Mean Absolute Error (MAE) | |
| - Root Mean Square Error (RMSE) | |
| - RΒ² score | |
| - Per-property evaluation metrics | |
| #### Primary Task - LLM Property Prediction | |
| - Property prediction accuracy | |
| - Zero-shot vs few-shot performance comparison | |
| - Out-of-distribution generalization | |
| - Per-property evaluation metrics | |
| #### Secondary Task - LLM Summary Generation | |
| The nanoparticle summary task requires domain-specific evaluation beyond traditional string-based metrics like ROUGE or BLEU, which do not penalize incorrect numerical values. A more meaningful strategy is to extract structured keyβvalue pairs-such as particle size, center of mass, coordination numbers, or bond angles-and compare them to ground truth using: | |
| - Information-level Fβ score that accepts only values within defined tolerances (e.g., 0.1 Γ or 1 degree) | |
| - MAPE over all numeric entries | |
| - Factual consistency score like BERTScore or QA-based faithfulness after masking numeric values | |
| - Optional assessments of readability and clarity using expert judgment or coherence-based metrics (e.g., Coh-LM) | |
| #### Tertiary Task - Classification | |
| A three-class classification task to distinguish among the TiOβ polymorphs. While overall accuracy provides a general overview, it is important to also report: | |
| - Class-wise precision, recall, and their harmonic mean (Fβ score), followed by macro-averaging to account for class imbalance | |
| - Full 3Γ3 confusion matrix to identify systematic misclassifications between phase pairs | |
| - Matthews correlation coefficient (MCC) and Cohen's ΞΊ statistic for chance-adjusted evaluations | |
| - Cross-entropy loss and macro-averaged area under the ROC curve (AUROC) when class probabilities are available | |
| ### Known Limitations | |
| 1. **Limited Chemical Space**: Only covers TiOβ polymorphs | |
| 2. **Temperature Range**: Limited to 0-1000K | |
| 3. **Computational Data**: All properties are from DFT calculations | |
| 4. **Modality Dependencies**: Some modalities may not be available for all samples | |
| ### Future Work | |
| - Extend to other materials systems | |
| - Include experimental data | |
| - Add more temperature points | |
| - Incorporate additional material properties | |
| - Support for more crystal structures |