# CrysMTM Dataset Card ## Dataset Description - **Repository:** [CrysMTM](https://github.com/KurbanIntelligenceLab/CrysMTM) - **Paper:** CrysMTM: A Multiphase, Temperature-Resolved, Multimodal Dataset for Crystalline Materials - **Authors:** Can Polat, Erchin Serpedin, Mustafa Kurban, and Hasan Kurban - **Point of Contact:** [Can Polat](johnpolat.com) ### Dataset Summary CrysMTM is a comprehensive multiphase, temperature-resolved, multimodal dataset for crystalline materials research, specifically focused on titanium dioxide (TiO₂) polymorphs. The dataset is designed primarily for regression tasks to predict 9 key material properties from multimodal inputs. It contains three crystalline phases of TiO₂ (anatase, brookite, and rutile) across a temperature range of 0-1000K, with multiple data modalities including atomic coordinates, visual representations, and textual descriptions. ### Supported Tasks and Leaderboards The dataset primarily supports regression tasks for materials property prediction: 1. **Main Task - Regression**: Predict 9 material properties from multimodal inputs - HOMO energy, LUMO energy, band gap, Fermi energy, total energy, energy per atom, atomic displacement, volumetric expansion, and bond length changes 2. **Main Task - LLM Property Prediction**: Zero-shot and few-shot prediction of the 9 material properties using large language models 3. **Secondary Task - LLM Summary Generation**: Generate textual summaries of crystal structures and properties using large language models 4. **Tertiary Task - Classification**: Predict the crystalline phase (anatase, brookite, or rutile) from multimodal inputs ### Languages The dataset contains English text descriptions of crystal structures and properties. ## Dataset Structure ### Data Instances Each data instance represents a TiO₂ crystal structure at a specific temperature and rotation, containing: - **Phase**: One of three TiO₂ polymorphs (anatase, brookite, rutile) - **Temperature**: Temperature in Kelvin (0-1000K, in 50K increments) - **Rotation**: Rotation index for the crystal structure - **Modalities**: Multiple data representations of the same structure ### Data Fields #### Core Metadata - `phase` (string): Crystal phase - "anatase", "brookite", or "rutile" - `temperature` (integer): Temperature in Kelvin (0, 50, 100, ..., 1000) - `rotation` (integer): Rotation index for the crystal structure #### Multimodal Data - `image` (PIL.Image): Visual representation of the crystal structure (PNG format) - `xyz` (torch.Tensor): Atomic coordinates in XYZ format (N×3 tensor) - `text` (string): Textual description of the crystal structure and properties - `element` (list): List of element symbols for each atom #### Labels **Primary Labels - Regression**: - `regression_label` (torch.Tensor): 9-dimensional tensor containing the main prediction targets: - `HOMO` (float): HOMO energy (E_H) in eV - `LUMO` (float): LUMO energy (E_L) in eV - `Eg` (float): Band gap energy (E_g) in eV - `Ef` (float): Fermi energy (E_f) in eV - `Et` (float): Total energy of the system (E_T) in eV - `Eta` (float): Total energy per atom (E_Ta) in eV - `disp` (float): Maximum atomic displacement (Δr_max) in Å - `vol` (float): Volumetric expansion (ΔV) in ų - `bond` (float): Ti-O bond length change (Δd_Ti-O) in Å **Secondary Labels - Classification**: - `label` (integer): Phase label (0=anatase, 1=brookite, 2=rutile) **LLM Task Labels**: - Individual property values for zero-shot/few-shot prediction - Text summaries for generation tasks ### Data Splits The dataset is organized by temperature ranges: - **Training Set**: Temperatures 0-850K (excluding 250K, 450K, 650K, 750K, 800K) - **In-Distribution (ID) Test**: Temperatures 250K, 450K, 650K, 750K, 800K - **Out-of-Distribution (OOD) Test**: Temperatures 0K, 50K, 100K, 900K, 950K, 1000K ### Citation Information ```bibtex @dataset{crysmtm2024, title={CrysMTM: A Multiphase, Temperature-Resolved, Multimodal Dataset for Crystalline Materials}, author={Can Polat and Erchin Serpedin and Mustafa Kurban and Hasan Kurban}, year={2024}, url={https://github.com/KurbanIntelligenceLab/CrysMTM} } ``` ## Usage Examples ### Option 1: Download and Use Locally 1. **Download the dataset** from [https://huggingface.co/datasets/johnpolat/CrysMTM](https://huggingface.co/datasets/johnpolat/CrysMTM) 2. **Use the provided loading script**: ```python # Download load_dataset.py from the repository and place it in your data directory from load_dataset import load_dataset # Load the dataset dataset = load_dataset(".") # Access splits train_dataset = dataset["train"] # 5,064 samples test_id_dataset = dataset["test_id"] # 1,380 samples test_ood_dataset = dataset["test_ood"] # 6,588 samples # Get a sample sample = train_dataset[0] print(f"Phase: {sample['phase']}") print(f"Temperature: {sample['temperature']}K") print(f"Image: {sample['image']}") print(f"Regression labels: {sample['regression_labels']}") ``` ### Option 2: Use with Original Dataloaders ```python from dataloaders.regression_dataloader import RegressionLoader # Load dataset for regression (main task) dataset = RegressionLoader( label_dir="data", modalities=["image", "xyz", "text"], normalize_labels=True ) # Get a sample sample = dataset[0] print(f"Target Properties: {sample['regression_label']}") print(f"Temperature: {sample['temperature']}K") print(f"Phase: {sample['phase']}") print(f"Image shape: {sample['image'].size}") print(f"XYZ coordinates shape: {sample['xyz'].shape}") ``` ### Main Task - LLM Property Prediction ```python from dataloaders.llm_regression_dataloader import LLMLoader # Load dataset for LLM property prediction (main task) dataset = LLMLoader( label_dir="data", modalities=["text", "image"] ) # Get a sample for zero-shot/few-shot property prediction sample = dataset[0] print(f"HOMO: {sample['HOMO']}") print(f"LUMO: {sample['LUMO']}") print(f"Band gap: {sample['Eg']}") print(f"Temperature: {sample['temperature']}K") print(f"Phase: {sample['phase']}") ``` ### Secondary Task - LLM Summary Generation ```python from dataloaders.llm_regression_dataloader import LLMLoader # Load dataset for LLM summary generation (secondary task) dataset = LLMLoader( label_dir="data", modalities=["text", "image"] ) # Get a sample for summary generation sample = dataset[0] print(f"Input text: {sample['text'][:200]}...") print(f"Target properties: {sample['HOMO']}, {sample['LUMO']}, {sample['Eg']}") ``` ### Tertiary Task - Classification ```python from dataloaders.classification_dataloader import ClassificationLoader # Load dataset for classification (tertiary task) dataset = ClassificationLoader( base_dir="data", modalities=["image", "xyz", "text"], max_rotations=10 ) # Get a sample sample = dataset[0] print(f"Phase: {sample['label']}") print(f"Image shape: {sample['image'].size}") print(f"XYZ coordinates shape: {sample['xyz'].shape}") print(f"Text: {sample['text'][:100]}...") ``` ### PyTorch Geometric Integration ```python # For graph neural networks dataset = ClassificationLoader( base_dir="data", modalities=["xyz", "element"], as_pyg_data=True ) # Returns PyG Data objects sample = dataset[0] print(f"Node features: {sample.z}") print(f"Positions: {sample.pos}") print(f"Label: {sample.y}") ``` ## Technical Details ### File Structure ``` data/ ├── anatase/ │ ├── 0K/ │ │ ├── images/ │ │ │ ├── rot_0.png │ │ │ ├── rot_1.png │ │ │ └── ... │ │ ├── xyz/ │ │ │ ├── rot_0.xyz │ │ │ ├── rot_1.xyz │ │ │ └── ... │ │ └── text/ │ │ ├── rot_0.txt │ │ ├── rot_1.txt │ │ └── ... │ ├── 50K/ │ └── ... ├── brookite/ ├── rutile/ └── labels.csv ``` ### Data Formats #### XYZ Files Standard XYZ format with atomic coordinates: ``` [number of atoms] [comment line] [element] [x] [y] [z] [element] [x] [y] [z] ... ``` #### Images PNG format visualizations of crystal structures. #### Text Files Natural language descriptions of crystal structures and properties. #### Labels CSV Contains material properties for each phase-temperature combination: ```csv Polymorph,Temperature,Parameter,Value anatase,0K,HOMO,-7.2340 anatase,0K,LUMO,-4.1234 ... ``` ### Supported Models The dataset is compatible with various model architectures: - **Vision Models**: ResNet, ViT - **Graph Neural Networks**: SchNet, DimeNet, EGNN, FAENet, GoTenNet - **Language Models**: LLMs for zero-shot/few-shot learning - **Multimodal Models**: CLIP, Pure2DopeNet, ViSNet ### Performance Metrics #### Primary Task - Regression - Mean Absolute Error (MAE) - Root Mean Square Error (RMSE) - R² score - Per-property evaluation metrics #### Primary Task - LLM Property Prediction - Property prediction accuracy - Zero-shot vs few-shot performance comparison - Out-of-distribution generalization - Per-property evaluation metrics #### Secondary Task - LLM Summary Generation The nanoparticle summary task requires domain-specific evaluation beyond traditional string-based metrics like ROUGE or BLEU, which do not penalize incorrect numerical values. A more meaningful strategy is to extract structured key–value pairs-such as particle size, center of mass, coordination numbers, or bond angles-and compare them to ground truth using: - Information-level F₁ score that accepts only values within defined tolerances (e.g., 0.1 Å or 1 degree) - MAPE over all numeric entries - Factual consistency score like BERTScore or QA-based faithfulness after masking numeric values - Optional assessments of readability and clarity using expert judgment or coherence-based metrics (e.g., Coh-LM) #### Tertiary Task - Classification A three-class classification task to distinguish among the TiO₂ polymorphs. While overall accuracy provides a general overview, it is important to also report: - Class-wise precision, recall, and their harmonic mean (F₁ score), followed by macro-averaging to account for class imbalance - Full 3×3 confusion matrix to identify systematic misclassifications between phase pairs - Matthews correlation coefficient (MCC) and Cohen's κ statistic for chance-adjusted evaluations - Cross-entropy loss and macro-averaged area under the ROC curve (AUROC) when class probabilities are available ### Known Limitations 1. **Limited Chemical Space**: Only covers TiO₂ polymorphs 2. **Temperature Range**: Limited to 0-1000K 3. **Computational Data**: All properties are from DFT calculations 4. **Modality Dependencies**: Some modalities may not be available for all samples ### Future Work - Extend to other materials systems - Include experimental data - Add more temperature points - Incorporate additional material properties - Support for more crystal structures