xiaohang07
/

MatterGPT

+---
+license: mit
+pipeline_tag: text-generation
+inference: true
+---
+# MatterGPT
+MatterGPT is a generative pre-trained transformer model for inverse design of inorganic materials. It uses the SLICES (Simplified Line-Input Crystal-Encoding System) representation to generate novel crystal structures with targeted properties.
+## Model Description
+- **Model type:** Generative Pre-trained Transformer (GPT2)
+- **Language(s):** SLICES (crystal structure representation)
+- **License:** MIT
+- **Finetuned from model:** GPT2
+## Intended Uses & Limitations
+MatterGPT is designed for:
+- Generating crystal structures with specified formation energies and band gaps
+- Multi-property targeted material design
+- Exploring novel inorganic materials
+Note: This model is trained on structures with up to 20 atoms per unit cell and may not generalize well to larger structures.
+## How to Use
+You can use this model directly with the Hugging Face Inference API:
+```python
+from huggingface_hub import InferenceApi
+inference = InferenceApi("your-username/mattergpt")
+# Generate a single crystal structure
+result = inference({"formation_energy": -1.0, "band_gap": 2.0})
+print(result)
+# Generate multiple crystal structures
+results = inference([
+    {"formation_energy": -1.0, "band_gap": 2.0},
+    {"formation_energy": -2.0, "band_gap": 3.0}
+])
+for crystal in results:
+    print(crystal)
+```
+For local usage, please refer to the detailed instructions below.
+## How to Use MatterGPT locally
+This guide will help you get started with using the MatterGPT model for generating crystal structures.
+### Setup
+First, ensure you have the necessary dependencies installed:
+```bash
+pip install torch tqdm
+```
+You'll also need the `matter_gpt_wrapper` module, which should be provided with the model.
+### Loading the Model and Tokenizer
+```python
+from matter_gpt_wrapper import MatterGPTWrapper, SimpleTokenizer
+import torch
+import os
+# Load the model
+model_path = "./"  # Directory containing config.json and pytorch_model.pt
+model = MatterGPTWrapper.from_pretrained(model_path)
+model.to('cuda' if torch.cuda.is_available() else 'cpu')
+# Load the tokenizer
+tokenizer_path = "Voc_prior"
+tokenizer = SimpleTokenizer(tokenizer_path)
+```
+Make sure the `config.json`, `pytorch_model.pt`, and `Voc_prior` files are in the correct locations.
+### Generating a Single Sequence
+To generate a single crystal structure:
+```python
+def generate_single(condition):
+    context = '>'
+    x = torch.tensor([tokenizer.stoi[context]], dtype=torch.long)[None,...].to(model.device)
+    p = torch.tensor([condition]).unsqueeze(1).to(model.device)
+    generated = model.generate(x, prop=p, max_length=model.config.block_size,
+                               temperature=1.2, do_sample=True, top_k=0, top_p=0.9)
+    return tokenizer.decode(generated[0].tolist())
+# Example usage
+condition = [-1.0, 2.0]  # formation energy and bandgap
+single_sequence = generate_single(condition)
+print(single_sequence)
+```
+### Generating Multiple Sequences
+To generate multiple crystal structures:
+```python
+from tqdm import tqdm
+def generate_multiple(condition, num_sequences, batch_size=32):
+    all_sequences = []
+    for _ in tqdm(range(0, num_sequences, batch_size)):
+        current_batch_size = min(batch_size, num_sequences - len(all_sequences))
+        context = '>'
+        x = torch.tensor([tokenizer.stoi[context]], dtype=torch.long)[None,...].repeat(current_batch_size, 1).to(model.device)
+        p = torch.tensor([condition]).repeat(current_batch_size, 1).unsqueeze(1).to(model.device)
+        generated = model.generate(x, prop=p, max_length=model.config.block_size,
+                                   temperature=1.2, do_sample=True, top_k=0, top_p=0.9)
+        all_sequences.extend([tokenizer.decode(seq.tolist()) for seq in generated])
+        if len(all_sequences) >= num_sequences:
+            break
+    return all_sequences[:num_sequences]
+# Example usage
+condition = [-1.0, 2.0]  # formation energy and bandgap
+num_sequences = 10
+multiple_sequences = generate_multiple(condition, num_sequences)
+for seq in multiple_sequences:
+    print(seq)
+```
+### Notes
+- The `condition` parameter is a list containing the desired formation energy and bandgap values.
+- The generated sequences are SLICES representations of crystal structures.
+- You may need to post-process the generated SLICES to convert them into actual crystal structures.
+For more detailed information on the SLICES format and how to convert it to crystal structures, please refer to the full documentation.
+## Training Data
+The model was trained on the Alex-20 dataset, derived from the Alexandria database, containing 280,033 unique crystal structures with up to 20 atoms per unit cell.
+## Training Procedure
+MatterGPT was trained for 50 epochs using the Adam optimizer with an initial learning rate of 0.0001 and cosine annealing schedule. The model has approximately 80 million trainable parameters.
+## Evaluation Results
+Performance metrics on test set:
+- Validity: >90%
+- Uniqueness: >90%
+- Novelty: ~40-60%
+- MAPE for formation energy: ~11-13%
+- MAPE for band gap: ~31-51%
+## Citation
+If you use this model in your research, please cite:
+[Include citation information when available]
+## Contact
+[Provide contact information or link to the GitHub repository for issues and questions]