File size: 5,477 Bytes
4a6c57a
ee83624
4a6c57a
ee83624
 
 
 
 
 
4a6c57a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
43598e5
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
---
language: en
license: mit
library_name: transformers
tags:
- materials-science
- crystal-structure
- inverse-design
pipeline_tag: text-generation
inference: true
---

# MatterGPT

MatterGPT is a generative pre-trained transformer model for inverse design of inorganic materials. It uses the SLICES (Simplified Line-Input Crystal-Encoding System) representation to generate novel crystal structures with targeted properties.

## Model Description

- **Model type:** Generative Pre-trained Transformer (GPT2)
- **Language(s):** SLICES (crystal structure representation)
- **License:** MIT
- **Finetuned from model:** GPT2

## Intended Uses & Limitations

MatterGPT is designed for:
- Generating crystal structures with specified formation energies and band gaps
- Multi-property targeted material design
- Exploring novel inorganic materials

Note: This model is trained on structures with up to 20 atoms per unit cell and may not generalize well to larger structures.

## How to Use

You can use this model directly with the Hugging Face Inference API:

```python
from huggingface_hub import InferenceApi

inference = InferenceApi("your-username/mattergpt")

# Generate a single crystal structure
result = inference({"formation_energy": -1.0, "band_gap": 2.0})
print(result)

# Generate multiple crystal structures
results = inference([
    {"formation_energy": -1.0, "band_gap": 2.0},
    {"formation_energy": -2.0, "band_gap": 3.0}
])
for crystal in results:
    print(crystal)
```

For local usage, please refer to the detailed instructions below.
## How to Use MatterGPT locally

This guide will help you get started with using the MatterGPT model for generating crystal structures.

### Setup

First, ensure you have the necessary dependencies installed:

```bash
pip install torch tqdm
```

You'll also need the `matter_gpt_wrapper` module, which should be provided with the model.

### Loading the Model and Tokenizer

```python
from matter_gpt_wrapper import MatterGPTWrapper, SimpleTokenizer
import torch
import os

# Load the model
model_path = "./"  # Directory containing config.json and pytorch_model.pt
model = MatterGPTWrapper.from_pretrained(model_path)
model.to('cuda' if torch.cuda.is_available() else 'cpu')

# Load the tokenizer
tokenizer_path = "Voc_prior"
tokenizer = SimpleTokenizer(tokenizer_path)
```

Make sure the `config.json`, `pytorch_model.pt`, and `Voc_prior` files are in the correct locations.

### Generating a Single Sequence

To generate a single crystal structure:

```python
def generate_single(condition):
    context = '>'
    x = torch.tensor([tokenizer.stoi[context]], dtype=torch.long)[None,...].to(model.device)
    p = torch.tensor([condition]).unsqueeze(1).to(model.device)
    
    generated = model.generate(x, prop=p, max_length=model.config.block_size, 
                               temperature=1.2, do_sample=True, top_k=0, top_p=0.9)
    return tokenizer.decode(generated[0].tolist())

# Example usage
condition = [-1.0, 2.0]  # formation energy and bandgap
single_sequence = generate_single(condition)
print(single_sequence)
```

### Generating Multiple Sequences

To generate multiple crystal structures:

```python
from tqdm import tqdm

def generate_multiple(condition, num_sequences, batch_size=32):
    all_sequences = []
    for _ in tqdm(range(0, num_sequences, batch_size)):
        current_batch_size = min(batch_size, num_sequences - len(all_sequences))
        context = '>'
        x = torch.tensor([tokenizer.stoi[context]], dtype=torch.long)[None,...].repeat(current_batch_size, 1).to(model.device)
        p = torch.tensor([condition]).repeat(current_batch_size, 1).unsqueeze(1).to(model.device)
        
        generated = model.generate(x, prop=p, max_length=model.config.block_size, 
                                   temperature=1.2, do_sample=True, top_k=0, top_p=0.9)
        all_sequences.extend([tokenizer.decode(seq.tolist()) for seq in generated])
        
        if len(all_sequences) >= num_sequences:
            break
    
    return all_sequences[:num_sequences]

# Example usage
condition = [-1.0, 2.0]  # formation energy and bandgap
num_sequences = 10
multiple_sequences = generate_multiple(condition, num_sequences)
for seq in multiple_sequences:
    print(seq)
```

### Notes

- The `condition` parameter is a list containing the desired formation energy and bandgap values.
- The generated sequences are SLICES representations of crystal structures.
- You may need to post-process the generated SLICES to convert them into actual crystal structures.

For more detailed information on the SLICES format and how to convert it to crystal structures, please refer to the full documentation.

## Training Data

The model was trained on the Alex-20 dataset, derived from the Alexandria database, containing 280,033 unique crystal structures with up to 20 atoms per unit cell.

## Training Procedure

MatterGPT was trained for 50 epochs using the Adam optimizer with an initial learning rate of 0.0001 and cosine annealing schedule. The model has approximately 80 million trainable parameters.

## Evaluation Results

Performance metrics on test set:
- Validity: >90%
- Uniqueness: >90%
- Novelty: ~40-60%
- MAPE for formation energy: ~11-13%
- MAPE for band gap: ~31-51%

## Citation

If you use this model in your research, please cite:

[Include citation information when available]

## Contact

[Provide contact information or link to the GitHub repository for issues and questions]