xiaohang07 commited on
Commit
4a6c57a
·
verified ·
1 Parent(s): debcb64

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +169 -3
README.md CHANGED
@@ -1,3 +1,169 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ pipeline_tag: text-generation
4
+ inference: true
5
+ ---
6
+
7
+ # MatterGPT
8
+
9
+ MatterGPT is a generative pre-trained transformer model for inverse design of inorganic materials. It uses the SLICES (Simplified Line-Input Crystal-Encoding System) representation to generate novel crystal structures with targeted properties.
10
+
11
+ ## Model Description
12
+
13
+ - **Model type:** Generative Pre-trained Transformer (GPT2)
14
+ - **Language(s):** SLICES (crystal structure representation)
15
+ - **License:** MIT
16
+ - **Finetuned from model:** GPT2
17
+
18
+ ## Intended Uses & Limitations
19
+
20
+ MatterGPT is designed for:
21
+ - Generating crystal structures with specified formation energies and band gaps
22
+ - Multi-property targeted material design
23
+ - Exploring novel inorganic materials
24
+
25
+ Note: This model is trained on structures with up to 20 atoms per unit cell and may not generalize well to larger structures.
26
+
27
+ ## How to Use
28
+
29
+ You can use this model directly with the Hugging Face Inference API:
30
+
31
+ ```python
32
+ from huggingface_hub import InferenceApi
33
+
34
+ inference = InferenceApi("your-username/mattergpt")
35
+
36
+ # Generate a single crystal structure
37
+ result = inference({"formation_energy": -1.0, "band_gap": 2.0})
38
+ print(result)
39
+
40
+ # Generate multiple crystal structures
41
+ results = inference([
42
+ {"formation_energy": -1.0, "band_gap": 2.0},
43
+ {"formation_energy": -2.0, "band_gap": 3.0}
44
+ ])
45
+ for crystal in results:
46
+ print(crystal)
47
+ ```
48
+
49
+ For local usage, please refer to the detailed instructions below.
50
+ ## How to Use MatterGPT locally
51
+
52
+ This guide will help you get started with using the MatterGPT model for generating crystal structures.
53
+
54
+ ### Setup
55
+
56
+ First, ensure you have the necessary dependencies installed:
57
+
58
+ ```bash
59
+ pip install torch tqdm
60
+ ```
61
+
62
+ You'll also need the `matter_gpt_wrapper` module, which should be provided with the model.
63
+
64
+ ### Loading the Model and Tokenizer
65
+
66
+ ```python
67
+ from matter_gpt_wrapper import MatterGPTWrapper, SimpleTokenizer
68
+ import torch
69
+ import os
70
+
71
+ # Load the model
72
+ model_path = "./" # Directory containing config.json and pytorch_model.pt
73
+ model = MatterGPTWrapper.from_pretrained(model_path)
74
+ model.to('cuda' if torch.cuda.is_available() else 'cpu')
75
+
76
+ # Load the tokenizer
77
+ tokenizer_path = "Voc_prior"
78
+ tokenizer = SimpleTokenizer(tokenizer_path)
79
+ ```
80
+
81
+ Make sure the `config.json`, `pytorch_model.pt`, and `Voc_prior` files are in the correct locations.
82
+
83
+ ### Generating a Single Sequence
84
+
85
+ To generate a single crystal structure:
86
+
87
+ ```python
88
+ def generate_single(condition):
89
+ context = '>'
90
+ x = torch.tensor([tokenizer.stoi[context]], dtype=torch.long)[None,...].to(model.device)
91
+ p = torch.tensor([condition]).unsqueeze(1).to(model.device)
92
+
93
+ generated = model.generate(x, prop=p, max_length=model.config.block_size,
94
+ temperature=1.2, do_sample=True, top_k=0, top_p=0.9)
95
+ return tokenizer.decode(generated[0].tolist())
96
+
97
+ # Example usage
98
+ condition = [-1.0, 2.0] # formation energy and bandgap
99
+ single_sequence = generate_single(condition)
100
+ print(single_sequence)
101
+ ```
102
+
103
+ ### Generating Multiple Sequences
104
+
105
+ To generate multiple crystal structures:
106
+
107
+ ```python
108
+ from tqdm import tqdm
109
+
110
+ def generate_multiple(condition, num_sequences, batch_size=32):
111
+ all_sequences = []
112
+ for _ in tqdm(range(0, num_sequences, batch_size)):
113
+ current_batch_size = min(batch_size, num_sequences - len(all_sequences))
114
+ context = '>'
115
+ x = torch.tensor([tokenizer.stoi[context]], dtype=torch.long)[None,...].repeat(current_batch_size, 1).to(model.device)
116
+ p = torch.tensor([condition]).repeat(current_batch_size, 1).unsqueeze(1).to(model.device)
117
+
118
+ generated = model.generate(x, prop=p, max_length=model.config.block_size,
119
+ temperature=1.2, do_sample=True, top_k=0, top_p=0.9)
120
+ all_sequences.extend([tokenizer.decode(seq.tolist()) for seq in generated])
121
+
122
+ if len(all_sequences) >= num_sequences:
123
+ break
124
+
125
+ return all_sequences[:num_sequences]
126
+
127
+ # Example usage
128
+ condition = [-1.0, 2.0] # formation energy and bandgap
129
+ num_sequences = 10
130
+ multiple_sequences = generate_multiple(condition, num_sequences)
131
+ for seq in multiple_sequences:
132
+ print(seq)
133
+ ```
134
+
135
+ ### Notes
136
+
137
+ - The `condition` parameter is a list containing the desired formation energy and bandgap values.
138
+ - The generated sequences are SLICES representations of crystal structures.
139
+ - You may need to post-process the generated SLICES to convert them into actual crystal structures.
140
+
141
+ For more detailed information on the SLICES format and how to convert it to crystal structures, please refer to the full documentation.
142
+
143
+ ## Training Data
144
+
145
+ The model was trained on the Alex-20 dataset, derived from the Alexandria database, containing 280,033 unique crystal structures with up to 20 atoms per unit cell.
146
+
147
+ ## Training Procedure
148
+
149
+ MatterGPT was trained for 50 epochs using the Adam optimizer with an initial learning rate of 0.0001 and cosine annealing schedule. The model has approximately 80 million trainable parameters.
150
+
151
+ ## Evaluation Results
152
+
153
+ Performance metrics on test set:
154
+ - Validity: >90%
155
+ - Uniqueness: >90%
156
+ - Novelty: ~40-60%
157
+ - MAPE for formation energy: ~11-13%
158
+ - MAPE for band gap: ~31-51%
159
+
160
+ ## Citation
161
+
162
+ If you use this model in your research, please cite:
163
+
164
+ [Include citation information when available]
165
+
166
+ ## Contact
167
+
168
+ [Provide contact information or link to the GitHub repository for issues and questions]
169
+