McClain commited on
Commit
04646d8
·
verified ·
1 Parent(s): c139f01

Upload 4 files

Browse files
Files changed (4) hide show
  1. README.md +73 -3
  2. config.json +283 -0
  3. generation_config.json +6 -0
  4. model.safetensors +3 -0
README.md CHANGED
@@ -1,3 +1,73 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - dna
4
+ tags:
5
+ - biology
6
+ - genomics
7
+ - foundation-model
8
+ license: apache-2.0
9
+ ---
10
+
11
+ # Evo 2 (1B Base) - Hugging Face Transformers Format
12
+
13
+ This repository contains the **Evo 2 (1B Base)** model, converted to the Hugging Face Transformers format.
14
+
15
+ **Original Repository:** [arcinstitute/evo2_1b_base](https://huggingface.co/arcinstitute/evo2_1b_base)
16
+ **Paper:** [Genome modeling and design across all domains of life with Evo 2](https://www.biorxiv.org/content/10.1101/2024.02.27.582234v1)
17
+ **Authors:** Garyk Brixi, Matthew G. Durrant, Jerome Ku, Michael Poli, et al.
18
+
19
+ ## Model Description
20
+
21
+ Evo 2 is a biological foundation model trained on 9.3 trillion DNA base pairs from a curated genomic atlas spanning all domains of life. It uses the StripedHyena architecture to process long sequences (up to 1 million base pairs) at nucleotide-level resolution. This model is designed for tasks such as predicting the functional effects of mutations and generating novel genomic sequences.
22
+
23
+ This version has been converted to be compatible with the `transformers` library, allowing for easy loading and inference.
24
+
25
+ ## Usage
26
+
27
+ You can load and run this model using the `transformers` library as follows:
28
+
29
+ ```python
30
+ import torch
31
+ from transformers import Evo2ForCausalLM, Evo2Tokenizer
32
+
33
+ # Replace with your local path or the Hub repo ID after uploading
34
+ model_path = "path/to/this/repo"
35
+
36
+ print(f"Loading model from {model_path}...")
37
+ model = Evo2ForCausalLM.from_pretrained(model_path)
38
+ tokenizer = Evo2Tokenizer.from_pretrained(model_path)
39
+
40
+ # Move to GPU if available
41
+ device = "cuda" if torch.cuda.is_available() else "cpu"
42
+ model = model.to(device)
43
+
44
+ # Input sequence (DNA)
45
+ sequence = "ACGTACGT"
46
+ print(f"Input: {sequence}")
47
+
48
+ # Tokenize
49
+ input_ids = tokenizer.encode(sequence, return_tensors="pt").to(device)
50
+
51
+ # Generate
52
+ print("Generating...")
53
+ with torch.no_grad():
54
+ output = model.generate(input_ids, max_new_tokens=20)
55
+
56
+ # Decode
57
+ generated_sequence = tokenizer.decode(output[0])
58
+ print(f"Output: {generated_sequence}")
59
+ ```
60
+
61
+ ## Citation
62
+
63
+ If you use this model, please cite the original paper:
64
+
65
+ ```bibtex
66
+ @article{brixi2024genome,
67
+ title={Genome modeling and design across all domains of life with Evo 2},
68
+ author={Brixi, Garyk and Durrant, Matthew G and Ku, Jerome and Poli, Michael and others},
69
+ journal={bioRxiv},
70
+ year={2024},
71
+ publisher={Cold Spring Harbor Laboratory}
72
+ }
73
+ ```
config.json ADDED
@@ -0,0 +1,283 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "Evo2ForCausalLM"
4
+ ],
5
+ "attn_dropout": 0.0,
6
+ "dtype": "float32",
7
+ "eos_token_id": 0,
8
+ "hidden_dropout": 0.0,
9
+ "hidden_size": 1920,
10
+ "hyena_filter_configurations": [
11
+ {
12
+ "h_shape": [
13
+ 128,
14
+ 1,
15
+ 7
16
+ ]
17
+ },
18
+ {
19
+ "D_shape": [
20
+ 1920
21
+ ],
22
+ "h_shape": [
23
+ 128,
24
+ 1,
25
+ 128
26
+ ]
27
+ },
28
+ {
29
+ "D_shape": [
30
+ 1920
31
+ ],
32
+ "log_poles_shape": [
33
+ 1920,
34
+ 16,
35
+ 1
36
+ ],
37
+ "residues_shape": [
38
+ 1920,
39
+ 16
40
+ ]
41
+ },
42
+ {},
43
+ {
44
+ "h_shape": [
45
+ 128,
46
+ 1,
47
+ 7
48
+ ]
49
+ },
50
+ {
51
+ "D_shape": [
52
+ 1920
53
+ ],
54
+ "h_shape": [
55
+ 128,
56
+ 1,
57
+ 128
58
+ ]
59
+ },
60
+ {
61
+ "D_shape": [
62
+ 1920
63
+ ],
64
+ "log_poles_shape": [
65
+ 1920,
66
+ 16,
67
+ 1
68
+ ],
69
+ "residues_shape": [
70
+ 1920,
71
+ 16
72
+ ]
73
+ },
74
+ {
75
+ "h_shape": [
76
+ 128,
77
+ 1,
78
+ 7
79
+ ]
80
+ },
81
+ {
82
+ "D_shape": [
83
+ 1920
84
+ ],
85
+ "h_shape": [
86
+ 128,
87
+ 1,
88
+ 128
89
+ ]
90
+ },
91
+ {
92
+ "D_shape": [
93
+ 1920
94
+ ],
95
+ "log_poles_shape": [
96
+ 1920,
97
+ 16,
98
+ 1
99
+ ],
100
+ "residues_shape": [
101
+ 1920,
102
+ 16
103
+ ]
104
+ },
105
+ {},
106
+ {
107
+ "h_shape": [
108
+ 128,
109
+ 1,
110
+ 7
111
+ ]
112
+ },
113
+ {
114
+ "D_shape": [
115
+ 1920
116
+ ],
117
+ "h_shape": [
118
+ 128,
119
+ 1,
120
+ 128
121
+ ]
122
+ },
123
+ {
124
+ "D_shape": [
125
+ 1920
126
+ ],
127
+ "log_poles_shape": [
128
+ 1920,
129
+ 16,
130
+ 1
131
+ ],
132
+ "residues_shape": [
133
+ 1920,
134
+ 16
135
+ ]
136
+ },
137
+ {
138
+ "h_shape": [
139
+ 128,
140
+ 1,
141
+ 7
142
+ ]
143
+ },
144
+ {
145
+ "D_shape": [
146
+ 1920
147
+ ],
148
+ "h_shape": [
149
+ 128,
150
+ 1,
151
+ 128
152
+ ]
153
+ },
154
+ {
155
+ "D_shape": [
156
+ 1920
157
+ ],
158
+ "log_poles_shape": [
159
+ 1920,
160
+ 16,
161
+ 1
162
+ ],
163
+ "residues_shape": [
164
+ 1920,
165
+ 16
166
+ ]
167
+ },
168
+ {},
169
+ {
170
+ "h_shape": [
171
+ 128,
172
+ 1,
173
+ 7
174
+ ]
175
+ },
176
+ {
177
+ "D_shape": [
178
+ 1920
179
+ ],
180
+ "h_shape": [
181
+ 128,
182
+ 1,
183
+ 128
184
+ ]
185
+ },
186
+ {
187
+ "D_shape": [
188
+ 1920
189
+ ],
190
+ "log_poles_shape": [
191
+ 1920,
192
+ 16,
193
+ 1
194
+ ],
195
+ "residues_shape": [
196
+ 1920,
197
+ 16
198
+ ]
199
+ },
200
+ {
201
+ "h_shape": [
202
+ 128,
203
+ 1,
204
+ 7
205
+ ]
206
+ },
207
+ {
208
+ "D_shape": [
209
+ 1920
210
+ ],
211
+ "h_shape": [
212
+ 128,
213
+ 1,
214
+ 128
215
+ ]
216
+ },
217
+ {
218
+ "D_shape": [
219
+ 1920
220
+ ],
221
+ "log_poles_shape": [
222
+ 1920,
223
+ 16,
224
+ 1
225
+ ],
226
+ "residues_shape": [
227
+ 1920,
228
+ 16
229
+ ]
230
+ },
231
+ {}
232
+ ],
233
+ "hyena_filters": 128,
234
+ "hyena_flip_x1x2": false,
235
+ "hyena_hidden_size": 1920,
236
+ "hyena_kernel_size": 3,
237
+ "hyena_order": 3,
238
+ "initializer_range": 0.02,
239
+ "intermediate_size": 5120,
240
+ "layer_types": [
241
+ "hyena",
242
+ "hyena",
243
+ "hyena",
244
+ "attention",
245
+ "hyena",
246
+ "hyena",
247
+ "hyena",
248
+ "hyena",
249
+ "hyena",
250
+ "hyena",
251
+ "attention",
252
+ "hyena",
253
+ "hyena",
254
+ "hyena",
255
+ "hyena",
256
+ "hyena",
257
+ "hyena",
258
+ "attention",
259
+ "hyena",
260
+ "hyena",
261
+ "hyena",
262
+ "hyena",
263
+ "hyena",
264
+ "hyena",
265
+ "attention"
266
+ ],
267
+ "max_position_embeddings": 2048,
268
+ "mlp_dropout": 0.0,
269
+ "model_type": "evo2",
270
+ "num_attention_heads": 15,
271
+ "num_hidden_layers": 25,
272
+ "num_key_value_heads": 15,
273
+ "pad_token_id": 1,
274
+ "rms_norm_eps": 1e-06,
275
+ "rope_parameters": {
276
+ "rope_theta": 1000000.0,
277
+ "rope_type": "default"
278
+ },
279
+ "rope_theta": 1000000.0,
280
+ "transformers_version": "5.0.0.dev0",
281
+ "use_cache": true,
282
+ "vocab_size": 512
283
+ }
generation_config.json ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ {
2
+ "_from_model_config": true,
3
+ "eos_token_id": 0,
4
+ "pad_token_id": 1,
5
+ "transformers_version": "5.0.0.dev0"
6
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f24757c711cc33668450ae2c81cc0207ddde947f742494301eb2fd193686fb08
3
+ size 4431961904