JuIm commited on
Commit
2d6c4e6
·
verified ·
1 Parent(s): 4f69217

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +22 -4
README.md CHANGED
@@ -11,16 +11,34 @@ pipeline_tag: text-generation
11
 
12
  # ProGemma
13
 
14
- This is a custom configuration of Google's Gemma 2 model that was pre-trained on amino acid sequences of lengths 0 to 512.
 
 
15
 
 
16
 
17
- ## Model description
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
18
 
19
 
20
 
21
- ## Intended uses & limitations
22
 
23
- The purpose of this model was to
24
 
25
  ### Framework versions
26
 
 
11
 
12
  # ProGemma
13
 
14
+ This is a custom configuration of Google's Gemma 2 model that is being pre-trained on amino acid sequences of lengths 0 to 512.
15
+ I used the free version of Google Colab to train this model, so updates are made regularly as the model hits new checkpoints.
16
+ As of 07.28.2024, the model has been trained on about 5% of the dataset.
17
 
18
+ The model generates amino acids on a letter-by-letter basis.
19
 
20
+ Current training loss is about 2.7. Preliminary evaluation of generated sequences on AlphaFold 3 shows pTM scores of ~0.4 and
21
+ average pLLDT scores ~60. After training is complete, a proper evaluation will be done to see whether sequences result in proteins with
22
+ a low free energy. Perplexity scores will also be calculated.
23
+
24
+ The purpose of this model was to see whether I could develop an alternative to NVIDIA's ProtGPT2. ProGemma also serves as a stepping stone
25
+ to a new model that will also utilize control tags to generate proteins based on function.
26
+
27
+ To use this mode for yourself using the pipeline within the Transformers package, please see the code below:
28
+
29
+ from transformers import pipeline, AutoTokenizer, AutoModelForCausalLM
30
+
31
+ model = AutoModelForCausalLM.from_pretrained("JuIm/ProGemma")
32
+ tokenizer = AutoTokenizer.from_pretrained("JuIm/Amino-Acid-Sequence-Tokenizer")
33
+
34
+ progemma = pipeline("text-generation", model=model, tokenizer=tokenizer)
35
+
36
+ sequence = progemma("bosM", top_k=950, max_length=100, num_return_sequences=1, do_sample=True, repetition_penalty=1.2, eos_token_id=21, pad_token_id=222, bos_token_id=20)
37
+ print(sequence)
38
 
39
 
40
 
 
41
 
 
42
 
43
  ### Framework versions
44