JuIm commited on
Commit
d3ca0bb
·
verified ·
1 Parent(s): 5827342

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +12 -20
README.md CHANGED
@@ -13,37 +13,29 @@ should probably proofread and complete it, then remove this comment. -->
13
 
14
  # ProGemma2
15
 
16
- This model is a fine-tuned version of [JuIm/ProGemma2](https://huggingface.co/JuIm/ProGemma2) on an unknown dataset.
17
 
18
- ## Model description
19
 
20
- More information needed
21
 
22
- ## Intended uses & limitations
23
 
24
- More information needed
25
 
26
- ## Training and evaluation data
27
 
28
- More information needed
29
 
30
- ## Training procedure
31
 
32
- ### Training hyperparameters
33
 
34
- The following hyperparameters were used during training:
35
- - learning_rate: 0.001
36
- - train_batch_size: 2
37
- - eval_batch_size: 8
38
- - seed: 42
39
- - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
40
- - lr_scheduler_type: linear
41
- - lr_scheduler_warmup_ratio: 0.4
42
- - training_steps: 3500
43
-
44
- ### Training results
45
 
 
46
 
 
47
 
48
  ### Framework versions
49
 
 
13
 
14
  # ProGemma2
15
 
16
+ This is a custom configuration (336M parameters) of Google’s Gemma 2 LLM that is being pre-trained on amino acid sequences of 512 AA or less in length. Periodic updates are made to this page as training reaches new checkpoints.
17
 
18
+ The purpose of this model was to investigate the differences between ProGemma and ProtGPT (GPT-2 architecture) as it pertains to sequence generation. Training loss is ~1.6. Perplexity scores as well as AlphaFold 3’s ptm, pLDDT, and iptm scores are generally in line with ProtGPT’s scores for sequence lengths < 250, although the testing phase is still very early. I have yet to do testing for sequence lengths > 250. More robust testing is also required for lengths < 250 AA. In my very preliminary testing, HHblit e-values of ~0.1 are achieved with relatively easily.
19
 
20
+ Controlled generation is not a capability of this model, and therefore serves as a method to significantly improve generation as, in principal, a sequence that performs a given function or resides in a particular cellular location can be generated.
21
 
22
+ In sequence generation, a top_k of 950 appears to work well as it prevents repetition. This is also seen in ProtGPT.
23
 
24
+ Below is code using the Transformers library to generate sequences using ProGemma.
25
 
26
+ from transformers import pipeline, AutoTokenizer, AutoModelForCausalLM
27
 
28
+ model = AutoModelForCausalLM.from_pretrained("JuIm/ProGemma")
29
 
30
+ tokenizer = AutoTokenizer.from_pretrained("JuIm/Amino-Acid-Sequence-Tokenizer")
31
 
32
+ progemma = pipeline("text-generation", model=model, tokenizer=tokenizer)
33
 
34
+ sequence = progemma("\<bos>", top_k=950, max_length=100, num_return_sequences=1, do_sample=True, repetition_penalty=1.2, eos_token_id=21, pad_token_id=22, bos_token_id=20)
 
 
 
 
 
 
 
 
 
 
35
 
36
+ s = sequence[0]['generated_text']
37
 
38
+ print(s)
39
 
40
  ### Framework versions
41