File size: 2,193 Bytes
676e622
aceef28
48862dc
 
 
 
 
 
676e622
 
48862dc
 
676e622
3139a3e
48862dc
d3ca0bb
7dade35
6fb3648
475287a
7dade35
d3ca0bb
7dade35
d3ca0bb
7dade35
d3ca0bb
7dade35
d3ca0bb
7dade35
d3ca0bb
7dade35
d3ca0bb
7dade35
d3ca0bb
7dade35
d3ca0bb
48862dc
d3ca0bb
48862dc
d3ca0bb
48862dc
 
 
aceef28
8e3ca5d
48862dc
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
---
library_name: transformers
base_model: JuIm/ProGemma2
tags:
- generated_from_trainer
model-index:
- name: ProGemma2
  results: []
---

<!-- This model card has been generated automatically according to the information the Trainer had access to. You
should probably proofread and complete it, then remove this comment. -->

# ProteinLM

This is a custom configuration (336M parameters) of Google’s Gemma 2 LLM that is being pre-trained on amino acid sequences of 512 AA or less in length. Periodic updates are made to this page as training reaches new checkpoints.


The purpose of this model was to investigate the differences between ProGemma and ProtGPT (GPT-2 architecture) as it pertains to sequence generation. Training loss is ~1.6. Perplexity scores as well as AlphaFold 3’s ptm, pLDDT, and iptm scores are generally in line with ProtGPT’s scores for sequence lengths < 250, although the testing phase is still very early. I have yet to do testing for sequence lengths > 250. More robust testing is also required for lengths < 250 AA. In my very preliminary testing, HHblit e-values of ~0.1 are achieved without much guidance.

Controlled generation is not a capability of this model, and therefore serves as a method to significantly improve generation as, in principal, a sequence that performs a given function or resides in a particular cellular location can be generated.

In sequence generation, a top_k of 950 appears to work well as it prevents repetition. This is also seen in ProtGPT.

Below is code using the Transformers library to generate sequences using ProGemma.

from transformers import pipeline, AutoTokenizer, AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained("JuIm/ProGemma")

tokenizer = AutoTokenizer.from_pretrained("JuIm/Amino-Acid-Sequence-Tokenizer")

progemma = pipeline("text-generation", model=model, tokenizer=tokenizer)

sequence = progemma("\<bos>", top_k=950, max_length=100, num_return_sequences=1, do_sample=True, repetition_penalty=1.2, eos_token_id=21, pad_token_id=22, bos_token_id=20)

s = sequence[0]['generated_text']

print(s)

### Framework versions

- Transformers 4.44.2
- Pytorch 2.4.0+cu121
- Tokenizers 0.19.1