JuIm commited on
Commit
cef8c5b
·
verified ·
1 Parent(s): 5fe2c0a

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +19 -1
README.md CHANGED
@@ -14,8 +14,26 @@ should probably proofread and complete it, then remove this comment. -->
14
 
15
  A custom configuration of Google's Gemma 2 LLM pre-trained on amino acid sequences of length 0 - 512 in length. This configuration is 275M parameters.
16
  This model is being trained on Google Colab (free version), so regular updates are made upon hitting new checkpoints.
 
 
17
 
18
- The tokenizer uses bos, eos, and pad special tokens where each sequence is padded to length 512.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
19
 
20
  ### Framework versions
21
 
 
14
 
15
  A custom configuration of Google's Gemma 2 LLM pre-trained on amino acid sequences of length 0 - 512 in length. This configuration is 275M parameters.
16
  This model is being trained on Google Colab (free version), so regular updates are made upon hitting new checkpoints.
17
+ The dataset is ~500k sequences in total, and the model was trained on about 5% of the dataset as of date 7.28.2024. Training loss at this point is ~2.7.
18
+ The tokenizer uses bos, eos, and pad special tokens where each sequence is padded to length 512.
19
 
20
+
21
+ The purpose of this model was simply to build my own version of NVIDIA's ProtGPT.
22
+ Upon completion of training, the model will be properly evaluated, looking at perplexity, energy of proteins generated, and AlphaFold 3 pLDDT/pTM scores
23
+
24
+
25
+ To try this model out for yourself, see the code below:
26
+
27
+ from transformers import pipeline, AutoTokenizer, AutoModelForCausalLM
28
+
29
+ model = AutoModelForCausalLM.from_pretrained("JuIm/ProGemma")
30
+ tokenizer = AutoTokenizer.from_pretrained("JuIm/Amino-Acid-Sequence-Tokenizer")
31
+
32
+ progemma = pipeline("text-generation", model=model, tokenizer=tokenizer)
33
+
34
+ sequence = progemma("<bos>", max_length=150, do_sample=True, top_k=950, repetition_penalty=1.2, num_return_sequences=1, eos_token_id=21, bos_token_id=20, pad_token_id=22)
35
+ print(sequence)
36
+ [{'generated_text': '<bos>MLSLFSWFENKLDKTLKKISRIELFRKKITEVICDEHIYVMKPPFSEKTTLTREGYECGSRTMPNLARPDTYLLSRFKENCYGLHYTILGCSKNLLAPFGATFTSMLSVMVIFIFLFTKVEDFIKRCEGAGWVITEFGSTSGVPAVGPG'}]
37
 
38
  ### Framework versions
39