Update README.md
Browse files
README.md
CHANGED
|
@@ -14,8 +14,26 @@ should probably proofread and complete it, then remove this comment. -->
|
|
| 14 |
|
| 15 |
A custom configuration of Google's Gemma 2 LLM pre-trained on amino acid sequences of length 0 - 512 in length. This configuration is 275M parameters.
|
| 16 |
This model is being trained on Google Colab (free version), so regular updates are made upon hitting new checkpoints.
|
|
|
|
|
|
|
| 17 |
|
| 18 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 19 |
|
| 20 |
### Framework versions
|
| 21 |
|
|
|
|
| 14 |
|
| 15 |
A custom configuration of Google's Gemma 2 LLM pre-trained on amino acid sequences of length 0 - 512 in length. This configuration is 275M parameters.
|
| 16 |
This model is being trained on Google Colab (free version), so regular updates are made upon hitting new checkpoints.
|
| 17 |
+
The dataset is ~500k sequences in total, and the model was trained on about 5% of the dataset as of date 7.28.2024. Training loss at this point is ~2.7.
|
| 18 |
+
The tokenizer uses bos, eos, and pad special tokens where each sequence is padded to length 512.
|
| 19 |
|
| 20 |
+
|
| 21 |
+
The purpose of this model was simply to build my own version of NVIDIA's ProtGPT.
|
| 22 |
+
Upon completion of training, the model will be properly evaluated, looking at perplexity, energy of proteins generated, and AlphaFold 3 pLDDT/pTM scores
|
| 23 |
+
|
| 24 |
+
|
| 25 |
+
To try this model out for yourself, see the code below:
|
| 26 |
+
|
| 27 |
+
from transformers import pipeline, AutoTokenizer, AutoModelForCausalLM
|
| 28 |
+
|
| 29 |
+
model = AutoModelForCausalLM.from_pretrained("JuIm/ProGemma")
|
| 30 |
+
tokenizer = AutoTokenizer.from_pretrained("JuIm/Amino-Acid-Sequence-Tokenizer")
|
| 31 |
+
|
| 32 |
+
progemma = pipeline("text-generation", model=model, tokenizer=tokenizer)
|
| 33 |
+
|
| 34 |
+
sequence = progemma("<bos>", max_length=150, do_sample=True, top_k=950, repetition_penalty=1.2, num_return_sequences=1, eos_token_id=21, bos_token_id=20, pad_token_id=22)
|
| 35 |
+
print(sequence)
|
| 36 |
+
[{'generated_text': '<bos>MLSLFSWFENKLDKTLKKISRIELFRKKITEVICDEHIYVMKPPFSEKTTLTREGYECGSRTMPNLARPDTYLLSRFKENCYGLHYTILGCSKNLLAPFGATFTSMLSVMVIFIFLFTKVEDFIKRCEGAGWVITEFGSTSGVPAVGPG'}]
|
| 37 |
|
| 38 |
### Framework versions
|
| 39 |
|