JuIm
/

ProGemma

Text Generation

Generated from Trainer

text-generation-inference

Model card Files Files and versions

Metrics Training metrics Community

JuIm commited on Jul 28, 2024

Commit

cef8c5b

·

verified ·

1 Parent(s): 5fe2c0a

Update README.md

Files changed (1) hide show

README.md +19 -1

README.md CHANGED Viewed

@@ -14,8 +14,26 @@ should probably proofread and complete it, then remove this comment. -->
 A custom configuration of Google's Gemma 2 LLM pre-trained on amino acid sequences of length 0 - 512 in length. This configuration is 275M parameters.
 This model is being trained on Google Colab (free version), so regular updates are made upon hitting new checkpoints.
-The tokenizer uses bos, eos, and pad special tokens where each sequence is padded to length 512.
 ### Framework versions

 A custom configuration of Google's Gemma 2 LLM pre-trained on amino acid sequences of length 0 - 512 in length. This configuration is 275M parameters.
 This model is being trained on Google Colab (free version), so regular updates are made upon hitting new checkpoints.
+The dataset is ~500k sequences in total, and the model was trained on about 5% of the dataset as of date 7.28.2024. Training loss at this point is ~2.7.
+The tokenizer uses bos, eos, and pad special tokens where each sequence is padded to length 512.
+The purpose of this model was simply to build my own version of NVIDIA's ProtGPT.
+Upon completion of training, the model will be properly evaluated, looking at perplexity, energy of proteins generated, and AlphaFold 3 pLDDT/pTM scores
+To try this model out for yourself, see the code below:
+from transformers import pipeline, AutoTokenizer, AutoModelForCausalLM
+model = AutoModelForCausalLM.from_pretrained("JuIm/ProGemma")
+tokenizer = AutoTokenizer.from_pretrained("JuIm/Amino-Acid-Sequence-Tokenizer")
+progemma = pipeline("text-generation", model=model, tokenizer=tokenizer)
+sequence = progemma("<bos>", max_length=150, do_sample=True, top_k=950, repetition_penalty=1.2, num_return_sequences=1, eos_token_id=21, bos_token_id=20, pad_token_id=22)
+print(sequence)
+  [{'generated_text': '<bos>MLSLFSWFENKLDKTLKKISRIELFRKKITEVICDEHIYVMKPPFSEKTTLTREGYECGSRTMPNLARPDTYLLSRFKENCYGLHYTILGCSKNLLAPFGATFTSMLSVMVIFIFLFTKVEDFIKRCEGAGWVITEFGSTSGVPAVGPG'}]
 ### Framework versions