JuIm
/

ProGemma

Text Generation

Generated from Trainer

text-generation-inference

Model card Files Files and versions

Metrics Training metrics Community

JuIm commited on Jul 29, 2024

Commit

2d6c4e6

·

verified ·

1 Parent(s): 4f69217

Update README.md

Files changed (1) hide show

README.md +22 -4

README.md CHANGED Viewed

@@ -11,16 +11,34 @@ pipeline_tag: text-generation
 # ProGemma
-This is a custom configuration of Google's Gemma 2 model that was pre-trained on amino acid sequences of lengths 0 to 512.
-## Model description
-## Intended uses & limitations
-The purpose of this model was to
 ### Framework versions

 # ProGemma
+This is a custom configuration of Google's Gemma 2 model that is being pre-trained on amino acid sequences of lengths 0 to 512.
+I used the free version of Google Colab to train this model, so updates are made regularly as the model hits new checkpoints.
+As of 07.28.2024, the model has been trained on about 5% of the dataset.
+The model generates amino acids on a letter-by-letter basis.
+Current training loss is about 2.7. Preliminary evaluation of generated sequences on AlphaFold 3 shows pTM scores of ~0.4 and
+average pLLDT scores ~60. After training is complete, a proper evaluation will be done to see whether sequences result in proteins with
+a low free energy. Perplexity scores will also be calculated.
+The purpose of this model was to see whether I could develop an alternative to NVIDIA's ProtGPT2. ProGemma also serves as a stepping stone
+to a new model that will also utilize control tags to generate proteins based on function.
+To use this mode for yourself using the pipeline within the Transformers package, please see the code below:
+from transformers import pipeline, AutoTokenizer, AutoModelForCausalLM
+model = AutoModelForCausalLM.from_pretrained("JuIm/ProGemma")
+tokenizer = AutoTokenizer.from_pretrained("JuIm/Amino-Acid-Sequence-Tokenizer")
+progemma = pipeline("text-generation", model=model, tokenizer=tokenizer)
+sequence = progemma("bosM", top_k=950, max_length=100, num_return_sequences=1, do_sample=True, repetition_penalty=1.2, eos_token_id=21, pad_token_id=222, bos_token_id=20)
+print(sequence)
 ### Framework versions