Finetuned model generates sequences far different from sequences in the finetune training set

#11

by atqamar - opened Sep 5, 2024

Sep 5, 2024

We finetuned a ZymCTRL model using EC 4.2.1.1 (Carbonic Anhydrase) as the context label, and 131 carbonic anhydrase sequences, which are all highly similar and roughly ~190 residues long.

However, when we generate sequences with the finetuned model using EC 4.2.1.1 as the context label, the resulting sequences differ significantly from the sequences in the training set. The generated sequences exhibit an average Levenshtein distance of ~62 from the sequences in the training set.

What adjustments can we make to obtain generated sequences more similar to those used in the fine-tuning step?

nferruz

AI for protein design org Sep 23, 2024

hi atqamar,

sorry for the late response. I am surprised this is the case, how long did you train for? are the training curves looking good?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment