New line characters in generated sequences

#20

by emrecicekyurt - opened Apr 19, 2023

Apr 19, 2023

Hello,

Firstly, thank you @nferruz for providing a freely accessible and well-documented model.

I am wondering how the model inserts newline(\n) characters into the generated sequences. It seems that a newline character is inserted after every 60 characters to mimic the format of a typical text document. However, in some cases, the model inserts a new line before 60 characters.

I couldn't find a proper answer to these questions:

Are there any criteria for inserting newline characters into sequences?
Are "\n" only used for formatting purposes and can it be removed to obtain the actual amino acid sequence?

Thanks in advance,

Emre

nferruz

Owner Apr 19, 2023

Hello!

Thanks for your post. Yes, those newline tokens are an artifact of the way I trained the model. I didn't notice them at the time of training, but of course, following the fasta format, they were there after every 60 characters. We trained several models after ProtGPT2, and I ensured they didn't have newline characters as they only make generation more complicated.
In any case, for this model, I'd ignore all sequences where the model generates a new line character in the first 60 amino acids- those are bad sequences. And then, for the rest of the sequences, you can take the sequence after removing the newline character to get the final string - although I'd leave the newline character if you are computing perplexity values since the model expects them every 60 characters. Also, it has never happened to me, but if a newline character appeared at a different position than a 60 amino acid window, I would discard that sequence too.

Let me know if questions remain.
Noelia

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment