ElnaggarLab
/

ankh3-large

Feature Extraction

text2text-generation

protein language model

Model card Files Files and versions

hazemessam commited on May 13, 2025

Commit

f77377b

·

verified ·

1 Parent(s): 9f7c1ad

Update README.md

Files changed (1) hide show

README.md +20 -16

README.md CHANGED Viewed

@@ -8,32 +8,36 @@ Ankh3 is a protein language model that is jointly optimized on two objectives:
 * Protein sequence completion.
 1. Masked Language Modeling:
-  The idea of this task is to intentionally 'corrupt' an input protein sequence by
-  masking a certain percentage (X%) of its individual tokens (amino acids),
-  and then train the model to reconstruct the original sequence.
-  Example on a protein sequence before and after corruption:
-  Original protein sequence: MKAYVLINSRGP
-  This sequence will be masked/corrupted using sentinel tokens as shown below:
-  Sequence after corruption: M <extra_id_0> A Y <extra_id_1> L I <extra_id_2> S R G <extra_id_3>
-  The decoder learns to correspond each sentinel token to the actual amino acid that was masked.
-  In this example: <extra_id_0> K means that <extra_id_0> corresponds to the "K" amino acid and so on.
-  Decoder output: <extra_id_0> K <extra_id_1> V <extra_id_2> N <extra_id_3> P
 2. Protein Sequence Completion:
-  The idea of this task is to cut the input sequence into
   two segments, where the first segment is fed to the encoder
   and the decoder is tasked to auto-regressively generate the
   second segment conditioned on the first segment representation
   outputted from the encoder.
-  Example on protein sequence completion:
   Original sequence: MKAYVLINSRGP
   We will pass "MKAYVL" of it to the encoder, and the decoder is trained
   that given the representation of the first part provided by the encoder,
   it should output the second part which is: "INSRGP"

 * Protein sequence completion.
 1. Masked Language Modeling:
+  - The idea of this task is to intentionally 'corrupt' an input protein sequence by
+    masking a certain percentage (X%) of its individual tokens (amino acids),
+    and then train the model to reconstruct the original sequence.
+  - Example on a protein sequence before and after corruption:
+    Original protein sequence: MKAYVLINSRGP
+    This sequence will be masked/corrupted using sentinel tokens as shown below:
+    Sequence after corruption: M <extra_id_0> A Y <extra_id_1> L I <extra_id_2> S R G <extra_id_3>
+    The decoder learns to correspond each sentinel token to the actual amino acid that was masked.
+    In this example: <extra_id_0> K means that <extra_id_0> corresponds to the "K" amino acid and so on.
+    Decoder output: <extra_id_0> K <extra_id_1> V <extra_id_2> N <extra_id_3> P
 2. Protein Sequence Completion:
+- The idea of this task is to cut the input sequence into
   two segments, where the first segment is fed to the encoder
   and the decoder is tasked to auto-regressively generate the
   second segment conditioned on the first segment representation
   outputted from the encoder.
+- Example on protein sequence completion:
   Original sequence: MKAYVLINSRGP
   We will pass "MKAYVL" of it to the encoder, and the decoder is trained
   that given the representation of the first part provided by the encoder,
   it should output the second part which is: "INSRGP"