Improve model card with pipeline tag and code link
Browse filesThis PR improves the model card by:
- Adding the `pipeline_tag: feature-extraction` to better categorize the model.
- Adding the `library_name: transformers` to specify the compatible library.
- Improving the structure and readability of the model card.
This enhances discoverability and usability of the Ankh3 model.
README.md
CHANGED
|
@@ -1,52 +1,31 @@
|
|
| 1 |
---
|
| 2 |
license: cc-by-nc-sa-4.0
|
|
|
|
|
|
|
| 3 |
---
|
| 4 |
|
| 5 |
-
#
|
| 6 |
-
Ankh3 is a protein language model that is jointly optimized on two objectives:
|
| 7 |
-
* Masked language modeling with multiple masking probabilities
|
| 8 |
-
* Protein sequence completion.
|
| 9 |
|
| 10 |
-
|
| 11 |
-
- The idea of this task is to intentionally 'corrupt' an input protein sequence by
|
| 12 |
-
masking a certain percentage (X%) of its individual tokens (amino acids),
|
| 13 |
-
and then train the model to reconstruct the original sequence.
|
| 14 |
-
|
| 15 |
-
- Example on a protein sequence before and after corruption:
|
| 16 |
|
| 17 |
-
|
| 18 |
|
| 19 |
-
|
| 20 |
-
Sequence after corruption: M <extra_id_0> A Y <extra_id_1> L I <extra_id_2> S R G <extra_id_3>
|
| 21 |
|
| 22 |
|
| 23 |
-
|
| 24 |
-
In this example: <extra_id_0> K means that <extra_id_0> corresponds to the "K" amino acid and so on.
|
| 25 |
|
| 26 |
-
|
| 27 |
|
|
|
|
| 28 |
|
|
|
|
| 29 |
|
| 30 |
-
2. Protein Sequence Completion:
|
| 31 |
-
- The idea of this task is to cut the input sequence into
|
| 32 |
-
two segments, where the first segment is fed to the encoder
|
| 33 |
-
and the decoder is tasked to auto-regressively generate the
|
| 34 |
-
second segment conditioned on the first segment representation
|
| 35 |
-
outputted from the encoder.
|
| 36 |
|
| 37 |
-
|
| 38 |
|
| 39 |
-
|
| 40 |
|
| 41 |
-
We will pass "MKAYVL" of it to the encoder, and the decoder is trained
|
| 42 |
-
that given the representation of the first part provided by the encoder,
|
| 43 |
-
it should output the second part which is: "INSRGP"
|
| 44 |
-
|
| 45 |
-
|
| 46 |
-
|
| 47 |
-
# How to use:
|
| 48 |
-
|
| 49 |
-
## For Embedding Extraction:
|
| 50 |
```python
|
| 51 |
from transformers import T5ForConditionalGeneration, T5Tokenizer, T5EncoderModel
|
| 52 |
import torch
|
|
@@ -75,7 +54,8 @@ with torch.no_grad():
|
|
| 75 |
embedding = encoder_model(**encoded_nlu_sequence)
|
| 76 |
```
|
| 77 |
|
| 78 |
-
|
|
|
|
| 79 |
```python
|
| 80 |
from transformers import T5ForConditionalGeneration, T5Tokenizer
|
| 81 |
from transformers.generation import GenerationConfig
|
|
|
|
| 1 |
---
|
| 2 |
license: cc-by-nc-sa-4.0
|
| 3 |
+
pipeline_tag: feature-extraction
|
| 4 |
+
library_name: transformers
|
| 5 |
---
|
| 6 |
|
| 7 |
+
# Ankh3: Multi-Task Pretraining for Enhanced Protein Representations
|
|
|
|
|
|
|
|
|
|
| 8 |
|
| 9 |
+
The model was presented in the paper [Ankh3: Multi-Task Pretraining with Sequence Denoising and Completion Enhances Protein Representations](https://huggingface.co/papers/2505.20052).
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 10 |
|
| 11 |
+
The abstract of the paper is the following:
|
| 12 |
|
| 13 |
+
Protein language models (PLMs) have emerged as powerful tools to detect complex patterns of protein sequences. However, the capability of PLMs to fully capture information on protein sequences might be limited by focusing on single pre-training tasks. Although adding data modalities or supervised objectives can improve the performance of PLMs, pre-training often remains focused on denoising corrupted sequences. To push the boundaries of PLMs, our research investigated a multi-task pre-training strategy. We developed Ankh3, a model jointly optimized on two objectives: masked language modeling with multiple masking probabilities and protein sequence completion relying only on protein sequences as input. This multi-task pre-training demonstrated that PLMs can learn richer and more generalizable representations solely from protein sequences. The results demonstrated improved performance in downstream tasks, such as secondary structure prediction, fluorescence, GB1 fitness, and contact prediction. The integration of multiple tasks gave the model a more comprehensive understanding of protein properties, leading to more robust and accurate predictions.
|
|
|
|
| 14 |
|
| 15 |
|
| 16 |
+
**Model Details:**
|
|
|
|
| 17 |
|
| 18 |
+
Ankh3 is a protein language model jointly optimized on two objectives:
|
| 19 |
|
| 20 |
+
1. **Masked Language Modeling:** Predicts masked amino acids in a protein sequence, trained with varying masking probabilities. The model masks a percentage (X%) of the amino acids and learns to reconstruct the original sequence from the masked input.
|
| 21 |
|
| 22 |
+
2. **Protein Sequence Completion:** Predicts the second half of a protein sequence given the first half. The encoder processes the first segment, and the decoder autoregressively generates the second segment conditioned on the encoder's representation.
|
| 23 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 24 |
|
| 25 |
+
**How to use:**
|
| 26 |
|
| 27 |
+
**For Embedding Extraction:**
|
| 28 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 29 |
```python
|
| 30 |
from transformers import T5ForConditionalGeneration, T5Tokenizer, T5EncoderModel
|
| 31 |
import torch
|
|
|
|
| 54 |
embedding = encoder_model(**encoded_nlu_sequence)
|
| 55 |
```
|
| 56 |
|
| 57 |
+
**For Sequence Completion:**
|
| 58 |
+
|
| 59 |
```python
|
| 60 |
from transformers import T5ForConditionalGeneration, T5Tokenizer
|
| 61 |
from transformers.generation import GenerationConfig
|