| | --- |
| | license: cc-by-nc-sa-4.0 |
| | tags: |
| | - biology |
| | - protein |
| | - protein language model |
| | - protein embedding |
| | datasets: |
| | - agemagician/uniref50 |
| | --- |
| | |
| | # ANKH2-extended1 model |
| |
|
| | Pretrained model on protein sequences using a masked language modeling (MLM) objective. It was introduced in |
| | [this paper](https://arxiv.org/abs/2301.06568) and first released in |
| | [this repository](https://github.com/agemagician/Ankh). This model is trained on uppercase amino acids: it only works with capital letter amino acids. |
| |
|
| |
|
| | ## Model description |
| |
|
| | Ankh2-ext1 is based on the `ANKH-Large` model and was pretrained on a large corpus of protein sequences in a self-supervised fashion. |
| | This means it was pretrained on the raw protein sequences only, with no humans labelling them in any way (which is why it can use lots of |
| | publicly available data) with an automatic process to generate inputs and labels from those protein sequences. |
| |
|
| | Two important differences between this ANKH2-Large model and the original ANKH-Large version are: |
| | 1. The model was trained with more number of epochs. |
| | 2. The activation function changed to silu. |
| |
|
| | It has been shown that the features extracted from this self-supervised model (LM-embeddings) captured important biophysical properties governing protein shape. |
| | shape. |
| | This implied learning some of the grammar of the language of life realized in protein sequences. |
| |
|
| | ## Intended uses & limitations |
| |
|
| | The model could be used for protein feature extraction or to be fine-tuned on downstream tasks. |
| | We have noticed in some tasks you can gain more accuracy by fine-tuning the model using lora method rather than using it as a feature extractor. |
| | We have also noticed that for feature extraction, its better to use the feature extracted from the encoder rather than from the decoder. |
| |
|
| | ### How to use |
| |
|
| | Here is how to use this model to extract the features of a given protein sequence in PyTorch: |
| |
|
| | ```python |
| | sequence_examples = ["PRTEINO", "SEQWENCE"] |
| | # tokenize sequences and pad up to the longest sequence in the batch |
| | ids = tokenizer.batch_encode_plus(sequence_examples, add_special_tokens=True, padding="longest") |
| | input_ids = torch.tensor(ids['input_ids']).to(device) |
| | attention_mask = torch.tensor(ids['attention_mask']).to(device) |
| | # generate embeddings |
| | with torch.no_grad(): |
| | embedding_repr = model(input_ids=input_ids,attention_mask=attention_mask) |
| | # extract embeddings for the first ([0,:]) sequence in the batch while removing padded & special tokens ([0,:7]) |
| | emb_0 = embedding_repr.last_hidden_state[0,:7] # shape (7 x 1536) |
| | print(f"Shape of per-residue embedding of first sequences: {emb_0.shape}") |
| | # do the same for the second ([1,:]) sequence in the batch while taking into account different sequence lengths ([1,:8]) |
| | emb_1 = embedding_repr.last_hidden_state[1,:8] # shape (8 x 1536) |
| | # if you want to derive a single representation (per-protein embedding) for the whole protein |
| | emb_0_per_protein = emb_0.mean(dim=0) # shape (1536) |
| | print(f"Shape of per-protein embedding of first sequences: {emb_0_per_protein.shape}") |
| | ``` |
| |
|
| | ## Training data |
| |
|
| | The ANKH2-Large model was pretrained on [UniRef50](https://www.uniprot.org/help/uniref), a dataset consisting of 60 million protein sequences. |
| |
|
| | ## Training procedure |
| |
|
| | ### Preprocessing |
| |
|
| | The protein sequences are uppercased and tokenized using a single space and a vocabulary size of 25. |
| | The inputs of the model are then of the form: |
| |
|
| | ``` |
| | Protein Sequence </s> |
| | ``` |
| |
|
| | The preprocessing step was performed on the fly, by cutting and padding the protein sequences up to 512 tokens. |
| |
|
| | The details of the masking procedure for each sequence are as follows: |
| | - 20% of the amino acids are masked. |
| | - In 100% of the cases, the masked amino acids are replaced by `<extra_id_num>` token, where "num" is a number in range 0 and 115. |
| |
|
| | ### Pretraining |
| |
|
| | The model was trained on a single TPU Pod V5-lite for 45 epochs in total, using sequence length 512 (batch size 1k). |
| | It was trained using ANKH-Large model as an initial checkpoint, rather than training from scratch. |
| | It has a total of approximately 2B parameters and was trained using the encoder-decoder architecture. |
| | The optimizer used is Adafactor with linear warmup with linear decay learning rate schedule for pre-training. |
| |
|
| |
|
| | ## Evaluation results |
| |
|
| | When the model is used for feature extraction "FE" and parameter efficient fine-tuning "Lora", this model achieves the following results: |
| |
|
| | Test results : |
| |
|
| | | Task/Dataset | Method | secondary structure (3-states) | secondary structure (8-states) | Localization | Membrane | Solubility | Fluorescence | |
| | |:-----:|:-----:|:-----:|:-----:|:-----:|:-----:|:-----:|:-----:| |
| | | CASP12 | FE | comming soon | comming soon | | | | | |
| | | CASP12 | Lora | comming soon | comming soon | | | | | |
| | | TS115 | FE | comming soon | comming soon | | | | | |
| | | TS115 | Lora | comming soon | comming soon | | | | | |
| | | CB513 | FE | comming soon | comming soon | | | | | |
| | | CB513 | Lora | comming soon | comming soon | | | | | |
| | | DeepLoc | FE | | | comming soon | comming soon | | |
| | | DeepLoc | Lora | | | comming soon | comming soon | | | |
| | | Solubility | FE | | | | | comming soon | | |
| | | Solubility | Lora | | | | | 74% | | |
| | | Fluorescence | FE | | | | | | Comming Soon | |
| | | Fluorescence | Lora | | | | | | 68% | |
| |
|
| | ### BibTeX entry and citation info |
| |
|
| | ```bibtex |
| | @article{elnaggar2023ankh, |
| | title={Ankh☥: Optimized protein language model unlocks general-purpose modelling}, |
| | author={Elnaggar, Ahmed and Essam, Hazem and Salah-Eldin, Wafaa and Moustafa, Walid and Elkerdawy, Mohamed and Rochereau, Charlotte and Rost, Burkhard}, |
| | journal={bioRxiv}, |
| | pages={2023--01}, |
| | year={2023}, |
| | publisher={Cold Spring Harbor Laboratory} |
| | } |
| | ``` |
| |
|
| | > Created by [Ahmed Elnaggar/@Elnaggar_AI](https://twitter.com/Elnaggar_AI) | [LinkedIn](https://www.linkedin.com/in/prof-ahmed-elnaggar/) |