| --- |
| license: mit |
| datasets: |
| - chandar-lab/UR100P |
| language: |
| - en |
| tags: |
| - biology |
| --- |
| |
| ## AMPLIFY |
|
|
| AMPLIFY is an efficient, state-of-the-art protein language model pre-trained using masked language modeling on UniRef100, OAS, and SCOP ([UR100P](https://huggingface.co/datasets/chandar-lab/UR100P)). AMPLIFY can generate residue and protein embeddings, suggest mutations, differentiate disordered proteins from non-protein sequences, and much more. AMPLIFY is available in two sizes, 120M and 350M parameters, with the `_base` models not extended beyond 512 residues (Stage 1). The model architecture and pre-training procedure are detailed below. For more details, please refer to the [accompanying paper](https://www.biorxiv.org/content/10.1101/2024.09.23.614603v1). |
|
|
| - [`AMPLIFY_350M`](https://huggingface.co/chandar-lab/AMPLIFY_350M) |
| - [`AMPLIFY_350M_base`](https://huggingface.co/chandar-lab/AMPLIFY_350M_base) |
| - [`AMPLIFY_120M`](https://huggingface.co/chandar-lab/AMPLIFY_120M) |
| - [`AMPLIFY_120M_base`](https://huggingface.co/chandar-lab/AMPLIFY_120M_base) |
|
|
| ### Model Descritpion |
|
|
| | | AMPLIFY 120M | AMPLIFY 350M | |
| | :----------------------------- | -----------: | -----------: | |
| | `hidden-size` | 640 | 960 | |
| | `num-hidden-layers` | 24 | 32 | |
| | `num-attention-heads` | 10 | 15 | |
| | `intermediate-size` | 2560 | 3840 | |
| | `max-position-embeddings` | 2048 | 2048 | |
| | `vocab-size` | 27 | 27 | |
| | `rope-theta` | 10000 | 10000 | |
| | `dropout-prob` | 0 | 0 | |
| | `embedding-init-range` | 0.02 | 0.02 | |
| | `norm-eps` | 1.0e-05 | 1.0e-05 | |
| | `hidden-act` | swiglu | swiglu | |
| | `pre-activation-layer-norm` | true | true | |
| | `layer-norm-after-embedding` | false | false | |
| | `layer-norm-before-last-layer` | true | true | |
| | `rms-norm` | true | true | |
| | `ffn-bias` | false | false | |
| | `attn-bias` | false | false | |
|
|
| ### Training Descritpion |
|
|
| | | Stage 1 | Stage 2 | |
| | :------------------ | ----------: | ---------------------------: | |
| | `dataset` | UR100P | UR100P | |
| | `max-steps` | 1000000 | 25000 (120M) or 50000 (350M) | |
| | `max-length` | 512 | 2048 | |
| | `optimizer` | adamw | adamw | |
| | `lr` | 0.001 | 0.0001 | |
| | `betas` | (0.9, 0.95) | (0.9, 0.95) | |
| | `eps` | 1.0e-08 | 1.0e-08 | |
| | `weight-decay` | 0.01 | 0.01 | |
| | `scheduler` | cosinedecay | none | |
| | `warmup-steps` | 1,000 | none | |
| | `final-step` | 900,000 | none | |
| | `warmup-steps` | 1,000 | none | |
| | `gradient-clipping` | 1.0 | 1.0 | |
| | `tf32` | true | true | |
| | `mixed-precision` | bf16 | bf16 | |
| | `padding` | max-length | max-length | |
| | `random-truncate` | true | true | |
| | `mask-probability` | 0.15 | 0.15 | |
| | `total-batch-size` | 4096 | 4096 | |
| | `deepspeed` | true | true | |
| | `zero-stage` | 3 | 3 | |
|
|
| ## Get Started |
|
|
| ```python |
| from transformers import AutoModel |
| from transformers import AutoTokenizer |
| from datasets import load_dataset |
| |
| # Load AMPLIFY and tokenizer |
| model = AutoModel.from_pretrained("chandar-lab/AMPLIFY_350M", trust_remote_code=True) |
| tokenizer = AutoTokenizer.from_pretrained("chandar-lab/AMPLIFY_350M", trust_remote_code=True) |
| |
| # Move the model to GPU (required due to Flash Attention) |
| model = model.to("cuda") |
| |
| # Load the UniProt validation set |
| dataset = load_dataset("chandar-lab/UR100P", data_dir="UniProt", split="test") |
| |
| for sample in dataset: |
| # Protein |
| print("Sample: ", sample["name"], sample["sequence"]) |
| |
| # Tokenize the protein |
| input = tokenizer.encode(sample["sequence"], return_tensors="pt") |
| print("Input: ", input) |
| |
| # Move to the GPU and make a prediction |
| input = input.to("cuda") |
| output = model(input) |
| print("Output: ", output) |
| |
| break |
| ``` |
|
|
| ## Citations |
|
|
| If you find the models useful in your research, we ask that you cite the paper: |
|
|
| ```bibtex |
| @article{Fournier2024.09.23.614603, |
| title = {Protein Language Models: Is Scaling Necessary?}, |
| author = {Fournier, Quentin and Vernon, Robert M. and van der Sloot, Almer and Schulz, Benjamin and Chandar, Sarath and Langmead, Christopher James}, |
| year = {2024}, |
| journal = {bioRxiv}, |
| publisher = {Cold Spring Harbor Laboratory}, |
| doi = {10.1101/2024.09.23.614603}, |
| url = {https://www.biorxiv.org/content/early/2024/09/23/2024.09.23.614603}, |
| elocation-id = {2024.09.23.614603}, |
| eprint = {https://www.biorxiv.org/content/early/2024/09/23/2024.09.23.614603.full.pdf} |
| } |
| ``` |