AMPLIFY 120M

FLAIR Lab · Website · GitHub · Paper

A 120M-parameter protein language model pre-trained on UR100P using masked language modeling. Trained for 1M steps (~2T tokens) with context length 2,048.

This model was trained using the AMPLIFY training codebase. The original models and code were released under chandar-lab/AMPLIFY. See also flair-bio/AMPLIFY_350M.

Property	Value
Architecture	BERT-style encoder (RoPE, SwiGLU, RMSNorm)
Parameters	120M
Training tokens	~2T
Vocabulary size	32 (amino acid alphabet + special tokens)
Context length	2,048
Training steps	1,000,000
License	Apache 2.0

Quick Start

from transformers import AutoModelForMaskedLM, AutoTokenizer

model = AutoModelForMaskedLM.from_pretrained("flair-bio/amplify-120m", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("flair-bio/amplify-120m", trust_remote_code=True)
model.eval()

How to Use

Extract Embeddings

import torch
from transformers import AutoModel, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("flair-bio/amplify-120m", trust_remote_code=True)
model = AutoModel.from_pretrained("flair-bio/amplify-120m", trust_remote_code=True)

sequences = ["MKTAYIAK", "MVLSPADKTNVK"]
inputs = tokenizer(sequences, return_tensors="pt", padding=True, truncation=True, max_length=2048)

with torch.no_grad():
    outputs = model(**inputs)

embeddings = outputs.last_hidden_state  # [batch, seq_len, 640]

Masked Language Modeling

from transformers import AutoTokenizer, AutoModelForMaskedLM
import torch

tokenizer = AutoTokenizer.from_pretrained("flair-bio/amplify-120m", trust_remote_code=True)
model = AutoModelForMaskedLM.from_pretrained("flair-bio/amplify-120m", trust_remote_code=True)

sequence = "MKTAY<mask>AKQRQISFVK"
inputs = tokenizer(sequence, return_tensors="pt")

with torch.no_grad():
    logits = model(**inputs).logits

mask_idx = (inputs["input_ids"] == tokenizer.mask_token_id).nonzero(as_tuple=True)[1]
predicted = tokenizer.decode(logits[0, mask_idx].argmax(dim=-1))
print(predicted)

Model Description

Architecture

AMPLIFY 120M is a BERT-style transformer encoder with 24 layers, 640-dimensional hidden states, and 10 attention heads. It uses rotary positional embeddings (RoPE), SwiGLU feed-forward blocks, and RMSNorm. Tokenization is at the amino acid level with a vocabulary of 32 tokens.

Config	Layers	Hidden dim	Heads	FFN dim	Context
AMPLIFY 120M	24	640	10	1,712	512 → 2,048

Intended Use

This model is intended for extracting per-residue or per-sequence representations for downstream tasks, zero-shot variant effect prediction via pseudo-log-likelihood scoring, and fine-tuning on protein fitness, stability, binding, or functional annotation tasks.

Training

Data

Pre-trained on UR100P (chandar-lab/UR100P), a deduplicated union of UniRef100, OAS, and SCOPe.

Training Procedure

Hyperparameter	Value
Hardware	8× H100 80GB
Optimizer	AdamW
Learning rate	1e-3 (peak)
LR schedule	Linear warmup + cosine decay
Batch size (tokens)	~2M per step
Masking rate	15%
Training objective	Masked language modeling
Precision	BF16
Framework	PyTorch + HuggingFace Transformers

Training logs are available on Weights & Biases.

Citation

If you use this model in your work, please cite:

@article{Fournier2024.09.23.614603,
  title        = {Protein Language Models: Is Scaling Necessary?},
  author       = {Fournier, Quentin and Vernon, Robert M. and van der Sloot, Almer and Schulz, Benjamin and Chandar, Sarath and Langmead, Christopher James},
  year         = {2024},
  journal      = {bioRxiv},
  publisher    = {Cold Spring Harbor Laboratory},
  doi          = {10.1101/2024.09.23.614603},
  url          = {https://www.biorxiv.org/content/early/2024/09/23/2024.09.23.614603}
}

Downloads last month: 307

Safetensors

Model size

0.1B params

Tensor type

F32

Model tree for flair-bio/amplify-120m

Unable to build the model tree, the base model loops to the model itself. Learn more.

flair-bio
/

amplify-120m