longevity-esm2-v4 / README.md
mawe2's picture
Add full model card
697719c verified
---
language:
- en
license: apache-2.0
tags:
- biology
- protein
- longevity
- aging
- ESM-2
- LoRA
- sequence-classification
datasets:
- GenAge
- SwissProt
metrics:
- auprc
- roc_auc
base_model: facebook/esm2_t30_150M_UR50D
---
# Longevity Protein Classifier v6
Fine-tuned ESM-2 150M for binary classification of protein sequences
as longevity-associated or not, trained on multi-species GenAge data
with LoRA adapters.
Built as part of a personal ML learning arc — Week 3 of 8 —
connecting protein language models to longevity biology.
---
## Model Description
- **Model type:** ESM-2 150M + LoRA (r=16) sequence classifier
- **Base model:** facebook/esm2_t30_150M_UR50D
- **Task:** Binary classification — longevity-associated vs non-longevity
- **Developed by:** Mo Elzek
- **License:** Apache 2.0
---
## Performance
| Metric | Value |
|--------|-------|
| Test AUPRC | 0.335 |
| Test AUC-ROC | 0.696 |
| Random AUPRC baseline | 0.061 |
| Improvement over random | 5.5x |
| Training epochs | 10 (early stopping) |
---
## Benchmark Results
| Protein | Score | Expected | Notes |
|---------|-------|----------|-------|
| SIRT1 | 0.996 | HIGH | NAD+ deacetylase, caloric restriction mediator |
| SIRT3 | 0.998 | HIGH | Mitochondrial sirtuin |
| TP53 | 0.974 | HIGH | Tumour suppressor, aging roles |
| MYH9 | 0.000 | LOW | Structural myosin — negative control |
| ACTB | 0.000 | LOW | Beta actin — negative control |
| ALB | 0.000 | LOW | Serum albumin — negative control |
| FOXO3 | 0.000 | HIGH | **Fails** — see limitations |
| MTOR | 0.000 | HIGH | **Fails** — see limitations |
| TERT | 0.000 | HIGH | **Fails** — see limitations |
---
## Novel Predictions Not in GenAge
Proteins scoring above 0.50 that are not present in GenAge human
database. These are the model's predictions of longevity-relevant
proteins not yet catalogued — not validated findings.
| Protein | Score | Biological relevance |
|---------|-------|----------------------|
| TFEB | 0.502 | Master regulator of autophagy and lysosomal biogenesis. Overexpression extends lifespan in C. elegans. Regulated by mTOR. Strongest novel prediction. |
| NEIL1 | 0.951 | DNA glycosylase, base excision repair of oxidative damage. DNA repair capacity correlates with species lifespan. |
| GSTA1 | 0.871 | Glutathione S-transferase. Antioxidant defence. GST family implicated in longevity across multiple species. |
| GRHL1 | 0.880 | Grainyhead-like transcription factor. Epithelial barrier maintenance — tissue integrity declines with age. |
| EXO1 | 0.550 | Exonuclease involved in DNA mismatch repair and double-strand break repair. |
| MSH4 | 0.546 | DNA mismatch repair. Related family members (MSH2, MSH6) are established longevity-associated genes. |
---
## Recommended Thresholds
| Use case | Threshold | Precision | Recall |
|----------|-----------|-----------|--------|
| Screening — cast wide net | 0.05 | ~0.20 | ~29% |
| Balanced | 0.06 | ~0.41 | ~29% |
| High confidence hits only | 0.50 | ~0.61 | ~24% |
Optimised threshold from val set: **0.06** (F1: 0.358)
The model produces a bimodal distribution — proteins it recognises
score very high (above 0.50), proteins it does not score near zero.
The flat recall curve from 0.05 to 0.70 reflects this — most
longevity proteins are either clearly found or clearly missed.
---
## Known Limitations — Read Before Use
### 1. Protein length truncation
Sequences longer than 512 amino acids are truncated from the
C-terminus. This causes systematic failures on long proteins where
the functional domain sits in the C-terminal half:
- **MTOR** (2,549 aa): kinase domain at residues 2181-2431 — truncated away
- **TERT** (1,132 aa): reverse transcriptase domain at 600-900 — truncated away
Do not use this model to score proteins above 800 amino acids
without validating on known examples from that protein family first.
### 2. Family-specific blind spots
The model learned sirtuin and tumour suppressor sequence features
well but has insufficient training examples to generalise to:
- **Forkhead transcription factors** (FOXO3 scores 0.000 despite
being a canonical longevity gene and fitting within the 512 aa window)
- **Large kinases** (truncation compounds this)
- **Telomerase complex** proteins
### 3. Direction of effect not captured
The model cannot distinguish between:
- Pro-longevity proteins (overexpression extends lifespan)
- Anti-aging-disease proteins (loss of function accelerates aging)
Both may score high. A high score means "associated with longevity
biology" not "activating this protein extends lifespan."
### 4. Not validated experimentally
Novel predictions are model outputs only. No wet lab validation has
been performed. TFEB is the strongest prediction based on prior
literature but this model did not discover TFEB — it independently
ranked it highly, consistent with existing biology.
### 5. Not for clinical use
This is a research screening tool. Do not use for any clinical,
diagnostic, or therapeutic decision-making.
---
## Training Data
**Positive set:** GenAge database (genomics.senescence.info)
- Human GenAge: 306 human longevity-associated genes
- Model organism GenAge: Pro-Longevity genes only from 4 species
- C. elegans: 283 genes
- D. melanogaster: 125 genes
- M. musculus: 85 genes
- Total positives: ~574
**Negative set:** Swiss-Prot reviewed proteins from same species
- Sampled proportionally per species (NEG_RATIO=10)
- Species weights applied: human 2.0x, mouse 1.5x, worm/fly 1.0x
- "Necessary for fitness" genes excluded from universe entirely
- Anti-Longevity genes excluded from positives
**Filtering:**
- Sequence length: 50-1500 amino acids
- Swiss-Prot reviewed only (manually curated)
---
## Training Procedure
**Architecture:** ESM-2 150M + LoRA adapters
- LoRA rank: r=16, alpha=32, dropout=0.15
- Target modules: query, value attention projections
- Trainable parameters: ~4.7M of 150M total (3.1%)
**Loss function:** Focal loss with contrastive margin penalty
- gamma=1.0 (softer than standard gamma=2.0)
- Label smoothing=0.1
- Contrastive margin=0.30 (explicit separation penalty)
- Class weights: balanced
**Optimiser:** AdamW, lr=2e-4, weight_decay=0.01
**Schedule:** Cosine with warmup (10% warmup steps)
**Early stopping:** Patience=4 on val AUPRC
**Best epoch:** 10 of 20
**Hardware:** NVIDIA T4 16GB (Kaggle)
**Training time:** ~2 hours
---
## How to Use
```
from transformers import AutoTokenizer, EsmForSequenceClassification
from peft import PeftModel
import torch
# Load model
base = EsmForSequenceClassification.from_pretrained(
"facebook/esm2_t30_150M_UR50D",
num_labels=2,
ignore_mismatched_sizes=True
)
model = PeftModel.from_pretrained(base, "YOUR_USERNAME/longevity-esm2-v6")
tokenizer = AutoTokenizer.from_pretrained("YOUR_USERNAME/longevity-esm2-v6")
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)
model.eval()
def score_sequence(sequence, threshold=0.06):
inputs = tokenizer(
sequence,
max_length=512,
padding="max_length",
truncation=True,
return_tensors="pt"
)
with torch.no_grad():
outputs = model(
input_ids=inputs["input_ids"].to(device),
attention_mask=inputs["attention_mask"].to(device)
)
prob = torch.softmax(outputs.logits, dim=1)[:, 1].item()
return {
"probability": round(prob, 4),
"prediction": "Longevity" if prob >= threshold else "Non-longevity",
"threshold": threshold,
"warning": "Truncated to 512 aa" if len(sequence) > 512 else None
}
# Example
result = score_sequence("MKTAYIAKQRQISFVK...")
print(result)
```
**Recommended thresholds:**
- 0.05-0.06 for screening (maximise recall)
- 0.50 for high-confidence hits only
---
## Experiment History
This model is v6 in a series of iterative experiments:
| Version | Key change | Test AUPRC |
|---------|-----------|------------|
| v1 | Frozen encoder, 186 positives | Collapsed |
| v2 | LoRA r=8, 277 positives | 0.027 |
| v3 | ESM-2 150M, multi-species, ~2000 positives | 0.302 |
| v4 | Pro-Longevity filter, focal loss gamma=2 | 0.250 |
| v5 | Cleaned species, gamma=1, label smoothing | 0.323 |
| v6 (this) | Pathway-stratified split, contrastive margin | **0.335** |
---
## Citation
If you use this model in research, please cite:
@misc{elzek2026longevity,
author = {Elzek, Mo},
title = {Longevity Protein Classifier: Multi-species ESM-2 Fine-tuning},
year = {2026},
publisher = {HuggingFace},
url = {https://huggingface.co/YOUR_USERNAME/longevity-esm2-v6}
}
---
## Contact
Built by Mo Elzek as part of the London Longevity Network ML project arc.
Feedback and collaboration welcome.