| --- |
| language: |
| - en |
| license: apache-2.0 |
| tags: |
| - biology |
| - protein |
| - longevity |
| - aging |
| - ESM-2 |
| - LoRA |
| - sequence-classification |
| datasets: |
| - GenAge |
| - SwissProt |
| metrics: |
| - auprc |
| - roc_auc |
| base_model: facebook/esm2_t30_150M_UR50D |
| --- |
| |
| # Longevity Protein Classifier v6 |
|
|
| Fine-tuned ESM-2 150M for binary classification of protein sequences |
| as longevity-associated or not, trained on multi-species GenAge data |
| with LoRA adapters. |
|
|
| Built as part of a personal ML learning arc — Week 3 of 8 — |
| connecting protein language models to longevity biology. |
|
|
| --- |
|
|
| ## Model Description |
|
|
| - **Model type:** ESM-2 150M + LoRA (r=16) sequence classifier |
| - **Base model:** facebook/esm2_t30_150M_UR50D |
| - **Task:** Binary classification — longevity-associated vs non-longevity |
| - **Developed by:** Mo Elzek |
| - **License:** Apache 2.0 |
| |
| --- |
| |
| ## Performance |
| |
| | Metric | Value | |
| |--------|-------| |
| | Test AUPRC | 0.335 | |
| | Test AUC-ROC | 0.696 | |
| | Random AUPRC baseline | 0.061 | |
| | Improvement over random | 5.5x | |
| | Training epochs | 10 (early stopping) | |
| |
| --- |
| |
| ## Benchmark Results |
| |
| | Protein | Score | Expected | Notes | |
| |---------|-------|----------|-------| |
| | SIRT1 | 0.996 | HIGH | NAD+ deacetylase, caloric restriction mediator | |
| | SIRT3 | 0.998 | HIGH | Mitochondrial sirtuin | |
| | TP53 | 0.974 | HIGH | Tumour suppressor, aging roles | |
| | MYH9 | 0.000 | LOW | Structural myosin — negative control | |
| | ACTB | 0.000 | LOW | Beta actin — negative control | |
| | ALB | 0.000 | LOW | Serum albumin — negative control | |
| | FOXO3 | 0.000 | HIGH | **Fails** — see limitations | |
| | MTOR | 0.000 | HIGH | **Fails** — see limitations | |
| | TERT | 0.000 | HIGH | **Fails** — see limitations | |
| |
| --- |
| |
| ## Novel Predictions Not in GenAge |
| |
| Proteins scoring above 0.50 that are not present in GenAge human |
| database. These are the model's predictions of longevity-relevant |
| proteins not yet catalogued — not validated findings. |
| |
| | Protein | Score | Biological relevance | |
| |---------|-------|----------------------| |
| | TFEB | 0.502 | Master regulator of autophagy and lysosomal biogenesis. Overexpression extends lifespan in C. elegans. Regulated by mTOR. Strongest novel prediction. | |
| | NEIL1 | 0.951 | DNA glycosylase, base excision repair of oxidative damage. DNA repair capacity correlates with species lifespan. | |
| | GSTA1 | 0.871 | Glutathione S-transferase. Antioxidant defence. GST family implicated in longevity across multiple species. | |
| | GRHL1 | 0.880 | Grainyhead-like transcription factor. Epithelial barrier maintenance — tissue integrity declines with age. | |
| | EXO1 | 0.550 | Exonuclease involved in DNA mismatch repair and double-strand break repair. | |
| | MSH4 | 0.546 | DNA mismatch repair. Related family members (MSH2, MSH6) are established longevity-associated genes. | |
| |
| --- |
| |
| ## Recommended Thresholds |
| |
| | Use case | Threshold | Precision | Recall | |
| |----------|-----------|-----------|--------| |
| | Screening — cast wide net | 0.05 | ~0.20 | ~29% | |
| | Balanced | 0.06 | ~0.41 | ~29% | |
| | High confidence hits only | 0.50 | ~0.61 | ~24% | |
| |
| Optimised threshold from val set: **0.06** (F1: 0.358) |
| |
| The model produces a bimodal distribution — proteins it recognises |
| score very high (above 0.50), proteins it does not score near zero. |
| The flat recall curve from 0.05 to 0.70 reflects this — most |
| longevity proteins are either clearly found or clearly missed. |
| |
| --- |
| |
| ## Known Limitations — Read Before Use |
| |
| ### 1. Protein length truncation |
| Sequences longer than 512 amino acids are truncated from the |
| C-terminus. This causes systematic failures on long proteins where |
| the functional domain sits in the C-terminal half: |
| |
| - **MTOR** (2,549 aa): kinase domain at residues 2181-2431 — truncated away |
| - **TERT** (1,132 aa): reverse transcriptase domain at 600-900 — truncated away |
| |
| Do not use this model to score proteins above 800 amino acids |
| without validating on known examples from that protein family first. |
| |
| ### 2. Family-specific blind spots |
| The model learned sirtuin and tumour suppressor sequence features |
| well but has insufficient training examples to generalise to: |
| |
| - **Forkhead transcription factors** (FOXO3 scores 0.000 despite |
| being a canonical longevity gene and fitting within the 512 aa window) |
| - **Large kinases** (truncation compounds this) |
| - **Telomerase complex** proteins |
| |
| ### 3. Direction of effect not captured |
| The model cannot distinguish between: |
| - Pro-longevity proteins (overexpression extends lifespan) |
| - Anti-aging-disease proteins (loss of function accelerates aging) |
| |
| Both may score high. A high score means "associated with longevity |
| biology" not "activating this protein extends lifespan." |
| |
| ### 4. Not validated experimentally |
| Novel predictions are model outputs only. No wet lab validation has |
| been performed. TFEB is the strongest prediction based on prior |
| literature but this model did not discover TFEB — it independently |
| ranked it highly, consistent with existing biology. |
| |
| ### 5. Not for clinical use |
| This is a research screening tool. Do not use for any clinical, |
| diagnostic, or therapeutic decision-making. |
| |
| --- |
| |
| ## Training Data |
| |
| **Positive set:** GenAge database (genomics.senescence.info) |
| - Human GenAge: 306 human longevity-associated genes |
| - Model organism GenAge: Pro-Longevity genes only from 4 species |
| - C. elegans: 283 genes |
| - D. melanogaster: 125 genes |
| - M. musculus: 85 genes |
| - Total positives: ~574 |
| |
| **Negative set:** Swiss-Prot reviewed proteins from same species |
| - Sampled proportionally per species (NEG_RATIO=10) |
| - Species weights applied: human 2.0x, mouse 1.5x, worm/fly 1.0x |
| - "Necessary for fitness" genes excluded from universe entirely |
| - Anti-Longevity genes excluded from positives |
|
|
| **Filtering:** |
| - Sequence length: 50-1500 amino acids |
| - Swiss-Prot reviewed only (manually curated) |
|
|
| --- |
|
|
| ## Training Procedure |
|
|
| **Architecture:** ESM-2 150M + LoRA adapters |
| - LoRA rank: r=16, alpha=32, dropout=0.15 |
| - Target modules: query, value attention projections |
| - Trainable parameters: ~4.7M of 150M total (3.1%) |
|
|
| **Loss function:** Focal loss with contrastive margin penalty |
| - gamma=1.0 (softer than standard gamma=2.0) |
| - Label smoothing=0.1 |
| - Contrastive margin=0.30 (explicit separation penalty) |
| - Class weights: balanced |
|
|
| **Optimiser:** AdamW, lr=2e-4, weight_decay=0.01 |
| **Schedule:** Cosine with warmup (10% warmup steps) |
| **Early stopping:** Patience=4 on val AUPRC |
| **Best epoch:** 10 of 20 |
| |
| **Hardware:** NVIDIA T4 16GB (Kaggle) |
| **Training time:** ~2 hours |
| |
| --- |
| |
| ## How to Use |
| ``` |
| from transformers import AutoTokenizer, EsmForSequenceClassification |
| from peft import PeftModel |
| import torch |
| |
| # Load model |
| base = EsmForSequenceClassification.from_pretrained( |
| "facebook/esm2_t30_150M_UR50D", |
| num_labels=2, |
| ignore_mismatched_sizes=True |
| ) |
| model = PeftModel.from_pretrained(base, "YOUR_USERNAME/longevity-esm2-v6") |
| tokenizer = AutoTokenizer.from_pretrained("YOUR_USERNAME/longevity-esm2-v6") |
| |
| device = torch.device("cuda" if torch.cuda.is_available() else "cpu") |
| model = model.to(device) |
| model.eval() |
| |
| def score_sequence(sequence, threshold=0.06): |
| inputs = tokenizer( |
| sequence, |
| max_length=512, |
| padding="max_length", |
| truncation=True, |
| return_tensors="pt" |
| ) |
| with torch.no_grad(): |
| outputs = model( |
| input_ids=inputs["input_ids"].to(device), |
| attention_mask=inputs["attention_mask"].to(device) |
| ) |
| prob = torch.softmax(outputs.logits, dim=1)[:, 1].item() |
| return { |
| "probability": round(prob, 4), |
| "prediction": "Longevity" if prob >= threshold else "Non-longevity", |
| "threshold": threshold, |
| "warning": "Truncated to 512 aa" if len(sequence) > 512 else None |
| } |
| |
| # Example |
| result = score_sequence("MKTAYIAKQRQISFVK...") |
| print(result) |
| ``` |
| |
| **Recommended thresholds:** |
| - 0.05-0.06 for screening (maximise recall) |
| - 0.50 for high-confidence hits only |
| |
| --- |
| |
| ## Experiment History |
| |
| This model is v6 in a series of iterative experiments: |
| |
| | Version | Key change | Test AUPRC | |
| |---------|-----------|------------| |
| | v1 | Frozen encoder, 186 positives | Collapsed | |
| | v2 | LoRA r=8, 277 positives | 0.027 | |
| | v3 | ESM-2 150M, multi-species, ~2000 positives | 0.302 | |
| | v4 | Pro-Longevity filter, focal loss gamma=2 | 0.250 | |
| | v5 | Cleaned species, gamma=1, label smoothing | 0.323 | |
| | v6 (this) | Pathway-stratified split, contrastive margin | **0.335** | |
| |
| --- |
| |
| ## Citation |
| |
| If you use this model in research, please cite: |
| @misc{elzek2026longevity, |
| author = {Elzek, Mo}, |
| title = {Longevity Protein Classifier: Multi-species ESM-2 Fine-tuning}, |
| year = {2026}, |
| publisher = {HuggingFace}, |
| url = {https://huggingface.co/YOUR_USERNAME/longevity-esm2-v6} |
| } |
|
|
| --- |
|
|
| ## Contact |
|
|
| Built by Mo Elzek as part of the London Longevity Network ML project arc. |
| Feedback and collaboration welcome. |
|
|