Add full model card

697719c verified 2 months ago

8.78 kB

	---
	language:
	- en
	license: apache-2.0
	tags:
	- biology
	- protein
	- longevity
	- aging
	- ESM-2
	- LoRA
	- sequence-classification
	datasets:
	- GenAge
	- SwissProt
	metrics:
	- auprc
	- roc_auc
	base_model: facebook/esm2_t30_150M_UR50D
	---

	# Longevity Protein Classifier v6

	Fine-tuned ESM-2 150M for binary classification of protein sequences
	as longevity-associated or not, trained on multi-species GenAge data
	with LoRA adapters.

	Built as part of a personal ML learning arc — Week 3 of 8 —
	connecting protein language models to longevity biology.

	---

	## Model Description

	- Model type: ESM-2 150M + LoRA (r=16) sequence classifier
	- Base model: facebook/esm2_t30_150M_UR50D
	- Task: Binary classification — longevity-associated vs non-longevity
	- Developed by: Mo Elzek
	- License: Apache 2.0

	---

	## Performance

	\| Metric \| Value \|
	\|--------\|-------\|
	\| Test AUPRC \| 0.335 \|
	\| Test AUC-ROC \| 0.696 \|
	\| Random AUPRC baseline \| 0.061 \|
	\| Improvement over random \| 5.5x \|
	\| Training epochs \| 10 (early stopping) \|

	---

	## Benchmark Results

	\| Protein \| Score \| Expected \| Notes \|
	\|---------\|-------\|----------\|-------\|
	\| SIRT1 \| 0.996 \| HIGH \| NAD+ deacetylase, caloric restriction mediator \|
	\| SIRT3 \| 0.998 \| HIGH \| Mitochondrial sirtuin \|
	\| TP53 \| 0.974 \| HIGH \| Tumour suppressor, aging roles \|
	\| MYH9 \| 0.000 \| LOW \| Structural myosin — negative control \|
	\| ACTB \| 0.000 \| LOW \| Beta actin — negative control \|
	\| ALB \| 0.000 \| LOW \| Serum albumin — negative control \|
	\| FOXO3 \| 0.000 \| HIGH \| Fails — see limitations \|
	\| MTOR \| 0.000 \| HIGH \| Fails — see limitations \|
	\| TERT \| 0.000 \| HIGH \| Fails — see limitations \|

	---

	## Novel Predictions Not in GenAge

	Proteins scoring above 0.50 that are not present in GenAge human
	database. These are the model's predictions of longevity-relevant
	proteins not yet catalogued — not validated findings.

	\| Protein \| Score \| Biological relevance \|
	\|---------\|-------\|----------------------\|
	\| TFEB \| 0.502 \| Master regulator of autophagy and lysosomal biogenesis. Overexpression extends lifespan in C. elegans. Regulated by mTOR. Strongest novel prediction. \|
	\| NEIL1 \| 0.951 \| DNA glycosylase, base excision repair of oxidative damage. DNA repair capacity correlates with species lifespan. \|
	\| GSTA1 \| 0.871 \| Glutathione S-transferase. Antioxidant defence. GST family implicated in longevity across multiple species. \|
	\| GRHL1 \| 0.880 \| Grainyhead-like transcription factor. Epithelial barrier maintenance — tissue integrity declines with age. \|
	\| EXO1 \| 0.550 \| Exonuclease involved in DNA mismatch repair and double-strand break repair. \|
	\| MSH4 \| 0.546 \| DNA mismatch repair. Related family members (MSH2, MSH6) are established longevity-associated genes. \|

	---

	## Recommended Thresholds

	\| Use case \| Threshold \| Precision \| Recall \|
	\|----------\|-----------\|-----------\|--------\|
	\| Screening — cast wide net \| 0.05 \| ~0.20 \| ~29% \|
	\| Balanced \| 0.06 \| ~0.41 \| ~29% \|
	\| High confidence hits only \| 0.50 \| ~0.61 \| ~24% \|

	Optimised threshold from val set: 0.06 (F1: 0.358)

	The model produces a bimodal distribution — proteins it recognises
	score very high (above 0.50), proteins it does not score near zero.
	The flat recall curve from 0.05 to 0.70 reflects this — most
	longevity proteins are either clearly found or clearly missed.

	---

	## Known Limitations — Read Before Use

	### 1. Protein length truncation
	Sequences longer than 512 amino acids are truncated from the
	C-terminus. This causes systematic failures on long proteins where
	the functional domain sits in the C-terminal half:

	- MTOR (2,549 aa): kinase domain at residues 2181-2431 — truncated away
	- TERT (1,132 aa): reverse transcriptase domain at 600-900 — truncated away

	Do not use this model to score proteins above 800 amino acids
	without validating on known examples from that protein family first.

	### 2. Family-specific blind spots
	The model learned sirtuin and tumour suppressor sequence features
	well but has insufficient training examples to generalise to:

	- Forkhead transcription factors (FOXO3 scores 0.000 despite
	being a canonical longevity gene and fitting within the 512 aa window)
	- Large kinases (truncation compounds this)
	- Telomerase complex proteins

	### 3. Direction of effect not captured
	The model cannot distinguish between:
	- Pro-longevity proteins (overexpression extends lifespan)
	- Anti-aging-disease proteins (loss of function accelerates aging)

	Both may score high. A high score means "associated with longevity
	biology" not "activating this protein extends lifespan."

	### 4. Not validated experimentally
	Novel predictions are model outputs only. No wet lab validation has
	been performed. TFEB is the strongest prediction based on prior
	literature but this model did not discover TFEB — it independently
	ranked it highly, consistent with existing biology.

	### 5. Not for clinical use
	This is a research screening tool. Do not use for any clinical,
	diagnostic, or therapeutic decision-making.

	---

	## Training Data

	Positive set: GenAge database (genomics.senescence.info)
	- Human GenAge: 306 human longevity-associated genes
	- Model organism GenAge: Pro-Longevity genes only from 4 species
	- C. elegans: 283 genes
	- D. melanogaster: 125 genes
	- M. musculus: 85 genes
	- Total positives: ~574

	Negative set: Swiss-Prot reviewed proteins from same species
	- Sampled proportionally per species (NEG_RATIO=10)
	- Species weights applied: human 2.0x, mouse 1.5x, worm/fly 1.0x
	- "Necessary for fitness" genes excluded from universe entirely
	- Anti-Longevity genes excluded from positives

	Filtering:
	- Sequence length: 50-1500 amino acids
	- Swiss-Prot reviewed only (manually curated)

	---

	## Training Procedure

	Architecture: ESM-2 150M + LoRA adapters
	- LoRA rank: r=16, alpha=32, dropout=0.15
	- Target modules: query, value attention projections
	- Trainable parameters: ~4.7M of 150M total (3.1%)

	Loss function: Focal loss with contrastive margin penalty
	- gamma=1.0 (softer than standard gamma=2.0)
	- Label smoothing=0.1
	- Contrastive margin=0.30 (explicit separation penalty)
	- Class weights: balanced

	Optimiser: AdamW, lr=2e-4, weight_decay=0.01
	Schedule: Cosine with warmup (10% warmup steps)
	Early stopping: Patience=4 on val AUPRC
	Best epoch: 10 of 20

	Hardware: NVIDIA T4 16GB (Kaggle)
	Training time: ~2 hours

	---

	## How to Use
	```
	from transformers import AutoTokenizer, EsmForSequenceClassification
	from peft import PeftModel
	import torch

	# Load model
	base = EsmForSequenceClassification.from_pretrained(
	"facebook/esm2_t30_150M_UR50D",
	num_labels=2,
	ignore_mismatched_sizes=True
	)
	model = PeftModel.from_pretrained(base, "YOUR_USERNAME/longevity-esm2-v6")
	tokenizer = AutoTokenizer.from_pretrained("YOUR_USERNAME/longevity-esm2-v6")

	device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
	model = model.to(device)
	model.eval()

	def score_sequence(sequence, threshold=0.06):
	inputs = tokenizer(
	sequence,
	max_length=512,
	padding="max_length",
	truncation=True,
	return_tensors="pt"
	)
	with torch.no_grad():
	outputs = model(
	input_ids=inputs["input_ids"].to(device),
	attention_mask=inputs["attention_mask"].to(device)
	)
	prob = torch.softmax(outputs.logits, dim=1)[:, 1].item()
	return {
	"probability": round(prob, 4),
	"prediction": "Longevity" if prob >= threshold else "Non-longevity",
	"threshold": threshold,
	"warning": "Truncated to 512 aa" if len(sequence) > 512 else None
	}

	# Example
	result = score_sequence("MKTAYIAKQRQISFVK...")
	print(result)
	```

	Recommended thresholds:
	- 0.05-0.06 for screening (maximise recall)
	- 0.50 for high-confidence hits only

	---

	## Experiment History

	This model is v6 in a series of iterative experiments:

	\| Version \| Key change \| Test AUPRC \|
	\|---------\|-----------\|------------\|
	\| v1 \| Frozen encoder, 186 positives \| Collapsed \|
	\| v2 \| LoRA r=8, 277 positives \| 0.027 \|
	\| v3 \| ESM-2 150M, multi-species, ~2000 positives \| 0.302 \|
	\| v4 \| Pro-Longevity filter, focal loss gamma=2 \| 0.250 \|
	\| v5 \| Cleaned species, gamma=1, label smoothing \| 0.323 \|
	\| v6 (this) \| Pathway-stratified split, contrastive margin \| 0.335 \|

	---

	## Citation

	If you use this model in research, please cite:
	@misc{elzek2026longevity,
	author = {Elzek, Mo},
	title = {Longevity Protein Classifier: Multi-species ESM-2 Fine-tuning},
	year = {2026},
	publisher = {HuggingFace},
	url = {https://huggingface.co/YOUR_USERNAME/longevity-esm2-v6}
	}

	---

	## Contact

	Built by Mo Elzek as part of the London Longevity Network ML project arc.
	Feedback and collaboration welcome.