Protein1 / README.md
nl45's picture
Update README.md
9a21501 verified
metadata
license: mit

language: en tags: - protein-function-prediction - bioinformatics - gene-ontology - multi-label-classification - esm-2 - CAFA-6 license: mit datasets: - CAFA-6 metrics: - f1 - precision - recall

🧬 CAFA 6 Protein Function Prediction

"Once I was zero epochs old, my model said to me... Go make yourself some predictions, don't wait for labeled data."

BioBERT, I'm coming for you! πŸ”₯

Model Description

State-of-the-art multi-label protein function prediction using ESM-2 embeddings. Predicts Gene Ontology (GO) terms across three ontologies from protein amino acid sequences.

What This Model Does

Given a protein sequence like:

MKTAYIAKQRQISFVKSHFSRQLEERLGLIEVQAPILSRVGDGTQDNLSGAEKAVQVKVKALPDAQFEVVH...

It predicts:

  • Molecular Function (MFO): What the protein DOES (e.g., "protein binding", "kinase activity")
  • Biological Process (BPO): What pathways it's involved in (e.g., "signal transduction")
  • Cellular Component (CCO): WHERE it's located (e.g., "nucleus", "membrane")

Files in This Repository

  • train_esm2_embeddings.pkl (427 MB) - Pre-computed ESM-2 embeddings for 82,404 training proteins
  • test_esm2_embeddings.pkl (1.16 GB) - Pre-computed ESM-2 embeddings for test proteins
  • go_parser.pkl (25.7 MB) - Gene Ontology hierarchy parser with 40,122 GO terms
  • .gitattributes - Git LFS configuration for large files

Dataset Statistics

Training Data

  • Total proteins: 82,404
  • Total annotations: 537,027
  • Unique GO terms: 26,125

Selected Terms for Prediction

  • MFO: 500 most frequent terms
  • BPO: 800 most frequent terms
  • CCO: 400 most frequent terms

Label Distribution

Ontology Proteins with Labels Avg Labels/Protein Sparsity
MFO 49,751 (60.4%) 54.2 89.2%
BPO 44,382 (53.9%) 6.6 99.2%
CCO 58,505 (71.0%) 36.5 90.9%

Usage

Requirements

pip install torch biopython transformers huggingface_hub numpy

Quick Start - Load Embeddings

from huggingface_hub import hf_hub_download
import pickle

# Download embeddings
embeddings_path = hf_hub_download(
    repo_id="nl45/Protein1",
    filename="train_esm2_embeddings.pkl"
)

# Load embeddings
with open(embeddings_path, 'rb') as f:
    embeddings = pickle.load(f)

# embeddings is a dict: {protein_id: embedding_vector}
print(f"Loaded embeddings for {len(embeddings)} proteins")
print(f"Embedding dimension: {list(embeddings.values())[0].shape}")

Generate New Embeddings for Your Protein

from transformers import AutoTokenizer, EsmModel
import torch

# Load ESM-2 model
tokenizer = AutoTokenizer.from_pretrained("facebook/esm2_t33_650M_UR50D")
model = EsmModel.from_pretrained("facebook/esm2_t33_650M_UR50D")

# Your protein sequence
sequence = "MKTAYIAKQRQISFVKSHFSRQLE..."

# Generate embedding
inputs = tokenizer(sequence, return_tensors="pt", padding=True)
with torch.no_grad():
    outputs = model(**inputs)
    embedding = outputs.last_hidden_state.mean(dim=1)  # Shape: [1, 1280]

print(f"Generated embedding shape: {embedding.shape}")

Load GO Parser

# Download GO parser
parser_path = hf_hub_download(
    repo_id="nl45/Protein1",
    filename="go_parser.pkl"
)

# Load parser
with open(parser_path, 'rb') as f:
    go_parser = pickle.load(f)

# Example: Get GO term information
term_info = go_parser.get_term_info("GO:0003674")
print(f"Term: {term_info['name']}")
print(f"Namespace: {term_info['namespace']}")

Model Architecture

The prediction model uses a Multi-Layer Perceptron (MLP):

Input: ESM-2 Embeddings (1280-dim)
    ↓
[Dense 2048] β†’ BatchNorm β†’ ReLU β†’ Dropout(0.3)
    ↓
[Dense 1024] β†’ BatchNorm β†’ ReLU β†’ Dropout(0.3)
    ↓
[Dense 512] β†’ BatchNorm β†’ ReLU β†’ Dropout(0.3)
    ↓
[Dense Output] β†’ Sigmoid
    ↓
Multi-label Predictions

Training Details:

  • Loss: Binary Cross-Entropy with Logits
  • Optimizer: Adam
  • Learning Rate: 0.001 with ReduceLROnPlateau
  • Early Stopping: Patience of 10 epochs

Data Processing Pipeline

  1. Raw Sequences (FASTA format) β†’ Parse protein IDs and sequences
  2. ESM-2 Encoding β†’ Generate 1280-dim embeddings using facebook/esm2_t33_650M_UR50D
  3. GO Annotations β†’ Load and normalize GO terms
  4. Label Preparation β†’ Create multi-label binary matrices with term propagation
  5. Model Training β†’ Train separate models for MFO, BPO, CCO

Citation

@misc{nl45_cafa6_2026,
  title={CAFA 6 Protein Function Prediction with ESM-2 Embeddings},
  author={nl45},
  year={2026},
  publisher={Hugging Face},
  howpublished={\url{https://huggingface.co/nl45/Protein1}}
}

Acknowledgments

  • CAFA Challenge: Critical Assessment of Functional Annotation
  • ESM-2: Evolutionary Scale Modeling from Meta AI
  • Gene Ontology Consortium: For GO term annotations

License

MIT License

Contact

For questions or collaboration: Create an issue


"BioBERT, I'm coming for you!" πŸ”₯🧬