language: en tags: - protein-function-prediction - bioinformatics - gene-ontology - multi-label-classification - esm-2 - CAFA-6 license: mit datasets: - CAFA-6 metrics: - f1 - precision - recall
𧬠CAFA 6 Protein Function Prediction
"Once I was zero epochs old, my model said to me... Go make yourself some predictions, don't wait for labeled data."
BioBERT, I'm coming for you! π₯
Model Description
State-of-the-art multi-label protein function prediction using ESM-2 embeddings. Predicts Gene Ontology (GO) terms across three ontologies from protein amino acid sequences.
What This Model Does
Given a protein sequence like:
MKTAYIAKQRQISFVKSHFSRQLEERLGLIEVQAPILSRVGDGTQDNLSGAEKAVQVKVKALPDAQFEVVH...
It predicts:
- Molecular Function (MFO): What the protein DOES (e.g., "protein binding", "kinase activity")
- Biological Process (BPO): What pathways it's involved in (e.g., "signal transduction")
- Cellular Component (CCO): WHERE it's located (e.g., "nucleus", "membrane")
Files in This Repository
train_esm2_embeddings.pkl(427 MB) - Pre-computed ESM-2 embeddings for 82,404 training proteinstest_esm2_embeddings.pkl(1.16 GB) - Pre-computed ESM-2 embeddings for test proteinsgo_parser.pkl(25.7 MB) - Gene Ontology hierarchy parser with 40,122 GO terms.gitattributes- Git LFS configuration for large files
Dataset Statistics
Training Data
- Total proteins: 82,404
- Total annotations: 537,027
- Unique GO terms: 26,125
Selected Terms for Prediction
- MFO: 500 most frequent terms
- BPO: 800 most frequent terms
- CCO: 400 most frequent terms
Label Distribution
| Ontology | Proteins with Labels | Avg Labels/Protein | Sparsity |
|---|---|---|---|
| MFO | 49,751 (60.4%) | 54.2 | 89.2% |
| BPO | 44,382 (53.9%) | 6.6 | 99.2% |
| CCO | 58,505 (71.0%) | 36.5 | 90.9% |
Usage
Requirements
pip install torch biopython transformers huggingface_hub numpy
Quick Start - Load Embeddings
from huggingface_hub import hf_hub_download
import pickle
# Download embeddings
embeddings_path = hf_hub_download(
repo_id="nl45/Protein1",
filename="train_esm2_embeddings.pkl"
)
# Load embeddings
with open(embeddings_path, 'rb') as f:
embeddings = pickle.load(f)
# embeddings is a dict: {protein_id: embedding_vector}
print(f"Loaded embeddings for {len(embeddings)} proteins")
print(f"Embedding dimension: {list(embeddings.values())[0].shape}")
Generate New Embeddings for Your Protein
from transformers import AutoTokenizer, EsmModel
import torch
# Load ESM-2 model
tokenizer = AutoTokenizer.from_pretrained("facebook/esm2_t33_650M_UR50D")
model = EsmModel.from_pretrained("facebook/esm2_t33_650M_UR50D")
# Your protein sequence
sequence = "MKTAYIAKQRQISFVKSHFSRQLE..."
# Generate embedding
inputs = tokenizer(sequence, return_tensors="pt", padding=True)
with torch.no_grad():
outputs = model(**inputs)
embedding = outputs.last_hidden_state.mean(dim=1) # Shape: [1, 1280]
print(f"Generated embedding shape: {embedding.shape}")
Load GO Parser
# Download GO parser
parser_path = hf_hub_download(
repo_id="nl45/Protein1",
filename="go_parser.pkl"
)
# Load parser
with open(parser_path, 'rb') as f:
go_parser = pickle.load(f)
# Example: Get GO term information
term_info = go_parser.get_term_info("GO:0003674")
print(f"Term: {term_info['name']}")
print(f"Namespace: {term_info['namespace']}")
Model Architecture
The prediction model uses a Multi-Layer Perceptron (MLP):
Input: ESM-2 Embeddings (1280-dim)
β
[Dense 2048] β BatchNorm β ReLU β Dropout(0.3)
β
[Dense 1024] β BatchNorm β ReLU β Dropout(0.3)
β
[Dense 512] β BatchNorm β ReLU β Dropout(0.3)
β
[Dense Output] β Sigmoid
β
Multi-label Predictions
Training Details:
- Loss: Binary Cross-Entropy with Logits
- Optimizer: Adam
- Learning Rate: 0.001 with ReduceLROnPlateau
- Early Stopping: Patience of 10 epochs
Data Processing Pipeline
- Raw Sequences (FASTA format) β Parse protein IDs and sequences
- ESM-2 Encoding β Generate 1280-dim embeddings using
facebook/esm2_t33_650M_UR50D - GO Annotations β Load and normalize GO terms
- Label Preparation β Create multi-label binary matrices with term propagation
- Model Training β Train separate models for MFO, BPO, CCO
Citation
@misc{nl45_cafa6_2026,
title={CAFA 6 Protein Function Prediction with ESM-2 Embeddings},
author={nl45},
year={2026},
publisher={Hugging Face},
howpublished={\url{https://huggingface.co/nl45/Protein1}}
}
Acknowledgments
- CAFA Challenge: Critical Assessment of Functional Annotation
- ESM-2: Evolutionary Scale Modeling from Meta AI
- Gene Ontology Consortium: For GO term annotations
License
MIT License
Contact
For questions or collaboration: Create an issue
"BioBERT, I'm coming for you!" π₯π§¬
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
π
Ask for provider support