Protein1 / README.md

nl45

Update README.md

9a21501 verified 10 days ago

preview code

raw

history blame contribute delete

5.24 kB

metadata

license: mit

language: en tags: - protein-function-prediction - bioinformatics - gene-ontology - multi-label-classification - esm-2 - CAFA-6 license: mit datasets: - CAFA-6 metrics: - f1 - precision - recall

🧬 CAFA 6 Protein Function Prediction

"Once I was zero epochs old, my model said to me... Go make yourself some predictions, don't wait for labeled data."

BioBERT, I'm coming for you! 🔥

Model Description

State-of-the-art multi-label protein function prediction using ESM-2 embeddings. Predicts Gene Ontology (GO) terms across three ontologies from protein amino acid sequences.

What This Model Does

Given a protein sequence like:

MKTAYIAKQRQISFVKSHFSRQLEERLGLIEVQAPILSRVGDGTQDNLSGAEKAVQVKVKALPDAQFEVVH...

It predicts:

Molecular Function (MFO): What the protein DOES (e.g., "protein binding", "kinase activity")
Biological Process (BPO): What pathways it's involved in (e.g., "signal transduction")
Cellular Component (CCO): WHERE it's located (e.g., "nucleus", "membrane")

Files in This Repository

train_esm2_embeddings.pkl (427 MB) - Pre-computed ESM-2 embeddings for 82,404 training proteins
test_esm2_embeddings.pkl (1.16 GB) - Pre-computed ESM-2 embeddings for test proteins
go_parser.pkl (25.7 MB) - Gene Ontology hierarchy parser with 40,122 GO terms
.gitattributes - Git LFS configuration for large files

Dataset Statistics

Training Data

Total proteins: 82,404
Total annotations: 537,027
Unique GO terms: 26,125

Selected Terms for Prediction

MFO: 500 most frequent terms
BPO: 800 most frequent terms
CCO: 400 most frequent terms

Label Distribution

Ontology	Proteins with Labels	Avg Labels/Protein	Sparsity
MFO	49,751 (60.4%)	54.2	89.2%
BPO	44,382 (53.9%)	6.6	99.2%
CCO	58,505 (71.0%)	36.5	90.9%

Usage

Requirements

pip install torch biopython transformers huggingface_hub numpy

Quick Start - Load Embeddings

from huggingface_hub import hf_hub_download
import pickle

# Download embeddings
embeddings_path = hf_hub_download(
    repo_id="nl45/Protein1",
    filename="train_esm2_embeddings.pkl"
)

# Load embeddings
with open(embeddings_path, 'rb') as f:
    embeddings = pickle.load(f)

# embeddings is a dict: {protein_id: embedding_vector}
print(f"Loaded embeddings for {len(embeddings)} proteins")
print(f"Embedding dimension: {list(embeddings.values())[0].shape}")

Generate New Embeddings for Your Protein

from transformers import AutoTokenizer, EsmModel
import torch

# Load ESM-2 model
tokenizer = AutoTokenizer.from_pretrained("facebook/esm2_t33_650M_UR50D")
model = EsmModel.from_pretrained("facebook/esm2_t33_650M_UR50D")

# Your protein sequence
sequence = "MKTAYIAKQRQISFVKSHFSRQLE..."

# Generate embedding
inputs = tokenizer(sequence, return_tensors="pt", padding=True)
with torch.no_grad():
    outputs = model(**inputs)
    embedding = outputs.last_hidden_state.mean(dim=1)  # Shape: [1, 1280]

print(f"Generated embedding shape: {embedding.shape}")

Load GO Parser

# Download GO parser
parser_path = hf_hub_download(
    repo_id="nl45/Protein1",
    filename="go_parser.pkl"
)

# Load parser
with open(parser_path, 'rb') as f:
    go_parser = pickle.load(f)

# Example: Get GO term information
term_info = go_parser.get_term_info("GO:0003674")
print(f"Term: {term_info['name']}")
print(f"Namespace: {term_info['namespace']}")

Model Architecture

The prediction model uses a Multi-Layer Perceptron (MLP):

Input: ESM-2 Embeddings (1280-dim)
    ↓
[Dense 2048] → BatchNorm → ReLU → Dropout(0.3)
    ↓
[Dense 1024] → BatchNorm → ReLU → Dropout(0.3)
    ↓
[Dense 512] → BatchNorm → ReLU → Dropout(0.3)
    ↓
[Dense Output] → Sigmoid
    ↓
Multi-label Predictions

Training Details:

Loss: Binary Cross-Entropy with Logits
Optimizer: Adam
Learning Rate: 0.001 with ReduceLROnPlateau
Early Stopping: Patience of 10 epochs

Data Processing Pipeline

Raw Sequences (FASTA format) → Parse protein IDs and sequences
ESM-2 Encoding → Generate 1280-dim embeddings using facebook/esm2_t33_650M_UR50D
GO Annotations → Load and normalize GO terms
Label Preparation → Create multi-label binary matrices with term propagation
Model Training → Train separate models for MFO, BPO, CCO

Citation

@misc{nl45_cafa6_2026,
  title={CAFA 6 Protein Function Prediction with ESM-2 Embeddings},
  author={nl45},
  year={2026},
  publisher={Hugging Face},
  howpublished={\url{https://huggingface.co/nl45/Protein1}}
}

Acknowledgments

CAFA Challenge: Critical Assessment of Functional Annotation
ESM-2: Evolutionary Scale Modeling from Meta AI
Gene Ontology Consortium: For GO term annotations

License

MIT License

Contact

For questions or collaboration: Create an issue

"BioBERT, I'm coming for you!" 🔥🧬