Protein1 / README.md
nl45's picture
Update README.md
9a21501 verified
---
license: mit
---
---
language: en
tags:
- protein-function-prediction
- bioinformatics
- gene-ontology
- multi-label-classification
- esm-2
- CAFA-6
license: mit
datasets:
- CAFA-6
metrics:
- f1
- precision
- recall
---
# 🧬 CAFA 6 Protein Function Prediction
> *"Once I was zero epochs old, my model said to me... Go make yourself some predictions, don't wait for labeled data."*
**BioBERT, I'm coming for you!** πŸ”₯
## Model Description
State-of-the-art multi-label protein function prediction using ESM-2 embeddings. Predicts Gene Ontology (GO) terms across three ontologies from protein amino acid sequences.
### What This Model Does
Given a protein sequence like:
```
MKTAYIAKQRQISFVKSHFSRQLEERLGLIEVQAPILSRVGDGTQDNLSGAEKAVQVKVKALPDAQFEVVH...
```
It predicts:
- **Molecular Function (MFO)**: What the protein DOES (e.g., "protein binding", "kinase activity")
- **Biological Process (BPO)**: What pathways it's involved in (e.g., "signal transduction")
- **Cellular Component (CCO)**: WHERE it's located (e.g., "nucleus", "membrane")
## Files in This Repository
- `train_esm2_embeddings.pkl` (427 MB) - Pre-computed ESM-2 embeddings for 82,404 training proteins
- `test_esm2_embeddings.pkl` (1.16 GB) - Pre-computed ESM-2 embeddings for test proteins
- `go_parser.pkl` (25.7 MB) - Gene Ontology hierarchy parser with 40,122 GO terms
- `.gitattributes` - Git LFS configuration for large files
## Dataset Statistics
### Training Data
- **Total proteins**: 82,404
- **Total annotations**: 537,027
- **Unique GO terms**: 26,125
### Selected Terms for Prediction
- **MFO**: 500 most frequent terms
- **BPO**: 800 most frequent terms
- **CCO**: 400 most frequent terms
### Label Distribution
| Ontology | Proteins with Labels | Avg Labels/Protein | Sparsity |
|----------|---------------------|-------------------|----------|
| MFO | 49,751 (60.4%) | 54.2 | 89.2% |
| BPO | 44,382 (53.9%) | 6.6 | 99.2% |
| CCO | 58,505 (71.0%) | 36.5 | 90.9% |
## Usage
### Requirements
```bash
pip install torch biopython transformers huggingface_hub numpy
```
### Quick Start - Load Embeddings
```python
from huggingface_hub import hf_hub_download
import pickle
# Download embeddings
embeddings_path = hf_hub_download(
repo_id="nl45/Protein1",
filename="train_esm2_embeddings.pkl"
)
# Load embeddings
with open(embeddings_path, 'rb') as f:
embeddings = pickle.load(f)
# embeddings is a dict: {protein_id: embedding_vector}
print(f"Loaded embeddings for {len(embeddings)} proteins")
print(f"Embedding dimension: {list(embeddings.values())[0].shape}")
```
### Generate New Embeddings for Your Protein
```python
from transformers import AutoTokenizer, EsmModel
import torch
# Load ESM-2 model
tokenizer = AutoTokenizer.from_pretrained("facebook/esm2_t33_650M_UR50D")
model = EsmModel.from_pretrained("facebook/esm2_t33_650M_UR50D")
# Your protein sequence
sequence = "MKTAYIAKQRQISFVKSHFSRQLE..."
# Generate embedding
inputs = tokenizer(sequence, return_tensors="pt", padding=True)
with torch.no_grad():
outputs = model(**inputs)
embedding = outputs.last_hidden_state.mean(dim=1) # Shape: [1, 1280]
print(f"Generated embedding shape: {embedding.shape}")
```
### Load GO Parser
```python
# Download GO parser
parser_path = hf_hub_download(
repo_id="nl45/Protein1",
filename="go_parser.pkl"
)
# Load parser
with open(parser_path, 'rb') as f:
go_parser = pickle.load(f)
# Example: Get GO term information
term_info = go_parser.get_term_info("GO:0003674")
print(f"Term: {term_info['name']}")
print(f"Namespace: {term_info['namespace']}")
```
## Model Architecture
The prediction model uses a Multi-Layer Perceptron (MLP):
```
Input: ESM-2 Embeddings (1280-dim)
↓
[Dense 2048] β†’ BatchNorm β†’ ReLU β†’ Dropout(0.3)
↓
[Dense 1024] β†’ BatchNorm β†’ ReLU β†’ Dropout(0.3)
↓
[Dense 512] β†’ BatchNorm β†’ ReLU β†’ Dropout(0.3)
↓
[Dense Output] β†’ Sigmoid
↓
Multi-label Predictions
```
**Training Details:**
- Loss: Binary Cross-Entropy with Logits
- Optimizer: Adam
- Learning Rate: 0.001 with ReduceLROnPlateau
- Early Stopping: Patience of 10 epochs
## Data Processing Pipeline
1. **Raw Sequences** (FASTA format) β†’ Parse protein IDs and sequences
2. **ESM-2 Encoding** β†’ Generate 1280-dim embeddings using `facebook/esm2_t33_650M_UR50D`
3. **GO Annotations** β†’ Load and normalize GO terms
4. **Label Preparation** β†’ Create multi-label binary matrices with term propagation
5. **Model Training** β†’ Train separate models for MFO, BPO, CCO
## Citation
```bibtex
@misc{nl45_cafa6_2026,
title={CAFA 6 Protein Function Prediction with ESM-2 Embeddings},
author={nl45},
year={2026},
publisher={Hugging Face},
howpublished={\url{https://huggingface.co/nl45/Protein1}}
}
```
## Acknowledgments
- **CAFA Challenge**: Critical Assessment of Functional Annotation
- **ESM-2**: Evolutionary Scale Modeling from Meta AI
- **Gene Ontology Consortium**: For GO term annotations
## License
MIT License
## Contact
For questions or collaboration: [Create an issue](https://huggingface.co/nl45/Protein1/discussions)
---
**"BioBERT, I'm coming for you!"** πŸ”₯🧬