File size: 5,243 Bytes

9a21501

---
license: mit
---
---
language: en
tags:
- protein-function-prediction
- bioinformatics
- gene-ontology
- multi-label-classification
- esm-2
- CAFA-6
license: mit
datasets:
- CAFA-6
metrics:
- f1
- precision
- recall
---

# 🧬 CAFA 6 Protein Function Prediction

> *"Once I was zero epochs old, my model said to me... Go make yourself some predictions, don't wait for labeled data."*

**BioBERT, I'm coming for you!** 🔥

## Model Description

State-of-the-art multi-label protein function prediction using ESM-2 embeddings. Predicts Gene Ontology (GO) terms across three ontologies from protein amino acid sequences.

### What This Model Does

Given a protein sequence like:
```
MKTAYIAKQRQISFVKSHFSRQLEERLGLIEVQAPILSRVGDGTQDNLSGAEKAVQVKVKALPDAQFEVVH...
```

It predicts:
- **Molecular Function (MFO)**: What the protein DOES (e.g., "protein binding", "kinase activity")
- **Biological Process (BPO)**: What pathways it's involved in (e.g., "signal transduction")
- **Cellular Component (CCO)**: WHERE it's located (e.g., "nucleus", "membrane")

## Files in This Repository

- `train_esm2_embeddings.pkl` (427 MB) - Pre-computed ESM-2 embeddings for 82,404 training proteins
- `test_esm2_embeddings.pkl` (1.16 GB) - Pre-computed ESM-2 embeddings for test proteins  
- `go_parser.pkl` (25.7 MB) - Gene Ontology hierarchy parser with 40,122 GO terms
- `.gitattributes` - Git LFS configuration for large files

## Dataset Statistics

### Training Data
- **Total proteins**: 82,404
- **Total annotations**: 537,027
- **Unique GO terms**: 26,125

### Selected Terms for Prediction
- **MFO**: 500 most frequent terms
- **BPO**: 800 most frequent terms  
- **CCO**: 400 most frequent terms

### Label Distribution
| Ontology | Proteins with Labels | Avg Labels/Protein | Sparsity |
|----------|---------------------|-------------------|----------|
| MFO      | 49,751 (60.4%)      | 54.2              | 89.2%    |
| BPO      | 44,382 (53.9%)      | 6.6               | 99.2%    |
| CCO      | 58,505 (71.0%)      | 36.5              | 90.9%    |

## Usage

### Requirements

```bash
pip install torch biopython transformers huggingface_hub numpy
```

### Quick Start - Load Embeddings

```python
from huggingface_hub import hf_hub_download
import pickle

# Download embeddings
embeddings_path = hf_hub_download(
    repo_id="nl45/Protein1",
    filename="train_esm2_embeddings.pkl"
)

# Load embeddings
with open(embeddings_path, 'rb') as f:
    embeddings = pickle.load(f)

# embeddings is a dict: {protein_id: embedding_vector}
print(f"Loaded embeddings for {len(embeddings)} proteins")
print(f"Embedding dimension: {list(embeddings.values())[0].shape}")
```

### Generate New Embeddings for Your Protein

```python
from transformers import AutoTokenizer, EsmModel
import torch

# Load ESM-2 model
tokenizer = AutoTokenizer.from_pretrained("facebook/esm2_t33_650M_UR50D")
model = EsmModel.from_pretrained("facebook/esm2_t33_650M_UR50D")

# Your protein sequence
sequence = "MKTAYIAKQRQISFVKSHFSRQLE..."

# Generate embedding
inputs = tokenizer(sequence, return_tensors="pt", padding=True)
with torch.no_grad():
    outputs = model(**inputs)
    embedding = outputs.last_hidden_state.mean(dim=1)  # Shape: [1, 1280]

print(f"Generated embedding shape: {embedding.shape}")
```

### Load GO Parser

```python
# Download GO parser
parser_path = hf_hub_download(
    repo_id="nl45/Protein1",
    filename="go_parser.pkl"
)

# Load parser
with open(parser_path, 'rb') as f:
    go_parser = pickle.load(f)

# Example: Get GO term information
term_info = go_parser.get_term_info("GO:0003674")
print(f"Term: {term_info['name']}")
print(f"Namespace: {term_info['namespace']}")
```

## Model Architecture

The prediction model uses a Multi-Layer Perceptron (MLP):

```
Input: ESM-2 Embeddings (1280-dim)
    ↓
[Dense 2048] → BatchNorm → ReLU → Dropout(0.3)
    ↓
[Dense 1024] → BatchNorm → ReLU → Dropout(0.3)
    ↓
[Dense 512] → BatchNorm → ReLU → Dropout(0.3)
    ↓
[Dense Output] → Sigmoid
    ↓
Multi-label Predictions
```

**Training Details:**
- Loss: Binary Cross-Entropy with Logits
- Optimizer: Adam
- Learning Rate: 0.001 with ReduceLROnPlateau
- Early Stopping: Patience of 10 epochs

## Data Processing Pipeline

1. **Raw Sequences** (FASTA format) → Parse protein IDs and sequences
2. **ESM-2 Encoding** → Generate 1280-dim embeddings using `facebook/esm2_t33_650M_UR50D`
3. **GO Annotations** → Load and normalize GO terms
4. **Label Preparation** → Create multi-label binary matrices with term propagation
5. **Model Training** → Train separate models for MFO, BPO, CCO

## Citation

```bibtex
@misc{nl45_cafa6_2026,
  title={CAFA 6 Protein Function Prediction with ESM-2 Embeddings},
  author={nl45},
  year={2026},
  publisher={Hugging Face},
  howpublished={\url{https://huggingface.co/nl45/Protein1}}
}
```

## Acknowledgments

- **CAFA Challenge**: Critical Assessment of Functional Annotation
- **ESM-2**: Evolutionary Scale Modeling from Meta AI
- **Gene Ontology Consortium**: For GO term annotations

## License

MIT License

## Contact

For questions or collaboration: [Create an issue](https://huggingface.co/nl45/Protein1/discussions)

---

**"BioBERT, I'm coming for you!"** 🔥🧬