File size: 5,243 Bytes
9a21501 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 |
---
license: mit
---
---
language: en
tags:
- protein-function-prediction
- bioinformatics
- gene-ontology
- multi-label-classification
- esm-2
- CAFA-6
license: mit
datasets:
- CAFA-6
metrics:
- f1
- precision
- recall
---
# 𧬠CAFA 6 Protein Function Prediction
> *"Once I was zero epochs old, my model said to me... Go make yourself some predictions, don't wait for labeled data."*
**BioBERT, I'm coming for you!** π₯
## Model Description
State-of-the-art multi-label protein function prediction using ESM-2 embeddings. Predicts Gene Ontology (GO) terms across three ontologies from protein amino acid sequences.
### What This Model Does
Given a protein sequence like:
```
MKTAYIAKQRQISFVKSHFSRQLEERLGLIEVQAPILSRVGDGTQDNLSGAEKAVQVKVKALPDAQFEVVH...
```
It predicts:
- **Molecular Function (MFO)**: What the protein DOES (e.g., "protein binding", "kinase activity")
- **Biological Process (BPO)**: What pathways it's involved in (e.g., "signal transduction")
- **Cellular Component (CCO)**: WHERE it's located (e.g., "nucleus", "membrane")
## Files in This Repository
- `train_esm2_embeddings.pkl` (427 MB) - Pre-computed ESM-2 embeddings for 82,404 training proteins
- `test_esm2_embeddings.pkl` (1.16 GB) - Pre-computed ESM-2 embeddings for test proteins
- `go_parser.pkl` (25.7 MB) - Gene Ontology hierarchy parser with 40,122 GO terms
- `.gitattributes` - Git LFS configuration for large files
## Dataset Statistics
### Training Data
- **Total proteins**: 82,404
- **Total annotations**: 537,027
- **Unique GO terms**: 26,125
### Selected Terms for Prediction
- **MFO**: 500 most frequent terms
- **BPO**: 800 most frequent terms
- **CCO**: 400 most frequent terms
### Label Distribution
| Ontology | Proteins with Labels | Avg Labels/Protein | Sparsity |
|----------|---------------------|-------------------|----------|
| MFO | 49,751 (60.4%) | 54.2 | 89.2% |
| BPO | 44,382 (53.9%) | 6.6 | 99.2% |
| CCO | 58,505 (71.0%) | 36.5 | 90.9% |
## Usage
### Requirements
```bash
pip install torch biopython transformers huggingface_hub numpy
```
### Quick Start - Load Embeddings
```python
from huggingface_hub import hf_hub_download
import pickle
# Download embeddings
embeddings_path = hf_hub_download(
repo_id="nl45/Protein1",
filename="train_esm2_embeddings.pkl"
)
# Load embeddings
with open(embeddings_path, 'rb') as f:
embeddings = pickle.load(f)
# embeddings is a dict: {protein_id: embedding_vector}
print(f"Loaded embeddings for {len(embeddings)} proteins")
print(f"Embedding dimension: {list(embeddings.values())[0].shape}")
```
### Generate New Embeddings for Your Protein
```python
from transformers import AutoTokenizer, EsmModel
import torch
# Load ESM-2 model
tokenizer = AutoTokenizer.from_pretrained("facebook/esm2_t33_650M_UR50D")
model = EsmModel.from_pretrained("facebook/esm2_t33_650M_UR50D")
# Your protein sequence
sequence = "MKTAYIAKQRQISFVKSHFSRQLE..."
# Generate embedding
inputs = tokenizer(sequence, return_tensors="pt", padding=True)
with torch.no_grad():
outputs = model(**inputs)
embedding = outputs.last_hidden_state.mean(dim=1) # Shape: [1, 1280]
print(f"Generated embedding shape: {embedding.shape}")
```
### Load GO Parser
```python
# Download GO parser
parser_path = hf_hub_download(
repo_id="nl45/Protein1",
filename="go_parser.pkl"
)
# Load parser
with open(parser_path, 'rb') as f:
go_parser = pickle.load(f)
# Example: Get GO term information
term_info = go_parser.get_term_info("GO:0003674")
print(f"Term: {term_info['name']}")
print(f"Namespace: {term_info['namespace']}")
```
## Model Architecture
The prediction model uses a Multi-Layer Perceptron (MLP):
```
Input: ESM-2 Embeddings (1280-dim)
β
[Dense 2048] β BatchNorm β ReLU β Dropout(0.3)
β
[Dense 1024] β BatchNorm β ReLU β Dropout(0.3)
β
[Dense 512] β BatchNorm β ReLU β Dropout(0.3)
β
[Dense Output] β Sigmoid
β
Multi-label Predictions
```
**Training Details:**
- Loss: Binary Cross-Entropy with Logits
- Optimizer: Adam
- Learning Rate: 0.001 with ReduceLROnPlateau
- Early Stopping: Patience of 10 epochs
## Data Processing Pipeline
1. **Raw Sequences** (FASTA format) β Parse protein IDs and sequences
2. **ESM-2 Encoding** β Generate 1280-dim embeddings using `facebook/esm2_t33_650M_UR50D`
3. **GO Annotations** β Load and normalize GO terms
4. **Label Preparation** β Create multi-label binary matrices with term propagation
5. **Model Training** β Train separate models for MFO, BPO, CCO
## Citation
```bibtex
@misc{nl45_cafa6_2026,
title={CAFA 6 Protein Function Prediction with ESM-2 Embeddings},
author={nl45},
year={2026},
publisher={Hugging Face},
howpublished={\url{https://huggingface.co/nl45/Protein1}}
}
```
## Acknowledgments
- **CAFA Challenge**: Critical Assessment of Functional Annotation
- **ESM-2**: Evolutionary Scale Modeling from Meta AI
- **Gene Ontology Consortium**: For GO term annotations
## License
MIT License
## Contact
For questions or collaboration: [Create an issue](https://huggingface.co/nl45/Protein1/discussions)
---
**"BioBERT, I'm coming for you!"** π₯𧬠|