|
|
--- |
|
|
license: mit |
|
|
--- |
|
|
--- |
|
|
language: en |
|
|
tags: |
|
|
- protein-function-prediction |
|
|
- bioinformatics |
|
|
- gene-ontology |
|
|
- multi-label-classification |
|
|
- esm-2 |
|
|
- CAFA-6 |
|
|
license: mit |
|
|
datasets: |
|
|
- CAFA-6 |
|
|
metrics: |
|
|
- f1 |
|
|
- precision |
|
|
- recall |
|
|
--- |
|
|
|
|
|
# 𧬠CAFA 6 Protein Function Prediction |
|
|
|
|
|
> *"Once I was zero epochs old, my model said to me... Go make yourself some predictions, don't wait for labeled data."* |
|
|
|
|
|
**BioBERT, I'm coming for you!** π₯ |
|
|
|
|
|
## Model Description |
|
|
|
|
|
State-of-the-art multi-label protein function prediction using ESM-2 embeddings. Predicts Gene Ontology (GO) terms across three ontologies from protein amino acid sequences. |
|
|
|
|
|
### What This Model Does |
|
|
|
|
|
Given a protein sequence like: |
|
|
``` |
|
|
MKTAYIAKQRQISFVKSHFSRQLEERLGLIEVQAPILSRVGDGTQDNLSGAEKAVQVKVKALPDAQFEVVH... |
|
|
``` |
|
|
|
|
|
It predicts: |
|
|
- **Molecular Function (MFO)**: What the protein DOES (e.g., "protein binding", "kinase activity") |
|
|
- **Biological Process (BPO)**: What pathways it's involved in (e.g., "signal transduction") |
|
|
- **Cellular Component (CCO)**: WHERE it's located (e.g., "nucleus", "membrane") |
|
|
|
|
|
## Files in This Repository |
|
|
|
|
|
- `train_esm2_embeddings.pkl` (427 MB) - Pre-computed ESM-2 embeddings for 82,404 training proteins |
|
|
- `test_esm2_embeddings.pkl` (1.16 GB) - Pre-computed ESM-2 embeddings for test proteins |
|
|
- `go_parser.pkl` (25.7 MB) - Gene Ontology hierarchy parser with 40,122 GO terms |
|
|
- `.gitattributes` - Git LFS configuration for large files |
|
|
|
|
|
## Dataset Statistics |
|
|
|
|
|
### Training Data |
|
|
- **Total proteins**: 82,404 |
|
|
- **Total annotations**: 537,027 |
|
|
- **Unique GO terms**: 26,125 |
|
|
|
|
|
### Selected Terms for Prediction |
|
|
- **MFO**: 500 most frequent terms |
|
|
- **BPO**: 800 most frequent terms |
|
|
- **CCO**: 400 most frequent terms |
|
|
|
|
|
### Label Distribution |
|
|
| Ontology | Proteins with Labels | Avg Labels/Protein | Sparsity | |
|
|
|----------|---------------------|-------------------|----------| |
|
|
| MFO | 49,751 (60.4%) | 54.2 | 89.2% | |
|
|
| BPO | 44,382 (53.9%) | 6.6 | 99.2% | |
|
|
| CCO | 58,505 (71.0%) | 36.5 | 90.9% | |
|
|
|
|
|
## Usage |
|
|
|
|
|
### Requirements |
|
|
|
|
|
```bash |
|
|
pip install torch biopython transformers huggingface_hub numpy |
|
|
``` |
|
|
|
|
|
### Quick Start - Load Embeddings |
|
|
|
|
|
```python |
|
|
from huggingface_hub import hf_hub_download |
|
|
import pickle |
|
|
|
|
|
# Download embeddings |
|
|
embeddings_path = hf_hub_download( |
|
|
repo_id="nl45/Protein1", |
|
|
filename="train_esm2_embeddings.pkl" |
|
|
) |
|
|
|
|
|
# Load embeddings |
|
|
with open(embeddings_path, 'rb') as f: |
|
|
embeddings = pickle.load(f) |
|
|
|
|
|
# embeddings is a dict: {protein_id: embedding_vector} |
|
|
print(f"Loaded embeddings for {len(embeddings)} proteins") |
|
|
print(f"Embedding dimension: {list(embeddings.values())[0].shape}") |
|
|
``` |
|
|
|
|
|
### Generate New Embeddings for Your Protein |
|
|
|
|
|
```python |
|
|
from transformers import AutoTokenizer, EsmModel |
|
|
import torch |
|
|
|
|
|
# Load ESM-2 model |
|
|
tokenizer = AutoTokenizer.from_pretrained("facebook/esm2_t33_650M_UR50D") |
|
|
model = EsmModel.from_pretrained("facebook/esm2_t33_650M_UR50D") |
|
|
|
|
|
# Your protein sequence |
|
|
sequence = "MKTAYIAKQRQISFVKSHFSRQLE..." |
|
|
|
|
|
# Generate embedding |
|
|
inputs = tokenizer(sequence, return_tensors="pt", padding=True) |
|
|
with torch.no_grad(): |
|
|
outputs = model(**inputs) |
|
|
embedding = outputs.last_hidden_state.mean(dim=1) # Shape: [1, 1280] |
|
|
|
|
|
print(f"Generated embedding shape: {embedding.shape}") |
|
|
``` |
|
|
|
|
|
### Load GO Parser |
|
|
|
|
|
```python |
|
|
# Download GO parser |
|
|
parser_path = hf_hub_download( |
|
|
repo_id="nl45/Protein1", |
|
|
filename="go_parser.pkl" |
|
|
) |
|
|
|
|
|
# Load parser |
|
|
with open(parser_path, 'rb') as f: |
|
|
go_parser = pickle.load(f) |
|
|
|
|
|
# Example: Get GO term information |
|
|
term_info = go_parser.get_term_info("GO:0003674") |
|
|
print(f"Term: {term_info['name']}") |
|
|
print(f"Namespace: {term_info['namespace']}") |
|
|
``` |
|
|
|
|
|
## Model Architecture |
|
|
|
|
|
The prediction model uses a Multi-Layer Perceptron (MLP): |
|
|
|
|
|
``` |
|
|
Input: ESM-2 Embeddings (1280-dim) |
|
|
β |
|
|
[Dense 2048] β BatchNorm β ReLU β Dropout(0.3) |
|
|
β |
|
|
[Dense 1024] β BatchNorm β ReLU β Dropout(0.3) |
|
|
β |
|
|
[Dense 512] β BatchNorm β ReLU β Dropout(0.3) |
|
|
β |
|
|
[Dense Output] β Sigmoid |
|
|
β |
|
|
Multi-label Predictions |
|
|
``` |
|
|
|
|
|
**Training Details:** |
|
|
- Loss: Binary Cross-Entropy with Logits |
|
|
- Optimizer: Adam |
|
|
- Learning Rate: 0.001 with ReduceLROnPlateau |
|
|
- Early Stopping: Patience of 10 epochs |
|
|
|
|
|
## Data Processing Pipeline |
|
|
|
|
|
1. **Raw Sequences** (FASTA format) β Parse protein IDs and sequences |
|
|
2. **ESM-2 Encoding** β Generate 1280-dim embeddings using `facebook/esm2_t33_650M_UR50D` |
|
|
3. **GO Annotations** β Load and normalize GO terms |
|
|
4. **Label Preparation** β Create multi-label binary matrices with term propagation |
|
|
5. **Model Training** β Train separate models for MFO, BPO, CCO |
|
|
|
|
|
## Citation |
|
|
|
|
|
```bibtex |
|
|
@misc{nl45_cafa6_2026, |
|
|
title={CAFA 6 Protein Function Prediction with ESM-2 Embeddings}, |
|
|
author={nl45}, |
|
|
year={2026}, |
|
|
publisher={Hugging Face}, |
|
|
howpublished={\url{https://huggingface.co/nl45/Protein1}} |
|
|
} |
|
|
``` |
|
|
|
|
|
## Acknowledgments |
|
|
|
|
|
- **CAFA Challenge**: Critical Assessment of Functional Annotation |
|
|
- **ESM-2**: Evolutionary Scale Modeling from Meta AI |
|
|
- **Gene Ontology Consortium**: For GO term annotations |
|
|
|
|
|
## License |
|
|
|
|
|
MIT License |
|
|
|
|
|
## Contact |
|
|
|
|
|
For questions or collaboration: [Create an issue](https://huggingface.co/nl45/Protein1/discussions) |
|
|
|
|
|
--- |
|
|
|
|
|
**"BioBERT, I'm coming for you!"** π₯𧬠|