--- license: mit --- --- language: en tags: - protein-function-prediction - bioinformatics - gene-ontology - multi-label-classification - esm-2 - CAFA-6 license: mit datasets: - CAFA-6 metrics: - f1 - precision - recall --- # 🧬 CAFA 6 Protein Function Prediction > *"Once I was zero epochs old, my model said to me... Go make yourself some predictions, don't wait for labeled data."* **BioBERT, I'm coming for you!** 🔥 ## Model Description State-of-the-art multi-label protein function prediction using ESM-2 embeddings. Predicts Gene Ontology (GO) terms across three ontologies from protein amino acid sequences. ### What This Model Does Given a protein sequence like: ``` MKTAYIAKQRQISFVKSHFSRQLEERLGLIEVQAPILSRVGDGTQDNLSGAEKAVQVKVKALPDAQFEVVH... ``` It predicts: - **Molecular Function (MFO)**: What the protein DOES (e.g., "protein binding", "kinase activity") - **Biological Process (BPO)**: What pathways it's involved in (e.g., "signal transduction") - **Cellular Component (CCO)**: WHERE it's located (e.g., "nucleus", "membrane") ## Files in This Repository - `train_esm2_embeddings.pkl` (427 MB) - Pre-computed ESM-2 embeddings for 82,404 training proteins - `test_esm2_embeddings.pkl` (1.16 GB) - Pre-computed ESM-2 embeddings for test proteins - `go_parser.pkl` (25.7 MB) - Gene Ontology hierarchy parser with 40,122 GO terms - `.gitattributes` - Git LFS configuration for large files ## Dataset Statistics ### Training Data - **Total proteins**: 82,404 - **Total annotations**: 537,027 - **Unique GO terms**: 26,125 ### Selected Terms for Prediction - **MFO**: 500 most frequent terms - **BPO**: 800 most frequent terms - **CCO**: 400 most frequent terms ### Label Distribution | Ontology | Proteins with Labels | Avg Labels/Protein | Sparsity | |----------|---------------------|-------------------|----------| | MFO | 49,751 (60.4%) | 54.2 | 89.2% | | BPO | 44,382 (53.9%) | 6.6 | 99.2% | | CCO | 58,505 (71.0%) | 36.5 | 90.9% | ## Usage ### Requirements ```bash pip install torch biopython transformers huggingface_hub numpy ``` ### Quick Start - Load Embeddings ```python from huggingface_hub import hf_hub_download import pickle # Download embeddings embeddings_path = hf_hub_download( repo_id="nl45/Protein1", filename="train_esm2_embeddings.pkl" ) # Load embeddings with open(embeddings_path, 'rb') as f: embeddings = pickle.load(f) # embeddings is a dict: {protein_id: embedding_vector} print(f"Loaded embeddings for {len(embeddings)} proteins") print(f"Embedding dimension: {list(embeddings.values())[0].shape}") ``` ### Generate New Embeddings for Your Protein ```python from transformers import AutoTokenizer, EsmModel import torch # Load ESM-2 model tokenizer = AutoTokenizer.from_pretrained("facebook/esm2_t33_650M_UR50D") model = EsmModel.from_pretrained("facebook/esm2_t33_650M_UR50D") # Your protein sequence sequence = "MKTAYIAKQRQISFVKSHFSRQLE..." # Generate embedding inputs = tokenizer(sequence, return_tensors="pt", padding=True) with torch.no_grad(): outputs = model(**inputs) embedding = outputs.last_hidden_state.mean(dim=1) # Shape: [1, 1280] print(f"Generated embedding shape: {embedding.shape}") ``` ### Load GO Parser ```python # Download GO parser parser_path = hf_hub_download( repo_id="nl45/Protein1", filename="go_parser.pkl" ) # Load parser with open(parser_path, 'rb') as f: go_parser = pickle.load(f) # Example: Get GO term information term_info = go_parser.get_term_info("GO:0003674") print(f"Term: {term_info['name']}") print(f"Namespace: {term_info['namespace']}") ``` ## Model Architecture The prediction model uses a Multi-Layer Perceptron (MLP): ``` Input: ESM-2 Embeddings (1280-dim) ↓ [Dense 2048] → BatchNorm → ReLU → Dropout(0.3) ↓ [Dense 1024] → BatchNorm → ReLU → Dropout(0.3) ↓ [Dense 512] → BatchNorm → ReLU → Dropout(0.3) ↓ [Dense Output] → Sigmoid ↓ Multi-label Predictions ``` **Training Details:** - Loss: Binary Cross-Entropy with Logits - Optimizer: Adam - Learning Rate: 0.001 with ReduceLROnPlateau - Early Stopping: Patience of 10 epochs ## Data Processing Pipeline 1. **Raw Sequences** (FASTA format) → Parse protein IDs and sequences 2. **ESM-2 Encoding** → Generate 1280-dim embeddings using `facebook/esm2_t33_650M_UR50D` 3. **GO Annotations** → Load and normalize GO terms 4. **Label Preparation** → Create multi-label binary matrices with term propagation 5. **Model Training** → Train separate models for MFO, BPO, CCO ## Citation ```bibtex @misc{nl45_cafa6_2026, title={CAFA 6 Protein Function Prediction with ESM-2 Embeddings}, author={nl45}, year={2026}, publisher={Hugging Face}, howpublished={\url{https://huggingface.co/nl45/Protein1}} } ``` ## Acknowledgments - **CAFA Challenge**: Critical Assessment of Functional Annotation - **ESM-2**: Evolutionary Scale Modeling from Meta AI - **Gene Ontology Consortium**: For GO term annotations ## License MIT License ## Contact For questions or collaboration: [Create an issue](https://huggingface.co/nl45/Protein1/discussions) --- **"BioBERT, I'm coming for you!"** 🔥🧬