Protein1 / README.md

Update README.md

9a21501 verified 10 days ago

5.24 kB

	---
	license: mit
	---
	---
	language: en
	tags:
	- protein-function-prediction
	- bioinformatics
	- gene-ontology
	- multi-label-classification
	- esm-2
	- CAFA-6
	license: mit
	datasets:
	- CAFA-6
	metrics:
	- f1
	- precision
	- recall
	---

	# 🧬 CAFA 6 Protein Function Prediction

	> "Once I was zero epochs old, my model said to me... Go make yourself some predictions, don't wait for labeled data."

	BioBERT, I'm coming for you! 🔥

	## Model Description

	State-of-the-art multi-label protein function prediction using ESM-2 embeddings. Predicts Gene Ontology (GO) terms across three ontologies from protein amino acid sequences.

	### What This Model Does

	Given a protein sequence like:
	```
	MKTAYIAKQRQISFVKSHFSRQLEERLGLIEVQAPILSRVGDGTQDNLSGAEKAVQVKVKALPDAQFEVVH...
	```

	It predicts:
	- Molecular Function (MFO): What the protein DOES (e.g., "protein binding", "kinase activity")
	- Biological Process (BPO): What pathways it's involved in (e.g., "signal transduction")
	- Cellular Component (CCO): WHERE it's located (e.g., "nucleus", "membrane")

	## Files in This Repository

	- `train_esm2_embeddings.pkl` (427 MB) - Pre-computed ESM-2 embeddings for 82,404 training proteins
	- `test_esm2_embeddings.pkl` (1.16 GB) - Pre-computed ESM-2 embeddings for test proteins
	- `go_parser.pkl` (25.7 MB) - Gene Ontology hierarchy parser with 40,122 GO terms
	- `.gitattributes` - Git LFS configuration for large files

	## Dataset Statistics

	### Training Data
	- Total proteins: 82,404
	- Total annotations: 537,027
	- Unique GO terms: 26,125

	### Selected Terms for Prediction
	- MFO: 500 most frequent terms
	- BPO: 800 most frequent terms
	- CCO: 400 most frequent terms

	### Label Distribution
	\| Ontology \| Proteins with Labels \| Avg Labels/Protein \| Sparsity \|
	\|----------\|---------------------\|-------------------\|----------\|
	\| MFO \| 49,751 (60.4%) \| 54.2 \| 89.2% \|
	\| BPO \| 44,382 (53.9%) \| 6.6 \| 99.2% \|
	\| CCO \| 58,505 (71.0%) \| 36.5 \| 90.9% \|

	## Usage

	### Requirements

	```bash
	pip install torch biopython transformers huggingface_hub numpy
	```

	### Quick Start - Load Embeddings

	```python
	from huggingface_hub import hf_hub_download
	import pickle

	# Download embeddings
	embeddings_path = hf_hub_download(
	repo_id="nl45/Protein1",
	filename="train_esm2_embeddings.pkl"
	)

	# Load embeddings
	with open(embeddings_path, 'rb') as f:
	embeddings = pickle.load(f)

	# embeddings is a dict: {protein_id: embedding_vector}
	print(f"Loaded embeddings for {len(embeddings)} proteins")
	print(f"Embedding dimension: {list(embeddings.values())[0].shape}")
	```

	### Generate New Embeddings for Your Protein

	```python
	from transformers import AutoTokenizer, EsmModel
	import torch

	# Load ESM-2 model
	tokenizer = AutoTokenizer.from_pretrained("facebook/esm2_t33_650M_UR50D")
	model = EsmModel.from_pretrained("facebook/esm2_t33_650M_UR50D")

	# Your protein sequence
	sequence = "MKTAYIAKQRQISFVKSHFSRQLE..."

	# Generate embedding
	inputs = tokenizer(sequence, return_tensors="pt", padding=True)
	with torch.no_grad():
	outputs = model(**inputs)
	embedding = outputs.last_hidden_state.mean(dim=1) # Shape: [1, 1280]

	print(f"Generated embedding shape: {embedding.shape}")
	```

	### Load GO Parser

	```python
	# Download GO parser
	parser_path = hf_hub_download(
	repo_id="nl45/Protein1",
	filename="go_parser.pkl"
	)

	# Load parser
	with open(parser_path, 'rb') as f:
	go_parser = pickle.load(f)

	# Example: Get GO term information
	term_info = go_parser.get_term_info("GO:0003674")
	print(f"Term: {term_info['name']}")
	print(f"Namespace: {term_info['namespace']}")
	```

	## Model Architecture

	The prediction model uses a Multi-Layer Perceptron (MLP):

	```
	Input: ESM-2 Embeddings (1280-dim)
	↓
	[Dense 2048] → BatchNorm → ReLU → Dropout(0.3)
	↓
	[Dense 1024] → BatchNorm → ReLU → Dropout(0.3)
	↓
	[Dense 512] → BatchNorm → ReLU → Dropout(0.3)
	↓
	[Dense Output] → Sigmoid
	↓
	Multi-label Predictions
	```

	Training Details:
	- Loss: Binary Cross-Entropy with Logits
	- Optimizer: Adam
	- Learning Rate: 0.001 with ReduceLROnPlateau
	- Early Stopping: Patience of 10 epochs

	## Data Processing Pipeline

	1. Raw Sequences (FASTA format) → Parse protein IDs and sequences
	2. ESM-2 Encoding → Generate 1280-dim embeddings using `facebook/esm2_t33_650M_UR50D`
	3. GO Annotations → Load and normalize GO terms
	4. Label Preparation → Create multi-label binary matrices with term propagation
	5. Model Training → Train separate models for MFO, BPO, CCO

	## Citation

	```bibtex
	@misc{nl45_cafa6_2026,
	title={CAFA 6 Protein Function Prediction with ESM-2 Embeddings},
	author={nl45},
	year={2026},
	publisher={Hugging Face},
	howpublished={\url{https://huggingface.co/nl45/Protein1}}
	}
	```

	## Acknowledgments

	- CAFA Challenge: Critical Assessment of Functional Annotation
	- ESM-2: Evolutionary Scale Modeling from Meta AI
	- Gene Ontology Consortium: For GO term annotations

	## License

	MIT License

	## Contact

	For questions or collaboration: [Create an issue](https://huggingface.co/nl45/Protein1/discussions)

	---

	"BioBERT, I'm coming for you!" 🔥🧬