prot2func-hf / README.md
midnightoatmeal's picture
Initial upload from GitHub
f470400 verified
Prot2Func: Predicting Enzyme Function from Protein Sequences
Prot2Func is a machine learning project that explores the feasibility of predicting enzymatic activity (enzyme vs. non-enzyme) from protein sequences using only their amino acid composition.
Update: This was an early experimental attempt at protein function prediction using shallow models. The results were modest, but the goal was to test data preprocessing and pipeline logic. I plan to improve performance in future iterations with attention-based architectures or pretrained embeddings
Background:
Predicting whether a protein acts as an enzyme is a fundamental problem in computational biology, with applications in drug discovery, metabolic engineering, and synthetic biology. This project attempts a first-principles approach using amino acid composition as a basic feature set.
Dataset:
• Source: Subset of 500 proteins from the UniProt Swiss-Prot database
• Representation: Each protein is a string of amino acids (e.g., "MVKVGVNGFGRIGRL...")
• Labeling: Proteins were queried via the UniProt REST API for Catalytic Activity (EC Number) to assign binary labels:
• 1 → Enzyme
• 0 → Non-enzyme
• Class Distribution:
• Enzymes: 140
• Non-enzymes: 360
Feature Engineering:
Protein sequences were featurized using amino acid composition—a simple 20-dimensional vector representing the relative frequency of each standard amino acid.
from collections import Counter
AMINO_ACIDS = "ACDEFGHIKLMNPQRSTVWY"
def aa_composition(seq):
count = Counter(seq)
total = len(seq)
return [count.get(aa, 0) / total for aa in AMINO_ACIDS]
Models Trained:
1. Logistic Regression (Sklearn)
• Input: 20-dimensional amino acid frequency vector
• Performance:
• Accuracy: 84%
• Precision (Enzyme): 0.00
• Recall (Enzyme): 0.00
• Observation: Strong class imbalance led to a degenerate classifier (predicting all as non-enzymes).
2. Feedforward Neural Network (PyTorch)
• Architecture:
• Input layer: 20 features
• Two hidden layers (ReLU)
• Output: 2 logits (enzyme vs non-enzyme)
• Loss Function: CrossEntropyLoss
• Epochs: 20
• Performance:
• Accuracy: 21%
• Precision: 16%
• Recall: 100% (predicts all as enzyme)
• F1 Score: 28%
Challenges & Key Learnings
• Protein function is not linearly separable by amino acid composition alone.
• The dataset suffers from label imbalance and potential noise in UniProt annotations.
• Even small neural networks overfit or collapse into trivial predictions under these conditions.
• There is significant potential in exploring sequence-aware models like:
• Convolutional Neural Networks (CNNs)
• Transformers (e.g., ProtBERT, ESM)
• Embedding-based representations