File size: 2,781 Bytes
f470400
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
Prot2Func: Predicting Enzyme Function from Protein Sequences

Prot2Func is a machine learning project that explores the feasibility of predicting enzymatic activity (enzyme vs. non-enzyme) from protein sequences using only their amino acid composition.

Update: This was an early experimental attempt at protein function prediction using shallow models. The results were modest, but the goal was to test data preprocessing and pipeline logic. I plan to improve performance in future iterations with attention-based architectures or pretrained embeddings

Background:

Predicting whether a protein acts as an enzyme is a fundamental problem in computational biology, with applications in drug discovery, metabolic engineering, and synthetic biology. This project attempts a first-principles approach using amino acid composition as a basic feature set.


Dataset:
	•	Source: Subset of 500 proteins from the UniProt Swiss-Prot database
	•	Representation: Each protein is a string of amino acids (e.g., "MVKVGVNGFGRIGRL...")
	•	Labeling: Proteins were queried via the UniProt REST API for Catalytic Activity (EC Number) to assign binary labels:
	•	1 → Enzyme
	•	0 → Non-enzyme
	•	Class Distribution:
	•	Enzymes: 140
	•	Non-enzymes: 360


Feature Engineering:

Protein sequences were featurized using amino acid composition—a simple 20-dimensional vector representing the relative frequency of each standard amino acid.

from collections import Counter

AMINO_ACIDS = "ACDEFGHIKLMNPQRSTVWY"

def aa_composition(seq):
    count = Counter(seq)
    total = len(seq)
    return [count.get(aa, 0) / total for aa in AMINO_ACIDS]


Models Trained:

1. Logistic Regression (Sklearn)
	•	Input: 20-dimensional amino acid frequency vector
	•	Performance:
	•	Accuracy: 84%
	•	Precision (Enzyme): 0.00
	•	Recall (Enzyme): 0.00
	•	Observation: Strong class imbalance led to a degenerate classifier (predicting all as non-enzymes).

2. Feedforward Neural Network (PyTorch)
	•	Architecture:
	•	Input layer: 20 features
	•	Two hidden layers (ReLU)
	•	Output: 2 logits (enzyme vs non-enzyme)
	•	Loss Function: CrossEntropyLoss
	•	Epochs: 20
	•	Performance:
	•	Accuracy: 21%
	•	Precision: 16%
	•	Recall: 100% (predicts all as enzyme)
	•	F1 Score: 28%


Challenges & Key Learnings
	•	Protein function is not linearly separable by amino acid composition alone.
	•	The dataset suffers from label imbalance and potential noise in UniProt annotations.
	•	Even small neural networks overfit or collapse into trivial predictions under these conditions.
	•	There is significant potential in exploring sequence-aware models like:
	•	Convolutional Neural Networks (CNNs)
	•	Transformers (e.g., ProtBERT, ESM)
	•	Embedding-based representations