midnightoatmeal commited on
Commit
f470400
·
verified ·
1 Parent(s): 3016fe8

Initial upload from GitHub

Browse files
Files changed (5) hide show
  1. .gitattributes +1 -35
  2. LICENSE +21 -0
  3. README.md +68 -0
  4. main_prot2func.ipynb +0 -0
  5. protein_sequences.csv +3 -0
.gitattributes CHANGED
@@ -1,35 +1 @@
1
- *.7z filter=lfs diff=lfs merge=lfs -text
2
- *.arrow filter=lfs diff=lfs merge=lfs -text
3
- *.bin filter=lfs diff=lfs merge=lfs -text
4
- *.bz2 filter=lfs diff=lfs merge=lfs -text
5
- *.ckpt filter=lfs diff=lfs merge=lfs -text
6
- *.ftz filter=lfs diff=lfs merge=lfs -text
7
- *.gz filter=lfs diff=lfs merge=lfs -text
8
- *.h5 filter=lfs diff=lfs merge=lfs -text
9
- *.joblib filter=lfs diff=lfs merge=lfs -text
10
- *.lfs.* filter=lfs diff=lfs merge=lfs -text
11
- *.mlmodel filter=lfs diff=lfs merge=lfs -text
12
- *.model filter=lfs diff=lfs merge=lfs -text
13
- *.msgpack filter=lfs diff=lfs merge=lfs -text
14
- *.npy filter=lfs diff=lfs merge=lfs -text
15
- *.npz filter=lfs diff=lfs merge=lfs -text
16
- *.onnx filter=lfs diff=lfs merge=lfs -text
17
- *.ot filter=lfs diff=lfs merge=lfs -text
18
- *.parquet filter=lfs diff=lfs merge=lfs -text
19
- *.pb filter=lfs diff=lfs merge=lfs -text
20
- *.pickle filter=lfs diff=lfs merge=lfs -text
21
- *.pkl filter=lfs diff=lfs merge=lfs -text
22
- *.pt filter=lfs diff=lfs merge=lfs -text
23
- *.pth filter=lfs diff=lfs merge=lfs -text
24
- *.rar filter=lfs diff=lfs merge=lfs -text
25
- *.safetensors filter=lfs diff=lfs merge=lfs -text
26
- saved_model/**/* filter=lfs diff=lfs merge=lfs -text
27
- *.tar.* filter=lfs diff=lfs merge=lfs -text
28
- *.tar filter=lfs diff=lfs merge=lfs -text
29
- *.tflite filter=lfs diff=lfs merge=lfs -text
30
- *.tgz filter=lfs diff=lfs merge=lfs -text
31
- *.wasm filter=lfs diff=lfs merge=lfs -text
32
- *.xz filter=lfs diff=lfs merge=lfs -text
33
- *.zip filter=lfs diff=lfs merge=lfs -text
34
- *.zst filter=lfs diff=lfs merge=lfs -text
35
- *tfevents* filter=lfs diff=lfs merge=lfs -text
 
1
+ protein_sequences.csv filter=lfs diff=lfs merge=lfs -text
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
LICENSE ADDED
@@ -0,0 +1,21 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ MIT License
2
+
3
+ Copyright (c) 2025 Lionel Rozario
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
README.md ADDED
@@ -0,0 +1,68 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Prot2Func: Predicting Enzyme Function from Protein Sequences
2
+
3
+ Prot2Func is a machine learning project that explores the feasibility of predicting enzymatic activity (enzyme vs. non-enzyme) from protein sequences using only their amino acid composition.
4
+
5
+ Update: This was an early experimental attempt at protein function prediction using shallow models. The results were modest, but the goal was to test data preprocessing and pipeline logic. I plan to improve performance in future iterations with attention-based architectures or pretrained embeddings
6
+
7
+ Background:
8
+
9
+ Predicting whether a protein acts as an enzyme is a fundamental problem in computational biology, with applications in drug discovery, metabolic engineering, and synthetic biology. This project attempts a first-principles approach using amino acid composition as a basic feature set.
10
+
11
+
12
+ Dataset:
13
+ • Source: Subset of 500 proteins from the UniProt Swiss-Prot database
14
+ • Representation: Each protein is a string of amino acids (e.g., "MVKVGVNGFGRIGRL...")
15
+ • Labeling: Proteins were queried via the UniProt REST API for Catalytic Activity (EC Number) to assign binary labels:
16
+ • 1 → Enzyme
17
+ • 0 → Non-enzyme
18
+ • Class Distribution:
19
+ • Enzymes: 140
20
+ • Non-enzymes: 360
21
+
22
+
23
+ Feature Engineering:
24
+
25
+ Protein sequences were featurized using amino acid composition—a simple 20-dimensional vector representing the relative frequency of each standard amino acid.
26
+
27
+ from collections import Counter
28
+
29
+ AMINO_ACIDS = "ACDEFGHIKLMNPQRSTVWY"
30
+
31
+ def aa_composition(seq):
32
+ count = Counter(seq)
33
+ total = len(seq)
34
+ return [count.get(aa, 0) / total for aa in AMINO_ACIDS]
35
+
36
+
37
+ Models Trained:
38
+
39
+ 1. Logistic Regression (Sklearn)
40
+ • Input: 20-dimensional amino acid frequency vector
41
+ • Performance:
42
+ • Accuracy: 84%
43
+ • Precision (Enzyme): 0.00
44
+ • Recall (Enzyme): 0.00
45
+ • Observation: Strong class imbalance led to a degenerate classifier (predicting all as non-enzymes).
46
+
47
+ 2. Feedforward Neural Network (PyTorch)
48
+ • Architecture:
49
+ • Input layer: 20 features
50
+ • Two hidden layers (ReLU)
51
+ • Output: 2 logits (enzyme vs non-enzyme)
52
+ • Loss Function: CrossEntropyLoss
53
+ • Epochs: 20
54
+ • Performance:
55
+ • Accuracy: 21%
56
+ • Precision: 16%
57
+ • Recall: 100% (predicts all as enzyme)
58
+ • F1 Score: 28%
59
+
60
+
61
+ Challenges & Key Learnings
62
+ • Protein function is not linearly separable by amino acid composition alone.
63
+ • The dataset suffers from label imbalance and potential noise in UniProt annotations.
64
+ • Even small neural networks overfit or collapse into trivial predictions under these conditions.
65
+ • There is significant potential in exploring sequence-aware models like:
66
+ • Convolutional Neural Networks (CNNs)
67
+ • Transformers (e.g., ProtBERT, ESM)
68
+ • Embedding-based representations
main_prot2func.ipynb ADDED
The diff for this file is too large to render. See raw diff
 
protein_sequences.csv ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:3c2d3e382fd04fb0a0a9b6151ee74a980908146afd006aa23acb424c1258347f
3
+ size 212337999