Initial upload from GitHub

Browse files

Files changed (5) hide show

.gitattributes +1 -35
LICENSE +21 -0
README.md +68 -0
main_prot2func.ipynb +0 -0
protein_sequences.csv +3 -0

.gitattributes CHANGED Viewed

@@ -1,35 +1 @@
-*.7z filter=lfs diff=lfs merge=lfs -text
-*.arrow filter=lfs diff=lfs merge=lfs -text
-*.bin filter=lfs diff=lfs merge=lfs -text
-*.bz2 filter=lfs diff=lfs merge=lfs -text
-*.ckpt filter=lfs diff=lfs merge=lfs -text
-*.ftz filter=lfs diff=lfs merge=lfs -text
-*.gz filter=lfs diff=lfs merge=lfs -text
-*.h5 filter=lfs diff=lfs merge=lfs -text
-*.joblib filter=lfs diff=lfs merge=lfs -text
-*.lfs.* filter=lfs diff=lfs merge=lfs -text
-*.mlmodel filter=lfs diff=lfs merge=lfs -text
-*.model filter=lfs diff=lfs merge=lfs -text
-*.msgpack filter=lfs diff=lfs merge=lfs -text
-*.npy filter=lfs diff=lfs merge=lfs -text
-*.npz filter=lfs diff=lfs merge=lfs -text
-*.onnx filter=lfs diff=lfs merge=lfs -text
-*.ot filter=lfs diff=lfs merge=lfs -text
-*.parquet filter=lfs diff=lfs merge=lfs -text
-*.pb filter=lfs diff=lfs merge=lfs -text
-*.pickle filter=lfs diff=lfs merge=lfs -text
-*.pkl filter=lfs diff=lfs merge=lfs -text
-*.pt filter=lfs diff=lfs merge=lfs -text
-*.pth filter=lfs diff=lfs merge=lfs -text
-*.rar filter=lfs diff=lfs merge=lfs -text
-*.safetensors filter=lfs diff=lfs merge=lfs -text
-saved_model/**/* filter=lfs diff=lfs merge=lfs -text
-*.tar.* filter=lfs diff=lfs merge=lfs -text
-*.tar filter=lfs diff=lfs merge=lfs -text
-*.tflite filter=lfs diff=lfs merge=lfs -text
-*.tgz filter=lfs diff=lfs merge=lfs -text
-*.wasm filter=lfs diff=lfs merge=lfs -text
-*.xz filter=lfs diff=lfs merge=lfs -text
-*.zip filter=lfs diff=lfs merge=lfs -text
-*.zst filter=lfs diff=lfs merge=lfs -text
-*tfevents* filter=lfs diff=lfs merge=lfs -text


1	+ protein_sequences.csv filter=lfs diff=lfs merge=lfs -text

LICENSE ADDED Viewed

	@@ -0,0 +1,21 @@

+MIT License
+Copyright (c) 2025 Lionel Rozario
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.

README.md ADDED Viewed

	@@ -0,0 +1,68 @@

+Prot2Func: Predicting Enzyme Function from Protein Sequences
+Prot2Func is a machine learning project that explores the feasibility of predicting enzymatic activity (enzyme vs. non-enzyme) from protein sequences using only their amino acid composition.
+Update: This was an early experimental attempt at protein function prediction using shallow models. The results were modest, but the goal was to test data preprocessing and pipeline logic. I plan to improve performance in future iterations with attention-based architectures or pretrained embeddings
+Background:
+Predicting whether a protein acts as an enzyme is a fundamental problem in computational biology, with applications in drug discovery, metabolic engineering, and synthetic biology. This project attempts a first-principles approach using amino acid composition as a basic feature set.
+Dataset:
+	•	Source: Subset of 500 proteins from the UniProt Swiss-Prot database
+	•	Representation: Each protein is a string of amino acids (e.g., "MVKVGVNGFGRIGRL...")
+	•	Labeling: Proteins were queried via the UniProt REST API for Catalytic Activity (EC Number) to assign binary labels:
+	•	1 → Enzyme
+	•	0 → Non-enzyme
+	•	Class Distribution:
+	•	Enzymes: 140
+	•	Non-enzymes: 360
+Feature Engineering:
+Protein sequences were featurized using amino acid composition—a simple 20-dimensional vector representing the relative frequency of each standard amino acid.
+from collections import Counter
+AMINO_ACIDS = "ACDEFGHIKLMNPQRSTVWY"
+def aa_composition(seq):
+    count = Counter(seq)
+    total = len(seq)
+    return [count.get(aa, 0) / total for aa in AMINO_ACIDS]
+Models Trained:
+1. Logistic Regression (Sklearn)
+	•	Input: 20-dimensional amino acid frequency vector
+	•	Performance:
+	•	Accuracy: 84%
+	•	Precision (Enzyme): 0.00
+	•	Recall (Enzyme): 0.00
+	•	Observation: Strong class imbalance led to a degenerate classifier (predicting all as non-enzymes).
+2. Feedforward Neural Network (PyTorch)
+	•	Architecture:
+	•	Input layer: 20 features
+	•	Two hidden layers (ReLU)
+	•	Output: 2 logits (enzyme vs non-enzyme)
+	•	Loss Function: CrossEntropyLoss
+	•	Epochs: 20
+	•	Performance:
+	•	Accuracy: 21%
+	•	Precision: 16%
+	•	Recall: 100% (predicts all as enzyme)
+	•	F1 Score: 28%
+Challenges & Key Learnings
+	•	Protein function is not linearly separable by amino acid composition alone.
+	•	The dataset suffers from label imbalance and potential noise in UniProt annotations.
+	•	Even small neural networks overfit or collapse into trivial predictions under these conditions.
+	•	There is significant potential in exploring sequence-aware models like:
+	•	Convolutional Neural Networks (CNNs)
+	•	Transformers (e.g., ProtBERT, ESM)
+	•	Embedding-based representations

main_prot2func.ipynb ADDED Viewed

The diff for this file is too large to render. See raw diff

protein_sequences.csv ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:3c2d3e382fd04fb0a0a9b6151ee74a980908146afd006aa23acb424c1258347f
+size 212337999