Update README.md
Browse files
README.md
CHANGED
|
@@ -1,3 +1,22 @@
|
|
| 1 |
-
---
|
| 2 |
-
license: apache-2.0
|
| 3 |
-
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: apache-2.0
|
| 3 |
+
---
|
| 4 |
+
# Struct2Seq-GNN
|
| 5 |
+
|
| 6 |
+
## Model Description
|
| 7 |
+
Struct2Seq-GNN is a lightweight, 6-layer Graph Neural Network designed for inverse protein folding (structure-to-sequence prediction). By mapping the 3D spatial coordinates of protein backbones to their corresponding amino acid sequences, this model serves as a foundational tool for computational protein engineering and structural bioinformatics workflows.
|
| 8 |
+
|
| 9 |
+
## Intended Uses & Limitations
|
| 10 |
+
* **Primary Use:** Computational protein design, generating plausible sequences for novel or heavily modified protein backbones.
|
| 11 |
+
* **Limitations:** This is a lightweight architecture built as an independent research project. While it achieves high native sequence recovery, it is not intended for out-of-the-box production of clinical therapeutics without further validation and optimization.
|
| 12 |
+
|
| 13 |
+
## Training Data & Procedure
|
| 14 |
+
* **Dataset:** Trained on biological protein assemblies from the PDB, clustered at a 30% sequence identity cutoff to prevent data leakage.
|
| 15 |
+
* **Data Augmentation:** During training, 0.1 Å standard deviation Gaussian noise was applied to all input atomic coordinates. This critical augmentation prevents the model from "reading out" the native sequence from over-refined crystal artifacts, forcing it to learn the underlying biophysics of the fold.
|
| 16 |
+
* **Hardware:** Trained efficiently over ~65 epochs on a 4-GPU HPC cluster.
|
| 17 |
+
|
| 18 |
+
## Evaluation Metrics
|
| 19 |
+
The model demonstrates strong generalization and robust learning of physical constraints:
|
| 20 |
+
* **Global Sequence Recovery:** ~33% validation accuracy across all residues. (Achieving >30% sequence identity strongly suggests the generated sequences will reliably adopt the target 3D fold).
|
| 21 |
+
* **Convergence:** Validation loss plateaued smoothly at ~2.236.
|
| 22 |
+
* *(Optional: Add your 5.0 Å binding pocket recovery metric here once you calculate it!)*
|