gsstec commited on
Commit
063bb10
Β·
verified Β·
1 Parent(s): 4617124

Upload README.md for CPU-based Protein Structure Predictor

Browse files
Files changed (1) hide show
  1. README.md +182 -10
README.md CHANGED
@@ -1,10 +1,182 @@
1
- ---
2
- title: Protein Predictor
3
- emoji: 🐒
4
- colorFrom: blue
5
- colorTo: red
6
- sdk: docker
7
- pinned: false
8
- ---
9
-
10
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: Protein Structure Predictor
3
+ emoji: 🧬
4
+ colorFrom: blue
5
+ colorTo: green
6
+ sdk: docker
7
+ pinned: false
8
+ ---
9
+
10
+ # Protein Structure Predictor - CPU-based Analysis
11
+
12
+ AI-powered protein structure prediction using established bioinformatics methods and machine learning, optimized for CPU execution.
13
+
14
+ ## 🧬 Features
15
+
16
+ - **πŸ”¬ Secondary Structure Prediction**: Predict helix, sheet, and coil regions using ML
17
+ - **βš”οΈ Protease Site Analysis**: Identify potential cleavage sites for common proteases
18
+ - **πŸ“Š Protein Properties**: Calculate molecular weight, pI, instability index, and more
19
+ - **πŸ§ͺ Interactive Interface**: User-friendly web interface for researchers
20
+ - **πŸ“š PDB Generation**: Create structure files for visualization
21
+ - **πŸ–₯️ CPU Optimized**: Fast execution without GPU requirements
22
+
23
+ ## πŸ—οΈ Technology Stack
24
+
25
+ ```
26
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
27
+ β”‚ Protein Structure Predictor β”‚
28
+ β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
29
+ β”‚ Gradio Frontend (Port 7860) β”‚
30
+ β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
31
+ β”‚ BioPython + scikit-learn ML β”‚
32
+ β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
33
+ β”‚ CPU-based Prediction Pipeline β”‚
34
+ β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
35
+ β”‚ Python 3.10 + Scientific Libraries β”‚
36
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
37
+ ```
38
+
39
+ ## πŸ“¦ Method Information
40
+
41
+ ### Prediction Approach
42
+ - **Type**: Machine learning-based structure prediction
43
+ - **Libraries**: BioPython, scikit-learn, NumPy, Pandas
44
+ - **Input**: Amino acid sequences (10-2000 residues)
45
+ - **Output**: Secondary structure, protease sites, PDB files, protein properties
46
+ - **Performance**: Fast CPU execution, ~1-5 seconds per sequence
47
+
48
+ ### Supported Features
49
+ - Secondary structure prediction (Ξ±-helix, Ξ²-sheet, coil)
50
+ - Protease cleavage site prediction (Trypsin, Chymotrypsin, Pepsin)
51
+ - Protein property analysis (MW, pI, instability, hydrophobicity)
52
+ - Simple PDB structure generation
53
+ - Confidence scoring for predictions
54
+
55
+ ## πŸš€ Usage
56
+
57
+ ### Input Requirements
58
+ - **Format**: Single-letter amino acid codes (A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W, Y)
59
+ - **Length**: 10-2000 amino acids
60
+ - **Examples**:
61
+ - Short peptide: `MKFLVNVALVFMVVYISYIYA`
62
+ - Protein domain: `MKTAYIAKQRQISFVKSHFSRQLEERLGLIEVQAPILSRVGDGTQDNLSGAEKAVQVKVKALPDAQFEVVHSLAKWKRQTLGQHDFSAGEGLYTHMKALRPDEDRLSPLHSVYVDQWDWERVMGDGERQFSTLKSTVEAIWAGIKATEAAVSEEFGLAPFLPDQIHFVHSQELLSRYPDLDAKGRERAIAKDLGAVFLVGIGGKLSDGHRHDVRAPDYDDWUQTPACVTYFTQSSLASRQGFVDWDDAASRPAINVGLYPTLNTVGGHQAAMQMLKETINEEAAEWDRVHPVHAGPIAPGQMREPRGTHGTWTIMHPSPSTEEGHAIPQRQTPSPGDGPVVPSASLYAVSPAILPKDGPVVVSQVKQWRQEFGWVLTPWVQTIIDGRGEEQTFLPGQHFLRELQJKHNLNHEFRLQTLLLTCDENGKGPLPQIVIRGQGDSREQAPGQWLEQPGWASPATCSPGPPRPPRPPPPPPPPPPPPPPP`
63
+
64
+ ### Workflow
65
+ 1. **Load Models**: Click "πŸš€ Load Prediction Models" to initialize the system
66
+ 2. **Input Sequence**: Enter or paste your protein sequence
67
+ 3. **Predict Structure**: Click "πŸ”¬ Predict Structure" to run analysis
68
+ 4. **Review Results**: Examine predictions, properties, and PDB structure
69
+ 5. **Export Data**: Download PDB files for further analysis
70
+
71
+ ## πŸ“Š Output Information
72
+
73
+ ### Secondary Structure Prediction
74
+ - **Helix (H)**: Ξ±-helical regions with confidence scores
75
+ - **Sheet (E)**: Ξ²-sheet regions with structural context
76
+ - **Coil (C)**: Random coil and loop regions
77
+
78
+ ### Protein Properties
79
+ - **Molecular Weight**: Calculated from amino acid composition
80
+ - **Isoelectric Point**: pH at which protein has no net charge
81
+ - **Instability Index**: Measure of protein stability in solution
82
+ - **GRAVY Score**: Grand average of hydropathy (hydrophobicity)
83
+ - **Aromaticity**: Fraction of aromatic amino acids
84
+
85
+ ### Protease Analysis
86
+ - **Cleavage Sites**: Predicted positions where proteases may cut
87
+ - **Site Context**: Amino acids surrounding cleavage sites
88
+ - **Protease Types**: Trypsin, Chymotrypsin, Pepsin predictions
89
+
90
+ ## πŸ”§ Technical Details
91
+
92
+ ### Machine Learning Approach
93
+ - **Algorithm**: Random Forest classifier for secondary structure
94
+ - **Features**: Amino acid properties in sliding windows
95
+ - **Training**: Synthetic data for demonstration (real implementation would use PDB data)
96
+ - **Validation**: Cross-validation and confidence scoring
97
+
98
+ ### Computational Requirements
99
+ - **Memory**: ~100-500 MB RAM for typical sequences
100
+ - **Processing Time**: 1-5 seconds depending on sequence length
101
+ - **CPU Usage**: Single-threaded, optimized for HF Spaces
102
+
103
+ ## πŸ§ͺ Research Applications
104
+
105
+ ### Structural Biology
106
+ - **Protein Characterization**: Analyze unknown protein sequences
107
+ - **Domain Analysis**: Identify structural domains and motifs
108
+ - **Comparative Studies**: Compare structures across species
109
+
110
+ ### Drug Discovery
111
+ - **Target Analysis**: Understand protein structure for drug design
112
+ - **Binding Site Prediction**: Identify potential drug binding regions
113
+ - **Stability Assessment**: Evaluate protein stability for therapeutics
114
+
115
+ ### Biotechnology
116
+ - **Protein Engineering**: Design proteins with desired properties
117
+ - **Enzyme Analysis**: Study enzyme structure-function relationships
118
+ - **Biomarker Discovery**: Identify structural features for diagnostics
119
+
120
+ ## πŸ“š Example Use Cases
121
+
122
+ ### Case 1: Enzyme Analysis
123
+ ```
124
+ Input: Protease enzyme sequence
125
+ Output: Active site prediction, substrate specificity
126
+ Application: Industrial enzyme optimization
127
+ ```
128
+
129
+ ### Case 2: Therapeutic Protein
130
+ ```
131
+ Input: Antibody or hormone sequence
132
+ Output: Stability analysis, potential degradation sites
133
+ Application: Biopharmaceutical development
134
+ ```
135
+
136
+ ### Case 3: Membrane Protein
137
+ ```
138
+ Input: Transmembrane protein sequence
139
+ Output: Secondary structure, hydrophobic regions
140
+ Application: Drug target analysis
141
+ ```
142
+
143
+ ## πŸ”— Related Resources
144
+
145
+ - **🧬 BioPython Documentation**: https://biopython.org/
146
+ - **πŸ“Š scikit-learn**: https://scikit-learn.org/
147
+ - **πŸ“š Protein Structure Databases**: PDB, UniProt, SCOP
148
+ - **πŸ”¬ Protease Databases**: MEROPS, CutDB
149
+
150
+ ## 🀝 Contributing
151
+
152
+ We welcome contributions to improve the protein structure predictor:
153
+
154
+ - **Algorithm Improvements**: Enhance prediction accuracy
155
+ - **Feature Additions**: Add new analysis capabilities
156
+ - **Performance Optimization**: Improve speed and efficiency
157
+ - **Documentation**: Help improve user guides and examples
158
+
159
+ ## πŸ“„ Citation
160
+
161
+ If you use this tool in your research, please cite:
162
+
163
+ ```bibtex
164
+ @misc{protein-predictor-2024,
165
+ title={CPU-based Protein Structure Predictor},
166
+ author={gsstec},
167
+ year={2024},
168
+ url={https://huggingface.co/spaces/gsstec/protein-predictor}
169
+ }
170
+ ```
171
+
172
+ ## πŸ“ž Support
173
+
174
+ For questions, issues, or collaboration opportunities:
175
+
176
+ - **GitHub Issues**: Report bugs and request features
177
+ - **HuggingFace Discussions**: Community support and discussions
178
+ - **Email**: Contact for research collaborations
179
+
180
+ ---
181
+
182
+ **Disclaimer**: This tool is for research purposes. Predictions should be validated experimentally for critical applications. The current implementation uses simplified models for demonstration - production use would require training on actual structural databases.