gsstec commited on
Commit
57fd2f0
·
verified ·
1 Parent(s): e970142

Upload 4 files

Browse files
Files changed (4) hide show
  1. README.md +165 -13
  2. app.py +115 -0
  3. packages.txt +3 -0
  4. requirements.txt +10 -0
README.md CHANGED
@@ -1,13 +1,165 @@
1
- ---
2
- title: Gss Diffdock Engine
3
- emoji: 🚀
4
- colorFrom: gray
5
- colorTo: gray
6
- sdk: gradio
7
- sdk_version: 6.15.2
8
- python_version: '3.13'
9
- app_file: app.py
10
- pinned: false
11
- ---
12
-
13
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # DiffDock API Layer for Window 8 Drug Development
2
+
3
+ ## Overview
4
+ This directory contains the optimized DiffDock molecular docking engine designed to run on Hugging Face's **free CPU Basic tier** (2 vCPUs). It provides protein-ligand binding affinity predictions for drug development analysis.
5
+
6
+ ## Architecture
7
+ - **Platform**: Hugging Face Spaces (Gradio SDK)
8
+ - **Hardware**: CPU Basic (Free Tier - 2 vCPUs)
9
+ - **Framework**: DiffDock neural network for molecular docking
10
+ - **API**: RESTful endpoint for Cloudflare Worker integration
11
+
12
+ ## Files
13
+
14
+ ### 1. `packages.txt`
15
+ System-level dependencies installed before Python setup:
16
+ - `unzip` - Archive extraction
17
+ - `wget` - File downloads
18
+ - `libgl1-mesa-glx` - OpenGL support for molecular visualization
19
+
20
+ ### 2. `requirements.txt`
21
+ Python dependencies optimized for CPU execution:
22
+ - **PyTorch 2.2.1** (CPU-only build)
23
+ - **torch-geometric 2.5.2** - Graph neural networks
24
+ - **biopython 1.83** - Biological computation
25
+ - **rdkit 2023.9.5** - Chemical informatics
26
+ - **gradio 4.19.2** - Web interface and API
27
+ - **pandas 2.2.1** - Data manipulation
28
+ - **pyyaml 6.0.1** - Configuration parsing
29
+ - **scipy 1.12.0** - Scientific computing
30
+ - **networkx 3.2.1** - Graph algorithms
31
+
32
+ ### 3. `app.py`
33
+ Main application with three key components:
34
+
35
+ #### CPU Optimization
36
+ ```python
37
+ torch.set_num_threads(2)
38
+ os.environ["OMP_NUM_THREADS"] = "2"
39
+ os.environ["MKL_NUM_THREADS"] = "2"
40
+ ```
41
+ Limits thread usage to match free tier allocation.
42
+
43
+ #### Automated Setup
44
+ - Clones DiffDock repository
45
+ - Downloads pre-trained weights from Zenodo
46
+ - Configures inference pipeline
47
+
48
+ #### API Endpoint
49
+ - **Function**: `run_diffdock_inference(protein_pdb_content, ligand_smiles_string)`
50
+ - **Input**:
51
+ - Protein structure (PDB format)
52
+ - Ligand molecule (SMILES string)
53
+ - **Output**: JSON with confidence score
54
+ - **API Name**: `execute_diffdock_prediction`
55
+
56
+ ## Deployment Steps
57
+
58
+ ### 1. Create Hugging Face Space
59
+ 1. Go to https://huggingface.co/spaces
60
+ 2. Click **"Create a New Space"**
61
+ 3. Name: `gss-diffdock-engine` (or your preferred name)
62
+ 4. SDK: **Gradio**
63
+ 5. Hardware: **CPU Basic** (Free)
64
+ 6. Visibility: Public or Private
65
+
66
+ ### 2. Upload Files
67
+ Upload these three files to your Space repository:
68
+ - `packages.txt`
69
+ - `requirements.txt`
70
+ - `app.py`
71
+
72
+ ### 3. Wait for Build
73
+ Hugging Face will:
74
+ 1. Install system packages (1-2 minutes)
75
+ 2. Install Python dependencies (3-5 minutes)
76
+ 3. Clone DiffDock and download weights (5-10 minutes)
77
+ 4. Start the application
78
+
79
+ Total build time: **10-15 minutes**
80
+
81
+ ### 4. Verify Deployment
82
+ Once status shows **"Running"**:
83
+ - The Space URL will be active
84
+ - API endpoint will be available at: `https://YOUR-USERNAME-gss-diffdock-engine.hf.space/api/execute_diffdock_prediction`
85
+
86
+ ## API Usage
87
+
88
+ ### Request Format
89
+ ```bash
90
+ curl -X POST "https://YOUR-USERNAME-gss-diffdock-engine.hf.space/api/execute_diffdock_prediction" \
91
+ -H "Content-Type: application/json" \
92
+ -d '{
93
+ "data": [
94
+ "PROTEIN_PDB_CONTENT_HERE",
95
+ "LIGAND_SMILES_STRING_HERE"
96
+ ]
97
+ }'
98
+ ```
99
+
100
+ ### Response Format
101
+ ```json
102
+ {
103
+ "data": [{
104
+ "success": true,
105
+ "diffdock_confidence_score": 0.85,
106
+ "hardware_allocation": "HF_FREE_CPU_TIER"
107
+ }]
108
+ }
109
+ ```
110
+
111
+ ## Performance Optimizations
112
+
113
+ ### Memory Management
114
+ - **Inference steps**: Limited to 10 (vs default 20)
115
+ - **Samples per complex**: 1 (vs default 40)
116
+ - **Cleanup**: Automatic removal of temporary files
117
+
118
+ ### CPU Constraints
119
+ - Thread count capped at 2
120
+ - Single pose generation
121
+ - Aggressive memory cleanup
122
+
123
+ ## Integration with Cloudflare Worker
124
+
125
+ The next step is to create a Cloudflare Worker handler that:
126
+ 1. Receives drug development requests from Window 8
127
+ 2. Formats protein/ligand data
128
+ 3. Calls this Hugging Face API
129
+ 4. Stores results in D1 database
130
+ 5. Returns predictions to frontend
131
+
132
+ ## Troubleshooting
133
+
134
+ ### Build Failures
135
+ - Check logs for missing dependencies
136
+ - Verify file names are exact (case-sensitive)
137
+ - Ensure no extra whitespace in files
138
+
139
+ ### Timeout Errors
140
+ - Inference is limited to 10 steps for speed
141
+ - Consider upgrading to paid tier for faster processing
142
+
143
+ ### Memory Issues
144
+ - Current config optimized for 16GB RAM limit
145
+ - Reduce inference steps if needed
146
+
147
+ ## Next Steps
148
+
149
+ 1. ✅ Deploy to Hugging Face Spaces
150
+ 2. ⏳ Create Cloudflare Worker integration
151
+ 3. ⏳ Add D1 database schema for drug predictions
152
+ 4. ⏳ Build Window 8 frontend interface
153
+ 5. ⏳ Implement result visualization
154
+
155
+ ## Support
156
+
157
+ For issues or questions:
158
+ - Hugging Face Docs: https://huggingface.co/docs/hub/spaces
159
+ - DiffDock Paper: https://arxiv.org/abs/2210.01776
160
+ - DiffDock Repo: https://github.com/gcorso/DiffDock
161
+
162
+ ---
163
+
164
+ **Gaston Software Solutions LLP**
165
+ Window 8: Drug Development & Molecular Docking Engine
app.py ADDED
@@ -0,0 +1,115 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import gradio as gr
2
+ import os
3
+ import sys
4
+ import subprocess
5
+ import torch
6
+
7
+ # 🛠️ Step 1: Optimize execution matrix for Hugging Face's 2 free CPU threads
8
+ torch.set_num_threads(2)
9
+ os.environ["OMP_NUM_THREADS"] = "2"
10
+ os.environ["MKL_NUM_THREADS"] = "2"
11
+
12
+ # 🧬 Step 2: Automated setup for the core DiffDock Neural Architecture layers
13
+ if not os.path.exists("DiffDock"):
14
+ print("[GSS LOG] Initializing DiffDock repo architectures...")
15
+ subprocess.run(["git", "clone", "https://github.com/gcorso/DiffDock.git"])
16
+
17
+ print("[GSS LOG] Fetching foundational pretrained academic weight structures...")
18
+ # Pulling the public, pre-computed spatial scoring weights
19
+ subprocess.run(["wget", "https://zenodo.org/record/7651515/files/workdir.zip"])
20
+ subprocess.run(["unzip", "workdir.zip", "-d", "DiffDock/"])
21
+
22
+ sys.path.append(os.path.abspath("DiffDock"))
23
+
24
+ def run_diffdock_inference(protein_pdb_content, ligand_smiles_string):
25
+ """
26
+ Ingests raw target pathogen protein text from Window 7 and the candidate
27
+ chemical SMILES sequence, mapping the docking coordinates entirely via CPU.
28
+ """
29
+ pid = os.getpid()
30
+ protein_path = f"target_pathogen_{pid}.pdb"
31
+ csv_path = f"input_manifest_{pid}.csv"
32
+ output_dir = f"results_{pid}"
33
+
34
+ try:
35
+ # 1. Output edge node payloads directly to physical system space files
36
+ with open(protein_path, "w") as f:
37
+ f.write(protein_pdb_content)
38
+
39
+ # 2. Build the task manifest table file matching DiffDock's intake expectations
40
+ import pandas as pd
41
+ manifest_df = pd.DataFrame([{
42
+ "complex_name": "gss_candidate",
43
+ "protein_path": protein_path,
44
+ "ligand_description": ligand_smiles_string,
45
+ "protein_sequence": ""
46
+ }])
47
+ manifest_df.to_csv(csv_path, index=False)
48
+
49
+ # 3. Construct the execution array with aggressive CPU concessions
50
+ # We clamp inference steps to 10 and output poses to 1 to stay inside memory lines
51
+ cmd = [
52
+ sys.executable, "DiffDock/inference.py",
53
+ "--config", "DiffDock/default_inference_args.yaml",
54
+ "--protein_ligand_csv", csv_path,
55
+ "--out_dir", output_dir,
56
+ "--inference_steps", "10",
57
+ "--samples_per_complex", "1",
58
+ "--actual_steps", "10"
59
+ ]
60
+
61
+ # Run execution loop through python process mapping pipelines
62
+ execution_run = subprocess.run(cmd, capture_output=True, text=True)
63
+
64
+ # 4. Parse the output results table to locate the match confidence metric
65
+ confidence_metric = -1.0 # Fallback default value
66
+ summary_sheet = os.path.join(output_dir, "summary.csv")
67
+
68
+ if os.path.exists(summary_sheet):
69
+ summary_df = pd.read_csv(summary_sheet)
70
+ if "confidence" in summary_df.columns and not summary_df.empty:
71
+ confidence_metric = float(summary_df.iloc[0]["confidence"])
72
+
73
+ return {
74
+ "success": True,
75
+ "diffdock_confidence_score": confidence_metric,
76
+ "hardware_allocation": "HF_FREE_CPU_TIER"
77
+ }
78
+
79
+ except Exception as runtime_fault:
80
+ return {
81
+ "success": False,
82
+ "error_log": str(runtime_fault)
83
+ }
84
+
85
+ finally:
86
+ # Clean up temporary generation artifacts from physical memory storage
87
+ for temp_file in [protein_path, csv_path]:
88
+ if os.path.exists(temp_file):
89
+ os.remove(temp_file)
90
+ if os.path.exists(output_dir):
91
+ import shutil
92
+ shutil.rmtree(output_dir)
93
+
94
+ # 🌐 Step 3: Instantiate the App Dashboard and expose the API schema
95
+ with gr.Blocks() as demo:
96
+ gr.Markdown("# Gaston Software Solutions LLP — Window 8 Engine")
97
+ gr.Markdown("Active Mode: Decentralized Independent CPU Inference Matrix.")
98
+
99
+ # Hidden registration endpoints to receive data programmatically from Cloudflare
100
+ protein_input_field = gr.Textbox(visible=False, label="Protein Data Stream")
101
+ ligand_input_field = gr.Textbox(visible=False, label="Ligand SMILES Chain")
102
+ json_output_response = gr.JSON()
103
+
104
+ # Named API route link mapped to your cloud architecture hook
105
+ api_trigger_node = gr.Button("Execute Processing", visible=False)
106
+ api_trigger_node.click(
107
+ run_diffdock_inference,
108
+ inputs=[protein_input_field, ligand_input_field],
109
+ outputs=json_output_response,
110
+ api_name="execute_diffdock_prediction"
111
+ )
112
+
113
+ demo.launch()
114
+
115
+ # Made with Bob
packages.txt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ unzip
2
+ wget
3
+ libgl1-mesa-glx
requirements.txt ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ --extra-index-url https://download.pytorch.org/whl/cpu
2
+ torch==2.2.1
3
+ torch-geometric==2.5.2
4
+ biopython==1.83
5
+ rdkit==2023.9.5
6
+ gradio==4.19.2
7
+ pandas==2.2.1
8
+ pyyaml==6.0.1
9
+ scipy==1.12.0
10
+ networkx==3.2.1