File size: 4,821 Bytes
efbafeb |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 |
---
# Compound Batch Query Tool
## Project Overview
This project is a compound batch query tool designed to annotating Contaminants of Emerging Concern (CECs)through databases and API interactions. It includes a Dify-based annotating agent, Flask-based SQL query service and a Tkinter-based graphical user interface for batch annotating CECs.
### Key Features
1.**CECs annotating agent**:
- Utilizes Dify's visual workflow orchestration engine and chains together the logic for querying multiple databases (such as PubChem Lite and InVitroDB) to form an automated pipeline.
- Supports CECs annotaing, which includes: `Category`, `EndpointName`, `XLogP, `BioPathway`, `ToxicityInfo`, `KnownUse`, `DisorderDisease`.
2. **SQL Query Service**:
- Provides a RESTful API (via Flask) to execute `SELECT` queries on PubChem Lite and InVitroDB databases.
- Supports dual-database switching with robust security design.
- Ensures safe SQL operations by restricting queries to `SELECT` only.
3. **Batch Compound Classification Tool**:
- A desktop GUI tool (built using Tkinter) that processes compound names from CSV files.
- Uses Dify's API to classify compounds into categories such as main category, subcategories, biological pathways, toxicity information, etc.
- Saves the results as CSV files with detailed logs for reference.
---
## File Structure
```
.
├── step1_pubchemlite_invitro_to_dify_en.py
└── step2_CECs annotating_agent_v1.0.py
```
### File Details
#### 1. `pubchemlite_invitro_to_dify_en.py`
This is a Flask-based SQL query API service with the following key functionalities:
- Allows users to execute SQL queries via HTTP POST requests.
- Provides dual-database support for PubChem Lite and InVitroDB.
- Ensures safety by restricting operations to `SELECT` queries only (disallows `INSERT`, `DELETE`, `UPDATE`, `DROP`, etc.).
- Includes robust error handling with detailed feedback.
**How to Run**:
```bash
python pubchemlite_invitro_to_dify_en.py
```
The service runs on `http://127.0.0.1:5000` by default.
#### 1. `CECs annotating_agent_v1.0.py`
This is a Tkinter-based batch compound classification tool with the following key functionalities:
- Allows users to select a CSV file and configure parameters through a graphical interface.
- Uses Dify's API to classify compounds into predefined categories.
- Supports batch processing and saves results as CSV files.
- Provides detailed logging and error messages for each step.
**How to Run**:
```bash
python CECs annotating_agent_v1.0.py
```
**Key Dependencies**:
- `tkinter`: For the graphical user interface.
- `pandas`: For loading and saving CSV files.
- `requests`: For making RESTful API calls.
- `json`: For parsing and generating JSON data.
---
## Usage Guide
### 1. Environment Setup
Ensure you have the following Python packages installed:
```bash
pip install flask pandas sqlalchemy requests pymysql
```
### 2. SQL Query Service
- Modify the database connection details in `pubchemlite_invitro_to_dify_en.py`:
```python
DB_CONFIGS = {
"pubchemlite": {
"uri": "mysql+pymysql://<username>:<password>@<host>:<port>/<database>"
},
"invitrodb_v4_3": {
"uri": "mysql+pymysql://<username>:<password>@<host>:<port>/<database>"
}
}
```
- Start the service and test the API with the examples provided above.
### 3. Batch Compound Classification Tool
- Update the default configuration in `CECs annotating_agent_v1.0.py`:
```python
self.default_api_key = "<DIFY_API_KEY>"
self.default_base_url = "http://<DIFY_HOST>:<PORT>/v1"
self.default_csv_path = "./path_to_your_data.csv"
```
- Run the program and use the GUI to upload a CSV file and execute batch classification.
---
## Example Data
### Input File Format
The input CSV file should contain a column with compound names. For example:
```csv
IUPAC_name
Methanol
Ethanol
Acetone
```
### Output File Format
The output file will be in CSV format and include the following fields:
- `CompoundName`: The compound name.
- `MainCategory`: The main classification category.
- `AdditionalCategory1`: Subcategory 1.
- `AdditionalCategory2`: Subcategory 2.
- `EndpointName`: Expanded endpoint classification.
- `XLogP`: XLogP value.
- `BioPathway`: Biological pathway information.
- `ToxicityInfo`: Toxicity information.
- `KnownUse`: Known uses of the compound.
- `DisorderDisease`: Associated disorders or diseases.
---
## Contributors
We welcome contributions! If you are interested in improving this project, feel free to submit pull requests or suggestions.
---
## License
This project is licensed under the cc-by-nc-4.0 License.
---
|