Upload readme.md

efbafeb verified 24 days ago

4.82 kB

	---

	# Compound Batch Query Tool

	## Project Overview
	This project is a compound batch query tool designed to annotating Contaminants of Emerging Concern (CECs）through databases and API interactions. It includes a Dify-based annotating agent, Flask-based SQL query service and a Tkinter-based graphical user interface for batch annotating CECs.

	### Key Features
	1.CECs annotating agent:
	- Utilizes Dify's visual workflow orchestration engine and chains together the logic for querying multiple databases (such as PubChem Lite and InVitroDB) to form an automated pipeline.
	- Supports CECs annotaing, which includes: `Category`, `EndpointName`, `XLogP, `BioPathway`, `ToxicityInfo`, `KnownUse`, `DisorderDisease`.

	2. SQL Query Service:
	- Provides a RESTful API (via Flask) to execute `SELECT` queries on PubChem Lite and InVitroDB databases.
	- Supports dual-database switching with robust security design.
	- Ensures safe SQL operations by restricting queries to `SELECT` only.

	3. Batch Compound Classification Tool:
	- A desktop GUI tool (built using Tkinter) that processes compound names from CSV files.
	- Uses Dify's API to classify compounds into categories such as main category, subcategories, biological pathways, toxicity information, etc.
	- Saves the results as CSV files with detailed logs for reference.

	---

	## File Structure

	```
	.
	├── step1_pubchemlite_invitro_to_dify_en.py
	└── step2_CECs annotating_agent_v1.0.py
	```

	### File Details

	#### 1. `pubchemlite_invitro_to_dify_en.py`

	This is a Flask-based SQL query API service with the following key functionalities:
	- Allows users to execute SQL queries via HTTP POST requests.
	- Provides dual-database support for PubChem Lite and InVitroDB.
	- Ensures safety by restricting operations to `SELECT` queries only (disallows `INSERT`, `DELETE`, `UPDATE`, `DROP`, etc.).
	- Includes robust error handling with detailed feedback.

	How to Run:
	```bash
	python pubchemlite_invitro_to_dify_en.py
	```

	The service runs on `http://127.0.0.1:5000` by default.


	#### 1. `CECs annotating_agent_v1.0.py`

	This is a Tkinter-based batch compound classification tool with the following key functionalities:
	- Allows users to select a CSV file and configure parameters through a graphical interface.
	- Uses Dify's API to classify compounds into predefined categories.
	- Supports batch processing and saves results as CSV files.
	- Provides detailed logging and error messages for each step.

	How to Run:
	```bash
	python CECs annotating_agent_v1.0.py
	```

	Key Dependencies:
	- `tkinter`: For the graphical user interface.
	- `pandas`: For loading and saving CSV files.
	- `requests`: For making RESTful API calls.
	- `json`: For parsing and generating JSON data.

	---

	## Usage Guide

	### 1. Environment Setup
	Ensure you have the following Python packages installed:
	```bash
	pip install flask pandas sqlalchemy requests pymysql
	```

	### 2. SQL Query Service
	- Modify the database connection details in `pubchemlite_invitro_to_dify_en.py`:
	```python
	DB_CONFIGS = {
	"pubchemlite": {
	"uri": "mysql+pymysql://<username>:<password>@<host>:<port>/<database>"
	},
	"invitrodb_v4_3": {
	"uri": "mysql+pymysql://<username>:<password>@<host>:<port>/<database>"
	}
	}
	```
	- Start the service and test the API with the examples provided above.

	### 3. Batch Compound Classification Tool
	- Update the default configuration in `CECs annotating_agent_v1.0.py`:
	```python
	self.default_api_key = "<DIFY_API_KEY>"
	self.default_base_url = "http://<DIFY_HOST>:<PORT>/v1"
	self.default_csv_path = "./path_to_your_data.csv"
	```
	- Run the program and use the GUI to upload a CSV file and execute batch classification.

	---

	## Example Data

	### Input File Format
	The input CSV file should contain a column with compound names. For example:
	```csv
	IUPAC_name
	Methanol
	Ethanol
	Acetone
	```

	### Output File Format
	The output file will be in CSV format and include the following fields:
	- `CompoundName`: The compound name.
	- `MainCategory`: The main classification category.
	- `AdditionalCategory1`: Subcategory 1.
	- `AdditionalCategory2`: Subcategory 2.
	- `EndpointName`: Expanded endpoint classification.
	- `XLogP`: XLogP value.
	- `BioPathway`: Biological pathway information.
	- `ToxicityInfo`: Toxicity information.
	- `KnownUse`: Known uses of the compound.
	- `DisorderDisease`: Associated disorders or diseases.

	---

	## Contributors

	We welcome contributions! If you are interested in improving this project, feel free to submit pull requests or suggestions.

	---

	## License

	This project is licensed under the cc-by-nc-4.0 License.

	---