Create README.md
Browse files
README.md
ADDED
|
@@ -0,0 +1,81 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: cc-by-nc-4.0
|
| 3 |
+
tags:
|
| 4 |
+
- agent
|
| 5 |
+
- chemistry
|
| 6 |
+
---
|
| 7 |
+
## Model Overview
|
| 8 |
+
This model is an **Contaminants of Emerging Concern Annotation Intelligent Agent** built on the Dify platform, integrated with the Norman knowledge base, Pubchemlite_exposomics database, and Invitrodb_v4.3 database. It enables high-throughput, large-scale annotation of emerging contaminants, including **usage classification**, **toxicity endpoints**, and corresponding **AC50 value queries** by inputting the IUPAC name of the target contaminant.
|
| 9 |
+
|
| 10 |
+
## Model Purpose
|
| 11 |
+
To construct a specialized knowledge database for emerging contaminants usage classification, which combines multi-source chemical/toxicological databases and AI agents. The core goals are:
|
| 12 |
+
1. Realize fast and large-scale annotation of emerging contaminants' usage categories.
|
| 13 |
+
2. Provide efficient query services for toxicity endpoints and their corresponding AC50 values.
|
| 14 |
+
3. Support high-throughput data analysis scenarios for emerging contaminants in environmental chemistry and toxicology research.
|
| 15 |
+
|
| 16 |
+
## Key Definitions
|
| 17 |
+
| Term | Definition |
|
| 18 |
+
|------|------------|
|
| 19 |
+
| **System** | Refers to the *Emerging Contaminants Annotation Intelligent Agent Based on Dify Platform* |
|
| 20 |
+
| **User** | Anyone authorized to use the functions of this system |
|
| 21 |
+
| **IUPAC Name** | A systematic naming convention formulated by IUPAC for accurately describing the composition and structure of chemical substances |
|
| 22 |
+
| **AI Agent** | A system based on large language models (LLMs) that understands user intentions and invokes multiple tools to solve complex tasks; in this system, it accepts IUPAC names of emerging contaminants and outputs usage classification, toxicity endpoints, and AC50 values |
|
| 23 |
+
| **Norman Database** | A network for monitoring and evaluating environmental pollutants, facilitating European and international cooperation and data sharing in environmental pollution monitoring; classifies over 100,000 chemicals |
|
| 24 |
+
| **Pubchemlite_exposomics Database** | An open-source organic molecule information database derived from PubChem, applicable for mass spectrometry analysis and non-targeted identification of unknown pollutants |
|
| 25 |
+
| **Invitrodb_v4.3 Database** | The core database of US EPA ToxCast, storing a large amount of biological activity data, analysis workflows, and metadata of compounds generated by high-throughput screening (HTS) |
|
| 26 |
+
|
| 27 |
+
## System Architecture & Components
|
| 28 |
+
### Core Databases Deployment
|
| 29 |
+
The system integrates three core databases with differentiated deployment strategies:
|
| 30 |
+
1. **Norman Chemical Classification Database**
|
| 31 |
+
- Serves as a relational knowledge base, uploaded and parsed on the FastGPT platform, then embedded into the Dify platform.
|
| 32 |
+
- Optimized classification: Integrated or removed redundant categories, finally categorized chemicals into **9 classes** (see Figure 1).
|
| 33 |
+
2. **Pubchemlite_exposomics & Invitrodb_v4.3 Databases**
|
| 34 |
+
- Deployed in local SQL databases to support efficient local query and invocation.
|
| 35 |
+
- Query workflow: GPT-4o generates SQL statements → Extract valid SQL queries → Backend executes database queries and returns results.
|
| 36 |
+
|
| 37 |
+
### AI Workflow Design (Dify Chatflow)
|
| 38 |
+
1. **Base Model**: GPT-4o is used to generate SQL query statements and organize output data in JSON format for subsequent data extraction.
|
| 39 |
+
2. **Custom Schema Tool**
|
| 40 |
+
- Created on the Dify platform to standardize SQL statement generation and API invocation logic.
|
| 41 |
+
- Implementation steps: Create custom tool → Configure tool name and Schema rules (see Appendix for details).
|
| 42 |
+
3. **Knowledge Base Integration (FastGPT + Dify)**
|
| 43 |
+
- **FastGPT Knowledge Base Construction**
|
| 44 |
+
1. Log in to FastGPT (https://fastgpt.aiown.top/) and enter the main interface.
|
| 45 |
+
2. Import dataset (50,000+ chemicals with IUPAC names and categories from the Norman database).
|
| 46 |
+
3. Connect the dataset to a FastGPT application and configure prompts (consistent with Few-shot prompts).
|
| 47 |
+
4. Publish the application and export the API key for subsequent calls.
|
| 48 |
+
- **Fast-Dify Adaptor (FDA) Plugin**
|
| 49 |
+
- Resolves API incompatibility between FastGPT and Dify.
|
| 50 |
+
- Deployment steps: Create `docker-compose.yml` for FDA → Run `docker-compose up -d` in the configuration file directory to deploy the plugin.
|
| 51 |
+
- **Dify External Knowledge Base Connection**: Link the trained FastGPT knowledge base to Dify by importing the FastGPT API key and knowledge base ID.
|
| 52 |
+
|
| 53 |
+
## System Requirements
|
| 54 |
+
| Category | Specification |
|
| 55 |
+
|----------|---------------|
|
| 56 |
+
| Operating System | Windows 10 |
|
| 57 |
+
| Python Version | Python 3.8 or higher |
|
| 58 |
+
| Dependencies | Docker (for FDA plugin deployment), FastGPT access rights, Dify platform account |
|
| 59 |
+
|
| 60 |
+
## Usage Instructions
|
| 61 |
+
### Preparations
|
| 62 |
+
1. Complete Norman database classification optimization (merge into 9 categories).
|
| 63 |
+
2. Deploy FDA plugin to connect FastGPT and Dify.
|
| 64 |
+
3. Import the Norman dataset into FastGPT, configure the application, and export the API key.
|
| 65 |
+
4. Create a custom Schema tool on Dify and configure SQL invocation rules.
|
| 66 |
+
5. Deploy Pubchemlite_exposomics and Invitrodb_v4.3 databases to local SQL and test query connectivity.
|
| 67 |
+
|
| 68 |
+
### Inference Workflow
|
| 69 |
+
1. Input the **IUPAC name** of the emerging contaminant into the Dify chat interface.
|
| 70 |
+
2. The AI agent invokes:
|
| 71 |
+
- FastGPT knowledge base for **usage classification** via FDA plugin.
|
| 72 |
+
- Local SQL databases for **toxicity endpoints and AC50 values** via GPT-4o-generated SQL queries.
|
| 73 |
+
3. Receive the structured output (JSON format) containing usage category, toxicity endpoints, and corresponding AC50 values.
|
| 74 |
+
|
| 75 |
+
## Limitations
|
| 76 |
+
1. The accuracy of annotations depends on the completeness of the Norman, Pubchemlite_exposomics, and Invitrodb_v4.3 databases; unrecorded emerging contaminants may return empty results.
|
| 77 |
+
2. Requires local deployment of SQL databases and FDA plugins, which has a certain threshold for environment configuration.
|
| 78 |
+
3. Currently only supports input of **IUPAC names**; other naming formats (e.g., common names) are not supported.
|
| 79 |
+
|
| 80 |
+
## Contact
|
| 81 |
+
laquh1086@163.com
|