Qianhui19 commited on
Commit
e571fc9
·
verified ·
1 Parent(s): e62c393

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +81 -0
README.md ADDED
@@ -0,0 +1,81 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: cc-by-nc-4.0
3
+ tags:
4
+ - agent
5
+ - chemistry
6
+ ---
7
+ ## Model Overview
8
+ This model is an **Contaminants of Emerging Concern Annotation Intelligent Agent** built on the Dify platform, integrated with the Norman knowledge base, Pubchemlite_exposomics database, and Invitrodb_v4.3 database. It enables high-throughput, large-scale annotation of emerging contaminants, including **usage classification**, **toxicity endpoints**, and corresponding **AC50 value queries** by inputting the IUPAC name of the target contaminant.
9
+
10
+ ## Model Purpose
11
+ To construct a specialized knowledge database for emerging contaminants usage classification, which combines multi-source chemical/toxicological databases and AI agents. The core goals are:
12
+ 1. Realize fast and large-scale annotation of emerging contaminants' usage categories.
13
+ 2. Provide efficient query services for toxicity endpoints and their corresponding AC50 values.
14
+ 3. Support high-throughput data analysis scenarios for emerging contaminants in environmental chemistry and toxicology research.
15
+
16
+ ## Key Definitions
17
+ | Term | Definition |
18
+ |------|------------|
19
+ | **System** | Refers to the *Emerging Contaminants Annotation Intelligent Agent Based on Dify Platform* |
20
+ | **User** | Anyone authorized to use the functions of this system |
21
+ | **IUPAC Name** | A systematic naming convention formulated by IUPAC for accurately describing the composition and structure of chemical substances |
22
+ | **AI Agent** | A system based on large language models (LLMs) that understands user intentions and invokes multiple tools to solve complex tasks; in this system, it accepts IUPAC names of emerging contaminants and outputs usage classification, toxicity endpoints, and AC50 values |
23
+ | **Norman Database** | A network for monitoring and evaluating environmental pollutants, facilitating European and international cooperation and data sharing in environmental pollution monitoring; classifies over 100,000 chemicals |
24
+ | **Pubchemlite_exposomics Database** | An open-source organic molecule information database derived from PubChem, applicable for mass spectrometry analysis and non-targeted identification of unknown pollutants |
25
+ | **Invitrodb_v4.3 Database** | The core database of US EPA ToxCast, storing a large amount of biological activity data, analysis workflows, and metadata of compounds generated by high-throughput screening (HTS) |
26
+
27
+ ## System Architecture & Components
28
+ ### Core Databases Deployment
29
+ The system integrates three core databases with differentiated deployment strategies:
30
+ 1. **Norman Chemical Classification Database**
31
+ - Serves as a relational knowledge base, uploaded and parsed on the FastGPT platform, then embedded into the Dify platform.
32
+ - Optimized classification: Integrated or removed redundant categories, finally categorized chemicals into **9 classes** (see Figure 1).
33
+ 2. **Pubchemlite_exposomics & Invitrodb_v4.3 Databases**
34
+ - Deployed in local SQL databases to support efficient local query and invocation.
35
+ - Query workflow: GPT-4o generates SQL statements → Extract valid SQL queries → Backend executes database queries and returns results.
36
+
37
+ ### AI Workflow Design (Dify Chatflow)
38
+ 1. **Base Model**: GPT-4o is used to generate SQL query statements and organize output data in JSON format for subsequent data extraction.
39
+ 2. **Custom Schema Tool**
40
+ - Created on the Dify platform to standardize SQL statement generation and API invocation logic.
41
+ - Implementation steps: Create custom tool → Configure tool name and Schema rules (see Appendix for details).
42
+ 3. **Knowledge Base Integration (FastGPT + Dify)**
43
+ - **FastGPT Knowledge Base Construction**
44
+ 1. Log in to FastGPT (https://fastgpt.aiown.top/) and enter the main interface.
45
+ 2. Import dataset (50,000+ chemicals with IUPAC names and categories from the Norman database).
46
+ 3. Connect the dataset to a FastGPT application and configure prompts (consistent with Few-shot prompts).
47
+ 4. Publish the application and export the API key for subsequent calls.
48
+ - **Fast-Dify Adaptor (FDA) Plugin**
49
+ - Resolves API incompatibility between FastGPT and Dify.
50
+ - Deployment steps: Create `docker-compose.yml` for FDA → Run `docker-compose up -d` in the configuration file directory to deploy the plugin.
51
+ - **Dify External Knowledge Base Connection**: Link the trained FastGPT knowledge base to Dify by importing the FastGPT API key and knowledge base ID.
52
+
53
+ ## System Requirements
54
+ | Category | Specification |
55
+ |----------|---------------|
56
+ | Operating System | Windows 10 |
57
+ | Python Version | Python 3.8 or higher |
58
+ | Dependencies | Docker (for FDA plugin deployment), FastGPT access rights, Dify platform account |
59
+
60
+ ## Usage Instructions
61
+ ### Preparations
62
+ 1. Complete Norman database classification optimization (merge into 9 categories).
63
+ 2. Deploy FDA plugin to connect FastGPT and Dify.
64
+ 3. Import the Norman dataset into FastGPT, configure the application, and export the API key.
65
+ 4. Create a custom Schema tool on Dify and configure SQL invocation rules.
66
+ 5. Deploy Pubchemlite_exposomics and Invitrodb_v4.3 databases to local SQL and test query connectivity.
67
+
68
+ ### Inference Workflow
69
+ 1. Input the **IUPAC name** of the emerging contaminant into the Dify chat interface.
70
+ 2. The AI agent invokes:
71
+ - FastGPT knowledge base for **usage classification** via FDA plugin.
72
+ - Local SQL databases for **toxicity endpoints and AC50 values** via GPT-4o-generated SQL queries.
73
+ 3. Receive the structured output (JSON format) containing usage category, toxicity endpoints, and corresponding AC50 values.
74
+
75
+ ## Limitations
76
+ 1. The accuracy of annotations depends on the completeness of the Norman, Pubchemlite_exposomics, and Invitrodb_v4.3 databases; unrecorded emerging contaminants may return empty results.
77
+ 2. Requires local deployment of SQL databases and FDA plugins, which has a certain threshold for environment configuration.
78
+ 3. Currently only supports input of **IUPAC names**; other naming formats (e.g., common names) are not supported.
79
+
80
+ ## Contact
81
+ laquh1086@163.com