|
|
--- |
|
|
license: cc-by-nc-4.0 |
|
|
tags: |
|
|
- agent |
|
|
- chemistry |
|
|
- environment |
|
|
--- |
|
|
## Model Overview |
|
|
This model is an **Contaminants of Emerging Concern Annotation Intelligent Agent** built on the Dify platform, integrated with the Norman knowledge base, Pubchemlite_exposomics database, and Invitrodb_v4.3 database. It enables high-throughput, large-scale annotation of emerging contaminants, including **usage classification** and **toxicity endpoints** by inputting the IUPAC name of the target contaminant. |
|
|
|
|
|
## Model Purpose |
|
|
To construct a specialized knowledge database for emerging contaminants usage classification, which combines multi-source chemical/toxicological databases and AI agents. The core goals are: |
|
|
1. Realize fast and large-scale annotation of emerging contaminants' usage categories. |
|
|
2. Provide efficient query services for toxicity endpoints. |
|
|
3. Support high-throughput data analysis scenarios for emerging contaminants in environmental chemistry and toxicology research. |
|
|
|
|
|
## Key Definitions |
|
|
| Term | Definition | |
|
|
|------|------------| |
|
|
| **System** | Refers to the *Emerging Contaminants Annotation Intelligent Agent Based on Dify Platform* | |
|
|
| **User** | Anyone authorized to use the functions of this system | |
|
|
| **IUPAC Name** | A systematic naming convention formulated by IUPAC for accurately describing the composition and structure of chemical substances | |
|
|
| **AI Agent** | A system based on large language models (LLMs) that understands user intentions and invokes multiple tools to solve complex tasks; in this system, it accepts IUPAC names of emerging contaminants and outputs usage classification, toxicity endpoints, and AC50 values | |
|
|
| **Norman Database** | A network for monitoring and evaluating environmental pollutants, facilitating European and international cooperation and data sharing in environmental pollution monitoring; classifies over 100,000 chemicals | |
|
|
| **Pubchemlite_exposomics Database** | An open-source organic molecule information database derived from PubChem, applicable for mass spectrometry analysis and non-targeted identification of unknown pollutants | |
|
|
| **Invitrodb_v4.3 Database** | The core database of US EPA ToxCast, storing a large amount of biological activity data, analysis workflows, and metadata of compounds generated by high-throughput screening (HTS) | |
|
|
|
|
|
## System Architecture & Components |
|
|
### Core Databases Deployment |
|
|
The system integrates three core databases with differentiated deployment strategies: |
|
|
1. **Norman Chemical Classification Database** |
|
|
- Serves as a relational knowledge base, uploaded and parsed on the FastGPT platform, then embedded into the Dify platform. |
|
|
- Optimized classification: Integrated or removed redundant categories, finally categorized chemicals into **9 classes**. |
|
|
<img src="figure/Norman_category.png" alt="Norman_category" width="400"> |
|
|
2. **Pubchemlite_exposomics & Invitrodb_v4.3 Databases** |
|
|
- Deployed in local SQL databases to support efficient local query and invocation. |
|
|
- Query workflow: GPT-4o generates SQL statements → Extract valid SQL queries → Backend executes database queries and returns results. |
|
|
|
|
|
### Agent Workflow Design (Dify Chatflow) |
|
|
1. **Base Model**: GPT-4o is used to generate SQL query statements and organize output data in JSON format for subsequent data extraction. |
|
|
2. **Custom Schema Tool** |
|
|
- Created on the Dify platform to standardize SQL statement generation and API invocation logic. |
|
|
- Implementation steps: Create custom tool → Configure tool name and Schema rules (see schema_tool.txt for details). |
|
|
3. **Knowledge Base Integration (FastGPT + Dify)** |
|
|
- **FastGPT Knowledge Base Construction** |
|
|
1. Log in to FastGPT (https://fastgpt.aiown.top/) and enter the main interface. |
|
|
2. Import dataset (50,000+ chemicals with IUPAC names and categories from the Norman database). |
|
|
3. Connect the dataset to a FastGPT application and configure prompts (consistent with Few-shot prompts). |
|
|
4. Publish the application and export the API key for subsequent calls. |
|
|
- **Fast-Dify Adaptor (FDA) Plugin** |
|
|
- Resolves API incompatibility between FastGPT and Dify. |
|
|
- Deployment steps: Create `docker-compose.yml` for FDA → Run `docker-compose up -d` in the configuration file directory to deploy the plugin. |
|
|
- **Dify External Knowledge Base Connection**: Link the trained FastGPT knowledge base to Dify by importing the FastGPT API key and knowledge base ID. |
|
|
<div align="center"> |
|
|
<img src="figure/pipeline.jpg" alt="Norman_category" width="800"> |
|
|
</div> |
|
|
|
|
|
|
|
|
## CECs BatchAnnotator (Desktop Version) |
|
|
For chemists/researchers to automate batch compound info retrieval from CSV files (default column: `IUPAC_name`). Outputs standardized CSVs, failure logs, and raw API records for analysis/debugging. Requires valid Dify API key and accessible backend. |
|
|
<div align="center"> |
|
|
<img src="figure/CECs_BatchAnnotator_v1.0.png" alt="Norman_category" width="400"> |
|
|
</div> |
|
|
|
|
|
## System Requirements |
|
|
| Category | Specification | |
|
|
|----------|---------------| |
|
|
| Operating System | Windows 10 | |
|
|
| Python Version | Python 3.8 or higher | |
|
|
| Dependencies | Docker, FastGPT access rights, Dify platform account | |
|
|
|
|
|
## Usage Instructions |
|
|
### Preparations |
|
|
1. Complete Norman database classification optimization (merge into 9 categories). |
|
|
2. Deploy FDA plugin to connect FastGPT and Dify. |
|
|
3. Import the Norman dataset(knowledge_database_input_iupac.csv) into FastGPT, configure the application, and export the API key. |
|
|
4. Create a custom Schema tool on Dify and configure SQL invocation rules. |
|
|
5. Deploy Pubchemlite_exposomics (PubChemLite_exposomics_20251226.csv) and Invitrodb_v4.3 databases to local SQL and test query connectivity. |
|
|
6. Create a backend program (step1_pubchemlite_invitrodb_to_dify_en.py) to connect Dify and the SQL databases. |
|
|
|
|
|
### Inference Workflow |
|
|
1. Input the **IUPAC name** of the emerging contaminant into the Dify chat interface. |
|
|
2. The AI agent invokes: |
|
|
- FastGPT knowledge base for **usage classification** via FDA plugin. |
|
|
- Local SQL databases for **toxicity endpoints** via GPT-4o-generated SQL queries. |
|
|
3. Receive the structured output (JSON format) containing usage category and toxicity endpoints. |
|
|
|
|
|
## Limitations |
|
|
1. The accuracy of annotations depends on the completeness of the Norman, Pubchemlite_exposomics, and Invitrodb_v4.3 databases; unrecorded emerging contaminants may return empty results. |
|
|
2. Requires local deployment of SQL databases and FDA plugins, which has a certain threshold for environment configuration. |
|
|
3. Currently only supports input of **IUPAC names**; other naming formats (e.g., common names) are not supported. |
|
|
|
|
|
## Contact |
|
|
laquh1086@163.com |