File size: 6,658 Bytes

e571fc9
 
 
 
 
611ea0c
e571fc9
 
611ea0c
e571fc9
 
 
 
611ea0c
e571fc9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
611ea0c
 
e571fc9
 
 
 
937417a
e571fc9
 
 
f707134
e571fc9
 
 
 
 
 
 
 
 
 
983a5a9
 
 
e571fc9
097ca95
06ac297
097ca95
5198efa
b685bf5
5198efa
097ca95
e571fc9
 
 
 
 
a3e1e38
e571fc9
 
 
 
 
798e39a
e571fc9
097ca95
 
e571fc9
 
 
 
 
611ea0c
4bfee4f
e571fc9

---
license: cc-by-nc-4.0
tags:
- agent
- chemistry
- environment
---
## Model Overview
This model is an **Contaminants of Emerging Concern Annotation Intelligent Agent** built on the Dify platform, integrated with the Norman knowledge base, Pubchemlite_exposomics database, and Invitrodb_v4.3 database. It enables high-throughput, large-scale annotation of emerging contaminants, including **usage classification** and **toxicity endpoints** by inputting the IUPAC name of the target contaminant.

## Model Purpose
To construct a specialized knowledge database for emerging contaminants usage classification, which combines multi-source chemical/toxicological databases and AI agents. The core goals are:
1. Realize fast and large-scale annotation of emerging contaminants' usage categories.
2. Provide efficient query services for toxicity endpoints.
3. Support high-throughput data analysis scenarios for emerging contaminants in environmental chemistry and toxicology research.

## Key Definitions
| Term | Definition |
|------|------------|
| **System** | Refers to the *Emerging Contaminants Annotation Intelligent Agent Based on Dify Platform* |
| **User** | Anyone authorized to use the functions of this system |
| **IUPAC Name** | A systematic naming convention formulated by IUPAC for accurately describing the composition and structure of chemical substances |
| **AI Agent** | A system based on large language models (LLMs) that understands user intentions and invokes multiple tools to solve complex tasks; in this system, it accepts IUPAC names of emerging contaminants and outputs usage classification, toxicity endpoints, and AC50 values |
| **Norman Database** | A network for monitoring and evaluating environmental pollutants, facilitating European and international cooperation and data sharing in environmental pollution monitoring; classifies over 100,000 chemicals |
| **Pubchemlite_exposomics Database** | An open-source organic molecule information database derived from PubChem, applicable for mass spectrometry analysis and non-targeted identification of unknown pollutants |
| **Invitrodb_v4.3 Database** | The core database of US EPA ToxCast, storing a large amount of biological activity data, analysis workflows, and metadata of compounds generated by high-throughput screening (HTS) |

## System Architecture & Components
### Core Databases Deployment
The system integrates three core databases with differentiated deployment strategies:
1. **Norman Chemical Classification Database**
   - Serves as a relational knowledge base, uploaded and parsed on the FastGPT platform, then embedded into the Dify platform.
   - Optimized classification: Integrated or removed redundant categories, finally categorized chemicals into **9 classes**.
    <img src="figure/Norman_category.png" alt="Norman_category" width="400">
2. **Pubchemlite_exposomics & Invitrodb_v4.3 Databases**
   - Deployed in local SQL databases to support efficient local query and invocation.
   - Query workflow: GPT-4o generates SQL statements → Extract valid SQL queries → Backend executes database queries and returns results.

### Agent Workflow Design (Dify Chatflow)
1. **Base Model**: GPT-4o is used to generate SQL query statements and organize output data in JSON format for subsequent data extraction.
2. **Custom Schema Tool**
   - Created on the Dify platform to standardize SQL statement generation and API invocation logic.
   - Implementation steps: Create custom tool → Configure tool name and Schema rules (see schema_tool.txt for details).
3. **Knowledge Base Integration (FastGPT + Dify)**
   - **FastGPT Knowledge Base Construction**
     1. Log in to FastGPT (https://fastgpt.aiown.top/) and enter the main interface.
     2. Import dataset (50,000+ chemicals with IUPAC names and categories from the Norman database).
     3. Connect the dataset to a FastGPT application and configure prompts (consistent with Few-shot prompts).
     4. Publish the application and export the API key for subsequent calls.
   - **Fast-Dify Adaptor (FDA) Plugin**
     - Resolves API incompatibility between FastGPT and Dify.
     - Deployment steps: Create `docker-compose.yml` for FDA → Run `docker-compose up -d` in the configuration file directory to deploy the plugin.
   - **Dify External Knowledge Base Connection**: Link the trained FastGPT knowledge base to Dify by importing the FastGPT API key and knowledge base ID.
<div align="center">
<img src="figure/pipeline.jpg" alt="Norman_category" width="800">
</div>


## CECs BatchAnnotator (Desktop Version)
For chemists/researchers to automate batch compound info retrieval from CSV files (default column: `IUPAC_name`). Outputs standardized CSVs, failure logs, and raw API records for analysis/debugging. Requires valid Dify API key and accessible backend.
<div align="center">
<img src="figure/CECs_BatchAnnotator_v1.0.png" alt="Norman_category" width="400">
</div>

## System Requirements
| Category | Specification |
|----------|---------------|
| Operating System | Windows 10 |
| Python Version | Python 3.8 or higher |
| Dependencies | Docker, FastGPT access rights, Dify platform account |

## Usage Instructions
### Preparations
1. Complete Norman database classification optimization (merge into 9 categories).
2. Deploy FDA plugin to connect FastGPT and Dify.
3. Import the Norman dataset(knowledge_database_input_iupac.csv) into FastGPT, configure the application, and export the API key.
4. Create a custom Schema tool on Dify and configure SQL invocation rules.
5. Deploy Pubchemlite_exposomics (PubChemLite_exposomics_20251226.csv) and Invitrodb_v4.3 databases to local SQL and test query connectivity.
6. Create a backend program (step1_pubchemlite_invitrodb_to_dify_en.py) to connect Dify and the SQL databases.

### Inference Workflow
1. Input the **IUPAC name** of the emerging contaminant into the Dify chat interface.
2. The AI agent invokes:
   - FastGPT knowledge base for **usage classification** via FDA plugin.
   - Local SQL databases for **toxicity endpoints** via GPT-4o-generated SQL queries.
3. Receive the structured output (JSON format) containing usage category and toxicity endpoints.

## Limitations
1. The accuracy of annotations depends on the completeness of the Norman, Pubchemlite_exposomics, and Invitrodb_v4.3 databases; unrecorded emerging contaminants may return empty results.
2. Requires local deployment of SQL databases and FDA plugins, which has a certain threshold for environment configuration.
3. Currently only supports input of **IUPAC names**; other naming formats (e.g., common names) are not supported.

## Contact
laquh1086@163.com