Model Overview

This model is an Contaminants of Emerging Concern Annotation Intelligent Agent built on the Dify platform, integrated with the Norman knowledge base, Pubchemlite_exposomics database, and Invitrodb_v4.3 database. It enables high-throughput, large-scale annotation of emerging contaminants, including usage classification and toxicity endpoints by inputting the IUPAC name of the target contaminant.

Model Purpose

To construct a specialized knowledge database for emerging contaminants usage classification, which combines multi-source chemical/toxicological databases and AI agents. The core goals are:

Realize fast and large-scale annotation of emerging contaminants' usage categories.
Provide efficient query services for toxicity endpoints.
Support high-throughput data analysis scenarios for emerging contaminants in environmental chemistry and toxicology research.

Key Definitions

Term	Definition
System	Refers to the Contaminants of Emerging Concern Annotation Intelligent Agent
User	Anyone authorized to use the functions of this system
IUPAC Name	A systematic naming convention formulated by IUPAC for accurately describing the composition and structure of chemical substances
AI Agent	A system based on large language models (LLMs) that understands user intentions and invokes multiple tools to solve complex tasks; in this system, it accepts IUPAC names of emerging contaminants and outputs usage classification and toxicity endpoints
Norman Database	A network for monitoring and evaluating environmental pollutants, facilitating European and international cooperation and data sharing in environmental pollution monitoring; classifies over 100,000 chemicals
Pubchemlite_exposomics Database	An open-source organic molecule information database derived from PubChem, applicable for mass spectrometry analysis and non-targeted identification of unknown pollutants
Invitrodb_v4.3 Database	The core database of US EPA ToxCast, storing a large amount of biological activity data, analysis workflows, and metadata of compounds generated by high-throughput screening (HTS)

Quick Start

Install Navicat on your computer and upload the invitrodb_v4_3.sql and pubchemlite_exposomics_20251226.sql databases.
Deploy Docker and Dify. Set the Docker engine according to dockerengine_setting.txt.
Create a custom tool named sql_executor_pubchemlite_invitrodb_en on the Dify platform. The detailed information for the schema tool is stored in schema_tool.txt. Note that you need to modify the URL address to your local address.
Train the knowledge base on the FastGPT platform. The knowledge base file is knowledge_database_input_iupac.csv.
Deploy the FDA docker-compose.yml and import the external knowledge base into Dify.
On the Dify platform, select Import DSL File and upload the dify_CECs_annotating.yml.
Before using this agent, you need to run Docker and the step1_pubchemlite_invitrodb_to_dify_en.py code. Note that in step1_pubchemlite_invitrodb_to_dify_en.py, the DB_CONFIGS needs to be modified to the username and password of your local SQL database.
Once the setup is complete, you can start running.
For batch queries, you can run the mini-program step2_CECs_annotating_agent_v1.0.py.

System Architecture & Components

Core Databases Deployment

The system integrates three core databases with differentiated deployment strategies:

Norman Chemical Classification Database
- Serves as a relational knowledge base, uploaded and parsed on the FastGPT platform, then embedded into the Dify platform.
- Optimized classification: Integrated or removed redundant categories, finally categorized chemicals into 9 classes.
Pubchemlite_exposomics & Invitrodb_v4.3 Databases
- Deployed in local SQL databases to support efficient local query and invocation.
- Query workflow: GPT-4o generates SQL statements → Extract valid SQL queries → Backend executes database queries and returns results.

Agent Workflow Design (Dify Chatflow)

Base Model: GPT-4o is used to generate SQL query statements and organize output data in JSON format for subsequent data extraction.
Custom Schema Tool
- Created on the Dify platform to standardize SQL statement generation and API invocation logic.
- Implementation steps: Create custom tool → Configure tool name and Schema rules (see schema_tool.txt for details).
Knowledge Base Integration (FastGPT + Dify)
- FastGPT Knowledge Base Construction
  1. Log in to FastGPT (https://fastgpt.aiown.top/) and enter the main interface.
  2. Import dataset (50,000+ chemicals with IUPAC names and categories from the Norman database).
  3. Connect the dataset to a FastGPT application and configure prompts (consistent with Few-shot prompts).
  4. Publish the application and export the API key for subsequent calls.
- FastGPT-Dify Adaptor (FDA) Plugin
  - Resolves API incompatibility between FastGPT and Dify.
  - Deployment steps: Create docker-compose.yml for FDA → Run docker-compose up -d in the configuration file directory to deploy the plugin.
- Dify External Knowledge Base Connection: Link the trained FastGPT knowledge base to Dify by importing the FastGPT API key and knowledge base ID.

CECs BatchAnnotator (Desktop Version)

For chemists/researchers to automate batch compound info retrieval from CSV files (default column: IUPAC_name). Outputs standardized CSVs, failure logs, and raw API records for analysis/debugging. Requires valid Dify API key and accessible backend.

System Requirements

Category	Specification
Operating System	Windows 10
Python Version	Python 3.8 or higher
Dependencies	Docker, FastGPT access rights, Dify platform account

Usage Instructions

Preparations

Complete Norman database classification optimization (merge into 9 categories).
Deploy FDA plugin to connect FastGPT and Dify.
Import the Norman dataset(knowledge_database_input_iupac.csv) into FastGPT, configure the application, and export the API key.
Create a custom Schema tool on Dify and configure SQL invocation rules.
Deploy Pubchemlite_exposomics (PubChemLite_exposomics_20251226.csv) and Invitrodb_v4.3 databases to local SQL and test query connectivity.
Create a backend program (step1_pubchemlite_invitrodb_to_dify_en.py) to connect Dify and the SQL databases.

Inference Workflow

Input the IUPAC name of the emerging contaminant into the Dify chat interface.
The AI agent invokes:
- FastGPT knowledge base for usage classification via FDA plugin.
- Local SQL databases for toxicity endpoints via GPT-4o-generated SQL queries.
Receive the structured output (JSON format) containing usage category and toxicity endpoints.

Limitations

The accuracy of annotations depends on the completeness of the Norman, Pubchemlite_exposomics, and Invitrodb_v4.3 databases; unrecorded emerging contaminants may return empty results.
Requires local deployment of SQL databases and FDA plugins, which has a certain threshold for environment configuration.
Currently only supports input of IUPAC names; other naming formats (e.g., common names) are not supported.

Contact

laquh1086@163.com

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support