Model Overview

This model is an Contaminants of Emerging Concern Annotation Intelligent Agent built on the Dify platform, integrated with the Norman knowledge base, Pubchemlite_exposomics database, and Invitrodb_v4.3 database. It enables high-throughput, large-scale annotation of emerging contaminants, including usage classification and toxicity endpoints by inputting the IUPAC name of the target contaminant.

Model Purpose

To construct a specialized knowledge database for emerging contaminants usage classification, which combines multi-source chemical/toxicological databases and AI agents. The core goals are:

  1. Realize fast and large-scale annotation of emerging contaminants' usage categories.
  2. Provide efficient query services for toxicity endpoints.
  3. Support high-throughput data analysis scenarios for emerging contaminants in environmental chemistry and toxicology research.

Key Definitions

Term Definition
System Refers to the Emerging Contaminants Annotation Intelligent Agent Based on Dify Platform
User Anyone authorized to use the functions of this system
IUPAC Name A systematic naming convention formulated by IUPAC for accurately describing the composition and structure of chemical substances
AI Agent A system based on large language models (LLMs) that understands user intentions and invokes multiple tools to solve complex tasks; in this system, it accepts IUPAC names of emerging contaminants and outputs usage classification, toxicity endpoints, and AC50 values
Norman Database A network for monitoring and evaluating environmental pollutants, facilitating European and international cooperation and data sharing in environmental pollution monitoring; classifies over 100,000 chemicals
Pubchemlite_exposomics Database An open-source organic molecule information database derived from PubChem, applicable for mass spectrometry analysis and non-targeted identification of unknown pollutants
Invitrodb_v4.3 Database The core database of US EPA ToxCast, storing a large amount of biological activity data, analysis workflows, and metadata of compounds generated by high-throughput screening (HTS)

System Architecture & Components

Core Databases Deployment

The system integrates three core databases with differentiated deployment strategies:

  1. Norman Chemical Classification Database
    • Serves as a relational knowledge base, uploaded and parsed on the FastGPT platform, then embedded into the Dify platform.
    • Optimized classification: Integrated or removed redundant categories, finally categorized chemicals into 9 classes. Norman_category
  2. Pubchemlite_exposomics & Invitrodb_v4.3 Databases
    • Deployed in local SQL databases to support efficient local query and invocation.
    • Query workflow: GPT-4o generates SQL statements โ†’ Extract valid SQL queries โ†’ Backend executes database queries and returns results.

Agent Workflow Design (Dify Chatflow)

  1. Base Model: GPT-4o is used to generate SQL query statements and organize output data in JSON format for subsequent data extraction.
  2. Custom Schema Tool
    • Created on the Dify platform to standardize SQL statement generation and API invocation logic.
    • Implementation steps: Create custom tool โ†’ Configure tool name and Schema rules (see schema_tool.txt for details).
  3. Knowledge Base Integration (FastGPT + Dify)
    • FastGPT Knowledge Base Construction
      1. Log in to FastGPT (https://fastgpt.aiown.top/) and enter the main interface.
      2. Import dataset (50,000+ chemicals with IUPAC names and categories from the Norman database).
      3. Connect the dataset to a FastGPT application and configure prompts (consistent with Few-shot prompts).
      4. Publish the application and export the API key for subsequent calls.
    • Fast-Dify Adaptor (FDA) Plugin
      • Resolves API incompatibility between FastGPT and Dify.
      • Deployment steps: Create docker-compose.yml for FDA โ†’ Run docker-compose up -d in the configuration file directory to deploy the plugin.
    • Dify External Knowledge Base Connection: Link the trained FastGPT knowledge base to Dify by importing the FastGPT API key and knowledge base ID.
      Norman_category

CECs BatchAnnotator (Desktop Version)

For chemists/researchers to automate batch compound info retrieval from CSV files (default column: IUPAC_name). Outputs standardized CSVs, failure logs, and raw API records for analysis/debugging. Requires valid Dify API key and accessible backend.

Norman_category

System Requirements

Category Specification
Operating System Windows 10
Python Version Python 3.8 or higher
Dependencies Docker, FastGPT access rights, Dify platform account

Usage Instructions

Preparations

  1. Complete Norman database classification optimization (merge into 9 categories).
  2. Deploy FDA plugin to connect FastGPT and Dify.
  3. Import the Norman dataset(knowledge_database_input_iupac.csv) into FastGPT, configure the application, and export the API key.
  4. Create a custom Schema tool on Dify and configure SQL invocation rules.
  5. Deploy Pubchemlite_exposomics (PubChemLite_exposomics_20251226.csv) and Invitrodb_v4.3 databases to local SQL and test query connectivity.
  6. Create a backend program (step1_pubchemlite_invitrodb_to_dify_en.py) to connect Dify and the SQL databases.

Inference Workflow

  1. Input the IUPAC name of the emerging contaminant into the Dify chat interface.
  2. The AI agent invokes:
    • FastGPT knowledge base for usage classification via FDA plugin.
    • Local SQL databases for toxicity endpoints via GPT-4o-generated SQL queries.
  3. Receive the structured output (JSON format) containing usage category and toxicity endpoints.

Limitations

  1. The accuracy of annotations depends on the completeness of the Norman, Pubchemlite_exposomics, and Invitrodb_v4.3 databases; unrecorded emerging contaminants may return empty results.
  2. Requires local deployment of SQL databases and FDA plugins, which has a certain threshold for environment configuration.
  3. Currently only supports input of IUPAC names; other naming formats (e.g., common names) are not supported.

Contact

laquh1086@163.com

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support