Update README.md

f707134 verified 20 days ago

6.66 kB

	---
	license: cc-by-nc-4.0
	tags:
	- agent
	- chemistry
	- environment
	---
	## Model Overview
	This model is an Contaminants of Emerging Concern Annotation Intelligent Agent built on the Dify platform, integrated with the Norman knowledge base, Pubchemlite_exposomics database, and Invitrodb_v4.3 database. It enables high-throughput, large-scale annotation of emerging contaminants, including usage classification and toxicity endpoints by inputting the IUPAC name of the target contaminant.

	## Model Purpose
	To construct a specialized knowledge database for emerging contaminants usage classification, which combines multi-source chemical/toxicological databases and AI agents. The core goals are:
	1. Realize fast and large-scale annotation of emerging contaminants' usage categories.
	2. Provide efficient query services for toxicity endpoints.
	3. Support high-throughput data analysis scenarios for emerging contaminants in environmental chemistry and toxicology research.

	## Key Definitions
	\| Term \| Definition \|
	\|------\|------------\|
	\| System \| Refers to the Emerging Contaminants Annotation Intelligent Agent Based on Dify Platform \|
	\| User \| Anyone authorized to use the functions of this system \|
	\| IUPAC Name \| A systematic naming convention formulated by IUPAC for accurately describing the composition and structure of chemical substances \|
	\| AI Agent \| A system based on large language models (LLMs) that understands user intentions and invokes multiple tools to solve complex tasks; in this system, it accepts IUPAC names of emerging contaminants and outputs usage classification, toxicity endpoints, and AC50 values \|
	\| Norman Database \| A network for monitoring and evaluating environmental pollutants, facilitating European and international cooperation and data sharing in environmental pollution monitoring; classifies over 100,000 chemicals \|
	\| Pubchemlite_exposomics Database \| An open-source organic molecule information database derived from PubChem, applicable for mass spectrometry analysis and non-targeted identification of unknown pollutants \|
	\| Invitrodb_v4.3 Database \| The core database of US EPA ToxCast, storing a large amount of biological activity data, analysis workflows, and metadata of compounds generated by high-throughput screening (HTS) \|

	## System Architecture & Components
	### Core Databases Deployment
	The system integrates three core databases with differentiated deployment strategies:
	1. Norman Chemical Classification Database
	- Serves as a relational knowledge base, uploaded and parsed on the FastGPT platform, then embedded into the Dify platform.
	- Optimized classification: Integrated or removed redundant categories, finally categorized chemicals into 9 classes.
	<img src="figure/Norman_category.png" alt="Norman_category" width="400">
	2. Pubchemlite_exposomics & Invitrodb_v4.3 Databases
	- Deployed in local SQL databases to support efficient local query and invocation.
	- Query workflow: GPT-4o generates SQL statements → Extract valid SQL queries → Backend executes database queries and returns results.

	### Agent Workflow Design (Dify Chatflow)
	1. Base Model: GPT-4o is used to generate SQL query statements and organize output data in JSON format for subsequent data extraction.
	2. Custom Schema Tool
	- Created on the Dify platform to standardize SQL statement generation and API invocation logic.
	- Implementation steps: Create custom tool → Configure tool name and Schema rules (see schema_tool.txt for details).
	3. Knowledge Base Integration (FastGPT + Dify)
	- FastGPT Knowledge Base Construction
	1. Log in to FastGPT (https://fastgpt.aiown.top/) and enter the main interface.
	2. Import dataset (50,000+ chemicals with IUPAC names and categories from the Norman database).
	3. Connect the dataset to a FastGPT application and configure prompts (consistent with Few-shot prompts).
	4. Publish the application and export the API key for subsequent calls.
	- Fast-Dify Adaptor (FDA) Plugin
	- Resolves API incompatibility between FastGPT and Dify.
	- Deployment steps: Create `docker-compose.yml` for FDA → Run `docker-compose up -d` in the configuration file directory to deploy the plugin.
	- Dify External Knowledge Base Connection: Link the trained FastGPT knowledge base to Dify by importing the FastGPT API key and knowledge base ID.
	<div align="center">
	<img src="figure/pipeline.jpg" alt="Norman_category" width="800">
	</div>


	## CECs BatchAnnotator (Desktop Version)
	For chemists/researchers to automate batch compound info retrieval from CSV files (default column: `IUPAC_name`). Outputs standardized CSVs, failure logs, and raw API records for analysis/debugging. Requires valid Dify API key and accessible backend.
	<div align="center">
	<img src="figure/CECs_BatchAnnotator_v1.0.png" alt="Norman_category" width="400">
	</div>

	## System Requirements
	\| Category \| Specification \|
	\|----------\|---------------\|
	\| Operating System \| Windows 10 \|
	\| Python Version \| Python 3.8 or higher \|
	\| Dependencies \| Docker, FastGPT access rights, Dify platform account \|

	## Usage Instructions
	### Preparations
	1. Complete Norman database classification optimization (merge into 9 categories).
	2. Deploy FDA plugin to connect FastGPT and Dify.
	3. Import the Norman dataset(knowledge_database_input_iupac.csv) into FastGPT, configure the application, and export the API key.
	4. Create a custom Schema tool on Dify and configure SQL invocation rules.
	5. Deploy Pubchemlite_exposomics (PubChemLite_exposomics_20251226.csv) and Invitrodb_v4.3 databases to local SQL and test query connectivity.
	6. Create a backend program (step1_pubchemlite_invitrodb_to_dify_en.py) to connect Dify and the SQL databases.

	### Inference Workflow
	1. Input the IUPAC name of the emerging contaminant into the Dify chat interface.
	2. The AI agent invokes:
	- FastGPT knowledge base for usage classification via FDA plugin.
	- Local SQL databases for toxicity endpoints via GPT-4o-generated SQL queries.
	3. Receive the structured output (JSON format) containing usage category and toxicity endpoints.

	## Limitations
	1. The accuracy of annotations depends on the completeness of the Norman, Pubchemlite_exposomics, and Invitrodb_v4.3 databases; unrecorded emerging contaminants may return empty results.
	2. Requires local deployment of SQL databases and FDA plugins, which has a certain threshold for environment configuration.
	3. Currently only supports input of IUPAC names; other naming formats (e.g., common names) are not supported.

	## Contact
	laquh1086@163.com