metadata
title: PFAS AI Analyzer
emoji: π§ͺ
colorFrom: red
colorTo: indigo
sdk: docker
app_port: 7860
tags:
- chemistry
- pfas
- bert
- toxicology
- streamlit
pinned: false
short_description: AI-powered PFAS detection and risk assessment pipeline.
π§ͺ PFAS AI Analyzer (BERT Enhanced)
This application is an end-to-end AI pipeline designed to identify, classify, and assess the environmental risks of Per- and Polyfluoroalkyl Substances (PFAS).
It leverages a fine-tuned BERT (Bidirectional Encoder Representations from Transformers) model to generate molecular embeddings, followed by Random Forest regressors for property prediction.
π Key Features
- Advanced PFAS Detection: Uses the OECD-aligned "Chain Rule" logic to distinguish industrial PFAS from fluorinated pharmaceuticals (e.g., Prozac, Fipronil).
- Subclass Classification: Automatically categorizes molecules into PFCA, PFSA, or General PFAS.
- Risk Assessment: Predicts key environmental properties:
- Persistence: Estimated half-life / biodegradation potential.
- Mobility: Soil adsorption coefficient ($K_{oc}$).
- Bioaccumulation: Bioconcentration factor (BCF) / LogP.
- BERT Embeddings: Utilizes a transformer model trained on ChEMBL data to understand deep molecular features beyond simple fingerprints.
π§ How It Works
- Input: The user provides a SMILES string (Simplified Molecular Input Line Entry System).
- Tokenization: The SMILES string is tokenized using a specialized
Spe_Tokenizer. - Embedding: The SMILE-to-BERT model converts the tokens into a 113-dimensional dense vector representation.
- Inference:
- A Random Forest Classifier determines the PFAS subclass.
- Random Forest Regressors predict environmental properties.
- Validation: A rule-based sanity checker applies chemical structure rules to prevent false positives.
π File Structure
src/app.py: Main Streamlit application.src/pfas_assets.zip: Contains the BERT model weights and tokenizer data.src/*.pkl: Trained Scikit-Learn models