PFAS-Analyzer / README.md
tueniuu's picture
Update README.md
1faaf1a verified
metadata
title: PFAS AI Analyzer
emoji: πŸ§ͺ
colorFrom: red
colorTo: indigo
sdk: docker
app_port: 7860
tags:
  - chemistry
  - pfas
  - bert
  - toxicology
  - streamlit
pinned: false
short_description: AI-powered PFAS detection and risk assessment pipeline.

πŸ§ͺ PFAS AI Analyzer (BERT Enhanced)

This application is an end-to-end AI pipeline designed to identify, classify, and assess the environmental risks of Per- and Polyfluoroalkyl Substances (PFAS).

It leverages a fine-tuned BERT (Bidirectional Encoder Representations from Transformers) model to generate molecular embeddings, followed by Random Forest regressors for property prediction.

πŸš€ Key Features

  1. Advanced PFAS Detection: Uses the OECD-aligned "Chain Rule" logic to distinguish industrial PFAS from fluorinated pharmaceuticals (e.g., Prozac, Fipronil).
  2. Subclass Classification: Automatically categorizes molecules into PFCA, PFSA, or General PFAS.
  3. Risk Assessment: Predicts key environmental properties:
    • Persistence: Estimated half-life / biodegradation potential.
    • Mobility: Soil adsorption coefficient ($K_{oc}$).
    • Bioaccumulation: Bioconcentration factor (BCF) / LogP.
  4. BERT Embeddings: Utilizes a transformer model trained on ChEMBL data to understand deep molecular features beyond simple fingerprints.

🧠 How It Works

  1. Input: The user provides a SMILES string (Simplified Molecular Input Line Entry System).
  2. Tokenization: The SMILES string is tokenized using a specialized Spe_Tokenizer.
  3. Embedding: The SMILE-to-BERT model converts the tokens into a 113-dimensional dense vector representation.
  4. Inference:
    • A Random Forest Classifier determines the PFAS subclass.
    • Random Forest Regressors predict environmental properties.
  5. Validation: A rule-based sanity checker applies chemical structure rules to prevent false positives.

πŸ“‚ File Structure

  • src/app.py: Main Streamlit application.
  • src/pfas_assets.zip: Contains the BERT model weights and tokenizer data.
  • src/*.pkl: Trained Scikit-Learn models