PFAS-Analyzer / README.md
tueniuu's picture
Update README.md
1faaf1a verified
---
title: PFAS AI Analyzer
emoji: 🧪
colorFrom: red
colorTo: indigo
sdk: docker
app_port: 7860
tags:
- chemistry
- pfas
- bert
- toxicology
- streamlit
pinned: false
short_description: AI-powered PFAS detection and risk assessment pipeline.
---
# 🧪 PFAS AI Analyzer (BERT Enhanced)
This application is an end-to-end AI pipeline designed to identify, classify, and assess the environmental risks of **Per- and Polyfluoroalkyl Substances (PFAS)**.
It leverages a fine-tuned **BERT (Bidirectional Encoder Representations from Transformers)** model to generate molecular embeddings, followed by Random Forest regressors for property prediction.
## 🚀 Key Features
1. **Advanced PFAS Detection:** Uses the OECD-aligned "Chain Rule" logic to distinguish industrial PFAS from fluorinated pharmaceuticals (e.g., Prozac, Fipronil).
2. **Subclass Classification:** Automatically categorizes molecules into PFCA, PFSA, or General PFAS.
3. **Risk Assessment:** Predicts key environmental properties:
* **Persistence:** Estimated half-life / biodegradation potential.
* **Mobility:** Soil adsorption coefficient ($K_{oc}$).
* **Bioaccumulation:** Bioconcentration factor (BCF) / LogP.
4. **BERT Embeddings:** Utilizes a transformer model trained on ChEMBL data to understand deep molecular features beyond simple fingerprints.
## 🧠 How It Works
1. **Input:** The user provides a SMILES string (Simplified Molecular Input Line Entry System).
2. **Tokenization:** The SMILES string is tokenized using a specialized `Spe_Tokenizer`.
3. **Embedding:** The **SMILE-to-BERT** model converts the tokens into a 113-dimensional dense vector representation.
4. **Inference:**
* A **Random Forest Classifier** determines the PFAS subclass.
* **Random Forest Regressors** predict environmental properties.
5. **Validation:** A rule-based sanity checker applies chemical structure rules to prevent false positives.
## 📂 File Structure
* `src/app.py`: Main Streamlit application.
* `src/pfas_assets.zip`: Contains the BERT model weights and tokenizer data.
* `src/*.pkl`: Trained Scikit-Learn models