|
|
--- |
|
|
title: PFAS AI Analyzer |
|
|
emoji: 🧪 |
|
|
colorFrom: red |
|
|
colorTo: indigo |
|
|
sdk: docker |
|
|
app_port: 7860 |
|
|
tags: |
|
|
- chemistry |
|
|
- pfas |
|
|
- bert |
|
|
- toxicology |
|
|
- streamlit |
|
|
pinned: false |
|
|
short_description: AI-powered PFAS detection and risk assessment pipeline. |
|
|
--- |
|
|
|
|
|
# 🧪 PFAS AI Analyzer (BERT Enhanced) |
|
|
|
|
|
This application is an end-to-end AI pipeline designed to identify, classify, and assess the environmental risks of **Per- and Polyfluoroalkyl Substances (PFAS)**. |
|
|
|
|
|
It leverages a fine-tuned **BERT (Bidirectional Encoder Representations from Transformers)** model to generate molecular embeddings, followed by Random Forest regressors for property prediction. |
|
|
|
|
|
## 🚀 Key Features |
|
|
|
|
|
1. **Advanced PFAS Detection:** Uses the OECD-aligned "Chain Rule" logic to distinguish industrial PFAS from fluorinated pharmaceuticals (e.g., Prozac, Fipronil). |
|
|
2. **Subclass Classification:** Automatically categorizes molecules into PFCA, PFSA, or General PFAS. |
|
|
3. **Risk Assessment:** Predicts key environmental properties: |
|
|
* **Persistence:** Estimated half-life / biodegradation potential. |
|
|
* **Mobility:** Soil adsorption coefficient ($K_{oc}$). |
|
|
* **Bioaccumulation:** Bioconcentration factor (BCF) / LogP. |
|
|
4. **BERT Embeddings:** Utilizes a transformer model trained on ChEMBL data to understand deep molecular features beyond simple fingerprints. |
|
|
|
|
|
## 🧠 How It Works |
|
|
|
|
|
1. **Input:** The user provides a SMILES string (Simplified Molecular Input Line Entry System). |
|
|
2. **Tokenization:** The SMILES string is tokenized using a specialized `Spe_Tokenizer`. |
|
|
3. **Embedding:** The **SMILE-to-BERT** model converts the tokens into a 113-dimensional dense vector representation. |
|
|
4. **Inference:** |
|
|
* A **Random Forest Classifier** determines the PFAS subclass. |
|
|
* **Random Forest Regressors** predict environmental properties. |
|
|
5. **Validation:** A rule-based sanity checker applies chemical structure rules to prevent false positives. |
|
|
|
|
|
## 📂 File Structure |
|
|
|
|
|
* `src/app.py`: Main Streamlit application. |
|
|
* `src/pfas_assets.zip`: Contains the BERT model weights and tokenizer data. |
|
|
* `src/*.pkl`: Trained Scikit-Learn models |