File size: 2,135 Bytes
f6a3fca
1faaf1a
 
f6a3fca
1faaf1a
f6a3fca
1faaf1a
f6a3fca
1faaf1a
 
 
 
 
f6a3fca
1faaf1a
f6a3fca
 
1faaf1a
f6a3fca
1faaf1a
f6a3fca
1faaf1a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
---
title: PFAS AI Analyzer
emoji: 🧪
colorFrom: red
colorTo: indigo
sdk: docker
app_port: 7860
tags:
  - chemistry
  - pfas
  - bert
  - toxicology
  - streamlit
pinned: false
short_description: AI-powered PFAS detection and risk assessment pipeline.
---

# 🧪 PFAS AI Analyzer (BERT Enhanced)

This application is an end-to-end AI pipeline designed to identify, classify, and assess the environmental risks of **Per- and Polyfluoroalkyl Substances (PFAS)**. 

It leverages a fine-tuned **BERT (Bidirectional Encoder Representations from Transformers)** model to generate molecular embeddings, followed by Random Forest regressors for property prediction.

## 🚀 Key Features

1.  **Advanced PFAS Detection:** Uses the OECD-aligned "Chain Rule" logic to distinguish industrial PFAS from fluorinated pharmaceuticals (e.g., Prozac, Fipronil).
2.  **Subclass Classification:** Automatically categorizes molecules into PFCA, PFSA, or General PFAS.
3.  **Risk Assessment:** Predicts key environmental properties:
    * **Persistence:** Estimated half-life / biodegradation potential.
    * **Mobility:** Soil adsorption coefficient ($K_{oc}$).
    * **Bioaccumulation:** Bioconcentration factor (BCF) / LogP.
4.  **BERT Embeddings:** Utilizes a transformer model trained on ChEMBL data to understand deep molecular features beyond simple fingerprints.

## 🧠 How It Works

1.  **Input:** The user provides a SMILES string (Simplified Molecular Input Line Entry System).
2.  **Tokenization:** The SMILES string is tokenized using a specialized `Spe_Tokenizer`.
3.  **Embedding:** The **SMILE-to-BERT** model converts the tokens into a 113-dimensional dense vector representation.
4.  **Inference:**
    * A **Random Forest Classifier** determines the PFAS subclass.
    * **Random Forest Regressors** predict environmental properties.
5.  **Validation:** A rule-based sanity checker applies chemical structure rules to prevent false positives.

## 📂 File Structure

* `src/app.py`: Main Streamlit application.
* `src/pfas_assets.zip`: Contains the BERT model weights and tokenizer data.
* `src/*.pkl`: Trained Scikit-Learn models