manpreet88 commited on
Commit
cd8041c
·
1 Parent(s): c03986e

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +379 -0
README.md CHANGED
@@ -1 +1,380 @@
 
1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # PolyFusionAgent: a multimodal foundation model and an autonomous AI assistant for polymer informatics
2
 
3
+ **PolyFusionAgent** is an interactive framework that couples a **multimodal polymer foundation model (PolyFusion)** with a **tool-augmented, literature-grounded design agent (PolyAgent)** for polymer property prediction, inverse design, and evidence-linked scientific reasoning.
4
+
5
+ > **PolyFusion** aligns complementary polymer views—**PSMILES sequence**, **2D topology**, **3D structural proxies**, and **chemical fingerprints**—into a shared latent space that transfers across chemistries and data regimes.
6
+ > **PolyAgent** closes the design loop by connecting **prediction + generation + retrieval + visualization** so recommendations are contextualized with explicit supporting precedent.
7
+
8
+ ---
9
+
10
+ ## Authors & Affiliation
11
+
12
+ **Manpreet Kaur**¹, **Qian Liu**¹\*
13
+
14
+ ¹ Department of Applied Computer Science, The University of Winnipeg, Winnipeg, MB, Canada
15
+
16
+ ### Contact
17
+ - **Qian Liu** — qi.liu@uwinnipeg.ca
18
+
19
+ ---
20
+
21
+ ## Abstract
22
+
23
+ Polymers underpin technologies from energy storage to biomedicine, yet discovery remains constrained by an astronomically large design space and fragmented representations of polymer structure, properties, and prior knowledge. Although machine learning has advanced property prediction and candidate generation, most models remain disconnected from the physical and experimental context needed for actionable materials design.
24
+
25
+ Here we introduce **PolyFusionAgent**, an interactive framework that couples a multimodal polymer foundation model (**PolyFusion**) with a tool-augmented, literature-grounded design agent (**PolyAgent**). PolyFusion aligns complementary polymer views—sequence, topology, three-dimensional structural proxies, and chemical fingerprints—across millions of polymers to learn a shared latent space that transfers across chemistries and data regimes. Using this unified representation, PolyFusion improves prediction of key thermophysical properties and enables property-conditioned generation of chemically valid, structurally novel polymers that extend beyond the reference design space.
26
+
27
+ PolyAgent closes the design loop by coupling prediction and inverse design to evidence retrieval from the polymer literature, so that hypotheses are proposed, evaluated, and contextualized with explicit supporting precedent in a single workflow. Together, **PolyFusionAgent** establishes a route toward interactive, evidence-linked polymer discovery that combines large-scale representation learning, multimodal chemical knowledge, and verifiable scientific reasoning.
28
+
29
+ ---
30
+
31
+ <p align="center">
32
+ <img src="assets/PolyFusionAgent_overview.png" alt="PolyFusionAgent overview" width="850"/>
33
+ </p>
34
+
35
+ ## Contents
36
+
37
+ - [1. Repository Overview](#1-repository-overview)
38
+ - [2. Dependencies & Environment](#2-dependencies--environment)
39
+ - [2.1 Installation](#21-installation)
40
+ - [2.2 Optional Chemistry & GPU Notes](#22-optional-chemistry--gpu-notes)
41
+ - [3. Data, Modalities, and Preprocessing](#3-data-modalities-and-preprocessing)
42
+ - [3.1 Input CSV schema](#31-input-csv-schema)
43
+ - [3.2 Generate multimodal columns (graph/geometry/fingerprints)](#32-generate-multimodal-columns-graphgeometryfingerprints)
44
+ - [3.3 What “graph”, “geometry”, and “fingerprints” look like](#33-what-graph-geometry-and-fingerprints-look-like)
45
+ - [4. Models & Artifacts](#4-models--artifacts)
46
+ - [5. Running the Code](#5-running-the-code)
47
+ - [5.1 Multimodal contrastive pretraining (PolyFusion)](#51-multimodal-contrastive-pretraining-polyfusion)
48
+ - [5.2 Downstream property prediction](#52-downstream-property-prediction)
49
+ - [5.3 Inverse design / polymer generation](#53-inverse-design--polymer-generation)
50
+ - [5.4 PolyAgent (Gradio UI)](#54-polyagent-gradio-ui)
51
+ - [6. Results & Reproducibility](#6-results--reproducibility)
52
+ - [7. Citation](#7-citation)
53
+ - [8. Contact](#8-contact)
54
+ - [9. License & Disclaimer](#9-license--disclaimer)
55
+
56
+ ---
57
+
58
+ ## 1. Repository Overview
59
+
60
+ This repository contains three major components:
61
+
62
+ ### **(A) PolyFusion** — multimodal polymer foundation model
63
+ PolyFusion learns a shared embedding space by aligning polymer modalities with **multimodal contrastive learning**:
64
+ - **PSMILES encoder**: DeBERTaV2-style sequence encoder (`PolyFusion/DeBERTav2.py`)
65
+ - **2D graph encoder**: GINE (Graph Isomorphism Network w/ edge features) (`PolyFusion/GINE.py`)
66
+ - **3D proxy encoder**: SchNet (`PolyFusion/SchNet.py`)
67
+ - **Fingerprint encoder**: Transformer encoder for Morgan bits (`PolyFusion/Transformer.py`)
68
+ - **Pretraining script**: `PolyFusion/CL.py`
69
+
70
+ ### **(B) Downstream Tasks** — prediction + inverse design
71
+ - **Property prediction** (multi-property evaluation with per-property CV): `Downstream Tasks/Property_Prediction.py`
72
+ - **Inverse design / generation** (property-conditioned generation using SELFIES-TED decoding + latent guidance): `Downstream Tasks/Polymer_Generation.py`
73
+
74
+ ### **(C) PolyAgent** — tool-augmented design assistant
75
+ A modular orchestrator that can:
76
+ - extract multimodal polymer data
77
+ - encode PolyFusion embeddings
78
+ - predict properties using best downstream heads
79
+ - generate candidates via an inverse-design generator
80
+ - retrieve literature via local RAG + web
81
+ - visualize polymer renderings and explainability maps
82
+ - compose a grounded, citation-linked final response
83
+
84
+ Files:
85
+ - `PolyAgent/orchestrator.py`
86
+ - `PolyAgent/rag_pipeline.py`
87
+ - `PolyAgent/gradio_interface.py`
88
+
89
+ ---
90
+
91
+ ## 2. Dependencies & Environment
92
+
93
+ ### 2.1 Installation
94
+
95
+ ```bash
96
+ git clone https://github.com/manpreet88/PolyFusionAgent.git
97
+ cd PolyFusionAgent
98
+
99
+ # Recommended: create a fresh environment (conda or venv), then:
100
+ pip install -r requirements.txt
101
+ 2.2 Optional Chemistry & GPU Notes
102
+ RDKit (recommended)
103
+ Data_Modalities.py and many optional visual/validation steps in generation/agent workflows work best with RDKit.
104
+ Recommended installation:
105
+
106
+ conda install -c conda-forge rdkit
107
+ GPU (recommended for training & large runs)
108
+ PyTorch + CUDA should match your GPU driver. If you use torch-geometric, install it following the official wheels for your CUDA/PyTorch build.
109
+
110
+ 3. Data, Modalities, and Preprocessing
111
+ 3.1 Input CSV schema
112
+ At minimum, your dataset CSV should include a polymer string column:
113
+
114
+ psmiles (required): polymer SMILES / PSMILES string (often contains [*] endpoints)
115
+
116
+ Optional:
117
+
118
+ source (optional): any identifier/source tag
119
+
120
+ property columns (optional): e.g., density, Tg, Tm, Td, etc. (names vary—see downstream scripts’ column matching)
121
+
122
+ Example:
123
+
124
+ psmiles,source,density,glass transition,melting,thermal decomposition
125
+ [*]CC(=O)OCCO[*],PI1M,1.21,55,155,350
126
+ ...
127
+ Wildcard handling: this code replaces * (atomicNum 0) with Astatine (At, Z=85) internally for RDKit robustness, while preserving endpoint semantics.
128
+
129
+ 3.2 Generate multimodal columns (graph/geometry/fingerprints)
130
+ Use Data_Modalities.py to process a CSV and append JSON blobs for:
131
+
132
+ graph
133
+
134
+ geometry
135
+
136
+ fingerprints
137
+
138
+ python Data_Modalities.py \
139
+ --csv_file /path/to/your/polymers.csv \
140
+ --chunk_size 1000 \
141
+ --num_workers 24
142
+ Outputs:
143
+
144
+ /path/to/your/polymers_processed.csv (same rows + new modality columns)
145
+
146
+ /path/to/your/polymers_failures.jsonl (failures with index/smiles/error)
147
+
148
+ 3.3 What “graph”, “geometry”, and “fingerprints” look like
149
+ Each processed row stores modalities as JSON strings.
150
+
151
+ graph contains:
152
+
153
+ node_features: atomic_num, degree, formal_charge, hybridization, aromatic/ring flags, chirality, etc.
154
+
155
+ edge_indices + edge_features (bond_type, stereo, conjugation, etc.)
156
+
157
+ adjacency_matrix
158
+
159
+ graph_features (MolWt, LogP, TPSA, rings, rotatable bonds, HBA/HBD, ...)
160
+
161
+ geometry contains:
162
+
163
+ ETKDG-generated conformers, optimized via MMFF/UFF (best energy chosen)
164
+
165
+ best_conformer: atomic_numbers + coordinates + energy + optional 3D descriptors
166
+
167
+ falls back to 2D coords if 3D fails
168
+
169
+ fingerprints contains:
170
+
171
+ Morgan fingerprints (bitstrings + counts) for radii up to 3 (default)
172
+
173
+ e.g., morgan_r3_bits, morgan_r3_counts, plus smaller radii
174
+
175
+ 4. Models & Artifacts
176
+ This repo is organized so you can train and export artifacts for:
177
+
178
+ PolyFusion (pretraining)
179
+ multimodal CL checkpoint bundle (e.g., multimodal_output/best/...)
180
+
181
+ unimodal encoder checkpoints (optional, used by some scripts)
182
+
183
+ Downstream (best weights per property)
184
+ saved best checkpoint per property (CV selection)
185
+
186
+ directory example: multimodal_downstream_bestweights/...
187
+
188
+ Inverse design generator artifacts
189
+ decoder bundles + scalers + (optionally) SentencePiece tokenizer assets
190
+
191
+ directory example: multimodal_inverse_design_output/.../best_models
192
+
193
+ Important: Several scripts include placeholder paths at the top (e.g., /path/to/...). You must update them for your filesystem.
194
+
195
+ 5. Running the Code
196
+ 5.1 Multimodal contrastive pretraining (PolyFusion)
197
+ Main entry:
198
+
199
+ PolyFusion/CL.py
200
+
201
+ What it does (high-level):
202
+
203
+ Streams a large CSV (CSV_PATH) and writes per-sample .pt files to avoid RAM spikes.
204
+
205
+ Encodes polymer modalities with DeBERTaV2 (PSMILES), GINE (2D), SchNet (3D), Transformer (fingerprints).
206
+
207
+ Projects each modality embedding into a shared space.
208
+
209
+ Trains with contrastive alignment (InfoNCE) + optional reconstruction objectives.
210
+
211
+ Steps
212
+
213
+ Edit path placeholders in PolyFusion/CL.py, e.g.:
214
+
215
+ CSV_PATH
216
+
217
+ SPM_MODEL
218
+
219
+ PREPROC_DIR
220
+
221
+ OUTPUT_DIR and BEST_*_DIR locations (if used)
222
+
223
+ Run:
224
+
225
+ python PolyFusion/CL.py
226
+ Tip: Start with a smaller TARGET_ROWS (e.g., 100k) to validate pipeline correctness before scaling.
227
+
228
+ 5.2 Downstream property prediction
229
+ Script:
230
+
231
+ Downstream Tasks/Property_Prediction.py
232
+
233
+ This script:
234
+
235
+ loads your dataset CSV with modalities (e.g., polyinfo_with_modalities.csv)
236
+
237
+ loads pretrained encoders / CL fused backbone
238
+
239
+ trains a fusion + regression head for each requested property
240
+
241
+ evaluates using true K-fold (NUM_RUNS = 5) and saves best weights
242
+
243
+ Steps
244
+
245
+ Update placeholders near the top of the script:
246
+
247
+ POLYINFO_PATH
248
+
249
+ PRETRAINED_MULTIMODAL_DIR
250
+
251
+ optional: BEST_*_DIR (if needed)
252
+
253
+ output paths: OUTPUT_RESULTS, BEST_WEIGHTS_DIR
254
+
255
+ Run:
256
+
257
+ python "Downstream Tasks/Property_Prediction.py"
258
+ Requested properties (default)
259
+
260
+ REQUESTED_PROPERTIES = [
261
+ "density",
262
+ "glass transition",
263
+ "melting",
264
+ "specific volume",
265
+ "thermal decomposition"
266
+ ]
267
+ The script includes a robust column-matching function that tries to map these names to your dataframe’s actual column headers.
268
+
269
+ 5.3 Inverse design / polymer generation
270
+ Script:
271
+
272
+ Downstream Tasks/Polymer_Generation.py
273
+
274
+ Core idea:
275
+
276
+ condition a SELFIES-TED-style decoder on PolyFusion embeddings,
277
+
278
+ guide sampling toward target property values (with optional latent noise and verification)
279
+
280
+ Steps
281
+
282
+ Update placeholders in the Config dataclass:
283
+
284
+ POLYINFO_PATH
285
+
286
+ pretrained weights directories (CL + downstream + tokenizer)
287
+
288
+ output directory OUTPUT_DIR
289
+
290
+ Run:
291
+
292
+ python "Downstream Tasks/Polymer_Generation.py"
293
+ Notes
294
+
295
+ If RDKit and SELFIES are installed, the script can:
296
+
297
+ validate chemistry constraints more robustly
298
+
299
+ convert polymer endpoints safely (e.g., [*] ↔ [At] internal representation)
300
+
301
+ 5.4 PolyAgent (Gradio UI)
302
+ Files:
303
+
304
+ PolyAgent/orchestrator.py (core engine)
305
+
306
+ PolyAgent/gradio_interface.py (UI)
307
+
308
+ PolyAgent/rag_pipeline.py (local RAG utilities)
309
+
310
+ What you configure
311
+ In PolyAgent/orchestrator.py, update the PathsConfig placeholders, e.g.:
312
+
313
+ cl_weights_path
314
+
315
+ downstream_bestweights_5m_dir
316
+
317
+ inverse_design_5m_dir
318
+
319
+ spm_model_path, spm_vocab_path
320
+
321
+ chroma_db_path (if using local RAG store)
322
+
323
+ Environment variables
324
+
325
+ OPENAI_API_KEY (required for planning/composition)
326
+
327
+ Optional (improves retrieval coverage):
328
+
329
+ OPENAI_MODEL (defaults set in config)
330
+
331
+ HF_TOKEN (if pulling HF artifacts)
332
+
333
+ SPRINGER_NATURE_API_KEY, SEMANTIC_SCHOLAR_API_KEY
334
+
335
+ Run the UI
336
+
337
+ cd PolyAgent
338
+ python gradio_interface.py --server-name 0.0.0.0 --server-port 7860
339
+ Prompting tips
340
+
341
+ To trigger inverse design: include “generate” / “inverse design” and a target value:
342
+
343
+ target_value=60 or Tg 60
344
+
345
+ Provide a seed polymer pSMILES in a code block:
346
+
347
+ [*]CC(=O)OCCOCCOC(=O)C[*]
348
+ If you need more citations, ask explicitly:
349
+
350
+ “cite 10 papers”
351
+
352
+ 6. Results & Reproducibility
353
+ PolyFusion is designed for scalable multimodal alignment across large polymer corpora.
354
+
355
+ Downstream scripts perform K-fold evaluation per property and save best weights.
356
+
357
+ PolyAgent produces evidence-linked answers with tool outputs and DOI-style links (when available).
358
+
359
+ Reproducibility reminder: Several scripts currently use in-file configuration constants (placeholders). For a clean workflow, keep a consistent folder layout for datasets and checkpoints and update paths in one place (or refactor into a shared config module).
360
+
361
+ 7. Citation
362
+ If you use this repository in your work, please cite the accompanying manuscript:
363
+
364
+ @article{kaur2026polyfusionagent,
365
+ title = {PolyFusionAgent: a multimodal foundation model and autonomous AI assistant for polymer informatics},
366
+ author = {Kaur, Manpreet and Liu, Qian},
367
+ year = {2026},
368
+ note = {Manuscript / preprint},
369
+ }
370
+ Replace the BibTeX entry above with the final venue DOI/citation when available.
371
+
372
+ 8. Contact
373
+ Corresponding author: Qian Liu — qi.liu@uwinnipeg.ca
374
+
375
+ Contributing author: Manpreet Kaur — kaur-m43@webmail.uwinnipeg.ca
376
+
377
+ 9. License & Disclaimer
378
+ License: (Add your license file here; e.g., MIT / Apache-2.0 / CC BY-NC for models)
379
+
380
+ Disclaimer: This codebase is provided for research and development use. Polymer generation outputs and suggested candidates should be validated with domain expertise, safety constraints, and experimental verification before deployment.