manpreet88 commited on
Commit
9c35dc0
·
1 Parent(s): e33a144

Rewrite README structure and formatting

Browse files
Files changed (1) hide show
  1. README.md +219 -400
README.md CHANGED
@@ -1,4 +1,4 @@
1
- # PolyFusionAgent: a multimodal foundation model and an autonomous AI assistant for polymer informatics
2
 
3
  **PolyFusionAgent** is an interactive framework that couples a **multimodal polymer foundation model (PolyFusion)** with a **tool-augmented, literature-grounded design agent (PolyAgent)** for polymer property prediction, inverse design, and evidence-linked scientific reasoning.
4
 
@@ -9,8 +9,7 @@
9
 
10
  ## Authors & Affiliation
11
 
12
- **Manpreet Kaur**¹, **Qian Liu**¹\*
13
-
14
  ¹ Department of Applied Computer Science, The University of Winnipeg, Winnipeg, MB, Canada
15
 
16
  ### Contact
@@ -20,9 +19,9 @@
20
 
21
  ## Abstract
22
 
23
- Polymers underpin technologies from energy storage to biomedicine, yet discovery remains constrained by an astronomically large design space and fragmented representations of polymer structure, properties, and prior knowledge. Although machine learning has advanced property prediction and candidate generation, most models remain disconnected from the physical and experimental context needed for actionable materials design.
24
 
25
- Here we introduce **PolyFusionAgent**, an interactive framework that couples a multimodal polymer foundation model (**PolyFusion**) with a tool-augmented, literature-grounded design agent (**PolyAgent**). PolyFusion aligns complementary polymer views—sequence, topology, three-dimensional structural proxies, and chemical fingerprints—across millions of polymers to learn a shared latent space that transfers across chemistries and data regimes. Using this unified representation, PolyFusion improves prediction of key thermophysical properties and enables property-conditioned generation of chemically valid, structurally novel polymers that extend beyond the reference design space.
26
 
27
  PolyAgent closes the design loop by coupling prediction and inverse design to evidence retrieval from the polymer literature, so that hypotheses are proposed, evaluated, and contextualized with explicit supporting precedent in a single workflow. Together, **PolyFusionAgent** establishes a route toward interactive, evidence-linked polymer discovery that combines large-scale representation learning, multimodal chemical knowledge, and verifiable scientific reasoning.
28
 
@@ -32,6 +31,8 @@ PolyAgent closes the design loop by coupling prediction and inverse design to ev
32
  <img src="assets/PP1.png" alt="PolyFusionAgent Overview" width="800" height="1000"/>
33
  </p>
34
 
 
 
35
  ## Contents
36
 
37
  - [1. Repository Overview](#1-repository-overview)
@@ -55,100 +56,87 @@ PolyAgent closes the design loop by coupling prediction and inverse design to ev
55
 
56
  ## 1. Repository Overview
57
 
58
- ### PolyFusionAgent has three tightly coupled layers:
59
- **(i) PolyFusion** learns a transferable multimodal embedding space; **(ii) task heads** perform property prediction and property-conditioned generation using that embedding; and **(iii) PolyAgent** orchestrates tools (prediction, generation, retrieval, visualization) to produce grounded, audit-ready design outputs.
60
- ---
 
 
 
61
 
62
- ### A. PolyFusion — multimodal polymer foundation model (FM)
63
- **Modalities + encoders:**
64
- - **PSMILES (D)** → DeBERTaV2-style encoder (`PolyFusion/DeBERTav2.py`)
65
- - **2D molecular graph (G)** → **GINE** (Graph Isomorphism Network with Edge features) (`PolyFusion/GINE.py`)
66
- - **3D geometry proxy (S)** → **SchNet** (continuous-filter network for 3D structures) (`PolyFusion/SchNet.py`)
67
  - **Fingerprints (T)** → Transformer encoder (`PolyFusion/Transformer.py`)
68
 
69
  **Pretraining objective:**
70
- PolyFusion forms a **fused structural anchor** from (D, G, S) and contrastively aligns it to the **fingerprint target** (T) using an **InfoNCE** loss over a cross-similarity matrix. (`PolyFusion/CL.py`)
71
 
72
- Key entrypoint:
73
  - `PolyFusion/CL.py` — multimodal contrastive pretraining (anchor–target InfoNCE)
74
- ---
75
 
76
  ### B. Downstream tasks — prediction + inverse design
 
77
  These scripts adapt PolyFusion embeddings for two core tasks:
78
 
79
  - **Property prediction (structure → properties)**
80
  `Downstream Tasks/Property_Prediction.py`
81
- Trains lightweight regression heads on top of (typically frozen) PolyFusion embeddings for thermophysical properties (e.g., density (ρ), glass transition temperature (Tg), melting temperature (Tm), and thermal decomposition temperature (Td)).
82
 
83
  - **Inverse design / generation (target properties → candidate polymers)**
84
  `Downstream Tasks/Polymer_Generation.py`
85
  Performs property-conditioned generation using PolyFusion embeddings as the conditioning interface with a pretrained SELFIES-based encoder–decoder (SELFIES-TED) and latent guidance.
86
- ---
87
 
88
  ### C. PolyAgent — tool-augmented AI assistant (controller + tools)
89
 
90
- **Goal:** convert open-ended polymer design prompts into **grounded, constraint-consistent, evidence-linked** outputs by coupling PolyFusion with tool-mediated verification and retrieval.
91
 
92
- **What PolyAgent does (system-level):**
93
- - decomposes a user request into typed sub-tasks
94
- - calls tools for **prediction**, **generation**, **retrieval (local RAG + web)**, and **visualization**
95
- - returns a final response with explicit evidence/citations and an experiment-ready validation plan
96
 
97
- Main files:
98
  - `PolyAgent/orchestrator.py` — planning + tool routing (controller)
99
  - `PolyAgent/rag_pipeline.py` — local retrieval utilities (PDF → chunks → embeddings → vector store)
100
  - `PolyAgent/gradio_interface.py` — Gradio UI entrypoint
101
 
102
- D. Datasets
103
- Data
104
 
105
  This repo is designed to work with large-scale pretraining corpora (for PolyFusion) plus experiment-backed downstream sets (for finetuning/evaluation). It does not redistribute these datasets—please download them from the original sources and follow their licenses/terms.
106
 
107
- Pretraining corpora (examples used in the paper)
108
-
109
- PI1M: “PI1M: A Benchmark Database for Polymer Informatics.”
110
-
111
- DOI page: https://pubs.acs.org/doi/10.1021/acs.jcim.0c00726
112
-
113
- (Often mirrored/linked via PubMed)
114
 
115
- polyOne: “polyOne Data Set – 100 million hypothetical polymers …” (Zenodo record).
 
 
 
116
 
117
- Zenodo: https://zenodo.org/records/7766806
118
 
119
- Downstream / evaluation data (example)
120
-
121
- PoLyInfo (NIMS Polymer Database) provides experimental/literature polymer properties and metadata.
122
-
123
- Main site: https://polymer.nims.go.jp/en/
124
-
125
- Overview/help: https://polymer.nims.go.jp/PoLyInfo/guide/en/what_is_polyinfo.html
126
-
127
- Tip: For reproducibility, document: export query, filtering rules, property units/conditions, and train/val/test splits in data/README.md.
128
 
129
- 2. Dependencies & Environment
130
 
131
  PolyFusionAgent spans three compute modes:
 
 
 
132
 
133
- Data preprocessing (RDKit-heavy; CPU-friendly but parallelizable)
134
-
135
- Model training/inference (PyTorch; GPU strongly recommended for PolyFusion pretraining)
136
-
137
- PolyAgent runtime (Gradio UI + retrieval stack; GPU optional but helpful for throughput)
138
-
139
- 2.1 Supported platforms
140
-
141
- OS: Linux recommended (Ubuntu 20.04/22.04 tested most commonly in similar stacks), macOS/Windows supported for lightweight inference but may require extra care for RDKit/FAISS.
142
-
143
- Python: 3.9–3.11 recommended (keep Python/PyTorch/CUDA consistent for reproducibility).
144
-
145
- GPU: NVIDIA recommended for training. Manuscript pretraining used mixed precision and ran on NVIDIA A100 GPUs
146
 
147
-
 
 
148
 
149
- .
150
 
151
- 2.2 Installation (base)
152
  git clone https://github.com/manpreet88/PolyFusionAgent.git
153
  cd PolyFusionAgent
154
 
@@ -162,462 +150,293 @@ pip install --upgrade pip
162
  # conda activate polyfusion
163
 
164
  pip install -r requirements.txt
 
165
 
 
166
 
167
- Tip (recommended): split installs by “extras” so users don’t pull GPU/RAG dependencies unless needed.
168
-
169
- requirements.txt → core + inference
170
-
171
- requirements-train.txt → training + distributed / acceleration
172
-
173
- requirements-agent.txt → gradio + retrieval + PDF tooling
174
 
175
  (If you keep a single requirements file, clearly label optional dependencies as such.)
176
 
177
- 2.3 Core ML stack (PolyFusion / downstream)
178
-
179
- Required
180
-
181
- torch (GPU build strongly recommended for training)
182
-
183
- numpy, pandas, scikit-learn (downstream regression uses standard scaling + CV; manuscript uses 5-fold CV
184
-
185
-
186
-
187
- )
188
-
189
- transformers (PSMILES encoder + assorted NLP utilities)
190
-
191
- Recommended
192
-
193
- accelerate (multi-GPU / fp16 ergonomics)
194
 
195
- sentencepiece (PSMILES tokenization uses SentencePiece with a fixed 265-token vocab
 
 
 
196
 
197
-
198
-
199
- )
200
-
201
- tqdm, rich (logging)
202
-
203
- GPU check
204
 
 
 
205
  nvidia-smi
206
  python -c "import torch; print('cuda:', torch.cuda.is_available(), '| torch:', torch.__version__, '| cuda_ver:', torch.version.cuda)"
 
207
 
208
- 2.4 Chemistry stack (strongly recommended)
209
 
210
  A large fraction of the pipeline depends on RDKit:
 
 
 
 
211
 
212
- building graphs / fingerprints
213
-
214
- conformer generation
215
-
216
- canonicalization + validity checks
217
-
218
- PolyAgent visualization
219
-
220
- Install RDKit via conda-forge:
221
-
222
  conda install -c conda-forge rdkit -y
 
223
 
 
 
224
 
225
- Wildcard endpoint handling (important):
226
- For RDKit-derived modalities, the pipeline converts polymer repeat units into a pseudo-molecule by replacing the repeat-unit wildcard attachment token [*] with [At] (Astatine) to ensure chemical sanitization and tool compatibility
227
-
228
-
229
-
230
- .
231
 
232
- 2.5 Graph / 3D stacks (optional, depending on your implementation)
233
 
234
- If your GINE implementation uses PyTorch Geometric, install the wheels that match your exact PyTorch + CUDA combination.
235
 
236
- PyG install instructions differ by CUDA version; pin your environment carefully.
237
-
238
- If you use SchNet via a third-party implementation, confirm the dependency (e.g., schnetpack, torchmd-net, or a local SchNet module). In the manuscript, SchNet uses a neighbor list with radial cutoff 10 Å and ≤64 neighbors/atom, with 6 interaction layers and hidden size 600
239
-
240
-
241
-
242
-
243
-
244
-
245
-
246
- .
247
-
248
- 2.6 Retrieval stack (PolyAgent)
249
 
250
  PolyAgent combines:
 
 
 
251
 
252
- Local RAG over PDFs (chunking + embeddings + vector index)
253
-
254
- Web augmentation (optional)
255
-
256
- Reranking (cross-encoder)
257
-
258
- In the manuscript implementation, the local knowledge base is constructed from 1108 PDFs, chunked at 512/256/128 tokens with overlaps 64/48/32, embedded with OpenAI text-embedding-3-small (1536-d), and indexed using FAISS HNSW (M=64, efconstruction=200)
259
-
260
-
261
-
262
- . Retrieved chunks are reranked with ms-marco-MiniLM-L-12-v2
263
-
264
-
265
-
266
- .
267
-
268
- Typical dependencies
269
-
270
- gradio
271
-
272
- faiss-cpu (or faiss-gpu if desired)
273
-
274
- pypdf / pdfminer.six (PDF text extraction)
275
-
276
- tiktoken (chunking tokens; manuscript references TikToken cl100k
277
-
278
-
279
-
280
- )
281
-
282
- trafilatura (web page extraction; used in manuscript web augmentation
283
 
284
-
 
 
 
 
 
 
285
 
286
- )
287
-
288
- transformers (reranker and query rewrite model; manuscript uses T5 for rewriting in web augmentation
289
-
290
-
291
-
292
- )
293
-
294
- 2.7 Environment variables
295
 
296
  PolyAgent is a tool-orchestrated system. At minimum, set:
297
-
298
  export OPENAI_API_KEY="YOUR_KEY"
299
-
300
 
301
  Optional (if your configs support them):
302
-
303
- export OPENAI_MODEL="gpt-4.1" # controller model (manuscript uses GPT-4.1) :contentReference[oaicite:11]{index=11}
304
  export HF_TOKEN="YOUR_HF_TOKEN" # to pull hosted weights/tokenizers if applicable
 
305
 
306
-
307
- Recommended .env pattern
308
- Create a .env (do not commit) and load it in the Gradio entrypoint:
309
-
310
  OPENAI_API_KEY=...
311
  OPENAI_MODEL=gpt-4.1
 
312
 
313
- 3. Data, Modalities, and Preprocessing
314
- 3.1 Datasets (what the manuscript uses)
315
-
316
- Pretraining uses PI1M + polyOne, at two scales: 2M and 5M polymers
317
-
318
-
319
-
320
- .
321
-
322
- Downstream fine-tuning / evaluation uses PolyInfo (≈ 1.8×10⁴ experimental polymers)
323
-
324
-
325
-
326
- .
327
 
328
- PolyInfo is held out from pretraining
329
 
330
-
331
 
332
- .
 
 
333
 
334
- Where are the links? The uploaded manuscript describes these datasets but does not include canonical URLs in the excerpted sections available here. Add the official dataset links in this README once you finalize where you host or reference them.
335
 
336
- 3.2 Minimum CSV schema
337
 
338
  Your raw CSV must include:
339
-
340
- psmiles (required) — polymer repeat unit string with [*] endpoints
341
 
342
  Optional:
 
 
343
 
344
- source — dataset tag (PI1M/polyOne/PolyInfo/custom)
345
-
346
- property columns — e.g., density, Tg, Tm, Td (names can be mapped)
347
-
348
- Example:
349
-
350
  psmiles,source,density,Tg,Tm,Td
351
  [*]CC(=O)OCCO[*],PolyInfo,1.21,55,155,350
 
352
 
 
353
 
354
- Endpoint note: when generating RDKit-dependent modalities, the code may internally replace [*] with [At] to sanitize repeat-unit molecules
355
-
356
-
357
 
358
- .
359
 
360
- 3.3 Modalities produced per polymer
 
 
 
 
 
 
 
361
 
362
- PolyFusion represents each polymer using four complementary modalities
363
 
364
-
365
-
366
- :
367
-
368
- PSMILES sequences (D)
369
-
370
- SentencePiece tokenization with fixed vocab size 265 (kept fixed during downstream)
371
-
372
-
373
-
374
- 2D molecular graph (G)
375
-
376
- nodes = atoms, edges = bonds, with chemically meaningful node/edge features
377
-
378
- 3D conformational proxy (S)
379
-
380
- conformer embedding + optimization pipeline (ETKDG/UFF described in Methods)
381
-
382
- SchNet neighbor cutoff and layer specs given in Supplementary
383
-
384
-
385
-
386
-
387
-
388
-
389
-
390
- Fingerprints (T)
391
-
392
- ECFP6 (radius r=3) with 2048 bits
393
-
394
-
395
-
396
- 3.4 Preprocessing script
397
-
398
- Use your preprocessing utility (e.g., Data_Modalities.py) to append multimodal columns:
399
 
 
400
  python Data_Modalities.py \
401
  --csv_file /path/to/polymers.csv \
402
  --chunk_size 1000 \
403
  --num_workers 24
 
404
 
 
 
 
405
 
406
- Expected outputs:
407
-
408
- *_processed.csv with new columns: graph, geometry, fingerprints (as JSON blobs)
409
-
410
- *_failures.jsonl for failed rows (index + error)
411
 
412
- 4. Models & Artifacts
413
 
414
  This repository typically produces three artifact families:
415
 
416
- 4.1 PolyFusion checkpoints (pretraining)
417
-
418
- PolyFusion maps each modality into a shared embedding space of dimension d=600
419
-
420
-
421
 
422
- .
423
- Pretraining uses:
 
424
 
425
- unified masking with pmask = 0.15 and an 80/10/10 corruption rule
 
 
 
426
 
427
-
428
-
429
- anchor–target contrastive learning where the fused structural anchor is aligned to the fingerprint target (InfoNCE with τ = 0.07)
430
-
431
-
432
-
433
- Store:
434
-
435
- encoder weights per modality
436
-
437
- projection heads
438
-
439
- training config + tokenizer artifacts (SentencePiece model)
440
-
441
- 4.2 Downstream predictors (property regression)
442
 
443
  Downstream uses:
 
 
444
 
445
- fused 600-d embedding
446
-
447
- a lightweight regressor (2-layer MLP, hidden width 300, dropout 0.1)
448
-
449
-
450
-
451
- Training protocol:
452
-
453
- 5-fold CV, inner validation (10%) with early stopping
454
-
455
-
456
-
457
- Save:
458
-
459
- best weights per property per fold
460
-
461
- scalers used for standardization
462
-
463
- 4.3 Inverse design generator (SELFIES-TED conditioning)
464
-
465
- Inverse design conditions a SELFIES-based encoder–decoder (SELFIES-TED) on PolyFusion’s 600-d embedding
466
-
467
-
468
-
469
- .
470
- Implementation details from the manuscript include:
471
-
472
- conditioning via K=4 learned memory tokens
473
-
474
-
475
 
476
- training-time latent noise σtrain = 0.10
 
 
477
 
478
-
479
 
480
- decoding uses top-p (0.92), temperature 1.0, repetition penalty 1.05, max length 256
 
 
 
 
481
 
482
-
 
 
 
483
 
484
- property targeting via generate-then-filter using a GP oracle and acceptance threshold τs = 0.5 (standardized units)
485
-
486
-
487
-
488
- Save:
489
-
490
- decoder weights + conditioning projection
491
-
492
- tokenization assets (if applicable)
493
-
494
- property oracle artifacts (GP models / scalers)
495
 
496
- 5. Running the Code
497
 
498
  Several scripts may contain path placeholders. Centralize them into one config file (recommended) or update the constants in each entrypoint.
499
 
500
- 5.1 Multimodal contrastive pretraining (PolyFusion)
501
 
502
- Entrypoint:
 
503
 
504
- PolyFusion/CL.py
505
-
506
- Manuscript-grounded defaults:
507
-
508
- AdamW, lr=1e-4, weight_decay=1e-2, batch=16, grad accum=4 (effective 64), up to 25 epochs, early stopping patience 10, FP16
509
-
510
-
511
-
512
- Run:
513
 
 
 
514
  python PolyFusion/CL.py
 
515
 
 
516
 
517
- Sanity tip: start with a smaller subset (e.g., 50k–200k rows) to validate preprocessing + training stability before scaling to millions.
518
-
519
- 5.2 Downstream property prediction
520
 
521
- Entrypoint:
 
522
 
523
- Downstream Tasks/Property_Prediction.py
524
-
525
- What it does:
526
-
527
- loads a modality-augmented CSV
528
-
529
- loads pretrained PolyFusion weights
530
-
531
- trains property heads with K-fold CV
532
-
533
- Run:
534
 
 
 
535
  python "Downstream Tasks/Property_Prediction.py"
 
536
 
537
- 5.3 Inverse design / polymer generation
538
-
539
- Entrypoint:
540
-
541
- Downstream Tasks/Polymer_Generation.py
542
-
543
- What it does:
544
-
545
- conditions SELFIES-TED on PolyFusion embeddings
546
-
547
- generates candidates and filters to target using the manuscript-style oracle loop
548
 
549
-
 
550
 
551
- Run:
 
 
552
 
 
 
553
  python "Downstream Tasks/Polymer_Generation.py"
 
554
 
555
- 5.4 PolyAgent (Gradio UI)
556
 
557
- Core components:
 
 
 
558
 
559
- PolyAgent/orchestrator.py (controller + tool router)
560
-
561
- PolyAgent/rag_pipeline.py (local RAG)
562
-
563
- PolyAgent/gradio_interface.py (UI)
564
-
565
- Manuscript controller:
566
-
567
- GPT-4.1 controller with planning temperature τplan=0.2
568
-
569
-
570
-
571
- Run:
572
 
 
 
573
  cd PolyAgent
574
  python gradio_interface.py --server-name 0.0.0.0 --server-port 7860
 
575
 
576
- 6. Results & Reproducibility
577
- 6.1 What “reproducible” means in this repo
578
-
579
- To help others reproduce your paper-level results:
580
-
581
- Pin versions: Python, PyTorch, CUDA, RDKit, FAISS, Transformers
582
-
583
- Fix seeds across Python/NumPy/Torch
584
-
585
- Log configs per run (JSON/YAML dumped beside checkpoints)
586
-
587
- Record dataset snapshots (hashes of CSVs and modality JSON columns)
588
-
589
- 6.2 Manuscript training protocol highlights
590
-
591
- PolyFusion shared latent dimension: 600
592
-
593
-
594
-
595
- Unified corruption: pmask = 0.15, 80/10/10 rule
596
-
597
-
598
 
599
- Contrastive alignment uses InfoNCE with τ = 0.07
600
 
601
-
602
 
603
- Pretraining optimization and schedule: AdamW, lr 1e-4, wd 1e-2, eff batch 64, FP16, early stopping
 
 
 
 
604
 
605
-
606
 
607
- PolyAgent retrieval index: 1108 PDFs; chunking and FAISS HNSW params as described
 
 
 
 
608
 
609
-
610
 
611
- 7. Citation
612
 
613
  If you use this repository in your work, please cite the accompanying manuscript:
614
 
 
615
  @article{kaur2026polyfusionagent,
616
  title = {PolyFusionAgent: a multimodal foundation model and autonomous AI assistant for polymer informatics},
617
  author = {Kaur, Manpreet and Liu, Qian},
618
  year = {2026},
619
  note = {Manuscript / preprint}
620
  }
 
 
621
  PI1M (JCIM): https://pubs.acs.org/doi/10.1021/acs.jcim.0c00726
622
 
623
  polyOne (Zenodo): https://zenodo.org/records/7766806
 
1
+ # PolyFusionAgent: A Multimodal Foundation Model and an Autonomous AI Assistant for Polymer Informatics
2
 
3
  **PolyFusionAgent** is an interactive framework that couples a **multimodal polymer foundation model (PolyFusion)** with a **tool-augmented, literature-grounded design agent (PolyAgent)** for polymer property prediction, inverse design, and evidence-linked scientific reasoning.
4
 
 
9
 
10
  ## Authors & Affiliation
11
 
12
+ **Manpreet Kaur**¹, **Qian Liu**¹*
 
13
  ¹ Department of Applied Computer Science, The University of Winnipeg, Winnipeg, MB, Canada
14
 
15
  ### Contact
 
19
 
20
  ## Abstract
21
 
22
+ Polymers underpin technologies from energy storage to biomedicine, yet discovery remains constrained by an astronomically large design space and fragmented representations of polymer structure, properties, and prior knowledge. Although machine learning has advanced property prediction and candidate generation, most models remain disconnected from the physical and experimental context needed for actionable materials design.
23
 
24
+ Here we introduce **PolyFusionAgent**, an interactive framework that couples a multimodal polymer foundation model (**PolyFusion**) with a tool-augmented, literature-grounded design agent (**PolyAgent**). PolyFusion aligns complementary polymer views—sequence, topology, three-dimensional structural proxies, and chemical fingerprints—across millions of polymers to learn a shared latent space that transfers across chemistries and data regimes. Using this unified representation, PolyFusion improves prediction of key thermophysical properties and enables property-conditioned generation of chemically valid, structurally novel polymers that extend beyond the reference design space.
25
 
26
  PolyAgent closes the design loop by coupling prediction and inverse design to evidence retrieval from the polymer literature, so that hypotheses are proposed, evaluated, and contextualized with explicit supporting precedent in a single workflow. Together, **PolyFusionAgent** establishes a route toward interactive, evidence-linked polymer discovery that combines large-scale representation learning, multimodal chemical knowledge, and verifiable scientific reasoning.
27
 
 
31
  <img src="assets/PP1.png" alt="PolyFusionAgent Overview" width="800" height="1000"/>
32
  </p>
33
 
34
+ ---
35
+
36
  ## Contents
37
 
38
  - [1. Repository Overview](#1-repository-overview)
 
56
 
57
  ## 1. Repository Overview
58
 
59
+ PolyFusionAgent has three tightly coupled layers:
60
+ - **(i) PolyFusion** learns a transferable multimodal embedding space.
61
+ - **(ii) Task heads** perform property prediction and property-conditioned generation using that embedding.
62
+ - **(iii) PolyAgent** orchestrates tools (prediction, generation, retrieval, visualization) to produce grounded, audit-ready design outputs.
63
+
64
+ ### A. PolyFusion — multimodal polymer foundation model (FM)
65
 
66
+ **Modalities + encoders:**
67
+ - **PSMILES (D)** DeBERTaV2-style encoder (`PolyFusion/DeBERTav2.py`)
68
+ - **2D molecular graph (G)** → **GINE** (Graph Isomorphism Network with Edge features) (`PolyFusion/GINE.py`)
69
+ - **3D geometry proxy (S)** → **SchNet** (continuous-filter network for 3D structures) (`PolyFusion/SchNet.py`)
 
70
  - **Fingerprints (T)** → Transformer encoder (`PolyFusion/Transformer.py`)
71
 
72
  **Pretraining objective:**
73
+ PolyFusion forms a **fused structural anchor** from (D, G, S) and contrastively aligns it to the **fingerprint target** (T) using an **InfoNCE** loss over a cross-similarity matrix (`PolyFusion/CL.py`).
74
 
75
+ **Key entrypoint:**
76
  - `PolyFusion/CL.py` — multimodal contrastive pretraining (anchor–target InfoNCE)
 
77
 
78
  ### B. Downstream tasks — prediction + inverse design
79
+
80
  These scripts adapt PolyFusion embeddings for two core tasks:
81
 
82
  - **Property prediction (structure → properties)**
83
  `Downstream Tasks/Property_Prediction.py`
84
+ Trains lightweight regression heads on top of (typically frozen) PolyFusion embeddings for thermophysical properties (e.g., density (ρ), glass transition temperature (Tg), melting temperature (Tm), and thermal decomposition temperature (Td)).
85
 
86
  - **Inverse design / generation (target properties → candidate polymers)**
87
  `Downstream Tasks/Polymer_Generation.py`
88
  Performs property-conditioned generation using PolyFusion embeddings as the conditioning interface with a pretrained SELFIES-based encoder–decoder (SELFIES-TED) and latent guidance.
 
89
 
90
  ### C. PolyAgent — tool-augmented AI assistant (controller + tools)
91
 
92
+ **Goal:** Convert open-ended polymer design prompts into **grounded, constraint-consistent, evidence-linked** outputs by coupling PolyFusion with tool-mediated verification and retrieval.
93
 
94
+ **What PolyAgent does (system-level):**
95
+ - Decomposes a user request into typed sub-tasks.
96
+ - Calls tools for **prediction**, **generation**, **retrieval (local RAG + web)**, and **visualization**.
97
+ - Returns a final response with explicit evidence/citations and an experiment-ready validation plan.
98
 
99
+ **Main files:**
100
  - `PolyAgent/orchestrator.py` — planning + tool routing (controller)
101
  - `PolyAgent/rag_pipeline.py` — local retrieval utilities (PDF → chunks → embeddings → vector store)
102
  - `PolyAgent/gradio_interface.py` — Gradio UI entrypoint
103
 
104
+ ### D. Datasets
 
105
 
106
  This repo is designed to work with large-scale pretraining corpora (for PolyFusion) plus experiment-backed downstream sets (for finetuning/evaluation). It does not redistribute these datasets—please download them from the original sources and follow their licenses/terms.
107
 
108
+ **Pretraining corpora (examples used in the paper):**
109
+ - **PI1M:** “PI1M: A Benchmark Database for Polymer Informatics.”
110
+ DOI page: https://pubs.acs.org/doi/10.1021/acs.jcim.0c00726
111
+ (Often mirrored/linked via PubMed)
112
+ - **polyOne:** “polyOne Data Set – 100 million hypothetical polymers …” (Zenodo record).
113
+ Zenodo: https://zenodo.org/records/7766806
 
114
 
115
+ **Downstream / evaluation data (example):**
116
+ - **PoLyInfo (NIMS Polymer Database)** provides experimental/literature polymer properties and metadata.
117
+ Main site: https://polymer.nims.go.jp/en/
118
+ Overview/help: https://polymer.nims.go.jp/PoLyInfo/guide/en/what_is_polyinfo.html
119
 
120
+ **Tip:** For reproducibility, document export queries, filtering rules, property units/conditions, and train/val/test splits in `data/README.md`.
121
 
122
+ ---
 
 
 
 
 
 
 
 
123
 
124
+ ## 2. Dependencies & Environment
125
 
126
  PolyFusionAgent spans three compute modes:
127
+ - **Data preprocessing** (RDKit-heavy; CPU-friendly but parallelizable)
128
+ - **Model training/inference** (PyTorch; GPU strongly recommended for PolyFusion pretraining)
129
+ - **PolyAgent runtime** (Gradio UI + retrieval stack; GPU optional but helpful for throughput)
130
 
131
+ ### 2.1 Supported platforms
 
 
 
 
 
 
 
 
 
 
 
 
132
 
133
+ - **OS:** Linux recommended (Ubuntu 20.04/22.04 tested most commonly in similar stacks). macOS/Windows are supported for lightweight inference but may require extra care for RDKit/FAISS.
134
+ - **Python:** 3.9–3.11 recommended (keep Python/PyTorch/CUDA consistent for reproducibility).
135
+ - **GPU:** NVIDIA recommended for training. Manuscript pretraining used mixed precision and ran on NVIDIA A100 GPUs.
136
 
137
+ ### 2.2 Installation (base)
138
 
139
+ ```bash
140
  git clone https://github.com/manpreet88/PolyFusionAgent.git
141
  cd PolyFusionAgent
142
 
 
150
  # conda activate polyfusion
151
 
152
  pip install -r requirements.txt
153
+ ```
154
 
155
+ **Tip (recommended):** split installs by “extras” so users don’t pull GPU/RAG dependencies unless needed.
156
 
157
+ - `requirements.txt` core + inference
158
+ - `requirements-train.txt` → training + distributed / acceleration
159
+ - `requirements-agent.txt`gradio + retrieval + PDF tooling
 
 
 
 
160
 
161
  (If you keep a single requirements file, clearly label optional dependencies as such.)
162
 
163
+ ### 2.3 Core ML stack (PolyFusion / downstream)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
164
 
165
+ **Required:**
166
+ - torch (GPU build strongly recommended for training)
167
+ - numpy, pandas, scikit-learn (downstream regression uses standard scaling + CV; manuscript uses 5-fold CV)
168
+ - transformers (PSMILES encoder + assorted NLP utilities)
169
 
170
+ **Recommended:**
171
+ - accelerate (multi-GPU / fp16 ergonomics)
172
+ - sentencepiece (PSMILES tokenization uses SentencePiece with a fixed 265-token vocab)
173
+ - tqdm, rich (logging)
 
 
 
174
 
175
+ **GPU check:**
176
+ ```bash
177
  nvidia-smi
178
  python -c "import torch; print('cuda:', torch.cuda.is_available(), '| torch:', torch.__version__, '| cuda_ver:', torch.version.cuda)"
179
+ ```
180
 
181
+ ### 2.4 Chemistry stack (strongly recommended)
182
 
183
  A large fraction of the pipeline depends on RDKit:
184
+ - building graphs / fingerprints
185
+ - conformer generation
186
+ - canonicalization + validity checks
187
+ - PolyAgent visualization
188
 
189
+ **Install RDKit via conda-forge:**
190
+ ```bash
 
 
 
 
 
 
 
 
191
  conda install -c conda-forge rdkit -y
192
+ ```
193
 
194
+ **Wildcard endpoint handling (important):**
195
+ For RDKit-derived modalities, the pipeline converts polymer repeat units into a pseudo-molecule by replacing the repeat-unit wildcard attachment token `[*]` with `[At]` (Astatine) to ensure chemical sanitization and tool compatibility.
196
 
197
+ ### 2.5 Graph / 3D stacks (optional, depending on your implementation)
 
 
 
 
 
198
 
199
+ If your GINE implementation uses PyTorch Geometric, install the wheels that match your exact PyTorch + CUDA combination. PyG install instructions differ by CUDA version; pin your environment carefully.
200
 
201
+ If you use SchNet via a third-party implementation, confirm the dependency (e.g., schnetpack, torchmd-net, or a local SchNet module). In the manuscript, SchNet uses a neighbor list with radial cutoff 10 Å and ≤64 neighbors/atom, with 6 interaction layers and hidden size 600.
202
 
203
+ ### 2.6 Retrieval stack (PolyAgent)
 
 
 
 
 
 
 
 
 
 
 
 
204
 
205
  PolyAgent combines:
206
+ - Local RAG over PDFs (chunking + embeddings + vector index)
207
+ - Web augmentation (optional)
208
+ - Reranking (cross-encoder)
209
 
210
+ In the manuscript implementation, the local knowledge base is constructed from 1108 PDFs, chunked at 512/256/128 tokens with overlaps 64/48/32, embedded with OpenAI text-embedding-3-small (1536-d), and indexed using FAISS HNSW (M=64, efconstruction=200). Retrieved chunks are reranked with ms-marco-MiniLM-L-12-v2.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
211
 
212
+ **Typical dependencies:**
213
+ - gradio
214
+ - faiss-cpu (or faiss-gpu if desired)
215
+ - pypdf / pdfminer.six (PDF text extraction)
216
+ - tiktoken (chunking tokens; manuscript references TikToken cl100k)
217
+ - trafilatura (web page extraction; used in manuscript web augmentation)
218
+ - transformers (reranker and query rewrite model; manuscript uses T5 for rewriting in web augmentation)
219
 
220
+ ### 2.7 Environment variables
 
 
 
 
 
 
 
 
221
 
222
  PolyAgent is a tool-orchestrated system. At minimum, set:
223
+ ```bash
224
  export OPENAI_API_KEY="YOUR_KEY"
225
+ ```
226
 
227
  Optional (if your configs support them):
228
+ ```bash
229
+ export OPENAI_MODEL="gpt-4.1" # controller model (manuscript uses GPT-4.1)
230
  export HF_TOKEN="YOUR_HF_TOKEN" # to pull hosted weights/tokenizers if applicable
231
+ ```
232
 
233
+ **Recommended .env pattern:** Create a `.env` (do not commit) and load it in the Gradio entrypoint:
234
+ ```
 
 
235
  OPENAI_API_KEY=...
236
  OPENAI_MODEL=gpt-4.1
237
+ ```
238
 
239
+ ---
 
 
 
 
 
 
 
 
 
 
 
 
 
240
 
241
+ ## 3. Data, Modalities, and Preprocessing
242
 
243
+ ### 3.1 Datasets (what the manuscript uses)
244
 
245
+ - Pretraining uses PI1M + polyOne, at two scales: 2M and 5M polymers.
246
+ - Downstream fine-tuning / evaluation uses PolyInfo (≈ 1.8×10⁴ experimental polymers).
247
+ - PolyInfo is held out from pretraining.
248
 
249
+ **Where are the links?** The uploaded manuscript describes these datasets but does not include canonical URLs in the excerpted sections available here. Add the official dataset links in this README once you finalize where you host or reference them.
250
 
251
+ ### 3.2 Minimum CSV schema
252
 
253
  Your raw CSV must include:
254
+ - `psmiles` (required) — polymer repeat unit string with `[*]` endpoints
 
255
 
256
  Optional:
257
+ - `source` — dataset tag (PI1M/polyOne/PolyInfo/custom)
258
+ - property columns — e.g., density, Tg, Tm, Td (names can be mapped)
259
 
260
+ **Example:**
261
+ ```csv
 
 
 
 
262
  psmiles,source,density,Tg,Tm,Td
263
  [*]CC(=O)OCCO[*],PolyInfo,1.21,55,155,350
264
+ ```
265
 
266
+ **Endpoint note:** When generating RDKit-dependent modalities, the code may internally replace `[*]` with `[At]` to sanitize repeat-unit molecules.
267
 
268
+ ### 3.3 Modalities produced per polymer
 
 
269
 
270
+ PolyFusion represents each polymer using four complementary modalities:
271
 
272
+ - **PSMILES sequences (D)**
273
+ SentencePiece tokenization with fixed vocab size 265 (kept fixed during downstream).
274
+ - **2D molecular graph (G)**
275
+ Nodes = atoms, edges = bonds, with chemically meaningful node/edge features.
276
+ - **3D conformational proxy (S)**
277
+ Conformer embedding + optimization pipeline (ETKDG/UFF described in Methods); SchNet neighbor cutoff and layer specs given in Supplementary.
278
+ - **Fingerprints (T)**
279
+ ECFP6 (radius r=3) with 2048 bits.
280
 
281
+ ### 3.4 Preprocessing script
282
 
283
+ Use your preprocessing utility (e.g., `Data_Modalities.py`) to append multimodal columns:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
284
 
285
+ ```bash
286
  python Data_Modalities.py \
287
  --csv_file /path/to/polymers.csv \
288
  --chunk_size 1000 \
289
  --num_workers 24
290
+ ```
291
 
292
+ **Expected outputs:**
293
+ - `*_processed.csv` with new columns: `graph`, `geometry`, `fingerprints` (as JSON blobs)
294
+ - `*_failures.jsonl` for failed rows (index + error)
295
 
296
+ ---
 
 
 
 
297
 
298
+ ## 4. Models & Artifacts
299
 
300
  This repository typically produces three artifact families:
301
 
302
+ ### 4.1 PolyFusion checkpoints (pretraining)
 
 
 
 
303
 
304
+ PolyFusion maps each modality into a shared embedding space of dimension **d = 600**. Pretraining uses:
305
+ - unified masking with **pmask = 0.15** and an **80/10/10** corruption rule
306
+ - anchor–target contrastive learning where the fused structural anchor is aligned to the fingerprint target (InfoNCE with **τ = 0.07**)
307
 
308
+ **Store:**
309
+ - encoder weights per modality
310
+ - projection heads
311
+ - training config + tokenizer artifacts (SentencePiece model)
312
 
313
+ ### 4.2 Downstream predictors (property regression)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
314
 
315
  Downstream uses:
316
+ - fused 600-d embedding
317
+ - a lightweight regressor (2-layer MLP, hidden width 300, dropout 0.1)
318
 
319
+ **Training protocol:**
320
+ - 5-fold CV, inner validation (10%) with early stopping
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
321
 
322
+ **Save:**
323
+ - best weights per property per fold
324
+ - scalers used for standardization
325
 
326
+ ### 4.3 Inverse design generator (SELFIES-TED conditioning)
327
 
328
+ Inverse design conditions a SELFIES-based encoder–decoder (SELFIES-TED) on PolyFusion’s 600-d embedding. Implementation details from the manuscript include:
329
+ - conditioning via **K = 4** learned memory tokens
330
+ - training-time latent noise **σ_train = 0.10**
331
+ - decoding uses top-p (0.92), temperature 1.0, repetition penalty 1.05, max length 256
332
+ - property targeting via generate-then-filter using a GP oracle and acceptance threshold **τ_s = 0.5** (standardized units)
333
 
334
+ **Save:**
335
+ - decoder weights + conditioning projection
336
+ - tokenization assets (if applicable)
337
+ - property oracle artifacts (GP models / scalers)
338
 
339
+ ---
 
 
 
 
 
 
 
 
 
 
340
 
341
+ ## 5. Running the Code
342
 
343
  Several scripts may contain path placeholders. Centralize them into one config file (recommended) or update the constants in each entrypoint.
344
 
345
+ ### 5.1 Multimodal contrastive pretraining (PolyFusion)
346
 
347
+ **Entrypoint:**
348
+ - `PolyFusion/CL.py`
349
 
350
+ **Manuscript-grounded defaults:**
351
+ - AdamW, lr=1e-4, weight_decay=1e-2, batch=16, grad accum=4 (effective 64), up to 25 epochs, early stopping patience 10, FP16
 
 
 
 
 
 
 
352
 
353
+ **Run:**
354
+ ```bash
355
  python PolyFusion/CL.py
356
+ ```
357
 
358
+ **Sanity tip:** Start with a smaller subset (e.g., 50k–200k rows) to validate preprocessing + training stability before scaling to millions.
359
 
360
+ ### 5.2 Downstream property prediction
 
 
361
 
362
+ **Entrypoint:**
363
+ - `Downstream Tasks/Property_Prediction.py`
364
 
365
+ **What it does:**
366
+ - loads a modality-augmented CSV
367
+ - loads pretrained PolyFusion weights
368
+ - trains property heads with K-fold CV
 
 
 
 
 
 
 
369
 
370
+ **Run:**
371
+ ```bash
372
  python "Downstream Tasks/Property_Prediction.py"
373
+ ```
374
 
375
+ ### 5.3 Inverse design / polymer generation
 
 
 
 
 
 
 
 
 
 
376
 
377
+ **Entrypoint:**
378
+ - `Downstream Tasks/Polymer_Generation.py`
379
 
380
+ **What it does:**
381
+ - conditions SELFIES-TED on PolyFusion embeddings
382
+ - generates candidates and filters to target using the manuscript-style oracle loop
383
 
384
+ **Run:**
385
+ ```bash
386
  python "Downstream Tasks/Polymer_Generation.py"
387
+ ```
388
 
389
+ ### 5.4 PolyAgent (Gradio UI)
390
 
391
+ **Core components:**
392
+ - `PolyAgent/orchestrator.py` (controller + tool router)
393
+ - `PolyAgent/rag_pipeline.py` (local RAG)
394
+ - `PolyAgent/gradio_interface.py` (UI)
395
 
396
+ **Manuscript controller:**
397
+ - GPT-4.1 controller with planning temperature τ_plan = 0.2
 
 
 
 
 
 
 
 
 
 
 
398
 
399
+ **Run:**
400
+ ```bash
401
  cd PolyAgent
402
  python gradio_interface.py --server-name 0.0.0.0 --server-port 7860
403
+ ```
404
 
405
+ ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
406
 
407
+ ## 6. Results & Reproducibility
408
 
409
+ ### 6.1 What “reproducible” means in this repo
410
 
411
+ To help others reproduce your paper-level results:
412
+ - Pin versions: Python, PyTorch, CUDA, RDKit, FAISS, Transformers
413
+ - Fix seeds across Python/NumPy/Torch
414
+ - Log configs per run (JSON/YAML dumped beside checkpoints)
415
+ - Record dataset snapshots (hashes of CSVs and modality JSON columns)
416
 
417
+ ### 6.2 Manuscript training protocol highlights
418
 
419
+ - PolyFusion shared latent dimension: **600**
420
+ - Unified corruption: **pmask = 0.15**, **80/10/10 rule**
421
+ - Contrastive alignment uses InfoNCE with **τ = 0.07**
422
+ - Pretraining optimization and schedule: AdamW, lr 1e-4, wd 1e-2, eff batch 64, FP16, early stopping
423
+ - PolyAgent retrieval index: 1108 PDFs; chunking and FAISS HNSW params as described
424
 
425
+ ---
426
 
427
+ ## 7. Citation
428
 
429
  If you use this repository in your work, please cite the accompanying manuscript:
430
 
431
+ ```bibtex
432
  @article{kaur2026polyfusionagent,
433
  title = {PolyFusionAgent: a multimodal foundation model and autonomous AI assistant for polymer informatics},
434
  author = {Kaur, Manpreet and Liu, Qian},
435
  year = {2026},
436
  note = {Manuscript / preprint}
437
  }
438
+ ```
439
+
440
  PI1M (JCIM): https://pubs.acs.org/doi/10.1021/acs.jcim.0c00726
441
 
442
  polyOne (Zenodo): https://zenodo.org/records/7766806