manpreet88 commited on
Commit
e33a144
·
1 Parent(s): 415578e

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +368 -172
README.md CHANGED
@@ -50,8 +50,6 @@ PolyAgent closes the design loop by coupling prediction and inverse design to ev
50
  - [5.4 PolyAgent (Gradio UI)](#54-polyagent-gradio-ui)
51
  - [6. Results & Reproducibility](#6-results--reproducibility)
52
  - [7. Citation](#7-citation)
53
- - [8. Contact](#8-contact)
54
- - [9. License & Disclaimer](#9-license--disclaimer)
55
 
56
  ---
57
 
@@ -101,329 +99,527 @@ Main files:
101
  - `PolyAgent/rag_pipeline.py` — local retrieval utilities (PDF → chunks → embeddings → vector store)
102
  - `PolyAgent/gradio_interface.py` — Gradio UI entrypoint
103
 
104
- ## 2. Dependencies & Environment
 
105
 
106
- ### 2.1 Installation
107
 
108
- ```bash
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
109
  git clone https://github.com/manpreet88/PolyFusionAgent.git
110
  cd PolyFusionAgent
111
 
112
- # Recommended: create a clean environment (conda or venv), then:
 
 
 
 
 
 
 
 
113
  pip install -r requirements.txt
114
 
115
- Recommended Python: 3.9–3.11 (keep your Python/PyTorch/CUDA versions consistent across machines for reproducibility).
116
 
117
- 2.2 Optional chemistry stack (recommended)
118
 
119
- Many core pipelines (multimodal data acquisition - 2D garph, 3D proxy, & fingerprint construction, canonicalization, validity checks, and visualization) rely on RDKit, and to render polymer depictions in the PolyAgent visualizer.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
120
 
121
- ```bash
122
- conda install -c conda-forge rdkit
123
 
124
- 2.3 GPU acceleration (recommended for training and large runs)
125
 
126
- Training PolyFusion and running large-scale downstream experiments is significantly faster on GPU.
127
 
128
- PyTorch + CUDA
129
 
130
- Install a CUDA-enabled PyTorch build that matches your NVIDIA driver / CUDA runtime.
131
 
132
- Verify GPU visibility:
133
  nvidia-smi
134
  python -c "import torch; print('cuda:', torch.cuda.is_available(), '| torch:', torch.__version__, '| cuda_ver:', torch.version.cuda)"
135
 
136
- torch-geometric (if enabled/required in your setup)
137
- If you require to use torch-geometric for GINE, install wheels that match your exact PyTorch + CUDA versions (follow the official PyG install instructions).
 
 
 
 
 
 
 
138
 
139
- 2.4 PolyAgent runtime (UI + retrieval)
 
 
 
 
 
 
 
 
 
 
140
 
141
- PolyAgent typically adds dependencies for:
142
 
143
- UI: gradio
144
 
145
- retrieval / vector DB: faiss (or another vector store, depending on your configuration)
146
 
147
- NLP utilities: transformers (reranking/models), tokenization helpers, etc.
148
- You will generally need:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
149
 
150
  export OPENAI_API_KEY="YOUR_KEY"
151
 
152
 
153
- Optional environment variables (if supported in your config):
 
 
 
154
 
155
- OPENAI_MODEL (controller/composition model)
156
 
157
- HF_TOKEN (to pull HF-hosted artifacts)
 
 
 
 
158
 
159
  3. Data, Modalities, and Preprocessing
160
- 3.1 Input CSV schema
161
- At minimum, your dataset CSV should include a polymer string column:
 
 
 
 
 
 
 
 
 
 
 
162
 
163
- psmiles (required): polymer SMILES / PSMILES string (often contains [*] endpoints)
 
 
 
 
 
 
 
 
 
 
 
 
164
 
165
  Optional:
166
 
167
- source (optional): any identifier/source tag
168
 
169
- property columns (optional): e.g., ρ, Tg, Tm, Td, etc.
170
 
171
  Example:
172
 
173
- psmiles,source,density,glass transition,melting,thermal decomposition
174
- [*]CC(=O)OCCO[*],PI1M,1.21,55,155,350
175
- ...
176
- Wildcard handling: this code replaces * (atomicNum 0) with Astatine (At, Z=85) internally for RDKit robustness, while preserving endpoint semantics.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
177
 
178
- 3.2 Generate multimodal columns (graph/geometry/fingerprints)
179
- Use Data_Modalities.py to process a CSV and append JSON blobs for:
180
 
181
- graph
182
 
183
- geometry
 
 
 
 
 
 
184
 
185
- fingerprints
 
 
 
 
 
 
186
 
187
  python Data_Modalities.py \
188
- --csv_file /path/to/your/polymers.csv \
189
  --chunk_size 1000 \
190
  --num_workers 24
191
- Outputs:
192
 
193
- /path/to/your/polymers_processed.csv (same rows + new modality columns)
194
 
195
- /path/to/your/polymers_failures.jsonl (failures with index/smiles/error)
196
 
197
- 3.3 What graph, geometry, and “fingerprints look like
198
- Each processed row stores modalities as JSON strings.
199
 
200
- graph contains:
201
 
202
- node_features: atomic_num, degree, formal_charge, hybridization, aromatic/ring flags, chirality, etc.
203
 
204
- edge_indices + edge_features (bond_type, stereo, conjugation, etc.)
205
 
206
- adjacency_matrix
207
 
208
- graph_features (MolWt, LogP, TPSA, rings, rotatable bonds, HBA/HBD, ...)
209
 
210
- geometry contains:
211
 
212
- ETKDG-generated conformers, optimized via MMFF/UFF (best energy chosen)
 
213
 
214
- best_conformer: atomic_numbers + coordinates + energy + optional 3D descriptors
215
 
216
- falls back to 2D coords if 3D fails
217
 
218
- fingerprints contains:
219
 
220
- Morgan fingerprints (bitstrings + counts) for radii up to 3 (default)
221
 
222
- e.g., morgan_r3_bits, morgan_r3_counts, plus smaller radii
223
 
224
- 4. Models & Artifacts
225
- This repo is organized so you can train and export artifacts for:
226
 
227
- PolyFusion (pretraining)
228
- multimodal CL checkpoint bundle (e.g., multimodal_output/best/...)
229
 
230
- unimodal encoder checkpoints (optional, used by some scripts)
231
 
232
- Downstream (best weights per property)
233
- saved best checkpoint per property (CV selection)
234
 
235
- directory example: multimodal_downstream_bestweights/...
236
 
237
- Inverse design generator artifacts
238
- decoder bundles + scalers + (optionally) SentencePiece tokenizer assets
239
 
240
- directory example: multimodal_inverse_design_output/.../best_models
241
 
242
- Important: Several scripts include placeholder paths at the top (e.g., /path/to/...). You must update them for your filesystem.
243
 
244
- 5. Running the Code
245
- 5.1 Multimodal contrastive pretraining (PolyFusion)
246
- Main entry:
247
 
248
- PolyFusion/CL.py
249
 
250
- What it does (high-level):
251
 
252
- Streams a large CSV (CSV_PATH) and writes per-sample .pt files to avoid RAM spikes.
253
 
254
- Encodes polymer modalities with DeBERTaV2 (PSMILES), GINE (2D), SchNet (3D), Transformer (fingerprints).
255
 
256
- Projects each modality embedding into a shared space.
257
 
258
- Trains with contrastive alignment (InfoNCE) + optional reconstruction objectives.
259
 
260
- Steps
261
 
262
- Edit path placeholders in PolyFusion/CL.py, e.g.:
263
 
264
- CSV_PATH
 
265
 
266
- SPM_MODEL
267
 
268
- PREPROC_DIR
269
 
270
- OUTPUT_DIR and BEST_*_DIR locations (if used)
271
 
272
- Run:
273
 
274
- python PolyFusion/CL.py
275
- Tip: Start with a smaller TARGET_ROWS (e.g., 100k) to validate pipeline correctness before scaling.
276
 
277
- 5.2 Downstream property prediction
278
- Script:
279
 
280
- Downstream Tasks/Property_Prediction.py
281
 
282
- This script:
283
 
284
- loads your dataset CSV with modalities (e.g., polyinfo_with_modalities.csv)
285
 
286
- loads pretrained encoders / CL fused backbone
287
 
288
- trains a fusion + regression head for each requested property
289
 
290
- evaluates using true K-fold (NUM_RUNS = 5) and saves best weights
291
 
292
- Steps
293
 
294
- Update placeholders near the top of the script:
295
 
296
- POLYINFO_PATH
297
 
298
- PRETRAINED_MULTIMODAL_DIR
299
 
300
- optional: BEST_*_DIR (if needed)
301
 
302
- output paths: OUTPUT_RESULTS, BEST_WEIGHTS_DIR
303
 
304
- Run:
305
 
306
- python "Downstream Tasks/Property_Prediction.py"
307
- Requested properties (default)
308
 
309
- REQUESTED_PROPERTIES = [
310
- "density",
311
- "glass transition",
312
- "melting",
313
- "specific volume",
314
- "thermal decomposition"
315
- ]
316
- The script includes a robust column-matching function that tries to map these names to your dataframe’s actual column headers.
317
 
318
- 5.3 Inverse design / polymer generation
319
- Script:
320
 
321
- Downstream Tasks/Polymer_Generation.py
322
 
323
- Core idea:
324
 
325
- condition a SELFIES-TED-style decoder on PolyFusion embeddings,
326
 
327
- guide sampling toward target property values (with optional latent noise and verification)
328
 
329
- Steps
330
 
331
- Update placeholders in the Config dataclass:
332
 
333
- POLYINFO_PATH
334
 
335
- pretrained weights directories (CL + downstream + tokenizer)
336
 
337
- output directory OUTPUT_DIR
338
 
339
  Run:
340
 
341
- python "Downstream Tasks/Polymer_Generation.py"
342
- Notes
 
 
 
 
 
343
 
344
- If RDKit and SELFIES are installed, the script can:
345
 
346
- validate chemistry constraints more robustly
347
 
348
- convert polymer endpoints safely (e.g., [*] [At] internal representation)
 
 
 
 
 
 
349
 
350
  5.4 PolyAgent (Gradio UI)
351
- Files:
352
 
353
- PolyAgent/orchestrator.py (core engine)
354
 
355
- PolyAgent/gradio_interface.py (UI)
356
 
357
- PolyAgent/rag_pipeline.py (local RAG utilities)
358
 
359
- What you configure
360
- In PolyAgent/orchestrator.py, update the PathsConfig placeholders, e.g.:
361
 
362
- cl_weights_path
363
 
364
- downstream_bestweights_5m_dir
365
 
366
- inverse_design_5m_dir
367
 
368
- spm_model_path, spm_vocab_path
369
 
370
- chroma_db_path (if using local RAG store)
 
371
 
372
- Environment variables
 
373
 
374
- OPENAI_API_KEY (required for planning/composition)
375
 
376
- Optional (improves retrieval coverage):
377
 
378
- OPENAI_MODEL (defaults set in config)
379
 
380
- HF_TOKEN (if pulling HF artifacts)
381
 
382
- SPRINGER_NATURE_API_KEY, SEMANTIC_SCHOLAR_API_KEY
383
 
384
- Run the UI
385
 
386
- cd PolyAgent
387
- python gradio_interface.py --server-name 0.0.0.0 --server-port 7860
388
- Prompting tips
389
 
390
- To trigger inverse design: include “generate” / “inverse design” and a target value:
391
 
392
- target_value=60 or Tg 60
393
 
394
- Provide a seed polymer pSMILES in a code block:
395
 
396
- [*]CC(=O)OCCOCCOC(=O)C[*]
397
- If you need more citations, ask explicitly:
398
 
399
- “cite 10 papers”
400
 
401
- 6. Results & Reproducibility
402
- PolyFusion is designed for scalable multimodal alignment across large polymer corpora.
403
 
404
- Downstream scripts perform K-fold evaluation per property and save best weights.
405
 
406
- PolyAgent produces evidence-linked answers with tool outputs and DOI-style links (when available).
407
 
408
- Reproducibility reminder: Several scripts currently use in-file configuration constants (placeholders). For a clean workflow, keep a consistent folder layout for datasets and checkpoints and update paths in one place (or refactor into a shared config module).
409
 
410
  7. Citation
 
411
  If you use this repository in your work, please cite the accompanying manuscript:
412
 
413
  @article{kaur2026polyfusionagent,
414
  title = {PolyFusionAgent: a multimodal foundation model and autonomous AI assistant for polymer informatics},
415
  author = {Kaur, Manpreet and Liu, Qian},
416
  year = {2026},
417
- note = {Manuscript / preprint},
418
  }
419
- Replace the BibTeX entry above with the final venue DOI/citation when available.
420
-
421
- 8. Contact
422
- Corresponding author: Qian Liu — qi.liu@uwinnipeg.ca
423
-
424
- Contributing author: Manpreet Kaur — kaur-m43@webmail.uwinnipeg.ca
425
 
426
- 9. License & Disclaimer
427
- License: (Add your license file here; e.g., MIT / Apache-2.0 / CC BY-NC for models)
428
 
429
- Disclaimer: This codebase is provided for research and development use. Polymer generation outputs and suggested candidates should be validated with domain expertise, safety constraints, and experimental verification before deployment.
 
50
  - [5.4 PolyAgent (Gradio UI)](#54-polyagent-gradio-ui)
51
  - [6. Results & Reproducibility](#6-results--reproducibility)
52
  - [7. Citation](#7-citation)
 
 
53
 
54
  ---
55
 
 
99
  - `PolyAgent/rag_pipeline.py` — local retrieval utilities (PDF → chunks → embeddings → vector store)
100
  - `PolyAgent/gradio_interface.py` — Gradio UI entrypoint
101
 
102
+ D. Datasets
103
+ Data
104
 
105
+ This repo is designed to work with large-scale pretraining corpora (for PolyFusion) plus experiment-backed downstream sets (for finetuning/evaluation). It does not redistribute these datasets—please download them from the original sources and follow their licenses/terms.
106
 
107
+ Pretraining corpora (examples used in the paper)
108
+
109
+ PI1M: “PI1M: A Benchmark Database for Polymer Informatics.”
110
+
111
+ DOI page: https://pubs.acs.org/doi/10.1021/acs.jcim.0c00726
112
+
113
+ (Often mirrored/linked via PubMed)
114
+
115
+ polyOne: “polyOne Data Set – 100 million hypothetical polymers …” (Zenodo record).
116
+
117
+ Zenodo: https://zenodo.org/records/7766806
118
+
119
+ Downstream / evaluation data (example)
120
+
121
+ PoLyInfo (NIMS Polymer Database) provides experimental/literature polymer properties and metadata.
122
+
123
+ Main site: https://polymer.nims.go.jp/en/
124
+
125
+ Overview/help: https://polymer.nims.go.jp/PoLyInfo/guide/en/what_is_polyinfo.html
126
+
127
+ Tip: For reproducibility, document: export query, filtering rules, property units/conditions, and train/val/test splits in data/README.md.
128
+
129
+ 2. Dependencies & Environment
130
+
131
+ PolyFusionAgent spans three compute modes:
132
+
133
+ Data preprocessing (RDKit-heavy; CPU-friendly but parallelizable)
134
+
135
+ Model training/inference (PyTorch; GPU strongly recommended for PolyFusion pretraining)
136
+
137
+ PolyAgent runtime (Gradio UI + retrieval stack; GPU optional but helpful for throughput)
138
+
139
+ 2.1 Supported platforms
140
+
141
+ OS: Linux recommended (Ubuntu 20.04/22.04 tested most commonly in similar stacks), macOS/Windows supported for lightweight inference but may require extra care for RDKit/FAISS.
142
+
143
+ Python: 3.9–3.11 recommended (keep Python/PyTorch/CUDA consistent for reproducibility).
144
+
145
+ GPU: NVIDIA recommended for training. Manuscript pretraining used mixed precision and ran on NVIDIA A100 GPUs
146
+
147
+
148
+
149
+ .
150
+
151
+ 2.2 Installation (base)
152
  git clone https://github.com/manpreet88/PolyFusionAgent.git
153
  cd PolyFusionAgent
154
 
155
+ # Option A: venv
156
+ python -m venv .venv
157
+ source .venv/bin/activate
158
+ pip install --upgrade pip
159
+
160
+ # Option B: conda (recommended if you use RDKit/FAISS)
161
+ # conda create -n polyfusion python=3.10 -y
162
+ # conda activate polyfusion
163
+
164
  pip install -r requirements.txt
165
 
 
166
 
167
+ Tip (recommended): split installs by “extras” so users don’t pull GPU/RAG dependencies unless needed.
168
 
169
+ requirements.txt core + inference
170
+
171
+ requirements-train.txt → training + distributed / acceleration
172
+
173
+ requirements-agent.txt → gradio + retrieval + PDF tooling
174
+
175
+ (If you keep a single requirements file, clearly label optional dependencies as such.)
176
+
177
+ 2.3 Core ML stack (PolyFusion / downstream)
178
+
179
+ Required
180
+
181
+ torch (GPU build strongly recommended for training)
182
+
183
+ numpy, pandas, scikit-learn (downstream regression uses standard scaling + CV; manuscript uses 5-fold CV
184
+
185
+
186
+
187
+ )
188
+
189
+ transformers (PSMILES encoder + assorted NLP utilities)
190
+
191
+ Recommended
192
+
193
+ accelerate (multi-GPU / fp16 ergonomics)
194
+
195
+ sentencepiece (PSMILES tokenization uses SentencePiece with a fixed 265-token vocab
196
 
 
 
197
 
 
198
 
199
+ )
200
 
201
+ tqdm, rich (logging)
202
 
203
+ GPU check
204
 
 
205
  nvidia-smi
206
  python -c "import torch; print('cuda:', torch.cuda.is_available(), '| torch:', torch.__version__, '| cuda_ver:', torch.version.cuda)"
207
 
208
+ 2.4 Chemistry stack (strongly recommended)
209
+
210
+ A large fraction of the pipeline depends on RDKit:
211
+
212
+ building graphs / fingerprints
213
+
214
+ conformer generation
215
+
216
+ canonicalization + validity checks
217
 
218
+ PolyAgent visualization
219
+
220
+ Install RDKit via conda-forge:
221
+
222
+ conda install -c conda-forge rdkit -y
223
+
224
+
225
+ Wildcard endpoint handling (important):
226
+ For RDKit-derived modalities, the pipeline converts polymer repeat units into a pseudo-molecule by replacing the repeat-unit wildcard attachment token [*] with [At] (Astatine) to ensure chemical sanitization and tool compatibility
227
+
228
+
229
 
230
+ .
231
 
232
+ 2.5 Graph / 3D stacks (optional, depending on your implementation)
233
 
234
+ If your GINE implementation uses PyTorch Geometric, install the wheels that match your exact PyTorch + CUDA combination.
235
 
236
+ PyG install instructions differ by CUDA version; pin your environment carefully.
237
+
238
+ If you use SchNet via a third-party implementation, confirm the dependency (e.g., schnetpack, torchmd-net, or a local SchNet module). In the manuscript, SchNet uses a neighbor list with radial cutoff 10 Å and ≤64 neighbors/atom, with 6 interaction layers and hidden size 600
239
+
240
+
241
+
242
+
243
+
244
+
245
+
246
+ .
247
+
248
+ 2.6 Retrieval stack (PolyAgent)
249
+
250
+ PolyAgent combines:
251
+
252
+ Local RAG over PDFs (chunking + embeddings + vector index)
253
+
254
+ Web augmentation (optional)
255
+
256
+ Reranking (cross-encoder)
257
+
258
+ In the manuscript implementation, the local knowledge base is constructed from 1108 PDFs, chunked at 512/256/128 tokens with overlaps 64/48/32, embedded with OpenAI text-embedding-3-small (1536-d), and indexed using FAISS HNSW (M=64, efconstruction=200)
259
+
260
+
261
+
262
+ . Retrieved chunks are reranked with ms-marco-MiniLM-L-12-v2
263
+
264
+
265
+
266
+ .
267
+
268
+ Typical dependencies
269
+
270
+ gradio
271
+
272
+ faiss-cpu (or faiss-gpu if desired)
273
+
274
+ pypdf / pdfminer.six (PDF text extraction)
275
+
276
+ tiktoken (chunking tokens; manuscript references TikToken cl100k
277
+
278
+
279
+
280
+ )
281
+
282
+ trafilatura (web page extraction; used in manuscript web augmentation
283
+
284
+
285
+
286
+ )
287
+
288
+ transformers (reranker and query rewrite model; manuscript uses T5 for rewriting in web augmentation
289
+
290
+
291
+
292
+ )
293
+
294
+ 2.7 Environment variables
295
+
296
+ PolyAgent is a tool-orchestrated system. At minimum, set:
297
 
298
  export OPENAI_API_KEY="YOUR_KEY"
299
 
300
 
301
+ Optional (if your configs support them):
302
+
303
+ export OPENAI_MODEL="gpt-4.1" # controller model (manuscript uses GPT-4.1) :contentReference[oaicite:11]{index=11}
304
+ export HF_TOKEN="YOUR_HF_TOKEN" # to pull hosted weights/tokenizers if applicable
305
 
 
306
 
307
+ Recommended .env pattern
308
+ Create a .env (do not commit) and load it in the Gradio entrypoint:
309
+
310
+ OPENAI_API_KEY=...
311
+ OPENAI_MODEL=gpt-4.1
312
 
313
  3. Data, Modalities, and Preprocessing
314
+ 3.1 Datasets (what the manuscript uses)
315
+
316
+ Pretraining uses PI1M + polyOne, at two scales: 2M and 5M polymers
317
+
318
+
319
+
320
+ .
321
+
322
+ Downstream fine-tuning / evaluation uses PolyInfo (≈ 1.8×10⁴ experimental polymers)
323
+
324
+
325
+
326
+ .
327
 
328
+ PolyInfo is held out from pretraining
329
+
330
+
331
+
332
+ .
333
+
334
+ Where are the links? The uploaded manuscript describes these datasets but does not include canonical URLs in the excerpted sections available here. Add the official dataset links in this README once you finalize where you host or reference them.
335
+
336
+ 3.2 Minimum CSV schema
337
+
338
+ Your raw CSV must include:
339
+
340
+ psmiles (required) — polymer repeat unit string with [*] endpoints
341
 
342
  Optional:
343
 
344
+ source dataset tag (PI1M/polyOne/PolyInfo/custom)
345
 
346
+ property columns e.g., density, Tg, Tm, Td (names can be mapped)
347
 
348
  Example:
349
 
350
+ psmiles,source,density,Tg,Tm,Td
351
+ [*]CC(=O)OCCO[*],PolyInfo,1.21,55,155,350
352
+
353
+
354
+ Endpoint note: when generating RDKit-dependent modalities, the code may internally replace [*] with [At] to sanitize repeat-unit molecules
355
+
356
+
357
+
358
+ .
359
+
360
+ 3.3 Modalities produced per polymer
361
+
362
+ PolyFusion represents each polymer using four complementary modalities
363
+
364
+
365
+
366
+ :
367
+
368
+ PSMILES sequences (D)
369
+
370
+ SentencePiece tokenization with fixed vocab size 265 (kept fixed during downstream)
371
+
372
+
373
+
374
+ 2D molecular graph (G)
375
+
376
+ nodes = atoms, edges = bonds, with chemically meaningful node/edge features
377
+
378
+ 3D conformational proxy (S)
379
 
380
+ conformer embedding + optimization pipeline (ETKDG/UFF described in Methods)
 
381
 
382
+ SchNet neighbor cutoff and layer specs given in Supplementary
383
 
384
+
385
+
386
+
387
+
388
+
389
+
390
+ Fingerprints (T)
391
 
392
+ ECFP6 (radius r=3) with 2048 bits
393
+
394
+
395
+
396
+ 3.4 Preprocessing script
397
+
398
+ Use your preprocessing utility (e.g., Data_Modalities.py) to append multimodal columns:
399
 
400
  python Data_Modalities.py \
401
+ --csv_file /path/to/polymers.csv \
402
  --chunk_size 1000 \
403
  --num_workers 24
 
404
 
 
405
 
406
+ Expected outputs:
407
 
408
+ *_processed.csv with new columns: graph, geometry, fingerprints (as JSON blobs)
 
409
 
410
+ *_failures.jsonl for failed rows (index + error)
411
 
412
+ 4. Models & Artifacts
413
 
414
+ This repository typically produces three artifact families:
415
 
416
+ 4.1 PolyFusion checkpoints (pretraining)
417
 
418
+ PolyFusion maps each modality into a shared embedding space of dimension d=600
419
 
420
+
421
 
422
+ .
423
+ Pretraining uses:
424
 
425
+ unified masking with pmask = 0.15 and an 80/10/10 corruption rule
426
 
427
+
428
 
429
+ anchor–target contrastive learning where the fused structural anchor is aligned to the fingerprint target (InfoNCE with τ = 0.07)
430
 
431
+
432
 
433
+ Store:
434
 
435
+ encoder weights per modality
 
436
 
437
+ projection heads
 
438
 
439
+ training config + tokenizer artifacts (SentencePiece model)
440
 
441
+ 4.2 Downstream predictors (property regression)
 
442
 
443
+ Downstream uses:
444
 
445
+ fused 600-d embedding
 
446
 
447
+ a lightweight regressor (2-layer MLP, hidden width 300, dropout 0.1)
448
 
449
+
450
 
451
+ Training protocol:
 
 
452
 
453
+ 5-fold CV, inner validation (10%) with early stopping
454
 
455
+
456
 
457
+ Save:
458
 
459
+ best weights per property per fold
460
 
461
+ scalers used for standardization
462
 
463
+ 4.3 Inverse design generator (SELFIES-TED conditioning)
464
 
465
+ Inverse design conditions a SELFIES-based encoder–decoder (SELFIES-TED) on PolyFusion’s 600-d embedding
466
 
467
+
468
 
469
+ .
470
+ Implementation details from the manuscript include:
471
 
472
+ conditioning via K=4 learned memory tokens
473
 
474
+
475
 
476
+ training-time latent noise σtrain = 0.10
477
 
478
+
479
 
480
+ decoding uses top-p (0.92), temperature 1.0, repetition penalty 1.05, max length 256
 
481
 
482
+
 
483
 
484
+ property targeting via generate-then-filter using a GP oracle and acceptance threshold τs = 0.5 (standardized units)
485
 
486
+
487
 
488
+ Save:
489
 
490
+ decoder weights + conditioning projection
491
 
492
+ tokenization assets (if applicable)
493
 
494
+ property oracle artifacts (GP models / scalers)
495
 
496
+ 5. Running the Code
497
 
498
+ Several scripts may contain path placeholders. Centralize them into one config file (recommended) or update the constants in each entrypoint.
499
 
500
+ 5.1 Multimodal contrastive pretraining (PolyFusion)
501
 
502
+ Entrypoint:
503
 
504
+ PolyFusion/CL.py
505
 
506
+ Manuscript-grounded defaults:
507
 
508
+ AdamW, lr=1e-4, weight_decay=1e-2, batch=16, grad accum=4 (effective 64), up to 25 epochs, early stopping patience 10, FP16
509
 
510
+
 
511
 
512
+ Run:
 
 
 
 
 
 
 
513
 
514
+ python PolyFusion/CL.py
 
515
 
 
516
 
517
+ Sanity tip: start with a smaller subset (e.g., 50k–200k rows) to validate preprocessing + training stability before scaling to millions.
518
 
519
+ 5.2 Downstream property prediction
520
 
521
+ Entrypoint:
522
 
523
+ Downstream Tasks/Property_Prediction.py
524
 
525
+ What it does:
526
 
527
+ loads a modality-augmented CSV
528
 
529
+ loads pretrained PolyFusion weights
530
 
531
+ trains property heads with K-fold CV
532
 
533
  Run:
534
 
535
+ python "Downstream Tasks/Property_Prediction.py"
536
+
537
+ 5.3 Inverse design / polymer generation
538
+
539
+ Entrypoint:
540
+
541
+ Downstream Tasks/Polymer_Generation.py
542
 
543
+ What it does:
544
 
545
+ conditions SELFIES-TED on PolyFusion embeddings
546
 
547
+ generates candidates and filters to target using the manuscript-style oracle loop
548
+
549
+
550
+
551
+ Run:
552
+
553
+ python "Downstream Tasks/Polymer_Generation.py"
554
 
555
  5.4 PolyAgent (Gradio UI)
 
556
 
557
+ Core components:
558
 
559
+ PolyAgent/orchestrator.py (controller + tool router)
560
 
561
+ PolyAgent/rag_pipeline.py (local RAG)
562
 
563
+ PolyAgent/gradio_interface.py (UI)
 
564
 
565
+ Manuscript controller:
566
 
567
+ GPT-4.1 controller with planning temperature τplan=0.2
568
 
569
+
570
 
571
+ Run:
572
 
573
+ cd PolyAgent
574
+ python gradio_interface.py --server-name 0.0.0.0 --server-port 7860
575
 
576
+ 6. Results & Reproducibility
577
+ 6.1 What “reproducible” means in this repo
578
 
579
+ To help others reproduce your paper-level results:
580
 
581
+ Pin versions: Python, PyTorch, CUDA, RDKit, FAISS, Transformers
582
 
583
+ Fix seeds across Python/NumPy/Torch
584
 
585
+ Log configs per run (JSON/YAML dumped beside checkpoints)
586
 
587
+ Record dataset snapshots (hashes of CSVs and modality JSON columns)
588
 
589
+ 6.2 Manuscript training protocol highlights
590
 
591
+ PolyFusion shared latent dimension: 600
 
 
592
 
593
+
594
 
595
+ Unified corruption: pmask = 0.15, 80/10/10 rule
596
 
597
+
598
 
599
+ Contrastive alignment uses InfoNCE with τ = 0.07
 
600
 
601
+
602
 
603
+ Pretraining optimization and schedule: AdamW, lr 1e-4, wd 1e-2, eff batch 64, FP16, early stopping
 
604
 
605
+
606
 
607
+ PolyAgent retrieval index: 1108 PDFs; chunking and FAISS HNSW params as described
608
 
609
+
610
 
611
  7. Citation
612
+
613
  If you use this repository in your work, please cite the accompanying manuscript:
614
 
615
  @article{kaur2026polyfusionagent,
616
  title = {PolyFusionAgent: a multimodal foundation model and autonomous AI assistant for polymer informatics},
617
  author = {Kaur, Manpreet and Liu, Qian},
618
  year = {2026},
619
+ note = {Manuscript / preprint}
620
  }
621
+ PI1M (JCIM): https://pubs.acs.org/doi/10.1021/acs.jcim.0c00726
 
 
 
 
 
622
 
623
+ polyOne (Zenodo): https://zenodo.org/records/7766806
 
624
 
625
+ PoLyInfo (NIMS): https://polymer.nims.go.jp/en/