Spaces:
Running
Running
manpreet88 commited on
Commit ·
e33a144
1
Parent(s): 415578e
Update README.md
Browse files
README.md
CHANGED
|
@@ -50,8 +50,6 @@ PolyAgent closes the design loop by coupling prediction and inverse design to ev
|
|
| 50 |
- [5.4 PolyAgent (Gradio UI)](#54-polyagent-gradio-ui)
|
| 51 |
- [6. Results & Reproducibility](#6-results--reproducibility)
|
| 52 |
- [7. Citation](#7-citation)
|
| 53 |
-
- [8. Contact](#8-contact)
|
| 54 |
-
- [9. License & Disclaimer](#9-license--disclaimer)
|
| 55 |
|
| 56 |
---
|
| 57 |
|
|
@@ -101,329 +99,527 @@ Main files:
|
|
| 101 |
- `PolyAgent/rag_pipeline.py` — local retrieval utilities (PDF → chunks → embeddings → vector store)
|
| 102 |
- `PolyAgent/gradio_interface.py` — Gradio UI entrypoint
|
| 103 |
|
| 104 |
-
|
|
|
|
| 105 |
|
| 106 |
-
|
| 107 |
|
| 108 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 109 |
git clone https://github.com/manpreet88/PolyFusionAgent.git
|
| 110 |
cd PolyFusionAgent
|
| 111 |
|
| 112 |
-
#
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 113 |
pip install -r requirements.txt
|
| 114 |
|
| 115 |
-
Recommended Python: 3.9–3.11 (keep your Python/PyTorch/CUDA versions consistent across machines for reproducibility).
|
| 116 |
|
| 117 |
-
|
| 118 |
|
| 119 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 120 |
|
| 121 |
-
```bash
|
| 122 |
-
conda install -c conda-forge rdkit
|
| 123 |
|
| 124 |
-
2.3 GPU acceleration (recommended for training and large runs)
|
| 125 |
|
| 126 |
-
|
| 127 |
|
| 128 |
-
|
| 129 |
|
| 130 |
-
|
| 131 |
|
| 132 |
-
Verify GPU visibility:
|
| 133 |
nvidia-smi
|
| 134 |
python -c "import torch; print('cuda:', torch.cuda.is_available(), '| torch:', torch.__version__, '| cuda_ver:', torch.version.cuda)"
|
| 135 |
|
| 136 |
-
|
| 137 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 138 |
|
| 139 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 140 |
|
| 141 |
-
|
| 142 |
|
| 143 |
-
|
| 144 |
|
| 145 |
-
|
| 146 |
|
| 147 |
-
|
| 148 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 149 |
|
| 150 |
export OPENAI_API_KEY="YOUR_KEY"
|
| 151 |
|
| 152 |
|
| 153 |
-
Optional
|
|
|
|
|
|
|
|
|
|
| 154 |
|
| 155 |
-
OPENAI_MODEL (controller/composition model)
|
| 156 |
|
| 157 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 158 |
|
| 159 |
3. Data, Modalities, and Preprocessing
|
| 160 |
-
3.1
|
| 161 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 162 |
|
| 163 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 164 |
|
| 165 |
Optional:
|
| 166 |
|
| 167 |
-
source
|
| 168 |
|
| 169 |
-
property columns
|
| 170 |
|
| 171 |
Example:
|
| 172 |
|
| 173 |
-
psmiles,source,density,
|
| 174 |
-
[*]CC(=O)OCCO[*],
|
| 175 |
-
|
| 176 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 177 |
|
| 178 |
-
|
| 179 |
-
Use Data_Modalities.py to process a CSV and append JSON blobs for:
|
| 180 |
|
| 181 |
-
|
| 182 |
|
| 183 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 184 |
|
| 185 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 186 |
|
| 187 |
python Data_Modalities.py \
|
| 188 |
-
--csv_file /path/to/
|
| 189 |
--chunk_size 1000 \
|
| 190 |
--num_workers 24
|
| 191 |
-
Outputs:
|
| 192 |
|
| 193 |
-
/path/to/your/polymers_processed.csv (same rows + new modality columns)
|
| 194 |
|
| 195 |
-
|
| 196 |
|
| 197 |
-
|
| 198 |
-
Each processed row stores modalities as JSON strings.
|
| 199 |
|
| 200 |
-
|
| 201 |
|
| 202 |
-
|
| 203 |
|
| 204 |
-
|
| 205 |
|
| 206 |
-
|
| 207 |
|
| 208 |
-
|
| 209 |
|
| 210 |
-
|
| 211 |
|
| 212 |
-
|
|
|
|
| 213 |
|
| 214 |
-
|
| 215 |
|
| 216 |
-
|
| 217 |
|
| 218 |
-
|
| 219 |
|
| 220 |
-
|
| 221 |
|
| 222 |
-
|
| 223 |
|
| 224 |
-
|
| 225 |
-
This repo is organized so you can train and export artifacts for:
|
| 226 |
|
| 227 |
-
|
| 228 |
-
multimodal CL checkpoint bundle (e.g., multimodal_output/best/...)
|
| 229 |
|
| 230 |
-
|
| 231 |
|
| 232 |
-
Downstream
|
| 233 |
-
saved best checkpoint per property (CV selection)
|
| 234 |
|
| 235 |
-
|
| 236 |
|
| 237 |
-
|
| 238 |
-
decoder bundles + scalers + (optionally) SentencePiece tokenizer assets
|
| 239 |
|
| 240 |
-
|
| 241 |
|
| 242 |
-
|
| 243 |
|
| 244 |
-
|
| 245 |
-
5.1 Multimodal contrastive pretraining (PolyFusion)
|
| 246 |
-
Main entry:
|
| 247 |
|
| 248 |
-
|
| 249 |
|
| 250 |
-
|
| 251 |
|
| 252 |
-
|
| 253 |
|
| 254 |
-
|
| 255 |
|
| 256 |
-
|
| 257 |
|
| 258 |
-
|
| 259 |
|
| 260 |
-
|
| 261 |
|
| 262 |
-
|
| 263 |
|
| 264 |
-
|
|
|
|
| 265 |
|
| 266 |
-
|
| 267 |
|
| 268 |
-
|
| 269 |
|
| 270 |
-
|
| 271 |
|
| 272 |
-
|
| 273 |
|
| 274 |
-
|
| 275 |
-
Tip: Start with a smaller TARGET_ROWS (e.g., 100k) to validate pipeline correctness before scaling.
|
| 276 |
|
| 277 |
-
|
| 278 |
-
Script:
|
| 279 |
|
| 280 |
-
|
| 281 |
|
| 282 |
-
|
| 283 |
|
| 284 |
-
|
| 285 |
|
| 286 |
-
|
| 287 |
|
| 288 |
-
|
| 289 |
|
| 290 |
-
|
| 291 |
|
| 292 |
-
|
| 293 |
|
| 294 |
-
|
| 295 |
|
| 296 |
-
|
| 297 |
|
| 298 |
-
|
| 299 |
|
| 300 |
-
|
| 301 |
|
| 302 |
-
|
| 303 |
|
| 304 |
-
|
| 305 |
|
| 306 |
-
|
| 307 |
-
Requested properties (default)
|
| 308 |
|
| 309 |
-
|
| 310 |
-
"density",
|
| 311 |
-
"glass transition",
|
| 312 |
-
"melting",
|
| 313 |
-
"specific volume",
|
| 314 |
-
"thermal decomposition"
|
| 315 |
-
]
|
| 316 |
-
The script includes a robust column-matching function that tries to map these names to your dataframe’s actual column headers.
|
| 317 |
|
| 318 |
-
|
| 319 |
-
Script:
|
| 320 |
|
| 321 |
-
Downstream Tasks/Polymer_Generation.py
|
| 322 |
|
| 323 |
-
|
| 324 |
|
| 325 |
-
|
| 326 |
|
| 327 |
-
|
| 328 |
|
| 329 |
-
|
| 330 |
|
| 331 |
-
|
| 332 |
|
| 333 |
-
|
| 334 |
|
| 335 |
-
pretrained
|
| 336 |
|
| 337 |
-
|
| 338 |
|
| 339 |
Run:
|
| 340 |
|
| 341 |
-
python "Downstream Tasks/
|
| 342 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 343 |
|
| 344 |
-
|
| 345 |
|
| 346 |
-
|
| 347 |
|
| 348 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 349 |
|
| 350 |
5.4 PolyAgent (Gradio UI)
|
| 351 |
-
Files:
|
| 352 |
|
| 353 |
-
|
| 354 |
|
| 355 |
-
PolyAgent/
|
| 356 |
|
| 357 |
-
PolyAgent/rag_pipeline.py (local RAG
|
| 358 |
|
| 359 |
-
|
| 360 |
-
In PolyAgent/orchestrator.py, update the PathsConfig placeholders, e.g.:
|
| 361 |
|
| 362 |
-
|
| 363 |
|
| 364 |
-
|
| 365 |
|
| 366 |
-
|
| 367 |
|
| 368 |
-
|
| 369 |
|
| 370 |
-
|
|
|
|
| 371 |
|
| 372 |
-
|
|
|
|
| 373 |
|
| 374 |
-
|
| 375 |
|
| 376 |
-
|
| 377 |
|
| 378 |
-
|
| 379 |
|
| 380 |
-
|
| 381 |
|
| 382 |
-
|
| 383 |
|
| 384 |
-
|
| 385 |
|
| 386 |
-
|
| 387 |
-
python gradio_interface.py --server-name 0.0.0.0 --server-port 7860
|
| 388 |
-
Prompting tips
|
| 389 |
|
| 390 |
-
|
| 391 |
|
| 392 |
-
|
| 393 |
|
| 394 |
-
|
| 395 |
|
| 396 |
-
|
| 397 |
-
If you need more citations, ask explicitly:
|
| 398 |
|
| 399 |
-
|
| 400 |
|
| 401 |
-
|
| 402 |
-
PolyFusion is designed for scalable multimodal alignment across large polymer corpora.
|
| 403 |
|
| 404 |
-
|
| 405 |
|
| 406 |
-
PolyAgent
|
| 407 |
|
| 408 |
-
|
| 409 |
|
| 410 |
7. Citation
|
|
|
|
| 411 |
If you use this repository in your work, please cite the accompanying manuscript:
|
| 412 |
|
| 413 |
@article{kaur2026polyfusionagent,
|
| 414 |
title = {PolyFusionAgent: a multimodal foundation model and autonomous AI assistant for polymer informatics},
|
| 415 |
author = {Kaur, Manpreet and Liu, Qian},
|
| 416 |
year = {2026},
|
| 417 |
-
note = {Manuscript / preprint}
|
| 418 |
}
|
| 419 |
-
|
| 420 |
-
|
| 421 |
-
8. Contact
|
| 422 |
-
Corresponding author: Qian Liu — qi.liu@uwinnipeg.ca
|
| 423 |
-
|
| 424 |
-
Contributing author: Manpreet Kaur — kaur-m43@webmail.uwinnipeg.ca
|
| 425 |
|
| 426 |
-
|
| 427 |
-
License: (Add your license file here; e.g., MIT / Apache-2.0 / CC BY-NC for models)
|
| 428 |
|
| 429 |
-
|
|
|
|
| 50 |
- [5.4 PolyAgent (Gradio UI)](#54-polyagent-gradio-ui)
|
| 51 |
- [6. Results & Reproducibility](#6-results--reproducibility)
|
| 52 |
- [7. Citation](#7-citation)
|
|
|
|
|
|
|
| 53 |
|
| 54 |
---
|
| 55 |
|
|
|
|
| 99 |
- `PolyAgent/rag_pipeline.py` — local retrieval utilities (PDF → chunks → embeddings → vector store)
|
| 100 |
- `PolyAgent/gradio_interface.py` — Gradio UI entrypoint
|
| 101 |
|
| 102 |
+
D. Datasets
|
| 103 |
+
Data
|
| 104 |
|
| 105 |
+
This repo is designed to work with large-scale pretraining corpora (for PolyFusion) plus experiment-backed downstream sets (for finetuning/evaluation). It does not redistribute these datasets—please download them from the original sources and follow their licenses/terms.
|
| 106 |
|
| 107 |
+
Pretraining corpora (examples used in the paper)
|
| 108 |
+
|
| 109 |
+
PI1M: “PI1M: A Benchmark Database for Polymer Informatics.”
|
| 110 |
+
|
| 111 |
+
DOI page: https://pubs.acs.org/doi/10.1021/acs.jcim.0c00726
|
| 112 |
+
|
| 113 |
+
(Often mirrored/linked via PubMed)
|
| 114 |
+
|
| 115 |
+
polyOne: “polyOne Data Set – 100 million hypothetical polymers …” (Zenodo record).
|
| 116 |
+
|
| 117 |
+
Zenodo: https://zenodo.org/records/7766806
|
| 118 |
+
|
| 119 |
+
Downstream / evaluation data (example)
|
| 120 |
+
|
| 121 |
+
PoLyInfo (NIMS Polymer Database) provides experimental/literature polymer properties and metadata.
|
| 122 |
+
|
| 123 |
+
Main site: https://polymer.nims.go.jp/en/
|
| 124 |
+
|
| 125 |
+
Overview/help: https://polymer.nims.go.jp/PoLyInfo/guide/en/what_is_polyinfo.html
|
| 126 |
+
|
| 127 |
+
Tip: For reproducibility, document: export query, filtering rules, property units/conditions, and train/val/test splits in data/README.md.
|
| 128 |
+
|
| 129 |
+
2. Dependencies & Environment
|
| 130 |
+
|
| 131 |
+
PolyFusionAgent spans three compute modes:
|
| 132 |
+
|
| 133 |
+
Data preprocessing (RDKit-heavy; CPU-friendly but parallelizable)
|
| 134 |
+
|
| 135 |
+
Model training/inference (PyTorch; GPU strongly recommended for PolyFusion pretraining)
|
| 136 |
+
|
| 137 |
+
PolyAgent runtime (Gradio UI + retrieval stack; GPU optional but helpful for throughput)
|
| 138 |
+
|
| 139 |
+
2.1 Supported platforms
|
| 140 |
+
|
| 141 |
+
OS: Linux recommended (Ubuntu 20.04/22.04 tested most commonly in similar stacks), macOS/Windows supported for lightweight inference but may require extra care for RDKit/FAISS.
|
| 142 |
+
|
| 143 |
+
Python: 3.9–3.11 recommended (keep Python/PyTorch/CUDA consistent for reproducibility).
|
| 144 |
+
|
| 145 |
+
GPU: NVIDIA recommended for training. Manuscript pretraining used mixed precision and ran on NVIDIA A100 GPUs
|
| 146 |
+
|
| 147 |
+
|
| 148 |
+
|
| 149 |
+
.
|
| 150 |
+
|
| 151 |
+
2.2 Installation (base)
|
| 152 |
git clone https://github.com/manpreet88/PolyFusionAgent.git
|
| 153 |
cd PolyFusionAgent
|
| 154 |
|
| 155 |
+
# Option A: venv
|
| 156 |
+
python -m venv .venv
|
| 157 |
+
source .venv/bin/activate
|
| 158 |
+
pip install --upgrade pip
|
| 159 |
+
|
| 160 |
+
# Option B: conda (recommended if you use RDKit/FAISS)
|
| 161 |
+
# conda create -n polyfusion python=3.10 -y
|
| 162 |
+
# conda activate polyfusion
|
| 163 |
+
|
| 164 |
pip install -r requirements.txt
|
| 165 |
|
|
|
|
| 166 |
|
| 167 |
+
Tip (recommended): split installs by “extras” so users don’t pull GPU/RAG dependencies unless needed.
|
| 168 |
|
| 169 |
+
requirements.txt → core + inference
|
| 170 |
+
|
| 171 |
+
requirements-train.txt → training + distributed / acceleration
|
| 172 |
+
|
| 173 |
+
requirements-agent.txt → gradio + retrieval + PDF tooling
|
| 174 |
+
|
| 175 |
+
(If you keep a single requirements file, clearly label optional dependencies as such.)
|
| 176 |
+
|
| 177 |
+
2.3 Core ML stack (PolyFusion / downstream)
|
| 178 |
+
|
| 179 |
+
Required
|
| 180 |
+
|
| 181 |
+
torch (GPU build strongly recommended for training)
|
| 182 |
+
|
| 183 |
+
numpy, pandas, scikit-learn (downstream regression uses standard scaling + CV; manuscript uses 5-fold CV
|
| 184 |
+
|
| 185 |
+
|
| 186 |
+
|
| 187 |
+
)
|
| 188 |
+
|
| 189 |
+
transformers (PSMILES encoder + assorted NLP utilities)
|
| 190 |
+
|
| 191 |
+
Recommended
|
| 192 |
+
|
| 193 |
+
accelerate (multi-GPU / fp16 ergonomics)
|
| 194 |
+
|
| 195 |
+
sentencepiece (PSMILES tokenization uses SentencePiece with a fixed 265-token vocab
|
| 196 |
|
|
|
|
|
|
|
| 197 |
|
|
|
|
| 198 |
|
| 199 |
+
)
|
| 200 |
|
| 201 |
+
tqdm, rich (logging)
|
| 202 |
|
| 203 |
+
GPU check
|
| 204 |
|
|
|
|
| 205 |
nvidia-smi
|
| 206 |
python -c "import torch; print('cuda:', torch.cuda.is_available(), '| torch:', torch.__version__, '| cuda_ver:', torch.version.cuda)"
|
| 207 |
|
| 208 |
+
2.4 Chemistry stack (strongly recommended)
|
| 209 |
+
|
| 210 |
+
A large fraction of the pipeline depends on RDKit:
|
| 211 |
+
|
| 212 |
+
building graphs / fingerprints
|
| 213 |
+
|
| 214 |
+
conformer generation
|
| 215 |
+
|
| 216 |
+
canonicalization + validity checks
|
| 217 |
|
| 218 |
+
PolyAgent visualization
|
| 219 |
+
|
| 220 |
+
Install RDKit via conda-forge:
|
| 221 |
+
|
| 222 |
+
conda install -c conda-forge rdkit -y
|
| 223 |
+
|
| 224 |
+
|
| 225 |
+
Wildcard endpoint handling (important):
|
| 226 |
+
For RDKit-derived modalities, the pipeline converts polymer repeat units into a pseudo-molecule by replacing the repeat-unit wildcard attachment token [*] with [At] (Astatine) to ensure chemical sanitization and tool compatibility
|
| 227 |
+
|
| 228 |
+
|
| 229 |
|
| 230 |
+
.
|
| 231 |
|
| 232 |
+
2.5 Graph / 3D stacks (optional, depending on your implementation)
|
| 233 |
|
| 234 |
+
If your GINE implementation uses PyTorch Geometric, install the wheels that match your exact PyTorch + CUDA combination.
|
| 235 |
|
| 236 |
+
PyG install instructions differ by CUDA version; pin your environment carefully.
|
| 237 |
+
|
| 238 |
+
If you use SchNet via a third-party implementation, confirm the dependency (e.g., schnetpack, torchmd-net, or a local SchNet module). In the manuscript, SchNet uses a neighbor list with radial cutoff 10 Å and ≤64 neighbors/atom, with 6 interaction layers and hidden size 600
|
| 239 |
+
|
| 240 |
+
|
| 241 |
+
|
| 242 |
+
|
| 243 |
+
|
| 244 |
+
|
| 245 |
+
|
| 246 |
+
.
|
| 247 |
+
|
| 248 |
+
2.6 Retrieval stack (PolyAgent)
|
| 249 |
+
|
| 250 |
+
PolyAgent combines:
|
| 251 |
+
|
| 252 |
+
Local RAG over PDFs (chunking + embeddings + vector index)
|
| 253 |
+
|
| 254 |
+
Web augmentation (optional)
|
| 255 |
+
|
| 256 |
+
Reranking (cross-encoder)
|
| 257 |
+
|
| 258 |
+
In the manuscript implementation, the local knowledge base is constructed from 1108 PDFs, chunked at 512/256/128 tokens with overlaps 64/48/32, embedded with OpenAI text-embedding-3-small (1536-d), and indexed using FAISS HNSW (M=64, efconstruction=200)
|
| 259 |
+
|
| 260 |
+
|
| 261 |
+
|
| 262 |
+
. Retrieved chunks are reranked with ms-marco-MiniLM-L-12-v2
|
| 263 |
+
|
| 264 |
+
|
| 265 |
+
|
| 266 |
+
.
|
| 267 |
+
|
| 268 |
+
Typical dependencies
|
| 269 |
+
|
| 270 |
+
gradio
|
| 271 |
+
|
| 272 |
+
faiss-cpu (or faiss-gpu if desired)
|
| 273 |
+
|
| 274 |
+
pypdf / pdfminer.six (PDF text extraction)
|
| 275 |
+
|
| 276 |
+
tiktoken (chunking tokens; manuscript references TikToken cl100k
|
| 277 |
+
|
| 278 |
+
|
| 279 |
+
|
| 280 |
+
)
|
| 281 |
+
|
| 282 |
+
trafilatura (web page extraction; used in manuscript web augmentation
|
| 283 |
+
|
| 284 |
+
|
| 285 |
+
|
| 286 |
+
)
|
| 287 |
+
|
| 288 |
+
transformers (reranker and query rewrite model; manuscript uses T5 for rewriting in web augmentation
|
| 289 |
+
|
| 290 |
+
|
| 291 |
+
|
| 292 |
+
)
|
| 293 |
+
|
| 294 |
+
2.7 Environment variables
|
| 295 |
+
|
| 296 |
+
PolyAgent is a tool-orchestrated system. At minimum, set:
|
| 297 |
|
| 298 |
export OPENAI_API_KEY="YOUR_KEY"
|
| 299 |
|
| 300 |
|
| 301 |
+
Optional (if your configs support them):
|
| 302 |
+
|
| 303 |
+
export OPENAI_MODEL="gpt-4.1" # controller model (manuscript uses GPT-4.1) :contentReference[oaicite:11]{index=11}
|
| 304 |
+
export HF_TOKEN="YOUR_HF_TOKEN" # to pull hosted weights/tokenizers if applicable
|
| 305 |
|
|
|
|
| 306 |
|
| 307 |
+
Recommended .env pattern
|
| 308 |
+
Create a .env (do not commit) and load it in the Gradio entrypoint:
|
| 309 |
+
|
| 310 |
+
OPENAI_API_KEY=...
|
| 311 |
+
OPENAI_MODEL=gpt-4.1
|
| 312 |
|
| 313 |
3. Data, Modalities, and Preprocessing
|
| 314 |
+
3.1 Datasets (what the manuscript uses)
|
| 315 |
+
|
| 316 |
+
Pretraining uses PI1M + polyOne, at two scales: 2M and 5M polymers
|
| 317 |
+
|
| 318 |
+
|
| 319 |
+
|
| 320 |
+
.
|
| 321 |
+
|
| 322 |
+
Downstream fine-tuning / evaluation uses PolyInfo (≈ 1.8×10⁴ experimental polymers)
|
| 323 |
+
|
| 324 |
+
|
| 325 |
+
|
| 326 |
+
.
|
| 327 |
|
| 328 |
+
PolyInfo is held out from pretraining
|
| 329 |
+
|
| 330 |
+
|
| 331 |
+
|
| 332 |
+
.
|
| 333 |
+
|
| 334 |
+
Where are the links? The uploaded manuscript describes these datasets but does not include canonical URLs in the excerpted sections available here. Add the official dataset links in this README once you finalize where you host or reference them.
|
| 335 |
+
|
| 336 |
+
3.2 Minimum CSV schema
|
| 337 |
+
|
| 338 |
+
Your raw CSV must include:
|
| 339 |
+
|
| 340 |
+
psmiles (required) — polymer repeat unit string with [*] endpoints
|
| 341 |
|
| 342 |
Optional:
|
| 343 |
|
| 344 |
+
source — dataset tag (PI1M/polyOne/PolyInfo/custom)
|
| 345 |
|
| 346 |
+
property columns — e.g., density, Tg, Tm, Td (names can be mapped)
|
| 347 |
|
| 348 |
Example:
|
| 349 |
|
| 350 |
+
psmiles,source,density,Tg,Tm,Td
|
| 351 |
+
[*]CC(=O)OCCO[*],PolyInfo,1.21,55,155,350
|
| 352 |
+
|
| 353 |
+
|
| 354 |
+
Endpoint note: when generating RDKit-dependent modalities, the code may internally replace [*] with [At] to sanitize repeat-unit molecules
|
| 355 |
+
|
| 356 |
+
|
| 357 |
+
|
| 358 |
+
.
|
| 359 |
+
|
| 360 |
+
3.3 Modalities produced per polymer
|
| 361 |
+
|
| 362 |
+
PolyFusion represents each polymer using four complementary modalities
|
| 363 |
+
|
| 364 |
+
|
| 365 |
+
|
| 366 |
+
:
|
| 367 |
+
|
| 368 |
+
PSMILES sequences (D)
|
| 369 |
+
|
| 370 |
+
SentencePiece tokenization with fixed vocab size 265 (kept fixed during downstream)
|
| 371 |
+
|
| 372 |
+
|
| 373 |
+
|
| 374 |
+
2D molecular graph (G)
|
| 375 |
+
|
| 376 |
+
nodes = atoms, edges = bonds, with chemically meaningful node/edge features
|
| 377 |
+
|
| 378 |
+
3D conformational proxy (S)
|
| 379 |
|
| 380 |
+
conformer embedding + optimization pipeline (ETKDG/UFF described in Methods)
|
|
|
|
| 381 |
|
| 382 |
+
SchNet neighbor cutoff and layer specs given in Supplementary
|
| 383 |
|
| 384 |
+
|
| 385 |
+
|
| 386 |
+
|
| 387 |
+
|
| 388 |
+
|
| 389 |
+
|
| 390 |
+
Fingerprints (T)
|
| 391 |
|
| 392 |
+
ECFP6 (radius r=3) with 2048 bits
|
| 393 |
+
|
| 394 |
+
|
| 395 |
+
|
| 396 |
+
3.4 Preprocessing script
|
| 397 |
+
|
| 398 |
+
Use your preprocessing utility (e.g., Data_Modalities.py) to append multimodal columns:
|
| 399 |
|
| 400 |
python Data_Modalities.py \
|
| 401 |
+
--csv_file /path/to/polymers.csv \
|
| 402 |
--chunk_size 1000 \
|
| 403 |
--num_workers 24
|
|
|
|
| 404 |
|
|
|
|
| 405 |
|
| 406 |
+
Expected outputs:
|
| 407 |
|
| 408 |
+
*_processed.csv with new columns: graph, geometry, fingerprints (as JSON blobs)
|
|
|
|
| 409 |
|
| 410 |
+
*_failures.jsonl for failed rows (index + error)
|
| 411 |
|
| 412 |
+
4. Models & Artifacts
|
| 413 |
|
| 414 |
+
This repository typically produces three artifact families:
|
| 415 |
|
| 416 |
+
4.1 PolyFusion checkpoints (pretraining)
|
| 417 |
|
| 418 |
+
PolyFusion maps each modality into a shared embedding space of dimension d=600
|
| 419 |
|
| 420 |
+
|
| 421 |
|
| 422 |
+
.
|
| 423 |
+
Pretraining uses:
|
| 424 |
|
| 425 |
+
unified masking with pmask = 0.15 and an 80/10/10 corruption rule
|
| 426 |
|
| 427 |
+
|
| 428 |
|
| 429 |
+
anchor–target contrastive learning where the fused structural anchor is aligned to the fingerprint target (InfoNCE with τ = 0.07)
|
| 430 |
|
| 431 |
+
|
| 432 |
|
| 433 |
+
Store:
|
| 434 |
|
| 435 |
+
encoder weights per modality
|
|
|
|
| 436 |
|
| 437 |
+
projection heads
|
|
|
|
| 438 |
|
| 439 |
+
training config + tokenizer artifacts (SentencePiece model)
|
| 440 |
|
| 441 |
+
4.2 Downstream predictors (property regression)
|
|
|
|
| 442 |
|
| 443 |
+
Downstream uses:
|
| 444 |
|
| 445 |
+
fused 600-d embedding
|
|
|
|
| 446 |
|
| 447 |
+
a lightweight regressor (2-layer MLP, hidden width 300, dropout 0.1)
|
| 448 |
|
| 449 |
+
|
| 450 |
|
| 451 |
+
Training protocol:
|
|
|
|
|
|
|
| 452 |
|
| 453 |
+
5-fold CV, inner validation (10%) with early stopping
|
| 454 |
|
| 455 |
+
|
| 456 |
|
| 457 |
+
Save:
|
| 458 |
|
| 459 |
+
best weights per property per fold
|
| 460 |
|
| 461 |
+
scalers used for standardization
|
| 462 |
|
| 463 |
+
4.3 Inverse design generator (SELFIES-TED conditioning)
|
| 464 |
|
| 465 |
+
Inverse design conditions a SELFIES-based encoder–decoder (SELFIES-TED) on PolyFusion’s 600-d embedding
|
| 466 |
|
| 467 |
+
|
| 468 |
|
| 469 |
+
.
|
| 470 |
+
Implementation details from the manuscript include:
|
| 471 |
|
| 472 |
+
conditioning via K=4 learned memory tokens
|
| 473 |
|
| 474 |
+
|
| 475 |
|
| 476 |
+
training-time latent noise σtrain = 0.10
|
| 477 |
|
| 478 |
+
|
| 479 |
|
| 480 |
+
decoding uses top-p (0.92), temperature 1.0, repetition penalty 1.05, max length 256
|
|
|
|
| 481 |
|
| 482 |
+
|
|
|
|
| 483 |
|
| 484 |
+
property targeting via generate-then-filter using a GP oracle and acceptance threshold τs = 0.5 (standardized units)
|
| 485 |
|
| 486 |
+
|
| 487 |
|
| 488 |
+
Save:
|
| 489 |
|
| 490 |
+
decoder weights + conditioning projection
|
| 491 |
|
| 492 |
+
tokenization assets (if applicable)
|
| 493 |
|
| 494 |
+
property oracle artifacts (GP models / scalers)
|
| 495 |
|
| 496 |
+
5. Running the Code
|
| 497 |
|
| 498 |
+
Several scripts may contain path placeholders. Centralize them into one config file (recommended) or update the constants in each entrypoint.
|
| 499 |
|
| 500 |
+
5.1 Multimodal contrastive pretraining (PolyFusion)
|
| 501 |
|
| 502 |
+
Entrypoint:
|
| 503 |
|
| 504 |
+
PolyFusion/CL.py
|
| 505 |
|
| 506 |
+
Manuscript-grounded defaults:
|
| 507 |
|
| 508 |
+
AdamW, lr=1e-4, weight_decay=1e-2, batch=16, grad accum=4 (effective 64), up to 25 epochs, early stopping patience 10, FP16
|
| 509 |
|
| 510 |
+
|
|
|
|
| 511 |
|
| 512 |
+
Run:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 513 |
|
| 514 |
+
python PolyFusion/CL.py
|
|
|
|
| 515 |
|
|
|
|
| 516 |
|
| 517 |
+
Sanity tip: start with a smaller subset (e.g., 50k–200k rows) to validate preprocessing + training stability before scaling to millions.
|
| 518 |
|
| 519 |
+
5.2 Downstream property prediction
|
| 520 |
|
| 521 |
+
Entrypoint:
|
| 522 |
|
| 523 |
+
Downstream Tasks/Property_Prediction.py
|
| 524 |
|
| 525 |
+
What it does:
|
| 526 |
|
| 527 |
+
loads a modality-augmented CSV
|
| 528 |
|
| 529 |
+
loads pretrained PolyFusion weights
|
| 530 |
|
| 531 |
+
trains property heads with K-fold CV
|
| 532 |
|
| 533 |
Run:
|
| 534 |
|
| 535 |
+
python "Downstream Tasks/Property_Prediction.py"
|
| 536 |
+
|
| 537 |
+
5.3 Inverse design / polymer generation
|
| 538 |
+
|
| 539 |
+
Entrypoint:
|
| 540 |
+
|
| 541 |
+
Downstream Tasks/Polymer_Generation.py
|
| 542 |
|
| 543 |
+
What it does:
|
| 544 |
|
| 545 |
+
conditions SELFIES-TED on PolyFusion embeddings
|
| 546 |
|
| 547 |
+
generates candidates and filters to target using the manuscript-style oracle loop
|
| 548 |
+
|
| 549 |
+
|
| 550 |
+
|
| 551 |
+
Run:
|
| 552 |
+
|
| 553 |
+
python "Downstream Tasks/Polymer_Generation.py"
|
| 554 |
|
| 555 |
5.4 PolyAgent (Gradio UI)
|
|
|
|
| 556 |
|
| 557 |
+
Core components:
|
| 558 |
|
| 559 |
+
PolyAgent/orchestrator.py (controller + tool router)
|
| 560 |
|
| 561 |
+
PolyAgent/rag_pipeline.py (local RAG)
|
| 562 |
|
| 563 |
+
PolyAgent/gradio_interface.py (UI)
|
|
|
|
| 564 |
|
| 565 |
+
Manuscript controller:
|
| 566 |
|
| 567 |
+
GPT-4.1 controller with planning temperature τplan=0.2
|
| 568 |
|
| 569 |
+
|
| 570 |
|
| 571 |
+
Run:
|
| 572 |
|
| 573 |
+
cd PolyAgent
|
| 574 |
+
python gradio_interface.py --server-name 0.0.0.0 --server-port 7860
|
| 575 |
|
| 576 |
+
6. Results & Reproducibility
|
| 577 |
+
6.1 What “reproducible” means in this repo
|
| 578 |
|
| 579 |
+
To help others reproduce your paper-level results:
|
| 580 |
|
| 581 |
+
Pin versions: Python, PyTorch, CUDA, RDKit, FAISS, Transformers
|
| 582 |
|
| 583 |
+
Fix seeds across Python/NumPy/Torch
|
| 584 |
|
| 585 |
+
Log configs per run (JSON/YAML dumped beside checkpoints)
|
| 586 |
|
| 587 |
+
Record dataset snapshots (hashes of CSVs and modality JSON columns)
|
| 588 |
|
| 589 |
+
6.2 Manuscript training protocol highlights
|
| 590 |
|
| 591 |
+
PolyFusion shared latent dimension: 600
|
|
|
|
|
|
|
| 592 |
|
| 593 |
+
|
| 594 |
|
| 595 |
+
Unified corruption: pmask = 0.15, 80/10/10 rule
|
| 596 |
|
| 597 |
+
|
| 598 |
|
| 599 |
+
Contrastive alignment uses InfoNCE with τ = 0.07
|
|
|
|
| 600 |
|
| 601 |
+
|
| 602 |
|
| 603 |
+
Pretraining optimization and schedule: AdamW, lr 1e-4, wd 1e-2, eff batch 64, FP16, early stopping
|
|
|
|
| 604 |
|
| 605 |
+
|
| 606 |
|
| 607 |
+
PolyAgent retrieval index: 1108 PDFs; chunking and FAISS HNSW params as described
|
| 608 |
|
| 609 |
+
|
| 610 |
|
| 611 |
7. Citation
|
| 612 |
+
|
| 613 |
If you use this repository in your work, please cite the accompanying manuscript:
|
| 614 |
|
| 615 |
@article{kaur2026polyfusionagent,
|
| 616 |
title = {PolyFusionAgent: a multimodal foundation model and autonomous AI assistant for polymer informatics},
|
| 617 |
author = {Kaur, Manpreet and Liu, Qian},
|
| 618 |
year = {2026},
|
| 619 |
+
note = {Manuscript / preprint}
|
| 620 |
}
|
| 621 |
+
PI1M (JCIM): https://pubs.acs.org/doi/10.1021/acs.jcim.0c00726
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 622 |
|
| 623 |
+
polyOne (Zenodo): https://zenodo.org/records/7766806
|
|
|
|
| 624 |
|
| 625 |
+
PoLyInfo (NIMS): https://polymer.nims.go.jp/en/
|