manpreet88 commited on
Commit
f17aa94
·
1 Parent(s): 8a22245

Create rag_pipeline.py

Browse files
Files changed (1) hide show
  1. PolyAgent/rag_pipeline.py +1319 -0
PolyAgent/rag_pipeline.py ADDED
@@ -0,0 +1,1319 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from __future__ import annotations
2
+
3
+ import os
4
+ import re
5
+ import time
6
+ import json
7
+ import hashlib
8
+ import pathlib
9
+ import tempfile
10
+ from typing import List, Optional, Dict, Any, Union
11
+ from concurrent.futures import ThreadPoolExecutor, as_completed
12
+ from collections import defaultdict
13
+
14
+ import requests
15
+ from tqdm import tqdm
16
+
17
+ # --------------------------------------------------------------------------------------
18
+ # Vector store, loaders, splitters
19
+ # --------------------------------------------------------------------------------------
20
+ from langchain_community.vectorstores import Chroma
21
+ from langchain_text_splitters import RecursiveCharacterTextSplitter
22
+ from langchain_community.document_loaders import PyPDFLoader, DirectoryLoader
23
+
24
+ # --------------------------------------------------------------------------------------
25
+ # OpenAI embeddings
26
+ # --------------------------------------------------------------------------------------
27
+ from langchain_openai import OpenAIEmbeddings
28
+
29
+ # --------------------------------------------------------------------------------------
30
+ # Tokenizer for true token-based multi-scale segmentation
31
+ # --------------------------------------------------------------------------------------
32
+ import tiktoken
33
+
34
+
35
+ def sanitize_text(text: str) -> str:
36
+ """
37
+ Remove surrogate pairs and invalid Unicode characters.
38
+ Prevents UnicodeEncodeError when adding documents to ChromaDB.
39
+ """
40
+ if not text:
41
+ return text
42
+ # Replace surrogates and invalid chars with empty string
43
+ return text.encode("utf-8", errors="ignore").decode("utf-8", errors="ignore")
44
+
45
+
46
+ # --------------------------------------------------------------------------------------
47
+ # ARXIV, OPENALEX, EPMC API URLS
48
+ # --------------------------------------------------------------------------------------
49
+ ARXIV_SEARCH_URL = "http://export.arxiv.org/api/query"
50
+ OPENALEX_WORKS_URL = "https://api.openalex.org/works"
51
+ EPMC_SEARCH_URL = "https://www.ebi.ac.uk/europepmc/webservices/rest/search"
52
+
53
+ DEFAULT_PERSIST_DIR = "chroma_polymer_db"
54
+ DEFAULT_TMP_DOWNLOAD_DIR = os.path.join(tempfile.gettempdir(), "polymer_rag_pdfs")
55
+ MANIFEST_NAME = "manifest.jsonl"
56
+
57
+ # --------------------------------------------------------------------------------------
58
+ # Balanced target distribution (total ~2000 PDFs)
59
+ # --------------------------------------------------------------------------------------
60
+ TARGET_CURATED = 100
61
+ TARGET_JOURNALS = 200
62
+ TARGET_ARXIV = 800
63
+ TARGET_OPENALEX = 600
64
+ TARGET_EPMC = 200
65
+ TARGET_DATABASES = 100
66
+
67
+ # --------------------------------------------------------------------------------------
68
+ # Polymer keywords (expandable)
69
+ # --------------------------------------------------------------------------------------
70
+ POLYMER_KEYWORDS = [
71
+ "polymer",
72
+ "macromolecule",
73
+ "macromolecular",
74
+ "polymeric",
75
+ "polymer informatics",
76
+ "polymer chemistry",
77
+ "polymer physics",
78
+ "PSMILES",
79
+ "pSMILES",
80
+ "BigSMILES",
81
+ "polymer SMILES",
82
+ "polymer sequence",
83
+ "polymer electrolyte",
84
+ "polymer morphology",
85
+ "polymer dielectric",
86
+ "polymer electrolyte membrane",
87
+ "block copolymer",
88
+ "biopolymer",
89
+ "polymer nanocomposite",
90
+ "polymer foundation model",
91
+ "self-supervised polymer",
92
+ "masked language model polymer",
93
+ "polymer transformer",
94
+ "generative polymer",
95
+ "copolymer",
96
+ "polymerization",
97
+ "polymer synthesis",
98
+ "polymer characterization",
99
+ ]
100
+
101
+ # --------------------------------------------------------------------------------------
102
+ # IUPAC Guidelines & Standards (polymer nomenclature and terminology standards)
103
+ # --------------------------------------------------------------------------------------
104
+ CURATED_IUPAC_STANDARDS: List[Dict[str, Any]] = [
105
+ {
106
+ "url": "https://iupac.org/wp-content/uploads/2019/07/140-Brief-Guide-to-Polymer-Nomenclature-Web-Final-d.pdf",
107
+ "name": "IUPAC - Brief Guide to Polymer Nomenclature",
108
+ "meta": {
109
+ "title": "A Brief Guide to Polymer Nomenclature (IUPAC Technical Report)",
110
+ "year": "2012",
111
+ "venue": "IUPAC Pure and Applied Chemistry",
112
+ "source": "curated_iupac_standard",
113
+ },
114
+ },
115
+ {
116
+ "url": "https://rseq.org/wp-content/uploads/2022/10/20220816-English-BriefGuidePolymerTerminology-IUPAC.pdf",
117
+ "name": "IUPAC - Brief Guide to Polymerization Terminology",
118
+ "meta": {
119
+ "title": "A Brief Guide to Polymerization Terminology (IUPAC Recommendations)",
120
+ "year": "2022",
121
+ "venue": "IUPAC",
122
+ "source": "curated_iupac_standard",
123
+ },
124
+ },
125
+ {
126
+ "url": "https://www.rsc.org/images/richard-jones-naming-polymers_tcm18-243646.pdf",
127
+ "name": "RSC - Naming Polymers",
128
+ "meta": {
129
+ "title": "Naming Polymers (RSC Educational Resource)",
130
+ "year": "2020",
131
+ "venue": "Royal Society of Chemistry",
132
+ "source": "curated_iupac_standard",
133
+ },
134
+ },
135
+ ]
136
+
137
+ # --------------------------------------------------------------------------------------
138
+ # ISO/ASTM Standards (polymer testing and characterization standards)
139
+ # --------------------------------------------------------------------------------------
140
+ CURATED_ISO_ASTM_STANDARDS: List[Dict[str, Any]] = [
141
+ {
142
+ "url": "https://cdn.standards.iteh.ai/samples/76910/29c8e7af07bd4188b297c39684ada79e/ISO-ASTM-52925-2022.pdf",
143
+ "name": "ISO/ASTM 52925:2022 - Additive Manufacturing Polymers",
144
+ "meta": {
145
+ "title": "ISO/ASTM 52925:2022 Additive manufacturing of polymers - Feedstock materials",
146
+ "year": "2022",
147
+ "venue": "ISO/ASTM",
148
+ "source": "curated_iso_astm_standard",
149
+ },
150
+ },
151
+ {
152
+ "url": "https://cdn.standards.iteh.ai/samples/76909/b9883b2f204248aca175e2f574bd879c/ISO-ASTM-52924-2023.pdf",
153
+ "name": "ISO/ASTM 52924:2023 - Additive Manufacturing Qualification",
154
+ "meta": {
155
+ "title": "ISO/ASTM 52924:2023 Additive manufacturing of polymers - Qualification principles",
156
+ "year": "2023",
157
+ "venue": "ISO/ASTM",
158
+ "source": "curated_iso_astm_standard",
159
+ },
160
+ },
161
+ {
162
+ "url": "https://nvlpubs.nist.gov/nistpubs/ir/2015/NIST.IR.8059.pdf",
163
+ "name": "NIST IR 8059 - Materials Testing Standards for Additive Manufacturing",
164
+ "meta": {
165
+ "title": "Materials Testing Standards for Additive Manufacturing of Polymer Materials",
166
+ "year": "2015",
167
+ "venue": "NIST",
168
+ "source": "curated_iso_astm_standard",
169
+ },
170
+ },
171
+ ]
172
+
173
+ # --------------------------------------------------------------------------------------
174
+ # Foundational polymer informatics papers
175
+ # --------------------------------------------------------------------------------------
176
+ CURATED_POLYMER_INFORMATICS: List[Dict[str, Any]] = [
177
+ {
178
+ "url": "https://ramprasad.mse.gatech.edu/wp-content/uploads/2021/01/polymer-informatics.pdf",
179
+ "name": "Polymer Informatics - Current Status and Critical Next Steps",
180
+ "meta": {
181
+ "title": "Polymer informatics: Current status and critical next steps",
182
+ "year": "2020",
183
+ "venue": "Materials Science and Engineering: R",
184
+ "source": "curated_review_informatics",
185
+ },
186
+ },
187
+ {
188
+ "url": "https://arxiv.org/pdf/2011.00508.pdf",
189
+ "name": "Polymer Informatics - Current Status (arXiv)",
190
+ "meta": {
191
+ "title": "Polymer Informatics: Current Status and Critical Next Steps",
192
+ "year": "2020",
193
+ "venue": "arXiv:2011.00508",
194
+ "source": "curated_review_informatics",
195
+ },
196
+ },
197
+ ]
198
+
199
+ # --------------------------------------------------------------------------------------
200
+ # BigSMILES notation papers (polymer representation standards)
201
+ # --------------------------------------------------------------------------------------
202
+ CURATED_BIGSMILES: List[Dict[str, Any]] = [
203
+ {
204
+ "url": "https://pubs.acs.org/doi/pdf/10.1021/acscentsci.9b00476",
205
+ "name": "BigSMILES - Structurally-Based Line Notation",
206
+ "meta": {
207
+ "title": "BigSMILES: A Structurally-Based Line Notation for Describing Macromolecules",
208
+ "year": "2019",
209
+ "venue": "ACS Central Science",
210
+ "source": "curated_bigsmiles",
211
+ },
212
+ },
213
+ {
214
+ "url": "https://www.rsc.org/suppdata/d3/dd/d3dd00147d/d3dd00147d1.pdf",
215
+ "name": "Generative BigSMILES - Supplementary Information",
216
+ "meta": {
217
+ "title": "Generative BigSMILES: an extension for polymer informatics (SI)",
218
+ "year": "2024",
219
+ "venue": "RSC Digital Discovery",
220
+ "source": "curated_bigsmiles",
221
+ },
222
+ },
223
+ ]
224
+
225
+ # --------------------------------------------------------------------------------------
226
+ # Combine all curated sources
227
+ # --------------------------------------------------------------------------------------
228
+ CURATED_POLYMER_PDF_SOURCES = (
229
+ CURATED_IUPAC_STANDARDS
230
+ + CURATED_ISO_ASTM_STANDARDS
231
+ + CURATED_POLYMER_INFORMATICS
232
+ + CURATED_BIGSMILES
233
+ )
234
+
235
+ # --------------------------------------------------------------------------------------
236
+ # Major polymer journals with OA content
237
+ # --------------------------------------------------------------------------------------
238
+ POLYMER_JOURNAL_QUERIES = [
239
+ # ACS Journals
240
+ {"journal": "Macromolecules", "issn": "0024-9297", "publisher": "ACS"},
241
+ {"journal": "ACS Polymers Au", "issn": "2768-1939", "publisher": "ACS"},
242
+ {"journal": "ACS Applied Polymer Materials", "issn": "2637-6105", "publisher": "ACS"},
243
+ {"journal": "Biomacromolecules", "issn": "1525-7797", "publisher": "ACS"},
244
+ {"journal": "ACS Macro Letters", "issn": "2161-1653", "publisher": "ACS"},
245
+ # RSC Journals
246
+ {"journal": "Polymer Chemistry", "issn": "1759-9954", "publisher": "RSC"},
247
+ {"journal": "RSC Applied Polymers", "issn": "2755-0656", "publisher": "RSC"},
248
+ {"journal": "Soft Matter", "issn": "1744-683X", "publisher": "RSC"},
249
+ # Springer/Nature Journals
250
+ {"journal": "Polymer Journal", "issn": "0032-3896", "publisher": "Nature"},
251
+ {"journal": "Journal of Polymer Science", "issn": "2642-4169", "publisher": "Wiley"},
252
+ # Additional OA Journals
253
+ {"journal": "Polymer Science and Technology", "issn": "2837-0341", "publisher": "ACS"},
254
+ {"journal": "Polymers", "issn": "2073-4360", "publisher": "MDPI"},
255
+ ]
256
+
257
+ DEFAULT_MAILTO = "kaur-m43@webmail.uwinnipeg.ca" # polite defaults
258
+
259
+
260
+ # --------------------------------------------------------------------------------------
261
+ # DEDUPLICATION, DOWNLOAD, MANIFEST HELPERS
262
+ # --------------------------------------------------------------------------------------
263
+ def sha256_bytes(data: bytes) -> str:
264
+ return hashlib.sha256(data).hexdigest()
265
+
266
+
267
+ def safe_filename(name: str) -> str:
268
+ name = str(name or "").strip().replace("/", "_").replace("\\", "_")
269
+ name = re.sub(r"[^a-zA-Z0-9._\-]", "_", name)
270
+ return name[:200]
271
+
272
+
273
+ def is_probably_pdf(raw: bytes, content_type: str) -> bool:
274
+ if not raw:
275
+ return False
276
+ if raw[:4] == b"%PDF":
277
+ return True
278
+ return "pdf" in (content_type or "").lower()
279
+
280
+
281
+ def ensure_dir(path: str) -> None:
282
+ os.makedirs(path, exist_ok=True)
283
+
284
+
285
+ def append_manifest(out_dir: str, record: Dict[str, Any]) -> None:
286
+ try:
287
+ ensure_dir(out_dir)
288
+ with open(os.path.join(out_dir, MANIFEST_NAME), "a", encoding="utf-8") as f:
289
+ f.write(json.dumps(record, ensure_ascii=False) + "\n")
290
+ except Exception:
291
+ pass
292
+
293
+
294
+ def load_manifest(out_dir: str) -> Dict[str, Dict[str, Any]]:
295
+ data: Dict[str, Dict[str, Any]] = {}
296
+ try:
297
+ mpath = os.path.join(out_dir, MANIFEST_NAME)
298
+ if not os.path.exists(mpath):
299
+ return data
300
+ with open(mpath, "r", encoding="utf-8") as f:
301
+ for line in f:
302
+ try:
303
+ rec = json.loads(line)
304
+ p = rec.get("path")
305
+ sha = rec.get("sha256")
306
+ if p:
307
+ data[p] = rec
308
+ if sha:
309
+ data[sha] = rec
310
+ except Exception:
311
+ continue
312
+ except Exception:
313
+ pass
314
+ return data
315
+
316
+
317
+ # --------------------------------------------------------------------------------------
318
+ # DOWNLOAD SINGLE PDF
319
+ # --------------------------------------------------------------------------------------
320
+ def download_pdf(
321
+ url: str,
322
+ out_dir: str,
323
+ suggested_name: Optional[str] = None,
324
+ timeout: int = 60,
325
+ meta: Optional[Dict[str, Any]] = None,
326
+ manifest: Optional[Dict[str, Dict[str, Any]]] = None,
327
+ ) -> Optional[str]:
328
+ """
329
+ Download a PDF and return local file path, or None on failure.
330
+ Deduplicates by SHA256 content hash.
331
+ Writes manifest record if meta provided.
332
+ """
333
+ try:
334
+ headers = {"User-Agent": f"polymer-rag/1.0 ({DEFAULT_MAILTO})"}
335
+ with requests.get(
336
+ url, headers=headers, timeout=timeout, stream=True, allow_redirects=True
337
+ ) as r:
338
+ r.raise_for_status()
339
+ content_type = r.headers.get("Content-Type", "")
340
+ raw = r.content
341
+ if not raw or not is_probably_pdf(raw, content_type):
342
+ return None
343
+
344
+ sha = sha256_bytes(raw)
345
+ ensure_dir(out_dir)
346
+
347
+ # Check manifest for existing SHA
348
+ if manifest and sha in manifest:
349
+ existing_path = manifest[sha].get("path")
350
+ if existing_path and os.path.exists(existing_path):
351
+ return existing_path
352
+
353
+ # Check filesystem for existing files with this hash
354
+ existing = list(pathlib.Path(out_dir).glob(f"{sha[:16]}*.pdf"))
355
+ if existing:
356
+ path = str(existing[0])
357
+ if meta:
358
+ rec = dict(meta)
359
+ rec.update({"sha256": sha, "path": path})
360
+ append_manifest(out_dir, rec)
361
+ return path
362
+
363
+ base = suggested_name or pathlib.Path(url).name or "paper.pdf"
364
+ base = safe_filename(base)
365
+ if not base.lower().endswith(".pdf"):
366
+ base += ".pdf"
367
+ fname = f"{sha[:16]}_{base}"
368
+ fpath = os.path.join(out_dir, fname)
369
+
370
+ with open(fpath, "wb") as f:
371
+ f.write(raw)
372
+
373
+ if meta:
374
+ rec = dict(meta)
375
+ rec.update({"sha256": sha, "path": fpath})
376
+ append_manifest(out_dir, rec)
377
+
378
+ return fpath
379
+ except Exception:
380
+ return None
381
+
382
+
383
+ def retry(fn, args, retries=3, sleep=0.6, **kwargs):
384
+ for i in range(retries):
385
+ out = fn(*args, **kwargs)
386
+ if out:
387
+ return out
388
+ time.sleep(sleep * (2**i))
389
+ return None
390
+
391
+
392
+ def download_one(entry: Union[str, Dict[str, Any]], out_dir: str, manifest: Dict):
393
+ if isinstance(entry, dict):
394
+ return download_pdf(
395
+ entry["url"],
396
+ out_dir,
397
+ suggested_name=entry.get("name"),
398
+ meta=entry.get("meta"),
399
+ manifest=manifest,
400
+ )
401
+ return download_pdf(entry, out_dir, manifest=manifest)
402
+
403
+
404
+ def parallel_download_pdfs(
405
+ entries: List[Union[str, Dict[str, Any]]],
406
+ out_dir: str,
407
+ manifest: Dict[str, Dict[str, Any]],
408
+ max_workers: int = 12,
409
+ desc: str = "Downloading PDFs",
410
+ ) -> List[str]:
411
+ ensure_dir(out_dir)
412
+ results: List[str] = []
413
+ if not entries:
414
+ return results
415
+ with ThreadPoolExecutor(max_workers=max_workers) as ex:
416
+ futs = [ex.submit(retry, download_one, (e, out_dir, manifest)) for e in entries]
417
+ for f in tqdm(as_completed(futs), total=len(futs), desc=desc):
418
+ p = f.result()
419
+ if p:
420
+ results.append(p)
421
+ return results
422
+
423
+
424
+ # --------------------------------------------------------------------------------------
425
+ # ARXIV
426
+ # --------------------------------------------------------------------------------------
427
+ def arxiv_query_from_keywords(keywords: List[str]) -> str:
428
+ kw = [k.replace(" ", "+") for k in keywords]
429
+ terms = " OR ".join([f"ti:{k}" for k in kw] + [f"abs:{k}" for k in kw])
430
+ cats = (
431
+ "cat:cond-mat.mtrl-sci OR cat:cond-mat.soft OR cat:physics.chem-ph OR cat:cs.LG OR cat:stat.ML"
432
+ )
433
+ return f"({terms}) AND ({cats})"
434
+
435
+
436
+ def fetch_arxiv_pdf_urls(keywords: List[str], max_results: int = 800) -> List[str]:
437
+ """
438
+ Extract explicit pdf links and fallback to building from id entries.
439
+ """
440
+ query = arxiv_query_from_keywords(keywords)
441
+ params = {
442
+ "search_query": query,
443
+ "start": 0,
444
+ "max_results": max_results,
445
+ "sortBy": "submittedDate",
446
+ "sortOrder": "descending",
447
+ }
448
+ headers = {"User-Agent": f"polymer-rag/1.0 ({DEFAULT_MAILTO})"}
449
+ try:
450
+ resp = requests.get(ARXIV_SEARCH_URL, params=params, headers=headers, timeout=60)
451
+ resp.raise_for_status()
452
+ xml = resp.text
453
+ except Exception:
454
+ return []
455
+
456
+ pdfs: List[str] = []
457
+ seen = set()
458
+
459
+ # explicit pdf hrefs
460
+ for p in re.findall(r'href="(https?://arxiv\.org/pdf[^"]*)"', xml):
461
+ if p not in seen:
462
+ pdfs.append(p)
463
+ seen.add(p)
464
+
465
+ # fallback: build from id entries
466
+ for aid in re.findall(r'<id>(https?://arxiv\.org/abs[^<]*)</id>', xml):
467
+ m = re.search(r"arxiv\.org/abs/([^?v]+)", aid)
468
+ if m:
469
+ identifier = m.group(1)
470
+ pdf = f"https://arxiv.org/pdf/{identifier}.pdf"
471
+ if pdf not in seen:
472
+ pdfs.append(pdf)
473
+ seen.add(pdf)
474
+
475
+ return pdfs
476
+
477
+
478
+ def fetch_arxiv_pdfs(
479
+ keywords: List[str],
480
+ out_dir: str,
481
+ manifest: Dict[str, Dict[str, Any]],
482
+ max_results: int = 800,
483
+ ) -> List[str]:
484
+ urls = fetch_arxiv_pdf_urls(keywords, max_results=max_results)
485
+ entries = [
486
+ {
487
+ "url": u,
488
+ "name": u.rstrip("/").split("/")[-1],
489
+ "meta": {"source": "arxiv", "url": u},
490
+ }
491
+ for u in urls
492
+ ]
493
+ paths = parallel_download_pdfs(entries, out_dir, manifest, max_workers=8, desc="arXiv PDFs")
494
+ return paths
495
+
496
+
497
+ # --------------------------------------------------------------------------------------
498
+ # OPENALEX
499
+ # --------------------------------------------------------------------------------------
500
+ def openalex_fetch_works_try(
501
+ search: str,
502
+ filter_str: str,
503
+ per_page: int,
504
+ page: int,
505
+ mailto: Optional[str],
506
+ ) -> Dict[str, Any]:
507
+ headers = {"User-Agent": f"polymer-rag/1.0 ({mailto or DEFAULT_MAILTO})"}
508
+ params: Dict[str, Any] = {
509
+ "search": search,
510
+ "per-page": per_page,
511
+ "per_page": per_page,
512
+ "page": page,
513
+ "sort": "publication_date:desc",
514
+ }
515
+ if filter_str:
516
+ params["filter"] = filter_str
517
+ if mailto:
518
+ params["mailto"] = mailto
519
+
520
+ resp = requests.get(OPENALEX_WORKS_URL, params=params, headers=headers, timeout=60)
521
+ resp.raise_for_status()
522
+ return resp.json()
523
+
524
+
525
+ def openalex_fetch_works(
526
+ keywords: List[str],
527
+ max_results: int = 600,
528
+ per_page: int = 200,
529
+ mailto: Optional[str] = None,
530
+ ) -> List[Dict[str, Any]]:
531
+ """
532
+ Try multiple query forms with relaxed filters if needed.
533
+ """
534
+ kws = sorted(set(keywords or []), key=str.lower)
535
+ combined = " ".join(kws)
536
+ or_query = " OR ".join(kws)
537
+
538
+ attempts = [
539
+ {"q": combined, "filter": "is_oa:true,language:en"},
540
+ {"q": or_query, "filter": "is_oa:true,language:en"},
541
+ {"q": or_query, "filter": "is_oa:true"},
542
+ {"q": or_query, "filter": ""},
543
+ ]
544
+
545
+ works: List[Dict[str, Any]] = []
546
+ for attempt in attempts:
547
+ search = attempt["q"]
548
+ filter_str = attempt["filter"]
549
+ page = 1
550
+ while len(works) < max_results:
551
+ try:
552
+ data = openalex_fetch_works_try(
553
+ search, filter_str, per_page, page, mailto or DEFAULT_MAILTO
554
+ )
555
+ except Exception as e:
556
+ print(f"[WARN] OpenAlex request failed: {e}")
557
+ break
558
+
559
+ results = data.get("results", [])
560
+ if not results:
561
+ break
562
+
563
+ works.extend(results)
564
+ if len(results) < per_page:
565
+ break
566
+ page += 1
567
+ time.sleep(0.12)
568
+
569
+ if len(works) >= max_results:
570
+ break
571
+ if works:
572
+ break
573
+
574
+ return works[:max_results]
575
+
576
+
577
+ def openalex_extract_pdf_entries(
578
+ works: List[Dict[str, Any]],
579
+ ) -> List[Dict[str, Any]]:
580
+ """
581
+ Extract candidate PDF URLs and metadata from OpenAlex works.
582
+ """
583
+ out: List[Dict[str, Any]] = []
584
+ seen_urls = set()
585
+
586
+ for w in works:
587
+ pdf = ""
588
+ best = w.get("best_oa_location") or {}
589
+ if isinstance(best, dict):
590
+ pdf = best.get("pdf_url") or best.get("url_for_pdf") or best.get("url") or ""
591
+ if not pdf:
592
+ pl = w.get("primary_location") or {}
593
+ if isinstance(pl, dict):
594
+ pdf = (
595
+ pl.get("pdf_url")
596
+ or pl.get("url_for_pdf")
597
+ or pl.get("landing_page_url")
598
+ or ""
599
+ )
600
+ if not pdf:
601
+ oa = w.get("open_access") or {}
602
+ if isinstance(oa, dict):
603
+ pdf = oa.get("oa_url") or oa.get("oa_url_for_pdf") or ""
604
+ if not pdf or pdf in seen_urls:
605
+ continue
606
+ seen_urls.add(pdf)
607
+
608
+ title = (w.get("title") or w.get("display_name") or "").strip()
609
+ year = w.get("publication_year") or w.get("publication_date") or ""
610
+ venue = ""
611
+ pl = w.get("primary_location") or {}
612
+ if isinstance(pl, dict):
613
+ venue = (pl.get("source") or {}).get("display_name") or ""
614
+ if not venue:
615
+ venue = (w.get("host_venue") or {}).get("display_name") or "".strip()
616
+
617
+ name = " - ".join([s for s in [title, venue, str(year) or ""] if s])
618
+
619
+ meta = {"title": title, "year": year, "venue": venue, "source": "openalex"}
620
+ out.append({"url": pdf, "name": name, "meta": meta})
621
+
622
+ return out
623
+
624
+
625
+ def fetch_openalex_pdfs(
626
+ keywords: List[str],
627
+ out_dir: str,
628
+ manifest: Dict[str, Dict[str, Any]],
629
+ max_results: int = 600,
630
+ mailto: Optional[str] = None,
631
+ ) -> List[str]:
632
+ works = openalex_fetch_works(keywords, max_results=max_results, mailto=mailto)
633
+ if not works:
634
+ print("[INFO] OpenAlex returned no works for given queries/filters.")
635
+ return []
636
+
637
+ entries = openalex_extract_pdf_entries(works)
638
+ if not entries:
639
+ print("[INFO] OpenAlex works found, but no PDF links extracted.")
640
+ return []
641
+
642
+ paths = parallel_download_pdfs(
643
+ entries, out_dir, manifest, max_workers=16, desc="OpenAlex PDFs"
644
+ )
645
+ return paths
646
+
647
+
648
+ # --------------------------------------------------------------------------------------
649
+ # EUROPE PMC
650
+ # --------------------------------------------------------------------------------------
651
+ def epmc_query_from_keywords(keywords: List[str]) -> str:
652
+ return " OR ".join([f'"{k}"' for k in keywords])
653
+
654
+
655
+ def epmc_extract_pdf_entries_from_results(
656
+ results: List[Dict[str, Any]],
657
+ ) -> List[Dict[str, Any]]:
658
+ out: List[Dict[str, Any]] = []
659
+ seen = set()
660
+
661
+ for r in results:
662
+ ftl = r.get("fullTextUrlList") or {}
663
+ urls: List[str] = []
664
+ if isinstance(ftl, dict):
665
+ for ful in ftl.get("fullTextUrl") or []:
666
+ if isinstance(ful, dict):
667
+ u = ful.get("url") or ""
668
+ if u:
669
+ urls.append(u)
670
+ if not urls:
671
+ fu = r.get("fullTextUrl")
672
+ if isinstance(fu, str) and fu:
673
+ urls.append(fu)
674
+
675
+ for u in urls:
676
+ if not u or u in seen:
677
+ continue
678
+ seen.add(u)
679
+
680
+ title = r.get("title") or "".strip()
681
+ year = r.get("firstPublicationDate") or r.get("pubYear") or ""
682
+ name = " - ".join([s for s in [title, str(year) or ""] if s])
683
+
684
+ out.append(
685
+ {
686
+ "url": u,
687
+ "name": name,
688
+ "meta": {"title": title, "year": year, "source": "epmc"},
689
+ }
690
+ )
691
+
692
+ return out
693
+
694
+
695
+ def fetch_epmc_pdfs(
696
+ keywords: List[str],
697
+ out_dir: str,
698
+ manifest: Dict[str, Dict[str, Any]],
699
+ max_results: int = 200,
700
+ page_size: int = 25,
701
+ ) -> List[str]:
702
+ """
703
+ Query Europe PMC and extract fullTextUrlList entries.
704
+ """
705
+ q = epmc_query_from_keywords(keywords)
706
+ params = {
707
+ "query": q,
708
+ "format": "json",
709
+ "pageSize": page_size,
710
+ "sort": "FIRST_PDATE desc",
711
+ }
712
+ headers = {"User-Agent": f"polymer-rag/1.0 ({DEFAULT_MAILTO})"}
713
+ saved: List[str] = []
714
+ cursor = 1
715
+ total_fetched = 0
716
+
717
+ while total_fetched < max_results:
718
+ params["page"] = cursor
719
+ try:
720
+ resp = requests.get(EPMC_SEARCH_URL, params=params, headers=headers, timeout=30)
721
+ resp.raise_for_status()
722
+ data = resp.json()
723
+ except Exception as e:
724
+ print(f"[WARN] Europe PMC request failed: {e}")
725
+ break
726
+
727
+ results = (data.get("resultList") or {}).get("result") or []
728
+ if not results:
729
+ break
730
+
731
+ entries = epmc_extract_pdf_entries_from_results(results)
732
+ if not entries:
733
+ cursor += 1
734
+ total_fetched += len(results)
735
+ time.sleep(0.2)
736
+ continue
737
+
738
+ paths = parallel_download_pdfs(entries, out_dir, manifest, max_workers=8, desc="Europe PMC PDFs")
739
+ saved.extend(paths)
740
+
741
+ total_fetched += len(results)
742
+ cursor += 1
743
+ time.sleep(0.2)
744
+
745
+ return saved
746
+
747
+
748
+ # --------------------------------------------------------------------------------------
749
+ # POLYMER JOURNALS OA
750
+ # --------------------------------------------------------------------------------------
751
+ def fetch_polymer_journal_pdfs(
752
+ journal_queries: List[Dict[str, Any]],
753
+ out_dir: str,
754
+ manifest: Dict[str, Dict[str, Any]],
755
+ max_per_journal: int = 50,
756
+ mailto: Optional[str] = None,
757
+ ) -> List[str]:
758
+ """
759
+ Fetch OA papers from specific polymer journals via OpenAlex.
760
+ """
761
+ all_paths: List[str] = []
762
+ for jq in journal_queries:
763
+ journal_name = jq["journal"]
764
+ issn = jq.get("issn", "")
765
+ publisher = jq.get("publisher", "")
766
+ print(f"→ Fetching from {journal_name} ({publisher})...")
767
+
768
+ # Build OpenAlex filter for this journal
769
+ filter_parts = ["is_oa:true", "language:en"]
770
+ if issn:
771
+ filter_parts.append(f"primary_location.source.issn:{issn}")
772
+ filter_str = ",".join(filter_parts)
773
+
774
+ # Search for polymer-related content in this journal
775
+ search_query = "polymer OR macromolecule OR copolymer"
776
+ page = 1
777
+ journal_works = []
778
+ while len(journal_works) < max_per_journal:
779
+ try:
780
+ data = openalex_fetch_works_try(
781
+ search_query, filter_str, 25, page, mailto or DEFAULT_MAILTO
782
+ )
783
+ except Exception as e:
784
+ print(f"[WARN] Failed to fetch {journal_name}: {e}")
785
+ break
786
+
787
+ results = data.get("results", [])
788
+ if not results:
789
+ break
790
+ journal_works.extend(results)
791
+ if len(results) < 25:
792
+ break
793
+ page += 1
794
+ time.sleep(0.15)
795
+
796
+ if journal_works:
797
+ entries = openalex_extract_pdf_entries(journal_works[:max_per_journal])
798
+ # Tag with journal source
799
+ for e in entries:
800
+ e["meta"]["journal"] = journal_name
801
+ e["meta"]["publisher"] = publisher
802
+ e["meta"]["source"] = f"{journal_name}_{publisher}".lower()
803
+
804
+ paths = parallel_download_pdfs(
805
+ entries, out_dir, manifest, max_workers=8, desc=f"{journal_name} PDFs"
806
+ )
807
+ all_paths.extend(paths)
808
+ print(f" → Downloaded {len(paths)} PDFs from {journal_name}")
809
+ time.sleep(0.3)
810
+
811
+ return all_paths
812
+
813
+
814
+ # --------------------------------------------------------------------------------------
815
+ # WRAPPER FOR OPENAI EMBEDDINGS (POLYMER STYLE)
816
+ # --------------------------------------------------------------------------------------
817
+ class PolymerStyleOpenAIEmbeddings(OpenAIEmbeddings):
818
+ """
819
+ OpenAI embeddings wrapper for polymer RAG.
820
+ Default model: text-embedding-3-small (1536-D) ← FIXED
821
+ """
822
+
823
+ def __init__(self, model: str = "text-embedding-3-small", **kwargs):
824
+ super().__init__(model=model, **kwargs)
825
+
826
+
827
+ # --------------------------------------------------------------------------------------
828
+ # TOKENIZER FOR TRUE TOKEN-BASED SEGMENTATION
829
+ # --------------------------------------------------------------------------------------
830
+ TOKENIZER = tiktoken.get_encoding("cl100k_base")
831
+
832
+
833
+ def token_length(text: str) -> int:
834
+ if not text:
835
+ return 0
836
+ return len(TOKENIZER.encode(text))
837
+
838
+
839
+ # --------------------------------------------------------------------------------------
840
+ # METADATA ENRICHMENT FROM MANIFEST
841
+ # --------------------------------------------------------------------------------------
842
+ def attach_extra_metadata_from_manifest(
843
+ docs: List[Any], manifest: Dict[str, Dict[str, Any]]
844
+ ) -> None:
845
+ """
846
+ Enrich Document metadata with manifest data for later citation.
847
+ """
848
+ for d in docs:
849
+ src_path = d.metadata.get("source", "")
850
+ if not src_path:
851
+ continue
852
+
853
+ rec = manifest.get(src_path)
854
+ if not rec:
855
+ for k, v in manifest.items():
856
+ if os.path.basename(k) == os.path.basename(src_path):
857
+ rec = v
858
+ break
859
+ if rec:
860
+ for k in ["title", "year", "venue", "url", "source", "journal", "publisher"]:
861
+ if k in rec:
862
+ d.metadata[k] = rec[k]
863
+
864
+
865
+ # --------------------------------------------------------------------------------------
866
+ # MULTI-SCALE CHUNKING
867
+ # --------------------------------------------------------------------------------------
868
+ def multiscale_chunk_documents(
869
+ docs: List[Any], min_chunk_tokens: int = 32
870
+ ) -> List[Any]:
871
+ """
872
+ Multi-scale segmentation at TOKEN level: 512, 256, 128 token windows.
873
+ """
874
+ splitter_specs = [
875
+ ("tokens=512", 512, 64), # 50% tokens overlap
876
+ ("tokens=256", 256, 48),
877
+ ("tokens=128", 128, 32),
878
+ ]
879
+
880
+ all_chunks: List[Any] = []
881
+ seg_id = 0
882
+
883
+ for scale_label, chunk_size, overlap in splitter_specs:
884
+ splitter = RecursiveCharacterTextSplitter(
885
+ chunk_size=chunk_size,
886
+ chunk_overlap=overlap,
887
+ length_function=token_length,
888
+ separators=["\n\n", "\n", ". ", " ", ""],
889
+ )
890
+ splits = splitter.split_documents(docs)
891
+ for d in splits:
892
+ if token_length(d.page_content or "") < min_chunk_tokens:
893
+ continue
894
+ d.metadata = dict(d.metadata or {})
895
+ d.metadata["segment_scale"] = scale_label
896
+ d.metadata["segment_id"] = seg_id
897
+ seg_id += 1
898
+ all_chunks.append(d)
899
+
900
+ return all_chunks
901
+
902
+
903
+ # --------------------------------------------------------------------------------------
904
+ # BUILD RETRIEVER FROM LOCAL PDFs
905
+ # --------------------------------------------------------------------------------------
906
+ def _split_and_build_retriever(
907
+ documents_dir: str,
908
+ persist_dir: Optional[str] = None,
909
+ k: int = 10,
910
+ embedding_model: str = "text-embedding-3-small",
911
+ vector_backend: str = "chroma",
912
+ min_chunk_tokens: int = 32,
913
+ api_key: Optional[str] = None,
914
+ ):
915
+ """
916
+ Load PDFs, chunk multi-scale, build dense retriever.
917
+ FIXED: Always uses text-embedding-3-small (1536-D) and handles existing DB correctly.
918
+ """
919
+ print(f"→ Loading PDFs from {documents_dir}...")
920
+ try:
921
+ loader = DirectoryLoader(
922
+ documents_dir,
923
+ glob="*.pdf",
924
+ loader_cls=PyPDFLoader,
925
+ show_progress=True,
926
+ use_multithreading=True,
927
+ silent_errors=True,
928
+ )
929
+ except TypeError:
930
+ loader = DirectoryLoader(
931
+ documents_dir,
932
+ glob="*.pdf",
933
+ loader_cls=PyPDFLoader,
934
+ show_progress=True,
935
+ use_multithreading=True,
936
+ )
937
+
938
+ docs = loader.load()
939
+ if not docs:
940
+ raise RuntimeError("No PDF documents found to index.")
941
+
942
+ manifest = load_manifest(documents_dir)
943
+ attach_extra_metadata_from_manifest(docs, manifest)
944
+
945
+ documents = multiscale_chunk_documents(docs, min_chunk_tokens=min_chunk_tokens)
946
+ print(
947
+ f"→ Created {len(documents)} multi-scale segments from {len(docs)} PDFs (512/256/128-token windows)."
948
+ )
949
+
950
+ print(f"→ Using OpenAI embeddings model: {embedding_model}")
951
+ embeddings = PolymerStyleOpenAIEmbeddings(model=embedding_model, api_key=api_key)
952
+
953
+ # --------------------------------------------------------------------------------------
954
+ # CRITICAL FIX: Delete existing DB if it exists to prevent dimension mismatch
955
+ # --------------------------------------------------------------------------------------
956
+ if vector_backend.lower() == "chroma":
957
+ if persist_dir and os.path.exists(persist_dir):
958
+ print(f"→ Deleting existing Chroma database at {persist_dir} to prevent dimension mismatch...")
959
+ import shutil
960
+ shutil.rmtree(persist_dir)
961
+ print(f"→ Existing database deleted. Creating fresh database...")
962
+
963
+ # Sanitize all text content to prevent Unicode errors
964
+ for doc in documents:
965
+ doc.page_content = sanitize_text(doc.page_content or "")
966
+ for key, value in doc.metadata.items():
967
+ if isinstance(value, str):
968
+ doc.metadata[key] = sanitize_text(value)
969
+
970
+ # Process in batches to avoid rate limiting and memory issues
971
+ batch_size = 500 # Adjust based on your document sizes (500 is safe for most cases)
972
+ total_batches = (len(documents) + batch_size - 1) // batch_size
973
+ print(f"→ Processing {len(documents)} documents in {total_batches} batches of {batch_size}...")
974
+
975
+ vector_store = None
976
+ for i in range(0, len(documents), batch_size):
977
+ batch = documents[i : i + batch_size]
978
+ batch_num = (i // batch_size) + 1
979
+ print(f" → Embedding batch {batch_num}/{total_batches} ({len(batch)} documents)...")
980
+
981
+ if vector_store is None:
982
+ # First batch: create the vector store
983
+ if persist_dir:
984
+ print(f" → Creating new Chroma database at {persist_dir}")
985
+ vector_store = Chroma.from_documents(
986
+ batch, embeddings, persist_directory=persist_dir
987
+ )
988
+ else:
989
+ # In-memory mode also needs batching
990
+ vector_store = Chroma.from_documents(batch, embeddings)
991
+ else:
992
+ # Subsequent batches: add to existing store
993
+ vector_store.add_documents(batch)
994
+
995
+ time.sleep(0.5) # Small delay to avoid rate limiting
996
+
997
+ print("→ All batches embedded and persisted!")
998
+
999
+ elif vector_backend.lower() == "faiss":
1000
+ try:
1001
+ from langchain_community.vectorstores import FAISS
1002
+ except Exception as e:
1003
+ raise RuntimeError("FAISS requested but not available") from e
1004
+
1005
+ # Sanitize all text content
1006
+ for doc in documents:
1007
+ doc.page_content = sanitize_text(doc.page_content or "")
1008
+ for key, value in doc.metadata.items():
1009
+ if isinstance(value, str):
1010
+ doc.metadata[key] = sanitize_text(value)
1011
+
1012
+ # FAISS also needs batching
1013
+ batch_size = 500
1014
+ total_batches = (len(documents) + batch_size - 1) // batch_size
1015
+ print(f"→ Processing {len(documents)} documents in {total_batches} batches of {batch_size}...")
1016
+
1017
+ vector_store = None
1018
+ for i in range(0, len(documents), batch_size):
1019
+ batch = documents[i : i + batch_size]
1020
+ batch_num = (i // batch_size) + 1
1021
+ print(f" → Embedding batch {batch_num}/{total_batches} ({len(batch)} documents)...")
1022
+
1023
+ if vector_store is None:
1024
+ vector_store = FAISS.from_documents(batch, embeddings)
1025
+ else:
1026
+ batch_store = FAISS.from_documents(batch, embeddings)
1027
+ vector_store.merge_from(batch_store)
1028
+
1029
+ time.sleep(0.5)
1030
+
1031
+ else:
1032
+ raise ValueError("vector_backend must be 'chroma' or 'faiss'")
1033
+
1034
+ vector_retriever = vector_store.as_retriever(search_kwargs={"k": k})
1035
+ print("→ RAG KB ready (dense retriever over multi-scale segments).")
1036
+ return vector_retriever
1037
+
1038
+
1039
+ # --------------------------------------------------------------------------------------
1040
+ # PUBLIC API: BUILD RETRIEVER FROM WEB
1041
+ # --------------------------------------------------------------------------------------
1042
+ def build_retriever_from_web(
1043
+ polymer_keywords: Optional[List[str]] = None,
1044
+ target_curated: int = TARGET_CURATED,
1045
+ target_journals: int = TARGET_JOURNALS,
1046
+ target_arxiv: int = TARGET_ARXIV,
1047
+ target_openalex: int = TARGET_OPENALEX,
1048
+ target_epmc: int = TARGET_EPMC,
1049
+ extra_pdf_urls: Optional[List[str]] = None,
1050
+ persist_dir: str = DEFAULT_PERSIST_DIR,
1051
+ tmp_download_dir: str = DEFAULT_TMP_DOWNLOAD_DIR,
1052
+ k: int = 10,
1053
+ embedding_model: str = "text-embedding-3-small",
1054
+ vector_backend: str = "chroma",
1055
+ mailto: Optional[str] = None,
1056
+ include_curated: bool = True,
1057
+ ):
1058
+ """
1059
+ Fetch balanced polymer corpus across multiple sources.
1060
+
1061
+ Target distribution (~2000 PDFs):
1062
+ - Curated guidelines/standards: 100
1063
+ - Polymer journals OA: 200
1064
+ - arXiv: 800
1065
+ - OpenAlex: 600
1066
+ - Europe PMC: 200
1067
+ - Extra/databases: 100
1068
+ """
1069
+ polymer_keywords = sorted(set(polymer_keywords or POLYMER_KEYWORDS), key=str.lower)
1070
+ print("=" * 70)
1071
+ print("Fetching polymer PDFs from balanced sources...")
1072
+ print(
1073
+ f"Target: {target_curated} curated + {target_journals} journals + "
1074
+ f"{target_arxiv} arXiv + {target_openalex} OpenAlex + {target_epmc} EPMC"
1075
+ )
1076
+
1077
+ ensure_dir(tmp_download_dir)
1078
+ manifest = load_manifest(tmp_download_dir)
1079
+ source_stats = defaultdict(int)
1080
+ all_paths: List[str] = []
1081
+
1082
+ # --------------------------------------------------------------------------------------
1083
+ # 1) Curated sources (IUPAC, ISO/ASTM, polymer informatics reviews)
1084
+ # --------------------------------------------------------------------------------------
1085
+ if include_curated and CURATED_POLYMER_PDF_SOURCES:
1086
+ print(f"[1/6] Downloading {len(CURATED_POLYMER_PDF_SOURCES)} curated PDFs...")
1087
+ curated_paths = parallel_download_pdfs(
1088
+ CURATED_POLYMER_PDF_SOURCES[:target_curated],
1089
+ tmp_download_dir,
1090
+ manifest,
1091
+ max_workers=4,
1092
+ desc="Curated PDFs",
1093
+ )
1094
+ for p in curated_paths:
1095
+ if p not in all_paths:
1096
+ all_paths.append(p)
1097
+ source_stats["curated"] += 1
1098
+ print(f" → {len(curated_paths)} curated PDFs downloaded")
1099
+
1100
+ # --------------------------------------------------------------------------------------
1101
+ # 2) Polymer journals OA
1102
+ # --------------------------------------------------------------------------------------
1103
+ try:
1104
+ print(f"[2/6] Fetching polymer journal PDFs (target: {target_journals})...")
1105
+ journal_paths = fetch_polymer_journal_pdfs(
1106
+ POLYMER_JOURNAL_QUERIES,
1107
+ tmp_download_dir,
1108
+ manifest,
1109
+ max_per_journal=target_journals // len(POLYMER_JOURNAL_QUERIES) + 1,
1110
+ mailto=mailto,
1111
+ )
1112
+ for p in journal_paths:
1113
+ if p not in all_paths:
1114
+ all_paths.append(p)
1115
+ source_stats["journal"] += 1
1116
+ print(f" → {len(journal_paths)} journal PDFs downloaded")
1117
+ except Exception as e:
1118
+ print(f"[WARN] Polymer journal fetch error: {e}")
1119
+
1120
+ # --------------------------------------------------------------------------------------
1121
+ # 3) arXiv polymer-focused categories
1122
+ # --------------------------------------------------------------------------------------
1123
+ try:
1124
+ print(f"[3/6] Fetching arXiv PDFs (target: {target_arxiv})...")
1125
+ arxiv_paths = fetch_arxiv_pdfs(
1126
+ polymer_keywords, tmp_download_dir, manifest, max_results=target_arxiv
1127
+ )
1128
+ for p in arxiv_paths:
1129
+ if p not in all_paths:
1130
+ all_paths.append(p)
1131
+ source_stats["arxiv"] += 1
1132
+ print(f" → {len(arxiv_paths)} arXiv PDFs downloaded")
1133
+ except Exception as e:
1134
+ print(f"[WARN] arXiv fetch error: {e}")
1135
+
1136
+ # --------------------------------------------------------------------------------------
1137
+ # 4) OpenAlex broad polymer search
1138
+ # --------------------------------------------------------------------------------------
1139
+ try:
1140
+ print(f"[4/6] Fetching OpenAlex PDFs (target: {target_openalex})...")
1141
+ openalex_paths = fetch_openalex_pdfs(
1142
+ polymer_keywords,
1143
+ tmp_download_dir,
1144
+ manifest,
1145
+ max_results=target_openalex,
1146
+ mailto=mailto,
1147
+ )
1148
+ for p in openalex_paths:
1149
+ if p not in all_paths:
1150
+ all_paths.append(p)
1151
+ source_stats["openalex"] += 1
1152
+ print(f" → {len(openalex_paths)} OpenAlex PDFs downloaded")
1153
+ except Exception as e:
1154
+ print(f"[WARN] OpenAlex fetch error: {e}")
1155
+
1156
+ # --------------------------------------------------------------------------------------
1157
+ # 5) Europe PMC biopolymers/materials
1158
+ # --------------------------------------------------------------------------------------
1159
+ try:
1160
+ print(f"[5/6] Fetching Europe PMC PDFs (target: {target_epmc})...")
1161
+ epmc_paths = fetch_epmc_pdfs(
1162
+ polymer_keywords, tmp_download_dir, manifest, max_results=target_epmc
1163
+ )
1164
+ for p in epmc_paths:
1165
+ if p not in all_paths:
1166
+ all_paths.append(p)
1167
+ source_stats["epmc"] += 1
1168
+ print(f" → {len(epmc_paths)} Europe PMC PDFs downloaded")
1169
+ except Exception as e:
1170
+ print(f"[WARN] Europe PMC fetch error: {e}")
1171
+
1172
+ # --------------------------------------------------------------------------------------
1173
+ # 6) Extra URLs (user-provided, database exports, etc.)
1174
+ # --------------------------------------------------------------------------------------
1175
+ if extra_pdf_urls:
1176
+ print(f"[6/6] Downloading {len(extra_pdf_urls)} extra PDFs...")
1177
+ extra_entries = [
1178
+ {"url": u, "name": None, "meta": {"url": u, "source": "extra"}}
1179
+ for u in extra_pdf_urls
1180
+ ]
1181
+ extra_paths = parallel_download_pdfs(
1182
+ extra_entries, tmp_download_dir, manifest, max_workers=8, desc="Extra PDFs"
1183
+ )
1184
+ for p in extra_paths:
1185
+ if p not in all_paths:
1186
+ all_paths.append(p)
1187
+ source_stats["extra"] += 1
1188
+ print(f" → {len(extra_paths)} extra PDFs downloaded")
1189
+
1190
+ # --------------------------------------------------------------------------------------
1191
+ # Summary
1192
+ # --------------------------------------------------------------------------------------
1193
+ total = len(all_paths)
1194
+ print("=" * 70)
1195
+ print("DOWNLOAD SUMMARY")
1196
+ print("=" * 70)
1197
+ print(f"Total unique PDFs downloaded: {total}")
1198
+ print(" by source:")
1199
+ for source, count in sorted(source_stats.items()):
1200
+ pct = (count / total * 100) if total > 0 else 0
1201
+ print(f" {source:20s} {count:4d} PDFs ({pct:5.1f}%)")
1202
+ print("=" * 70)
1203
+
1204
+ if total == 0:
1205
+ raise RuntimeError(
1206
+ "No PDFs fetched. Adjust keywords, targets, or add extra_pdf_urls."
1207
+ )
1208
+
1209
+ print("Building knowledge base from downloaded PDFs...")
1210
+ retriever = _split_and_build_retriever(
1211
+ documents_dir=tmp_download_dir,
1212
+ persist_dir=persist_dir,
1213
+ k=k,
1214
+ embedding_model=embedding_model,
1215
+ vector_backend=vector_backend,
1216
+ )
1217
+
1218
+ return retriever
1219
+
1220
+
1221
+ # --------------------------------------------------------------------------------------
1222
+ # PUBLIC API: BUILD RETRIEVER FROM LOCAL PAPERS
1223
+ # --------------------------------------------------------------------------------------
1224
+ def build_retriever(
1225
+ papers_path: str,
1226
+ persist_dir: Optional[str] = DEFAULT_PERSIST_DIR,
1227
+ k: int = 10,
1228
+ embedding_model: str = "text-embedding-3-small",
1229
+ vector_backend: str = "chroma",
1230
+ ):
1231
+ """
1232
+ Build polymer RAG KB from local PDFs.
1233
+ """
1234
+ print("Building RAG knowledge base from local PDFs...")
1235
+ return _split_and_build_retriever(
1236
+ documents_dir=papers_path,
1237
+ persist_dir=persist_dir,
1238
+ k=k,
1239
+ embedding_model=embedding_model,
1240
+ vector_backend=vector_backend,
1241
+ )
1242
+
1243
+
1244
+ # --------------------------------------------------------------------------------------
1245
+ # CONVENIENCE WRAPPER: POLYMER FOUNDATION MODELS
1246
+ # --------------------------------------------------------------------------------------
1247
+ def build_retriever_polymer_foundation_models(
1248
+ persist_dir: str = DEFAULT_PERSIST_DIR,
1249
+ k: int = 10,
1250
+ vector_backend: str = "chroma",
1251
+ ):
1252
+ """
1253
+ Convenience wrapper for polymer foundation model corpus.
1254
+ """
1255
+ fm_kw = list(
1256
+ set(POLYMER_KEYWORDS)
1257
+ | {
1258
+ "BigSMILES",
1259
+ "PSMILES",
1260
+ "polymer SMILES",
1261
+ "polymer language model",
1262
+ "foundation model polymer",
1263
+ "masked language model polymer",
1264
+ "self-supervised polymer",
1265
+ "generative polymer",
1266
+ "polymer sequence modeling",
1267
+ "representation learning polymer",
1268
+ }
1269
+ )
1270
+ return build_retriever_from_web(
1271
+ polymer_keywords=fm_kw,
1272
+ target_curated=100,
1273
+ target_journals=200,
1274
+ target_arxiv=800,
1275
+ target_openalex=600,
1276
+ target_epmc=200,
1277
+ persist_dir=persist_dir,
1278
+ k=k,
1279
+ embedding_model="text-embedding-3-small",
1280
+ vector_backend=vector_backend,
1281
+ )
1282
+
1283
+
1284
+ # --------------------------------------------------------------------------------------
1285
+ # MAIN
1286
+ # --------------------------------------------------------------------------------------
1287
+ if __name__ == "__main__":
1288
+ retriever = build_retriever_from_web(
1289
+ polymer_keywords=POLYMER_KEYWORDS,
1290
+ target_curated=100,
1291
+ target_journals=200,
1292
+ target_arxiv=800,
1293
+ target_openalex=600,
1294
+ target_epmc=200,
1295
+ persist_dir="chroma_polymer_db_balanced",
1296
+ tmp_download_dir=DEFAULT_TMP_DOWNLOAD_DIR,
1297
+ k=10,
1298
+ embedding_model="text-embedding-3-small",
1299
+ vector_backend="chroma",
1300
+ mailto=DEFAULT_MAILTO,
1301
+ include_curated=True,
1302
+ )
1303
+
1304
+ print("\n" + "=" * 70)
1305
+ print("Testing retrieval with sample query")
1306
+ docs = retriever.get_relevant_documents("PSMILES polymer electrolyte design")
1307
+ for i, d in enumerate(docs, 1):
1308
+ meta = d.metadata or {}
1309
+ title = meta.get("title") or os.path.basename(meta.get("source", "")) or "document"
1310
+ year = meta.get("year", "")
1311
+ src = meta.get("source", "unknown")
1312
+ journal = meta.get("journal", "")
1313
+ scale = meta.get("segment_scale", "")
1314
+ source_str = f"{src}"
1315
+ if journal:
1316
+ source_str = f"{journal} ({src})"
1317
+ print(f"\n[{i}] {title}")
1318
+ print(f" Year: {year} | Source: {source_str} | Scale: {scale}")
1319
+ print(f" Content: {(d.page_content or '')[:200]}...")