fierce74 commited on
Commit
7529164
·
1 Parent(s): e66ab71

Add application file

Browse files
LICENSE ADDED
@@ -0,0 +1,21 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ MIT License
2
+
3
+ Copyright (c) 2026 patient74
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
README.md CHANGED
@@ -1,14 +1,71 @@
1
- ---
2
- title: Microbiome Immunotherapy CDS
3
- emoji: 📚
4
- colorFrom: indigo
5
- colorTo: gray
6
- sdk: gradio
7
- sdk_version: 6.6.0
8
- app_file: app.py
9
- pinned: false
10
- license: mit
11
- short_description: ' An Evidence-Based Clinical Decision Support Tool'
12
- ---
13
-
14
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Microbiome-Immunotherapy Clinical Decision Support System
2
+ ### An Evidence-Based Clinical Decision Support Tool
3
+
4
+ This project provides a sophisticated clinical decision support system that optimizes immunotherapy (ICI/ACT) treatments based on a patient's gut microbiome profile. It leverages MedGemma 1.5 4B and a specialized RAG pipeline to generate evidence-based clinical reports.
5
+
6
+ ## Architecture Overview
7
+
8
+ The system processes patient data and clinical evidence through a modular pipeline to produce a 6-section clinical report:
9
+
10
+ 1. **Microbiome Composition**: Profile of diversity and key taxa.
11
+ 2. **Metabolite Landscape**: Analysis of SCFAs, bile acids, and tryptophan.
12
+ 3. **Drug-Microbiome Interaction**: Core interpretation of microbiome impact on drug efficacy.
13
+ 4. **Confounding Factors**: Impact of antibiotics, PPIs, and prior treatments.
14
+ 5. **Intervention Considerations**: Evidence-based dietary or probiotic suggestions.
15
+ 6. **Data Quality & Limitations**: Assessment of report confidence.
16
+
17
+ Each section is generated using targeted RAG retrieval from a database of peer-reviewed medical literature.
18
+
19
+ ## Key Features
20
+
21
+ - **EHR Extraction**: Automatically parses raw Electronic Health Records (EHR) text into structured patient data using MedGemma.
22
+ - **Medical RAG**: Domain-specific retrieval system using PubMedBERT embeddings and table-aware chunking.
23
+ - **Multi-Model Support**: Designed for MedGemma 1.5 but extensible to other LLMs.
24
+
25
+
26
+ ## Project Structure
27
+
28
+ ```text
29
+ ├── rag/ # RAG pipeline for indexing medical papers
30
+ ├── src/ # Core application logic (models, prompts, generators)
31
+ ├── data/ # Patient data and clinical inputs
32
+ ├── outputs/ # Generated clinical reports (Markdown)
33
+ ├── generate_report.py # Main CLI entry point
34
+ └── requirements.txt # Project dependencies
35
+ ```
36
+
37
+ ## Getting Started
38
+
39
+ ### Prerequisites
40
+
41
+ - First get the RAG ready. See `rag/README.md`s
42
+ - Python 3.10+
43
+ - CUDA-compatible GPU (recommended for MedGemma and PubMedBERT)
44
+ - HuggingFace access for `google/medgemma-1.5-4b-it`
45
+
46
+ ### Installation
47
+
48
+ 1. Clone the repository.
49
+ 2. Install dependencies:
50
+ ```bash
51
+ pip install -r requirements.txt
52
+ pip install -r rag/requirements.txt
53
+ ```
54
+
55
+ ## Usage
56
+
57
+ Generate a report from structured patient JSON
58
+ (see data/template/patient_schema_template.json):
59
+ ```bash
60
+ python generate_report.py data/patient_example.json
61
+ ```
62
+
63
+ Generate a report from raw EHR text:
64
+ ```bash
65
+ python generate_report.py data/patient_ehr.txt --save-ehr-json outputs/patient_profile.json
66
+ ```
67
+ ## Examples
68
+ See `data/sample_input` for EHR examples and `output` for the corresponding output
69
+ ## Configuration
70
+
71
+ Settings for model IDs, device selection (CPU/GPU), and RAG parameters can be found in `src/config.py`.
app.py ADDED
@@ -0,0 +1,258 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ app.py — Gradio entrypoint for the
3
+ Microbiome-Immunotherapy Clinical Decision Support System
4
+
5
+ Startup sequence:
6
+ 1. Download / verify ChromaDB from HuggingFace (chroma_loader)
7
+ 2. Load MedGemma + PubMedBERT once into memory (ReportAssembler.__init__)
8
+ 3. Launch Gradio UI
9
+
10
+ Usage:
11
+ python app.py
12
+ """
13
+
14
+ import logging
15
+ import sys
16
+ from pathlib import Path
17
+ from typing import Generator, Tuple
18
+
19
+ import gradio as gr
20
+
21
+ # ---------------------------------------------------------------------------
22
+ # Logging
23
+ # ---------------------------------------------------------------------------
24
+ logging.basicConfig(
25
+ level=logging.INFO,
26
+ format="%(asctime)s — %(name)s — %(levelname)s — %(message)s",
27
+ handlers=[logging.StreamHandler(sys.stdout)],
28
+ )
29
+ logger = logging.getLogger(__name__)
30
+
31
+ # ---------------------------------------------------------------------------
32
+ # Step 1: Ensure ChromaDB is available locally before anything else
33
+ # ---------------------------------------------------------------------------
34
+ logger.info("=" * 70)
35
+ logger.info("Microbiome-Immunotherapy CDS — starting up")
36
+ logger.info("=" * 70)
37
+
38
+ from src.chroma_loader import ensure_chroma_db
39
+ ensure_chroma_db()
40
+
41
+ # ---------------------------------------------------------------------------
42
+ # Step 2: Load models once (expensive — happens here, not per request)
43
+ # ---------------------------------------------------------------------------
44
+ logger.info("Loading models — this may take a few minutes on first run...")
45
+ from src.report_assembler import ReportAssembler
46
+ assembler = ReportAssembler()
47
+ logger.info("Models loaded successfully.")
48
+
49
+ # ---------------------------------------------------------------------------
50
+ # Step 3: Discover EHR files in data/input/
51
+ # ---------------------------------------------------------------------------
52
+ EHR_DIR = Path("data/input")
53
+
54
+ def _discover_ehr_files() -> dict:
55
+ """
56
+ Scan data/input/ for .txt and .ehr files.
57
+ Returns a dict mapping display label -> absolute path string.
58
+ E.g. {"Patient EHR 1 (patient_ehr_1)": "/abs/path/data/input/patient_ehr_1.txt"}
59
+ """
60
+ files = {}
61
+ for ext in ("*.txt", "*.ehr"):
62
+ for p in sorted(EHR_DIR.glob(ext)):
63
+ label = f"{p.stem}"
64
+ files[label] = str(p.resolve())
65
+ return files
66
+
67
+ EHR_FILES = _discover_ehr_files()
68
+
69
+ if not EHR_FILES:
70
+ logger.warning(
71
+ f"No EHR files found in {EHR_DIR}. "
72
+ "Add .txt or .ehr files there before generating reports."
73
+ )
74
+
75
+ # ---------------------------------------------------------------------------
76
+ # Step 4: Report generation function (generator — streams section by section)
77
+ # ---------------------------------------------------------------------------
78
+
79
+ def generate_report(ehr_label: str) -> Generator[Tuple[str, str], None, None]:
80
+ """
81
+ Gradio generator function.
82
+
83
+ Args:
84
+ ehr_label: The dropdown display label corresponding to an EHR file.
85
+
86
+ Yields:
87
+ Tuple of (report_markdown: str, status_message: str).
88
+ Each yield updates the UI immediately.
89
+ """
90
+ if not ehr_label:
91
+ yield "", "⚠️ Please select a patient EHR file."
92
+ return
93
+
94
+ ehr_path = EHR_FILES.get(ehr_label)
95
+ if not ehr_path:
96
+ yield "", f"⚠️ EHR file not found for selection: '{ehr_label}'"
97
+ return
98
+
99
+ logger.info(f"Report requested for: {ehr_label} → {ehr_path}")
100
+
101
+ # Extract patient JSON from EHR text via MedGemma
102
+ yield "", f"⏳ Extracting structured data from EHR: {ehr_label}..."
103
+ try:
104
+ patient_data = assembler.load_patient_data_from_ehr(ehr_path)
105
+ except Exception as exc:
106
+ logger.error(f"EHR extraction failed: {exc}", exc_info=True)
107
+ yield "", f"❌ EHR extraction failed: {exc}"
108
+ return
109
+
110
+ # Stream the report section by section
111
+ try:
112
+ for report_so_far, status in assembler.generate_full_report_streaming(patient_data):
113
+ yield report_so_far, status
114
+ except Exception as exc:
115
+ logger.error(f"Report generation failed: {exc}", exc_info=True)
116
+ yield "", f"❌ Report generation failed: {exc}"
117
+ return
118
+
119
+
120
+ def clear_outputs():
121
+ """Reset the output panel when a new EHR is selected from the dropdown."""
122
+ return "", "Select a patient EHR and click Generate Report."
123
+
124
+
125
+ # ---------------------------------------------------------------------------
126
+ # Step 5: Build the Gradio UI
127
+ # ---------------------------------------------------------------------------
128
+
129
+ DISCLAIMER_HTML = """
130
+ <div style="
131
+ background: #fff3cd;
132
+ border: 1px solid #ffc107;
133
+ border-radius: 6px;
134
+ padding: 12px 18px;
135
+ margin-bottom: 4px;
136
+ font-size: 0.92em;
137
+ color: #664d03;
138
+ line-height: 1.5;
139
+ ">
140
+ <strong>⚠️ Clinical Decision Support Tool — For Healthcare Professional Use Only</strong><br>
141
+ This system is intended for use by qualified oncologists and clinical teams as a
142
+ <em>decision support aid</em>. It does not constitute medical advice and must be
143
+ interpreted in conjunction with comprehensive clinical evaluation. All outputs are
144
+ AI-generated and evidence-based summaries sourced from peer-reviewed literature;
145
+ they do not replace clinical judgement.
146
+ </div>
147
+ """
148
+
149
+ TITLE_HTML = """
150
+ <div style="text-align: center; padding: 10px 0 4px 0;">
151
+ <h1 style="font-size: 1.5em; margin: 0; color: #1a1a2e;">
152
+ 🧬 Microbiome–Immunotherapy Clinical Decision Support System
153
+ </h1>
154
+ <p style="color: #555; margin: 4px 0 0 0; font-size: 0.95em;">
155
+ Evidence-based microbiome analysis for ICI &amp; ACT treatment planning
156
+ </p>
157
+ </div>
158
+ """
159
+
160
+ with gr.Blocks(
161
+ title="Microbiome-Immunotherapy CDS",
162
+ theme=gr.themes.Soft(primary_hue="blue", neutral_hue="slate"),
163
+ ) as demo:
164
+
165
+ # -----------------------------------------------------------------------
166
+ # Header
167
+ # -----------------------------------------------------------------------
168
+ gr.HTML(TITLE_HTML)
169
+ gr.HTML(DISCLAIMER_HTML)
170
+
171
+ # -----------------------------------------------------------------------
172
+ # Main layout: left control panel | right report output
173
+ # -----------------------------------------------------------------------
174
+ with gr.Row(equal_height=False):
175
+
176
+ # -------------------------------------------------------------------
177
+ # LEFT: Controls
178
+ # -------------------------------------------------------------------
179
+ with gr.Column(scale=1, min_width=260):
180
+
181
+ gr.Markdown("### Patient Selection")
182
+
183
+ ehr_dropdown = gr.Dropdown(
184
+ choices=list(EHR_FILES.keys()),
185
+ label="Select Patient EHR",
186
+ info="EHR reports available in data/input/",
187
+ interactive=True,
188
+ value=None,
189
+ )
190
+
191
+ generate_btn = gr.Button(
192
+ "Generate Report",
193
+ variant="primary",
194
+ size="lg",
195
+ )
196
+
197
+ gr.Markdown("---")
198
+ gr.Markdown("### Status")
199
+
200
+ status_box = gr.Textbox(
201
+ value="Select a patient EHR and click Generate Report.",
202
+ label="",
203
+ interactive=False,
204
+ lines=2,
205
+ max_lines=3,
206
+ )
207
+
208
+ gr.Markdown("---")
209
+ gr.Markdown(
210
+ "<small style='color:#888;'>"
211
+ "**Model:** MedGemma 1.5 4B &nbsp;|&nbsp; "
212
+ "**RAG:** PubMedBERT + ChromaDB<br>"
213
+ "**Evidence base:** Peer-reviewed literature on microbiome × immunotherapy"
214
+ "</small>"
215
+ )
216
+
217
+ # -------------------------------------------------------------------
218
+ # RIGHT: Report output
219
+ # -------------------------------------------------------------------
220
+ with gr.Column(scale=3):
221
+
222
+ gr.Markdown("### Clinical Report")
223
+
224
+ report_output = gr.Markdown(
225
+ value="*The generated report will appear here, section by section, as it is being written.*",
226
+ label="",
227
+ # height keeps the panel stable rather than jumping as content grows
228
+ height=820,
229
+ )
230
+
231
+ # -----------------------------------------------------------------------
232
+ # Event wiring
233
+ # -----------------------------------------------------------------------
234
+
235
+ # When a new EHR is selected from the dropdown, clear the output panel
236
+ ehr_dropdown.change(
237
+ fn=clear_outputs,
238
+ inputs=[],
239
+ outputs=[report_output, status_box],
240
+ )
241
+
242
+ # Generate button — streams the report section by section
243
+ generate_btn.click(
244
+ fn=generate_report,
245
+ inputs=[ehr_dropdown],
246
+ outputs=[report_output, status_box],
247
+ )
248
+
249
+
250
+ # ---------------------------------------------------------------------------
251
+ # Launch
252
+ # ---------------------------------------------------------------------------
253
+ if __name__ == "__main__":
254
+ demo.launch(
255
+ server_name="0.0.0.0", # bind to all interfaces (accessible on LAN)
256
+ server_port=7860,
257
+ show_error=True,
258
+ )
data/sample_input/demo_ehr_act_dlbcl.txt ADDED
@@ -0,0 +1,268 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ================================================================================
2
+ UNIVERSITY MEDICAL CENTER - CELLULAR THERAPY
3
+ ================================================================================
4
+ Patient: Marcus Johnson | MRN: UMC-556219 | DOB: 06/08/1965 (Age: 60)
5
+ Gender: Male | Visit Date: February 8, 2026
6
+
7
+ CHIEF COMPLAINT:
8
+ CAR-T therapy consultation for relapsed/refractory DLBCL.
9
+
10
+ HISTORY:
11
+ 60-year-old male with DLBCL initially diagnosed March 2024. Failed R-CHOP (CR
12
+ 6 months, relapsed) and R-ICE (PR, progressed at 4 months). Now stage IV with
13
+ bone marrow involvement. Planned for axicabtagene ciloleucel (CD19 CAR-T).
14
+
15
+ LYMPHOMA DIAGNOSIS:
16
+ - Histology: Diffuse Large B-Cell Lymphoma (non-GCB, ABC type)
17
+ - Stage: IV (bone marrow 15% involvement)
18
+ - Initial Diagnosis: March 18, 2024
19
+ - CD19: Positive (95% tumor cells) - CAR-T eligible
20
+ - High-risk features: BCL2+ (70%), TP53 mutation, Revised IPI 4
21
+
22
+ PRIOR TREATMENTS:
23
+ 1. R-CHOP × 6 (Apr-Aug 2024) → CR 6 months → Relapsed
24
+ 2. R-ICE × 3 (Apr-Jun 2025) → PR → Progressed Oct 2025
25
+ 3. Gemcitabine bridging (weekly, Jan 2026) → Stable disease
26
+
27
+ CAR-T PLAN:
28
+ - Product: Axicabtagene ciloleucel (Yescarta) - CD19-targeted
29
+ - Leukapheresis: February 14, 2026
30
+ - Manufacturing: ~17-21 days
31
+ - Lymphodepletion: Fludarabine/Cyclophosphamide (March 7-9, 2026)
32
+ - CAR-T Infusion: March 10, 2026 (Day 0)
33
+
34
+ CURRENT MEDICATIONS:
35
+ 1. Levofloxacin 500mg daily (bacterial prophylaxis - ACTIVE)
36
+ 2. Acyclovir 400mg BID (viral prophylaxis)
37
+ 3. Fluconazole 200mg daily (fungal prophylaxis)
38
+ 4. Lisinopril 20mg daily (hypertension)
39
+ 5. Atorvastatin 40mg daily (hyperlipidemia)
40
+
41
+ ANTIBIOTIC EXPOSURE:
42
+ Current Prophylaxis (ACTIVE):
43
+ - Levofloxacin 500mg daily since January 25, 2026 (14 days at microbiome sampling)
44
+ - Indication: Neutropenia prophylaxis post-chemotherapy
45
+ - Will continue through CAR-T per protocol
46
+
47
+ Recent Therapeutic:
48
+ - Piperacillin-tazobactam 4.5g IV Q6H × 10 days (Dec 18-27, 2025)
49
+ - Indication: Febrile neutropenia with pneumonia
50
+ - Days before CAR-T: ~73 days
51
+
52
+ PPI USE: None
53
+
54
+ PAST MEDICAL HISTORY:
55
+ Hypertension, hyperlipidemia, Type 2 diabetes (HbA1c 6.8%), chronic back pain.
56
+ No autoimmune disease. No seizure history (neurotoxicity risk assessment).
57
+
58
+ PERFORMANCE STATUS: ECOG 2 | Karnofsky 70%
59
+
60
+ VITALS: BP 142/88 | HR 88 | SpO2 97% RA | Temp 98.8°F
61
+ Weight 198 lbs (90 kg, down 22 lbs) | Height 5'11" (180 cm) | BMI 27.6 | BSA 2.11 m²
62
+
63
+ KEY LABS (February 6, 2026):
64
+ - WBC 3.2, Hgb 10.8, Plt 142, ANC 1.8, ALC 0.9 (lymphopenia post-chemo)
65
+ - Creatinine 1.1, eGFR 76 (adequate for CAR-T)
66
+ - AST/ALT 34/42, normal bilirubin
67
+ - LDH 345 (elevated - disease burden), CRP 24.8, Ferritin 485
68
+
69
+ IMAGING (February 2026):
70
+ - PET-CT: Retroperitoneal nodes 3.2cm SUVmax 18.4, Mesenteric mass 4.8cm SUVmax 22.1
71
+ - Bone marrow: Diffuse uptake (involvement)
72
+ - Deauville Score: 5 (very active disease)
73
+ - MRI Brain: No CNS involvement
74
+ - Echo: LVEF 58% (normal)
75
+
76
+ ================================================================================
77
+ MICROBIOME ANALYSIS - SUBOPTIMAL BUT MODIFIABLE
78
+ ================================================================================
79
+ Sample Date: February 5, 2026 (33 days before planned CAR-T infusion)
80
+ Method: Shotgun metagenomic sequencing (Illumina NovaSeq, 12M reads)
81
+ Lab: Precision Microbiome Diagnostics
82
+
83
+ CLINICAL CONTEXT:
84
+ - On prophylactic levofloxacin × 14 days at sampling
85
+ - Recent piperacillin-tazobactam 49 days prior
86
+ - Multiple prior chemotherapy regimens
87
+ - Immunosuppressed (lymphopenia, post-R-CHOP/R-ICE)
88
+
89
+ DIVERSITY (LOW-MODERATE - SUBOPTIMAL):
90
+ - Shannon Index: 2.7 [LOW - risk for poor CAR-T outcomes]
91
+ - Simpson Index: 0.82
92
+ - Observed Species: 164 (reduced)
93
+ - Interpretation: Reduced diversity from antibiotics + chemotherapy
94
+
95
+ COMPOSITION:
96
+ Firmicutes 46.2% | Bacteroidetes 38.8% | Proteobacteria 8.4% ↑ | Actinobacteria 4.2%
97
+ F/B Ratio: 1.19 (low - dysbiosis)
98
+
99
+ KEY TAXA (% relative abundance):
100
+
101
+ FAVORABLE (present but suboptimal):
102
+ - Akkermansia muciniphila: 1.8% [BORDERLINE - reduced]
103
+ - Faecalibacterium prausnitzii: 3.6% [LOW - key SCFA producer depleted]
104
+ - Ruminococcaceae: 9.2% [REDUCED]
105
+ - Ruminococcus lactaris: 1.8% [CAR-T biomarker - Smith 2022]
106
+ - Lachnospiraceae: 8.8% [REDUCED]
107
+ - Lachnospira pectinoschiza: 1.2% [CAR-T favorable - Stein-Thoeringer]
108
+ - Roseburia: 1.8%
109
+ - Bifidobacterium spp.: 2.2% total [LOW]
110
+ - Bacteroides eggerthii: 2.4% [CAR-T response predictor]
111
+
112
+ CONCERNING (elevated):
113
+ - Proteobacteria: 8.4% [ELEVATED - dysbiosis marker]
114
+ - E. coli: 4.2% [ELEVATED - antibiotic selection]
115
+ - Klebsiella pneumoniae: 1.8% [ELEVATED]
116
+ - Enterococcus faecalis: 2.1% [ELEVATED - pathobiont]
117
+ - Bacteroides uniformis: 3.6% [CRS risk - Stein-Thoeringer 2023]
118
+ - Blautia spp.: 4.2% [High Blautia → worse CAR-T outcomes]
119
+ - Bacteroides ovatus: 4.8%
120
+
121
+ METABOLITES (Measured):
122
+
123
+ SCFAs (GC-MS):
124
+ - Butyrate: 15.8 μM [LOW - CRITICAL for CD8+ T-cell function]
125
+ - Propionate: 11.2 μM [LOW-MODERATE]
126
+ - Acetate: 38.4 μM [MODERATE]
127
+ - Total SCFA: 71.0 μM [LOW - suboptimal for T-cell support]
128
+
129
+ Interpretation: REDUCED SCFA production is a major concern. Butyrate is critical
130
+ for CAR-T cell cytotoxicity. Low levels correlate with reduced Faecalibacterium.
131
+
132
+ Bile Acids (LC-MS/MS):
133
+ - Secondary bile acids: 18.2 μM [LOW]
134
+ - Secondary/Primary: 0.81 [LOW - reduced microbial conversion]
135
+
136
+ Tryptophan Metabolites (LC-MS/MS):
137
+ - NOT DETECTED - pathway significantly disrupted
138
+
139
+ FUNCTIONAL PATHWAYS:
140
+ - SCFA biosynthesis: 2.8% [SIGNIFICANTLY REDUCED]
141
+ - Butyrate production: 1.2% [CRITICALLY LOW]
142
+ - Vitamin B synthesis: 2.1% [REDUCED]
143
+ - LPS biosynthesis: 2.8% [ELEVATED - inflammation risk]
144
+ - Antibiotic resistance genes: Detected (fluoroquinolone markers)
145
+
146
+ CLINICAL INTERPRETATION FOR CAR-T:
147
+
148
+ OVERALL: SUBOPTIMAL BUT MODIFIABLE (33-day intervention window)
149
+
150
+ MAJOR CONCERNS (worse CAR-T outcomes in literature):
151
+ 1. LOW diversity (2.7) - Smith 2022: associated with worse response
152
+ 2. LOW butyrate (15.8 μM) - Luu 2021: critical for CD8+ T-cell cytotoxicity
153
+ 3. REDUCED Faecalibacterium (3.6%) - key SCFA producer depleted
154
+ 4. ACTIVE fluoroquinolone - Prasad 2024: ongoing disruption
155
+ 5. ELEVATED Proteobacteria (8.4%) - dysbiosis/inflammation
156
+ 6. ELEVATED Bacteroides uniformis (3.6%) - CRS risk marker
157
+ 7. High Blautia (4.2%) - associated with worse outcomes
158
+
159
+ FAVORABLE ASPECTS:
160
+ - Akkermansia present (1.8%) - not completely depleted
161
+ - Ruminococcus lactaris present (1.8%) - CAR-T biomarker
162
+ - Bacteroides eggerthii present (2.4%) - response predictor
163
+ - NO PPI use - not adding dysbiosis
164
+ - 33 DAYS until CAR-T - TIME TO INTERVENE
165
+
166
+ CRS RISK (microbiome-based): MODERATE-HIGH
167
+ - Low diversity, elevated B. uniformis/Blautia, pro-inflammatory dysbiosis
168
+
169
+ NEUROTOXICITY RISK: MODERATE
170
+ - Low SCFA (neuroprotective effects diminished)
171
+
172
+ ================================================================================
173
+ URGENT RECOMMENDATIONS (33-Day Optimization Window):
174
+
175
+ 1. DISCONTINUE LEVOFLOXACIN (consult ID)
176
+ - Ongoing fluoroquinolone perpetuating dysbiosis
177
+ - Risk/Benefit: Infection risk vs CAR-T efficacy
178
+
179
+ 2. HIGH-DOSE PROBIOTICS (start immediately):
180
+ - Multi-strain: Lactobacillus + Bifidobacterium
181
+ - Example: Visbiome 900B CFU BID
182
+ - Rationale: Restore beneficial bacteria
183
+
184
+ 3. BUTYRATE ENHANCEMENT (critical):
185
+ - Clostridium butyricum 2B CFU daily
186
+ - High-fiber diet (25-30g/day)
187
+ - Resistant starch 20g/day
188
+ - Target: Restore butyrate before CAR-T
189
+
190
+ 4. DIETARY INTERVENTION (nutritionist consult):
191
+ - High-fiber, plant-based foods
192
+ - Prebiotics: Inulin, pectin, resistant starch
193
+ - Fermented foods: Yogurt, kefir (if ANC >1.5)
194
+
195
+ 5. REPEAT MICROBIOME (March 1, 2026):
196
+ - 1 week before lymphodepletion
197
+ - Goal: Shannon >3.0, Butyrate >25 μM
198
+ - Consider FMT if persistent dysbiosis
199
+
200
+ 6. CONSIDER FMT:
201
+ - If repeat shows persistent severe dysbiosis
202
+ - Timing: ≥1 week before lymphodepletion
203
+
204
+ DATA QUALITY:
205
+ - Sample integrity: Good, adequate depth
206
+ - Limitations: Single time-point during active antibiotics
207
+ - Context: Results reflect antibiotic-disrupted state
208
+
209
+ Report: Dr. Jennifer Park, PhD | Clinical: Dr. Rachel Kim, PharmD
210
+ Date: February 7, 2026
211
+
212
+ ================================================================================
213
+
214
+ ASSESSMENT & PLAN:
215
+
216
+ 60yo male with R/R DLBCL (stage IV, BM+) post R-CHOP/R-ICE, planned for CD19
217
+ CAR-T (axicabtagene ciloleucel) on March 10, 2026.
218
+
219
+ CAR-T CANDIDACY: ✓ APPROVED
220
+ - CD19+ disease, no contraindications
221
+ - ECOG 2 acceptable, cardiac/CNS cleared
222
+
223
+ MICROBIOME: SUBOPTIMAL BUT MODIFIABLE
224
+ - Active antibiotic dysbiosis
225
+ - LOW butyrate → CD8+ T-cell concern
226
+ - Moderate-high CRS risk based on microbiome
227
+ - 33-day intervention window
228
+
229
+ PLAN:
230
+
231
+ 1. LEUKAPHERESIS: February 14, 2026
232
+
233
+ 2. MICROBIOME OPTIMIZATION (URGENT):
234
+ A. ID Consult: Discontinue levofloxacin if safe
235
+ B. Probiotics: Visbiome 900B CFU BID (start now)
236
+ C. Butyrate: C. butyricum + resistant starch
237
+ D. Diet: High-fiber (nutritionist Feb 10)
238
+ E. Repeat microbiome: March 1 (pre-lymphodepletion)
239
+
240
+ 3. LYMPHODEPLETION: March 7-9 (Flu/Cy)
241
+
242
+ 4. CAR-T INFUSION: March 10 (Day 0)
243
+
244
+ 5. POST-CAR-T MONITORING (high CRS/neurotoxicity risk):
245
+ - ICU-level monitoring × 48hr
246
+ - Daily neuro assessments (ICE score)
247
+ - Tocilizumab/dexamethasone available bedside
248
+ - CBC/CMP daily × 14 days
249
+
250
+ 6. RESPONSE ASSESSMENT:
251
+ - Day +30: PET-CT, MRD
252
+ - Microbiome correlation at Days +30, +100
253
+
254
+ PROGNOSIS: Guardedly optimistic. CAR-T shows 50-60% durable CR in R/R DLBCL.
255
+ Microbiome optimization may improve outcomes and reduce toxicity.
256
+
257
+ FOLLOW-UP:
258
+ - Leukapheresis: Feb 14
259
+ - Nutritionist: Feb 10
260
+ - ID consult: Feb 9 (re: levofloxacin)
261
+ - Repeat microbiome: Mar 1
262
+ - Admission: Mar 7
263
+
264
+ ________________________________________________________________________________
265
+ Dr. Rachel Kim, MD | Cellular Therapy & Hematologic Malignancies
266
+ Co-signed: Dr. David Martinez, MD, PhD (CAR-T Program Director)
267
+ February 8, 2026
268
+ ================================================================================
data/sample_input/demo_ehr_ici_nsclc.txt ADDED
@@ -0,0 +1,178 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ================================================================================
2
+ COMPREHENSIVE CANCER CENTER - THORACIC ONCOLOGY
3
+ ================================================================================
4
+ Patient: Sarah Chen | MRN: CCC-782934 | DOB: 03/22/1959 (Age: 66)
5
+ Gender: Female | Visit Date: February 12, 2026
6
+
7
+ CHIEF COMPLAINT:
8
+ First-line immunotherapy consultation for metastatic NSCLC.
9
+
10
+ HISTORY:
11
+ 66-year-old never-smoker with stage IV lung adenocarcinoma diagnosed December 2025.
12
+ Presented with cough and dyspnea. Imaging showed 4.2cm right upper lobe mass with
13
+ mediastinal nodes and malignant pleural effusion.
14
+
15
+ CANCER DIAGNOSIS:
16
+ - Primary: Right upper lobe adenocarcinoma (TTF-1+, Napsin-A+)
17
+ - Stage: IVA (T3N2M1a) - Malignant pleural effusion
18
+ - Diagnosis Date: December 8, 2025
19
+ - Molecular: KRAS G12C mutation; EGFR/ALK/ROS1/BRAF negative
20
+
21
+ BIOMARKERS (Immunotherapy-Relevant):
22
+ - PD-L1: 65% TPS (22C3 assay) - HIGH, favorable for pembrolizumab monotherapy
23
+ - TMB: 18 mutations/megabase - HIGH
24
+ - MSI: Stable (MSS)
25
+
26
+ TREATMENT PLAN:
27
+ Pembrolizumab 200mg IV Q3W starting February 19, 2026 (first-line monotherapy).
28
+
29
+ CURRENT MEDICATIONS:
30
+ 1. Metoprolol 50mg daily (atrial fibrillation)
31
+ 2. Apixaban 5mg BID (anticoagulation)
32
+ 3. Levothyroxine 75mcg daily (hypothyroidism)
33
+ 4. Vitamin D3 1000 IU daily
34
+ 5. Calcium carbonate 500mg BID
35
+
36
+ ANTIBIOTIC EXPOSURE:
37
+ - Recent: Levofloxacin 750mg daily × 5 days (Dec 28, 2025 - Jan 1, 2026)
38
+ - Indication: Community-acquired pneumonia
39
+ - Days before immunotherapy: 49 days (FULLY RECOVERED per microbiome)
40
+ - No other antibiotics in past 6 months
41
+
42
+ PPI USE: None (no GERD symptoms)
43
+
44
+ PAST MEDICAL HISTORY:
45
+ Atrial fibrillation, hypothyroidism, osteopenia, hypertension. No autoimmune disease.
46
+
47
+ SOCIAL HISTORY:
48
+ Never-smoker, rare alcohol use. Retired librarian. Diet: Predominantly plant-based,
49
+ high in vegetables and whole grains (relevant to favorable microbiome).
50
+
51
+ PERFORMANCE STATUS: ECOG 1 | Karnofsky 80%
52
+
53
+ VITALS: BP 118/76 | HR 68 (irregular) | SpO2 94% RA | Temp 98.2°F
54
+ Weight 128 lbs (58 kg) | Height 5'4" (163 cm) | BMI 22.0
55
+
56
+ KEY LABS (February 10, 2026):
57
+ - WBC 6.8, Hgb 12.4, Plt 278, ANC 4.2
58
+ - Creatinine 0.8, eGFR >90
59
+ - AST/ALT 24/28, normal LFTs
60
+ - TSH 2.8 (on levothyroxine - stable)
61
+ - LDH 198 (normal), CRP 12.5, CEA 8.4
62
+
63
+ IMAGING (February 2026):
64
+ - CT: 4.2cm RUL mass, mediastinal nodes, moderate pleural effusion
65
+ - MRI Brain: No metastases
66
+ - PET: Primary SUVmax 14.2, nodes SUVmax 8.7
67
+
68
+ ================================================================================
69
+ MICROBIOME ANALYSIS - HIGHLY FAVORABLE PROFILE
70
+ ================================================================================
71
+ Sample Date: February 7, 2026
72
+ Method: Shotgun metagenomic sequencing (Illumina NovaSeq, 15M reads)
73
+ Lab: Precision Microbiome Diagnostics
74
+
75
+ DIVERSITY (HIGH - FAVORABLE):
76
+ - Shannon Index: 3.6
77
+ - Simpson Index: 0.91
78
+ - Observed Species: 247
79
+ - Interpretation: High diversity consistently associated with better ICI response
80
+
81
+ COMPOSITION:
82
+ Firmicutes 52.8% | Bacteroidetes 34.6% | Actinobacteria 8.2% | Proteobacteria 2.8%
83
+ F/B Ratio: 1.53 (balanced)
84
+
85
+ KEY TAXA (% relative abundance):
86
+
87
+ FAVORABLE BACTERIA (HIGH - excellent for anti-PD-1):
88
+ - Akkermansia muciniphila: 4.8% [VERY HIGH - strongest NSCLC predictor]
89
+ - Faecalibacterium prausnitzii: 8.9% [HIGH - SCFA producer]
90
+ - Ruminococcaceae family: 14.2% [HIGH - favorable]
91
+ - Bifidobacterium longum: 3.2% + B. adolescentis: 1.8% [FAVORABLE]
92
+ - Lachnospiraceae family: 12.8% [HIGH]
93
+ - Roseburia intestinalis: 3.8% [butyrate producer]
94
+ - Alistipes putredinis: 2.6% [favorable in NSCLC]
95
+ - Prevotella copri: 4.2%
96
+
97
+ LESS FAVORABLE (LOW - good):
98
+ - Bacteroides spp.: 11.3% total (moderate)
99
+ - E. coli: 1.2%, Enterococcus: 0.4%, Fusobacterium: <0.1% (all low - favorable)
100
+
101
+ METABOLITES (Measured):
102
+
103
+ SCFAs (GC-MS):
104
+ - Butyrate: 32.8 μM [HIGH - excellent for CD8+ T cells]
105
+ - Propionate: 18.6 μM
106
+ - Acetate: 58.4 μM
107
+ - Total SCFA: 118.0 μM [HIGH]
108
+
109
+ Bile Acids (LC-MS/MS):
110
+ - Secondary bile acids: 42.8 μM (high conversion - favorable)
111
+ - Secondary/Primary ratio: 2.33
112
+
113
+ Tryptophan Metabolites (LC-MS/MS):
114
+ - Indole: 5.2 μM [AhR ligand]
115
+ - Indole-3-aldehyde: 2.8 μM
116
+ - Kynurenine/Tryptophan: 0.34 (moderate IDO activity)
117
+
118
+ Other:
119
+ - Inosine: 2.4 μM [T-cell activation]
120
+
121
+ FUNCTIONAL PATHWAYS:
122
+ - SCFA biosynthesis: HIGH
123
+ - Butyrate production: HIGH
124
+ - Vitamin B synthesis: HIGH
125
+ - Bile salt hydrolase: Moderate-High
126
+
127
+ CLINICAL INTERPRETATION:
128
+
129
+ OVERALL: HIGHLY FAVORABLE PROFILE FOR PEMBROLIZUMAB
130
+
131
+ Strengths:
132
+ 1. Very high diversity (Shannon 3.6) - predicts ICI response
133
+ 2. VERY HIGH Akkermansia (4.8%) - strongest predictor in NSCLC (Routy/Derosa)
134
+ 3. High Faecalibacterium (8.9%) - responder taxon
135
+ 4. High Ruminococcaceae (14.2%) - favorable in multiple ICI studies
136
+ 5. Robust SCFA production - supports CD8+ T-cell function
137
+ 6. Plant-based diet correlation with favorable microbiome
138
+ 7. No PPI use - microbiome not disrupted
139
+ 8. Full antibiotic recovery (49 days post-levofloxacin)
140
+
141
+ Literature Alignment:
142
+ - Routy 2018: Akkermansia >1% favorable → Patient: 4.8%
143
+ - Gopalakrishnan 2018: High diversity favorable → Patient: High
144
+ - Derosa 2022: Akkermansia predicts NSCLC response → Patient: Very high
145
+
146
+ Predicted Response: FAVORABLE microbiome signature for pembrolizumab in NSCLC.
147
+
148
+ RECOMMENDATIONS:
149
+ 1. Continue high-fiber, plant-based diet
150
+ 2. Avoid antibiotics during treatment if possible
151
+ 3. Maintain PPI-free status
152
+ 4. Repeat microbiome at Week 12 with response assessment
153
+
154
+ ================================================================================
155
+ ASSESSMENT & PLAN:
156
+
157
+ 66yo never-smoker with stage IVA lung adenocarcinoma (high PD-L1 65%, high TMB 18)
158
+ initiating first-line pembrolizumab monotherapy.
159
+
160
+ FAVORABLE FACTORS:
161
+ - High PD-L1 (65%), High TMB (18)
162
+ - HIGHLY FAVORABLE microbiome (high diversity, very high Akkermansia, high butyrate)
163
+ - Good performance status (ECOG 1)
164
+ - No PPI use, full antibiotic recovery
165
+
166
+ PLAN:
167
+ 1. Pembrolizumab 200mg IV Q3W starting Feb 19, 2026
168
+ 2. Monitor for irAEs (pneumonitis risk given lung disease)
169
+ 3. Continue plant-based high-fiber diet
170
+ 4. Repeat CT at Week 12 (May 2026)
171
+ 5. Repeat microbiome at Week 12 to correlate with response
172
+
173
+ PROGNOSIS: Guardedly optimistic given favorable biomarkers and microbiome profile.
174
+
175
+ ________________________________________________________________________________
176
+ Dr. Michael Torres, MD, PhD | Thoracic Oncology
177
+ February 12, 2026
178
+ ================================================================================
data/templates/patient_schema_template.json ADDED
@@ -0,0 +1,93 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "extraction_version": "1.0",
3
+ "extraction_date": "YYYY-MM-DD",
4
+
5
+ "patient": {
6
+ "id": "",
7
+ "age": 0,
8
+ "gender": ""
9
+ },
10
+
11
+ "cancer": {
12
+ "type": "",
13
+ "subtype": "",
14
+ "stage": "",
15
+ "primary_site": "",
16
+ "metastases": [],
17
+ "biomarkers": {
18
+ "pdl1_expression": "",
19
+ "tmb": "",
20
+ "msi_status": ""
21
+ },
22
+ "diagnosis_date": "YYYY-MM-DD"
23
+ },
24
+
25
+ "immunotherapy": {
26
+ "therapy_type": "",
27
+ "drug_name": "",
28
+ "drug_class": "",
29
+ "treatment_setting": "",
30
+ "line_of_therapy": "",
31
+ "planned_start_date": "YYYY-MM-DD",
32
+ "ici_details": null,
33
+ "act_details": null
34
+ },
35
+
36
+ "prior_treatments": {
37
+ "chemotherapy": {
38
+ "received": false,
39
+ "regimens": [],
40
+ "response": ""
41
+ },
42
+ "prior_immunotherapy": {
43
+ "received": false,
44
+ "drugs": [],
45
+ "response": ""
46
+ }
47
+ },
48
+
49
+ "medications": {
50
+ "ppi_use": {
51
+ "currently_on_ppi": false,
52
+ "ppi_name": "",
53
+ "duration_months": 0
54
+ },
55
+ "antibiotic_history": {
56
+ "recent_antibiotics": false,
57
+ "exposures": []
58
+ }
59
+ },
60
+
61
+ "comorbidities": [],
62
+
63
+ "microbiome": {
64
+ "sample_date": "YYYY-MM-DD",
65
+ "sequencing_method": "",
66
+ "diversity": {
67
+ "shannon_index": 0.0,
68
+ "simpson_index": 0.0,
69
+ "observed_species": 0
70
+ },
71
+ "key_bacteria": {},
72
+ "metabolites": {
73
+ "scfa": {
74
+ "butyrate_uM": null,
75
+ "propionate_uM": null,
76
+ "acetate_uM": null
77
+ },
78
+ "bile_acids_available": false,
79
+ "tryptophan_metabolites_available": false
80
+ },
81
+ "data_quality": {
82
+ "completeness": "",
83
+ "source": "",
84
+ "limitations": []
85
+ }
86
+ },
87
+
88
+ "clinical_context": {
89
+ "urgency": "",
90
+ "patient_goals": [],
91
+ "specific_concerns": []
92
+ }
93
+ }
generate_report.py ADDED
@@ -0,0 +1,121 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ """
3
+ Main CLI entry point
4
+
5
+ Accepts either:
6
+ - A pre-built patient JSON file (check template) (.json)
7
+ - A raw EHR text file
8
+
9
+ Examples:
10
+ python generate_report.py patient_example.json
11
+ python generate_report.py patient_ehr.txt
12
+ python generate_report.py patient_ehr.txt --save-ehr-json extracted_patient.json
13
+ """
14
+
15
+ import argparse
16
+ import logging
17
+ import sys
18
+ from pathlib import Path
19
+
20
+ from src.report_assembler import ReportAssembler
21
+
22
+ # Configure logging
23
+ logging.basicConfig(
24
+ level=logging.INFO,
25
+ format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',
26
+ handlers=[
27
+ logging.StreamHandler(sys.stdout)
28
+ ]
29
+ )
30
+
31
+ logger = logging.getLogger(__name__)
32
+
33
+
34
+ def main():
35
+ parser = argparse.ArgumentParser(
36
+ description=(
37
+ "Generate microbiome-immunotherapy clinical decision support report. "
38
+ "Accepts either a pre-built patient JSON file or a raw EHR text file."
39
+ )
40
+ )
41
+
42
+ parser.add_argument(
43
+ "patient_input",
44
+ type=str,
45
+ help=(
46
+ "Path to patient data file. "
47
+ "Use a .json file for pre-extracted patient data, "
48
+ "or a .txt/.ehr file to extract from a raw EHR report first."
49
+ )
50
+ )
51
+
52
+ parser.add_argument(
53
+ "-o", "--output-dir",
54
+ type=str,
55
+ default=None,
56
+ help="Output directory for generated report (default: ./outputs)"
57
+ )
58
+
59
+ parser.add_argument(
60
+ "--save-ehr-json",
61
+ type=str,
62
+ default=None,
63
+ metavar="PATH",
64
+ help=(
65
+ "[EHR mode only] Save the MedGemma-extracted patient JSON to this path. "
66
+ "Useful for inspecting or reusing the extraction without re-running the model."
67
+ )
68
+ )
69
+
70
+ args = parser.parse_args()
71
+
72
+ # Validate input file
73
+ input_path = Path(args.patient_input)
74
+ if not input_path.exists():
75
+ logger.error(f"Input file not found: {input_path}")
76
+ sys.exit(1)
77
+
78
+ # Detect input mode by extension
79
+ is_json = input_path.suffix.lower() == ".json"
80
+
81
+ if not is_json and args.save_ehr_json is None:
82
+ logger.info(
83
+ "Tip: use --save-ehr-json <path> to save the extracted patient JSON "
84
+ "and skip re-extraction on future runs."
85
+ )
86
+
87
+ logger.info("=" * 80)
88
+ logger.info("Microbiome-ICI Clinical Report Generator v1.0")
89
+ logger.info("=" * 80)
90
+
91
+ if is_json:
92
+ logger.info(f"Input mode: pre-built patient JSON → {input_path}")
93
+ else:
94
+ logger.info(f"Input mode: raw EHR text (MedGemma extraction) → {input_path}")
95
+
96
+ try:
97
+ assembler = ReportAssembler()
98
+
99
+ if is_json:
100
+ output_path = assembler.generate_and_save(
101
+ patient_json_path=str(input_path),
102
+ output_dir=args.output_dir,
103
+ )
104
+ else:
105
+ output_path = assembler.generate_and_save_from_ehr(
106
+ ehr_path=str(input_path),
107
+ output_dir=args.output_dir,
108
+ save_json_path=args.save_ehr_json,
109
+ )
110
+
111
+ logger.info("=" * 80)
112
+ logger.info(f"✓ Report generated successfully: {output_path}")
113
+ logger.info("=" * 80)
114
+
115
+ except Exception as e:
116
+ logger.error(f"Report generation failed: {e}", exc_info=True)
117
+ sys.exit(1)
118
+
119
+
120
+ if __name__ == "__main__":
121
+ main()
rag/README.md ADDED
@@ -0,0 +1,65 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Medical RAG Pipeline for Research Papers
2
+
3
+ This directory contains the Retrieval-Augmented Generation (RAG) pipeline designed for extracting and processing medical research papers to support clinical decision-making in immunotherapy.
4
+
5
+ ## Pipeline Overview
6
+
7
+ The pipeline transforms raw PDF research papers into a searchable vector database (ChromaDB), optimized for medical context and evidence retrieval.
8
+
9
+ - **PDF Extraction**: Uses `docling` for accurate markdown extraction from complex medical PDFs.
10
+ - **Cleaning**: `rag_md_cleaner.py` removes unnecessary metadata, references sections, and figures while preserving essential tables and text.
11
+ - **Chunking**: `SectionAwareChunker` implements section-aware splitting with specific handling for tables to ensure context is preserved.
12
+ - **Embedding**: Uses `pritamdeka/S-PubMedBert-MS-MARCO`, a domain-specific transformer model optimized for medical literature.
13
+ - **Storage**: Persists vectors and metadata in `ChromaDB`.
14
+
15
+ ## Components
16
+
17
+ - `pdf_to_chromadb_pipeline.py`: The main entry point for the ingestion pipeline.
18
+ - `rag_md_cleaner.py`: Utility for cleaning extracted markdown.
19
+ - `research_papers.json`: Metadata registry (filename stem mapping to paper titles/citations).
20
+
21
+ ## Data Requirements
22
+
23
+ To ensure accurate citations, provide a `research_papers.json` file in the same directory as the script. The format should be:
24
+
25
+ ```json
26
+ {
27
+ "1": {
28
+ "reference_id": "25",
29
+ "citation": "Takada et al., Int J Cancer 2021",
30
+ "title": "Clinical impact of probiotics on the efficacy of anti-PD-1 monotherapy in patients with NSCLC",
31
+ "year": 2021,
32
+ "tags": {
33
+ "treatment": [
34
+ "PD-1/PD-L1 Blockade"
35
+ ],
36
+ "cancer": [
37
+ "NSCLC"
38
+ ],
39
+ "biology": [
40
+ "Gut microbiome composition",
41
+ "Alpha diversity"
42
+ ],
43
+ "intervention": [
44
+ "Probiotics"
45
+ ]
46
+ }
47
+ },
48
+
49
+ ```
50
+ *Where "1" is the filename stem of "1.pdf".*
51
+
52
+
53
+
54
+ ## Usage
55
+
56
+ To run the pipeline and index a folder of PDFs:
57
+
58
+ ```bash
59
+ python pdf_to_chromadb_pipeline.py --input-folder ./pdfs --db-path ./chroma_db
60
+ ```
61
+
62
+
63
+ ## Dependencies
64
+
65
+ Requires `docling`, `transformers`, `sentence-transformers`, and `chromadb`. See `requirements_pipeline.txt` for details.
rag/pdf_to_chromadb_pipeline.py ADDED
@@ -0,0 +1,1044 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ pdf_to_chromadb_pipeline.py
3
+
4
+ Complete pipeline: PDF -> Docling extraction -> Cleaning -> Chunking -> ChromaDB
5
+ Optimized for research papers with PubMedBERT embeddings.
6
+
7
+ Pipeline steps:
8
+ 1. Extract markdown from PDF using docling
9
+ 2. Clean markdown (remove refs, metadata, figures; keep tables)
10
+ 3. Chunk with section-awareness and table splitting
11
+ 4. Embed using PubMedBERT tokenizer
12
+ 5. Store in ChromaDB with metadata
13
+
14
+ Usage:
15
+ python pdf_to_chromadb_pipeline.py --input-folder ./pdfs --db-path ./chroma_db
16
+ """
17
+
18
+ import os
19
+ import json
20
+ import argparse
21
+ from pathlib import Path
22
+ from typing import List, Dict, Optional
23
+ from datetime import datetime
24
+ import re
25
+ import unicodedata
26
+ from collections import defaultdict
27
+
28
+ # Import cleaning function
29
+ from rag_md_cleaner import clean_markdown_for_rag
30
+
31
+ # Docling imports
32
+ try:
33
+ from docling.document_converter import DocumentConverter
34
+ DOCLING_AVAILABLE = True
35
+ except ImportError:
36
+ DOCLING_AVAILABLE = False
37
+ print("Warning: docling not installed. Install with: pip install docling")
38
+
39
+ # Transformers for tokenizer
40
+ try:
41
+ from transformers import AutoTokenizer
42
+ TRANSFORMERS_AVAILABLE = True
43
+ except ImportError:
44
+ TRANSFORMERS_AVAILABLE = False
45
+ print("Warning: transformers not installed. Install with: pip install transformers torch")
46
+
47
+ # ChromaDB
48
+ try:
49
+ import chromadb
50
+ from chromadb.config import Settings
51
+ CHROMADB_AVAILABLE = True
52
+ except ImportError:
53
+ CHROMADB_AVAILABLE = False
54
+ print("Warning: chromadb not installed. Install with: pip install chromadb")
55
+
56
+ # Sentence transformers for embeddings
57
+ try:
58
+ from sentence_transformers import SentenceTransformer
59
+ SENTENCE_TRANSFORMERS_AVAILABLE = True
60
+ except ImportError:
61
+ SENTENCE_TRANSFORMERS_AVAILABLE = False
62
+ print("Warning: sentence-transformers not installed. Install with: pip install sentence-transformers")
63
+
64
+
65
+ # ========================================
66
+ # TABLE-AWARE MARKDOWN CHUNKER (embedded)
67
+ # ========================================
68
+
69
+ class TableAwareMarkdownSplitter:
70
+ """Split markdown by headers while keeping tables intact."""
71
+
72
+ def __init__(self, headers_to_split_on: List[tuple]):
73
+ self.headers_to_split_on = sorted(
74
+ headers_to_split_on,
75
+ key=lambda x: len(x[0]),
76
+ reverse=True
77
+ )
78
+
79
+ def split_text(self, text: str) -> List[Dict]:
80
+ """Split text by headers while preserving table structure."""
81
+ lines = text.split('\n')
82
+ documents = []
83
+ current_content = []
84
+ current_metadata = {}
85
+ in_table = False
86
+ table_buffer = []
87
+
88
+ for i, line in enumerate(lines):
89
+ is_table_row = self._is_table_row(line)
90
+ header_info = self._parse_header(line)
91
+
92
+ if header_info and not in_table:
93
+ if current_content:
94
+ documents.append({
95
+ 'content': '\n'.join(current_content),
96
+ 'metadata': current_metadata.copy(),
97
+ 'has_table': False
98
+ })
99
+ current_content = []
100
+
101
+ level, header_text = header_info
102
+ current_metadata[level] = header_text
103
+ self._clear_lower_headers(current_metadata, level)
104
+ current_content.append(line)
105
+
106
+ elif is_table_row:
107
+ if not in_table:
108
+ if current_content:
109
+ caption = self._get_table_caption(current_content)
110
+ if caption:
111
+ current_content = current_content[:-1]
112
+ if current_content:
113
+ documents.append({
114
+ 'content': '\n'.join(current_content),
115
+ 'metadata': current_metadata.copy(),
116
+ 'has_table': False
117
+ })
118
+ current_content = []
119
+ table_buffer = [caption]
120
+ else:
121
+ documents.append({
122
+ 'content': '\n'.join(current_content),
123
+ 'metadata': current_metadata.copy(),
124
+ 'has_table': False
125
+ })
126
+ current_content = []
127
+ table_buffer = []
128
+ in_table = True
129
+
130
+ table_buffer.append(line)
131
+
132
+ elif in_table and not is_table_row:
133
+ in_table = False
134
+ if table_buffer:
135
+ documents.append({
136
+ 'content': '\n'.join(table_buffer),
137
+ 'metadata': current_metadata.copy(),
138
+ 'has_table': True
139
+ })
140
+ table_buffer = []
141
+ current_content.append(line)
142
+
143
+ else:
144
+ current_content.append(line)
145
+
146
+ if table_buffer:
147
+ documents.append({
148
+ 'content': '\n'.join(table_buffer),
149
+ 'metadata': current_metadata.copy(),
150
+ 'has_table': True
151
+ })
152
+
153
+ if current_content:
154
+ documents.append({
155
+ 'content': '\n'.join(current_content),
156
+ 'metadata': current_metadata.copy(),
157
+ 'has_table': False
158
+ })
159
+
160
+ return documents
161
+
162
+ def _is_table_row(self, line: str) -> bool:
163
+ stripped = line.strip()
164
+ if not stripped:
165
+ return False
166
+ return stripped.startswith('|') or ('|' in stripped and stripped.count('|') >= 2)
167
+
168
+ def _get_table_caption(self, content_lines: List[str]) -> Optional[str]:
169
+ if not content_lines:
170
+ return None
171
+ last_line = content_lines[-1].strip()
172
+ if re.match(r'^Table \d+[\.:].+', last_line, re.IGNORECASE):
173
+ return last_line
174
+ return None
175
+
176
+ def _parse_header(self, line: str) -> Optional[tuple]:
177
+ line = line.strip()
178
+ for header_marker, level_name in self.headers_to_split_on:
179
+ if line.startswith(header_marker + ' '):
180
+ header_text = line[len(header_marker):].strip()
181
+ return level_name, header_text
182
+ return None
183
+
184
+ def _clear_lower_headers(self, metadata: Dict, current_level: str):
185
+ levels_order = [h[1] for h in self.headers_to_split_on]
186
+ try:
187
+ current_idx = levels_order.index(current_level)
188
+ for level in levels_order[current_idx + 1:]:
189
+ metadata.pop(level, None)
190
+ except ValueError:
191
+ pass
192
+
193
+
194
+ class SectionAwareChunker:
195
+ """Chunk markdown with section awareness, table handling, and token limits."""
196
+
197
+ def __init__(
198
+ self,
199
+ model_name: str = "pritamdeka/S-PubMedBert-MS-MARCO",
200
+ max_tokens: int = 330,
201
+ chunk_overlap_tokens: int = 30,
202
+ split_tables: bool = True
203
+ ):
204
+ """
205
+ Initialize the chunker.
206
+
207
+ Args:
208
+ model_name: HuggingFace model name for tokenizer
209
+ max_tokens: Maximum tokens per chunk
210
+ chunk_overlap_tokens: Overlap between chunks in tokens
211
+ split_tables: If True, split large tables by rows
212
+ """
213
+ if not TRANSFORMERS_AVAILABLE:
214
+ raise ImportError("transformers library required. Install: pip install transformers torch")
215
+
216
+ print(f"Loading tokenizer from {model_name}...")
217
+ self.tokenizer = AutoTokenizer.from_pretrained(model_name)
218
+ print("✓ Tokenizer loaded successfully")
219
+
220
+ self.max_tokens = max_tokens
221
+ self.chunk_overlap_tokens = chunk_overlap_tokens
222
+ self.split_tables = split_tables
223
+
224
+ self.headers_to_split_on = [
225
+ ("#", "h1"),
226
+ ("##", "h2"),
227
+ ("###", "h3"),
228
+ ("####", "h4"),
229
+ ]
230
+
231
+ self.header_splitter = TableAwareMarkdownSplitter(self.headers_to_split_on)
232
+
233
+ def count_tokens(self, text: str) -> int:
234
+ """Count tokens using the model's tokenizer."""
235
+ return len(self.tokenizer.encode(text, add_special_tokens=True))
236
+
237
+ def split_table_by_rows(self, table_text: str, max_tokens: int) -> List[str]:
238
+ """Split a table into smaller chunks by rows."""
239
+ lines = table_text.split('\n')
240
+
241
+ caption = None
242
+ header_row = None
243
+ separator_row = None
244
+ data_rows = []
245
+
246
+ for i, line in enumerate(lines):
247
+ line = line.strip()
248
+ if not line:
249
+ continue
250
+
251
+ if re.match(r'^Table \d+[\.:].+', line, re.IGNORECASE):
252
+ caption = line
253
+ elif '|' in line:
254
+ if header_row is None:
255
+ header_row = line
256
+ elif separator_row is None and re.match(r'^\|[\s\-:|]+\|', line):
257
+ separator_row = line
258
+ else:
259
+ data_rows.append(line)
260
+
261
+ if not data_rows:
262
+ return [table_text]
263
+
264
+ # Build header template
265
+ header_template = []
266
+ if caption:
267
+ header_template.append(caption)
268
+ if header_row:
269
+ header_template.append(header_row)
270
+ if separator_row:
271
+ header_template.append(separator_row)
272
+
273
+ header_tokens = self.count_tokens('\n'.join(header_template)) if header_template else 0
274
+
275
+ # Check if a single row exceeds the limit
276
+ max_row_tokens = max(self.count_tokens(row) for row in data_rows)
277
+
278
+ if header_tokens + max_row_tokens > max_tokens:
279
+ # Even a single row is too large - need to split columns
280
+ print(f"Table row exceeds limit ({max_row_tokens} tokens), splitting columns...")
281
+ return self._split_table_by_columns(table_text, max_tokens)
282
+
283
+ # Split by rows normally
284
+ chunks = []
285
+ current_chunk_rows = []
286
+
287
+ for row in data_rows:
288
+ row_tokens = self.count_tokens(row)
289
+ current_tokens = self.count_tokens('\n'.join(current_chunk_rows)) if current_chunk_rows else 0
290
+
291
+ if header_tokens + current_tokens + row_tokens <= max_tokens:
292
+ current_chunk_rows.append(row)
293
+ else:
294
+ if current_chunk_rows:
295
+ chunk = '\n'.join(header_template + current_chunk_rows)
296
+ chunks.append(chunk)
297
+ # Start new chunk with this row
298
+ current_chunk_rows = [row]
299
+
300
+ if current_chunk_rows:
301
+ chunk = '\n'.join(header_template + current_chunk_rows)
302
+ chunks.append(chunk)
303
+
304
+ return chunks if chunks else [table_text]
305
+
306
+ def _split_table_by_columns(self, table_text: str, max_tokens: int) -> List[str]:
307
+ """
308
+ Split a wide table by columns when rows are too long.
309
+ Creates multiple narrower tables, each with first column + subset of other columns.
310
+ """
311
+ lines = table_text.split('\n')
312
+
313
+ caption = None
314
+ header_row = None
315
+ separator_row = None
316
+ data_rows = []
317
+
318
+ for line in lines:
319
+ line = line.strip()
320
+ if not line:
321
+ continue
322
+
323
+ if re.match(r'^Table \d+[\.:].+', line, re.IGNORECASE):
324
+ caption = line
325
+ elif '|' in line:
326
+ if header_row is None:
327
+ header_row = line
328
+ elif separator_row is None and re.match(r'^\|[\s\-:|]+\|', line):
329
+ separator_row = line
330
+ else:
331
+ data_rows.append(line)
332
+
333
+ if not header_row or not data_rows:
334
+ # Can't split intelligently, just return as text chunks
335
+ return self.split_by_tokens(table_text, max_tokens, 0)
336
+
337
+ # Parse header columns
338
+ header_cells = [c.strip() for c in header_row.split('|')[1:-1]]
339
+ n_cols = len(header_cells)
340
+
341
+ if n_cols <= 2:
342
+ # Too few columns to split, fall back to text splitting
343
+ return self.split_by_tokens(table_text, max_tokens, 0)
344
+
345
+ # Parse all rows into cells
346
+ parsed_rows = []
347
+ for row in data_rows:
348
+ cells = [c.strip() for c in row.split('|')[1:-1]]
349
+ # Pad or trim to match header length
350
+ while len(cells) < n_cols:
351
+ cells.append('')
352
+ cells = cells[:n_cols]
353
+ parsed_rows.append(cells)
354
+
355
+ # Strategy: Keep first column (usually ID/key), split remaining columns into groups
356
+ chunks = []
357
+
358
+ # Try to fit columns into chunks
359
+ first_col_idx = 0
360
+ col_groups = []
361
+ current_group = [first_col_idx] # Always include first column
362
+
363
+ for col_idx in range(1, n_cols):
364
+ # Test if adding this column fits
365
+ test_group = current_group + [col_idx]
366
+ test_chunk = self._build_table_chunk(
367
+ caption, header_cells, parsed_rows, test_group
368
+ )
369
+ test_tokens = self.count_tokens(test_chunk)
370
+
371
+ if test_tokens <= max_tokens:
372
+ current_group.append(col_idx)
373
+ else:
374
+ # Current group is full, save it
375
+ if len(current_group) > 1: # Has more than just first column
376
+ col_groups.append(current_group)
377
+ current_group = [first_col_idx, col_idx]
378
+
379
+ # Add remaining group
380
+ if len(current_group) > 1:
381
+ col_groups.append(current_group)
382
+
383
+ # Build chunks from column groups
384
+ for group_idx, col_indices in enumerate(col_groups):
385
+ chunk = self._build_table_chunk(caption, header_cells, parsed_rows, col_indices)
386
+
387
+ # Add note about which columns
388
+ if len(col_groups) > 1:
389
+ col_names = [header_cells[i] for i in col_indices[1:]] # Skip first col (ID)
390
+ note = f"\n[Table part {group_idx + 1}/{len(col_groups)}: columns {', '.join(col_names)}]"
391
+ chunk = chunk + note
392
+
393
+ chunks.append(chunk)
394
+
395
+ return chunks if chunks else [table_text]
396
+
397
+ def _build_table_chunk(
398
+ self,
399
+ caption: Optional[str],
400
+ header_cells: List[str],
401
+ data_rows: List[List[str]],
402
+ col_indices: List[int]
403
+ ) -> str:
404
+ """Build a table chunk with selected columns."""
405
+ lines = []
406
+
407
+ if caption:
408
+ lines.append(caption)
409
+
410
+ # Header row with selected columns
411
+ selected_headers = [header_cells[i] for i in col_indices]
412
+ header_line = '| ' + ' | '.join(selected_headers) + ' |'
413
+ lines.append(header_line)
414
+
415
+ # Separator row
416
+ separator = '| ' + ' | '.join(['---'] * len(col_indices)) + ' |'
417
+ lines.append(separator)
418
+
419
+ # Data rows with selected columns
420
+ for row in data_rows:
421
+ selected_cells = [row[i] if i < len(row) else '' for i in col_indices]
422
+ row_line = '| ' + ' | '.join(selected_cells) + ' |'
423
+ lines.append(row_line)
424
+
425
+ return '\n'.join(lines)
426
+
427
+ def split_by_tokens(self, text: str, max_tokens: int, overlap_tokens: int = 0) -> List[str]:
428
+ """Split text into chunks by token count."""
429
+ sentences = re.split(r'(?<=[.!?])\s+(?=[A-Z])|\n\n+', text)
430
+
431
+ chunks = []
432
+ current_chunk = []
433
+ current_tokens = 0
434
+
435
+ for sentence in sentences:
436
+ sentence = sentence.strip()
437
+ if not sentence:
438
+ continue
439
+
440
+ sentence_tokens = self.count_tokens(sentence)
441
+
442
+ if sentence_tokens > max_tokens:
443
+ if current_chunk:
444
+ chunks.append(' '.join(current_chunk))
445
+ current_chunk = []
446
+ current_tokens = 0
447
+
448
+ word_chunks = self._split_by_words(sentence, max_tokens)
449
+ if len(word_chunks) > 1:
450
+ chunks.extend(word_chunks[:-1])
451
+ current_chunk = [word_chunks[-1]]
452
+ current_tokens = self.count_tokens(word_chunks[-1])
453
+ else:
454
+ chunks.extend(word_chunks)
455
+ continue
456
+
457
+ potential_tokens = current_tokens + sentence_tokens
458
+
459
+ if potential_tokens <= max_tokens:
460
+ current_chunk.append(sentence)
461
+ current_tokens = potential_tokens
462
+ else:
463
+ if current_chunk:
464
+ chunks.append(' '.join(current_chunk))
465
+
466
+ if overlap_tokens > 0 and current_chunk:
467
+ overlap_chunk = []
468
+ overlap_count = 0
469
+ for sent in reversed(current_chunk):
470
+ sent_tokens = self.count_tokens(sent)
471
+ if overlap_count + sent_tokens <= overlap_tokens:
472
+ overlap_chunk.insert(0, sent)
473
+ overlap_count += sent_tokens
474
+ else:
475
+ break
476
+ current_chunk = overlap_chunk
477
+ current_tokens = overlap_count
478
+ else:
479
+ current_chunk = []
480
+ current_tokens = 0
481
+
482
+ current_chunk.append(sentence)
483
+ current_tokens = current_tokens + sentence_tokens
484
+
485
+ if current_chunk:
486
+ chunks.append(' '.join(current_chunk))
487
+
488
+ return chunks
489
+
490
+ def _split_by_words(self, text: str, max_tokens: int) -> List[str]:
491
+ """Split text by words when sentences are too long."""
492
+ words = text.split()
493
+ chunks = []
494
+ current_chunk = []
495
+ current_tokens = 0
496
+
497
+ for word in words:
498
+ word_tokens = self.count_tokens(word + ' ')
499
+
500
+ if current_tokens + word_tokens <= max_tokens:
501
+ current_chunk.append(word)
502
+ current_tokens += word_tokens
503
+ else:
504
+ if current_chunk:
505
+ chunks.append(' '.join(current_chunk))
506
+ current_chunk = [word]
507
+ current_tokens = word_tokens
508
+
509
+ if current_chunk:
510
+ chunks.append(' '.join(current_chunk))
511
+
512
+ return chunks
513
+
514
+ def chunk_markdown(self, markdown_text: str, source_file: str = "unknown") -> List[Dict]:
515
+ """Chunk markdown with section awareness and table handling."""
516
+ header_splits = self.header_splitter.split_text(markdown_text)
517
+
518
+ final_chunks = []
519
+
520
+ for doc in header_splits:
521
+ section_metadata = self._extract_section_info(doc['metadata'])
522
+ is_table = doc.get('has_table', False)
523
+ token_count = self.count_tokens(doc['content'])
524
+
525
+ if token_count <= self.max_tokens:
526
+ final_chunks.append({
527
+ "text": doc['content'],
528
+ "metadata": {
529
+ **section_metadata,
530
+ "token_count": token_count,
531
+ "is_table": is_table,
532
+ "chunk_index": 0,
533
+ "total_chunks_in_section": 1,
534
+ "source_file": source_file
535
+ }
536
+ })
537
+ elif is_table and self.split_tables:
538
+ table_chunks = self.split_table_by_rows(doc['content'], self.max_tokens)
539
+ for idx, chunk_text in enumerate(table_chunks):
540
+ final_chunks.append({
541
+ "text": chunk_text,
542
+ "metadata": {
543
+ **section_metadata,
544
+ "token_count": self.count_tokens(chunk_text),
545
+ "is_table": True,
546
+ "chunk_index": idx,
547
+ "total_chunks_in_section": len(table_chunks),
548
+ "source_file": source_file
549
+ }
550
+ })
551
+ elif is_table:
552
+ # Keep table intact even if exceeds limit
553
+ final_chunks.append({
554
+ "text": doc['content'],
555
+ "metadata": {
556
+ **section_metadata,
557
+ "token_count": token_count,
558
+ "is_table": True,
559
+ "chunk_index": 0,
560
+ "total_chunks_in_section": 1,
561
+ "source_file": source_file,
562
+ "exceeds_limit": True
563
+ }
564
+ })
565
+ else:
566
+ sub_chunks = self.split_by_tokens(
567
+ doc['content'],
568
+ self.max_tokens,
569
+ self.chunk_overlap_tokens
570
+ )
571
+
572
+ for idx, chunk_text in enumerate(sub_chunks):
573
+ final_chunks.append({
574
+ "text": chunk_text,
575
+ "metadata": {
576
+ **section_metadata,
577
+ "token_count": self.count_tokens(chunk_text),
578
+ "is_table": False,
579
+ "chunk_index": idx,
580
+ "total_chunks_in_section": len(sub_chunks),
581
+ "source_file": source_file
582
+ }
583
+ })
584
+
585
+ return final_chunks
586
+
587
+ def _extract_section_info(self, metadata: Dict) -> Dict:
588
+ """Extract section information from metadata."""
589
+ section_info = {}
590
+
591
+ for level in ['h1', 'h2', 'h3', 'h4']:
592
+ if level in metadata:
593
+ section_info[level] = metadata[level]
594
+
595
+ section_type = self._identify_section_type(metadata)
596
+ if section_type:
597
+ section_info['section_type'] = section_type
598
+
599
+ return section_info
600
+
601
+ def _identify_section_type(self, metadata: Dict) -> str:
602
+ """Identify section type based on header text."""
603
+ all_headers = ' '.join([
604
+ metadata.get('h1', ''),
605
+ metadata.get('h2', ''),
606
+ metadata.get('h3', ''),
607
+ metadata.get('h4', '')
608
+ ]).lower()
609
+
610
+ section_patterns = {
611
+ 'abstract': r'\babstract\b',
612
+ 'introduction': r'\bintroduction\b',
613
+ 'background': r'\bbackground\b',
614
+ 'literature_review': r'\bliterature review\b|\brelated work\b',
615
+ 'methodology': r'\bmethodology\b|\bmethods\b|\bmaterials and methods\b',
616
+ 'results': r'\bresults\b',
617
+ 'discussion': r'\bdiscussion\b',
618
+ 'conclusion': r'\bconclusion\b|\bconcluding remarks\b',
619
+ 'references': r'\breferences\b|\bbibliography\b',
620
+ 'appendix': r'\bappendix\b',
621
+ 'acknowledgments': r'\backnowledgments\b|\backnowledgements\b',
622
+ 'abbreviations': r'\babbreviations\b',
623
+ 'data_availability': r'\bdata availability\b'
624
+ }
625
+
626
+ for section_type, pattern in section_patterns.items():
627
+ if re.search(pattern, all_headers):
628
+ return section_type
629
+
630
+ return 'other'
631
+
632
+
633
+ # ========================================
634
+ # PIPELINE CLASS
635
+ # ========================================
636
+
637
+ class PDFToChromaDBPipeline:
638
+ """Complete pipeline from PDF to ChromaDB."""
639
+
640
+ def __init__(
641
+ self,
642
+ db_path: str = "./chroma_db",
643
+ collection_name: str = "research_papers",
644
+ embedding_model: str = "pritamdeka/S-PubMedBert-MS-MARCO",
645
+ max_tokens: int = 330,
646
+ chunk_overlap: int = 30,
647
+ papers_json: Optional[str] = None,
648
+ ):
649
+ """
650
+ Initialize the pipeline.
651
+
652
+ Args:
653
+ db_path: Path to ChromaDB database
654
+ collection_name: Name of ChromaDB collection
655
+ embedding_model: HuggingFace model for embeddings
656
+ max_tokens: Maximum tokens per chunk
657
+ chunk_overlap: Overlap between chunks
658
+ papers_json: Path to filename-keyed JSON with paper metadata.
659
+ Keys are PDF stems (no extension), e.g. {"1": {...}, "2": {...}}
660
+ """
661
+ self.db_path = db_path
662
+ self.collection_name = collection_name
663
+ self.embedding_model_name = embedding_model
664
+ self.max_tokens = max_tokens
665
+ self.chunk_overlap = chunk_overlap
666
+
667
+ # Load paper registry (filename stem -> paper info dict)
668
+ self.paper_registry = self._load_paper_registry(papers_json)
669
+
670
+ # Initialize components
671
+ self._init_docling()
672
+ self._init_chunker()
673
+ self._init_embedder()
674
+ self._init_chromadb()
675
+
676
+ def _load_paper_registry(self, papers_json: Optional[str]) -> Dict:
677
+ """Load the filename-keyed paper metadata JSON.
678
+
679
+ Args:
680
+ papers_json: Path to JSON file. Expected format:
681
+ { "<pdf_stem>": { <paper fields> }, ... }
682
+ Returns:
683
+ Dict mapping pdf stem -> paper info, or empty dict if not provided.
684
+ """
685
+ if not papers_json:
686
+ print("ℹ No papers JSON provided — paper metadata will not be attached to chunks.")
687
+ return {}
688
+
689
+ try:
690
+ with open(papers_json, 'r', encoding='utf-8') as f:
691
+ registry = json.load(f)
692
+ print(f"✓ Loaded paper registry from {papers_json} ({len(registry)} entries)")
693
+ return registry
694
+ except FileNotFoundError:
695
+ print(f"⚠ Papers JSON not found: {papers_json} — continuing without paper metadata.")
696
+ return {}
697
+ except json.JSONDecodeError as e:
698
+ print(f"⚠ Failed to parse papers JSON: {e} — continuing without paper metadata.")
699
+ return {}
700
+
701
+ def _get_paper_info(self, pdf_stem: str) -> Dict:
702
+ """Look up paper metadata by PDF filename stem.
703
+
704
+ Args:
705
+ pdf_stem: PDF filename without extension, e.g. '1' for '1.pdf'
706
+ Returns:
707
+ Paper info dict, or empty dict if not found.
708
+ """
709
+ info = self.paper_registry.get(pdf_stem, {})
710
+ if not info:
711
+ print(f" ⚠ No paper metadata found for '{pdf_stem}' in registry.")
712
+ return info
713
+
714
+ def _init_docling(self):
715
+ """Initialize docling converter."""
716
+ if not DOCLING_AVAILABLE:
717
+ raise ImportError("docling required. Install: pip install docling")
718
+ self.converter = DocumentConverter()
719
+ print("✓ Docling converter initialized")
720
+
721
+ def _init_chunker(self):
722
+ """Initialize chunker with tokenizer."""
723
+ self.chunker = SectionAwareChunker(
724
+ model_name=self.embedding_model_name,
725
+ max_tokens=self.max_tokens,
726
+ chunk_overlap_tokens=self.chunk_overlap,
727
+ split_tables=True
728
+ )
729
+ print("✓ Chunker initialized")
730
+
731
+ def _init_embedder(self):
732
+ """Initialize embedding model."""
733
+ if not SENTENCE_TRANSFORMERS_AVAILABLE:
734
+ raise ImportError("sentence-transformers required. Install: pip install sentence-transformers")
735
+
736
+ print(f"Loading embedding model: {self.embedding_model_name}")
737
+ self.embedding_model = SentenceTransformer(self.embedding_model_name)
738
+ print("✓ Embedding model loaded")
739
+
740
+ def _init_chromadb(self):
741
+ """Initialize ChromaDB client and collection."""
742
+ if not CHROMADB_AVAILABLE:
743
+ raise ImportError("chromadb required. Install: pip install chromadb")
744
+
745
+ # Create persistent client
746
+ self.chroma_client = chromadb.PersistentClient(path=self.db_path)
747
+
748
+ # Get or create collection
749
+ self.collection = self.chroma_client.get_or_create_collection(
750
+ name=self.collection_name,
751
+ metadata={"hnsw:space": "cosine"}
752
+ )
753
+ print(f"✓ ChromaDB initialized at {self.db_path}")
754
+ print(f"✓ Collection '{self.collection_name}' ready (existing docs: {self.collection.count()})")
755
+
756
+ def extract_pdf(self, pdf_path: str) -> str:
757
+ """Extract markdown from PDF using docling."""
758
+ print(f" Extracting PDF: {pdf_path}")
759
+ result = self.converter.convert(pdf_path)
760
+ markdown_text = result.document.export_to_markdown()
761
+ print(f" ✓ Extracted {len(markdown_text)} characters")
762
+ return markdown_text
763
+
764
+ def clean_markdown(self, markdown_text: str) -> str:
765
+ """Clean markdown using rag_md_cleaner."""
766
+ print(f" Cleaning markdown...")
767
+ cleaned = clean_markdown_for_rag(
768
+ markdown_text,
769
+ remove_tables=False, # Keep tables
770
+ remove_figures=True,
771
+ remove_references=True,
772
+ reference_mode="conservative",
773
+ remove_metadata=True,
774
+ )
775
+ print(f" ✓ Cleaned to {len(cleaned)} characters")
776
+ return cleaned
777
+
778
+ def chunk_text(self, text: str, source_file: str) -> List[Dict]:
779
+ """Chunk text with section awareness, attaching paper metadata."""
780
+ print(f" Chunking text...")
781
+ chunks = self.chunker.chunk_markdown(text, source_file=source_file)
782
+
783
+ # Attach paper metadata to every chunk
784
+ paper_info = self._get_paper_info(source_file)
785
+ for chunk in chunks:
786
+ chunk['metadata']['paper'] = paper_info
787
+
788
+ # Validation
789
+ max_tokens = max(c['metadata']['token_count'] for c in chunks)
790
+ table_chunks = sum(1 for c in chunks if c['metadata'].get('is_table'))
791
+
792
+ print(f" ✓ Created {len(chunks)} chunks")
793
+ # print(f" - Text chunks: {len(chunks) - table_chunks}")
794
+ # print(f" - Table chunks: {table_chunks}")
795
+ print(f" - Max tokens: {max_tokens}")
796
+ if paper_info:
797
+ print(f" - Paper: {paper_info.get('title', paper_info.get('reference_id', '?'))}")
798
+
799
+ return chunks
800
+
801
+ def embed_chunks(self, chunks: List[Dict]) -> List[List[float]]:
802
+ """Create embeddings for chunks."""
803
+ print(f" Embedding {len(chunks)} chunks...")
804
+ texts = [chunk['text'] for chunk in chunks]
805
+ embeddings = self.embedding_model.encode(
806
+ texts,
807
+ show_progress_bar=True,
808
+ normalize_embeddings=True
809
+ )
810
+ print(f" ✓ Created embeddings")
811
+ return embeddings.tolist()
812
+
813
+ def store_in_chromadb(
814
+ self,
815
+ chunks: List[Dict],
816
+ embeddings: List[List[float]],
817
+ pdf_filename: str
818
+ ):
819
+ """Store chunks and embeddings in ChromaDB."""
820
+ print(f" Storing in ChromaDB...")
821
+
822
+ # Prepare data
823
+ ids = [f"{pdf_filename}_{i}" for i in range(len(chunks))]
824
+ documents = [chunk['text'] for chunk in chunks]
825
+ metadatas = []
826
+
827
+ for chunk in chunks:
828
+ # Flatten metadata for ChromaDB (ChromaDB only accepts scalar values)
829
+ metadata = {
830
+ 'source_file': chunk['metadata'].get('source_file', ''),
831
+ 'section_type': chunk['metadata'].get('section_type', 'other'),
832
+ 'is_table': str(chunk['metadata'].get('is_table', False)),
833
+ 'token_count': chunk['metadata']['token_count'],
834
+ 'chunk_index': chunk['metadata']['chunk_index'],
835
+ 'timestamp': datetime.now().isoformat(),
836
+ }
837
+
838
+ # Add header hierarchy
839
+ for level in ['h1', 'h2', 'h3', 'h4']:
840
+ if level in chunk['metadata']:
841
+ metadata[level] = chunk['metadata'][level]
842
+
843
+ # Attach paper metadata — serialised as JSON string so ChromaDB accepts it.
844
+ # To retrieve: json.loads(chunk_metadata['paper'])
845
+ paper_info = chunk['metadata'].get('paper', {})
846
+ metadata['paper'] = json.dumps(paper_info) if paper_info else '{}'
847
+
848
+ # Also write each tag array as a flat pipe-delimited string field so the
849
+ # retriever can filter with ChromaDB where clauses (which only support scalars).
850
+ # e.g. paper_tag_cancer = "NSCLC|Renal Cell Carcinoma|Bladder Cancer"
851
+ # Retriever filters with: {"paper_tag_cancer": {"$contains": "NSCLC"}}
852
+ tags = paper_info.get('tags', {}) if paper_info else {}
853
+ for tag_key, tag_values in tags.items():
854
+ if isinstance(tag_values, list) and tag_values:
855
+ metadata[f'paper_tag_{tag_key}'] = '|'.join(str(v) for v in tag_values)
856
+
857
+ metadatas.append(metadata)
858
+
859
+ # Add to collection
860
+ self.collection.add(
861
+ ids=ids,
862
+ embeddings=embeddings,
863
+ documents=documents,
864
+ metadatas=metadatas
865
+ )
866
+
867
+ print(f" ✓ Stored {len(chunks)} chunks in ChromaDB")
868
+ print(f" ✓ Total documents in collection: {self.collection.count()}")
869
+
870
+ def process_pdf(self, pdf_path: str) -> Dict:
871
+ """Process a single PDF through the entire pipeline."""
872
+ pdf_filename = Path(pdf_path).stem
873
+ print(f"\n{'='*80}")
874
+ print(f"Processing: {pdf_filename}")
875
+ print(f"{'='*80}")
876
+
877
+ try:
878
+ # Extract
879
+ markdown = self.extract_pdf(pdf_path)
880
+
881
+ # Clean
882
+ cleaned = self.clean_markdown(markdown)
883
+
884
+ # Chunk (paper metadata is attached inside chunk_text)
885
+ chunks = self.chunk_text(cleaned, source_file=pdf_filename)
886
+
887
+ # with open("/content/files/chunks/" + pdf_filename + ".json", "w", encoding="utf-8") as f:
888
+ # json.dump(chunks, f, indent=2, ensure_ascii=False)
889
+ # print(f"\n✓ Saved {len(chunks)} chunks")
890
+
891
+ # Embed
892
+ embeddings = self.embed_chunks(chunks)
893
+
894
+ # Store
895
+ self.store_in_chromadb(chunks, embeddings, pdf_filename)
896
+
897
+ result = {
898
+ 'status': 'success',
899
+ 'pdf_file': pdf_filename,
900
+ 'num_chunks': len(chunks),
901
+ 'max_tokens': max(c['metadata']['token_count'] for c in chunks),
902
+ }
903
+
904
+ print(f"✓ Successfully processed {pdf_filename}")
905
+ return result
906
+
907
+ except Exception as e:
908
+ print(f"✗ Error processing {pdf_filename}: {str(e)}")
909
+ return {
910
+ 'status': 'error',
911
+ 'pdf_file': pdf_filename,
912
+ 'error': str(e)
913
+ }
914
+
915
+ def process_folder(self, input_folder: str) -> List[Dict]:
916
+ """Process all PDFs in a folder."""
917
+ pdf_files = list(Path(input_folder).glob("*.pdf"))
918
+
919
+ if not pdf_files:
920
+ print(f"No PDF files found in {input_folder}")
921
+ return []
922
+
923
+ print(f"\nFound {len(pdf_files)} PDF files to process")
924
+
925
+ results = []
926
+ for pdf_path in pdf_files:
927
+ result = self.process_pdf(str(pdf_path))
928
+ results.append(result)
929
+
930
+ # Summary
931
+ successful = sum(1 for r in results if r['status'] == 'success')
932
+ failed = len(results) - successful
933
+
934
+ print(f"\n{'='*80}")
935
+ print("PIPELINE SUMMARY")
936
+ print(f"{'='*80}")
937
+ print(f"Total PDFs: {len(results)}")
938
+ print(f"Successful: {successful}")
939
+ print(f"Failed: {failed}")
940
+
941
+ if successful > 0:
942
+ total_chunks = sum(r.get('num_chunks', 0) for r in results if r['status'] == 'success')
943
+ print(f"Total chunks created: {total_chunks}")
944
+ print(f"ChromaDB collection size: {self.collection.count()}")
945
+
946
+ return results
947
+
948
+ def query(
949
+ self,
950
+ query_text: str,
951
+ n_results: int = 5,
952
+ filter_section: Optional[str] = None
953
+ ) -> Dict:
954
+ """Query the ChromaDB collection."""
955
+ # Create query embedding
956
+ query_embedding = self.embedding_model.encode(
957
+ [query_text],
958
+ normalize_embeddings=True
959
+ )[0].tolist()
960
+
961
+ # Build filter
962
+ where = {}
963
+ if filter_section:
964
+ where['section_type'] = filter_section
965
+
966
+ # Query
967
+ results = self.collection.query(
968
+ query_embeddings=[query_embedding],
969
+ n_results=n_results,
970
+ where=where if where else None
971
+ )
972
+
973
+ return results
974
+
975
+
976
+ def main():
977
+ parser = argparse.ArgumentParser(
978
+ description='PDF to ChromaDB Pipeline with PubMedBERT'
979
+ )
980
+ parser.add_argument(
981
+ '--input-folder',
982
+ required=True,
983
+ help='Folder containing PDF files'
984
+ )
985
+ parser.add_argument(
986
+ '--db-path',
987
+ default='./chroma_db',
988
+ help='Path to ChromaDB database (default: ./chroma_db)'
989
+ )
990
+ parser.add_argument(
991
+ '--collection-name',
992
+ default='research_papers',
993
+ help='ChromaDB collection name (default: research_papers)'
994
+ )
995
+ parser.add_argument(
996
+ '--max-tokens',
997
+ type=int,
998
+ default=330,
999
+ help='Maximum tokens per chunk (default: 330)'
1000
+ )
1001
+ parser.add_argument(
1002
+ '--overlap',
1003
+ type=int,
1004
+ default=30,
1005
+ help='Chunk overlap in tokens (default: 30)'
1006
+ )
1007
+ parser.add_argument(
1008
+ '--embedding-model',
1009
+ default='pritamdeka/S-PubMedBert-MS-MARCO',
1010
+ help='HuggingFace embedding model'
1011
+ )
1012
+ parser.add_argument(
1013
+ '--papers-json',
1014
+ default=None,
1015
+ help='Path to filename-keyed paper metadata JSON (e.g. research_papers.json)'
1016
+ )
1017
+
1018
+ args = parser.parse_args()
1019
+
1020
+ # Initialize pipeline
1021
+ print("\nInitializing PDF to ChromaDB Pipeline...")
1022
+ print(f"{'='*80}")
1023
+
1024
+ pipeline = PDFToChromaDBPipeline(
1025
+ db_path=args.db_path,
1026
+ collection_name=args.collection_name,
1027
+ embedding_model=args.embedding_model,
1028
+ max_tokens=args.max_tokens,
1029
+ chunk_overlap=args.overlap,
1030
+ papers_json=args.papers_json,
1031
+ )
1032
+
1033
+ # Process folder
1034
+ results = pipeline.process_folder(args.input_folder)
1035
+
1036
+ # Save results log
1037
+ log_file = f"pipeline_results_{datetime.now().strftime('%Y%m%d_%H%M%S')}.json"
1038
+ with open(log_file, 'w') as f:
1039
+ json.dump(results, f, indent=2)
1040
+ print(f"\n✓ Results saved to {log_file}")
1041
+
1042
+
1043
+ if __name__ == "__main__":
1044
+ main()
rag/rag_md_cleaner.py ADDED
@@ -0,0 +1,278 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ rag_md_cleaner.py
3
+
4
+ Markdown cleaner optimized for PDF -> Markdown extraction
5
+ (docling style) intended for RAG ingestion.
6
+
7
+ Primary concerns addressed:
8
+ - HTML comments like <!-- image -->
9
+ - Ligature /uniFB01 /uniFB02 /uniFB03 artifacts and unicode normalization
10
+ - Broken hyphen spacing "immune - related" => "immune-related"
11
+ - Standalone pipe lines " | "
12
+ - Tables (optional removal)
13
+ - Figure captions (optional removal)
14
+ - Reference removal:
15
+ - Remove common metadata (Funding, Author contributions, Conflict of interest, Publisher's note)
16
+ """
17
+
18
+ import re
19
+ import unicodedata
20
+ from typing import Optional
21
+
22
+ def normalize_unicode(text: str) -> str:
23
+ text = unicodedata.normalize("NFKC", text)
24
+ # common PDF extraction broken ligature tokens and odd markers
25
+ ligature_map = {
26
+ "/uniFB01": "fi",
27
+ "/uniFB02": "fl",
28
+ "/uniFB03": "ffi",
29
+ "/uniFB04": "ffl",
30
+ "\ufb01": "fi",
31
+ "\ufb02": "fl",
32
+ }
33
+ for k, v in ligature_map.items():
34
+ text = text.replace(k, v)
35
+ return text
36
+
37
+ def is_reference_like_line(line: str) -> bool:
38
+ s = line.strip()
39
+ if not s:
40
+ return False
41
+
42
+ # common signals of a reference line
43
+ patterns = [
44
+ r"\bdoi\s*:\s*10\.",
45
+ r"\bdoi\.?/?10\.",
46
+ r"\(\s*\d{4}\s*\)",
47
+ r"^\s*\d+\.\s+",
48
+ r"\bet al\.",
49
+ r"\bPMID\b|\bPMC\b",
50
+ ]
51
+ for p in patterns:
52
+ if re.search(p, s, flags=re.IGNORECASE):
53
+ return True
54
+
55
+ comma_count = s.count(",")
56
+ if comma_count >= 3 and len(s) < 300:
57
+ # heuristic: author lines usually have at least one capitalized surname-like token
58
+ if re.search(r"[A-Z][a-z]{2,}\s+[A-Z]\b", s) or re.search(r"[A-Z][a-z]{2,},\s+[A-Z]", s):
59
+ return True
60
+
61
+ # journal-like end pattern: volume:pages or year;volume:pages
62
+ if re.search(r"\d{4}\).*?\d{1,4}[:](\d|–|-)", s) or re.search(r"\b\d{1,4}:\d{1,4}\b", s):
63
+ return True
64
+
65
+ return False
66
+
67
+
68
+
69
+ def remove_references_section(
70
+ text: str,
71
+ mode: str = "conservative",
72
+ consecutive_threshold_conservative: int = 5,
73
+ consecutive_threshold_aggressive: int = 3,
74
+ window_tail_fraction: float = 0.35,
75
+ ) -> str:
76
+ """
77
+ Remove references from text.
78
+
79
+ Parameters
80
+ ----------
81
+ text : str
82
+ Raw markdown text
83
+ mode : str
84
+ "conservative" (safer, fewer false positives) or "aggressive" (more likely to remove refs)
85
+ consecutive_threshold_conservative : int
86
+ number of consecutive ref-like lines required for conservative mode
87
+ consecutive_threshold_aggressive : int
88
+ threshold for aggressive mode
89
+ window_tail_fraction : float
90
+ fraction of document considered the "tail" for additional detection
91
+
92
+ Returns
93
+ -------
94
+ str
95
+ Text trimmed before detected references (or original if none detected)
96
+ """
97
+
98
+ # 1) Try explicit headers
99
+ header_regexes = [
100
+ r"\n##\s*REFERENCES\b", r"\n##\s*References\b", r"\nREFERENCES\b", r"\nReferences\b"
101
+ ]
102
+ for hdr in header_regexes:
103
+ m = re.search(hdr, text, flags=re.IGNORECASE)
104
+ if m:
105
+ return text[: m.start()]
106
+
107
+ # 2) Heuristic scan for consecutive reference-like lines
108
+ lines = text.splitlines()
109
+ threshold = consecutive_threshold_conservative if mode == "conservative" else consecutive_threshold_aggressive
110
+
111
+ consecutive = 0
112
+ for idx, line in enumerate(lines):
113
+ if is_reference_like_line(line):
114
+ consecutive += 1
115
+ else:
116
+ consecutive = 0
117
+
118
+ if consecutive >= threshold:
119
+ # find the first line index where this consecutive block started
120
+ start_idx = idx - consecutive + 1
121
+ # return up to before that block
122
+ return "\n".join(lines[:start_idx]).rstrip()
123
+
124
+ # 3) Tail-window detection: maybe refs are at the end but not consecutive enough earlier
125
+ n_lines = len(lines)
126
+ tail_start = int(n_lines * (1.0 - window_tail_fraction))
127
+ tail = lines[tail_start:]
128
+ # compute fraction of ref-like lines in tail
129
+ ref_like_count = sum(1 for L in tail if is_reference_like_line(L))
130
+ if len(tail) > 10 and (ref_like_count / len(tail)) > 0.25:
131
+ # find first ref-like line in tail and cut before it
132
+ for i, L in enumerate(tail):
133
+ if is_reference_like_line(L):
134
+ return "\n".join(lines[: tail_start + i]).rstrip()
135
+
136
+ # 4) No reliable reference block found -> return original
137
+ return text
138
+
139
+
140
+
141
+ def remove_metadata_sections(text: str) -> str:
142
+ """
143
+ Remove common metadata sections by header names.
144
+ Cuts text at the first occurrence of any of these headers.
145
+ """
146
+ meta_headers = [
147
+ "AUTHOR CONTRIBUTIONS",
148
+ "Author contributions",
149
+ "FUNDING",
150
+ "Funding",
151
+ "CONFLICT OF INTEREST",
152
+ "CONFLICT OF INTEREST STATEMENT",
153
+ "Conflict of interest",
154
+ "CONFLICT OF INTEREST STATEMENT",
155
+ "Publisher's note",
156
+ "Publisher ' s note",
157
+ "Publisher?s note",
158
+ "ACKNOWLEDGMENTS",
159
+ "Acknowledgments",
160
+ "DATA AVAILABILITY STATEMENT",
161
+ "Data availability statement",
162
+ "ORCID",
163
+ "DATA AVAILABILITY",
164
+ "Data Availability",
165
+ ]
166
+ pattern = r"\n(?:" + "|".join([re.escape(h) for h in meta_headers]) + r")\b"
167
+ m = re.search(pattern, text, flags=re.IGNORECASE)
168
+ if m:
169
+ return text[: m.start()].rstrip()
170
+ return text
171
+
172
+
173
+ # ---------------------------
174
+ # Main cleaning function
175
+ # ---------------------------
176
+ def clean_markdown_for_rag(
177
+ markdown_text: str,
178
+ remove_tables: bool = False,
179
+ remove_figures: bool = True,
180
+ remove_references: bool = True,
181
+ reference_mode: str = "conservative",
182
+ remove_metadata: bool = True,
183
+ collapse_multiblank: bool = True,
184
+ ) -> str:
185
+ """
186
+ Clean markdown text extracted from PDFs to produce RAG-friendly text.
187
+
188
+ Parameters
189
+ ----------
190
+ markdown_text : str
191
+ Raw markdown
192
+ remove_tables : bool
193
+ Remove markdown tables (default False).
194
+ remove_figures : bool
195
+ Remove figure captions / figure blocks (default True)
196
+ remove_references : bool
197
+ Attempt to remove references (default True)
198
+ reference_mode : str
199
+ 'conservative' or 'aggressive'
200
+ remove_metadata : bool
201
+ Remove author contributions/funding/conflict etc (default True)
202
+ collapse_multiblank : bool
203
+ Collapse >2 newlines to 2 newlines (default True)
204
+
205
+ Returns
206
+ -------
207
+ str
208
+ Cleaned markdown
209
+ """
210
+ text = markdown_text
211
+
212
+ # normalize unicode & ligatures
213
+ text = normalize_unicode(text)
214
+
215
+ # remove HTML comments like <!-- image -->
216
+ text = re.sub(r"<!--.*?-->", "", text, flags=re.DOTALL)
217
+
218
+ # remove explicit image placeholders (variants)
219
+ text = re.sub(r"<\s*--\s*image\s*--\s*>", "", text, flags=re.IGNORECASE)
220
+ text = re.sub(r"\[image:\s*.*?\]", "", text, flags=re.IGNORECASE)
221
+
222
+ # remove standalone pipes lines
223
+ text = re.sub(r"^\s*\|\s*$", "", text, flags=re.MULTILINE)
224
+
225
+ # Optionally remove markdown tables (entire blocks). Conservative removal:
226
+ if remove_tables:
227
+ # Remove contiguous table-like blocks beginning and ending with pipes or table dividers
228
+ text = re.sub(
229
+ r"\n(?:\|[^\n]*\|\s*\n(?:\|[-:\s|]+\|\s*\n)?(?:\|[^\n]*\|\s*\n)+)",
230
+ "\n",
231
+ text,
232
+ flags=re.DOTALL,
233
+ )
234
+ # Also remove inline table fragments
235
+ text = re.sub(r"\n\|[-:\s|]+\|\n", "\n", text)
236
+
237
+ else:
238
+ # clean duplicate pipes
239
+ text = re.sub(r"\|{2,}", "|", text)
240
+
241
+ # Remove figure captions/blocks like FIGURE 1 ... Figure 1: ... or Fig. 1
242
+ if remove_figures:
243
+ text = re.sub(r"(?is)\bFIGURE\s*\d+[:.\s\S]*?(?=\n##|\n[A-Z]{2,}|$)", "", text)
244
+ text = re.sub(r"(?is)\bFigure\s*\d+[:.\s\S]*?(?=\n##|\n[A-Z]{2,}|$)", "", text)
245
+ text = re.sub(r"(?is)\bFig\.\s*\d+[:.\s\S]*?(?=\n##|\n[A-Z]{2,}|$)", "", text)
246
+
247
+ # Fix hyphen spacing artifacts: "immune - related" -> "immune-related"
248
+ text = re.sub(r"\s-\s+", "-", text)
249
+
250
+ # Fix spaced punctuation: "word ," -> "word,"
251
+ text = re.sub(r"\s+([,.;:])", r"\1", text)
252
+
253
+ # Remove common publisher footer / 'Downloaded from...' blocks by looking for typical phrases
254
+ text = re.sub(
255
+ r"Downloaded from .*?Terms and Conditions.*?(?:\n|$)",
256
+ "",
257
+ text,
258
+ flags=re.IGNORECASE | re.DOTALL,
259
+ )
260
+ # generic residual footer lines with DOI-like trailing
261
+ text = re.sub(r"\n\d{6,}x?,?\s*\d{4}.*$", "", text, flags=re.DOTALL)
262
+
263
+ # Remove references (robust)
264
+ if remove_references:
265
+ text = remove_references_section(text, mode=reference_mode)
266
+
267
+ # Remove metadata sections (Funding, Author contributions, Conflicts, ORCID, etc.)
268
+ if remove_metadata:
269
+ text = remove_metadata_sections(text)
270
+
271
+ # Collapse excessive blank lines
272
+ if collapse_multiblank:
273
+ text = re.sub(r"\n{3,}", "\n\n", text)
274
+
275
+ # Trim leading/trailing whitespace
276
+ text = text.strip()
277
+
278
+ return text
rag/requirements.txt ADDED
@@ -0,0 +1,17 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # PDF to ChromaDB Pipeline Requirements
2
+ # For research paper processing with PubMedBERT embeddings
3
+
4
+ # Core dependencies
5
+ docling>=1.0.0 # PDF extraction
6
+ transformers>=4.40.0 # PubMedBERT tokenizer (aligned with root)
7
+ torch>=2.0.0 # Required by transformers
8
+ sentence-transformers>=2.5.0 # Embedding generation (aligned with root)
9
+ chromadb>=0.4.0 # Vector database
10
+
11
+ # Optional but recommended
12
+ numpy>=1.24.0 # Array operations
13
+ tqdm>=4.65.0 # Progress bars (used by sentence-transformers)
14
+
15
+ # For development/testing
16
+ ipython>=8.0.0 # Interactive shell
17
+ jupyter>=1.0.0 # Notebook support
rag/research_papers.json ADDED
@@ -0,0 +1,555 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "1": {
3
+ "reference_id": "25",
4
+ "citation": "Takada et al., Int J Cancer 2021",
5
+ "title": "Clinical impact of probiotics on the efficacy of anti-PD-1 monotherapy in patients with NSCLC",
6
+ "year": 2021,
7
+ "tags": {
8
+ "treatment": [
9
+ "PD-1/PD-L1 Blockade"
10
+ ],
11
+ "cancer": [
12
+ "NSCLC"
13
+ ],
14
+ "biology": [
15
+ "Gut microbiome composition",
16
+ "Alpha diversity"
17
+ ],
18
+ "intervention": [
19
+ "Probiotics"
20
+ ]
21
+ }
22
+ },
23
+ "2": {
24
+ "reference_id": "2",
25
+ "citation": "Maynard et al., Nature 2012",
26
+ "title": "Reciprocal interactions of the intestinal microbiota and immune system",
27
+ "year": 2012,
28
+ "tags": {
29
+ "treatment": [],
30
+ "cancer": [],
31
+ "biology": [
32
+ "Innate immunity",
33
+ "Adaptive immunity",
34
+ "T cell-microbiota interactions",
35
+ "Immune homeostasis"
36
+ ],
37
+ "intervention": [
38
+ "Review article"
39
+ ]
40
+ }
41
+ },
42
+ "3": {
43
+ "reference_id": "3",
44
+ "citation": "Dzutsev et al., Annu Rev Immunol 2017",
45
+ "title": "Microbes and cancer",
46
+ "year": 2017,
47
+ "tags": {
48
+ "treatment": [],
49
+ "cancer": [
50
+ "General"
51
+ ],
52
+ "biology": [
53
+ "Tumorigenesis",
54
+ "Microbiome-cancer interactions",
55
+ "Inflammation"
56
+ ],
57
+ "intervention": [
58
+ "Review article"
59
+ ]
60
+ }
61
+ },
62
+ "4": {
63
+ "reference_id": "4",
64
+ "citation": "Zheng et al., Cell Res 2020",
65
+ "title": "Interaction between microbiota and immunity in health and disease",
66
+ "year": 2020,
67
+ "tags": {
68
+ "treatment": [],
69
+ "cancer": [],
70
+ "biology": [
71
+ "Immune modulation",
72
+ "Inflammation",
73
+ "Host-microbiota interactions"
74
+ ],
75
+ "intervention": [
76
+ "Review article"
77
+ ]
78
+ }
79
+ },
80
+ "5": {
81
+ "reference_id": "5",
82
+ "citation": "Cullin et al., Cancer Cell 2021",
83
+ "title": "Microbiome and cancer",
84
+ "year": 2021,
85
+ "tags": {
86
+ "treatment": [],
87
+ "cancer": [
88
+ "General"
89
+ ],
90
+ "biology": [
91
+ "Tumor microenvironment",
92
+ "Microbiome-cancer interactions",
93
+ "Immune regulation"
94
+ ],
95
+ "intervention": [
96
+ "Review article"
97
+ ]
98
+ }
99
+ },
100
+ "44": {
101
+ "reference_id": "44",
102
+ "citation": "Han et al., Nat Biomed Eng 2021",
103
+ "title": "Generation of systemic anti-tumour immunity via the in situ modulation of the gut microbiome by an orally administered inulin gel",
104
+ "year": 2021,
105
+ "tags": {
106
+ "treatment": [
107
+ "PD-1/PD-L1 Blockade"
108
+ ],
109
+ "cancer": [
110
+ "Preclinical tumor models"
111
+ ],
112
+ "biology": [
113
+ "Memory T cells",
114
+ "Microbiome modulation"
115
+ ],
116
+ "intervention": [
117
+ "Prebiotics",
118
+ "Inulin"
119
+ ]
120
+ }
121
+ },
122
+ "46": {
123
+ "reference_id": "46",
124
+ "citation": "Wastyk et al., Cell 2021",
125
+ "title": "Gut-microbiota-targeted diets modulate human immune status",
126
+ "year": 2021,
127
+ "tags": {
128
+ "treatment": [],
129
+ "cancer": [],
130
+ "biology": [
131
+ "Fiber",
132
+ "Fermented foods",
133
+ "Microbial diversity",
134
+ "Inflammatory markers"
135
+ ],
136
+ "intervention": [
137
+ "Diet",
138
+ "High-fiber diet",
139
+ "Fermented food diet"
140
+ ]
141
+ }
142
+ },
143
+ "47": {
144
+ "reference_id": "47",
145
+ "citation": "Huang et al., Gut 2022",
146
+ "title": "Ginseng polysaccharides alter the gut microbiota and kynurenine/tryptophan ratio, potentiating anti-PD-1/PD-L1 immunotherapy",
147
+ "year": 2022,
148
+ "tags": {
149
+ "treatment": [
150
+ "PD-1/PD-L1 Blockade"
151
+ ],
152
+ "cancer": [
153
+ "Preclinical tumor models"
154
+ ],
155
+ "biology": [
156
+ "Tryptophan metabolism",
157
+ "Kynurenine pathway",
158
+ "Microbiome modulation"
159
+ ],
160
+ "intervention": [
161
+ "Prebiotics",
162
+ "Polysaccharides"
163
+ ]
164
+ }
165
+ },
166
+
167
+ {
168
+ "6": {
169
+ "reference_id": "6",
170
+ "citation": "Routy et al., Science 2018",
171
+ "title": "Gut microbiome influences efficacy of PD-1-based immunotherapy against epithelial tumors",
172
+ "year": 2018,
173
+ "tags": {
174
+ "treatment": ["PD-1/PD-L1 Blockade"],
175
+ "cancer": ["NSCLC", "RCC"],
176
+ "biology": ["Akkermansia muciniphila", "Alpha diversity", "FMT validation"],
177
+ "intervention": ["Fecal microbiota transplantation"]
178
+ }
179
+ },
180
+ "8": {
181
+ "reference_id": "8",
182
+ "citation": "Matson et al., Science 2018",
183
+ "title": "The commensal microbiome is associated with anti-PD-1 efficacy in metastatic melanoma patients",
184
+ "year": 2018,
185
+ "tags": {
186
+ "treatment": ["PD-1/PD-L1 Blockade"],
187
+ "cancer": ["Melanoma"],
188
+ "biology": ["Bifidobacterium", "Faecalibacterium", "FMT validation"],
189
+ "intervention": ["Fecal microbiota transplantation"]
190
+ }
191
+ },
192
+ "9": {
193
+ "reference_id": "9",
194
+ "citation": "Jin et al., J Thorac Oncol 2019",
195
+ "title": "The diversity of gut microbiome is associated with favorable responses to anti-PD-1 immunotherapy in Chinese NSCLC patients",
196
+ "year": 2019,
197
+ "tags": {
198
+ "treatment": ["PD-1/PD-L1 Blockade"],
199
+ "cancer": ["NSCLC"],
200
+ "biology": ["Alpha diversity", "Alistipes", "Prevotella"],
201
+ "intervention": []
202
+ }
203
+ },
204
+ "10": {
205
+ "reference_id": "10",
206
+ "citation": "Lee et al., Nat Med 2022",
207
+ "title": "Cross-cohort gut microbiome associations with immune checkpoint inhibitor response in advanced melanoma",
208
+ "year": 2022,
209
+ "tags": {
210
+ "treatment": ["PD-1/PD-L1 Blockade"],
211
+ "cancer": ["Melanoma"],
212
+ "biology": ["Roseburia", "Akkermansia", "Bifidobacterium", "Meta-analysis"],
213
+ "intervention": []
214
+ }
215
+ },
216
+ "11": {
217
+ "reference_id": "11",
218
+ "citation": "Smith et al., Nat Med 2022",
219
+ "title": "Gut microbiome correlates of response and toxicity following anti-CD19 CAR T cell therapy",
220
+ "year": 2022,
221
+ "tags": {
222
+ "treatment": ["CAR-T"],
223
+ "cancer": ["B-cell lymphoma"],
224
+ "biology": ["Ruminococcus", "Bacteroides", "Faecalibacterium", "Cytokine release syndrome"],
225
+ "intervention": []
226
+ }
227
+ },
228
+ "12": {
229
+ "reference_id": "12",
230
+ "citation": "Stein-Thoeringer et al., Nat Med 2023",
231
+ "title": "A non-antibiotic-disrupted gut microbiome is associated with clinical responses to CD19-CAR-T cell cancer immunotherapy",
232
+ "year": 2023,
233
+ "tags": {
234
+ "treatment": ["CAR-T"],
235
+ "cancer": ["Lymphoma"],
236
+ "biology": ["Akkermansia", "Ruminococcus lactaris", "Alpha diversity"],
237
+ "intervention": ["Antibiotic exposure"]
238
+ }
239
+ },
240
+ "13": {
241
+ "reference_id": "13",
242
+ "citation": "Hu et al., Nat Commun 2022",
243
+ "title": "CAR-T cell therapy-related cytokine release syndrome and therapeutic response is modulated by the gut microbiome in hematologic malignancies",
244
+ "year": 2022,
245
+ "tags": {
246
+ "treatment": ["CAR-T"],
247
+ "cancer": ["Hematologic malignancies"],
248
+ "biology": ["Faecalibacterium", "Roseburia", "Cytokine release syndrome"],
249
+ "intervention": []
250
+ }
251
+ },
252
+ "15": {
253
+ "reference_id": "15",
254
+ "citation": "Luu et al., Nat Commun 2021",
255
+ "title": "Microbial short-chain fatty acids modulate CD8+ T cell responses and improve adoptive immunotherapy for cancer",
256
+ "year": 2021,
257
+ "tags": {
258
+ "treatment": ["ACT"],
259
+ "cancer": ["Preclinical tumor models"],
260
+ "biology": ["SCFAs", "Butyrate", "CD8+ T cells"],
261
+ "intervention": ["Short-chain fatty acids supplementation"]
262
+ }
263
+ },
264
+ "16": {
265
+ "reference_id": "16",
266
+ "citation": "He et al., Cell Metab 2021",
267
+ "title": "Gut microbial metabolites facilitate anticancer therapy efficacy by modulating cytotoxic CD8+ T cell immunity",
268
+ "year": 2021,
269
+ "tags": {
270
+ "treatment": ["Immunotherapy"],
271
+ "cancer": ["Preclinical tumor models"],
272
+ "biology": ["SCFAs", "Microbial metabolites", "CD8+ T cells"],
273
+ "intervention": []
274
+ }
275
+ },
276
+ "18": {
277
+ "reference_id": "18",
278
+ "citation": "Paik et al., Nature 2022",
279
+ "title": "Human gut bacteria produce TH17-modulating bile acid metabolites",
280
+ "year": 2022,
281
+ "tags": {
282
+ "treatment": [],
283
+ "cancer": [],
284
+ "biology": ["Bile acids", "Th17 cells", "Microbial metabolites"],
285
+ "intervention": []
286
+ }
287
+ },
288
+ "19": {
289
+ "reference_id": "19",
290
+ "citation": "Bender et al., Cell 2023",
291
+ "title": "Dietary tryptophan metabolite released by intratumoral Lactobacillus reuteri facilitates immune checkpoint inhibitor treatment",
292
+ "year": 2023,
293
+ "tags": {
294
+ "treatment": ["PD-1/PD-L1 Blockade"],
295
+ "cancer": ["Melanoma"],
296
+ "biology": ["Tryptophan metabolism", "Indole derivatives", "Lactobacillus reuteri"],
297
+ "intervention": ["Dietary tryptophan modulation"]
298
+ }
299
+ },
300
+ "20": {
301
+ "reference_id": "20",
302
+ "citation": "Hezaveh et al., Immunity 2022",
303
+ "title": "Tryptophan-derived microbial metabolites activate the aryl hydrocarbon receptor in tumor-associated macrophages to suppress anti-tumor immunity",
304
+ "year": 2022,
305
+ "tags": {
306
+ "treatment": [],
307
+ "cancer": ["Preclinical tumor models"],
308
+ "biology": ["Tryptophan metabolism", "Aryl hydrocarbon receptor", "Tumor-associated macrophages"],
309
+ "intervention": []
310
+ }
311
+ },
312
+ "23": {
313
+ "reference_id": "23",
314
+ "citation": "McCulloch et al., Nat Med 2022",
315
+ "title": "Intestinal microbiota signatures of clinical response and immune-related adverse events in melanoma patients treated with anti-PD-1",
316
+ "year": 2022,
317
+ "tags": {
318
+ "treatment": ["PD-1/PD-L1 Blockade"],
319
+ "cancer": ["Melanoma"],
320
+ "biology": ["Immune-related adverse events", "Bacteroides", "Microbiome composition"],
321
+ "intervention": []
322
+ }
323
+ },
324
+ "24": {
325
+ "reference_id": "24",
326
+ "citation": "Derosa et al., Nat Med 2022",
327
+ "title": "Intestinal Akkermansia muciniphila predicts clinical response to PD-1 blockade in patients with advanced NSCLC",
328
+ "year": 2022,
329
+ "tags": {
330
+ "treatment": ["PD-1/PD-L1 Blockade"],
331
+ "cancer": ["NSCLC"],
332
+ "biology": ["Akkermansia muciniphila", "Antibiotic exposure", "Microbiome composition"],
333
+ "intervention": ["Antibiotics"]
334
+ }
335
+ },
336
+ "26": {
337
+ "reference_id": "26",
338
+ "citation": "Zheng et al., J Immunother Cancer 2019",
339
+ "title": "Gut microbiome affects the response to anti-PD-1 immunotherapy in patients with HCC",
340
+ "year": 2019,
341
+ "tags": {
342
+ "treatment": ["PD-1/PD-L1 Blockade"],
343
+ "cancer": ["HCC"],
344
+ "biology": ["Akkermansia", "Ruminococcaceae", "Microbiome composition"],
345
+ "intervention": []
346
+ }
347
+ },
348
+ "27": {
349
+ "reference_id": "27",
350
+ "citation": "Lee et al., J Immunother Cancer 2022",
351
+ "title": "Gut microbiota and metabolites associate with outcomes of immune checkpoint inhibitor-treated unresectable HCC",
352
+ "year": 2022,
353
+ "tags": {
354
+ "treatment": ["PD-1/PD-L1 Blockade"],
355
+ "cancer": ["HCC"],
356
+ "biology": ["Microbial metabolites", "Lachnospiraceae", "Metabolomics"],
357
+ "intervention": []
358
+ }
359
+ },
360
+ "28": {
361
+ "reference_id": "28",
362
+ "citation": "Limeta et al., JCI Insight 2020",
363
+ "title": "Meta-analysis of the gut microbiota in predicting response to cancer immunotherapy in metastatic melanoma",
364
+ "year": 2020,
365
+ "tags": {
366
+ "treatment": ["PD-1/PD-L1 Blockade"],
367
+ "cancer": ["Melanoma"],
368
+ "biology": ["Meta-analysis", "Microbiome composition", "Biomarker discovery"],
369
+ "intervention": []
370
+ }
371
+ },
372
+ "29": {
373
+ "reference_id": "29",
374
+ "citation": "Spencer et al., Science 2021",
375
+ "title": "Dietary fiber and probiotics influence the gut microbiome and melanoma immunotherapy response",
376
+ "year": 2021,
377
+ "tags": {
378
+ "treatment": ["PD-1/PD-L1 Blockade"],
379
+ "cancer": ["Melanoma"],
380
+ "biology": ["Dietary fiber", "Microbiome diversity", "Immune modulation"],
381
+ "intervention": ["Diet", "Probiotics"]
382
+ }
383
+ },
384
+ "30": {
385
+ "reference_id": "30",
386
+ "citation": "Simpson et al., Nat Med 2022",
387
+ "title": "Diet-driven microbial ecology underpins associations between cancer immunotherapy outcomes and the gut microbiome",
388
+ "year": 2022,
389
+ "tags": {
390
+ "treatment": ["PD-1/PD-L1 Blockade"],
391
+ "cancer": ["Melanoma"],
392
+ "biology": ["Dietary fiber", "Microbial ecology", "Microbiome composition"],
393
+ "intervention": ["Diet"]
394
+ }
395
+ },
396
+ "31": {
397
+ "reference_id": "31",
398
+ "citation": "Paulos et al., J Clin Invest 2007",
399
+ "title": "Microbial translocation augments the function of adoptively transferred self/tumor-specific CD8+ T cells via TLR4 signaling",
400
+ "year": 2007,
401
+ "tags": {
402
+ "treatment": ["ACT"],
403
+ "cancer": ["Preclinical tumor models"],
404
+ "biology": ["TLR4 signaling", "LPS", "CD8+ T cells"],
405
+ "intervention": []
406
+ }
407
+ },
408
+ "32": {
409
+ "reference_id": "32",
410
+ "citation": "Uribe-Herranz et al., JCI Insight 2018",
411
+ "title": "Gut microbiota modulates adoptive cell therapy via CD8α dendritic cells and IL-12",
412
+ "year": 2018,
413
+ "tags": {
414
+ "treatment": ["ACT"],
415
+ "cancer": ["Preclinical tumor models"],
416
+ "biology": ["Dendritic cells", "IL-12", "Antibiotic exposure"],
417
+ "intervention": ["Vancomycin"]
418
+ }
419
+ },
420
+ "33": {
421
+ "reference_id": "33",
422
+ "citation": "Luu et al., Sci Rep 2018",
423
+ "title": "Regulation of the effector function of CD8+ T cells by gut microbiota-derived metabolite butyrate",
424
+ "year": 2018,
425
+ "tags": {
426
+ "treatment": ["ACT"],
427
+ "cancer": ["Preclinical tumor models"],
428
+ "biology": ["Butyrate", "HDAC inhibition", "CD8+ T cells"],
429
+ "intervention": ["Short-chain fatty acids supplementation"]
430
+ }
431
+ },
432
+ "34": {
433
+ "reference_id": "34",
434
+ "citation": "Yang et al., Oncoimmunology 2021",
435
+ "title": "Blood microbiota diversity determines response of advanced CRC to chemotherapy combined with adoptive T cell immunotherapy",
436
+ "year": 2021,
437
+ "tags": {
438
+ "treatment": ["ACT"],
439
+ "cancer": ["CRC"],
440
+ "biology": ["Blood microbiome", "Bifidobacterium", "Microbial diversity"],
441
+ "intervention": ["Chemotherapy"]
442
+ }
443
+ },
444
+ "37": {
445
+ "reference_id": "37",
446
+ "citation": "Derosa et al., Ann Oncol 2018",
447
+ "title": "Negative association of antibiotics on clinical activity of immune checkpoint inhibitors in patients with advanced RCC and NSCLC",
448
+ "year": 2018,
449
+ "tags": {
450
+ "treatment": ["PD-1/PD-L1 Blockade"],
451
+ "cancer": ["RCC", "NSCLC"],
452
+ "biology": ["Microbiome disruption"],
453
+ "intervention": ["Antibiotics"]
454
+ }
455
+ },
456
+ "38": {
457
+ "reference_id": "38",
458
+ "citation": "Wilson et al., Cancer Immunol Immunother 2020",
459
+ "title": "The effect of antibiotics on clinical outcomes in immune-checkpoint blockade: a systematic review and meta-analysis",
460
+ "year": 2020,
461
+ "tags": {
462
+ "treatment": ["ICI"],
463
+ "cancer": ["Multiple cancers"],
464
+ "biology": ["Meta-analysis", "Microbiome disruption"],
465
+ "intervention": ["Antibiotics"]
466
+ }
467
+ },
468
+ "39": {
469
+ "reference_id": "39",
470
+ "citation": "Peiffer et al., Neoplasia 2022",
471
+ "title": "Composition of gastrointestinal microbiota in association with treatment response in individuals with metastatic CRPC receiving pembrolizumab",
472
+ "year": 2022,
473
+ "tags": {
474
+ "treatment": ["PD-1/PD-L1 Blockade"],
475
+ "cancer": ["Prostate cancer"],
476
+ "biology": ["Microbiome composition", "Treatment response"],
477
+ "intervention": ["Antibiotics"]
478
+ }
479
+ },
480
+ "40": {
481
+ "reference_id": "40",
482
+ "citation": "Elkrief et al., Oncoimmunology 2019",
483
+ "title": "Antibiotics are associated with decreased PFS in advanced melanoma patients treated with ICI",
484
+ "year": 2019,
485
+ "tags": {
486
+ "treatment": ["ICI"],
487
+ "cancer": ["Melanoma"],
488
+ "biology": ["Microbiome disruption", "Progression-free survival"],
489
+ "intervention": ["Antibiotics"]
490
+ }
491
+ },
492
+ "41": {
493
+ "reference_id": "41",
494
+ "citation": "Chalabi et al., Ann Oncol 2020",
495
+ "title": "Efficacy of chemotherapy and atezolizumab in patients with NSCLC receiving antibiotics and PPIs",
496
+ "year": 2020,
497
+ "tags": {
498
+ "treatment": ["PD-1/PD-L1 Blockade"],
499
+ "cancer": ["NSCLC"],
500
+ "biology": ["Microbiome disruption"],
501
+ "intervention": ["Proton pump inhibitors", "Antibiotics"]
502
+ }
503
+ },
504
+ "42": {
505
+ "reference_id": "42",
506
+ "citation": "Tomita et al., Oncoimmunology 2022",
507
+ "title": "Clostridium butyricum therapy restores the decreased efficacy of ICI in lung cancer patients receiving PPIs",
508
+ "year": 2022,
509
+ "tags": {
510
+ "treatment": ["PD-1/PD-L1 Blockade"],
511
+ "cancer": ["NSCLC"],
512
+ "biology": ["Clostridium butyricum", "Microbiome modulation"],
513
+ "intervention": ["Probiotics", "Proton pump inhibitors"]
514
+ }
515
+ },
516
+ "43": {
517
+ "reference_id": "43",
518
+ "citation": "Terrisse et al., Cell Death Differ 2021",
519
+ "title": "Intestinal microbiota influences clinical outcome and side effects of early breast cancer treatment",
520
+ "year": 2021,
521
+ "tags": {
522
+ "treatment": ["Chemotherapy"],
523
+ "cancer": ["Breast cancer"],
524
+ "biology": ["Dysbiosis", "Treatment toxicity"],
525
+ "intervention": []
526
+ }
527
+ },
528
+ "45": {
529
+ "reference_id": "45",
530
+ "citation": "Zhang et al., Theranostics 2021",
531
+ "title": "Pectin supplement significantly enhanced the anti-PD-1 efficacy in tumor-bearing mice humanized with gut microbiota from CRC patients",
532
+ "year": 2021,
533
+ "tags": {
534
+ "treatment": ["PD-1/PD-L1 Blockade"],
535
+ "cancer": ["CRC", "Preclinical tumor models"],
536
+ "biology": ["Butyrate production", "Microbiome modulation"],
537
+ "intervention": ["Prebiotics", "Pectin"]
538
+ }
539
+ },
540
+ "48": {
541
+ "reference_id": "48",
542
+ "citation": "Dizman et al., Nat Med 2022",
543
+ "title": "Nivolumab plus ipilimumab with or without live bacterial supplementation (CBM588) in metastatic RCC: a randomized phase 1 trial",
544
+ "year": 2022,
545
+ "tags": {
546
+ "treatment": ["PD-1/PD-L1 Blockade", "CTLA-4 Blockade"],
547
+ "cancer": ["RCC"],
548
+ "biology": ["Clostridium butyricum", "Microbiome modulation"],
549
+ "intervention": ["Probiotics", "Live bacterial supplementation"]
550
+ }
551
+ }
552
+ }
553
+
554
+
555
+
requirements.txt ADDED
@@ -0,0 +1,16 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Core dependencies
2
+ torch>=2.0.0
3
+ transformers>=4.40.0
4
+ accelerate>=0.27.0
5
+ sentence-transformers>=2.5.0
6
+ bitsandbytes>=0.43.0
7
+
8
+ # RAG and vector database
9
+ chromadb>=0.4.0
10
+
11
+ # Utilities
12
+ tqdm>=4.65.0
13
+
14
+ # For gradio
15
+ huggingface-hub>=0.23.0
16
+ gradio
src/__init__.py ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Microbiome-ICI Report Generator Package
3
+ """
4
+
5
+ __version__ = "1.0.0"
6
+
7
+ from .report_assembler import ReportAssembler
8
+
9
+ __all__ = ["ReportAssembler"]
10
+
src/__pycache__/__init__.cpython-311.pyc ADDED
Binary file (352 Bytes). View file
 
src/__pycache__/config.cpython-311.pyc ADDED
Binary file (2.1 kB). View file
 
src/__pycache__/models.cpython-311.pyc ADDED
Binary file (7.64 kB). View file
 
src/__pycache__/report_assembler.cpython-311.pyc ADDED
Binary file (10.7 kB). View file
 
src/__pycache__/section_generators.cpython-311.pyc ADDED
Binary file (24.6 kB). View file
 
src/chroma_loader.py ADDED
@@ -0,0 +1,103 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ ChromaDB loader — downloads the vector database from a HuggingFace dataset
3
+ at startup if it is not already present locally.
4
+
5
+ Replace HF_REPO_ID with your actual dataset repo once it is uploaded.
6
+ """
7
+
8
+ import logging
9
+ import os
10
+ from pathlib import Path
11
+
12
+ from . import config
13
+
14
+ logger = logging.getLogger(__name__)
15
+
16
+ # ---------------------------------------------------------------------------
17
+ # Configuration — update HF_REPO_ID before deployment
18
+ # ---------------------------------------------------------------------------
19
+
20
+ HF_REPO_ID = "your-username/your-chroma-db-dataset" # <-- replace this
21
+ LOCAL_CHROMA_DIR = Path("./chroma_db")
22
+
23
+
24
+ def ensure_chroma_db() -> str:
25
+ """
26
+ Ensure the ChromaDB is available locally.
27
+
28
+ If the local directory already exists and contains ChromaDB files,
29
+ this is a no-op. Otherwise the dataset is downloaded from HuggingFace
30
+ Hub into LOCAL_CHROMA_DIR.
31
+
32
+ Returns:
33
+ The absolute path to the local ChromaDB directory (str).
34
+
35
+ Raises:
36
+ RuntimeError: If the download fails for any reason.
37
+ """
38
+ chroma_path = LOCAL_CHROMA_DIR.resolve()
39
+
40
+ # -----------------------------------------------------------------------
41
+ # Check if a valid ChromaDB already exists locally
42
+ # -----------------------------------------------------------------------
43
+ if _chroma_db_exists(chroma_path):
44
+ logger.info(f"ChromaDB already present at {chroma_path} — skipping download.")
45
+ _update_config(str(chroma_path))
46
+ return str(chroma_path)
47
+
48
+ # -----------------------------------------------------------------------
49
+ # Download from HuggingFace Hub
50
+ # -----------------------------------------------------------------------
51
+ logger.info(
52
+ f"ChromaDB not found locally. Downloading from HuggingFace: {HF_REPO_ID}"
53
+ )
54
+
55
+ try:
56
+ from huggingface_hub import snapshot_download
57
+ except ImportError:
58
+ raise RuntimeError(
59
+ "huggingface_hub is not installed. "
60
+ "Add it to requirements.txt: huggingface-hub>=0.23.0"
61
+ )
62
+
63
+ try:
64
+ downloaded_path = snapshot_download(
65
+ repo_id=HF_REPO_ID,
66
+ repo_type="dataset",
67
+ local_dir=str(chroma_path),
68
+ )
69
+ logger.info(f"ChromaDB downloaded successfully to: {downloaded_path}")
70
+
71
+ except Exception as exc:
72
+ raise RuntimeError(
73
+ f"Failed to download ChromaDB from HuggingFace repo '{HF_REPO_ID}': {exc}"
74
+ ) from exc
75
+
76
+ if not _chroma_db_exists(chroma_path):
77
+ raise RuntimeError(
78
+ f"Download appeared to succeed but no ChromaDB files were found in "
79
+ f"{chroma_path}. Check that the HuggingFace dataset contains a "
80
+ f"ChromaDB at its root."
81
+ )
82
+
83
+ _update_config(str(chroma_path))
84
+ return str(chroma_path)
85
+
86
+
87
+ # ---------------------------------------------------------------------------
88
+ # Helpers
89
+ # ---------------------------------------------------------------------------
90
+
91
+ def _chroma_db_exists(path: Path) -> bool:
92
+ """
93
+ Return True if the path looks like a populated ChromaDB directory.
94
+ ChromaDB always writes a 'chroma.sqlite3' file at the root.
95
+ """
96
+ return path.is_dir() and (path / "chroma.sqlite3").exists()
97
+
98
+
99
+ def _update_config(path: str) -> None:
100
+ """Point config.CHROMADB_PERSIST_DIRECTORY at the resolved local path."""
101
+ config.CHROMADB_PERSIST_DIRECTORY = path
102
+ os.environ["CHROMA_DB_PATH"] = path
103
+ logger.info(f"config.CHROMADB_PERSIST_DIRECTORY set to: {path}")
src/config.py ADDED
@@ -0,0 +1,158 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Configuration for the Microbiome-ICI Report Generator
3
+ """
4
+ import os
5
+
6
+ # =============================================================================
7
+ # Model Configuration
8
+ # =============================================================================
9
+
10
+ # MedGemma 1.5 4B model
11
+ MEDGEMMA_MODEL_ID = "google/medgemma-1.5-4b-it"
12
+ MEDGEMMA_DEVICE = "cuda" # Change to "cpu" if no GPU available
13
+
14
+ # PubMedBERT embedding model
15
+ EMBEDDING_MODEL_ID = "pritamdeka/S-PubMedBert-MS-MARCO"
16
+ EMBEDDING_DEVICE = "cuda"
17
+
18
+ # =============================================================================
19
+ # Generation Parameters
20
+ # =============================================================================
21
+
22
+ # GENERATION_CONFIG = {
23
+ # "temperature": 0.0,
24
+ # "do_sample": False,
25
+ # "repetition_penalty": 1.15,
26
+ # "no_repeat_ngram_size": 5,
27
+ # }
28
+
29
+ GENERATION_CONFIG ={
30
+ "temperature": 0.0,
31
+ "do_sample": False,
32
+ "top_p": 1.0,
33
+ "top_k": 0,
34
+ "repetition_penalty": 1.0,
35
+ "num_beams": 1,
36
+ "early_stopping": True,
37
+ }
38
+
39
+
40
+ SECTION_MAX_NEW_TOKENS = {
41
+ "section_1": 2200,
42
+ "section_2": 2400,
43
+ "section_3": 3000,
44
+ "section_4": 2200,
45
+ "section_5": 2400,
46
+ "section_6": 1000,
47
+ }
48
+
49
+ # =============================================================================
50
+ # ChromaDB Configuration
51
+ # =============================================================================
52
+
53
+ CHROMADB_COLLECTION_NAME = "research_papers"
54
+ # CHROMADB_PERSIST_DIRECTORY = "/content/chroma_db" # Adjust to your actual path
55
+ CHROMADB_PERSIST_DIRECTORY = os.getenv("CHROMA_DB_PATH", "./chroma_db")
56
+ # =============================================================================
57
+ # RAG Retrieval Configuration
58
+ # =============================================================================
59
+
60
+ # Number of chunks to retrieve per section
61
+ RAG_TOP_K = {
62
+ "section_1": 5, # Composition Profile
63
+ "section_2": 8, # Metabolite Landscape
64
+ "section_3": 12, # Drug-Microbiome Interaction (most evidence-dense)
65
+ "section_4": 7, # Confounding Factors
66
+ "section_5": 7, # Intervention Considerations
67
+ }
68
+
69
+ # Metadata filtering strategy
70
+ # Options: "semantic_only", "metadata_only", "hybrid"
71
+ RETRIEVAL_STRATEGY = "hybrid"
72
+
73
+ # =============================================================================
74
+ # Report Configuration
75
+ # =============================================================================
76
+
77
+ OUTPUT_DIR = "./outputs"
78
+ REPORT_FILENAME_TEMPLATE = "microbiome_immunotherapy_report_{patient_id}_{timestamp}.md"
79
+
80
+ # =============================================================================
81
+ # Clinical Context Windows (days before therapy start)
82
+ # =============================================================================
83
+
84
+ ANTIBIOTIC_WINDOW_DAYS = 42 # Critical window for antibiotic impact (ICI)
85
+ ANTIBIOTIC_WINDOW_DAYS_ACT = 28 # Critical window before CAR-T infusion
86
+ PPI_CONCERN_DURATION_MONTHS = 3 # Duration after which PPI use is flagged
87
+
88
+ # ACT-specific toxicity windows
89
+ CRS_ONSET_DAYS = 14 # CRS typically occurs within 2 weeks of CAR-T infusion
90
+ NEUROTOXICITY_ONSET_DAYS = 21 # Neurotoxicity can occur up to 3 weeks post-infusion
91
+
92
+ # =============================================================================
93
+ # Taxa of Interest (for targeted retrieval)
94
+ # =============================================================================
95
+
96
+ KEY_TAXA = [
97
+ "Akkermansia muciniphila",
98
+ "Bifidobacterium",
99
+ "Faecalibacterium prausnitzii",
100
+ "Ruminococcaceae",
101
+ "Lachnospiraceae",
102
+ "Bacteroides",
103
+ "Collinsella aerofaciens",
104
+ "Alistipes",
105
+ "Clostridium butyricum",
106
+ ]
107
+
108
+ # =============================================================================
109
+ # Therapy Type Detection
110
+ # =============================================================================
111
+
112
+ THERAPY_TYPE_MAP = {
113
+ # ICI drugs
114
+ "pembrolizumab": "ICI",
115
+ "nivolumab": "ICI",
116
+ "atezolizumab": "ICI",
117
+ "durvalumab": "ICI",
118
+ "avelumab": "ICI",
119
+ "ipilimumab": "ICI",
120
+ "tremelimumab": "ICI",
121
+ "cemiplimab": "ICI",
122
+
123
+ # ACT drugs
124
+ "tisagenlecleucel": "ACT",
125
+ "axicabtagene ciloleucel": "ACT",
126
+ "brexucabtagene autoleucel": "ACT",
127
+ "lisocabtagene maraleucel": "ACT",
128
+ "idecabtagene vicleucel": "ACT",
129
+ "ciltacabtagene autoleucel": "ACT",
130
+ }
131
+
132
+ # =============================================================================
133
+ # ICI Drug Classes (for metadata filtering)
134
+ # =============================================================================
135
+
136
+ ICI_DRUG_CLASS_MAP = {
137
+ "pembrolizumab": "PD-1/PD-L1 Blockade",
138
+ "nivolumab": "PD-1/PD-L1 Blockade",
139
+ "atezolizumab": "PD-1/PD-L1 Blockade",
140
+ "durvalumab": "PD-1/PD-L1 Blockade",
141
+ "avelumab": "PD-1/PD-L1 Blockade",
142
+ "ipilimumab": "CTLA-4 Blockade",
143
+ "tremelimumab": "CTLA-4 Blockade",
144
+ "cemiplimab": "PD-1/PD-L1 Blockade",
145
+ }
146
+
147
+ # =============================================================================
148
+ # ACT Drug Classes (for metadata filtering)
149
+ # =============================================================================
150
+
151
+ ACT_DRUG_CLASS_MAP = {
152
+ "tisagenlecleucel": "CAR-T (CD19-targeted)",
153
+ "axicabtagene ciloleucel": "CAR-T (CD19-targeted)",
154
+ "brexucabtagene autoleucel": "CAR-T (CD19-targeted)",
155
+ "lisocabtagene maraleucel": "CAR-T (CD19-targeted)",
156
+ "idecabtagene vicleucel": "CAR-T (BCMA-targeted)",
157
+ "ciltacabtagene autoleucel": "CAR-T (BCMA-targeted)",
158
+ }
src/ehr_extractor.py ADDED
@@ -0,0 +1,257 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import json
2
+ import logging
3
+ import re
4
+ from datetime import date
5
+ from pathlib import Path
6
+ from typing import Dict
7
+
8
+ from .models import get_medgemma
9
+
10
+ logger = logging.getLogger(__name__)
11
+
12
+
13
+ # =============================================================================
14
+ # JSON Schema Template
15
+ # =============================================================================
16
+
17
+ # Helper to locate the template relative to this file
18
+ _BASE_DIR = Path(__file__).parent.parent
19
+ _SCHEMA_TEMPLATE_PATH = _BASE_DIR / "data" / "templates" / "patient_schema_template.json"
20
+
21
+ def _load_json_template() -> str:
22
+ """Load the JSON schema template from the external file."""
23
+ if not _SCHEMA_TEMPLATE_PATH.exists():
24
+ logger.warning(f"Schema template not found at {_SCHEMA_TEMPLATE_PATH}. Extraction may fail or be inaccurate.")
25
+ return "{}"
26
+
27
+ with open(_SCHEMA_TEMPLATE_PATH, "r", encoding="utf-8") as f:
28
+ return f.read()
29
+
30
+
31
+ def _build_prompt(ehr_text: str) -> str:
32
+ """
33
+ Build the combined system + user prompt to pass to MedGemmaGenerator.generate().
34
+ """
35
+ today = date.today().isoformat()
36
+ json_template = _load_json_template()
37
+
38
+ system_instruction = f"""You are a clinical data extraction specialist for cancer immunotherapy. Extract structured data from EHRs covering both immune checkpoint inhibitors (ICI) and adoptive cell therapy (ACT, including CAR-T).
39
+
40
+ === OUTPUT FORMAT ===
41
+ - Return ONLY the filled JSON object. No explanation, no preamble, no markdown fences.
42
+ - Do NOT add fields not in the template. Do NOT remove template fields.
43
+ - Valid JSON: no trailing commas, no comments, no extra keys.
44
+ - Set "extraction_date" to today: {today}.
45
+
46
+ === DATA RULES ===
47
+ - Extract only explicitly stated values. Do not infer beyond specified rules.
48
+ - Dates: ISO 8601 (YYYY-MM-DD). If only month/year, use 1st (e.g. "March 2024" → "2024-03-01").
49
+ - Numbers: numeric type, not strings. Percentages as plain floats (4.8 not "4.8%").
50
+ - Missing optional fields: null.
51
+ - Missing required strings: "".
52
+ - Missing required arrays: [].
53
+ - Missing required booleans: false.
54
+
55
+ === PATIENT ===
56
+ - "id": MRN exactly as written.
57
+
58
+ === CANCER ===
59
+ - "type": Full name (e.g. "Diffuse Large B-Cell Lymphoma", "NSCLC", "Melanoma").
60
+ - "subtype": Histological subtype (e.g. "Non-GCB (ABC type)", "Adenocarcinoma").
61
+ - "stage": Use stage label only (e.g. "Stage IV", "IVA", "IIIB") — not full TNM.
62
+ - "metastases": List ANATOMICAL SITES with optional details in parentheses.
63
+ Examples: ["Lung", "Liver"], ["Bone marrow (15% involvement)", "Pleural effusion (malignant)"].
64
+ If M0 or no metastases, use [].
65
+ - "biomarkers.pdl1_expression": Use format from report. If percentage with TPS, use "<value>% TPS".
66
+ If just percentage, use "<value>%". If N/A for non-relevant cancer types, use "N/A".
67
+ - "biomarkers.tmb": "<value> mutations/megabase" or "N/A" if not applicable.
68
+ - "biomarkers.msi_status": Full label (e.g. "MSS", "Microsatellite stable (MSS)") or "N/A".
69
+
70
+ === IMMUNOTHERAPY (CRITICAL SECTION) ===
71
+ - "therapy_type": "ICI" for checkpoint inhibitors (pembrolizumab, nivolumab, ipilimumab, atezolizumab, durvalumab).
72
+ "ACT" for adoptive cell therapy (CAR-T, TIL, TCR-T, etc.).
73
+ - "drug_name": Full drug name (e.g. "Pembrolizumab", "Axicabtagene ciloleucel").
74
+ - "drug_class": For ICI, use checkpoint target (e.g. "PD-1/PD-L1 Blockade", "PD-1 inhibitor").
75
+ For ACT, use "CAR-T", "TIL therapy", "TCR-T", etc.
76
+ - "treatment_setting": "First-line", "Relapsed/Refractory", "consolidation", "metastatic", "adjuvant", "neoadjuvant".
77
+ - "line_of_therapy": "First-line", "Second-line", "Third-line", "consolidation", etc.
78
+ - "planned_start_date": Date therapy begins (for CAR-T, this is infusion date, not leukapheresis).
79
+
80
+ IF therapy_type is "ICI":
81
+ - "ici_details": {{"ici_target": "PD-1", "PD-L1", "CTLA-4", or "PD-1, CTLA-4" for combinations}}
82
+ - "act_details": null
83
+
84
+ IF therapy_type is "ACT":
85
+ - "ici_details": null
86
+ - "act_details": {{
87
+ "act_type": "CAR-T", "TIL therapy", "TCR-T", etc.
88
+ "target_antigen": e.g. "CD19", "CD22", "BCMA"
89
+ "cell_source": "autologous" or "allogeneic"
90
+ "preconditioning_regimen": e.g. "Fludarabine + Cyclophosphamide"
91
+ "t_cell_harvest_date": Date of leukapheresis (YYYY-MM-DD)
92
+ "expected_crs_risk": "low", "moderate", "moderate-high", "high"
93
+ "expected_neurotoxicity_risk": "low", "moderate", "moderate-high", "high"
94
+ }}
95
+
96
+ === PRIOR TREATMENTS ===
97
+ - "chemotherapy.received": TRUE if any chemo regimen described (even if completed before current therapy).
98
+ - "chemotherapy.regimens": List each as string (e.g. ["R-CHOP", "R-ICE", "Gemcitabine (bridging)"]).
99
+ - "chemotherapy.response": Describe response to each regimen if stated.
100
+ - "prior_immunotherapy.received": TRUE only if immunotherapy given BEFORE current planned regimen.
101
+
102
+ === MEDICATIONS ===
103
+ - "ppi_use.currently_on_ppi": true if on any PPI.
104
+ - "ppi_use.ppi_name": name (e.g. "Omeprazole").
105
+ - "ppi_use.duration_months": numeric months if stated; 0 if unknown.
106
+ - "antibiotic_history.recent_antibiotics": TRUE if any antibiotic within 90 days of planned therapy start.
107
+ - "antibiotic_history.exposures": List EVERY antibiotic course mentioned. Never leave [] if antibiotics documented.
108
+ Each exposure object:
109
+ - "antibiotic_name": name + dose (e.g. "Levofloxacin 500mg").
110
+ - "antibiotic_class": use mappings (levofloxacin→fluoroquinolone, azithromycin→macrolide,
111
+ piperacillin-tazobactam→beta-lactam, amoxicillin-clavulanate→beta-lactam/penicillin combination).
112
+ - "start_date", "end_date": YYYY-MM-DD. If ongoing at report date, use "ongoing" for end_date.
113
+ - "days_before_ici": Days from antibiotic END (or report date if ongoing) to planned therapy start.
114
+ - "note" (optional): Add if context needed.
115
+
116
+ === COMORBIDITIES ===
117
+ - List all conditions from Past Medical History as plain strings.
118
+ - Never use [] if a PMH section exists — scan fully.
119
+ - Include diet-controlled or asymptomatic conditions if listed.
120
+ - Do NOT include surgical history, family history, or social history.
121
+
122
+ === MICROBIOME ===
123
+ - "sequencing_method": Exact method from report.
124
+ - "diversity.observed_species": Use "Observed OTUs" or "Observed Species" value.
125
+ - "key_bacteria": DYNAMIC object. Extract ALL bacterial species mentioned with abundance percentages.
126
+ - Keys: lowercase, underscores for spaces (e.g. "akkermansia_muciniphila").
127
+ - Create SEPARATE keys for each Bifidobacterium species — do NOT sum into bifidobacterium_spp unless explicitly stated.
128
+ - Values: plain floats (percentages).
129
+ - "metabolites.scfa": butyrate, propionate, acetate as floats in μM. null if not measured.
130
+ - "metabolites.bile_acids_available": true if ANY bile acid data reported.
131
+ - "metabolites.tryptophan_metabolites_available": true if ANY tryptophan metabolite reported.
132
+ - "data_quality.completeness": "high" if metabolites + diversity + species all present; "moderate" if some missing; "low" if sparse.
133
+ - "data_quality.source": Lab name if stated; "" if unknown.
134
+ - "data_quality.limitations": Extract any explicitly noted limitations as strings.
135
+
136
+ === CLINICAL CONTEXT ===
137
+ - "urgency": Extract urgency statements (e.g. "High - 33-day intervention window", "Standard"). "" if not stated.
138
+ - "patient_goals": List goals patient explicitly expressed.
139
+ - "specific_concerns": List clinical concerns from assessment.
140
+
141
+ === FINAL CHECK ===
142
+ Before outputting, verify:
143
+ - therapy_type determines which of ici_details/act_details is populated (the other must be null).
144
+ - All antibiotic exposures logged in array.
145
+ - key_bacteria contains species-level entries from report (not a fixed schema).
146
+ - JSON is valid (no trailing commas, proper null usage)."""
147
+
148
+ user_prompt = f"""EHR REPORT:
149
+ {ehr_text}
150
+
151
+ JSON TEMPLATE TO FILL:
152
+ {json_template}
153
+
154
+ Return the completed JSON object now."""
155
+
156
+ return f"{system_instruction}\n\n{user_prompt}"
157
+
158
+
159
+ def _parse_output(raw_output: str) -> Dict:
160
+ """
161
+ Extract and parse a JSON object from raw model output.
162
+ """
163
+
164
+ # Strip any residual markdown fences (```json ... ``` or ``` ... ```)
165
+ fenced = re.search(r"```(?:json)?\s*([\s\S]+?)\s*```", raw_output)
166
+ if fenced:
167
+ raw_output = fenced.group(1)
168
+
169
+ # Find the outermost JSON object
170
+ start = raw_output.find("{")
171
+ end = raw_output.rfind("}") + 1
172
+ if start == -1 or end == 0:
173
+ raise ValueError(
174
+ "No JSON object found in model output.\n"
175
+ f"Raw output (first 500 chars):\n{raw_output[:500]}"
176
+ )
177
+
178
+ json_str = raw_output[start:end]
179
+ return json.loads(json_str)
180
+
181
+
182
+ class EHRExtractor:
183
+ """
184
+ Extracts structured patient JSON from free-text EHR reports using MedGemma.
185
+
186
+
187
+ Usage:
188
+ extractor = EHRExtractor()
189
+ patient_data = extractor.extract(ehr_text)
190
+ # or
191
+ patient_data = extractor.extract_from_file("path/to/report.txt")
192
+ """
193
+
194
+ # # Expose as a static method so tests can call it without instantiation
195
+ # _parse_output = staticmethod(_parse_output)
196
+
197
+ def __init__(self):
198
+ self._llm = get_medgemma()
199
+
200
+ # def _get_llm(self):
201
+ # if self._llm is None:
202
+ # from .models import get_medgemma
203
+ # self._llm = get_medgemma()
204
+ # return self._llm
205
+
206
+ def extract(self, ehr_text: str) -> Dict:
207
+ """
208
+ Run EHR extraction and return the parsed patient data dictionary.
209
+
210
+ Args:
211
+ ehr_text: Raw EHR report as a string.
212
+
213
+ Returns:
214
+ Patient data dict matching the pipeline's expected JSON schema.
215
+
216
+ Raises:
217
+ ValueError: If no valid JSON could be found in the model output.
218
+ json.JSONDecodeError: If the extracted JSON string is malformed.
219
+ """
220
+ logger.info("Starting EHR extraction via MedGemma")
221
+
222
+ prompt = _build_prompt(ehr_text)
223
+ logger.info(f"EHR prompt length: {len(prompt)} characters")
224
+
225
+ raw_output = self._llm.generate(prompt, max_new_tokens=6000)
226
+
227
+ logger.debug(f"Raw EHR extraction output:\n{raw_output[:1000]}...")
228
+
229
+ try:
230
+ patient_data = _parse_output(raw_output)
231
+ except (ValueError, json.JSONDecodeError) as exc:
232
+ logger.error(
233
+ f"EHR extraction failed — could not parse JSON.\n"
234
+ f"Raw output:\n{raw_output}"
235
+ )
236
+ raise
237
+
238
+ logger.info(
239
+ f"EHR extraction complete. "
240
+ f"Patient ID: {patient_data.get('patient', {}).get('id', 'unknown')}"
241
+ )
242
+ return patient_data
243
+
244
+ def extract_from_file(self, ehr_path: str) -> Dict:
245
+ """
246
+ Load an EHR text file and extract patient data.
247
+
248
+ Args:
249
+ ehr_path: Path to the EHR text file.
250
+
251
+ Returns:
252
+ Patient data dict.
253
+ """
254
+ logger.info(f"Loading EHR from file: {ehr_path}")
255
+ with open(ehr_path, "r", encoding="utf-8") as f:
256
+ ehr_text = f.read()
257
+ return self.extract(ehr_text)
src/models.py ADDED
@@ -0,0 +1,185 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Model loading and inference utilities
3
+ """
4
+
5
+ import torch
6
+ from transformers import AutoTokenizer, AutoModelForCausalLM
7
+ from sentence_transformers import SentenceTransformer
8
+ from typing import List, Dict
9
+ import logging
10
+
11
+ from . import config
12
+
13
+ logger = logging.getLogger(__name__)
14
+
15
+
16
+ import re
17
+ from transformers import AutoProcessor, AutoModelForImageTextToText
18
+
19
+ class MedGemmaGenerator:
20
+ """Wrapper for MedGemma 1.5 4B model"""
21
+
22
+ def __init__(self):
23
+ logger.info(f"Loading MedGemma model: {config.MEDGEMMA_MODEL_ID}")
24
+
25
+ # Use AutoProcessor (not AutoTokenizer) and AutoModelForImageTextToText
26
+ self.processor = AutoProcessor.from_pretrained(config.MEDGEMMA_MODEL_ID)
27
+ self.model = AutoModelForImageTextToText.from_pretrained(
28
+ config.MEDGEMMA_MODEL_ID,
29
+ torch_dtype=torch.bfloat16 if torch.cuda.is_available() else torch.float32,
30
+ device_map="auto" if config.MEDGEMMA_DEVICE == "cuda" else None,
31
+ )
32
+
33
+ if config.MEDGEMMA_DEVICE == "cpu":
34
+ self.model = self.model.to("cpu")
35
+
36
+ self.model.eval()
37
+ logger.info("MedGemma model loaded successfully")
38
+
39
+ def _strip_thinking_block(self, text: str) -> str:
40
+ """
41
+ Remove the thinking/reasoning block that Gemma 3-based models emit.
42
+ MedGemma 1.5 uses <unused94>thought...<unused95> tokens.
43
+
44
+ """
45
+
46
+ text = re.sub(
47
+ r"<unused94>thought[\s\S]*?<unused95>",
48
+ "",
49
+ text,
50
+ flags=re.IGNORECASE,
51
+ )
52
+ text = re.sub(
53
+ r"<think>[\s\S]*?</think>",
54
+ "",
55
+ text,
56
+ flags=re.IGNORECASE,
57
+ )
58
+
59
+ text = re.sub(
60
+ r"<unused94>thought[\s\S]*$",
61
+ "",
62
+ text,
63
+ flags=re.IGNORECASE,
64
+ )
65
+ text = re.sub(
66
+ r"<think>[\s\S]*$",
67
+ "",
68
+ text,
69
+ flags=re.IGNORECASE,
70
+ )
71
+
72
+ return text.strip()
73
+
74
+ def generate(self, prompt: str, max_new_tokens: int = None) -> str:
75
+ """
76
+ Generate text from prompt using MedGemma
77
+
78
+ Args:
79
+ prompt: Input prompt
80
+ max_new_tokens: Override default max tokens if provided
81
+
82
+ Returns:
83
+ Generated text (with thinking block removed)
84
+ """
85
+ gen_config = config.GENERATION_CONFIG.copy()
86
+ if max_new_tokens:
87
+ gen_config["max_new_tokens"] = max_new_tokens
88
+
89
+ # Use proper message format for MedGemma 1.5 4B-IT
90
+ messages = [
91
+ {
92
+ "role": "user",
93
+ "content": [{"type": "text", "text": prompt}]
94
+ }
95
+ ]
96
+
97
+ # Apply chat template properly
98
+ inputs = self.processor.apply_chat_template(
99
+ messages,
100
+ add_generation_prompt=True,
101
+ tokenize=True,
102
+ return_dict=True,
103
+ return_tensors="pt",
104
+ ).to(self.model.device, dtype=torch.bfloat16 if torch.cuda.is_available() else torch.float32)
105
+
106
+ input_len = inputs["input_ids"].shape[-1]
107
+
108
+ with torch.no_grad():
109
+ outputs = self.model.generate(
110
+ **inputs,
111
+ **gen_config,
112
+ pad_token_id=self.processor.tokenizer.pad_token_id if hasattr(self.processor, 'tokenizer') else self.processor.pad_token_id,
113
+ )
114
+
115
+ # Extract only the generated portion (after the input)
116
+ generated_tokens = outputs[0][input_len:]
117
+ generated_text = self.processor.decode(generated_tokens, skip_special_tokens=True)
118
+
119
+ # Strip the thinking block before returning
120
+ generated_text = self._strip_thinking_block(generated_text)
121
+
122
+ return generated_text
123
+
124
+ class EmbeddingModel:
125
+ """Wrapper for PubMedBERT embedding model"""
126
+
127
+ def __init__(self):
128
+ logger.info(f"Loading embedding model: {config.EMBEDDING_MODEL_ID}")
129
+
130
+ self.model = SentenceTransformer(
131
+ config.EMBEDDING_MODEL_ID,
132
+ device=config.EMBEDDING_DEVICE
133
+ )
134
+
135
+ logger.info("Embedding model loaded successfully")
136
+
137
+ def encode(self, texts: List[str]) -> List[List[float]]:
138
+ """
139
+ Encode texts to embeddings
140
+
141
+ Args:
142
+ texts: List of text strings to encode
143
+
144
+ Returns:
145
+ List of embedding vectors
146
+ """
147
+ embeddings = self.model.encode(
148
+ texts,
149
+ convert_to_tensor=False,
150
+ show_progress_bar=False
151
+ )
152
+ return embeddings.tolist()
153
+
154
+ def encode_single(self, text: str) -> List[float]:
155
+ """
156
+ Encode a single text to embedding
157
+
158
+ Args:
159
+ text: Text string to encode
160
+
161
+ Returns:
162
+ Embedding vector
163
+ """
164
+ return self.encode([text])[0]
165
+
166
+
167
+ # Global model instances (loaded once)
168
+ _medgemma_instance = None
169
+ _embedding_instance = None
170
+
171
+
172
+ def get_medgemma() -> MedGemmaGenerator:
173
+ """Get or create MedGemma model instance"""
174
+ global _medgemma_instance
175
+ if _medgemma_instance is None:
176
+ _medgemma_instance = MedGemmaGenerator()
177
+ return _medgemma_instance
178
+
179
+
180
+ def get_embedding_model() -> EmbeddingModel:
181
+ """Get or create embedding model instance"""
182
+ global _embedding_instance
183
+ if _embedding_instance is None:
184
+ _embedding_instance = EmbeddingModel()
185
+ return _embedding_instance
src/prompts.py ADDED
@@ -0,0 +1,323 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Section-specific prompt templates for clinical report generation.
3
+ Optimized for instruction-following in smaller models (e.g. MedGemma 4B IT).
4
+
5
+ Key design principles applied:
6
+ - Single flat instruction block per section (no nested lists inside lists)
7
+ - Positive framing: tell the model what TO do, not what NOT to do
8
+ - Evidence anchor placed immediately before the generation task
9
+ - Citation format stated once, clearly, close to where citations are used
10
+ - Section headers kept inside the prompt so the model knows its structural role
11
+ - Global instruction kept minimal; section prompts are self-contained
12
+ """
13
+
14
+ # =============================================================================
15
+ # Global Instruction (prepended to all section prompts)
16
+ # Kept short — section prompts carry the detailed guidance
17
+ # =============================================================================
18
+
19
+ GLOBAL_INSTRUCTION = """You are a clinical report writer assisting an oncologist.
20
+ Your output will become one section of a microbiome-immunotherapy report used to inform treatment decisions.
21
+
22
+ Two rules apply to every section you write:
23
+ - Every factual claim must come directly from the retrieved evidence provided. If the evidence does not address a topic, omit that topic.
24
+ - Every claim must be followed by an inline citation in this exact format: (Author et al., Journal Year). Only cite from the Retrieved evidence section below. The citations are present under "paper": "citation"
25
+
26
+ Write in formal clinical prose. Do not use bullet points unless explicitly instructed.
27
+ """
28
+
29
+ # =============================================================================
30
+ # Section 1: Microbiome Diversity & Composition Profile
31
+ # =============================================================================
32
+
33
+ SECTION_1_PROMPT = """{global_instruction}
34
+
35
+ ---
36
+ SECTION 1: Microbiome Diversity & Composition Profile
37
+ ---
38
+
39
+ Patient context:
40
+ - Cancer type: {cancer_type} | Stage: {cancer_stage}
41
+ - Planned therapy: {drug_name} ({drug_class})
42
+
43
+ Patient microbiome data:
44
+ - Shannon Diversity Index: {shannon_index}
45
+ - Simpson Diversity Index: {simpson_index}
46
+ - Observed Species: {observed_species}
47
+ - Detected taxa (% relative abundance):
48
+ {detected_taxa}
49
+
50
+ Retrieved evidence:
51
+ {evidence}
52
+
53
+ Task:
54
+ Write this section in two parts.
55
+
56
+ Part 1 — Diversity characterization: Describe the patient's alpha diversity level. Use the retrieved evidence to characterize whether this diversity profile has been associated with favorable or unfavorable outcomes in this cancer and immunotherapy context. Cite the evidence.
57
+
58
+ Part 2 — Taxa characterization: For each detected taxon above that appears in the retrieved evidence, describe its observed relative abundance and what the evidence associates it with in this clinical context. Cover only taxa that have retrieved evidence. Cite each association.
59
+
60
+ Write in descriptive, factual prose. Do not predict this patient's individual outcome.
61
+
62
+ Begin writing Section 1 now:
63
+ """
64
+
65
+ # =============================================================================
66
+ # Section 2: Metabolite Landscape
67
+ # =============================================================================
68
+
69
+ SECTION_2_PROMPT = """{global_instruction}
70
+
71
+ ---
72
+ SECTION 2: Metabolite Landscape
73
+ ---
74
+
75
+ Patient context:
76
+ - Cancer type: {cancer_type}
77
+ - Planned therapy: {drug_name} ({drug_class})
78
+
79
+ Patient metabolite data:
80
+ {metabolite_data}
81
+
82
+ Retrieved evidence:
83
+ {evidence}
84
+
85
+ Task:
86
+ Write a functional interpretation of the patient's metabolite profile. For each metabolite class present in the patient data (e.g. short-chain fatty acids, bile acids, tryptophan metabolites), do the following in sequence:
87
+ 1. State the observed level from the patient data.
88
+ 2. Describe what the retrieved evidence says about that metabolite class in the context of immune function and this therapy type. Cite the evidence.
89
+
90
+ If a metabolite class is present in the patient data but absent from the retrieved evidence, omit it entirely.
91
+ Frame this section as bridging microbiome composition to immune activity. Reserve response predictions for Section 3.
92
+
93
+ Begin writing Section 2 now:
94
+ """
95
+
96
+ # =============================================================================
97
+ # Section 3: Drug–Microbiome Interaction Outlook (ICI version)
98
+ # =============================================================================
99
+
100
+ SECTION_3_ICI_PROMPT = """{global_instruction}
101
+
102
+ ---
103
+ SECTION 3: Drug–Microbiome Interaction Outlook
104
+ ---
105
+
106
+ Patient context:
107
+ - Cancer type: {cancer_type} | Stage: {cancer_stage}
108
+ - Planned therapy: {drug_name} ({drug_class}) | Line: {line_of_therapy}
109
+ - Tumor biomarkers: PD-L1 {pdl1} | TMB {tmb} | MSI {msi}
110
+
111
+ Patient microbiome summary:
112
+ - Shannon {shannon_index} | Simpson {simpson_index}
113
+ - Key taxa: {key_taxa_summary}
114
+ - Metabolite context: {metabolite_summary}
115
+
116
+ Retrieved evidence:
117
+ {evidence}
118
+
119
+ Task:
120
+ Write this section in three parts.
121
+
122
+ Part 1 — Overall microbiome-ICI context: Describe what the retrieved evidence says about how this patient's microbiome profile (diversity level and dominant taxa) compares to patterns observed in comparable cohorts treated with this ICI class. Use phrases such as "the evidence suggests" or "studies in comparable cohorts found". Cite all claims.
123
+
124
+ Part 2 — Individual taxa associations: For each taxon in the patient's key taxa list that appears in the retrieved evidence for this ICI class, describe the association the evidence reports (favorable, unfavorable, or bidirectional). If evidence reports both efficacy and immune-related adverse event (irAE) associations for the same taxon, state both explicitly. Cite each association.
125
+
126
+ Part 3 — Alpha diversity in this treatment setting: Describe what the retrieved evidence specifically says about alpha diversity and outcomes in this ICI and cancer type context. Cite the evidence.
127
+
128
+ Do not predict this individual patient's outcome. Attribute all findings to the evidence source.
129
+
130
+ Begin writing Section 3 now:
131
+ """
132
+
133
+ # =============================================================================
134
+ # Section 3: Drug–Microbiome Interaction Outlook (ACT version)
135
+ # =============================================================================
136
+
137
+ SECTION_3_ACT_PROMPT = """{global_instruction}
138
+
139
+ ---
140
+ SECTION 3: Microbiome–ACT Interaction Outlook
141
+ ---
142
+
143
+ Patient context:
144
+ - Cancer type: {cancer_type} | Stage: {cancer_stage}
145
+ - Planned therapy: {drug_name} ({drug_class})
146
+ - ACT type: {act_type} | Target antigen: {target_antigen} | Cell source: {cell_source}
147
+ - Expected CRS risk: {crs_risk} | Expected neurotoxicity risk: {neurotoxicity_risk}
148
+ - Line of therapy: {line_of_therapy}
149
+
150
+ Patient microbiome summary:
151
+ - Shannon {shannon_index} | Simpson {simpson_index}
152
+ - Key taxa: {key_taxa_summary}
153
+ - Metabolite context: {metabolite_summary}
154
+
155
+ Retrieved evidence:
156
+ {evidence}
157
+
158
+ Task:
159
+ Write this section in four parts.
160
+
161
+ Part 1 — Overall microbiome-ACT context: Describe what the retrieved evidence says about how this patient's microbiome profile relates to outcomes observed in comparable ACT cohorts. Use phrases such as "the evidence suggests" or "studies in ACT cohorts found". Cite all claims.
162
+
163
+ Part 2 — Efficacy-related taxa: For each taxon in the patient's key taxa list where the retrieved evidence links it to CAR-T cell expansion, persistence, or anti-tumor cytotoxicity, describe that association. Cite each claim.
164
+
165
+ Part 3 — Toxicity-related taxa and metabolites: Describe what the retrieved evidence says about microbiota associations with CRS or ICANS risk. If the evidence links specific taxa or metabolites (particularly SCFAs) to T-cell function or inflammatory tone relevant to ACT toxicity, include those findings. Cite each claim.
166
+
167
+ Part 4 — Metabolite context for T-cell function: Describe what the retrieved evidence says about microbiota-derived metabolites, especially SCFAs, in modulating T-cell function in the ACT setting. Cite the evidence.
168
+
169
+ Do not predict this individual patient's outcome. Attribute all findings to the evidence source.
170
+
171
+ Begin writing Section 3 now:
172
+ """
173
+
174
+ # =============================================================================
175
+ # Section 4: Confounding Factors
176
+ # =============================================================================
177
+
178
+ SECTION_4_PROMPT = """{global_instruction}
179
+
180
+ ---
181
+ SECTION 4: Confounding Factors
182
+ ---
183
+
184
+ Patient context:
185
+ - Cancer type: {cancer_type}
186
+ - Planned therapy: {drug_name} ({drug_class})
187
+
188
+ Patient confounding factor data:
189
+ {confounding_data}
190
+
191
+ Retrieved evidence:
192
+ {evidence}
193
+
194
+ Task:
195
+ For each confounding factor present in the patient data above, write one paragraph using only the retrieved evidence.
196
+
197
+ Antibiotic exposure: If present, describe what the retrieved evidence says about antibiotic timing relative to ICI initiation and its documented interactions with microbiome-mediated ICI efficacy. If the evidence distinguishes by antibiotic class, include that distinction. Cite the evidence.
198
+
199
+ PPI use: If present, describe what the retrieved evidence says about proton pump inhibitor effects on the microbiome in the ICI context. Cite the evidence.
200
+
201
+ Prior treatments: If prior chemotherapy or immunotherapy is recorded, describe any retrieved evidence connecting those treatments to microbiome changes relevant to subsequent ICI response. Cite the evidence.
202
+
203
+ Comorbidities: Include only if the retrieved evidence directly links the recorded comorbidity to microbiome-ICI interactions. Cite the evidence.
204
+
205
+ If no confounding factors are present in the patient data, or if no retrieved evidence addresses the recorded factors, output exactly this sentence:
206
+ "No significant confounding factors with established microbiome-immunotherapy interactions were identified in the available data."
207
+
208
+ Begin writing Section 4 now:
209
+ """
210
+
211
+ # =============================================================================
212
+ # Section 5: Microbiota-Modulation Intervention Considerations
213
+ # =============================================================================
214
+
215
+ SECTION_5_PROMPT = """{global_instruction}
216
+
217
+ ---
218
+ SECTION 5: Microbiota-Modulation Intervention Considerations
219
+ ---
220
+
221
+ Patient context:
222
+ - Cancer type: {cancer_type}
223
+ - Planned therapy: {drug_name} ({drug_class})
224
+ - Microbiome context: {microbiome_summary}
225
+
226
+ Retrieved evidence by intervention type:
227
+ {evidence}
228
+
229
+ Task:
230
+ Generate a sub-section for each intervention type that has supporting evidence in the retrieved chunks above. Skip any intervention type with no retrieved evidence — do not note its absence.
231
+
232
+ For each sub-section that has evidence, use this structure:
233
+
234
+ Sub-section title (e.g., "Dietary and Prebiotic Approaches" or "Probiotic Supplementation")
235
+ Write 2–4 sentences describing what the retrieved evidence found about this intervention in this cancer and therapy context. State the finding, cite it, and note any caveat the evidence itself raises. Close with one sentence framing it as a consideration for clinical discussion rather than a recommendation.
236
+
237
+ Tone: exploratory and evidence-grounded. This section informs discussion; it does not prescribe.
238
+
239
+ Begin writing Section 5 now:
240
+ """
241
+
242
+ # =============================================================================
243
+ # Section 6: Data Quality & Interpretive Limitations
244
+ # =============================================================================
245
+
246
+ SECTION_6_FIXED_CAVEATS = (
247
+ "Microbiome composition is highly individual and dynamic; this report reflects a "
248
+ "single time-point sample. Associations between microbiome features and immunotherapy response "
249
+ "are derived from cohort-level studies and may not predict individual outcomes. "
250
+ "The evidence base is evolving; findings should be interpreted in the context of "
251
+ "current clinical judgment."
252
+ )
253
+
254
+ SECTION_6_PROMPT = """
255
+ ---
256
+ SECTION 6: Data Quality & Interpretive Limitations
257
+ ---
258
+
259
+ Patient sample data quality:
260
+ {data_quality}
261
+
262
+ Task:
263
+ Write 1–3 sentences addressing only what is specific to this patient's sample based on the data quality fields above.
264
+
265
+ Follow these rules in order:
266
+ - If completeness is "high" and no limitations are listed, write exactly: "No significant data quality limitations were identified for this sample."
267
+ - If completeness is below "high" or limitations are listed: name each affected data domain and state which section (Section 2 or Section 3) is consequently limited in its interpretation.
268
+ - If a metabolite class is listed as unavailable in the data quality fields, name it and describe the resulting interpretive gap. Only name classes that are explicitly listed as unavailable.
269
+
270
+ Do not add general statements about microbiome variability or evolving evidence — those appear in a separate fixed caveats section.
271
+
272
+ Begin writing Section 6 now:
273
+ """
274
+
275
+
276
+ # =============================================================================
277
+ # Helper function to build full prompts
278
+ # =============================================================================
279
+
280
+ def build_prompt(section_name: str, patient_data: dict, evidence: str, **kwargs) -> str:
281
+ """
282
+ Build a complete prompt for a given section.
283
+
284
+ Args:
285
+ section_name: One of "section_1", "section_2", "section_3",
286
+ "section_4", "section_5", "section_6"
287
+ patient_data: Patient JSON data dictionary
288
+ evidence: Formatted evidence string from RAG
289
+ **kwargs: Additional template variables (e.g. detected_taxa,
290
+ metabolite_data, confounding_data, data_quality, etc.)
291
+
292
+ Returns:
293
+ Complete formatted prompt string
294
+ """
295
+ therapy_type = patient_data["immunotherapy"].get("therapy_type", "ICI")
296
+
297
+ # Select template
298
+ if section_name == "section_3":
299
+ template = SECTION_3_ACT_PROMPT if therapy_type == "ACT" else SECTION_3_ICI_PROMPT
300
+ else:
301
+ prompt_templates = {
302
+ "section_1": SECTION_1_PROMPT,
303
+ "section_2": SECTION_2_PROMPT,
304
+ "section_4": SECTION_4_PROMPT,
305
+ "section_5": SECTION_5_PROMPT,
306
+ "section_6": SECTION_6_PROMPT,
307
+ }
308
+ template = prompt_templates.get(section_name)
309
+
310
+ if not template:
311
+ raise ValueError(f"Unknown section: {section_name}")
312
+
313
+ # Inject global instruction (section_6 is standalone, doesn't use it)
314
+ kwargs["global_instruction"] = GLOBAL_INSTRUCTION if section_name != "section_6" else ""
315
+ kwargs["evidence"] = evidence
316
+
317
+ # Inject shared patient context fields
318
+ if section_name in ["section_1", "section_2", "section_3", "section_4", "section_5"]:
319
+ kwargs["cancer_type"] = patient_data["cancer"]["type"]
320
+ kwargs["drug_name"] = patient_data["immunotherapy"]["drug_name"]
321
+ kwargs["drug_class"] = patient_data["immunotherapy"]["drug_class"]
322
+
323
+ return template.format(**kwargs)
src/rag.py ADDED
@@ -0,0 +1,471 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ RAG retrieval logic with ChromaDB
3
+ """
4
+
5
+ import json
6
+ import chromadb
7
+ from typing import List, Dict, Optional, Set
8
+ import logging
9
+
10
+ from . import config
11
+ from .models import get_embedding_model
12
+
13
+ logger = logging.getLogger(__name__)
14
+
15
+
16
+ class RAGRetriever:
17
+ """Handles retrieval from ChromaDB with metadata filtering"""
18
+
19
+ def __init__(self):
20
+ logger.info(f"Connecting to ChromaDB at {config.CHROMADB_PERSIST_DIRECTORY}")
21
+
22
+ self.client = chromadb.PersistentClient(path=config.CHROMADB_PERSIST_DIRECTORY)
23
+ self.collection = self.client.get_collection(name=config.CHROMADB_COLLECTION_NAME)
24
+ self.embedding_model = get_embedding_model()
25
+
26
+ logger.info(f"Connected to collection: {config.CHROMADB_COLLECTION_NAME}")
27
+
28
+ def retrieve(
29
+ self,
30
+ query_text: str,
31
+ top_k: int,
32
+ metadata_filters: Optional[Dict] = None,
33
+ exclude_filters: Optional[Dict] = None,
34
+ strategy: str = "hybrid" # options: "semantic_only", "metadata_only", "hybrid"
35
+ ) -> List[Dict]:
36
+ """
37
+ Retrieve chunks from ChromaDB according to strategy:
38
+ - semantic_only: ignores metadata filters, purely vector search
39
+ - metadata_only: filters metadata, no fallback
40
+ - hybrid: filters metadata, then fills remaining with semantic-only search
41
+ """
42
+ # Encode query
43
+ query_embedding = self.embedding_model.encode_single(query_text)
44
+
45
+ # Helper function to query Chroma
46
+ def _query_chroma(where_clause=None, n_results=top_k):
47
+ query_params = {
48
+ "query_embeddings": [query_embedding],
49
+ "n_results": n_results,
50
+ }
51
+ if where_clause:
52
+ query_params["where"] = where_clause
53
+ results = self.collection.query(**query_params)
54
+ chunks = []
55
+ if results and results["documents"]:
56
+ for i in range(len(results["documents"][0])):
57
+ chunk = {
58
+ "text": results["documents"][0][i],
59
+ "metadata": results["metadatas"][0][i] if results["metadatas"] else {},
60
+ "distance": results["distances"][0][i] if results["distances"] else None,
61
+ }
62
+ chunks.append(chunk)
63
+ return chunks
64
+
65
+ # Build metadata filter clauses if needed
66
+ include_clause = self._build_where_clause(metadata_filters)
67
+ exclude_clause = self._build_exclude_clause(exclude_filters)
68
+ if include_clause and exclude_clause:
69
+ where_clause = {"$and": [include_clause, exclude_clause]}
70
+ elif include_clause:
71
+ where_clause = include_clause
72
+ elif exclude_clause:
73
+ where_clause = exclude_clause
74
+ else:
75
+ where_clause = None
76
+
77
+ # --- Handle strategies ---
78
+ if strategy == "semantic_only":
79
+ return _query_chroma(where_clause=None, n_results=top_k)
80
+
81
+ elif strategy == "metadata_only":
82
+ # Only filter-based search, no fallback
83
+ return _query_chroma(where_clause=where_clause, n_results=top_k)
84
+
85
+ elif strategy == "hybrid":
86
+ # Step 1: filtered search
87
+ filtered_chunks = _query_chroma(where_clause=where_clause, n_results=top_k)
88
+
89
+ # Step 2: fallback to semantic-only if not enough
90
+ if len(filtered_chunks) < top_k:
91
+ remaining_k = top_k - len(filtered_chunks)
92
+ semantic_chunks = _query_chroma(where_clause=None, n_results=remaining_k)
93
+
94
+ # Remove duplicates based on text
95
+ existing_texts = set(c["text"] for c in filtered_chunks)
96
+ semantic_chunks = [c for c in semantic_chunks if c["text"] not in existing_texts]
97
+
98
+ filtered_chunks.extend(semantic_chunks)
99
+
100
+ return filtered_chunks
101
+
102
+ else:
103
+ raise ValueError(f"Unknown strategy: {strategy}")
104
+
105
+ def _build_where_clause(self, filters: Optional[Dict]) -> Optional[Dict]:
106
+ """
107
+ Build a ChromaDB WHERE clause from an include-filter dictionary.
108
+
109
+ Filters map tag category names (matching keys in research_papers.json "tags")
110
+ to one or more values. The pipeline stores tags as flat pipe-delimited string
111
+ fields named paper_tag_{category}, e.g.:
112
+ paper_tag_cancer = "NSCLC|Renal Cell Carcinoma|Bladder Cancer"
113
+ paper_tag_treatment = "PD-1/PD-L1 Blockade"
114
+
115
+ We use $contains for substring matching against the pipe-delimited string.
116
+
117
+ Examples:
118
+ {"cancer": "NSCLC"}
119
+ -> {"paper_tag_cancer": {"$contains": "NSCLC"}}
120
+
121
+ {"cancer": "NSCLC", "treatment": ["PD-1/PD-L1 Blockade"]}
122
+ -> {"$and": [
123
+ {"paper_tag_cancer": {"$contains": "NSCLC"}},
124
+ {"paper_tag_treatment": {"$contains": "PD-1/PD-L1 Blockade"}},
125
+ ]}
126
+
127
+ Note: For list values, only the FIRST element is used for $contains filtering.
128
+ If you need to match any of several values, call retrieve() once per value and
129
+ merge results, or fetch unfiltered and post-filter in Python.
130
+ """
131
+ if not filters:
132
+ return None
133
+
134
+ where_conditions = []
135
+
136
+ for key, value in filters.items():
137
+ field = f"paper_tag_{key}"
138
+ # Use the first element if a list was provided; $contains substring-matches
139
+ # against the pipe-delimited string stored in the metadata field.
140
+ match_value = value[0] if isinstance(value, list) else value
141
+ where_conditions.append({field: {"$contains": match_value}})
142
+
143
+ if len(where_conditions) == 1:
144
+ return where_conditions[0]
145
+ elif len(where_conditions) > 1:
146
+ return {"$and": where_conditions}
147
+
148
+ return None
149
+
150
+ def _build_exclude_clause(self, filters: Optional[Dict]) -> Optional[Dict]:
151
+ """
152
+ Build a ChromaDB WHERE clause that EXCLUDES documents matching the filters.
153
+
154
+ Uses $not_contains so chunks whose tag field contains the given value are
155
+ filtered out.
156
+
157
+ Example:
158
+ {"section_type": "references"}
159
+ -> {"section_type": {"$ne": "references"}}
160
+
161
+ {"cancer": "NSCLC"}
162
+ -> {"paper_tag_cancer": {"$not_contains": "NSCLC"}}
163
+ """
164
+ if not filters:
165
+ return None
166
+
167
+ where_conditions = []
168
+
169
+ for key, value in filters.items():
170
+ # Allow filtering on plain metadata fields (e.g. section_type)
171
+ # as well as tag fields.
172
+ if key.startswith("paper_tag_") or key in ("source_file", "section_type", "is_table"):
173
+ field = key
174
+ else:
175
+ field = f"paper_tag_{key}"
176
+
177
+ match_value = value[0] if isinstance(value, list) else value
178
+ where_conditions.append({field: {"$not_contains": match_value}})
179
+
180
+ if len(where_conditions) == 1:
181
+ return where_conditions[0]
182
+ elif len(where_conditions) > 1:
183
+ return {"$and": where_conditions}
184
+
185
+ return None
186
+
187
+ def _parse_paper_meta(self, metadata: Dict) -> Dict:
188
+ """
189
+ Safely deserialise the paper metadata stored as a JSON string in ChromaDB.
190
+
191
+ ChromaDB only stores scalar values, so the pipeline serialised the full
192
+ paper dict as json.dumps(). This helper reverses that.
193
+ """
194
+ raw = metadata.get("paper", "{}")
195
+ try:
196
+ return json.loads(raw) if raw else {}
197
+ except (json.JSONDecodeError, TypeError):
198
+ logger.warning("Could not deserialise paper metadata: %s", raw)
199
+ return {}
200
+
201
+ # ------------------------------------------------------------------
202
+ # Section-specific retrieval methods
203
+ # ------------------------------------------------------------------
204
+
205
+ def retrieve_for_section_1(self, patient_data: Dict) -> List[Dict]:
206
+ """
207
+ Retrieve chunks for Section 1: Microbiome Composition Profile
208
+
209
+ Focus on: diversity, detected taxa, cancer type, ICI class
210
+ """
211
+ cancer_type = patient_data["cancer"]["type"]
212
+ ici_class = self._get_ici_class(patient_data["immunotherapy"]["drug_name"])
213
+
214
+ detected_taxa = [
215
+ taxon for taxon, abundance in patient_data["microbiome"]["key_bacteria"].items()
216
+ if abundance is not None and abundance > 0
217
+ ]
218
+
219
+ query = f"""
220
+ Microbiome composition and diversity in {cancer_type} patients receiving {ici_class} therapy.
221
+ Taxa of interest: {', '.join(detected_taxa[:5])}.
222
+ Alpha diversity and response to immunotherapy.
223
+ """
224
+
225
+ filters = {
226
+ "cancer": cancer_type,
227
+ "treatment": ici_class,
228
+ }
229
+
230
+ return self.retrieve(
231
+ query_text=query,
232
+ top_k=config.RAG_TOP_K["section_1"],
233
+ metadata_filters=filters,
234
+ )
235
+
236
+ def retrieve_for_section_2(self, patient_data: Dict) -> List[Dict]:
237
+ """
238
+ Retrieve chunks for Section 2: Metabolite Landscape
239
+
240
+ Focus on: SCFAs, bile acids, tryptophan metabolites
241
+ """
242
+ cancer_type = patient_data["cancer"]["type"]
243
+ ici_class = self._get_ici_class(patient_data["immunotherapy"]["drug_name"])
244
+
245
+ metabolites = patient_data["microbiome"]["metabolites"]
246
+
247
+ metabolite_terms = []
248
+ if metabolites["scfa"]["butyrate_uM"] is not None:
249
+ metabolite_terms.append("short-chain fatty acids")
250
+ metabolite_terms.append("butyrate")
251
+ if metabolites["bile_acids_available"]:
252
+ metabolite_terms.append("bile acids")
253
+ if metabolites["tryptophan_metabolites_available"]:
254
+ metabolite_terms.append("tryptophan metabolism")
255
+
256
+ if not metabolite_terms:
257
+ return []
258
+
259
+ query = f"""
260
+ Microbial metabolites and immune function in {cancer_type}.
261
+ {', '.join(metabolite_terms)} and their role in immunotherapy response.
262
+ CD8+ T cell function, regulatory T cells, mucosal immunity.
263
+ """
264
+
265
+ # Metabolite section: semantic search only (biology tags are too broad for
266
+ # reliable filtering here).
267
+ return self.retrieve(
268
+ query_text=query,
269
+ top_k=config.RAG_TOP_K["section_2"],
270
+ metadata_filters=None,
271
+ strategy="semantic_only"
272
+ )
273
+
274
+ def retrieve_for_section_3(self, patient_data: Dict) -> List[Dict]:
275
+ """
276
+ Retrieve chunks for Section 3: Drug-Microbiome Interaction Outlook
277
+ """
278
+ cancer_type = patient_data["cancer"]["type"]
279
+ drug_name = patient_data["immunotherapy"]["drug_name"]
280
+ therapy_type = self._get_therapy_type(patient_data)
281
+
282
+ key_bacteria = patient_data["microbiome"]["key_bacteria"]
283
+ detected_taxa = sorted(
284
+ [(k, v) for k, v in key_bacteria.items() if v is not None and v > 0],
285
+ key=lambda x: x[1],
286
+ reverse=True
287
+ )[:5]
288
+ taxa_names = [taxon for taxon, _ in detected_taxa]
289
+
290
+ if therapy_type == "ICI":
291
+ ici_class = self._get_ici_class(drug_name)
292
+ query = f"""
293
+ {ici_class} response prediction in {cancer_type} based on gut microbiome composition.
294
+ Specific bacteria: {', '.join(taxa_names)}.
295
+ Clinical outcomes, progression-free survival, response rates.
296
+ Immune-related adverse events and microbiome associations.
297
+ """
298
+ filters = {"cancer": cancer_type, "treatment": ici_class}
299
+
300
+ elif therapy_type == "ACT":
301
+ act_details = patient_data["immunotherapy"].get("act_details", {})
302
+ act_type = act_details.get("act_type", "CAR-T")
303
+ target_antigen = act_details.get("target_antigen", "CD19")
304
+ query = f"""
305
+ {act_type} therapy efficacy in {cancer_type} and gut microbiome composition.
306
+ Target antigen: {target_antigen}.
307
+ Specific bacteria: {', '.join(taxa_names)}.
308
+ CAR-T cell expansion, persistence, and anti-tumor activity.
309
+ Cytokine release syndrome (CRS) and neurotoxicity associations with microbiome.
310
+ T-cell function and microbiota-derived metabolites.
311
+ """
312
+ filters = {"cancer": cancer_type, "treatment": "CAR-T"}
313
+
314
+ else:
315
+ query = f"""
316
+ Immunotherapy response in {cancer_type} and gut microbiome.
317
+ Bacteria: {', '.join(taxa_names)}.
318
+ """
319
+ filters = {"cancer": cancer_type}
320
+
321
+ return self.retrieve(
322
+ query_text=query,
323
+ top_k=config.RAG_TOP_K["section_3"],
324
+ metadata_filters=filters
325
+ )
326
+
327
+ def retrieve_for_section_4(self, patient_data: Dict) -> List[Dict]:
328
+ """
329
+ Retrieve chunks for Section 4: Confounding Factors
330
+ """
331
+ cancer_type = patient_data["cancer"]["type"]
332
+ therapy_type = self._get_therapy_type(patient_data)
333
+
334
+ query_terms = []
335
+
336
+ if patient_data["medications"]["antibiotic_history"]["recent_antibiotics"]:
337
+ if therapy_type == "ACT":
338
+ query_terms.append("antibiotic exposure before CAR-T therapy and outcomes")
339
+ query_terms.append("gut microbiota disruption and CAR-T efficacy")
340
+ else:
341
+ query_terms.append("antibiotic exposure and immunotherapy outcomes")
342
+
343
+ if patient_data["medications"]["ppi_use"]["currently_on_ppi"]:
344
+ query_terms.append("proton pump inhibitors and microbiome")
345
+
346
+ if patient_data["prior_treatments"]["chemotherapy"]["received"]:
347
+ if therapy_type == "ACT":
348
+ query_terms.append("prior chemotherapy effects on gut microbiota before CAR-T")
349
+ query_terms.append("lymphodepleting chemotherapy and microbiome")
350
+ else:
351
+ query_terms.append("prior chemotherapy effects on gut microbiota")
352
+
353
+ if not query_terms:
354
+ return []
355
+
356
+ query = f"""
357
+ {' '.join(query_terms)} in {cancer_type} patients.
358
+ Impact on immunotherapy efficacy and toxicity.
359
+ """
360
+
361
+ # Broader search for confounders — no cancer-type filter
362
+ return self.retrieve(
363
+ query_text=query,
364
+ top_k=config.RAG_TOP_K["section_4"],
365
+ metadata_filters=None,
366
+ strategy="semantic_only"
367
+ )
368
+
369
+ def retrieve_for_section_5(self, patient_data: Dict) -> Dict[str, List[Dict]]:
370
+ """
371
+ Retrieve chunks for Section 5: Intervention Considerations
372
+ """
373
+ cancer_type = patient_data["cancer"]["type"]
374
+ ici_class = self._get_ici_class(patient_data["immunotherapy"]["drug_name"])
375
+
376
+ intervention_chunks = {}
377
+
378
+ # Sub-section 5a: Dietary & Prebiotics
379
+ diet_query = f"""
380
+ Dietary interventions, prebiotics, fiber supplementation in {cancer_type}.
381
+ High-fiber diet, inulin, pectin, polyphenols and immunotherapy response.
382
+ """
383
+ intervention_chunks["diet"] = self.retrieve(
384
+ query_text=diet_query,
385
+ top_k=10,
386
+ metadata_filters=None,
387
+ strategy="semantic_only"
388
+ )
389
+
390
+ # Sub-section 5b: Probiotics
391
+ probiotics_query = f"""
392
+ Probiotic supplementation in {cancer_type} patients receiving {ici_class}.
393
+ Lactobacillus, Bifidobacterium, Akkermansia, Clostridium butyricum.
394
+ Clinical trials and efficacy data.
395
+ """
396
+ intervention_chunks["probiotics"] = self.retrieve(
397
+ query_text=probiotics_query,
398
+ top_k=10,
399
+ metadata_filters={"cancer": cancer_type}
400
+ )
401
+
402
+ return intervention_chunks
403
+
404
+ # ------------------------------------------------------------------
405
+ # Helpers
406
+ # ------------------------------------------------------------------
407
+
408
+ def _get_ici_class(self, drug_name: str) -> str:
409
+ """Map drug name to ICI class"""
410
+ return config.ICI_DRUG_CLASS_MAP.get(drug_name.lower(), "Immune Checkpoint Inhibitor")
411
+
412
+ def _get_act_class(self, drug_name: str) -> str:
413
+ """Map drug name to ACT class"""
414
+ return config.ACT_DRUG_CLASS_MAP.get(drug_name.lower(), "Adoptive Cell Therapy")
415
+
416
+ def _get_therapy_type(self, patient_data: Dict) -> str:
417
+ """Determine therapy type from patient data"""
418
+ if "therapy_type" in patient_data["immunotherapy"]:
419
+ return patient_data["immunotherapy"]["therapy_type"]
420
+ drug_name = patient_data["immunotherapy"]["drug_name"].lower()
421
+ return config.THERAPY_TYPE_MAP.get(drug_name, "ICI")
422
+
423
+ def format_chunks_for_llm(self, chunks: List[Dict]) -> str:
424
+ """
425
+ Format retrieved chunks into a structured string for LLM context.
426
+ Returns markdown-formatted evidence with citations.
427
+ """
428
+ if not chunks:
429
+ return "No relevant evidence retrieved."
430
+
431
+ formatted = "# Retrieved Evidence\n\n"
432
+
433
+ for i, chunk in enumerate(chunks, 1):
434
+ # paper is stored as a JSON string in ChromaDB — deserialise it first
435
+ paper_meta = self._parse_paper_meta(chunk["metadata"])
436
+
437
+ citation = paper_meta.get("citation", "Unknown source")
438
+ text = chunk["text"]
439
+
440
+ formatted += f"## Evidence {i}\n"
441
+ formatted += f"**Citation:** {citation}\n"
442
+ formatted += f"**Content:** {text}\n\n"
443
+
444
+ return formatted
445
+
446
+ def get_unique_citations(self, chunks: List[Dict]) -> Set[str]:
447
+ """Extract unique citations from chunks for a references section."""
448
+ citations = set()
449
+ for chunk in chunks:
450
+ # paper is stored as a JSON string in ChromaDB — deserialise it first
451
+ paper_meta = self._parse_paper_meta(chunk["metadata"])
452
+ citation = paper_meta.get("citation")
453
+ if citation:
454
+ citations.add(citation)
455
+ return citations
456
+
457
+ def get_unique_citation_metadata(self, chunks: List[Dict]) -> Set[tuple]:
458
+ """
459
+ Extract unique (citation, title) tuples from chunks.
460
+ Used for the final References section to show paper titles.
461
+ """
462
+ meta = set()
463
+ for chunk in chunks:
464
+ # paper is stored as a JSON string in ChromaDB — deserialise it first
465
+ paper_meta = self._parse_paper_meta(chunk["metadata"])
466
+ citation = paper_meta.get("citation")
467
+ if citation:
468
+ # Get title, falling back to citation if missing
469
+ title = paper_meta.get("title", citation)
470
+ meta.add((citation, title))
471
+ return meta
src/report_assembler.py ADDED
@@ -0,0 +1,333 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Report assembler - combines sections into final markdown report
3
+ """
4
+
5
+ import json
6
+ import logging
7
+ from datetime import datetime
8
+ from pathlib import Path
9
+ from typing import Dict
10
+
11
+ from . import config
12
+ from .section_generators import SectionGenerator
13
+
14
+ logger = logging.getLogger(__name__)
15
+
16
+
17
+ class ReportAssembler:
18
+ """Assembles complete clinical report from individual sections.
19
+
20
+ Supports two input modes:
21
+ - JSON: load_patient_data() / generate_and_save() (existing path)
22
+ - EHR: load_patient_data_from_ehr() / generate_and_save_from_ehr() (new path)
23
+ """
24
+
25
+ def __init__(self):
26
+ self.generator = SectionGenerator()
27
+
28
+ def load_patient_data(self, json_path: str) -> Dict:
29
+ """Load patient JSON data from file"""
30
+ logger.info(f"Loading patient data from {json_path}")
31
+
32
+ with open(json_path, 'r') as f:
33
+ patient_data = json.load(f)
34
+
35
+ return patient_data
36
+
37
+ def load_patient_data_from_ehr(self, ehr_path: str) -> Dict:
38
+ """Extract patient JSON from a raw EHR text file using MedGemma.
39
+
40
+ Args:
41
+ ehr_path: Path to the plain-text EHR report.
42
+
43
+ Returns:
44
+ Patient data dictionary matching the pipeline schema.
45
+ """
46
+ # Imported here so the JSON-only code path has zero extra import cost
47
+ from .ehr_extractor import EHRExtractor
48
+
49
+ logger.info(f"Extracting patient data from EHR: {ehr_path}")
50
+ extractor = EHRExtractor()
51
+ return extractor.extract_from_file(ehr_path)
52
+
53
+ def generate_full_report(self, patient_data: Dict) -> str:
54
+ """
55
+ Generate complete clinical report
56
+
57
+ Args:
58
+ patient_data: Patient JSON dictionary
59
+
60
+ Returns:
61
+ Complete report as markdown string
62
+ """
63
+ logger.info("Starting full report generation")
64
+
65
+ report_sections = []
66
+
67
+ # Section 0: Preamble (always included, not LLM-generated)
68
+ logger.info("Generating preamble")
69
+ preamble = self.generator.generate_preamble(patient_data)
70
+ report_sections.append(preamble)
71
+
72
+ # Section 1: Microbiome Composition Profile
73
+ section_1 = self.generator.generate_section_1(patient_data)
74
+ if section_1:
75
+ report_sections.append(section_1)
76
+
77
+ # Section 2: Metabolite Landscape
78
+ section_2 = self.generator.generate_section_2(patient_data)
79
+ if section_2:
80
+ report_sections.append(section_2)
81
+
82
+ # Section 3: Drug-Microbiome Interaction Outlook
83
+ section_3 = self.generator.generate_section_3(patient_data)
84
+ if section_3:
85
+ report_sections.append(section_3)
86
+
87
+ # Section 4: Confounding Factors
88
+ section_4 = self.generator.generate_section_4(patient_data)
89
+ if section_4:
90
+ report_sections.append(section_4)
91
+
92
+ # Section 5: Intervention Considerations
93
+ section_5 = self.generator.generate_section_5(patient_data)
94
+ if section_5:
95
+ report_sections.append(section_5)
96
+
97
+ # Section 6: Data Quality & Limitations (always included)
98
+ section_6 = self.generator.generate_section_6(patient_data)
99
+ report_sections.append(section_6)
100
+
101
+ # References section
102
+ references = self._generate_references_section()
103
+ report_sections.append(references)
104
+
105
+ # Footer
106
+ footer = self._generate_footer()
107
+ report_sections.append(footer)
108
+
109
+ # Combine all sections
110
+ full_report = "\n".join(report_sections)
111
+
112
+ logger.info("Report generation complete")
113
+ return full_report
114
+
115
+ def generate_full_report_streaming(self, patient_data: Dict):
116
+ """
117
+ Generate the complete clinical report section by section, yielding the
118
+ cumulative markdown string after each section completes.
119
+
120
+ Designed for Gradio generator functions: each yield replaces the current
121
+ content of the output gr.Markdown component, so the clinician sees the
122
+ report grow in real time.
123
+
124
+ Args:
125
+ patient_data: Patient JSON dictionary.
126
+
127
+ Yields:
128
+ Tuple of (cumulative_report: str, status_message: str) after each
129
+ section is appended.
130
+ """
131
+ logger.info("Starting streaming report generation")
132
+ accumulated = ""
133
+
134
+ # ------------------------------------------------------------------
135
+ # Section 0: Preamble (no LLM — instant)
136
+ # ------------------------------------------------------------------
137
+ logger.info("Generating preamble")
138
+ preamble = self.generator.generate_preamble(patient_data)
139
+ accumulated += preamble + "\n"
140
+ yield accumulated, "⏳ Generating Section 1: Microbiome Composition Profile..."
141
+
142
+ # ------------------------------------------------------------------
143
+ # Section 1: Microbiome Composition Profile
144
+ # ------------------------------------------------------------------
145
+ logger.info("Generating section 1")
146
+ section_1 = self.generator.generate_section_1(patient_data)
147
+ if section_1:
148
+ accumulated += section_1 + "\n"
149
+ yield accumulated, "⏳ Generating Section 2: Metabolite Landscape..."
150
+
151
+ # ------------------------------------------------------------------
152
+ # Section 2: Metabolite Landscape
153
+ # ------------------------------------------------------------------
154
+ logger.info("Generating section 2")
155
+ section_2 = self.generator.generate_section_2(patient_data)
156
+ if section_2:
157
+ accumulated += section_2 + "\n"
158
+ yield accumulated, "⏳ Generating Section 3: Drug–Microbiome Interaction Outlook..."
159
+
160
+ # ------------------------------------------------------------------
161
+ # Section 3: Drug–Microbiome Interaction Outlook
162
+ # ------------------------------------------------------------------
163
+ logger.info("Generating section 3")
164
+ section_3 = self.generator.generate_section_3(patient_data)
165
+ if section_3:
166
+ accumulated += section_3 + "\n"
167
+ yield accumulated, "⏳ Generating Section 4: Confounding Factors..."
168
+
169
+ # ------------------------------------------------------------------
170
+ # Section 4: Confounding Factors
171
+ # ------------------------------------------------------------------
172
+ logger.info("Generating section 4")
173
+ section_4 = self.generator.generate_section_4(patient_data)
174
+ if section_4:
175
+ accumulated += section_4 + "\n"
176
+ yield accumulated, "⏳ Generating Section 5: Intervention Considerations..."
177
+
178
+ # ------------------------------------------------------------------
179
+ # Section 5: Intervention Considerations
180
+ # ------------------------------------------------------------------
181
+ logger.info("Generating section 5")
182
+ section_5 = self.generator.generate_section_5(patient_data)
183
+ if section_5:
184
+ accumulated += section_5 + "\n"
185
+ yield accumulated, "⏳ Generating Section 6: Data Quality & Limitations..."
186
+
187
+ # ------------------------------------------------------------------
188
+ # Section 6: Data Quality & Limitations (always included)
189
+ # ------------------------------------------------------------------
190
+ logger.info("Generating section 6")
191
+ section_6 = self.generator.generate_section_6(patient_data)
192
+ accumulated += section_6 + "\n"
193
+ yield accumulated, "⏳ Compiling references and finalising report..."
194
+
195
+ # ------------------------------------------------------------------
196
+ # References + Footer (no LLM — instant)
197
+ # ------------------------------------------------------------------
198
+ logger.info("Generating references and footer")
199
+ references = self._generate_references_section()
200
+ footer = self._generate_footer()
201
+ accumulated += references + footer
202
+
203
+ logger.info("Streaming report generation complete")
204
+ yield accumulated, "✅ Report complete"
205
+
206
+
207
+
208
+ def _generate_references_section(self) -> str:
209
+ """Generate references section from all citations and titles used"""
210
+ # get_all_citations now returns List[tuple] i.e. [(citation, title), ...]
211
+ references_data = self.generator.get_all_citations()
212
+
213
+ if not references_data:
214
+ return ""
215
+
216
+ references = "## References\n\n"
217
+ references += "The following peer-reviewed publications were cited in this report:\n\n"
218
+
219
+ for i, (citation, title) in enumerate(references_data, 1):
220
+ if title and title != citation:
221
+ references += f"{i}. {citation}: {title}\n"
222
+ else:
223
+ references += f"{i}. {citation}\n"
224
+
225
+ references += "\n"
226
+ return references
227
+
228
+ def _generate_footer(self) -> str:
229
+ """Generate report footer with metadata"""
230
+ timestamp = datetime.now().strftime("%Y-%m-%d %H:%M:%S")
231
+
232
+ footer = f"""---
233
+
234
+ **Report Generated:** {timestamp}
235
+ **Model:** MedGemma 1.5 4B
236
+ **System:** Microbiome-Immunotherapy Clinical Decision Support v1.0
237
+
238
+ *This report is intended for use by qualified healthcare professionals as a clinical decision support tool. It does not constitute medical advice and should be interpreted in conjunction with comprehensive clinical evaluation.*
239
+ """
240
+ return footer
241
+
242
+ def save_report(self, report: str, patient_id: str, output_dir: str = None) -> str:
243
+ """
244
+ Save report to markdown file
245
+
246
+ Args:
247
+ report: Complete report markdown string
248
+ patient_id: Patient identifier for filename
249
+ output_dir: Output directory (uses config default if not provided)
250
+
251
+ Returns:
252
+ Path to saved report file
253
+ """
254
+ if output_dir is None:
255
+ output_dir = config.OUTPUT_DIR
256
+
257
+ # Create output directory if it doesn't exist
258
+ output_path = Path(output_dir)
259
+ output_path.mkdir(parents=True, exist_ok=True)
260
+
261
+ # Generate filename
262
+ timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
263
+ filename = f"microbiome_ici_report_{patient_id}_{timestamp}.md"
264
+ filepath = output_path / filename
265
+
266
+ # Save report
267
+ with open(filepath, 'w') as f:
268
+ f.write(report)
269
+
270
+ logger.info(f"Report saved to: {filepath}")
271
+ return str(filepath)
272
+
273
+ def generate_and_save(self, patient_json_path: str, output_dir: str = None) -> str:
274
+ """
275
+ Complete workflow: load data, generate report, save to file
276
+
277
+ Args:
278
+ patient_json_path: Path to patient JSON file
279
+ output_dir: Optional output directory override
280
+
281
+ Returns:
282
+ Path to saved report file
283
+ """
284
+ # Load patient data
285
+ patient_data = self.load_patient_data(patient_json_path)
286
+ patient_id = patient_data["patient"]["id"]
287
+
288
+ # Generate report
289
+ report = self.generate_full_report(patient_data)
290
+
291
+ # Save report
292
+ output_path = self.save_report(report, patient_id, output_dir)
293
+
294
+ return output_path
295
+
296
+ def generate_and_save_from_ehr(
297
+ self,
298
+ ehr_path: str,
299
+ output_dir: str = None,
300
+ save_json_path: str = None,
301
+ ) -> str:
302
+ """
303
+ Complete EHR workflow: extract JSON from EHR, generate report, save to file.
304
+
305
+ Args:
306
+ ehr_path: Path to the plain-text EHR report.
307
+ output_dir: Optional output directory override.
308
+ save_json_path: If provided, save the extracted patient JSON to this path
309
+ so it can be inspected or reused without re-running extraction.
310
+
311
+ Returns:
312
+ Path to the saved report markdown file.
313
+ """
314
+ # Step 1: Extract patient data from EHR
315
+ patient_data = self.load_patient_data_from_ehr(ehr_path)
316
+ patient_id = patient_data["patient"]["id"]
317
+
318
+ # Step 2: Optionally save the extracted JSON
319
+ if save_json_path:
320
+ import json as _json
321
+ from pathlib import Path as _Path
322
+ _Path(save_json_path).parent.mkdir(parents=True, exist_ok=True)
323
+ with open(save_json_path, "w", encoding="utf-8") as f:
324
+ _json.dump(patient_data, f, indent=2)
325
+ logger.info(f"Extracted patient JSON saved to: {save_json_path}")
326
+
327
+ # Step 3: Generate report
328
+ report = self.generate_full_report(patient_data)
329
+
330
+ # Step 4: Save report
331
+ output_path = self.save_report(report, patient_id, output_dir)
332
+
333
+ return output_path
src/section_generators.py ADDED
@@ -0,0 +1,489 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Section generation functions for each report section
3
+ """
4
+
5
+ import logging
6
+ from typing import Dict, Optional, List
7
+
8
+ from .models import get_medgemma
9
+ from .rag import RAGRetriever
10
+ from .prompts import build_prompt
11
+ from . import config
12
+
13
+ logger = logging.getLogger(__name__)
14
+
15
+
16
+ class SectionGenerator:
17
+ """Handles generation of individual report sections"""
18
+
19
+ def __init__(self):
20
+ self.llm = get_medgemma()
21
+ self.rag = RAGRetriever()
22
+ self.all_citations = {} # Map citation -> title across all sections
23
+
24
+ def generate_preamble(self, patient_data: Dict) -> str:
25
+ """
26
+ Generate Section 0: Clinical Preamble (auto-populated, no LLM)
27
+ """
28
+ p = patient_data["patient"]
29
+ c = patient_data["cancer"]
30
+ i = patient_data["immunotherapy"]
31
+ m = patient_data["microbiome"]
32
+
33
+ # Format metastases
34
+ metastases_str = ", ".join(c["metastases"]) if c["metastases"] else "none"
35
+
36
+ # Determine therapy type
37
+ therapy_type = i.get("therapy_type", "ICI")
38
+
39
+ preamble = f"""# Microbiome-Immunotherapy Clinical Report
40
+
41
+ **Patient ID:** {p['id']}
42
+ **Age:** {p['age']} years
43
+ **Gender:** {p['gender']}
44
+
45
+ ## Clinical Context
46
+
47
+ **Cancer Diagnosis:** {c['stage']} {c['type']}"""
48
+
49
+ if c.get('subtype'):
50
+ preamble += f" ({c['subtype']})"
51
+
52
+ preamble += f"""
53
+ **Primary Site:** {c['primary_site']}
54
+ **Metastases:** {metastases_str}
55
+ **Diagnosis Date:** {c['diagnosis_date']}
56
+
57
+ **Tumor Biomarkers:**
58
+ - PD-L1 Expression: {c['biomarkers']['pdl1_expression']}
59
+ - Tumor Mutational Burden (TMB): {c['biomarkers']['tmb']}
60
+ - Microsatellite Instability (MSI): {c['biomarkers']['msi_status']}
61
+
62
+ ## Planned Immunotherapy
63
+
64
+ **Therapy Type:** {therapy_type}
65
+ **Drug:** {i['drug_name']} ({i['drug_class']})
66
+ **Treatment Setting:** {i['treatment_setting']}
67
+ **Line of Therapy:** {i['line_of_therapy']}
68
+ **Planned Start Date:** {i['planned_start_date']}
69
+ """
70
+
71
+ # Add ACT-specific details if present
72
+ if therapy_type == "ACT" and i.get("act_details"):
73
+ act = i["act_details"]
74
+ preamble += f"""
75
+ **ACT Details:**
76
+ - ACT Type: {act.get('act_type', 'N/A')}
77
+ - Target Antigen: {act.get('target_antigen', 'N/A')}
78
+ - Cell Source: {act.get('cell_source', 'N/A')}
79
+ - Preconditioning Regimen: {act.get('preconditioning_regimen', 'N/A')}
80
+ - T-Cell Harvest Date: {act.get('t_cell_harvest_date', 'N/A')}
81
+ - Expected CRS Risk: {act.get('expected_crs_risk', 'N/A')}
82
+ - Expected Neurotoxicity Risk: {act.get('expected_neurotoxicity_risk', 'N/A')}
83
+ """
84
+
85
+ preamble += f"""
86
+ ## Microbiome Profile Overview
87
+
88
+ **Sample Date:** {m['sample_date']}
89
+ **Sequencing Method:** {m['sequencing_method']}
90
+
91
+ This report summarizes gut microbiome findings relevant to anticipated immunotherapy response based on current evidence from peer-reviewed literature.
92
+
93
+ ---
94
+ """
95
+ return preamble
96
+
97
+ def generate_section_1(self, patient_data: Dict) -> Optional[str]:
98
+ """
99
+ Generate Section 1: Microbiome Diversity & Composition Profile
100
+ """
101
+ logger.info("Generating Section 1: Microbiome Diversity & Composition Profile")
102
+
103
+ # Retrieve evidence
104
+ chunks = self.rag.retrieve_for_section_1(patient_data)
105
+
106
+ if not chunks:
107
+ logger.warning("No evidence retrieved for Section 1, omitting section")
108
+ return None
109
+
110
+ # Track citations
111
+ for citation, title in self.rag.get_unique_citation_metadata(chunks):
112
+ self.all_citations[citation] = title
113
+
114
+ # Format evidence
115
+ evidence = self.rag.format_chunks_for_llm(chunks)
116
+
117
+ # Prepare detected taxa string
118
+ key_bacteria = patient_data["microbiome"]["key_bacteria"]
119
+ detected_taxa_lines = []
120
+ for taxon, abundance in key_bacteria.items():
121
+ if abundance is not None and abundance > 0:
122
+ taxon_display = taxon.replace("_", " ").title()
123
+ detected_taxa_lines.append(f"- {taxon_display}: {abundance}%")
124
+
125
+ detected_taxa_str = "\n".join(detected_taxa_lines) if detected_taxa_lines else "None detected above threshold"
126
+
127
+ # Build prompt
128
+ diversity = patient_data["microbiome"]["diversity"]
129
+ prompt = build_prompt(
130
+ "section_1",
131
+ patient_data,
132
+ evidence,
133
+ cancer_stage=patient_data["cancer"]["stage"],
134
+ shannon_index=diversity["shannon_index"],
135
+ simpson_index=diversity["simpson_index"],
136
+ observed_species=diversity["observed_species"],
137
+ detected_taxa=detected_taxa_str,
138
+ )
139
+
140
+ # Generate
141
+ content = self.llm.generate(prompt, max_new_tokens=config.SECTION_MAX_NEW_TOKENS["section_1"])
142
+
143
+ return f"## 1. Microbiome Diversity & Composition Profile\n\n{content}\n\n"
144
+
145
+ def generate_section_2(self, patient_data: Dict) -> Optional[str]:
146
+ """
147
+ Generate Section 2: Metabolite Landscape
148
+ """
149
+ logger.info("Generating Section 2: Metabolite Landscape")
150
+
151
+ # Check if metabolite data is available
152
+ metabolites = patient_data["microbiome"]["metabolites"]
153
+ has_scfa = any(v is not None for v in metabolites["scfa"].values())
154
+ has_metabolites = has_scfa or metabolites["bile_acids_available"] or metabolites["tryptophan_metabolites_available"]
155
+
156
+ if not has_metabolites:
157
+ logger.info("No metabolite data available, omitting Section 2")
158
+ return None
159
+
160
+ # Retrieve evidence
161
+ chunks = self.rag.retrieve_for_section_2(patient_data)
162
+
163
+ if not chunks:
164
+ logger.warning("No evidence retrieved for Section 2, omitting section")
165
+ return None
166
+
167
+ # Track citations
168
+ for citation, title in self.rag.get_unique_citation_metadata(chunks):
169
+ self.all_citations[citation] = title
170
+
171
+ # Format evidence
172
+ evidence = self.rag.format_chunks_for_llm(chunks)
173
+
174
+ # Prepare metabolite data string
175
+ metabolite_lines = []
176
+
177
+ if has_scfa:
178
+ metabolite_lines.append("**Short-Chain Fatty Acids:**")
179
+ scfa = metabolites["scfa"]
180
+ if scfa["butyrate_uM"] is not None:
181
+ metabolite_lines.append(f"- Butyrate: {scfa['butyrate_uM']} μM")
182
+ if scfa["propionate_uM"] is not None:
183
+ metabolite_lines.append(f"- Propionate: {scfa['propionate_uM']} μM")
184
+ if scfa["acetate_uM"] is not None:
185
+ metabolite_lines.append(f"- Acetate: {scfa['acetate_uM']} μM")
186
+
187
+ if metabolites["bile_acids_available"]:
188
+ metabolite_lines.append("**Bile Acids:** Analysis available")
189
+
190
+ if metabolites["tryptophan_metabolites_available"]:
191
+ metabolite_lines.append("**Tryptophan Metabolites:** Analysis available")
192
+
193
+ metabolite_data_str = "\n".join(metabolite_lines)
194
+
195
+ # Build prompt
196
+ prompt = build_prompt(
197
+ "section_2",
198
+ patient_data,
199
+ evidence,
200
+ metabolite_data=metabolite_data_str,
201
+ )
202
+
203
+ # Generate
204
+ content = self.llm.generate(prompt, max_new_tokens=config.SECTION_MAX_NEW_TOKENS["section_2"])
205
+
206
+ return f"## 2. Metabolite Landscape\n\n{content}\n\n"
207
+
208
+ def generate_section_3(self, patient_data: Dict) -> Optional[str]:
209
+ """
210
+ Generate Section 3: Drug–Microbiome Interaction Outlook
211
+ Supports both ICI and ACT therapies
212
+ """
213
+ logger.info("Generating Section 3: Drug–Microbiome Interaction Outlook")
214
+
215
+ # Retrieve evidence
216
+ chunks = self.rag.retrieve_for_section_3(patient_data)
217
+
218
+ if not chunks:
219
+ logger.warning("No evidence retrieved for Section 3, omitting section")
220
+ return None
221
+
222
+ # Track citations
223
+ for citation, title in self.rag.get_unique_citation_metadata(chunks):
224
+ self.all_citations[citation] = title
225
+
226
+ # Format evidence
227
+ evidence = self.rag.format_chunks_for_llm(chunks)
228
+
229
+ # Prepare summaries
230
+ diversity = patient_data["microbiome"]["diversity"]
231
+ key_bacteria = patient_data["microbiome"]["key_bacteria"]
232
+
233
+ # Key taxa summary
234
+ detected_taxa = [
235
+ (k.replace("_", " ").title(), v)
236
+ for k, v in key_bacteria.items()
237
+ if v is not None and v > 0
238
+ ]
239
+ detected_taxa.sort(key=lambda x: x[1], reverse=True)
240
+ key_taxa_summary = ", ".join([f"{t} ({a}%)" for t, a in detected_taxa[:5]])
241
+
242
+ # Metabolite summary
243
+ metabolites = patient_data["microbiome"]["metabolites"]
244
+ metabolite_flags = []
245
+ if any(v is not None for v in metabolites["scfa"].values()):
246
+ metabolite_flags.append("SCFAs measured")
247
+ if metabolites["bile_acids_available"]:
248
+ metabolite_flags.append("bile acids available")
249
+ if metabolites["tryptophan_metabolites_available"]:
250
+ metabolite_flags.append("tryptophan metabolites available")
251
+ metabolite_summary = ", ".join(metabolite_flags) if metabolite_flags else "limited metabolite data"
252
+
253
+ # Determine therapy type
254
+ therapy_type = patient_data["immunotherapy"].get("therapy_type", "ICI")
255
+
256
+ # Build prompt based on therapy type
257
+ if therapy_type == "ACT":
258
+ act_details = patient_data["immunotherapy"].get("act_details", {})
259
+ prompt = build_prompt(
260
+ "section_3",
261
+ patient_data,
262
+ evidence,
263
+ cancer_stage=patient_data["cancer"]["stage"],
264
+ act_type=act_details.get("act_type", "CAR-T"),
265
+ target_antigen=act_details.get("target_antigen", "CD19"),
266
+ cell_source=act_details.get("cell_source", "autologous"),
267
+ crs_risk=act_details.get("expected_crs_risk", "unknown"),
268
+ neurotoxicity_risk=act_details.get("expected_neurotoxicity_risk", "unknown"),
269
+ line_of_therapy=patient_data["immunotherapy"]["line_of_therapy"],
270
+ shannon_index=diversity["shannon_index"],
271
+ simpson_index=diversity["simpson_index"],
272
+ key_taxa_summary=key_taxa_summary,
273
+ metabolite_summary=metabolite_summary,
274
+ )
275
+ else: # ICI
276
+ biomarkers = patient_data["cancer"]["biomarkers"]
277
+ prompt = build_prompt(
278
+ "section_3",
279
+ patient_data,
280
+ evidence,
281
+ cancer_stage=patient_data["cancer"]["stage"],
282
+ pdl1=biomarkers["pdl1_expression"],
283
+ tmb=biomarkers["tmb"],
284
+ msi=biomarkers["msi_status"],
285
+ line_of_therapy=patient_data["immunotherapy"]["line_of_therapy"],
286
+ shannon_index=diversity["shannon_index"],
287
+ simpson_index=diversity["simpson_index"],
288
+ key_taxa_summary=key_taxa_summary,
289
+ metabolite_summary=metabolite_summary,
290
+ )
291
+
292
+ # Generate
293
+ content = self.llm.generate(prompt, max_new_tokens=config.SECTION_MAX_NEW_TOKENS["section_3"]) # Longer for this section
294
+
295
+ # Section title varies by therapy type
296
+ if therapy_type == "ACT":
297
+ section_title = "3. Microbiome–ACT Interaction Outlook"
298
+ else:
299
+ section_title = "3. Drug–Microbiome Interaction Outlook"
300
+
301
+ return f"## {section_title}\n\n{content}\n\n"
302
+
303
+ def generate_section_4(self, patient_data: Dict) -> Optional[str]:
304
+ """
305
+ Generate Section 4: Confounding Factors
306
+ """
307
+ logger.info("Generating Section 4: Confounding Factors")
308
+
309
+ # Check if any confounding factors are present
310
+ meds = patient_data["medications"]
311
+ prior = patient_data["prior_treatments"]
312
+
313
+ has_confounders = (
314
+ meds["antibiotic_history"]["recent_antibiotics"] or
315
+ meds["ppi_use"]["currently_on_ppi"] or
316
+ prior["chemotherapy"]["received"] or
317
+ prior["prior_immunotherapy"]["received"] or
318
+ len(patient_data.get("comorbidities", [])) > 0
319
+ )
320
+
321
+ if not has_confounders:
322
+ logger.info("No confounding factors present, omitting Section 4")
323
+ return None
324
+
325
+ # Retrieve evidence
326
+ chunks = self.rag.retrieve_for_section_4(patient_data)
327
+
328
+ if not chunks:
329
+ logger.warning("No evidence retrieved for Section 4, omitting section")
330
+ return None
331
+
332
+ # Track citations
333
+ for citation, title in self.rag.get_unique_citation_metadata(chunks):
334
+ self.all_citations[citation] = title
335
+
336
+ # Format evidence
337
+ evidence = self.rag.format_chunks_for_llm(chunks)
338
+
339
+ # Prepare confounding data string
340
+ confounding_lines = []
341
+
342
+ # Antibiotic history
343
+ if meds["antibiotic_history"]["recent_antibiotics"]:
344
+ confounding_lines.append("**Antibiotic Exposure:**")
345
+ for exp in meds["antibiotic_history"]["exposures"]:
346
+ confounding_lines.append(
347
+ f"- {exp['antibiotic_name']} ({exp['antibiotic_class']}): "
348
+ f"{exp['start_date']} to {exp['end_date']} "
349
+ f"({exp['days_before_ici']} days before ICI start)"
350
+ )
351
+
352
+ # PPI use
353
+ if meds["ppi_use"]["currently_on_ppi"]:
354
+ ppi = meds["ppi_use"]
355
+ confounding_lines.append(f"**Proton Pump Inhibitor Use:**")
356
+ confounding_lines.append(f"- {ppi['ppi_name']}, duration: {ppi['duration_months']} months")
357
+
358
+ # Prior chemotherapy
359
+ if prior["chemotherapy"]["received"]:
360
+ chemo = prior["chemotherapy"]
361
+ regimens_str = ", ".join(chemo["regimens"])
362
+ confounding_lines.append(f"**Prior Chemotherapy:**")
363
+ confounding_lines.append(f"- Regimens: {regimens_str}")
364
+ confounding_lines.append(f"- Response: {chemo['response']}")
365
+
366
+ # Prior immunotherapy
367
+ if prior["prior_immunotherapy"]["received"]:
368
+ prior_ici = prior["prior_immunotherapy"]
369
+ drugs_str = ", ".join(prior_ici["drugs"])
370
+ confounding_lines.append(f"**Prior Immunotherapy:**")
371
+ confounding_lines.append(f"- Drugs: {drugs_str}")
372
+ confounding_lines.append(f"- Response: {prior_ici['response']}")
373
+
374
+ # Comorbidities
375
+ if patient_data.get("comorbidities"):
376
+ comorbidities_str = ", ".join(patient_data["comorbidities"])
377
+ confounding_lines.append(f"**Comorbidities:** {comorbidities_str}")
378
+
379
+ confounding_data_str = "\n".join(confounding_lines)
380
+
381
+ # Build prompt
382
+ prompt = build_prompt(
383
+ "section_4",
384
+ patient_data,
385
+ evidence,
386
+ confounding_data=confounding_data_str,
387
+ )
388
+
389
+ # Generate
390
+ content = self.llm.generate(prompt, max_new_tokens=config.SECTION_MAX_NEW_TOKENS["section_4"])
391
+
392
+ return f"## 4. Confounding Factors\n\n{content}\n\n"
393
+
394
+ def generate_section_5(self, patient_data: Dict) -> Optional[str]:
395
+ """
396
+ Generate Section 5: Microbiota-Modulation Intervention Considerations
397
+ """
398
+ logger.info("Generating Section 5: Microbiota-Modulation Intervention Considerations")
399
+
400
+ # Retrieve evidence for each intervention type
401
+ intervention_chunks = self.rag.retrieve_for_section_5(patient_data)
402
+
403
+ # Check if any intervention evidence was retrieved
404
+ total_chunks = sum(len(chunks) for chunks in intervention_chunks.values())
405
+ if total_chunks == 0:
406
+ logger.warning("No intervention evidence retrieved for Section 5, omitting section")
407
+ return None
408
+
409
+ # Track citations from all intervention types
410
+ for chunks in intervention_chunks.values():
411
+ for citation, title in self.rag.get_unique_citation_metadata(chunks):
412
+ self.all_citations[citation] = title
413
+
414
+ # Format evidence for each intervention type
415
+ evidence_str = "## Diet and Prebiotics Evidence\n\n"
416
+ evidence_str += self.rag.format_chunks_for_llm(intervention_chunks.get("diet", []))
417
+ evidence_str += "\n\n## Probiotics Evidence\n\n"
418
+ evidence_str += self.rag.format_chunks_for_llm(intervention_chunks.get("probiotics", []))
419
+
420
+ # Prepare microbiome summary
421
+ key_bacteria = patient_data["microbiome"]["key_bacteria"]
422
+ detected_taxa = [
423
+ k.replace("_", " ").title()
424
+ for k, v in key_bacteria.items()
425
+ if v is not None and v > 0
426
+ ]
427
+ microbiome_summary = f"Detected taxa: {', '.join(detected_taxa[:5])}"
428
+
429
+ # Build prompt
430
+ prompt = build_prompt(
431
+ "section_5",
432
+ patient_data,
433
+ evidence_str,
434
+ microbiome_summary=microbiome_summary,
435
+ )
436
+
437
+ # Generate
438
+ content = self.llm.generate(prompt, max_new_tokens=config.SECTION_MAX_NEW_TOKENS["section_5"])
439
+
440
+ return f"## 5. Microbiota-Modulation Intervention Considerations\n\n{content}\n\n"
441
+
442
+ def generate_section_6(self, patient_data: Dict) -> str:
443
+ """
444
+ Generate Section 6: Data Quality & Interpretive Limitations
445
+ """
446
+ logger.info("Generating Section 6: Data Quality & Interpretive Limitations")
447
+
448
+ # This section doesn't use RAG, just data quality fields
449
+ data_quality = patient_data["microbiome"]["data_quality"]
450
+ metabolites = patient_data["microbiome"]["metabolites"]
451
+
452
+ # Prepare data quality string
453
+ dq_lines = [
454
+ f"**Sequencing Method:** {patient_data['microbiome']['sequencing_method']}",
455
+ f"**Data Completeness:** {data_quality['completeness']}",
456
+ f"**Data Source:** {data_quality['source']}",
457
+ ]
458
+
459
+ if data_quality.get("limitations"):
460
+ dq_lines.append(f"**Limitations:** {', '.join(data_quality['limitations'])}")
461
+
462
+ # Note missing metabolite data
463
+ missing_metabolites = []
464
+ if not any(v is not None for v in metabolites["scfa"].values()):
465
+ missing_metabolites.append("Short-chain fatty acids")
466
+ if not metabolites["bile_acids_available"]:
467
+ missing_metabolites.append("Bile acids")
468
+ if not metabolites["tryptophan_metabolites_available"]:
469
+ missing_metabolites.append("Tryptophan metabolites")
470
+
471
+ if missing_metabolites:
472
+ dq_lines.append(f"**Missing Metabolite Data:** {', '.join(missing_metabolites)}")
473
+
474
+ data_quality_str = "\n".join(dq_lines)
475
+
476
+ # Build prompt (no RAG evidence needed)
477
+ from .prompts import SECTION_6_PROMPT
478
+ prompt = SECTION_6_PROMPT.format(data_quality=data_quality_str)
479
+
480
+ # Generate
481
+ content = self.llm.generate(prompt, max_new_tokens=config.SECTION_MAX_NEW_TOKENS["section_6"])
482
+
483
+ from .prompts import SECTION_6_FIXED_CAVEATS
484
+ full_content = f"{SECTION_6_FIXED_CAVEATS}\n\n{content}"
485
+ return f"## 6. Data Quality & Interpretive Limitations\n\n{full_content}\n\n"
486
+
487
+ def get_all_citations(self) -> List[tuple]:
488
+ """Return sorted list of all unique (citation, title) tuples used in the report"""
489
+ return sorted(list(self.all_citations.items()))