Spaces:

fierce74
/

Microbiome-Immunotherapy-CDS

Sleeping

App Files Files Community

fierce74 commited on Feb 23

Commit

7529164

1 Parent(s): e66ab71

Add application file

Browse files

Files changed (27) hide show

LICENSE +21 -0
README.md +71 -14
app.py +258 -0
data/sample_input/demo_ehr_act_dlbcl.txt +268 -0
data/sample_input/demo_ehr_ici_nsclc.txt +178 -0
data/templates/patient_schema_template.json +93 -0
generate_report.py +121 -0
rag/README.md +65 -0
rag/pdf_to_chromadb_pipeline.py +1044 -0
rag/rag_md_cleaner.py +278 -0
rag/requirements.txt +17 -0
rag/research_papers.json +555 -0
requirements.txt +16 -0
src/__init__.py +10 -0
src/__pycache__/__init__.cpython-311.pyc +0 -0
src/__pycache__/config.cpython-311.pyc +0 -0
src/__pycache__/models.cpython-311.pyc +0 -0
src/__pycache__/report_assembler.cpython-311.pyc +0 -0
src/__pycache__/section_generators.cpython-311.pyc +0 -0
src/chroma_loader.py +103 -0
src/config.py +158 -0
src/ehr_extractor.py +257 -0
src/models.py +185 -0
src/prompts.py +323 -0
src/rag.py +471 -0
src/report_assembler.py +333 -0
src/section_generators.py +489 -0

LICENSE ADDED Viewed

	@@ -0,0 +1,21 @@

+MIT License
+Copyright (c) 2026 patient74
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.

README.md CHANGED Viewed

@@ -1,14 +1,71 @@
----
-title: Microbiome Immunotherapy CDS
-emoji: 📚
-colorFrom: indigo
-colorTo: gray
-sdk: gradio
-sdk_version: 6.6.0
-app_file: app.py
-pinned: false
-license: mit
-short_description: ' An Evidence-Based Clinical Decision Support Tool'
----
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

+# Microbiome-Immunotherapy Clinical Decision Support System
+### An Evidence-Based Clinical Decision Support Tool
+This project provides a sophisticated clinical decision support system that optimizes immunotherapy (ICI/ACT) treatments based on a patient's gut microbiome profile. It leverages MedGemma 1.5 4B and a specialized RAG pipeline to generate evidence-based clinical reports.
+## Architecture Overview
+The system processes patient data and clinical evidence through a modular pipeline to produce a 6-section clinical report:
+1.  **Microbiome Composition**: Profile of diversity and key taxa.
+2.  **Metabolite Landscape**: Analysis of SCFAs, bile acids, and tryptophan.
+3.  **Drug-Microbiome Interaction**: Core interpretation of microbiome impact on drug efficacy.
+4.  **Confounding Factors**: Impact of antibiotics, PPIs, and prior treatments.
+5.  **Intervention Considerations**: Evidence-based dietary or probiotic suggestions.
+6.  **Data Quality & Limitations**: Assessment of report confidence.
+Each section is generated using targeted RAG retrieval from a database of peer-reviewed medical literature.
+## Key Features
+-   **EHR Extraction**: Automatically parses raw Electronic Health Records (EHR) text into structured patient data using MedGemma.
+-   **Medical RAG**: Domain-specific retrieval system using PubMedBERT embeddings and table-aware chunking.
+-   **Multi-Model Support**: Designed for MedGemma 1.5 but extensible to other LLMs.
+## Project Structure
+```text
+├── rag/               # RAG pipeline for indexing medical papers
+├── src/               # Core application logic (models, prompts, generators)
+├── data/              # Patient data and clinical inputs
+├── outputs/           # Generated clinical reports (Markdown)
+├── generate_report.py # Main CLI entry point
+└── requirements.txt   # Project dependencies
+```
+## Getting Started
+### Prerequisites
+-  First get the RAG ready. See `rag/README.md`s
+-   Python 3.10+
+-   CUDA-compatible GPU (recommended for MedGemma and PubMedBERT)
+-   HuggingFace access for `google/medgemma-1.5-4b-it`
+### Installation
+1.  Clone the repository.
+2.  Install dependencies:
+    ```bash
+    pip install -r requirements.txt
+    pip install -r rag/requirements.txt
+    ```
+## Usage
+Generate a report from structured patient JSON
+(see data/template/patient_schema_template.json):
+```bash
+python generate_report.py data/patient_example.json
+```
+Generate a report from raw EHR text:
+```bash
+python generate_report.py data/patient_ehr.txt --save-ehr-json outputs/patient_profile.json
+```
+## Examples
+See `data/sample_input` for EHR examples and `output` for the corresponding output
+## Configuration
+Settings for model IDs, device selection (CPU/GPU), and RAG parameters can be found in `src/config.py`.

app.py ADDED Viewed

	@@ -0,0 +1,258 @@

+"""
+app.py — Gradio entrypoint for the
+Microbiome-Immunotherapy Clinical Decision Support System
+Startup sequence:
+  1. Download / verify ChromaDB from HuggingFace (chroma_loader)
+  2. Load MedGemma + PubMedBERT once into memory (ReportAssembler.__init__)
+  3. Launch Gradio UI
+Usage:
+  python app.py
+"""
+import logging
+import sys
+from pathlib import Path
+from typing import Generator, Tuple
+import gradio as gr
+# ---------------------------------------------------------------------------
+# Logging
+# ---------------------------------------------------------------------------
+logging.basicConfig(
+    level=logging.INFO,
+    format="%(asctime)s — %(name)s — %(levelname)s — %(message)s",
+    handlers=[logging.StreamHandler(sys.stdout)],
+)
+logger = logging.getLogger(__name__)
+# ---------------------------------------------------------------------------
+# Step 1: Ensure ChromaDB is available locally before anything else
+# ---------------------------------------------------------------------------
+logger.info("=" * 70)
+logger.info("Microbiome-Immunotherapy CDS — starting up")
+logger.info("=" * 70)
+from src.chroma_loader import ensure_chroma_db
+ensure_chroma_db()
+# ---------------------------------------------------------------------------
+# Step 2: Load models once (expensive — happens here, not per request)
+# ---------------------------------------------------------------------------
+logger.info("Loading models — this may take a few minutes on first run...")
+from src.report_assembler import ReportAssembler
+assembler = ReportAssembler()
+logger.info("Models loaded successfully.")
+# ---------------------------------------------------------------------------
+# Step 3: Discover EHR files in data/input/
+# ---------------------------------------------------------------------------
+EHR_DIR = Path("data/input")
+def _discover_ehr_files() -> dict:
+    """
+    Scan data/input/ for .txt and .ehr files.
+    Returns a dict mapping display label -> absolute path string.
+    E.g. {"Patient EHR 1 (patient_ehr_1)": "/abs/path/data/input/patient_ehr_1.txt"}
+    """
+    files = {}
+    for ext in ("*.txt", "*.ehr"):
+        for p in sorted(EHR_DIR.glob(ext)):
+            label = f"{p.stem}"
+            files[label] = str(p.resolve())
+    return files
+EHR_FILES = _discover_ehr_files()
+if not EHR_FILES:
+    logger.warning(
+        f"No EHR files found in {EHR_DIR}.  "
+        "Add .txt or .ehr files there before generating reports."
+    )
+# ---------------------------------------------------------------------------
+# Step 4: Report generation function (generator — streams section by section)
+# ---------------------------------------------------------------------------
+def generate_report(ehr_label: str) -> Generator[Tuple[str, str], None, None]:
+    """
+    Gradio generator function.
+    Args:
+        ehr_label: The dropdown display label corresponding to an EHR file.
+    Yields:
+        Tuple of (report_markdown: str, status_message: str).
+        Each yield updates the UI immediately.
+    """
+    if not ehr_label:
+        yield "", "⚠️ Please select a patient EHR file."
+        return
+    ehr_path = EHR_FILES.get(ehr_label)
+    if not ehr_path:
+        yield "", f"⚠️ EHR file not found for selection: '{ehr_label}'"
+        return
+    logger.info(f"Report requested for: {ehr_label}  →  {ehr_path}")
+    # Extract patient JSON from EHR text via MedGemma
+    yield "", f"⏳ Extracting structured data from EHR: {ehr_label}..."
+    try:
+        patient_data = assembler.load_patient_data_from_ehr(ehr_path)
+    except Exception as exc:
+        logger.error(f"EHR extraction failed: {exc}", exc_info=True)
+        yield "", f"❌ EHR extraction failed: {exc}"
+        return
+    # Stream the report section by section
+    try:
+        for report_so_far, status in assembler.generate_full_report_streaming(patient_data):
+            yield report_so_far, status
+    except Exception as exc:
+        logger.error(f"Report generation failed: {exc}", exc_info=True)
+        yield "", f"❌ Report generation failed: {exc}"
+        return
+def clear_outputs():
+    """Reset the output panel when a new EHR is selected from the dropdown."""
+    return "", "Select a patient EHR and click Generate Report."
+# ---------------------------------------------------------------------------
+# Step 5: Build the Gradio UI
+# ---------------------------------------------------------------------------
+DISCLAIMER_HTML = """
+<div style="
+    background: #fff3cd;
+    border: 1px solid #ffc107;
+    border-radius: 6px;
+    padding: 12px 18px;
+    margin-bottom: 4px;
+    font-size: 0.92em;
+    color: #664d03;
+    line-height: 1.5;
+">
+    <strong>⚠️ Clinical Decision Support Tool — For Healthcare Professional Use Only</strong><br>
+    This system is intended for use by qualified oncologists and clinical teams as a
+    <em>decision support aid</em>. It does not constitute medical advice and must be
+    interpreted in conjunction with comprehensive clinical evaluation. All outputs are
+    AI-generated and evidence-based summaries sourced from peer-reviewed literature;
+    they do not replace clinical judgement.
+</div>
+"""
+TITLE_HTML = """
+<div style="text-align: center; padding: 10px 0 4px 0;">
+    <h1 style="font-size: 1.5em; margin: 0; color: #1a1a2e;">
+        🧬 Microbiome–Immunotherapy Clinical Decision Support System
+    </h1>
+    <p style="color: #555; margin: 4px 0 0 0; font-size: 0.95em;">
+        Evidence-based microbiome analysis for ICI &amp; ACT treatment planning
+    </p>
+</div>
+"""
+with gr.Blocks(
+    title="Microbiome-Immunotherapy CDS",
+    theme=gr.themes.Soft(primary_hue="blue", neutral_hue="slate"),
+) as demo:
+    # -----------------------------------------------------------------------
+    # Header
+    # -----------------------------------------------------------------------
+    gr.HTML(TITLE_HTML)
+    gr.HTML(DISCLAIMER_HTML)
+    # -----------------------------------------------------------------------
+    # Main layout: left control panel | right report output
+    # -----------------------------------------------------------------------
+    with gr.Row(equal_height=False):
+        # -------------------------------------------------------------------
+        # LEFT: Controls
+        # -------------------------------------------------------------------
+        with gr.Column(scale=1, min_width=260):
+            gr.Markdown("### Patient Selection")
+            ehr_dropdown = gr.Dropdown(
+                choices=list(EHR_FILES.keys()),
+                label="Select Patient EHR",
+                info="EHR reports available in data/input/",
+                interactive=True,
+                value=None,
+            )
+            generate_btn = gr.Button(
+                "Generate Report",
+                variant="primary",
+                size="lg",
+            )
+            gr.Markdown("---")
+            gr.Markdown("### Status")
+            status_box = gr.Textbox(
+                value="Select a patient EHR and click Generate Report.",
+                label="",
+                interactive=False,
+                lines=2,
+                max_lines=3,
+            )
+            gr.Markdown("---")
+            gr.Markdown(
+                "<small style='color:#888;'>"
+                "**Model:** MedGemma 1.5 4B &nbsp;|&nbsp; "
+                "**RAG:** PubMedBERT + ChromaDB<br>"
+                "**Evidence base:** Peer-reviewed literature on microbiome × immunotherapy"
+                "</small>"
+            )
+        # -------------------------------------------------------------------
+        # RIGHT: Report output
+        # -------------------------------------------------------------------
+        with gr.Column(scale=3):
+            gr.Markdown("### Clinical Report")
+            report_output = gr.Markdown(
+                value="*The generated report will appear here, section by section, as it is being written.*",
+                label="",
+                # height keeps the panel stable rather than jumping as content grows
+                height=820,
+            )
+    # -----------------------------------------------------------------------
+    # Event wiring
+    # -----------------------------------------------------------------------
+    # When a new EHR is selected from the dropdown, clear the output panel
+    ehr_dropdown.change(
+        fn=clear_outputs,
+        inputs=[],
+        outputs=[report_output, status_box],
+    )
+    # Generate button — streams the report section by section
+    generate_btn.click(
+        fn=generate_report,
+        inputs=[ehr_dropdown],
+        outputs=[report_output, status_box],
+    )
+# ---------------------------------------------------------------------------
+# Launch
+# ---------------------------------------------------------------------------
+if __name__ == "__main__":
+    demo.launch(
+        server_name="0.0.0.0",   # bind to all interfaces (accessible on LAN)
+        server_port=7860,
+        show_error=True,
+    )

data/sample_input/demo_ehr_act_dlbcl.txt ADDED Viewed

	@@ -0,0 +1,268 @@

+================================================================================
+UNIVERSITY MEDICAL CENTER - CELLULAR THERAPY
+================================================================================
+Patient: Marcus Johnson | MRN: UMC-556219 | DOB: 06/08/1965 (Age: 60)
+Gender: Male | Visit Date: February 8, 2026
+CHIEF COMPLAINT:
+CAR-T therapy consultation for relapsed/refractory DLBCL.
+HISTORY:
+60-year-old male with DLBCL initially diagnosed March 2024. Failed R-CHOP (CR
+6 months, relapsed) and R-ICE (PR, progressed at 4 months). Now stage IV with
+bone marrow involvement. Planned for axicabtagene ciloleucel (CD19 CAR-T).
+LYMPHOMA DIAGNOSIS:
+- Histology: Diffuse Large B-Cell Lymphoma (non-GCB, ABC type)
+- Stage: IV (bone marrow 15% involvement)
+- Initial Diagnosis: March 18, 2024
+- CD19: Positive (95% tumor cells) - CAR-T eligible
+- High-risk features: BCL2+ (70%), TP53 mutation, Revised IPI 4
+PRIOR TREATMENTS:
+1. R-CHOP × 6 (Apr-Aug 2024) → CR 6 months → Relapsed
+2. R-ICE × 3 (Apr-Jun 2025) → PR → Progressed Oct 2025
+3. Gemcitabine bridging (weekly, Jan 2026) → Stable disease
+CAR-T PLAN:
+- Product: Axicabtagene ciloleucel (Yescarta) - CD19-targeted
+- Leukapheresis: February 14, 2026
+- Manufacturing: ~17-21 days
+- Lymphodepletion: Fludarabine/Cyclophosphamide (March 7-9, 2026)
+- CAR-T Infusion: March 10, 2026 (Day 0)
+CURRENT MEDICATIONS:
+1. Levofloxacin 500mg daily (bacterial prophylaxis - ACTIVE)
+2. Acyclovir 400mg BID (viral prophylaxis)
+3. Fluconazole 200mg daily (fungal prophylaxis)
+4. Lisinopril 20mg daily (hypertension)
+5. Atorvastatin 40mg daily (hyperlipidemia)
+ANTIBIOTIC EXPOSURE:
+Current Prophylaxis (ACTIVE):
+- Levofloxacin 500mg daily since January 25, 2026 (14 days at microbiome sampling)
+- Indication: Neutropenia prophylaxis post-chemotherapy
+- Will continue through CAR-T per protocol
+Recent Therapeutic:
+- Piperacillin-tazobactam 4.5g IV Q6H × 10 days (Dec 18-27, 2025)
+- Indication: Febrile neutropenia with pneumonia
+- Days before CAR-T: ~73 days
+PPI USE: None
+PAST MEDICAL HISTORY:
+Hypertension, hyperlipidemia, Type 2 diabetes (HbA1c 6.8%), chronic back pain.
+No autoimmune disease. No seizure history (neurotoxicity risk assessment).
+PERFORMANCE STATUS: ECOG 2 | Karnofsky 70%
+VITALS: BP 142/88 | HR 88 | SpO2 97% RA | Temp 98.8°F
+Weight 198 lbs (90 kg, down 22 lbs) | Height 5'11" (180 cm) | BMI 27.6 | BSA 2.11 m²
+KEY LABS (February 6, 2026):
+- WBC 3.2, Hgb 10.8, Plt 142, ANC 1.8, ALC 0.9 (lymphopenia post-chemo)
+- Creatinine 1.1, eGFR 76 (adequate for CAR-T)
+- AST/ALT 34/42, normal bilirubin
+- LDH 345 (elevated - disease burden), CRP 24.8, Ferritin 485
+IMAGING (February 2026):
+- PET-CT: Retroperitoneal nodes 3.2cm SUVmax 18.4, Mesenteric mass 4.8cm SUVmax 22.1
+- Bone marrow: Diffuse uptake (involvement)
+- Deauville Score: 5 (very active disease)
+- MRI Brain: No CNS involvement
+- Echo: LVEF 58% (normal)
+================================================================================
+MICROBIOME ANALYSIS - SUBOPTIMAL BUT MODIFIABLE
+================================================================================
+Sample Date: February 5, 2026 (33 days before planned CAR-T infusion)
+Method: Shotgun metagenomic sequencing (Illumina NovaSeq, 12M reads)
+Lab: Precision Microbiome Diagnostics
+CLINICAL CONTEXT:
+- On prophylactic levofloxacin × 14 days at sampling
+- Recent piperacillin-tazobactam 49 days prior
+- Multiple prior chemotherapy regimens
+- Immunosuppressed (lymphopenia, post-R-CHOP/R-ICE)
+DIVERSITY (LOW-MODERATE - SUBOPTIMAL):
+- Shannon Index: 2.7 [LOW - risk for poor CAR-T outcomes]
+- Simpson Index: 0.82
+- Observed Species: 164 (reduced)
+- Interpretation: Reduced diversity from antibiotics + chemotherapy
+COMPOSITION:
+Firmicutes 46.2% | Bacteroidetes 38.8% | Proteobacteria 8.4% ↑ | Actinobacteria 4.2%
+F/B Ratio: 1.19 (low - dysbiosis)
+KEY TAXA (% relative abundance):
+FAVORABLE (present but suboptimal):
+  - Akkermansia muciniphila: 1.8% [BORDERLINE - reduced]
+  - Faecalibacterium prausnitzii: 3.6% [LOW - key SCFA producer depleted]
+  - Ruminococcaceae: 9.2% [REDUCED]
+  - Ruminococcus lactaris: 1.8% [CAR-T biomarker - Smith 2022]
+  - Lachnospiraceae: 8.8% [REDUCED]
+  - Lachnospira pectinoschiza: 1.2% [CAR-T favorable - Stein-Thoeringer]
+  - Roseburia: 1.8%
+  - Bifidobacterium spp.: 2.2% total [LOW]
+  - Bacteroides eggerthii: 2.4% [CAR-T response predictor]
+CONCERNING (elevated):
+ - Proteobacteria: 8.4% [ELEVATED - dysbiosis marker]
+ -  E. coli: 4.2% [ELEVATED - antibiotic selection]
+ -  Klebsiella pneumoniae: 1.8% [ELEVATED]
+ -  Enterococcus faecalis: 2.1% [ELEVATED - pathobiont]
+ -  Bacteroides uniformis: 3.6% [CRS risk - Stein-Thoeringer 2023]
+ -  Blautia spp.: 4.2% [High Blautia → worse CAR-T outcomes]
+ -  Bacteroides ovatus: 4.8%
+METABOLITES (Measured):
+SCFAs (GC-MS):
+ - Butyrate: 15.8 μM [LOW - CRITICAL for CD8+ T-cell function]
+ - Propionate: 11.2 μM [LOW-MODERATE]
+ - Acetate: 38.4 μM [MODERATE]
+ - Total SCFA: 71.0 μM [LOW - suboptimal for T-cell support]
+Interpretation: REDUCED SCFA production is a major concern. Butyrate is critical
+for CAR-T cell cytotoxicity. Low levels correlate with reduced Faecalibacterium.
+Bile Acids (LC-MS/MS):
+ - Secondary bile acids: 18.2 μM [LOW]
+ - Secondary/Primary: 0.81 [LOW - reduced microbial conversion]
+Tryptophan Metabolites (LC-MS/MS):
+- NOT DETECTED - pathway significantly disrupted
+FUNCTIONAL PATHWAYS:
+ - SCFA biosynthesis: 2.8% [SIGNIFICANTLY REDUCED]
+ - Butyrate production: 1.2% [CRITICALLY LOW]
+ - Vitamin B synthesis: 2.1% [REDUCED]
+ - LPS biosynthesis: 2.8% [ELEVATED - inflammation risk]
+ - Antibiotic resistance genes: Detected (fluoroquinolone markers)
+CLINICAL INTERPRETATION FOR CAR-T:
+OVERALL: SUBOPTIMAL BUT MODIFIABLE (33-day intervention window)
+MAJOR CONCERNS (worse CAR-T outcomes in literature):
+1. LOW diversity (2.7) - Smith 2022: associated with worse response
+2. LOW butyrate (15.8 μM) - Luu 2021: critical for CD8+ T-cell cytotoxicity
+3. REDUCED Faecalibacterium (3.6%) - key SCFA producer depleted
+4. ACTIVE fluoroquinolone - Prasad 2024: ongoing disruption
+5. ELEVATED Proteobacteria (8.4%) - dysbiosis/inflammation
+6. ELEVATED Bacteroides uniformis (3.6%) - CRS risk marker
+7. High Blautia (4.2%) - associated with worse outcomes
+FAVORABLE ASPECTS:
+ - Akkermansia present (1.8%) - not completely depleted
+ - Ruminococcus lactaris present (1.8%) - CAR-T biomarker
+ - Bacteroides eggerthii present (2.4%) - response predictor
+ - NO PPI use - not adding dysbiosis
+ - 33 DAYS until CAR-T - TIME TO INTERVENE
+CRS RISK (microbiome-based): MODERATE-HIGH
+- Low diversity, elevated B. uniformis/Blautia, pro-inflammatory dysbiosis
+NEUROTOXICITY RISK: MODERATE
+- Low SCFA (neuroprotective effects diminished)
+================================================================================
+URGENT RECOMMENDATIONS (33-Day Optimization Window):
+1.  DISCONTINUE LEVOFLOXACIN (consult ID)
+   - Ongoing fluoroquinolone perpetuating dysbiosis
+   - Risk/Benefit: Infection risk vs CAR-T efficacy
+2. HIGH-DOSE PROBIOTICS (start immediately):
+   - Multi-strain: Lactobacillus + Bifidobacterium
+   - Example: Visbiome 900B CFU BID
+   - Rationale: Restore beneficial bacteria
+3. BUTYRATE ENHANCEMENT (critical):
+   - Clostridium butyricum 2B CFU daily
+   - High-fiber diet (25-30g/day)
+   - Resistant starch 20g/day
+   - Target: Restore butyrate before CAR-T
+4. DIETARY INTERVENTION (nutritionist consult):
+   - High-fiber, plant-based foods
+   - Prebiotics: Inulin, pectin, resistant starch
+   - Fermented foods: Yogurt, kefir (if ANC >1.5)
+5. REPEAT MICROBIOME (March 1, 2026):
+   - 1 week before lymphodepletion
+   - Goal: Shannon >3.0, Butyrate >25 μM
+   - Consider FMT if persistent dysbiosis
+6. CONSIDER FMT:
+   - If repeat shows persistent severe dysbiosis
+   - Timing: ≥1 week before lymphodepletion
+DATA QUALITY:
+- Sample integrity: Good, adequate depth
+- Limitations: Single time-point during active antibiotics
+- Context: Results reflect antibiotic-disrupted state
+Report: Dr. Jennifer Park, PhD | Clinical: Dr. Rachel Kim, PharmD
+Date: February 7, 2026
+================================================================================
+ASSESSMENT & PLAN:
+60yo male with R/R DLBCL (stage IV, BM+) post R-CHOP/R-ICE, planned for CD19
+CAR-T (axicabtagene ciloleucel) on March 10, 2026.
+CAR-T CANDIDACY: ✓ APPROVED
+- CD19+ disease, no contraindications
+- ECOG 2 acceptable, cardiac/CNS cleared
+MICROBIOME: SUBOPTIMAL BUT MODIFIABLE
+- Active antibiotic dysbiosis
+- LOW butyrate → CD8+ T-cell concern
+- Moderate-high CRS risk based on microbiome
+- 33-day intervention window
+PLAN:
+1. LEUKAPHERESIS: February 14, 2026
+2. MICROBIOME OPTIMIZATION (URGENT):
+   A. ID Consult: Discontinue levofloxacin if safe
+   B. Probiotics: Visbiome 900B CFU BID (start now)
+   C. Butyrate: C. butyricum + resistant starch
+   D. Diet: High-fiber (nutritionist Feb 10)
+   E. Repeat microbiome: March 1 (pre-lymphodepletion)
+3. LYMPHODEPLETION: March 7-9 (Flu/Cy)
+4. CAR-T INFUSION: March 10 (Day 0)
+5. POST-CAR-T MONITORING (high CRS/neurotoxicity risk):
+   - ICU-level monitoring × 48hr
+   - Daily neuro assessments (ICE score)
+   - Tocilizumab/dexamethasone available bedside
+   - CBC/CMP daily × 14 days
+6. RESPONSE ASSESSMENT:
+   - Day +30: PET-CT, MRD
+   - Microbiome correlation at Days +30, +100
+PROGNOSIS: Guardedly optimistic. CAR-T shows 50-60% durable CR in R/R DLBCL.
+Microbiome optimization may improve outcomes and reduce toxicity.
+FOLLOW-UP:
+- Leukapheresis: Feb 14
+- Nutritionist: Feb 10
+- ID consult: Feb 9 (re: levofloxacin)
+- Repeat microbiome: Mar 1
+- Admission: Mar 7
+________________________________________________________________________________
+Dr. Rachel Kim, MD | Cellular Therapy & Hematologic Malignancies
+Co-signed: Dr. David Martinez, MD, PhD (CAR-T Program Director)
+February 8, 2026
+================================================================================

data/sample_input/demo_ehr_ici_nsclc.txt ADDED Viewed

	@@ -0,0 +1,178 @@

+================================================================================
+COMPREHENSIVE CANCER CENTER - THORACIC ONCOLOGY
+================================================================================
+Patient: Sarah Chen | MRN: CCC-782934 | DOB: 03/22/1959 (Age: 66)
+Gender: Female | Visit Date: February 12, 2026
+CHIEF COMPLAINT:
+First-line immunotherapy consultation for metastatic NSCLC.
+HISTORY:
+66-year-old never-smoker with stage IV lung adenocarcinoma diagnosed December 2025.
+Presented with cough and dyspnea. Imaging showed 4.2cm right upper lobe mass with
+mediastinal nodes and malignant pleural effusion.
+CANCER DIAGNOSIS:
+- Primary: Right upper lobe adenocarcinoma (TTF-1+, Napsin-A+)
+- Stage: IVA (T3N2M1a) - Malignant pleural effusion
+- Diagnosis Date: December 8, 2025
+- Molecular: KRAS G12C mutation; EGFR/ALK/ROS1/BRAF negative
+BIOMARKERS (Immunotherapy-Relevant):
+- PD-L1: 65% TPS (22C3 assay) - HIGH, favorable for pembrolizumab monotherapy
+- TMB: 18 mutations/megabase - HIGH
+- MSI: Stable (MSS)
+TREATMENT PLAN:
+Pembrolizumab 200mg IV Q3W starting February 19, 2026 (first-line monotherapy).
+CURRENT MEDICATIONS:
+1. Metoprolol 50mg daily (atrial fibrillation)
+2. Apixaban 5mg BID (anticoagulation)
+3. Levothyroxine 75mcg daily (hypothyroidism)
+4. Vitamin D3 1000 IU daily
+5. Calcium carbonate 500mg BID
+ANTIBIOTIC EXPOSURE:
+- Recent: Levofloxacin 750mg daily × 5 days (Dec 28, 2025 - Jan 1, 2026)
+- Indication: Community-acquired pneumonia
+- Days before immunotherapy: 49 days (FULLY RECOVERED per microbiome)
+- No other antibiotics in past 6 months
+PPI USE: None (no GERD symptoms)
+PAST MEDICAL HISTORY:
+Atrial fibrillation, hypothyroidism, osteopenia, hypertension. No autoimmune disease.
+SOCIAL HISTORY:
+Never-smoker, rare alcohol use. Retired librarian. Diet: Predominantly plant-based,
+high in vegetables and whole grains (relevant to favorable microbiome).
+PERFORMANCE STATUS: ECOG 1 | Karnofsky 80%
+VITALS: BP 118/76 | HR 68 (irregular) | SpO2 94% RA | Temp 98.2°F
+Weight 128 lbs (58 kg) | Height 5'4" (163 cm) | BMI 22.0
+KEY LABS (February 10, 2026):
+- WBC 6.8, Hgb 12.4, Plt 278, ANC 4.2
+- Creatinine 0.8, eGFR >90
+- AST/ALT 24/28, normal LFTs
+- TSH 2.8 (on levothyroxine - stable)
+- LDH 198 (normal), CRP 12.5, CEA 8.4
+IMAGING (February 2026):
+- CT: 4.2cm RUL mass, mediastinal nodes, moderate pleural effusion
+- MRI Brain: No metastases
+- PET: Primary SUVmax 14.2, nodes SUVmax 8.7
+================================================================================
+MICROBIOME ANALYSIS - HIGHLY FAVORABLE PROFILE
+================================================================================
+Sample Date: February 7, 2026
+Method: Shotgun metagenomic sequencing (Illumina NovaSeq, 15M reads)
+Lab: Precision Microbiome Diagnostics
+DIVERSITY (HIGH - FAVORABLE):
+- Shannon Index: 3.6
+- Simpson Index: 0.91
+- Observed Species: 247
+- Interpretation: High diversity consistently associated with better ICI response
+COMPOSITION:
+Firmicutes 52.8% | Bacteroidetes 34.6% | Actinobacteria 8.2% | Proteobacteria 2.8%
+F/B Ratio: 1.53 (balanced)
+KEY TAXA (% relative abundance):
+FAVORABLE BACTERIA (HIGH - excellent for anti-PD-1):
+ - Akkermansia muciniphila: 4.8% [VERY HIGH - strongest NSCLC predictor]
+ - Faecalibacterium prausnitzii: 8.9% [HIGH - SCFA producer]
+ - Ruminococcaceae family: 14.2% [HIGH - favorable]
+ - Bifidobacterium longum: 3.2% + B. adolescentis: 1.8% [FAVORABLE]
+ - Lachnospiraceae family: 12.8% [HIGH]
+ - Roseburia intestinalis: 3.8% [butyrate producer]
+ - Alistipes putredinis: 2.6% [favorable in NSCLC]
+ - Prevotella copri: 4.2%
+LESS FAVORABLE (LOW - good):
+- Bacteroides spp.: 11.3% total (moderate)
+- E. coli: 1.2%, Enterococcus: 0.4%, Fusobacterium: <0.1% (all low - favorable)
+METABOLITES (Measured):
+SCFAs (GC-MS):
+ - Butyrate: 32.8 μM [HIGH - excellent for CD8+ T cells]
+ - Propionate: 18.6 μM
+ - Acetate: 58.4 μM
+ - Total SCFA: 118.0 μM [HIGH]
+Bile Acids (LC-MS/MS):
+- Secondary bile acids: 42.8 μM (high conversion - favorable)
+- Secondary/Primary ratio: 2.33
+Tryptophan Metabolites (LC-MS/MS):
+- Indole: 5.2 μM [AhR ligand]
+- Indole-3-aldehyde: 2.8 μM
+- Kynurenine/Tryptophan: 0.34 (moderate IDO activity)
+Other:
+- Inosine: 2.4 μM [T-cell activation]
+FUNCTIONAL PATHWAYS:
+- SCFA biosynthesis: HIGH
+- Butyrate production: HIGH
+- Vitamin B synthesis: HIGH
+- Bile salt hydrolase: Moderate-High
+CLINICAL INTERPRETATION:
+OVERALL: HIGHLY FAVORABLE PROFILE FOR PEMBROLIZUMAB
+Strengths:
+1.  Very high diversity (Shannon 3.6) - predicts ICI response
+2.  VERY HIGH Akkermansia (4.8%) - strongest predictor in NSCLC (Routy/Derosa)
+3.  High Faecalibacterium (8.9%) - responder taxon
+4.  High Ruminococcaceae (14.2%) - favorable in multiple ICI studies
+5.  Robust SCFA production - supports CD8+ T-cell function
+6.  Plant-based diet correlation with favorable microbiome
+7.  No PPI use - microbiome not disrupted
+8.  Full antibiotic recovery (49 days post-levofloxacin)
+Literature Alignment:
+- Routy 2018: Akkermansia >1% favorable → Patient: 4.8%
+- Gopalakrishnan 2018: High diversity favorable → Patient: High
+- Derosa 2022: Akkermansia predicts NSCLC response → Patient: Very high
+Predicted Response: FAVORABLE microbiome signature for pembrolizumab in NSCLC.
+RECOMMENDATIONS:
+1. Continue high-fiber, plant-based diet
+2. Avoid antibiotics during treatment if possible
+3. Maintain PPI-free status
+4. Repeat microbiome at Week 12 with response assessment
+================================================================================
+ASSESSMENT & PLAN:
+66yo never-smoker with stage IVA lung adenocarcinoma (high PD-L1 65%, high TMB 18)
+initiating first-line pembrolizumab monotherapy.
+FAVORABLE FACTORS:
+- High PD-L1 (65%), High TMB (18)
+- HIGHLY FAVORABLE microbiome (high diversity, very high Akkermansia, high butyrate)
+- Good performance status (ECOG 1)
+- No PPI use, full antibiotic recovery
+PLAN:
+1. Pembrolizumab 200mg IV Q3W starting Feb 19, 2026
+2. Monitor for irAEs (pneumonitis risk given lung disease)
+3. Continue plant-based high-fiber diet
+4. Repeat CT at Week 12 (May 2026)
+5. Repeat microbiome at Week 12 to correlate with response
+PROGNOSIS: Guardedly optimistic given favorable biomarkers and microbiome profile.
+________________________________________________________________________________
+Dr. Michael Torres, MD, PhD | Thoracic Oncology
+February 12, 2026
+================================================================================

data/templates/patient_schema_template.json ADDED Viewed

	@@ -0,0 +1,93 @@

+{
+  "extraction_version": "1.0",
+  "extraction_date": "YYYY-MM-DD",
+  "patient": {
+    "id": "",
+    "age": 0,
+    "gender": ""
+  },
+  "cancer": {
+    "type": "",
+    "subtype": "",
+    "stage": "",
+    "primary_site": "",
+    "metastases": [],
+    "biomarkers": {
+      "pdl1_expression": "",
+      "tmb": "",
+      "msi_status": ""
+    },
+    "diagnosis_date": "YYYY-MM-DD"
+  },
+  "immunotherapy": {
+    "therapy_type": "",
+    "drug_name": "",
+    "drug_class": "",
+    "treatment_setting": "",
+    "line_of_therapy": "",
+    "planned_start_date": "YYYY-MM-DD",
+    "ici_details": null,
+    "act_details": null
+  },
+  "prior_treatments": {
+    "chemotherapy": {
+      "received": false,
+      "regimens": [],
+      "response": ""
+    },
+    "prior_immunotherapy": {
+      "received": false,
+      "drugs": [],
+      "response": ""
+    }
+  },
+  "medications": {
+    "ppi_use": {
+      "currently_on_ppi": false,
+      "ppi_name": "",
+      "duration_months": 0
+    },
+    "antibiotic_history": {
+      "recent_antibiotics": false,
+      "exposures": []
+    }
+  },
+  "comorbidities": [],
+  "microbiome": {
+    "sample_date": "YYYY-MM-DD",
+    "sequencing_method": "",
+    "diversity": {
+      "shannon_index": 0.0,
+      "simpson_index": 0.0,
+      "observed_species": 0
+    },
+    "key_bacteria": {},
+    "metabolites": {
+      "scfa": {
+        "butyrate_uM": null,
+        "propionate_uM": null,
+        "acetate_uM": null
+      },
+      "bile_acids_available": false,
+      "tryptophan_metabolites_available": false
+    },
+    "data_quality": {
+      "completeness": "",
+      "source": "",
+      "limitations": []
+    }
+  },
+  "clinical_context": {
+    "urgency": "",
+    "patient_goals": [],
+    "specific_concerns": []
+  }
+}

generate_report.py ADDED Viewed

	@@ -0,0 +1,121 @@

+#!/usr/bin/env python3
+"""
+Main CLI entry point
+Accepts either:
+  - A pre-built patient JSON file (check template) (.json)
+  - A raw EHR text file
+Examples:
+  python generate_report.py patient_example.json
+  python generate_report.py patient_ehr.txt
+  python generate_report.py patient_ehr.txt --save-ehr-json extracted_patient.json
+"""
+import argparse
+import logging
+import sys
+from pathlib import Path
+from src.report_assembler import ReportAssembler
+# Configure logging
+logging.basicConfig(
+    level=logging.INFO,
+    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',
+    handlers=[
+        logging.StreamHandler(sys.stdout)
+    ]
+)
+logger = logging.getLogger(__name__)
+def main():
+    parser = argparse.ArgumentParser(
+        description=(
+            "Generate microbiome-immunotherapy clinical decision support report. "
+            "Accepts either a pre-built patient JSON file or a raw EHR text file."
+        )
+    )
+    parser.add_argument(
+        "patient_input",
+        type=str,
+        help=(
+            "Path to patient data file. "
+            "Use a .json file for pre-extracted patient data, "
+            "or a .txt/.ehr file to extract from a raw EHR report first."
+        )
+    )
+    parser.add_argument(
+        "-o", "--output-dir",
+        type=str,
+        default=None,
+        help="Output directory for generated report (default: ./outputs)"
+    )
+    parser.add_argument(
+        "--save-ehr-json",
+        type=str,
+        default=None,
+        metavar="PATH",
+        help=(
+            "[EHR mode only] Save the MedGemma-extracted patient JSON to this path. "
+            "Useful for inspecting or reusing the extraction without re-running the model."
+        )
+    )
+    args = parser.parse_args()
+    # Validate input file
+    input_path = Path(args.patient_input)
+    if not input_path.exists():
+        logger.error(f"Input file not found: {input_path}")
+        sys.exit(1)
+    # Detect input mode by extension
+    is_json = input_path.suffix.lower() == ".json"
+    if not is_json and args.save_ehr_json is None:
+        logger.info(
+            "Tip: use --save-ehr-json <path> to save the extracted patient JSON "
+            "and skip re-extraction on future runs."
+        )
+    logger.info("=" * 80)
+    logger.info("Microbiome-ICI Clinical Report Generator v1.0")
+    logger.info("=" * 80)
+    if is_json:
+        logger.info(f"Input mode: pre-built patient JSON  →  {input_path}")
+    else:
+        logger.info(f"Input mode: raw EHR text (MedGemma extraction)  →  {input_path}")
+    try:
+        assembler = ReportAssembler()
+        if is_json:
+            output_path = assembler.generate_and_save(
+                patient_json_path=str(input_path),
+                output_dir=args.output_dir,
+            )
+        else:
+            output_path = assembler.generate_and_save_from_ehr(
+                ehr_path=str(input_path),
+                output_dir=args.output_dir,
+                save_json_path=args.save_ehr_json,
+            )
+        logger.info("=" * 80)
+        logger.info(f"✓ Report generated successfully: {output_path}")
+        logger.info("=" * 80)
+    except Exception as e:
+        logger.error(f"Report generation failed: {e}", exc_info=True)
+        sys.exit(1)
+if __name__ == "__main__":
+    main()

rag/README.md ADDED Viewed

	@@ -0,0 +1,65 @@

+# Medical RAG Pipeline for Research Papers
+This directory contains the Retrieval-Augmented Generation (RAG) pipeline designed for extracting and processing medical research papers to support clinical decision-making in immunotherapy.
+## Pipeline Overview
+The pipeline transforms raw PDF research papers into a searchable vector database (ChromaDB), optimized for medical context and evidence retrieval.
+- **PDF Extraction**: Uses `docling` for accurate markdown extraction from complex medical PDFs.
+- **Cleaning**: `rag_md_cleaner.py` removes unnecessary metadata, references sections, and figures while preserving essential tables and text.
+- **Chunking**: `SectionAwareChunker` implements section-aware splitting with specific handling for tables to ensure context is preserved.
+- **Embedding**: Uses `pritamdeka/S-PubMedBert-MS-MARCO`, a domain-specific transformer model optimized for medical literature.
+- **Storage**: Persists vectors and metadata in `ChromaDB`.
+## Components
+- `pdf_to_chromadb_pipeline.py`: The main entry point for the ingestion pipeline.
+- `rag_md_cleaner.py`: Utility for cleaning extracted markdown.
+- `research_papers.json`: Metadata registry (filename stem mapping to paper titles/citations).
+## Data Requirements
+To ensure accurate citations, provide a `research_papers.json` file in the same directory as the script. The format should be:
+```json
+{
+ "1": {
+        "reference_id": "25",
+        "citation": "Takada et al., Int J Cancer 2021",
+        "title": "Clinical impact of probiotics on the efficacy of anti-PD-1 monotherapy in patients with NSCLC",
+        "year": 2021,
+        "tags": {
+            "treatment": [
+                "PD-1/PD-L1 Blockade"
+            ],
+            "cancer": [
+                "NSCLC"
+            ],
+            "biology": [
+                "Gut microbiome composition",
+                "Alpha diversity"
+            ],
+            "intervention": [
+                "Probiotics"
+            ]
+        }
+    },
+```
+*Where "1" is the filename stem of "1.pdf".*
+## Usage
+To run the pipeline and index a folder of PDFs:
+```bash
+python pdf_to_chromadb_pipeline.py --input-folder ./pdfs --db-path ./chroma_db
+```
+## Dependencies
+Requires `docling`, `transformers`, `sentence-transformers`, and `chromadb`. See `requirements_pipeline.txt` for details.

rag/pdf_to_chromadb_pipeline.py ADDED Viewed

	@@ -0,0 +1,1044 @@

+"""
+pdf_to_chromadb_pipeline.py
+Complete pipeline: PDF -> Docling extraction -> Cleaning -> Chunking -> ChromaDB
+Optimized for research papers with PubMedBERT embeddings.
+Pipeline steps:
+1. Extract markdown from PDF using docling
+2. Clean markdown (remove refs, metadata, figures; keep tables)
+3. Chunk with section-awareness and table splitting
+4. Embed using PubMedBERT tokenizer
+5. Store in ChromaDB with metadata
+Usage:
+    python pdf_to_chromadb_pipeline.py --input-folder ./pdfs --db-path ./chroma_db
+"""
+import os
+import json
+import argparse
+from pathlib import Path
+from typing import List, Dict, Optional
+from datetime import datetime
+import re
+import unicodedata
+from collections import defaultdict
+# Import cleaning function
+from rag_md_cleaner import clean_markdown_for_rag
+# Docling imports
+try:
+    from docling.document_converter import DocumentConverter
+    DOCLING_AVAILABLE = True
+except ImportError:
+    DOCLING_AVAILABLE = False
+    print("Warning: docling not installed. Install with: pip install docling")
+# Transformers for tokenizer
+try:
+    from transformers import AutoTokenizer
+    TRANSFORMERS_AVAILABLE = True
+except ImportError:
+    TRANSFORMERS_AVAILABLE = False
+    print("Warning: transformers not installed. Install with: pip install transformers torch")
+# ChromaDB
+try:
+    import chromadb
+    from chromadb.config import Settings
+    CHROMADB_AVAILABLE = True
+except ImportError:
+    CHROMADB_AVAILABLE = False
+    print("Warning: chromadb not installed. Install with: pip install chromadb")
+# Sentence transformers for embeddings
+try:
+    from sentence_transformers import SentenceTransformer
+    SENTENCE_TRANSFORMERS_AVAILABLE = True
+except ImportError:
+    SENTENCE_TRANSFORMERS_AVAILABLE = False
+    print("Warning: sentence-transformers not installed. Install with: pip install sentence-transformers")
+# ========================================
+# TABLE-AWARE MARKDOWN CHUNKER (embedded)
+# ========================================
+class TableAwareMarkdownSplitter:
+    """Split markdown by headers while keeping tables intact."""
+    def __init__(self, headers_to_split_on: List[tuple]):
+        self.headers_to_split_on = sorted(
+            headers_to_split_on,
+            key=lambda x: len(x[0]),
+            reverse=True
+        )
+    def split_text(self, text: str) -> List[Dict]:
+        """Split text by headers while preserving table structure."""
+        lines = text.split('\n')
+        documents = []
+        current_content = []
+        current_metadata = {}
+        in_table = False
+        table_buffer = []
+        for i, line in enumerate(lines):
+            is_table_row = self._is_table_row(line)
+            header_info = self._parse_header(line)
+            if header_info and not in_table:
+                if current_content:
+                    documents.append({
+                        'content': '\n'.join(current_content),
+                        'metadata': current_metadata.copy(),
+                        'has_table': False
+                    })
+                    current_content = []
+                level, header_text = header_info
+                current_metadata[level] = header_text
+                self._clear_lower_headers(current_metadata, level)
+                current_content.append(line)
+            elif is_table_row:
+                if not in_table:
+                    if current_content:
+                        caption = self._get_table_caption(current_content)
+                        if caption:
+                            current_content = current_content[:-1]
+                            if current_content:
+                                documents.append({
+                                    'content': '\n'.join(current_content),
+                                    'metadata': current_metadata.copy(),
+                                    'has_table': False
+                                })
+                            current_content = []
+                            table_buffer = [caption]
+                        else:
+                            documents.append({
+                                'content': '\n'.join(current_content),
+                                'metadata': current_metadata.copy(),
+                                'has_table': False
+                            })
+                            current_content = []
+                            table_buffer = []
+                    in_table = True
+                table_buffer.append(line)
+            elif in_table and not is_table_row:
+                in_table = False
+                if table_buffer:
+                    documents.append({
+                        'content': '\n'.join(table_buffer),
+                        'metadata': current_metadata.copy(),
+                        'has_table': True
+                    })
+                    table_buffer = []
+                current_content.append(line)
+            else:
+                current_content.append(line)
+        if table_buffer:
+            documents.append({
+                'content': '\n'.join(table_buffer),
+                'metadata': current_metadata.copy(),
+                'has_table': True
+            })
+        if current_content:
+            documents.append({
+                'content': '\n'.join(current_content),
+                'metadata': current_metadata.copy(),
+                'has_table': False
+            })
+        return documents
+    def _is_table_row(self, line: str) -> bool:
+        stripped = line.strip()
+        if not stripped:
+            return False
+        return stripped.startswith('|') or ('|' in stripped and stripped.count('|') >= 2)
+    def _get_table_caption(self, content_lines: List[str]) -> Optional[str]:
+        if not content_lines:
+            return None
+        last_line = content_lines[-1].strip()
+        if re.match(r'^Table \d+[\.:].+', last_line, re.IGNORECASE):
+            return last_line
+        return None
+    def _parse_header(self, line: str) -> Optional[tuple]:
+        line = line.strip()
+        for header_marker, level_name in self.headers_to_split_on:
+            if line.startswith(header_marker + ' '):
+                header_text = line[len(header_marker):].strip()
+                return level_name, header_text
+        return None
+    def _clear_lower_headers(self, metadata: Dict, current_level: str):
+        levels_order = [h[1] for h in self.headers_to_split_on]
+        try:
+            current_idx = levels_order.index(current_level)
+            for level in levels_order[current_idx + 1:]:
+                metadata.pop(level, None)
+        except ValueError:
+            pass
+class SectionAwareChunker:
+    """Chunk markdown with section awareness, table handling, and token limits."""
+    def __init__(
+        self,
+        model_name: str = "pritamdeka/S-PubMedBert-MS-MARCO",
+        max_tokens: int = 330,
+        chunk_overlap_tokens: int = 30,
+        split_tables: bool = True
+    ):
+        """
+        Initialize the chunker.
+        Args:
+            model_name: HuggingFace model name for tokenizer
+            max_tokens: Maximum tokens per chunk
+            chunk_overlap_tokens: Overlap between chunks in tokens
+            split_tables: If True, split large tables by rows
+        """
+        if not TRANSFORMERS_AVAILABLE:
+            raise ImportError("transformers library required. Install: pip install transformers torch")
+        print(f"Loading tokenizer from {model_name}...")
+        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
+        print("✓ Tokenizer loaded successfully")
+        self.max_tokens = max_tokens
+        self.chunk_overlap_tokens = chunk_overlap_tokens
+        self.split_tables = split_tables
+        self.headers_to_split_on = [
+            ("#", "h1"),
+            ("##", "h2"),
+            ("###", "h3"),
+            ("####", "h4"),
+        ]
+        self.header_splitter = TableAwareMarkdownSplitter(self.headers_to_split_on)
+    def count_tokens(self, text: str) -> int:
+        """Count tokens using the model's tokenizer."""
+        return len(self.tokenizer.encode(text, add_special_tokens=True))
+    def split_table_by_rows(self, table_text: str, max_tokens: int) -> List[str]:
+        """Split a table into smaller chunks by rows."""
+        lines = table_text.split('\n')
+        caption = None
+        header_row = None
+        separator_row = None
+        data_rows = []
+        for i, line in enumerate(lines):
+            line = line.strip()
+            if not line:
+                continue
+            if re.match(r'^Table \d+[\.:].+', line, re.IGNORECASE):
+                caption = line
+            elif '|' in line:
+                if header_row is None:
+                    header_row = line
+                elif separator_row is None and re.match(r'^\|[\s\-:|]+\|', line):
+                    separator_row = line
+                else:
+                    data_rows.append(line)
+        if not data_rows:
+            return [table_text]
+        # Build header template
+        header_template = []
+        if caption:
+            header_template.append(caption)
+        if header_row:
+            header_template.append(header_row)
+        if separator_row:
+            header_template.append(separator_row)
+        header_tokens = self.count_tokens('\n'.join(header_template)) if header_template else 0
+        # Check if a single row exceeds the limit
+        max_row_tokens = max(self.count_tokens(row) for row in data_rows)
+        if header_tokens + max_row_tokens > max_tokens:
+            # Even a single row is too large - need to split columns
+            print(f"Table row exceeds limit ({max_row_tokens} tokens), splitting columns...")
+            return self._split_table_by_columns(table_text, max_tokens)
+        # Split by rows normally
+        chunks = []
+        current_chunk_rows = []
+        for row in data_rows:
+            row_tokens = self.count_tokens(row)
+            current_tokens = self.count_tokens('\n'.join(current_chunk_rows)) if current_chunk_rows else 0
+            if header_tokens + current_tokens + row_tokens <= max_tokens:
+                current_chunk_rows.append(row)
+            else:
+                if current_chunk_rows:
+                    chunk = '\n'.join(header_template + current_chunk_rows)
+                    chunks.append(chunk)
+                # Start new chunk with this row
+                current_chunk_rows = [row]
+        if current_chunk_rows:
+            chunk = '\n'.join(header_template + current_chunk_rows)
+            chunks.append(chunk)
+        return chunks if chunks else [table_text]
+    def _split_table_by_columns(self, table_text: str, max_tokens: int) -> List[str]:
+        """
+        Split a wide table by columns when rows are too long.
+        Creates multiple narrower tables, each with first column + subset of other columns.
+        """
+        lines = table_text.split('\n')
+        caption = None
+        header_row = None
+        separator_row = None
+        data_rows = []
+        for line in lines:
+            line = line.strip()
+            if not line:
+                continue
+            if re.match(r'^Table \d+[\.:].+', line, re.IGNORECASE):
+                caption = line
+            elif '|' in line:
+                if header_row is None:
+                    header_row = line
+                elif separator_row is None and re.match(r'^\|[\s\-:|]+\|', line):
+                    separator_row = line
+                else:
+                    data_rows.append(line)
+        if not header_row or not data_rows:
+            # Can't split intelligently, just return as text chunks
+            return self.split_by_tokens(table_text, max_tokens, 0)
+        # Parse header columns
+        header_cells = [c.strip() for c in header_row.split('|')[1:-1]]
+        n_cols = len(header_cells)
+        if n_cols <= 2:
+            # Too few columns to split, fall back to text splitting
+            return self.split_by_tokens(table_text, max_tokens, 0)
+        # Parse all rows into cells
+        parsed_rows = []
+        for row in data_rows:
+            cells = [c.strip() for c in row.split('|')[1:-1]]
+            # Pad or trim to match header length
+            while len(cells) < n_cols:
+                cells.append('')
+            cells = cells[:n_cols]
+            parsed_rows.append(cells)
+        # Strategy: Keep first column (usually ID/key), split remaining columns into groups
+        chunks = []
+        # Try to fit columns into chunks
+        first_col_idx = 0
+        col_groups = []
+        current_group = [first_col_idx]  # Always include first column
+        for col_idx in range(1, n_cols):
+            # Test if adding this column fits
+            test_group = current_group + [col_idx]
+            test_chunk = self._build_table_chunk(
+                caption, header_cells, parsed_rows, test_group
+            )
+            test_tokens = self.count_tokens(test_chunk)
+            if test_tokens <= max_tokens:
+                current_group.append(col_idx)
+            else:
+                # Current group is full, save it
+                if len(current_group) > 1:  # Has more than just first column
+                    col_groups.append(current_group)
+                current_group = [first_col_idx, col_idx]
+        # Add remaining group
+        if len(current_group) > 1:
+            col_groups.append(current_group)
+        # Build chunks from column groups
+        for group_idx, col_indices in enumerate(col_groups):
+            chunk = self._build_table_chunk(caption, header_cells, parsed_rows, col_indices)
+            # Add note about which columns
+            if len(col_groups) > 1:
+                col_names = [header_cells[i] for i in col_indices[1:]]  # Skip first col (ID)
+                note = f"\n[Table part {group_idx + 1}/{len(col_groups)}: columns {', '.join(col_names)}]"
+                chunk = chunk + note
+            chunks.append(chunk)
+        return chunks if chunks else [table_text]
+    def _build_table_chunk(
+        self,
+        caption: Optional[str],
+        header_cells: List[str],
+        data_rows: List[List[str]],
+        col_indices: List[int]
+    ) -> str:
+        """Build a table chunk with selected columns."""
+        lines = []
+        if caption:
+            lines.append(caption)
+        # Header row with selected columns
+        selected_headers = [header_cells[i] for i in col_indices]
+        header_line = '| ' + ' | '.join(selected_headers) + ' |'
+        lines.append(header_line)
+        # Separator row
+        separator = '| ' + ' | '.join(['---'] * len(col_indices)) + ' |'
+        lines.append(separator)
+        # Data rows with selected columns
+        for row in data_rows:
+            selected_cells = [row[i] if i < len(row) else '' for i in col_indices]
+            row_line = '| ' + ' | '.join(selected_cells) + ' |'
+            lines.append(row_line)
+        return '\n'.join(lines)
+    def split_by_tokens(self, text: str, max_tokens: int, overlap_tokens: int = 0) -> List[str]:
+        """Split text into chunks by token count."""
+        sentences = re.split(r'(?<=[.!?])\s+(?=[A-Z])|\n\n+', text)
+        chunks = []
+        current_chunk = []
+        current_tokens = 0
+        for sentence in sentences:
+            sentence = sentence.strip()
+            if not sentence:
+                continue
+            sentence_tokens = self.count_tokens(sentence)
+            if sentence_tokens > max_tokens:
+                if current_chunk:
+                    chunks.append(' '.join(current_chunk))
+                    current_chunk = []
+                    current_tokens = 0
+                word_chunks = self._split_by_words(sentence, max_tokens)
+                if len(word_chunks) > 1:
+                    chunks.extend(word_chunks[:-1])
+                    current_chunk = [word_chunks[-1]]
+                    current_tokens = self.count_tokens(word_chunks[-1])
+                else:
+                    chunks.extend(word_chunks)
+                continue
+            potential_tokens = current_tokens + sentence_tokens
+            if potential_tokens <= max_tokens:
+                current_chunk.append(sentence)
+                current_tokens = potential_tokens
+            else:
+                if current_chunk:
+                    chunks.append(' '.join(current_chunk))
+                if overlap_tokens > 0 and current_chunk:
+                    overlap_chunk = []
+                    overlap_count = 0
+                    for sent in reversed(current_chunk):
+                        sent_tokens = self.count_tokens(sent)
+                        if overlap_count + sent_tokens <= overlap_tokens:
+                            overlap_chunk.insert(0, sent)
+                            overlap_count += sent_tokens
+                        else:
+                            break
+                    current_chunk = overlap_chunk
+                    current_tokens = overlap_count
+                else:
+                    current_chunk = []
+                    current_tokens = 0
+                current_chunk.append(sentence)
+                current_tokens = current_tokens + sentence_tokens
+        if current_chunk:
+            chunks.append(' '.join(current_chunk))
+        return chunks
+    def _split_by_words(self, text: str, max_tokens: int) -> List[str]:
+        """Split text by words when sentences are too long."""
+        words = text.split()
+        chunks = []
+        current_chunk = []
+        current_tokens = 0
+        for word in words:
+            word_tokens = self.count_tokens(word + ' ')
+            if current_tokens + word_tokens <= max_tokens:
+                current_chunk.append(word)
+                current_tokens += word_tokens
+            else:
+                if current_chunk:
+                    chunks.append(' '.join(current_chunk))
+                current_chunk = [word]
+                current_tokens = word_tokens
+        if current_chunk:
+            chunks.append(' '.join(current_chunk))
+        return chunks
+    def chunk_markdown(self, markdown_text: str, source_file: str = "unknown") -> List[Dict]:
+        """Chunk markdown with section awareness and table handling."""
+        header_splits = self.header_splitter.split_text(markdown_text)
+        final_chunks = []
+        for doc in header_splits:
+            section_metadata = self._extract_section_info(doc['metadata'])
+            is_table = doc.get('has_table', False)
+            token_count = self.count_tokens(doc['content'])
+            if token_count <= self.max_tokens:
+                final_chunks.append({
+                    "text": doc['content'],
+                    "metadata": {
+                        **section_metadata,
+                        "token_count": token_count,
+                        "is_table": is_table,
+                        "chunk_index": 0,
+                        "total_chunks_in_section": 1,
+                        "source_file": source_file
+                    }
+                })
+            elif is_table and self.split_tables:
+                table_chunks = self.split_table_by_rows(doc['content'], self.max_tokens)
+                for idx, chunk_text in enumerate(table_chunks):
+                    final_chunks.append({
+                        "text": chunk_text,
+                        "metadata": {
+                            **section_metadata,
+                            "token_count": self.count_tokens(chunk_text),
+                            "is_table": True,
+                            "chunk_index": idx,
+                            "total_chunks_in_section": len(table_chunks),
+                            "source_file": source_file
+                        }
+                    })
+            elif is_table:
+                # Keep table intact even if exceeds limit
+                final_chunks.append({
+                    "text": doc['content'],
+                    "metadata": {
+                        **section_metadata,
+                        "token_count": token_count,
+                        "is_table": True,
+                        "chunk_index": 0,
+                        "total_chunks_in_section": 1,
+                        "source_file": source_file,
+                        "exceeds_limit": True
+                    }
+                })
+            else:
+                sub_chunks = self.split_by_tokens(
+                    doc['content'],
+                    self.max_tokens,
+                    self.chunk_overlap_tokens
+                )
+                for idx, chunk_text in enumerate(sub_chunks):
+                    final_chunks.append({
+                        "text": chunk_text,
+                        "metadata": {
+                            **section_metadata,
+                            "token_count": self.count_tokens(chunk_text),
+                            "is_table": False,
+                            "chunk_index": idx,
+                            "total_chunks_in_section": len(sub_chunks),
+                            "source_file": source_file
+                        }
+                    })
+        return final_chunks
+    def _extract_section_info(self, metadata: Dict) -> Dict:
+        """Extract section information from metadata."""
+        section_info = {}
+        for level in ['h1', 'h2', 'h3', 'h4']:
+            if level in metadata:
+                section_info[level] = metadata[level]
+        section_type = self._identify_section_type(metadata)
+        if section_type:
+            section_info['section_type'] = section_type
+        return section_info
+    def _identify_section_type(self, metadata: Dict) -> str:
+        """Identify section type based on header text."""
+        all_headers = ' '.join([
+            metadata.get('h1', ''),
+            metadata.get('h2', ''),
+            metadata.get('h3', ''),
+            metadata.get('h4', '')
+        ]).lower()
+        section_patterns = {
+            'abstract': r'\babstract\b',
+            'introduction': r'\bintroduction\b',
+            'background': r'\bbackground\b',
+            'literature_review': r'\bliterature review\b|\brelated work\b',
+            'methodology': r'\bmethodology\b|\bmethods\b|\bmaterials and methods\b',
+            'results': r'\bresults\b',
+            'discussion': r'\bdiscussion\b',
+            'conclusion': r'\bconclusion\b|\bconcluding remarks\b',
+            'references': r'\breferences\b|\bbibliography\b',
+            'appendix': r'\bappendix\b',
+            'acknowledgments': r'\backnowledgments\b|\backnowledgements\b',
+            'abbreviations': r'\babbreviations\b',
+            'data_availability': r'\bdata availability\b'
+        }
+        for section_type, pattern in section_patterns.items():
+            if re.search(pattern, all_headers):
+                return section_type
+        return 'other'
+# ========================================
+# PIPELINE CLASS
+# ========================================
+class PDFToChromaDBPipeline:
+    """Complete pipeline from PDF to ChromaDB."""
+    def __init__(
+        self,
+        db_path: str = "./chroma_db",
+        collection_name: str = "research_papers",
+        embedding_model: str = "pritamdeka/S-PubMedBert-MS-MARCO",
+        max_tokens: int = 330,
+        chunk_overlap: int = 30,
+        papers_json: Optional[str] = None,
+    ):
+        """
+        Initialize the pipeline.
+        Args:
+            db_path: Path to ChromaDB database
+            collection_name: Name of ChromaDB collection
+            embedding_model: HuggingFace model for embeddings
+            max_tokens: Maximum tokens per chunk
+            chunk_overlap: Overlap between chunks
+            papers_json: Path to filename-keyed JSON with paper metadata.
+                         Keys are PDF stems (no extension), e.g. {"1": {...}, "2": {...}}
+        """
+        self.db_path = db_path
+        self.collection_name = collection_name
+        self.embedding_model_name = embedding_model
+        self.max_tokens = max_tokens
+        self.chunk_overlap = chunk_overlap
+        # Load paper registry (filename stem -> paper info dict)
+        self.paper_registry = self._load_paper_registry(papers_json)
+        # Initialize components
+        self._init_docling()
+        self._init_chunker()
+        self._init_embedder()
+        self._init_chromadb()
+    def _load_paper_registry(self, papers_json: Optional[str]) -> Dict:
+        """Load the filename-keyed paper metadata JSON.
+        Args:
+            papers_json: Path to JSON file. Expected format:
+                         { "<pdf_stem>": { <paper fields> }, ... }
+        Returns:
+            Dict mapping pdf stem -> paper info, or empty dict if not provided.
+        """
+        if not papers_json:
+            print("ℹ No papers JSON provided — paper metadata will not be attached to chunks.")
+            return {}
+        try:
+            with open(papers_json, 'r', encoding='utf-8') as f:
+                registry = json.load(f)
+            print(f"✓ Loaded paper registry from {papers_json} ({len(registry)} entries)")
+            return registry
+        except FileNotFoundError:
+            print(f"⚠ Papers JSON not found: {papers_json} — continuing without paper metadata.")
+            return {}
+        except json.JSONDecodeError as e:
+            print(f"⚠ Failed to parse papers JSON: {e} — continuing without paper metadata.")
+            return {}
+    def _get_paper_info(self, pdf_stem: str) -> Dict:
+        """Look up paper metadata by PDF filename stem.
+        Args:
+            pdf_stem: PDF filename without extension, e.g. '1' for '1.pdf'
+        Returns:
+            Paper info dict, or empty dict if not found.
+        """
+        info = self.paper_registry.get(pdf_stem, {})
+        if not info:
+            print(f"  ⚠ No paper metadata found for '{pdf_stem}' in registry.")
+        return info
+    def _init_docling(self):
+        """Initialize docling converter."""
+        if not DOCLING_AVAILABLE:
+            raise ImportError("docling required. Install: pip install docling")
+        self.converter = DocumentConverter()
+        print("✓ Docling converter initialized")
+    def _init_chunker(self):
+        """Initialize chunker with tokenizer."""
+        self.chunker = SectionAwareChunker(
+            model_name=self.embedding_model_name,
+            max_tokens=self.max_tokens,
+            chunk_overlap_tokens=self.chunk_overlap,
+            split_tables=True
+        )
+        print("✓ Chunker initialized")
+    def _init_embedder(self):
+        """Initialize embedding model."""
+        if not SENTENCE_TRANSFORMERS_AVAILABLE:
+            raise ImportError("sentence-transformers required. Install: pip install sentence-transformers")
+        print(f"Loading embedding model: {self.embedding_model_name}")
+        self.embedding_model = SentenceTransformer(self.embedding_model_name)
+        print("✓ Embedding model loaded")
+    def _init_chromadb(self):
+        """Initialize ChromaDB client and collection."""
+        if not CHROMADB_AVAILABLE:
+            raise ImportError("chromadb required. Install: pip install chromadb")
+        # Create persistent client
+        self.chroma_client = chromadb.PersistentClient(path=self.db_path)
+        # Get or create collection
+        self.collection = self.chroma_client.get_or_create_collection(
+            name=self.collection_name,
+            metadata={"hnsw:space": "cosine"}
+        )
+        print(f"✓ ChromaDB initialized at {self.db_path}")
+        print(f"✓ Collection '{self.collection_name}' ready (existing docs: {self.collection.count()})")
+    def extract_pdf(self, pdf_path: str) -> str:
+        """Extract markdown from PDF using docling."""
+        print(f"  Extracting PDF: {pdf_path}")
+        result = self.converter.convert(pdf_path)
+        markdown_text = result.document.export_to_markdown()
+        print(f"  ✓ Extracted {len(markdown_text)} characters")
+        return markdown_text
+    def clean_markdown(self, markdown_text: str) -> str:
+        """Clean markdown using rag_md_cleaner."""
+        print(f"  Cleaning markdown...")
+        cleaned = clean_markdown_for_rag(
+            markdown_text,
+            remove_tables=False,  # Keep tables
+            remove_figures=True,
+            remove_references=True,
+            reference_mode="conservative",
+            remove_metadata=True,
+        )
+        print(f"  ✓ Cleaned to {len(cleaned)} characters")
+        return cleaned
+    def chunk_text(self, text: str, source_file: str) -> List[Dict]:
+        """Chunk text with section awareness, attaching paper metadata."""
+        print(f"  Chunking text...")
+        chunks = self.chunker.chunk_markdown(text, source_file=source_file)
+        # Attach paper metadata to every chunk
+        paper_info = self._get_paper_info(source_file)
+        for chunk in chunks:
+            chunk['metadata']['paper'] = paper_info
+        # Validation
+        max_tokens = max(c['metadata']['token_count'] for c in chunks)
+        table_chunks = sum(1 for c in chunks if c['metadata'].get('is_table'))
+        print(f"  ✓ Created {len(chunks)} chunks")
+#         print(f"    - Text chunks: {len(chunks) - table_chunks}")
+#         print(f"    - Table chunks: {table_chunks}")
+        print(f"    - Max tokens: {max_tokens}")
+        if paper_info:
+            print(f"    - Paper: {paper_info.get('title', paper_info.get('reference_id', '?'))}")
+        return chunks
+    def embed_chunks(self, chunks: List[Dict]) -> List[List[float]]:
+        """Create embeddings for chunks."""
+        print(f"  Embedding {len(chunks)} chunks...")
+        texts = [chunk['text'] for chunk in chunks]
+        embeddings = self.embedding_model.encode(
+            texts,
+            show_progress_bar=True,
+            normalize_embeddings=True
+        )
+        print(f"  ✓ Created embeddings")
+        return embeddings.tolist()
+    def store_in_chromadb(
+        self,
+        chunks: List[Dict],
+        embeddings: List[List[float]],
+        pdf_filename: str
+    ):
+        """Store chunks and embeddings in ChromaDB."""
+        print(f"  Storing in ChromaDB...")
+        # Prepare data
+        ids = [f"{pdf_filename}_{i}" for i in range(len(chunks))]
+        documents = [chunk['text'] for chunk in chunks]
+        metadatas = []
+        for chunk in chunks:
+            # Flatten metadata for ChromaDB (ChromaDB only accepts scalar values)
+            metadata = {
+                'source_file': chunk['metadata'].get('source_file', ''),
+                'section_type': chunk['metadata'].get('section_type', 'other'),
+                'is_table': str(chunk['metadata'].get('is_table', False)),
+                'token_count': chunk['metadata']['token_count'],
+                'chunk_index': chunk['metadata']['chunk_index'],
+                'timestamp': datetime.now().isoformat(),
+            }
+            # Add header hierarchy
+            for level in ['h1', 'h2', 'h3', 'h4']:
+                if level in chunk['metadata']:
+                    metadata[level] = chunk['metadata'][level]
+            # Attach paper metadata — serialised as JSON string so ChromaDB accepts it.
+            # To retrieve: json.loads(chunk_metadata['paper'])
+            paper_info = chunk['metadata'].get('paper', {})
+            metadata['paper'] = json.dumps(paper_info) if paper_info else '{}'
+            # Also write each tag array as a flat pipe-delimited string field so the
+            # retriever can filter with ChromaDB where clauses (which only support scalars).
+            # e.g. paper_tag_cancer = "NSCLC|Renal Cell Carcinoma|Bladder Cancer"
+            # Retriever filters with: {"paper_tag_cancer": {"$contains": "NSCLC"}}
+            tags = paper_info.get('tags', {}) if paper_info else {}
+            for tag_key, tag_values in tags.items():
+                if isinstance(tag_values, list) and tag_values:
+                    metadata[f'paper_tag_{tag_key}'] = '|'.join(str(v) for v in tag_values)
+            metadatas.append(metadata)
+        # Add to collection
+        self.collection.add(
+            ids=ids,
+            embeddings=embeddings,
+            documents=documents,
+            metadatas=metadatas
+        )
+        print(f"  ✓ Stored {len(chunks)} chunks in ChromaDB")
+        print(f"  ✓ Total documents in collection: {self.collection.count()}")
+    def process_pdf(self, pdf_path: str) -> Dict:
+        """Process a single PDF through the entire pipeline."""
+        pdf_filename = Path(pdf_path).stem
+        print(f"\n{'='*80}")
+        print(f"Processing: {pdf_filename}")
+        print(f"{'='*80}")
+        try:
+            # Extract
+            markdown = self.extract_pdf(pdf_path)
+            # Clean
+            cleaned = self.clean_markdown(markdown)
+            # Chunk (paper metadata is attached inside chunk_text)
+            chunks = self.chunk_text(cleaned, source_file=pdf_filename)
+#             with open("/content/files/chunks/" + pdf_filename + ".json", "w", encoding="utf-8") as f:
+#                 json.dump(chunks, f, indent=2, ensure_ascii=False)
+#             print(f"\n✓ Saved {len(chunks)} chunks")
+            # Embed
+            embeddings = self.embed_chunks(chunks)
+            # Store
+            self.store_in_chromadb(chunks, embeddings, pdf_filename)
+            result = {
+                'status': 'success',
+                'pdf_file': pdf_filename,
+                'num_chunks': len(chunks),
+                'max_tokens': max(c['metadata']['token_count'] for c in chunks),
+            }
+            print(f"✓ Successfully processed {pdf_filename}")
+            return result
+        except Exception as e:
+            print(f"✗ Error processing {pdf_filename}: {str(e)}")
+            return {
+                'status': 'error',
+                'pdf_file': pdf_filename,
+                'error': str(e)
+            }
+    def process_folder(self, input_folder: str) -> List[Dict]:
+        """Process all PDFs in a folder."""
+        pdf_files = list(Path(input_folder).glob("*.pdf"))
+        if not pdf_files:
+            print(f"No PDF files found in {input_folder}")
+            return []
+        print(f"\nFound {len(pdf_files)} PDF files to process")
+        results = []
+        for pdf_path in pdf_files:
+            result = self.process_pdf(str(pdf_path))
+            results.append(result)
+        # Summary
+        successful = sum(1 for r in results if r['status'] == 'success')
+        failed = len(results) - successful
+        print(f"\n{'='*80}")
+        print("PIPELINE SUMMARY")
+        print(f"{'='*80}")
+        print(f"Total PDFs: {len(results)}")
+        print(f"Successful: {successful}")
+        print(f"Failed: {failed}")
+        if successful > 0:
+            total_chunks = sum(r.get('num_chunks', 0) for r in results if r['status'] == 'success')
+            print(f"Total chunks created: {total_chunks}")
+            print(f"ChromaDB collection size: {self.collection.count()}")
+        return results
+    def query(
+        self,
+        query_text: str,
+        n_results: int = 5,
+        filter_section: Optional[str] = None
+    ) -> Dict:
+        """Query the ChromaDB collection."""
+        # Create query embedding
+        query_embedding = self.embedding_model.encode(
+            [query_text],
+            normalize_embeddings=True
+        )[0].tolist()
+        # Build filter
+        where = {}
+        if filter_section:
+            where['section_type'] = filter_section
+        # Query
+        results = self.collection.query(
+            query_embeddings=[query_embedding],
+            n_results=n_results,
+            where=where if where else None
+        )
+        return results
+def main():
+    parser = argparse.ArgumentParser(
+        description='PDF to ChromaDB Pipeline with PubMedBERT'
+    )
+    parser.add_argument(
+        '--input-folder',
+        required=True,
+        help='Folder containing PDF files'
+    )
+    parser.add_argument(
+        '--db-path',
+        default='./chroma_db',
+        help='Path to ChromaDB database (default: ./chroma_db)'
+    )
+    parser.add_argument(
+        '--collection-name',
+        default='research_papers',
+        help='ChromaDB collection name (default: research_papers)'
+    )
+    parser.add_argument(
+        '--max-tokens',
+        type=int,
+        default=330,
+        help='Maximum tokens per chunk (default: 330)'
+    )
+    parser.add_argument(
+        '--overlap',
+        type=int,
+        default=30,
+        help='Chunk overlap in tokens (default: 30)'
+    )
+    parser.add_argument(
+        '--embedding-model',
+        default='pritamdeka/S-PubMedBert-MS-MARCO',
+        help='HuggingFace embedding model'
+    )
+    parser.add_argument(
+        '--papers-json',
+        default=None,
+        help='Path to filename-keyed paper metadata JSON (e.g. research_papers.json)'
+    )
+    args = parser.parse_args()
+    # Initialize pipeline
+    print("\nInitializing PDF to ChromaDB Pipeline...")
+    print(f"{'='*80}")
+    pipeline = PDFToChromaDBPipeline(
+        db_path=args.db_path,
+        collection_name=args.collection_name,
+        embedding_model=args.embedding_model,
+        max_tokens=args.max_tokens,
+        chunk_overlap=args.overlap,
+        papers_json=args.papers_json,
+    )
+    # Process folder
+    results = pipeline.process_folder(args.input_folder)
+    # Save results log
+    log_file = f"pipeline_results_{datetime.now().strftime('%Y%m%d_%H%M%S')}.json"
+    with open(log_file, 'w') as f:
+        json.dump(results, f, indent=2)
+    print(f"\n✓ Results saved to {log_file}")
+if __name__ == "__main__":
+    main()

rag/rag_md_cleaner.py ADDED Viewed

	@@ -0,0 +1,278 @@

+"""
+rag_md_cleaner.py
+Markdown cleaner optimized for PDF -> Markdown extraction
+(docling style) intended for RAG ingestion.
+Primary concerns addressed:
+- HTML comments like <!-- image -->
+- Ligature /uniFB01 /uniFB02 /uniFB03 artifacts and unicode normalization
+- Broken hyphen spacing "immune - related" => "immune-related"
+- Standalone pipe lines " | "
+- Tables (optional removal)
+- Figure captions (optional removal)
+- Reference removal:
+- Remove common metadata (Funding, Author contributions, Conflict of interest, Publisher's note)
+"""
+import re
+import unicodedata
+from typing import Optional
+def normalize_unicode(text: str) -> str:
+    text = unicodedata.normalize("NFKC", text)
+    # common PDF extraction broken ligature tokens and odd markers
+    ligature_map = {
+        "/uniFB01": "fi",
+        "/uniFB02": "fl",
+        "/uniFB03": "ffi",
+        "/uniFB04": "ffl",
+        "\ufb01": "fi",
+        "\ufb02": "fl",
+    }
+    for k, v in ligature_map.items():
+        text = text.replace(k, v)
+    return text
+def is_reference_like_line(line: str) -> bool:
+    s = line.strip()
+    if not s:
+        return False
+    # common signals of a reference line
+    patterns = [
+        r"\bdoi\s*:\s*10\.",
+        r"\bdoi\.?/?10\.",
+        r"\(\s*\d{4}\s*\)",
+        r"^\s*\d+\.\s+",
+        r"\bet al\.",
+        r"\bPMID\b|\bPMC\b",
+    ]
+    for p in patterns:
+        if re.search(p, s, flags=re.IGNORECASE):
+            return True
+    comma_count = s.count(",")
+    if comma_count >= 3 and len(s) < 300:
+        # heuristic: author lines usually have at least one capitalized surname-like token
+        if re.search(r"[A-Z][a-z]{2,}\s+[A-Z]\b", s) or re.search(r"[A-Z][a-z]{2,},\s+[A-Z]", s):
+            return True
+    # journal-like end pattern: volume:pages or year;volume:pages
+    if re.search(r"\d{4}\).*?\d{1,4}[:](\d|–|-)", s) or re.search(r"\b\d{1,4}:\d{1,4}\b", s):
+        return True
+    return False
+def remove_references_section(
+    text: str,
+    mode: str = "conservative",
+    consecutive_threshold_conservative: int = 5,
+    consecutive_threshold_aggressive: int = 3,
+    window_tail_fraction: float = 0.35,
+) -> str:
+    """
+    Remove references from text.
+    Parameters
+    ----------
+    text : str
+        Raw markdown text
+    mode : str
+        "conservative" (safer, fewer false positives) or "aggressive" (more likely to remove refs)
+    consecutive_threshold_conservative : int
+        number of consecutive ref-like lines required for conservative mode
+    consecutive_threshold_aggressive : int
+        threshold for aggressive mode
+    window_tail_fraction : float
+        fraction of document considered the "tail" for additional detection
+    Returns
+    -------
+    str
+        Text trimmed before detected references (or original if none detected)
+    """
+    # 1) Try explicit headers
+    header_regexes = [
+        r"\n##\s*REFERENCES\b", r"\n##\s*References\b", r"\nREFERENCES\b", r"\nReferences\b"
+    ]
+    for hdr in header_regexes:
+        m = re.search(hdr, text, flags=re.IGNORECASE)
+        if m:
+            return text[: m.start()]
+    # 2) Heuristic scan for consecutive reference-like lines
+    lines = text.splitlines()
+    threshold = consecutive_threshold_conservative if mode == "conservative" else consecutive_threshold_aggressive
+    consecutive = 0
+    for idx, line in enumerate(lines):
+        if is_reference_like_line(line):
+            consecutive += 1
+        else:
+            consecutive = 0
+        if consecutive >= threshold:
+            # find the first line index where this consecutive block started
+            start_idx = idx - consecutive + 1
+            # return up to before that block
+            return "\n".join(lines[:start_idx]).rstrip()
+    # 3) Tail-window detection: maybe refs are at the end but not consecutive enough earlier
+    n_lines = len(lines)
+    tail_start = int(n_lines * (1.0 - window_tail_fraction))
+    tail = lines[tail_start:]
+    # compute fraction of ref-like lines in tail
+    ref_like_count = sum(1 for L in tail if is_reference_like_line(L))
+    if len(tail) > 10 and (ref_like_count / len(tail)) > 0.25:
+        # find first ref-like line in tail and cut before it
+        for i, L in enumerate(tail):
+            if is_reference_like_line(L):
+                return "\n".join(lines[: tail_start + i]).rstrip()
+    # 4) No reliable reference block found -> return original
+    return text
+def remove_metadata_sections(text: str) -> str:
+    """
+    Remove common metadata sections by header names.
+    Cuts text at the first occurrence of any of these headers.
+    """
+    meta_headers = [
+        "AUTHOR CONTRIBUTIONS",
+        "Author contributions",
+        "FUNDING",
+        "Funding",
+        "CONFLICT OF INTEREST",
+        "CONFLICT OF INTEREST STATEMENT",
+        "Conflict of interest",
+        "CONFLICT OF INTEREST STATEMENT",
+        "Publisher's note",
+        "Publisher ' s note",
+        "Publisher?s note",
+        "ACKNOWLEDGMENTS",
+        "Acknowledgments",
+        "DATA AVAILABILITY STATEMENT",
+        "Data availability statement",
+        "ORCID",
+        "DATA AVAILABILITY",
+        "Data Availability",
+    ]
+    pattern = r"\n(?:" + "|".join([re.escape(h) for h in meta_headers]) + r")\b"
+    m = re.search(pattern, text, flags=re.IGNORECASE)
+    if m:
+        return text[: m.start()].rstrip()
+    return text
+# ---------------------------
+# Main cleaning function
+# ---------------------------
+def clean_markdown_for_rag(
+    markdown_text: str,
+    remove_tables: bool = False,
+    remove_figures: bool = True,
+    remove_references: bool = True,
+    reference_mode: str = "conservative",
+    remove_metadata: bool = True,
+    collapse_multiblank: bool = True,
+) -> str:
+    """
+    Clean markdown text extracted from PDFs to produce RAG-friendly text.
+    Parameters
+    ----------
+    markdown_text : str
+        Raw markdown
+    remove_tables : bool
+        Remove markdown tables (default False).
+    remove_figures : bool
+        Remove figure captions / figure blocks (default True)
+    remove_references : bool
+        Attempt to remove references (default True)
+    reference_mode : str
+        'conservative' or 'aggressive'
+    remove_metadata : bool
+        Remove author contributions/funding/conflict etc (default True)
+    collapse_multiblank : bool
+        Collapse >2 newlines to 2 newlines (default True)
+    Returns
+    -------
+    str
+        Cleaned markdown
+    """
+    text = markdown_text
+    # normalize unicode & ligatures
+    text = normalize_unicode(text)
+    # remove HTML comments like <!-- image -->
+    text = re.sub(r"<!--.*?-->", "", text, flags=re.DOTALL)
+    # remove explicit image placeholders (variants)
+    text = re.sub(r"<\s*--\s*image\s*--\s*>", "", text, flags=re.IGNORECASE)
+    text = re.sub(r"\[image:\s*.*?\]", "", text, flags=re.IGNORECASE)
+    # remove standalone pipes lines
+    text = re.sub(r"^\s*\|\s*$", "", text, flags=re.MULTILINE)
+    # Optionally remove markdown tables (entire blocks). Conservative removal:
+    if remove_tables:
+        # Remove contiguous table-like blocks beginning and ending with pipes or table dividers
+        text = re.sub(
+            r"\n(?:\|[^\n]*\|\s*\n(?:\|[-:\s|]+\|\s*\n)?(?:\|[^\n]*\|\s*\n)+)",
+            "\n",
+            text,
+            flags=re.DOTALL,
+        )
+        # Also remove inline table fragments
+        text = re.sub(r"\n\|[-:\s|]+\|\n", "\n", text)
+    else:
+        # clean duplicate pipes
+        text = re.sub(r"\|{2,}", "|", text)
+    # Remove figure captions/blocks like FIGURE 1 ... Figure 1: ... or Fig. 1
+    if remove_figures:
+        text = re.sub(r"(?is)\bFIGURE\s*\d+[:.\s\S]*?(?=\n##|\n[A-Z]{2,}|$)", "", text)
+        text = re.sub(r"(?is)\bFigure\s*\d+[:.\s\S]*?(?=\n##|\n[A-Z]{2,}|$)", "", text)
+        text = re.sub(r"(?is)\bFig\.\s*\d+[:.\s\S]*?(?=\n##|\n[A-Z]{2,}|$)", "", text)
+    # Fix hyphen spacing artifacts: "immune - related" -> "immune-related"
+    text = re.sub(r"\s-\s+", "-", text)
+    # Fix spaced punctuation: "word ," -> "word,"
+    text = re.sub(r"\s+([,.;:])", r"\1", text)
+    # Remove common publisher footer / 'Downloaded from...' blocks by looking for typical phrases
+    text = re.sub(
+        r"Downloaded from .*?Terms and Conditions.*?(?:\n|$)",
+        "",
+        text,
+        flags=re.IGNORECASE | re.DOTALL,
+    )
+    # generic residual footer lines with DOI-like trailing
+    text = re.sub(r"\n\d{6,}x?,?\s*\d{4}.*$", "", text, flags=re.DOTALL)
+    # Remove references (robust)
+    if remove_references:
+        text = remove_references_section(text, mode=reference_mode)
+    # Remove metadata sections (Funding, Author contributions, Conflicts, ORCID, etc.)
+    if remove_metadata:
+        text = remove_metadata_sections(text)
+    # Collapse excessive blank lines
+    if collapse_multiblank:
+        text = re.sub(r"\n{3,}", "\n\n", text)
+    # Trim leading/trailing whitespace
+    text = text.strip()
+    return text

rag/requirements.txt ADDED Viewed

	@@ -0,0 +1,17 @@

+# PDF to ChromaDB Pipeline Requirements
+# For research paper processing with PubMedBERT embeddings
+# Core dependencies
+docling>=1.0.0                    # PDF extraction
+transformers>=4.40.0              # PubMedBERT tokenizer (aligned with root)
+torch>=2.0.0                      # Required by transformers
+sentence-transformers>=2.5.0      # Embedding generation (aligned with root)
+chromadb>=0.4.0                   # Vector database
+# Optional but recommended
+numpy>=1.24.0                     # Array operations
+tqdm>=4.65.0                      # Progress bars (used by sentence-transformers)
+# For development/testing
+ipython>=8.0.0                    # Interactive shell
+jupyter>=1.0.0                    # Notebook support

rag/research_papers.json ADDED Viewed

	@@ -0,0 +1,555 @@

+{
+    "1": {
+        "reference_id": "25",
+        "citation": "Takada et al., Int J Cancer 2021",
+        "title": "Clinical impact of probiotics on the efficacy of anti-PD-1 monotherapy in patients with NSCLC",
+        "year": 2021,
+        "tags": {
+            "treatment": [
+                "PD-1/PD-L1 Blockade"
+            ],
+            "cancer": [
+                "NSCLC"
+            ],
+            "biology": [
+                "Gut microbiome composition",
+                "Alpha diversity"
+            ],
+            "intervention": [
+                "Probiotics"
+            ]
+        }
+    },
+    "2": {
+        "reference_id": "2",
+        "citation": "Maynard et al., Nature 2012",
+        "title": "Reciprocal interactions of the intestinal microbiota and immune system",
+        "year": 2012,
+        "tags": {
+            "treatment": [],
+            "cancer": [],
+            "biology": [
+                "Innate immunity",
+                "Adaptive immunity",
+                "T cell-microbiota interactions",
+                "Immune homeostasis"
+            ],
+            "intervention": [
+                "Review article"
+            ]
+        }
+    },
+    "3": {
+        "reference_id": "3",
+        "citation": "Dzutsev et al., Annu Rev Immunol 2017",
+        "title": "Microbes and cancer",
+        "year": 2017,
+        "tags": {
+            "treatment": [],
+            "cancer": [
+                "General"
+            ],
+            "biology": [
+                "Tumorigenesis",
+                "Microbiome-cancer interactions",
+                "Inflammation"
+            ],
+            "intervention": [
+                "Review article"
+            ]
+        }
+    },
+    "4": {
+        "reference_id": "4",
+        "citation": "Zheng et al., Cell Res 2020",
+        "title": "Interaction between microbiota and immunity in health and disease",
+        "year": 2020,
+        "tags": {
+            "treatment": [],
+            "cancer": [],
+            "biology": [
+                "Immune modulation",
+                "Inflammation",
+                "Host-microbiota interactions"
+            ],
+            "intervention": [
+                "Review article"
+            ]
+        }
+    },
+    "5": {
+        "reference_id": "5",
+        "citation": "Cullin et al., Cancer Cell 2021",
+        "title": "Microbiome and cancer",
+        "year": 2021,
+        "tags": {
+            "treatment": [],
+            "cancer": [
+                "General"
+            ],
+            "biology": [
+                "Tumor microenvironment",
+                "Microbiome-cancer interactions",
+                "Immune regulation"
+            ],
+            "intervention": [
+                "Review article"
+            ]
+        }
+    },
+    "44": {
+        "reference_id": "44",
+        "citation": "Han et al., Nat Biomed Eng 2021",
+        "title": "Generation of systemic anti-tumour immunity via the in situ modulation of the gut microbiome by an orally administered inulin gel",
+        "year": 2021,
+        "tags": {
+            "treatment": [
+                "PD-1/PD-L1 Blockade"
+            ],
+            "cancer": [
+                "Preclinical tumor models"
+            ],
+            "biology": [
+                "Memory T cells",
+                "Microbiome modulation"
+            ],
+            "intervention": [
+                "Prebiotics",
+                "Inulin"
+            ]
+        }
+    },
+    "46": {
+        "reference_id": "46",
+        "citation": "Wastyk et al., Cell 2021",
+        "title": "Gut-microbiota-targeted diets modulate human immune status",
+        "year": 2021,
+        "tags": {
+            "treatment": [],
+            "cancer": [],
+            "biology": [
+                "Fiber",
+                "Fermented foods",
+                "Microbial diversity",
+                "Inflammatory markers"
+            ],
+            "intervention": [
+                "Diet",
+                "High-fiber diet",
+                "Fermented food diet"
+            ]
+        }
+    },
+    "47": {
+        "reference_id": "47",
+        "citation": "Huang et al., Gut 2022",
+        "title": "Ginseng polysaccharides alter the gut microbiota and kynurenine/tryptophan ratio, potentiating anti-PD-1/PD-L1 immunotherapy",
+        "year": 2022,
+        "tags": {
+            "treatment": [
+                "PD-1/PD-L1 Blockade"
+            ],
+            "cancer": [
+                "Preclinical tumor models"
+            ],
+            "biology": [
+                "Tryptophan metabolism",
+                "Kynurenine pathway",
+                "Microbiome modulation"
+            ],
+            "intervention": [
+                "Prebiotics",
+                "Polysaccharides"
+            ]
+        }
+    },
+{
+  "6": {
+    "reference_id": "6",
+    "citation": "Routy et al., Science 2018",
+    "title": "Gut microbiome influences efficacy of PD-1-based immunotherapy against epithelial tumors",
+    "year": 2018,
+    "tags": {
+      "treatment": ["PD-1/PD-L1 Blockade"],
+      "cancer": ["NSCLC", "RCC"],
+      "biology": ["Akkermansia muciniphila", "Alpha diversity", "FMT validation"],
+      "intervention": ["Fecal microbiota transplantation"]
+    }
+  },
+  "8": {
+    "reference_id": "8",
+    "citation": "Matson et al., Science 2018",
+    "title": "The commensal microbiome is associated with anti-PD-1 efficacy in metastatic melanoma patients",
+    "year": 2018,
+    "tags": {
+      "treatment": ["PD-1/PD-L1 Blockade"],
+      "cancer": ["Melanoma"],
+      "biology": ["Bifidobacterium", "Faecalibacterium", "FMT validation"],
+      "intervention": ["Fecal microbiota transplantation"]
+    }
+  },
+  "9": {
+    "reference_id": "9",
+    "citation": "Jin et al., J Thorac Oncol 2019",
+    "title": "The diversity of gut microbiome is associated with favorable responses to anti-PD-1 immunotherapy in Chinese NSCLC patients",
+    "year": 2019,
+    "tags": {
+      "treatment": ["PD-1/PD-L1 Blockade"],
+      "cancer": ["NSCLC"],
+      "biology": ["Alpha diversity", "Alistipes", "Prevotella"],
+      "intervention": []
+    }
+  },
+  "10": {
+    "reference_id": "10",
+    "citation": "Lee et al., Nat Med 2022",
+    "title": "Cross-cohort gut microbiome associations with immune checkpoint inhibitor response in advanced melanoma",
+    "year": 2022,
+    "tags": {
+      "treatment": ["PD-1/PD-L1 Blockade"],
+      "cancer": ["Melanoma"],
+      "biology": ["Roseburia", "Akkermansia", "Bifidobacterium", "Meta-analysis"],
+      "intervention": []
+    }
+  },
+  "11": {
+    "reference_id": "11",
+    "citation": "Smith et al., Nat Med 2022",
+    "title": "Gut microbiome correlates of response and toxicity following anti-CD19 CAR T cell therapy",
+    "year": 2022,
+    "tags": {
+      "treatment": ["CAR-T"],
+      "cancer": ["B-cell lymphoma"],
+      "biology": ["Ruminococcus", "Bacteroides", "Faecalibacterium", "Cytokine release syndrome"],
+      "intervention": []
+    }
+  },
+  "12": {
+    "reference_id": "12",
+    "citation": "Stein-Thoeringer et al., Nat Med 2023",
+    "title": "A non-antibiotic-disrupted gut microbiome is associated with clinical responses to CD19-CAR-T cell cancer immunotherapy",
+    "year": 2023,
+    "tags": {
+      "treatment": ["CAR-T"],
+      "cancer": ["Lymphoma"],
+      "biology": ["Akkermansia", "Ruminococcus lactaris", "Alpha diversity"],
+      "intervention": ["Antibiotic exposure"]
+    }
+  },
+  "13": {
+    "reference_id": "13",
+    "citation": "Hu et al., Nat Commun 2022",
+    "title": "CAR-T cell therapy-related cytokine release syndrome and therapeutic response is modulated by the gut microbiome in hematologic malignancies",
+    "year": 2022,
+    "tags": {
+      "treatment": ["CAR-T"],
+      "cancer": ["Hematologic malignancies"],
+      "biology": ["Faecalibacterium", "Roseburia", "Cytokine release syndrome"],
+      "intervention": []
+    }
+  },
+  "15": {
+    "reference_id": "15",
+    "citation": "Luu et al., Nat Commun 2021",
+    "title": "Microbial short-chain fatty acids modulate CD8+ T cell responses and improve adoptive immunotherapy for cancer",
+    "year": 2021,
+    "tags": {
+      "treatment": ["ACT"],
+      "cancer": ["Preclinical tumor models"],
+      "biology": ["SCFAs", "Butyrate", "CD8+ T cells"],
+      "intervention": ["Short-chain fatty acids supplementation"]
+    }
+  },
+  "16": {
+    "reference_id": "16",
+    "citation": "He et al., Cell Metab 2021",
+    "title": "Gut microbial metabolites facilitate anticancer therapy efficacy by modulating cytotoxic CD8+ T cell immunity",
+    "year": 2021,
+    "tags": {
+      "treatment": ["Immunotherapy"],
+      "cancer": ["Preclinical tumor models"],
+      "biology": ["SCFAs", "Microbial metabolites", "CD8+ T cells"],
+      "intervention": []
+    }
+  },
+  "18": {
+    "reference_id": "18",
+    "citation": "Paik et al., Nature 2022",
+    "title": "Human gut bacteria produce TH17-modulating bile acid metabolites",
+    "year": 2022,
+    "tags": {
+      "treatment": [],
+      "cancer": [],
+      "biology": ["Bile acids", "Th17 cells", "Microbial metabolites"],
+      "intervention": []
+    }
+  },
+  "19": {
+    "reference_id": "19",
+    "citation": "Bender et al., Cell 2023",
+    "title": "Dietary tryptophan metabolite released by intratumoral Lactobacillus reuteri facilitates immune checkpoint inhibitor treatment",
+    "year": 2023,
+    "tags": {
+      "treatment": ["PD-1/PD-L1 Blockade"],
+      "cancer": ["Melanoma"],
+      "biology": ["Tryptophan metabolism", "Indole derivatives", "Lactobacillus reuteri"],
+      "intervention": ["Dietary tryptophan modulation"]
+    }
+  },
+  "20": {
+    "reference_id": "20",
+    "citation": "Hezaveh et al., Immunity 2022",
+    "title": "Tryptophan-derived microbial metabolites activate the aryl hydrocarbon receptor in tumor-associated macrophages to suppress anti-tumor immunity",
+    "year": 2022,
+    "tags": {
+      "treatment": [],
+      "cancer": ["Preclinical tumor models"],
+      "biology": ["Tryptophan metabolism", "Aryl hydrocarbon receptor", "Tumor-associated macrophages"],
+      "intervention": []
+    }
+  },
+  "23": {
+    "reference_id": "23",
+    "citation": "McCulloch et al., Nat Med 2022",
+    "title": "Intestinal microbiota signatures of clinical response and immune-related adverse events in melanoma patients treated with anti-PD-1",
+    "year": 2022,
+    "tags": {
+      "treatment": ["PD-1/PD-L1 Blockade"],
+      "cancer": ["Melanoma"],
+      "biology": ["Immune-related adverse events", "Bacteroides", "Microbiome composition"],
+      "intervention": []
+    }
+  },
+  "24": {
+    "reference_id": "24",
+    "citation": "Derosa et al., Nat Med 2022",
+    "title": "Intestinal Akkermansia muciniphila predicts clinical response to PD-1 blockade in patients with advanced NSCLC",
+    "year": 2022,
+    "tags": {
+      "treatment": ["PD-1/PD-L1 Blockade"],
+      "cancer": ["NSCLC"],
+      "biology": ["Akkermansia muciniphila", "Antibiotic exposure", "Microbiome composition"],
+      "intervention": ["Antibiotics"]
+    }
+  },
+  "26": {
+    "reference_id": "26",
+    "citation": "Zheng et al., J Immunother Cancer 2019",
+    "title": "Gut microbiome affects the response to anti-PD-1 immunotherapy in patients with HCC",
+    "year": 2019,
+    "tags": {
+      "treatment": ["PD-1/PD-L1 Blockade"],
+      "cancer": ["HCC"],
+      "biology": ["Akkermansia", "Ruminococcaceae", "Microbiome composition"],
+      "intervention": []
+    }
+  },
+  "27": {
+    "reference_id": "27",
+    "citation": "Lee et al., J Immunother Cancer 2022",
+    "title": "Gut microbiota and metabolites associate with outcomes of immune checkpoint inhibitor-treated unresectable HCC",
+    "year": 2022,
+    "tags": {
+      "treatment": ["PD-1/PD-L1 Blockade"],
+      "cancer": ["HCC"],
+      "biology": ["Microbial metabolites", "Lachnospiraceae", "Metabolomics"],
+      "intervention": []
+    }
+  },
+  "28": {
+    "reference_id": "28",
+    "citation": "Limeta et al., JCI Insight 2020",
+    "title": "Meta-analysis of the gut microbiota in predicting response to cancer immunotherapy in metastatic melanoma",
+    "year": 2020,
+    "tags": {
+      "treatment": ["PD-1/PD-L1 Blockade"],
+      "cancer": ["Melanoma"],
+      "biology": ["Meta-analysis", "Microbiome composition", "Biomarker discovery"],
+      "intervention": []
+    }
+  },
+  "29": {
+    "reference_id": "29",
+    "citation": "Spencer et al., Science 2021",
+    "title": "Dietary fiber and probiotics influence the gut microbiome and melanoma immunotherapy response",
+    "year": 2021,
+    "tags": {
+      "treatment": ["PD-1/PD-L1 Blockade"],
+      "cancer": ["Melanoma"],
+      "biology": ["Dietary fiber", "Microbiome diversity", "Immune modulation"],
+      "intervention": ["Diet", "Probiotics"]
+    }
+  },
+  "30": {
+    "reference_id": "30",
+    "citation": "Simpson et al., Nat Med 2022",
+    "title": "Diet-driven microbial ecology underpins associations between cancer immunotherapy outcomes and the gut microbiome",
+    "year": 2022,
+    "tags": {
+      "treatment": ["PD-1/PD-L1 Blockade"],
+      "cancer": ["Melanoma"],
+      "biology": ["Dietary fiber", "Microbial ecology", "Microbiome composition"],
+      "intervention": ["Diet"]
+    }
+  },
+  "31": {
+    "reference_id": "31",
+    "citation": "Paulos et al., J Clin Invest 2007",
+    "title": "Microbial translocation augments the function of adoptively transferred self/tumor-specific CD8+ T cells via TLR4 signaling",
+    "year": 2007,
+    "tags": {
+      "treatment": ["ACT"],
+      "cancer": ["Preclinical tumor models"],
+      "biology": ["TLR4 signaling", "LPS", "CD8+ T cells"],
+      "intervention": []
+    }
+  },
+  "32": {
+    "reference_id": "32",
+    "citation": "Uribe-Herranz et al., JCI Insight 2018",
+    "title": "Gut microbiota modulates adoptive cell therapy via CD8α dendritic cells and IL-12",
+    "year": 2018,
+    "tags": {
+      "treatment": ["ACT"],
+      "cancer": ["Preclinical tumor models"],
+      "biology": ["Dendritic cells", "IL-12", "Antibiotic exposure"],
+      "intervention": ["Vancomycin"]
+    }
+  },
+  "33": {
+    "reference_id": "33",
+    "citation": "Luu et al., Sci Rep 2018",
+    "title": "Regulation of the effector function of CD8+ T cells by gut microbiota-derived metabolite butyrate",
+    "year": 2018,
+    "tags": {
+      "treatment": ["ACT"],
+      "cancer": ["Preclinical tumor models"],
+      "biology": ["Butyrate", "HDAC inhibition", "CD8+ T cells"],
+      "intervention": ["Short-chain fatty acids supplementation"]
+    }
+  },
+  "34": {
+    "reference_id": "34",
+    "citation": "Yang et al., Oncoimmunology 2021",
+    "title": "Blood microbiota diversity determines response of advanced CRC to chemotherapy combined with adoptive T cell immunotherapy",
+    "year": 2021,
+    "tags": {
+      "treatment": ["ACT"],
+      "cancer": ["CRC"],
+      "biology": ["Blood microbiome", "Bifidobacterium", "Microbial diversity"],
+      "intervention": ["Chemotherapy"]
+    }
+  },
+  "37": {
+    "reference_id": "37",
+    "citation": "Derosa et al., Ann Oncol 2018",
+    "title": "Negative association of antibiotics on clinical activity of immune checkpoint inhibitors in patients with advanced RCC and NSCLC",
+    "year": 2018,
+    "tags": {
+      "treatment": ["PD-1/PD-L1 Blockade"],
+      "cancer": ["RCC", "NSCLC"],
+      "biology": ["Microbiome disruption"],
+      "intervention": ["Antibiotics"]
+    }
+  },
+  "38": {
+    "reference_id": "38",
+    "citation": "Wilson et al., Cancer Immunol Immunother 2020",
+    "title": "The effect of antibiotics on clinical outcomes in immune-checkpoint blockade: a systematic review and meta-analysis",
+    "year": 2020,
+    "tags": {
+      "treatment": ["ICI"],
+      "cancer": ["Multiple cancers"],
+      "biology": ["Meta-analysis", "Microbiome disruption"],
+      "intervention": ["Antibiotics"]
+    }
+  },
+  "39": {
+    "reference_id": "39",
+    "citation": "Peiffer et al., Neoplasia 2022",
+    "title": "Composition of gastrointestinal microbiota in association with treatment response in individuals with metastatic CRPC receiving pembrolizumab",
+    "year": 2022,
+    "tags": {
+      "treatment": ["PD-1/PD-L1 Blockade"],
+      "cancer": ["Prostate cancer"],
+      "biology": ["Microbiome composition", "Treatment response"],
+      "intervention": ["Antibiotics"]
+    }
+  },
+  "40": {
+    "reference_id": "40",
+    "citation": "Elkrief et al., Oncoimmunology 2019",
+    "title": "Antibiotics are associated with decreased PFS in advanced melanoma patients treated with ICI",
+    "year": 2019,
+    "tags": {
+      "treatment": ["ICI"],
+      "cancer": ["Melanoma"],
+      "biology": ["Microbiome disruption", "Progression-free survival"],
+      "intervention": ["Antibiotics"]
+    }
+  },
+  "41": {
+    "reference_id": "41",
+    "citation": "Chalabi et al., Ann Oncol 2020",
+    "title": "Efficacy of chemotherapy and atezolizumab in patients with NSCLC receiving antibiotics and PPIs",
+    "year": 2020,
+    "tags": {
+      "treatment": ["PD-1/PD-L1 Blockade"],
+      "cancer": ["NSCLC"],
+      "biology": ["Microbiome disruption"],
+      "intervention": ["Proton pump inhibitors", "Antibiotics"]
+    }
+  },
+  "42": {
+    "reference_id": "42",
+    "citation": "Tomita et al., Oncoimmunology 2022",
+    "title": "Clostridium butyricum therapy restores the decreased efficacy of ICI in lung cancer patients receiving PPIs",
+    "year": 2022,
+    "tags": {
+      "treatment": ["PD-1/PD-L1 Blockade"],
+      "cancer": ["NSCLC"],
+      "biology": ["Clostridium butyricum", "Microbiome modulation"],
+      "intervention": ["Probiotics", "Proton pump inhibitors"]
+    }
+  },
+  "43": {
+    "reference_id": "43",
+    "citation": "Terrisse et al., Cell Death Differ 2021",
+    "title": "Intestinal microbiota influences clinical outcome and side effects of early breast cancer treatment",
+    "year": 2021,
+    "tags": {
+      "treatment": ["Chemotherapy"],
+      "cancer": ["Breast cancer"],
+      "biology": ["Dysbiosis", "Treatment toxicity"],
+      "intervention": []
+    }
+  },
+  "45": {
+    "reference_id": "45",
+    "citation": "Zhang et al., Theranostics 2021",
+    "title": "Pectin supplement significantly enhanced the anti-PD-1 efficacy in tumor-bearing mice humanized with gut microbiota from CRC patients",
+    "year": 2021,
+    "tags": {
+      "treatment": ["PD-1/PD-L1 Blockade"],
+      "cancer": ["CRC", "Preclinical tumor models"],
+      "biology": ["Butyrate production", "Microbiome modulation"],
+      "intervention": ["Prebiotics", "Pectin"]
+    }
+  },
+  "48": {
+    "reference_id": "48",
+    "citation": "Dizman et al., Nat Med 2022",
+    "title": "Nivolumab plus ipilimumab with or without live bacterial supplementation (CBM588) in metastatic RCC: a randomized phase 1 trial",
+    "year": 2022,
+    "tags": {
+      "treatment": ["PD-1/PD-L1 Blockade", "CTLA-4 Blockade"],
+      "cancer": ["RCC"],
+      "biology": ["Clostridium butyricum", "Microbiome modulation"],
+      "intervention": ["Probiotics", "Live bacterial supplementation"]
+    }
+  }
+}

requirements.txt ADDED Viewed

	@@ -0,0 +1,16 @@

+# Core dependencies
+torch>=2.0.0
+transformers>=4.40.0
+accelerate>=0.27.0
+sentence-transformers>=2.5.0
+bitsandbytes>=0.43.0
+# RAG and vector database
+chromadb>=0.4.0
+# Utilities
+tqdm>=4.65.0
+# For gradio
+huggingface-hub>=0.23.0
+gradio

src/__init__.py ADDED Viewed

	@@ -0,0 +1,10 @@

+"""
+Microbiome-ICI Report Generator Package
+"""
+__version__ = "1.0.0"
+from .report_assembler import ReportAssembler
+__all__ = ["ReportAssembler"]

src/__pycache__/__init__.cpython-311.pyc ADDED Viewed

Binary file (352 Bytes). View file

src/__pycache__/config.cpython-311.pyc ADDED Viewed

Binary file (2.1 kB). View file

src/__pycache__/models.cpython-311.pyc ADDED Viewed

Binary file (7.64 kB). View file

src/__pycache__/report_assembler.cpython-311.pyc ADDED Viewed

Binary file (10.7 kB). View file

src/__pycache__/section_generators.cpython-311.pyc ADDED Viewed

Binary file (24.6 kB). View file

src/chroma_loader.py ADDED Viewed

	@@ -0,0 +1,103 @@

+"""
+ChromaDB loader — downloads the vector database from a HuggingFace dataset
+at startup if it is not already present locally.
+Replace HF_REPO_ID with your actual dataset repo once it is uploaded.
+"""
+import logging
+import os
+from pathlib import Path
+from . import config
+logger = logging.getLogger(__name__)
+# ---------------------------------------------------------------------------
+# Configuration — update HF_REPO_ID before deployment
+# ---------------------------------------------------------------------------
+HF_REPO_ID = "your-username/your-chroma-db-dataset"   # <-- replace this
+LOCAL_CHROMA_DIR = Path("./chroma_db")
+def ensure_chroma_db() -> str:
+    """
+    Ensure the ChromaDB is available locally.
+    If the local directory already exists and contains ChromaDB files,
+    this is a no-op.  Otherwise the dataset is downloaded from HuggingFace
+    Hub into LOCAL_CHROMA_DIR.
+    Returns:
+        The absolute path to the local ChromaDB directory (str).
+    Raises:
+        RuntimeError: If the download fails for any reason.
+    """
+    chroma_path = LOCAL_CHROMA_DIR.resolve()
+    # -----------------------------------------------------------------------
+    # Check if a valid ChromaDB already exists locally
+    # -----------------------------------------------------------------------
+    if _chroma_db_exists(chroma_path):
+        logger.info(f"ChromaDB already present at {chroma_path} — skipping download.")
+        _update_config(str(chroma_path))
+        return str(chroma_path)
+    # -----------------------------------------------------------------------
+    # Download from HuggingFace Hub
+    # -----------------------------------------------------------------------
+    logger.info(
+        f"ChromaDB not found locally.  Downloading from HuggingFace: {HF_REPO_ID}"
+    )
+    try:
+        from huggingface_hub import snapshot_download
+    except ImportError:
+        raise RuntimeError(
+            "huggingface_hub is not installed.  "
+            "Add it to requirements.txt:  huggingface-hub>=0.23.0"
+        )
+    try:
+        downloaded_path = snapshot_download(
+            repo_id=HF_REPO_ID,
+            repo_type="dataset",
+            local_dir=str(chroma_path),
+        )
+        logger.info(f"ChromaDB downloaded successfully to: {downloaded_path}")
+    except Exception as exc:
+        raise RuntimeError(
+            f"Failed to download ChromaDB from HuggingFace repo '{HF_REPO_ID}': {exc}"
+        ) from exc
+    if not _chroma_db_exists(chroma_path):
+        raise RuntimeError(
+            f"Download appeared to succeed but no ChromaDB files were found in "
+            f"{chroma_path}.  Check that the HuggingFace dataset contains a "
+            f"ChromaDB at its root."
+        )
+    _update_config(str(chroma_path))
+    return str(chroma_path)
+# ---------------------------------------------------------------------------
+# Helpers
+# ---------------------------------------------------------------------------
+def _chroma_db_exists(path: Path) -> bool:
+    """
+    Return True if the path looks like a populated ChromaDB directory.
+    ChromaDB always writes a 'chroma.sqlite3' file at the root.
+    """
+    return path.is_dir() and (path / "chroma.sqlite3").exists()
+def _update_config(path: str) -> None:
+    """Point config.CHROMADB_PERSIST_DIRECTORY at the resolved local path."""
+    config.CHROMADB_PERSIST_DIRECTORY = path
+    os.environ["CHROMA_DB_PATH"] = path
+    logger.info(f"config.CHROMADB_PERSIST_DIRECTORY set to: {path}")

src/config.py ADDED Viewed

	@@ -0,0 +1,158 @@

+"""
+Configuration for the Microbiome-ICI Report Generator
+"""
+import os
+# =============================================================================
+# Model Configuration
+# =============================================================================
+# MedGemma 1.5 4B model
+MEDGEMMA_MODEL_ID = "google/medgemma-1.5-4b-it"
+MEDGEMMA_DEVICE = "cuda"  # Change to "cpu" if no GPU available
+# PubMedBERT embedding model
+EMBEDDING_MODEL_ID = "pritamdeka/S-PubMedBert-MS-MARCO"
+EMBEDDING_DEVICE = "cuda"
+# =============================================================================
+# Generation Parameters
+# =============================================================================
+# GENERATION_CONFIG = {
+#     "temperature": 0.0,
+#     "do_sample": False,
+#     "repetition_penalty": 1.15,
+#     "no_repeat_ngram_size": 5,
+# }
+GENERATION_CONFIG ={
+  "temperature": 0.0,
+  "do_sample": False,
+  "top_p": 1.0,
+  "top_k": 0,
+  "repetition_penalty": 1.0,
+  "num_beams": 1,
+  "early_stopping": True,
+}
+SECTION_MAX_NEW_TOKENS = {
+    "section_1": 2200,
+    "section_2": 2400,
+    "section_3": 3000,
+    "section_4": 2200,
+    "section_5": 2400,
+    "section_6": 1000,
+}
+# =============================================================================
+# ChromaDB Configuration
+# =============================================================================
+CHROMADB_COLLECTION_NAME = "research_papers"
+# CHROMADB_PERSIST_DIRECTORY = "/content/chroma_db"  # Adjust to your actual path
+CHROMADB_PERSIST_DIRECTORY = os.getenv("CHROMA_DB_PATH", "./chroma_db")
+# =============================================================================
+# RAG Retrieval Configuration
+# =============================================================================
+# Number of chunks to retrieve per section
+RAG_TOP_K = {
+    "section_1": 5,   # Composition Profile
+    "section_2": 8,   # Metabolite Landscape
+    "section_3": 12,   # Drug-Microbiome Interaction (most evidence-dense)
+    "section_4": 7,   # Confounding Factors
+    "section_5": 7,   # Intervention Considerations
+}
+# Metadata filtering strategy
+# Options: "semantic_only", "metadata_only", "hybrid"
+RETRIEVAL_STRATEGY = "hybrid"
+# =============================================================================
+# Report Configuration
+# =============================================================================
+OUTPUT_DIR = "./outputs"
+REPORT_FILENAME_TEMPLATE = "microbiome_immunotherapy_report_{patient_id}_{timestamp}.md"
+# =============================================================================
+# Clinical Context Windows (days before therapy start)
+# =============================================================================
+ANTIBIOTIC_WINDOW_DAYS = 42  # Critical window for antibiotic impact (ICI)
+ANTIBIOTIC_WINDOW_DAYS_ACT = 28  # Critical window before CAR-T infusion
+PPI_CONCERN_DURATION_MONTHS = 3  # Duration after which PPI use is flagged
+# ACT-specific toxicity windows
+CRS_ONSET_DAYS = 14  # CRS typically occurs within 2 weeks of CAR-T infusion
+NEUROTOXICITY_ONSET_DAYS = 21  # Neurotoxicity can occur up to 3 weeks post-infusion
+# =============================================================================
+# Taxa of Interest (for targeted retrieval)
+# =============================================================================
+KEY_TAXA = [
+    "Akkermansia muciniphila",
+    "Bifidobacterium",
+    "Faecalibacterium prausnitzii",
+    "Ruminococcaceae",
+    "Lachnospiraceae",
+    "Bacteroides",
+    "Collinsella aerofaciens",
+    "Alistipes",
+    "Clostridium butyricum",
+]
+# =============================================================================
+# Therapy Type Detection
+# =============================================================================
+THERAPY_TYPE_MAP = {
+    # ICI drugs
+    "pembrolizumab": "ICI",
+    "nivolumab": "ICI",
+    "atezolizumab": "ICI",
+    "durvalumab": "ICI",
+    "avelumab": "ICI",
+    "ipilimumab": "ICI",
+    "tremelimumab": "ICI",
+    "cemiplimab": "ICI",
+    # ACT drugs
+    "tisagenlecleucel": "ACT",
+    "axicabtagene ciloleucel": "ACT",
+    "brexucabtagene autoleucel": "ACT",
+    "lisocabtagene maraleucel": "ACT",
+    "idecabtagene vicleucel": "ACT",
+    "ciltacabtagene autoleucel": "ACT",
+}
+# =============================================================================
+# ICI Drug Classes (for metadata filtering)
+# =============================================================================
+ICI_DRUG_CLASS_MAP = {
+    "pembrolizumab": "PD-1/PD-L1 Blockade",
+    "nivolumab": "PD-1/PD-L1 Blockade",
+    "atezolizumab": "PD-1/PD-L1 Blockade",
+    "durvalumab": "PD-1/PD-L1 Blockade",
+    "avelumab": "PD-1/PD-L1 Blockade",
+    "ipilimumab": "CTLA-4 Blockade",
+    "tremelimumab": "CTLA-4 Blockade",
+    "cemiplimab": "PD-1/PD-L1 Blockade",
+}
+# =============================================================================
+# ACT Drug Classes (for metadata filtering)
+# =============================================================================
+ACT_DRUG_CLASS_MAP = {
+    "tisagenlecleucel": "CAR-T (CD19-targeted)",
+    "axicabtagene ciloleucel": "CAR-T (CD19-targeted)",
+    "brexucabtagene autoleucel": "CAR-T (CD19-targeted)",
+    "lisocabtagene maraleucel": "CAR-T (CD19-targeted)",
+    "idecabtagene vicleucel": "CAR-T (BCMA-targeted)",
+    "ciltacabtagene autoleucel": "CAR-T (BCMA-targeted)",
+}

src/ehr_extractor.py ADDED Viewed

	@@ -0,0 +1,257 @@

+import json
+import logging
+import re
+from datetime import date
+from pathlib import Path
+from typing import Dict
+from .models import get_medgemma
+logger = logging.getLogger(__name__)
+# =============================================================================
+# JSON Schema Template
+# =============================================================================
+# Helper to locate the template relative to this file
+_BASE_DIR = Path(__file__).parent.parent
+_SCHEMA_TEMPLATE_PATH = _BASE_DIR / "data" / "templates" / "patient_schema_template.json"
+def _load_json_template() -> str:
+    """Load the JSON schema template from the external file."""
+    if not _SCHEMA_TEMPLATE_PATH.exists():
+        logger.warning(f"Schema template not found at {_SCHEMA_TEMPLATE_PATH}. Extraction may fail or be inaccurate.")
+        return "{}"
+    with open(_SCHEMA_TEMPLATE_PATH, "r", encoding="utf-8") as f:
+        return f.read()
+def _build_prompt(ehr_text: str) -> str:
+    """
+    Build the combined system + user prompt to pass to MedGemmaGenerator.generate().
+    """
+    today = date.today().isoformat()
+    json_template = _load_json_template()
+    system_instruction = f"""You are a clinical data extraction specialist for cancer immunotherapy. Extract structured data from EHRs covering both immune checkpoint inhibitors (ICI) and adoptive cell therapy (ACT, including CAR-T).
+=== OUTPUT FORMAT ===
+- Return ONLY the filled JSON object. No explanation, no preamble, no markdown fences.
+- Do NOT add fields not in the template. Do NOT remove template fields.
+- Valid JSON: no trailing commas, no comments, no extra keys.
+- Set "extraction_date" to today: {today}.
+=== DATA RULES ===
+- Extract only explicitly stated values. Do not infer beyond specified rules.
+- Dates: ISO 8601 (YYYY-MM-DD). If only month/year, use 1st (e.g. "March 2024" → "2024-03-01").
+- Numbers: numeric type, not strings. Percentages as plain floats (4.8 not "4.8%").
+- Missing optional fields: null.
+- Missing required strings: "".
+- Missing required arrays: [].
+- Missing required booleans: false.
+=== PATIENT ===
+- "id": MRN exactly as written.
+=== CANCER ===
+- "type": Full name (e.g. "Diffuse Large B-Cell Lymphoma", "NSCLC", "Melanoma").
+- "subtype": Histological subtype (e.g. "Non-GCB (ABC type)", "Adenocarcinoma").
+- "stage": Use stage label only (e.g. "Stage IV", "IVA", "IIIB") — not full TNM.
+- "metastases": List ANATOMICAL SITES with optional details in parentheses.
+  Examples: ["Lung", "Liver"], ["Bone marrow (15% involvement)", "Pleural effusion (malignant)"].
+  If M0 or no metastases, use [].
+- "biomarkers.pdl1_expression": Use format from report. If percentage with TPS, use "<value>% TPS".
+  If just percentage, use "<value>%". If N/A for non-relevant cancer types, use "N/A".
+- "biomarkers.tmb": "<value> mutations/megabase" or "N/A" if not applicable.
+- "biomarkers.msi_status": Full label (e.g. "MSS", "Microsatellite stable (MSS)") or "N/A".
+=== IMMUNOTHERAPY (CRITICAL SECTION) ===
+- "therapy_type": "ICI" for checkpoint inhibitors (pembrolizumab, nivolumab, ipilimumab, atezolizumab, durvalumab).
+  "ACT" for adoptive cell therapy (CAR-T, TIL, TCR-T, etc.).
+- "drug_name": Full drug name (e.g. "Pembrolizumab", "Axicabtagene ciloleucel").
+- "drug_class": For ICI, use checkpoint target (e.g. "PD-1/PD-L1 Blockade", "PD-1 inhibitor").
+  For ACT, use "CAR-T", "TIL therapy", "TCR-T", etc.
+- "treatment_setting": "First-line", "Relapsed/Refractory", "consolidation", "metastatic", "adjuvant", "neoadjuvant".
+- "line_of_therapy": "First-line", "Second-line", "Third-line", "consolidation", etc.
+- "planned_start_date": Date therapy begins (for CAR-T, this is infusion date, not leukapheresis).
+IF therapy_type is "ICI":
+  - "ici_details": {{"ici_target": "PD-1", "PD-L1", "CTLA-4", or "PD-1, CTLA-4" for combinations}}
+  - "act_details": null
+IF therapy_type is "ACT":
+  - "ici_details": null
+  - "act_details": {{
+      "act_type": "CAR-T", "TIL therapy", "TCR-T", etc.
+      "target_antigen": e.g. "CD19", "CD22", "BCMA"
+      "cell_source": "autologous" or "allogeneic"
+      "preconditioning_regimen": e.g. "Fludarabine + Cyclophosphamide"
+      "t_cell_harvest_date": Date of leukapheresis (YYYY-MM-DD)
+      "expected_crs_risk": "low", "moderate", "moderate-high", "high"
+      "expected_neurotoxicity_risk": "low", "moderate", "moderate-high", "high"
+    }}
+=== PRIOR TREATMENTS ===
+- "chemotherapy.received": TRUE if any chemo regimen described (even if completed before current therapy).
+- "chemotherapy.regimens": List each as string (e.g. ["R-CHOP", "R-ICE", "Gemcitabine (bridging)"]).
+- "chemotherapy.response": Describe response to each regimen if stated.
+- "prior_immunotherapy.received": TRUE only if immunotherapy given BEFORE current planned regimen.
+=== MEDICATIONS ===
+- "ppi_use.currently_on_ppi": true if on any PPI.
+- "ppi_use.ppi_name": name (e.g. "Omeprazole").
+- "ppi_use.duration_months": numeric months if stated; 0 if unknown.
+- "antibiotic_history.recent_antibiotics": TRUE if any antibiotic within 90 days of planned therapy start.
+- "antibiotic_history.exposures": List EVERY antibiotic course mentioned. Never leave [] if antibiotics documented.
+  Each exposure object:
+  - "antibiotic_name": name + dose (e.g. "Levofloxacin 500mg").
+  - "antibiotic_class": use mappings (levofloxacin→fluoroquinolone, azithromycin→macrolide,
+    piperacillin-tazobactam→beta-lactam, amoxicillin-clavulanate→beta-lactam/penicillin combination).
+  - "start_date", "end_date": YYYY-MM-DD. If ongoing at report date, use "ongoing" for end_date.
+  - "days_before_ici": Days from antibiotic END (or report date if ongoing) to planned therapy start.
+  - "note" (optional): Add if context needed.
+=== COMORBIDITIES ===
+- List all conditions from Past Medical History as plain strings.
+- Never use [] if a PMH section exists — scan fully.
+- Include diet-controlled or asymptomatic conditions if listed.
+- Do NOT include surgical history, family history, or social history.
+=== MICROBIOME ===
+- "sequencing_method": Exact method from report.
+- "diversity.observed_species": Use "Observed OTUs" or "Observed Species" value.
+- "key_bacteria": DYNAMIC object. Extract ALL bacterial species mentioned with abundance percentages.
+  - Keys: lowercase, underscores for spaces (e.g. "akkermansia_muciniphila").
+  - Create SEPARATE keys for each Bifidobacterium species — do NOT sum into bifidobacterium_spp unless explicitly stated.
+  - Values: plain floats (percentages).
+- "metabolites.scfa": butyrate, propionate, acetate as floats in μM. null if not measured.
+- "metabolites.bile_acids_available": true if ANY bile acid data reported.
+- "metabolites.tryptophan_metabolites_available": true if ANY tryptophan metabolite reported.
+- "data_quality.completeness": "high" if metabolites + diversity + species all present; "moderate" if some missing; "low" if sparse.
+- "data_quality.source": Lab name if stated; "" if unknown.
+- "data_quality.limitations": Extract any explicitly noted limitations as strings.
+=== CLINICAL CONTEXT ===
+- "urgency": Extract urgency statements (e.g. "High - 33-day intervention window", "Standard"). "" if not stated.
+- "patient_goals": List goals patient explicitly expressed.
+- "specific_concerns": List clinical concerns from assessment.
+=== FINAL CHECK ===
+Before outputting, verify:
+- therapy_type determines which of ici_details/act_details is populated (the other must be null).
+- All antibiotic exposures logged in array.
+- key_bacteria contains species-level entries from report (not a fixed schema).
+- JSON is valid (no trailing commas, proper null usage)."""
+    user_prompt = f"""EHR REPORT:
+{ehr_text}
+JSON TEMPLATE TO FILL:
+{json_template}
+Return the completed JSON object now."""
+    return f"{system_instruction}\n\n{user_prompt}"
+def _parse_output(raw_output: str) -> Dict:
+    """
+    Extract and parse a JSON object from raw model output.
+    """
+    # Strip any residual markdown fences (```json ... ``` or ``` ... ```)
+    fenced = re.search(r"```(?:json)?\s*([\s\S]+?)\s*```", raw_output)
+    if fenced:
+        raw_output = fenced.group(1)
+    # Find the outermost JSON object
+    start = raw_output.find("{")
+    end = raw_output.rfind("}") + 1
+    if start == -1 or end == 0:
+        raise ValueError(
+            "No JSON object found in model output.\n"
+            f"Raw output (first 500 chars):\n{raw_output[:500]}"
+        )
+    json_str = raw_output[start:end]
+    return json.loads(json_str)
+class EHRExtractor:
+    """
+    Extracts structured patient JSON from free-text EHR reports using MedGemma.
+    Usage:
+        extractor = EHRExtractor()
+        patient_data = extractor.extract(ehr_text)
+        # or
+        patient_data = extractor.extract_from_file("path/to/report.txt")
+    """
+    # # Expose as a static method so tests can call it without instantiation
+    # _parse_output = staticmethod(_parse_output)
+    def __init__(self):
+        self._llm = get_medgemma()
+#     def _get_llm(self):
+#         if self._llm is None:
+#             from .models import get_medgemma
+#             self._llm = get_medgemma()
+#         return self._llm
+    def extract(self, ehr_text: str) -> Dict:
+        """
+        Run EHR extraction and return the parsed patient data dictionary.
+        Args:
+            ehr_text: Raw EHR report as a string.
+        Returns:
+            Patient data dict matching the pipeline's expected JSON schema.
+        Raises:
+            ValueError: If no valid JSON could be found in the model output.
+            json.JSONDecodeError: If the extracted JSON string is malformed.
+        """
+        logger.info("Starting EHR extraction via MedGemma")
+        prompt = _build_prompt(ehr_text)
+        logger.info(f"EHR prompt length: {len(prompt)} characters")
+        raw_output = self._llm.generate(prompt, max_new_tokens=6000)
+        logger.debug(f"Raw EHR extraction output:\n{raw_output[:1000]}...")
+        try:
+            patient_data = _parse_output(raw_output)
+        except (ValueError, json.JSONDecodeError) as exc:
+            logger.error(
+                f"EHR extraction failed — could not parse JSON.\n"
+                f"Raw output:\n{raw_output}"
+            )
+            raise
+        logger.info(
+            f"EHR extraction complete. "
+            f"Patient ID: {patient_data.get('patient', {}).get('id', 'unknown')}"
+        )
+        return patient_data
+    def extract_from_file(self, ehr_path: str) -> Dict:
+        """
+        Load an EHR text file and extract patient data.
+        Args:
+            ehr_path: Path to the EHR text file.
+        Returns:
+            Patient data dict.
+        """
+        logger.info(f"Loading EHR from file: {ehr_path}")
+        with open(ehr_path, "r", encoding="utf-8") as f:
+            ehr_text = f.read()
+        return self.extract(ehr_text)

src/models.py ADDED Viewed

	@@ -0,0 +1,185 @@

+"""
+Model loading and inference utilities
+"""
+import torch
+from transformers import AutoTokenizer, AutoModelForCausalLM
+from sentence_transformers import SentenceTransformer
+from typing import List, Dict
+import logging
+from . import config
+logger = logging.getLogger(__name__)
+import re
+from transformers import AutoProcessor, AutoModelForImageTextToText
+class MedGemmaGenerator:
+    """Wrapper for MedGemma 1.5 4B model"""
+    def __init__(self):
+        logger.info(f"Loading MedGemma model: {config.MEDGEMMA_MODEL_ID}")
+        # Use AutoProcessor (not AutoTokenizer) and AutoModelForImageTextToText
+        self.processor = AutoProcessor.from_pretrained(config.MEDGEMMA_MODEL_ID)
+        self.model = AutoModelForImageTextToText.from_pretrained(
+            config.MEDGEMMA_MODEL_ID,
+            torch_dtype=torch.bfloat16 if torch.cuda.is_available() else torch.float32,
+            device_map="auto" if config.MEDGEMMA_DEVICE == "cuda" else None,
+        )
+        if config.MEDGEMMA_DEVICE == "cpu":
+            self.model = self.model.to("cpu")
+        self.model.eval()
+        logger.info("MedGemma model loaded successfully")
+    def _strip_thinking_block(self, text: str) -> str:
+        """
+        Remove the thinking/reasoning block that Gemma 3-based models emit.
+        MedGemma 1.5 uses <unused94>thought...<unused95> tokens.
+        """
+        text = re.sub(
+            r"<unused94>thought[\s\S]*?<unused95>",
+            "",
+            text,
+            flags=re.IGNORECASE,
+        )
+        text = re.sub(
+            r"<think>[\s\S]*?</think>",
+            "",
+            text,
+            flags=re.IGNORECASE,
+        )
+        text = re.sub(
+            r"<unused94>thought[\s\S]*$",
+            "",
+            text,
+            flags=re.IGNORECASE,
+        )
+        text = re.sub(
+            r"<think>[\s\S]*$",
+            "",
+            text,
+            flags=re.IGNORECASE,
+        )
+        return text.strip()
+    def generate(self, prompt: str, max_new_tokens: int = None) -> str:
+        """
+        Generate text from prompt using MedGemma
+        Args:
+            prompt: Input prompt
+            max_new_tokens: Override default max tokens if provided
+        Returns:
+            Generated text (with thinking block removed)
+        """
+        gen_config = config.GENERATION_CONFIG.copy()
+        if max_new_tokens:
+            gen_config["max_new_tokens"] = max_new_tokens
+        # Use proper message format for MedGemma 1.5 4B-IT
+        messages = [
+            {
+                "role": "user",
+                "content": [{"type": "text", "text": prompt}]
+            }
+        ]
+        # Apply chat template properly
+        inputs = self.processor.apply_chat_template(
+            messages,
+            add_generation_prompt=True,
+            tokenize=True,
+            return_dict=True,
+            return_tensors="pt",
+        ).to(self.model.device, dtype=torch.bfloat16 if torch.cuda.is_available() else torch.float32)
+        input_len = inputs["input_ids"].shape[-1]
+        with torch.no_grad():
+            outputs = self.model.generate(
+                **inputs,
+                **gen_config,
+                pad_token_id=self.processor.tokenizer.pad_token_id if hasattr(self.processor, 'tokenizer') else self.processor.pad_token_id,
+            )
+        # Extract only the generated portion (after the input)
+        generated_tokens = outputs[0][input_len:]
+        generated_text = self.processor.decode(generated_tokens, skip_special_tokens=True)
+        # Strip the thinking block before returning
+        generated_text = self._strip_thinking_block(generated_text)
+        return generated_text
+class EmbeddingModel:
+    """Wrapper for PubMedBERT embedding model"""
+    def __init__(self):
+        logger.info(f"Loading embedding model: {config.EMBEDDING_MODEL_ID}")
+        self.model = SentenceTransformer(
+            config.EMBEDDING_MODEL_ID,
+            device=config.EMBEDDING_DEVICE
+        )
+        logger.info("Embedding model loaded successfully")
+    def encode(self, texts: List[str]) -> List[List[float]]:
+        """
+        Encode texts to embeddings
+        Args:
+            texts: List of text strings to encode
+        Returns:
+            List of embedding vectors
+        """
+        embeddings = self.model.encode(
+            texts,
+            convert_to_tensor=False,
+            show_progress_bar=False
+        )
+        return embeddings.tolist()
+    def encode_single(self, text: str) -> List[float]:
+        """
+        Encode a single text to embedding
+        Args:
+            text: Text string to encode
+        Returns:
+            Embedding vector
+        """
+        return self.encode([text])[0]
+# Global model instances (loaded once)
+_medgemma_instance = None
+_embedding_instance = None
+def get_medgemma() -> MedGemmaGenerator:
+    """Get or create MedGemma model instance"""
+    global _medgemma_instance
+    if _medgemma_instance is None:
+        _medgemma_instance = MedGemmaGenerator()
+    return _medgemma_instance
+def get_embedding_model() -> EmbeddingModel:
+    """Get or create embedding model instance"""
+    global _embedding_instance
+    if _embedding_instance is None:
+        _embedding_instance = EmbeddingModel()
+    return _embedding_instance

src/prompts.py ADDED Viewed

	@@ -0,0 +1,323 @@

+"""
+Section-specific prompt templates for clinical report generation.
+Optimized for instruction-following in smaller models (e.g. MedGemma 4B IT).
+Key design principles applied:
+- Single flat instruction block per section (no nested lists inside lists)
+- Positive framing: tell the model what TO do, not what NOT to do
+- Evidence anchor placed immediately before the generation task
+- Citation format stated once, clearly, close to where citations are used
+- Section headers kept inside the prompt so the model knows its structural role
+- Global instruction kept minimal; section prompts are self-contained
+"""
+# =============================================================================
+# Global Instruction (prepended to all section prompts)
+# Kept short — section prompts carry the detailed guidance
+# =============================================================================
+GLOBAL_INSTRUCTION = """You are a clinical report writer assisting an oncologist.
+Your output will become one section of a microbiome-immunotherapy report used to inform treatment decisions.
+Two rules apply to every section you write:
+- Every factual claim must come directly from the retrieved evidence provided. If the evidence does not address a topic, omit that topic.
+- Every claim must be followed by an inline citation in this exact format: (Author et al., Journal Year). Only cite from the Retrieved evidence section below. The citations are present under "paper": "citation"
+Write in formal clinical prose. Do not use bullet points unless explicitly instructed.
+"""
+# =============================================================================
+# Section 1: Microbiome Diversity & Composition Profile
+# =============================================================================
+SECTION_1_PROMPT = """{global_instruction}
+---
+SECTION 1: Microbiome Diversity & Composition Profile
+---
+Patient context:
+- Cancer type: {cancer_type} | Stage: {cancer_stage}
+- Planned therapy: {drug_name} ({drug_class})
+Patient microbiome data:
+- Shannon Diversity Index: {shannon_index}
+- Simpson Diversity Index: {simpson_index}
+- Observed Species: {observed_species}
+- Detected taxa (% relative abundance):
+{detected_taxa}
+Retrieved evidence:
+{evidence}
+Task:
+Write this section in two parts.
+Part 1 — Diversity characterization: Describe the patient's alpha diversity level. Use the retrieved evidence to characterize whether this diversity profile has been associated with favorable or unfavorable outcomes in this cancer and immunotherapy context. Cite the evidence.
+Part 2 — Taxa characterization: For each detected taxon above that appears in the retrieved evidence, describe its observed relative abundance and what the evidence associates it with in this clinical context. Cover only taxa that have retrieved evidence. Cite each association.
+Write in descriptive, factual prose. Do not predict this patient's individual outcome.
+Begin writing Section 1 now:
+"""
+# =============================================================================
+# Section 2: Metabolite Landscape
+# =============================================================================
+SECTION_2_PROMPT = """{global_instruction}
+---
+SECTION 2: Metabolite Landscape
+---
+Patient context:
+- Cancer type: {cancer_type}
+- Planned therapy: {drug_name} ({drug_class})
+Patient metabolite data:
+{metabolite_data}
+Retrieved evidence:
+{evidence}
+Task:
+Write a functional interpretation of the patient's metabolite profile. For each metabolite class present in the patient data (e.g. short-chain fatty acids, bile acids, tryptophan metabolites), do the following in sequence:
+1. State the observed level from the patient data.
+2. Describe what the retrieved evidence says about that metabolite class in the context of immune function and this therapy type. Cite the evidence.
+If a metabolite class is present in the patient data but absent from the retrieved evidence, omit it entirely.
+Frame this section as bridging microbiome composition to immune activity. Reserve response predictions for Section 3.
+Begin writing Section 2 now:
+"""
+# =============================================================================
+# Section 3: Drug–Microbiome Interaction Outlook (ICI version)
+# =============================================================================
+SECTION_3_ICI_PROMPT = """{global_instruction}
+---
+SECTION 3: Drug–Microbiome Interaction Outlook
+---
+Patient context:
+- Cancer type: {cancer_type} | Stage: {cancer_stage}
+- Planned therapy: {drug_name} ({drug_class}) | Line: {line_of_therapy}
+- Tumor biomarkers: PD-L1 {pdl1} | TMB {tmb} | MSI {msi}
+Patient microbiome summary:
+- Shannon {shannon_index} | Simpson {simpson_index}
+- Key taxa: {key_taxa_summary}
+- Metabolite context: {metabolite_summary}
+Retrieved evidence:
+{evidence}
+Task:
+Write this section in three parts.
+Part 1 — Overall microbiome-ICI context: Describe what the retrieved evidence says about how this patient's microbiome profile (diversity level and dominant taxa) compares to patterns observed in comparable cohorts treated with this ICI class. Use phrases such as "the evidence suggests" or "studies in comparable cohorts found". Cite all claims.
+Part 2 — Individual taxa associations: For each taxon in the patient's key taxa list that appears in the retrieved evidence for this ICI class, describe the association the evidence reports (favorable, unfavorable, or bidirectional). If evidence reports both efficacy and immune-related adverse event (irAE) associations for the same taxon, state both explicitly. Cite each association.
+Part 3 — Alpha diversity in this treatment setting: Describe what the retrieved evidence specifically says about alpha diversity and outcomes in this ICI and cancer type context. Cite the evidence.
+Do not predict this individual patient's outcome. Attribute all findings to the evidence source.
+Begin writing Section 3 now:
+"""
+# =============================================================================
+# Section 3: Drug–Microbiome Interaction Outlook (ACT version)
+# =============================================================================
+SECTION_3_ACT_PROMPT = """{global_instruction}
+---
+SECTION 3: Microbiome–ACT Interaction Outlook
+---
+Patient context:
+- Cancer type: {cancer_type} | Stage: {cancer_stage}
+- Planned therapy: {drug_name} ({drug_class})
+- ACT type: {act_type} | Target antigen: {target_antigen} | Cell source: {cell_source}
+- Expected CRS risk: {crs_risk} | Expected neurotoxicity risk: {neurotoxicity_risk}
+- Line of therapy: {line_of_therapy}
+Patient microbiome summary:
+- Shannon {shannon_index} | Simpson {simpson_index}
+- Key taxa: {key_taxa_summary}
+- Metabolite context: {metabolite_summary}
+Retrieved evidence:
+{evidence}
+Task:
+Write this section in four parts.
+Part 1 — Overall microbiome-ACT context: Describe what the retrieved evidence says about how this patient's microbiome profile relates to outcomes observed in comparable ACT cohorts. Use phrases such as "the evidence suggests" or "studies in ACT cohorts found". Cite all claims.
+Part 2 — Efficacy-related taxa: For each taxon in the patient's key taxa list where the retrieved evidence links it to CAR-T cell expansion, persistence, or anti-tumor cytotoxicity, describe that association. Cite each claim.
+Part 3 — Toxicity-related taxa and metabolites: Describe what the retrieved evidence says about microbiota associations with CRS or ICANS risk. If the evidence links specific taxa or metabolites (particularly SCFAs) to T-cell function or inflammatory tone relevant to ACT toxicity, include those findings. Cite each claim.
+Part 4 — Metabolite context for T-cell function: Describe what the retrieved evidence says about microbiota-derived metabolites, especially SCFAs, in modulating T-cell function in the ACT setting. Cite the evidence.
+Do not predict this individual patient's outcome. Attribute all findings to the evidence source.
+Begin writing Section 3 now:
+"""
+# =============================================================================
+# Section 4: Confounding Factors
+# =============================================================================
+SECTION_4_PROMPT = """{global_instruction}
+---
+SECTION 4: Confounding Factors
+---
+Patient context:
+- Cancer type: {cancer_type}
+- Planned therapy: {drug_name} ({drug_class})
+Patient confounding factor data:
+{confounding_data}
+Retrieved evidence:
+{evidence}
+Task:
+For each confounding factor present in the patient data above, write one paragraph using only the retrieved evidence.
+Antibiotic exposure: If present, describe what the retrieved evidence says about antibiotic timing relative to ICI initiation and its documented interactions with microbiome-mediated ICI efficacy. If the evidence distinguishes by antibiotic class, include that distinction. Cite the evidence.
+PPI use: If present, describe what the retrieved evidence says about proton pump inhibitor effects on the microbiome in the ICI context. Cite the evidence.
+Prior treatments: If prior chemotherapy or immunotherapy is recorded, describe any retrieved evidence connecting those treatments to microbiome changes relevant to subsequent ICI response. Cite the evidence.
+Comorbidities: Include only if the retrieved evidence directly links the recorded comorbidity to microbiome-ICI interactions. Cite the evidence.
+If no confounding factors are present in the patient data, or if no retrieved evidence addresses the recorded factors, output exactly this sentence:
+"No significant confounding factors with established microbiome-immunotherapy interactions were identified in the available data."
+Begin writing Section 4 now:
+"""
+# =============================================================================
+# Section 5: Microbiota-Modulation Intervention Considerations
+# =============================================================================
+SECTION_5_PROMPT = """{global_instruction}
+---
+SECTION 5: Microbiota-Modulation Intervention Considerations
+---
+Patient context:
+- Cancer type: {cancer_type}
+- Planned therapy: {drug_name} ({drug_class})
+- Microbiome context: {microbiome_summary}
+Retrieved evidence by intervention type:
+{evidence}
+Task:
+Generate a sub-section for each intervention type that has supporting evidence in the retrieved chunks above. Skip any intervention type with no retrieved evidence — do not note its absence.
+For each sub-section that has evidence, use this structure:
+Sub-section title (e.g., "Dietary and Prebiotic Approaches" or "Probiotic Supplementation")
+Write 2–4 sentences describing what the retrieved evidence found about this intervention in this cancer and therapy context. State the finding, cite it, and note any caveat the evidence itself raises. Close with one sentence framing it as a consideration for clinical discussion rather than a recommendation.
+Tone: exploratory and evidence-grounded. This section informs discussion; it does not prescribe.
+Begin writing Section 5 now:
+"""
+# =============================================================================
+# Section 6: Data Quality & Interpretive Limitations
+# =============================================================================
+SECTION_6_FIXED_CAVEATS = (
+    "Microbiome composition is highly individual and dynamic; this report reflects a "
+    "single time-point sample. Associations between microbiome features and immunotherapy response "
+    "are derived from cohort-level studies and may not predict individual outcomes. "
+    "The evidence base is evolving; findings should be interpreted in the context of "
+    "current clinical judgment."
+)
+SECTION_6_PROMPT = """
+---
+SECTION 6: Data Quality & Interpretive Limitations
+---
+Patient sample data quality:
+{data_quality}
+Task:
+Write 1–3 sentences addressing only what is specific to this patient's sample based on the data quality fields above.
+Follow these rules in order:
+- If completeness is "high" and no limitations are listed, write exactly: "No significant data quality limitations were identified for this sample."
+- If completeness is below "high" or limitations are listed: name each affected data domain and state which section (Section 2 or Section 3) is consequently limited in its interpretation.
+- If a metabolite class is listed as unavailable in the data quality fields, name it and describe the resulting interpretive gap. Only name classes that are explicitly listed as unavailable.
+Do not add general statements about microbiome variability or evolving evidence — those appear in a separate fixed caveats section.
+Begin writing Section 6 now:
+"""
+# =============================================================================
+# Helper function to build full prompts
+# =============================================================================
+def build_prompt(section_name: str, patient_data: dict, evidence: str, **kwargs) -> str:
+    """
+    Build a complete prompt for a given section.
+    Args:
+        section_name: One of "section_1", "section_2", "section_3",
+                      "section_4", "section_5", "section_6"
+        patient_data: Patient JSON data dictionary
+        evidence: Formatted evidence string from RAG
+        **kwargs: Additional template variables (e.g. detected_taxa,
+                  metabolite_data, confounding_data, data_quality, etc.)
+    Returns:
+        Complete formatted prompt string
+    """
+    therapy_type = patient_data["immunotherapy"].get("therapy_type", "ICI")
+    # Select template
+    if section_name == "section_3":
+        template = SECTION_3_ACT_PROMPT if therapy_type == "ACT" else SECTION_3_ICI_PROMPT
+    else:
+        prompt_templates = {
+            "section_1": SECTION_1_PROMPT,
+            "section_2": SECTION_2_PROMPT,
+            "section_4": SECTION_4_PROMPT,
+            "section_5": SECTION_5_PROMPT,
+            "section_6": SECTION_6_PROMPT,
+        }
+        template = prompt_templates.get(section_name)
+    if not template:
+        raise ValueError(f"Unknown section: {section_name}")
+    # Inject global instruction (section_6 is standalone, doesn't use it)
+    kwargs["global_instruction"] = GLOBAL_INSTRUCTION if section_name != "section_6" else ""
+    kwargs["evidence"] = evidence
+    # Inject shared patient context fields
+    if section_name in ["section_1", "section_2", "section_3", "section_4", "section_5"]:
+        kwargs["cancer_type"] = patient_data["cancer"]["type"]
+        kwargs["drug_name"] = patient_data["immunotherapy"]["drug_name"]
+        kwargs["drug_class"] = patient_data["immunotherapy"]["drug_class"]
+    return template.format(**kwargs)

src/rag.py ADDED Viewed

	@@ -0,0 +1,471 @@

+"""
+RAG retrieval logic with ChromaDB
+"""
+import json
+import chromadb
+from typing import List, Dict, Optional, Set
+import logging
+from . import config
+from .models import get_embedding_model
+logger = logging.getLogger(__name__)
+class RAGRetriever:
+    """Handles retrieval from ChromaDB with metadata filtering"""
+    def __init__(self):
+        logger.info(f"Connecting to ChromaDB at {config.CHROMADB_PERSIST_DIRECTORY}")
+        self.client = chromadb.PersistentClient(path=config.CHROMADB_PERSIST_DIRECTORY)
+        self.collection = self.client.get_collection(name=config.CHROMADB_COLLECTION_NAME)
+        self.embedding_model = get_embedding_model()
+        logger.info(f"Connected to collection: {config.CHROMADB_COLLECTION_NAME}")
+    def retrieve(
+        self,
+        query_text: str,
+        top_k: int,
+        metadata_filters: Optional[Dict] = None,
+        exclude_filters: Optional[Dict] = None,
+        strategy: str = "hybrid"  # options: "semantic_only", "metadata_only", "hybrid"
+    ) -> List[Dict]:
+        """
+        Retrieve chunks from ChromaDB according to strategy:
+        - semantic_only: ignores metadata filters, purely vector search
+        - metadata_only: filters metadata, no fallback
+        - hybrid: filters metadata, then fills remaining with semantic-only search
+        """
+        # Encode query
+        query_embedding = self.embedding_model.encode_single(query_text)
+        # Helper function to query Chroma
+        def _query_chroma(where_clause=None, n_results=top_k):
+            query_params = {
+                "query_embeddings": [query_embedding],
+                "n_results": n_results,
+            }
+            if where_clause:
+                query_params["where"] = where_clause
+            results = self.collection.query(**query_params)
+            chunks = []
+            if results and results["documents"]:
+                for i in range(len(results["documents"][0])):
+                    chunk = {
+                        "text": results["documents"][0][i],
+                        "metadata": results["metadatas"][0][i] if results["metadatas"] else {},
+                        "distance": results["distances"][0][i] if results["distances"] else None,
+                    }
+                    chunks.append(chunk)
+            return chunks
+        # Build metadata filter clauses if needed
+        include_clause = self._build_where_clause(metadata_filters)
+        exclude_clause = self._build_exclude_clause(exclude_filters)
+        if include_clause and exclude_clause:
+            where_clause = {"$and": [include_clause, exclude_clause]}
+        elif include_clause:
+            where_clause = include_clause
+        elif exclude_clause:
+            where_clause = exclude_clause
+        else:
+            where_clause = None
+        # --- Handle strategies ---
+        if strategy == "semantic_only":
+            return _query_chroma(where_clause=None, n_results=top_k)
+        elif strategy == "metadata_only":
+            # Only filter-based search, no fallback
+            return _query_chroma(where_clause=where_clause, n_results=top_k)
+        elif strategy == "hybrid":
+            # Step 1: filtered search
+            filtered_chunks = _query_chroma(where_clause=where_clause, n_results=top_k)
+            # Step 2: fallback to semantic-only if not enough
+            if len(filtered_chunks) < top_k:
+                remaining_k = top_k - len(filtered_chunks)
+                semantic_chunks = _query_chroma(where_clause=None, n_results=remaining_k)
+                # Remove duplicates based on text
+                existing_texts = set(c["text"] for c in filtered_chunks)
+                semantic_chunks = [c for c in semantic_chunks if c["text"] not in existing_texts]
+                filtered_chunks.extend(semantic_chunks)
+            return filtered_chunks
+        else:
+            raise ValueError(f"Unknown strategy: {strategy}")
+    def _build_where_clause(self, filters: Optional[Dict]) -> Optional[Dict]:
+        """
+        Build a ChromaDB WHERE clause from an include-filter dictionary.
+        Filters map tag category names (matching keys in research_papers.json "tags")
+        to one or more values. The pipeline stores tags as flat pipe-delimited string
+        fields named paper_tag_{category}, e.g.:
+            paper_tag_cancer  = "NSCLC|Renal Cell Carcinoma|Bladder Cancer"
+            paper_tag_treatment = "PD-1/PD-L1 Blockade"
+        We use $contains for substring matching against the pipe-delimited string.
+        Examples:
+            {"cancer": "NSCLC"}
+            -> {"paper_tag_cancer": {"$contains": "NSCLC"}}
+            {"cancer": "NSCLC", "treatment": ["PD-1/PD-L1 Blockade"]}
+            -> {"$and": [
+                    {"paper_tag_cancer": {"$contains": "NSCLC"}},
+                    {"paper_tag_treatment": {"$contains": "PD-1/PD-L1 Blockade"}},
+               ]}
+        Note: For list values, only the FIRST element is used for $contains filtering.
+        If you need to match any of several values, call retrieve() once per value and
+        merge results, or fetch unfiltered and post-filter in Python.
+        """
+        if not filters:
+            return None
+        where_conditions = []
+        for key, value in filters.items():
+            field = f"paper_tag_{key}"
+            # Use the first element if a list was provided; $contains substring-matches
+            # against the pipe-delimited string stored in the metadata field.
+            match_value = value[0] if isinstance(value, list) else value
+            where_conditions.append({field: {"$contains": match_value}})
+        if len(where_conditions) == 1:
+            return where_conditions[0]
+        elif len(where_conditions) > 1:
+            return {"$and": where_conditions}
+        return None
+    def _build_exclude_clause(self, filters: Optional[Dict]) -> Optional[Dict]:
+        """
+        Build a ChromaDB WHERE clause that EXCLUDES documents matching the filters.
+        Uses $not_contains so chunks whose tag field contains the given value are
+        filtered out.
+        Example:
+            {"section_type": "references"}
+            -> {"section_type": {"$ne": "references"}}
+            {"cancer": "NSCLC"}
+            -> {"paper_tag_cancer": {"$not_contains": "NSCLC"}}
+        """
+        if not filters:
+            return None
+        where_conditions = []
+        for key, value in filters.items():
+            # Allow filtering on plain metadata fields (e.g. section_type)
+            # as well as tag fields.
+            if key.startswith("paper_tag_") or key in ("source_file", "section_type", "is_table"):
+                field = key
+            else:
+                field = f"paper_tag_{key}"
+            match_value = value[0] if isinstance(value, list) else value
+            where_conditions.append({field: {"$not_contains": match_value}})
+        if len(where_conditions) == 1:
+            return where_conditions[0]
+        elif len(where_conditions) > 1:
+            return {"$and": where_conditions}
+        return None
+    def _parse_paper_meta(self, metadata: Dict) -> Dict:
+        """
+        Safely deserialise the paper metadata stored as a JSON string in ChromaDB.
+        ChromaDB only stores scalar values, so the pipeline serialised the full
+        paper dict as json.dumps(). This helper reverses that.
+        """
+        raw = metadata.get("paper", "{}")
+        try:
+            return json.loads(raw) if raw else {}
+        except (json.JSONDecodeError, TypeError):
+            logger.warning("Could not deserialise paper metadata: %s", raw)
+            return {}
+    # ------------------------------------------------------------------
+    # Section-specific retrieval methods
+    # ------------------------------------------------------------------
+    def retrieve_for_section_1(self, patient_data: Dict) -> List[Dict]:
+        """
+        Retrieve chunks for Section 1: Microbiome Composition Profile
+        Focus on: diversity, detected taxa, cancer type, ICI class
+        """
+        cancer_type = patient_data["cancer"]["type"]
+        ici_class = self._get_ici_class(patient_data["immunotherapy"]["drug_name"])
+        detected_taxa = [
+            taxon for taxon, abundance in patient_data["microbiome"]["key_bacteria"].items()
+            if abundance is not None and abundance > 0
+        ]
+        query = f"""
+        Microbiome composition and diversity in {cancer_type} patients receiving {ici_class} therapy.
+        Taxa of interest: {', '.join(detected_taxa[:5])}.
+        Alpha diversity and response to immunotherapy.
+        """
+        filters = {
+            "cancer": cancer_type,
+            "treatment": ici_class,
+        }
+        return self.retrieve(
+            query_text=query,
+            top_k=config.RAG_TOP_K["section_1"],
+            metadata_filters=filters,
+        )
+    def retrieve_for_section_2(self, patient_data: Dict) -> List[Dict]:
+        """
+        Retrieve chunks for Section 2: Metabolite Landscape
+        Focus on: SCFAs, bile acids, tryptophan metabolites
+        """
+        cancer_type = patient_data["cancer"]["type"]
+        ici_class = self._get_ici_class(patient_data["immunotherapy"]["drug_name"])
+        metabolites = patient_data["microbiome"]["metabolites"]
+        metabolite_terms = []
+        if metabolites["scfa"]["butyrate_uM"] is not None:
+            metabolite_terms.append("short-chain fatty acids")
+            metabolite_terms.append("butyrate")
+        if metabolites["bile_acids_available"]:
+            metabolite_terms.append("bile acids")
+        if metabolites["tryptophan_metabolites_available"]:
+            metabolite_terms.append("tryptophan metabolism")
+        if not metabolite_terms:
+            return []
+        query = f"""
+        Microbial metabolites and immune function in {cancer_type}.
+        {', '.join(metabolite_terms)} and their role in immunotherapy response.
+        CD8+ T cell function, regulatory T cells, mucosal immunity.
+        """
+        # Metabolite section: semantic search only (biology tags are too broad for
+        # reliable filtering here).
+        return self.retrieve(
+            query_text=query,
+            top_k=config.RAG_TOP_K["section_2"],
+            metadata_filters=None,
+            strategy="semantic_only"
+        )
+    def retrieve_for_section_3(self, patient_data: Dict) -> List[Dict]:
+        """
+        Retrieve chunks for Section 3: Drug-Microbiome Interaction Outlook
+        """
+        cancer_type = patient_data["cancer"]["type"]
+        drug_name = patient_data["immunotherapy"]["drug_name"]
+        therapy_type = self._get_therapy_type(patient_data)
+        key_bacteria = patient_data["microbiome"]["key_bacteria"]
+        detected_taxa = sorted(
+            [(k, v) for k, v in key_bacteria.items() if v is not None and v > 0],
+            key=lambda x: x[1],
+            reverse=True
+        )[:5]
+        taxa_names = [taxon for taxon, _ in detected_taxa]
+        if therapy_type == "ICI":
+            ici_class = self._get_ici_class(drug_name)
+            query = f"""
+            {ici_class} response prediction in {cancer_type} based on gut microbiome composition.
+            Specific bacteria: {', '.join(taxa_names)}.
+            Clinical outcomes, progression-free survival, response rates.
+            Immune-related adverse events and microbiome associations.
+            """
+            filters = {"cancer": cancer_type, "treatment": ici_class}
+        elif therapy_type == "ACT":
+            act_details = patient_data["immunotherapy"].get("act_details", {})
+            act_type = act_details.get("act_type", "CAR-T")
+            target_antigen = act_details.get("target_antigen", "CD19")
+            query = f"""
+            {act_type} therapy efficacy in {cancer_type} and gut microbiome composition.
+            Target antigen: {target_antigen}.
+            Specific bacteria: {', '.join(taxa_names)}.
+            CAR-T cell expansion, persistence, and anti-tumor activity.
+            Cytokine release syndrome (CRS) and neurotoxicity associations with microbiome.
+            T-cell function and microbiota-derived metabolites.
+            """
+            filters = {"cancer": cancer_type, "treatment": "CAR-T"}
+        else:
+            query = f"""
+            Immunotherapy response in {cancer_type} and gut microbiome.
+            Bacteria: {', '.join(taxa_names)}.
+            """
+            filters = {"cancer": cancer_type}
+        return self.retrieve(
+            query_text=query,
+            top_k=config.RAG_TOP_K["section_3"],
+            metadata_filters=filters
+        )
+    def retrieve_for_section_4(self, patient_data: Dict) -> List[Dict]:
+        """
+        Retrieve chunks for Section 4: Confounding Factors
+        """
+        cancer_type = patient_data["cancer"]["type"]
+        therapy_type = self._get_therapy_type(patient_data)
+        query_terms = []
+        if patient_data["medications"]["antibiotic_history"]["recent_antibiotics"]:
+            if therapy_type == "ACT":
+                query_terms.append("antibiotic exposure before CAR-T therapy and outcomes")
+                query_terms.append("gut microbiota disruption and CAR-T efficacy")
+            else:
+                query_terms.append("antibiotic exposure and immunotherapy outcomes")
+        if patient_data["medications"]["ppi_use"]["currently_on_ppi"]:
+            query_terms.append("proton pump inhibitors and microbiome")
+        if patient_data["prior_treatments"]["chemotherapy"]["received"]:
+            if therapy_type == "ACT":
+                query_terms.append("prior chemotherapy effects on gut microbiota before CAR-T")
+                query_terms.append("lymphodepleting chemotherapy and microbiome")
+            else:
+                query_terms.append("prior chemotherapy effects on gut microbiota")
+        if not query_terms:
+            return []
+        query = f"""
+        {' '.join(query_terms)} in {cancer_type} patients.
+        Impact on immunotherapy efficacy and toxicity.
+        """
+        # Broader search for confounders — no cancer-type filter
+        return self.retrieve(
+            query_text=query,
+            top_k=config.RAG_TOP_K["section_4"],
+            metadata_filters=None,
+            strategy="semantic_only"
+        )
+    def retrieve_for_section_5(self, patient_data: Dict) -> Dict[str, List[Dict]]:
+        """
+        Retrieve chunks for Section 5: Intervention Considerations
+        """
+        cancer_type = patient_data["cancer"]["type"]
+        ici_class = self._get_ici_class(patient_data["immunotherapy"]["drug_name"])
+        intervention_chunks = {}
+        # Sub-section 5a: Dietary & Prebiotics
+        diet_query = f"""
+        Dietary interventions, prebiotics, fiber supplementation in {cancer_type}.
+        High-fiber diet, inulin, pectin, polyphenols and immunotherapy response.
+        """
+        intervention_chunks["diet"] = self.retrieve(
+            query_text=diet_query,
+            top_k=10,
+            metadata_filters=None,
+            strategy="semantic_only"
+        )
+        # Sub-section 5b: Probiotics
+        probiotics_query = f"""
+        Probiotic supplementation in {cancer_type} patients receiving {ici_class}.
+        Lactobacillus, Bifidobacterium, Akkermansia, Clostridium butyricum.
+        Clinical trials and efficacy data.
+        """
+        intervention_chunks["probiotics"] = self.retrieve(
+            query_text=probiotics_query,
+            top_k=10,
+            metadata_filters={"cancer": cancer_type}
+        )
+        return intervention_chunks
+    # ------------------------------------------------------------------
+    # Helpers
+    # ------------------------------------------------------------------
+    def _get_ici_class(self, drug_name: str) -> str:
+        """Map drug name to ICI class"""
+        return config.ICI_DRUG_CLASS_MAP.get(drug_name.lower(), "Immune Checkpoint Inhibitor")
+    def _get_act_class(self, drug_name: str) -> str:
+        """Map drug name to ACT class"""
+        return config.ACT_DRUG_CLASS_MAP.get(drug_name.lower(), "Adoptive Cell Therapy")
+    def _get_therapy_type(self, patient_data: Dict) -> str:
+        """Determine therapy type from patient data"""
+        if "therapy_type" in patient_data["immunotherapy"]:
+            return patient_data["immunotherapy"]["therapy_type"]
+        drug_name = patient_data["immunotherapy"]["drug_name"].lower()
+        return config.THERAPY_TYPE_MAP.get(drug_name, "ICI")
+    def format_chunks_for_llm(self, chunks: List[Dict]) -> str:
+        """
+        Format retrieved chunks into a structured string for LLM context.
+        Returns markdown-formatted evidence with citations.
+        """
+        if not chunks:
+            return "No relevant evidence retrieved."
+        formatted = "# Retrieved Evidence\n\n"
+        for i, chunk in enumerate(chunks, 1):
+            # paper is stored as a JSON string in ChromaDB — deserialise it first
+            paper_meta = self._parse_paper_meta(chunk["metadata"])
+            citation = paper_meta.get("citation", "Unknown source")
+            text = chunk["text"]
+            formatted += f"## Evidence {i}\n"
+            formatted += f"**Citation:** {citation}\n"
+            formatted += f"**Content:** {text}\n\n"
+        return formatted
+    def get_unique_citations(self, chunks: List[Dict]) -> Set[str]:
+        """Extract unique citations from chunks for a references section."""
+        citations = set()
+        for chunk in chunks:
+            # paper is stored as a JSON string in ChromaDB — deserialise it first
+            paper_meta = self._parse_paper_meta(chunk["metadata"])
+            citation = paper_meta.get("citation")
+            if citation:
+                citations.add(citation)
+        return citations
+    def get_unique_citation_metadata(self, chunks: List[Dict]) -> Set[tuple]:
+        """
+        Extract unique (citation, title) tuples from chunks.
+        Used for the final References section to show paper titles.
+        """
+        meta = set()
+        for chunk in chunks:
+            # paper is stored as a JSON string in ChromaDB — deserialise it first
+            paper_meta = self._parse_paper_meta(chunk["metadata"])
+            citation = paper_meta.get("citation")
+            if citation:
+                # Get title, falling back to citation if missing
+                title = paper_meta.get("title", citation)
+                meta.add((citation, title))
+        return meta

src/report_assembler.py ADDED Viewed

	@@ -0,0 +1,333 @@

+"""
+Report assembler - combines sections into final markdown report
+"""
+import json
+import logging
+from datetime import datetime
+from pathlib import Path
+from typing import Dict
+from . import config
+from .section_generators import SectionGenerator
+logger = logging.getLogger(__name__)
+class ReportAssembler:
+    """Assembles complete clinical report from individual sections.
+    Supports two input modes:
+    - JSON:  load_patient_data() / generate_and_save()           (existing path)
+    - EHR:   load_patient_data_from_ehr() / generate_and_save_from_ehr()  (new path)
+    """
+    def __init__(self):
+        self.generator = SectionGenerator()
+    def load_patient_data(self, json_path: str) -> Dict:
+        """Load patient JSON data from file"""
+        logger.info(f"Loading patient data from {json_path}")
+        with open(json_path, 'r') as f:
+            patient_data = json.load(f)
+        return patient_data
+    def load_patient_data_from_ehr(self, ehr_path: str) -> Dict:
+        """Extract patient JSON from a raw EHR text file using MedGemma.
+        Args:
+            ehr_path: Path to the plain-text EHR report.
+        Returns:
+            Patient data dictionary matching the pipeline schema.
+        """
+        # Imported here so the JSON-only code path has zero extra import cost
+        from .ehr_extractor import EHRExtractor
+        logger.info(f"Extracting patient data from EHR: {ehr_path}")
+        extractor = EHRExtractor()
+        return extractor.extract_from_file(ehr_path)
+    def generate_full_report(self, patient_data: Dict) -> str:
+        """
+        Generate complete clinical report
+        Args:
+            patient_data: Patient JSON dictionary
+        Returns:
+            Complete report as markdown string
+        """
+        logger.info("Starting full report generation")
+        report_sections = []
+        # Section 0: Preamble (always included, not LLM-generated)
+        logger.info("Generating preamble")
+        preamble = self.generator.generate_preamble(patient_data)
+        report_sections.append(preamble)
+        # Section 1: Microbiome Composition Profile
+        section_1 = self.generator.generate_section_1(patient_data)
+        if section_1:
+            report_sections.append(section_1)
+        # Section 2: Metabolite Landscape
+        section_2 = self.generator.generate_section_2(patient_data)
+        if section_2:
+            report_sections.append(section_2)
+        # Section 3: Drug-Microbiome Interaction Outlook
+        section_3 = self.generator.generate_section_3(patient_data)
+        if section_3:
+            report_sections.append(section_3)
+        # Section 4: Confounding Factors
+        section_4 = self.generator.generate_section_4(patient_data)
+        if section_4:
+            report_sections.append(section_4)
+        # Section 5: Intervention Considerations
+        section_5 = self.generator.generate_section_5(patient_data)
+        if section_5:
+            report_sections.append(section_5)
+        # Section 6: Data Quality & Limitations (always included)
+        section_6 = self.generator.generate_section_6(patient_data)
+        report_sections.append(section_6)
+        # References section
+        references = self._generate_references_section()
+        report_sections.append(references)
+        # Footer
+        footer = self._generate_footer()
+        report_sections.append(footer)
+        # Combine all sections
+        full_report = "\n".join(report_sections)
+        logger.info("Report generation complete")
+        return full_report
+    def generate_full_report_streaming(self, patient_data: Dict):
+        """
+        Generate the complete clinical report section by section, yielding the
+        cumulative markdown string after each section completes.
+        Designed for Gradio generator functions: each yield replaces the current
+        content of the output gr.Markdown component, so the clinician sees the
+        report grow in real time.
+        Args:
+            patient_data: Patient JSON dictionary.
+        Yields:
+            Tuple of (cumulative_report: str, status_message: str) after each
+            section is appended.
+        """
+        logger.info("Starting streaming report generation")
+        accumulated = ""
+        # ------------------------------------------------------------------
+        # Section 0: Preamble  (no LLM — instant)
+        # ------------------------------------------------------------------
+        logger.info("Generating preamble")
+        preamble = self.generator.generate_preamble(patient_data)
+        accumulated += preamble + "\n"
+        yield accumulated, "⏳ Generating Section 1: Microbiome Composition Profile..."
+        # ------------------------------------------------------------------
+        # Section 1: Microbiome Composition Profile
+        # ------------------------------------------------------------------
+        logger.info("Generating section 1")
+        section_1 = self.generator.generate_section_1(patient_data)
+        if section_1:
+            accumulated += section_1 + "\n"
+        yield accumulated, "⏳ Generating Section 2: Metabolite Landscape..."
+        # ------------------------------------------------------------------
+        # Section 2: Metabolite Landscape
+        # ------------------------------------------------------------------
+        logger.info("Generating section 2")
+        section_2 = self.generator.generate_section_2(patient_data)
+        if section_2:
+            accumulated += section_2 + "\n"
+        yield accumulated, "⏳ Generating Section 3: Drug–Microbiome Interaction Outlook..."
+        # ------------------------------------------------------------------
+        # Section 3: Drug–Microbiome Interaction Outlook
+        # ------------------------------------------------------------------
+        logger.info("Generating section 3")
+        section_3 = self.generator.generate_section_3(patient_data)
+        if section_3:
+            accumulated += section_3 + "\n"
+        yield accumulated, "⏳ Generating Section 4: Confounding Factors..."
+        # ------------------------------------------------------------------
+        # Section 4: Confounding Factors
+        # ------------------------------------------------------------------
+        logger.info("Generating section 4")
+        section_4 = self.generator.generate_section_4(patient_data)
+        if section_4:
+            accumulated += section_4 + "\n"
+        yield accumulated, "⏳ Generating Section 5: Intervention Considerations..."
+        # ------------------------------------------------------------------
+        # Section 5: Intervention Considerations
+        # ------------------------------------------------------------------
+        logger.info("Generating section 5")
+        section_5 = self.generator.generate_section_5(patient_data)
+        if section_5:
+            accumulated += section_5 + "\n"
+        yield accumulated, "⏳ Generating Section 6: Data Quality & Limitations..."
+        # ------------------------------------------------------------------
+        # Section 6: Data Quality & Limitations  (always included)
+        # ------------------------------------------------------------------
+        logger.info("Generating section 6")
+        section_6 = self.generator.generate_section_6(patient_data)
+        accumulated += section_6 + "\n"
+        yield accumulated, "⏳ Compiling references and finalising report..."
+        # ------------------------------------------------------------------
+        # References + Footer  (no LLM — instant)
+        # ------------------------------------------------------------------
+        logger.info("Generating references and footer")
+        references = self._generate_references_section()
+        footer = self._generate_footer()
+        accumulated += references + footer
+        logger.info("Streaming report generation complete")
+        yield accumulated, "✅ Report complete"
+    def _generate_references_section(self) -> str:
+        """Generate references section from all citations and titles used"""
+        # get_all_citations now returns List[tuple] i.e. [(citation, title), ...]
+        references_data = self.generator.get_all_citations()
+        if not references_data:
+            return ""
+        references = "## References\n\n"
+        references += "The following peer-reviewed publications were cited in this report:\n\n"
+        for i, (citation, title) in enumerate(references_data, 1):
+            if title and title != citation:
+                references += f"{i}. {citation}: {title}\n"
+            else:
+                references += f"{i}. {citation}\n"
+        references += "\n"
+        return references
+    def _generate_footer(self) -> str:
+        """Generate report footer with metadata"""
+        timestamp = datetime.now().strftime("%Y-%m-%d %H:%M:%S")
+        footer = f"""---
+**Report Generated:** {timestamp}
+**Model:** MedGemma 1.5 4B
+**System:** Microbiome-Immunotherapy Clinical Decision Support v1.0
+*This report is intended for use by qualified healthcare professionals as a clinical decision support tool. It does not constitute medical advice and should be interpreted in conjunction with comprehensive clinical evaluation.*
+"""
+        return footer
+    def save_report(self, report: str, patient_id: str, output_dir: str = None) -> str:
+        """
+        Save report to markdown file
+        Args:
+            report: Complete report markdown string
+            patient_id: Patient identifier for filename
+            output_dir: Output directory (uses config default if not provided)
+        Returns:
+            Path to saved report file
+        """
+        if output_dir is None:
+            output_dir = config.OUTPUT_DIR
+        # Create output directory if it doesn't exist
+        output_path = Path(output_dir)
+        output_path.mkdir(parents=True, exist_ok=True)
+        # Generate filename
+        timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
+        filename = f"microbiome_ici_report_{patient_id}_{timestamp}.md"
+        filepath = output_path / filename
+        # Save report
+        with open(filepath, 'w') as f:
+            f.write(report)
+        logger.info(f"Report saved to: {filepath}")
+        return str(filepath)
+    def generate_and_save(self, patient_json_path: str, output_dir: str = None) -> str:
+        """
+        Complete workflow: load data, generate report, save to file
+        Args:
+            patient_json_path: Path to patient JSON file
+            output_dir: Optional output directory override
+        Returns:
+            Path to saved report file
+        """
+        # Load patient data
+        patient_data = self.load_patient_data(patient_json_path)
+        patient_id = patient_data["patient"]["id"]
+        # Generate report
+        report = self.generate_full_report(patient_data)
+        # Save report
+        output_path = self.save_report(report, patient_id, output_dir)
+        return output_path
+    def generate_and_save_from_ehr(
+        self,
+        ehr_path: str,
+        output_dir: str = None,
+        save_json_path: str = None,
+    ) -> str:
+        """
+        Complete EHR workflow: extract JSON from EHR, generate report, save to file.
+        Args:
+            ehr_path: Path to the plain-text EHR report.
+            output_dir: Optional output directory override.
+            save_json_path: If provided, save the extracted patient JSON to this path
+                            so it can be inspected or reused without re-running extraction.
+        Returns:
+            Path to the saved report markdown file.
+        """
+        # Step 1: Extract patient data from EHR
+        patient_data = self.load_patient_data_from_ehr(ehr_path)
+        patient_id = patient_data["patient"]["id"]
+        # Step 2: Optionally save the extracted JSON
+        if save_json_path:
+            import json as _json
+            from pathlib import Path as _Path
+            _Path(save_json_path).parent.mkdir(parents=True, exist_ok=True)
+            with open(save_json_path, "w", encoding="utf-8") as f:
+                _json.dump(patient_data, f, indent=2)
+            logger.info(f"Extracted patient JSON saved to: {save_json_path}")
+        # Step 3: Generate report
+        report = self.generate_full_report(patient_data)
+        # Step 4: Save report
+        output_path = self.save_report(report, patient_id, output_dir)
+        return output_path

src/section_generators.py ADDED Viewed

	@@ -0,0 +1,489 @@

+"""
+Section generation functions for each report section
+"""
+import logging
+from typing import Dict, Optional, List
+from .models import get_medgemma
+from .rag import RAGRetriever
+from .prompts import build_prompt
+from . import config
+logger = logging.getLogger(__name__)
+class SectionGenerator:
+    """Handles generation of individual report sections"""
+    def __init__(self):
+        self.llm = get_medgemma()
+        self.rag = RAGRetriever()
+        self.all_citations = {}  # Map citation -> title across all sections
+    def generate_preamble(self, patient_data: Dict) -> str:
+        """
+        Generate Section 0: Clinical Preamble (auto-populated, no LLM)
+        """
+        p = patient_data["patient"]
+        c = patient_data["cancer"]
+        i = patient_data["immunotherapy"]
+        m = patient_data["microbiome"]
+        # Format metastases
+        metastases_str = ", ".join(c["metastases"]) if c["metastases"] else "none"
+        # Determine therapy type
+        therapy_type = i.get("therapy_type", "ICI")
+        preamble = f"""# Microbiome-Immunotherapy Clinical Report
+**Patient ID:** {p['id']}
+**Age:** {p['age']} years
+**Gender:** {p['gender']}
+## Clinical Context
+**Cancer Diagnosis:** {c['stage']} {c['type']}"""
+        if c.get('subtype'):
+            preamble += f" ({c['subtype']})"
+        preamble += f"""
+**Primary Site:** {c['primary_site']}
+**Metastases:** {metastases_str}
+**Diagnosis Date:** {c['diagnosis_date']}
+**Tumor Biomarkers:**
+- PD-L1 Expression: {c['biomarkers']['pdl1_expression']}
+- Tumor Mutational Burden (TMB): {c['biomarkers']['tmb']}
+- Microsatellite Instability (MSI): {c['biomarkers']['msi_status']}
+## Planned Immunotherapy
+**Therapy Type:** {therapy_type}
+**Drug:** {i['drug_name']} ({i['drug_class']})
+**Treatment Setting:** {i['treatment_setting']}
+**Line of Therapy:** {i['line_of_therapy']}
+**Planned Start Date:** {i['planned_start_date']}
+"""
+        # Add ACT-specific details if present
+        if therapy_type == "ACT" and i.get("act_details"):
+            act = i["act_details"]
+            preamble += f"""
+**ACT Details:**
+- ACT Type: {act.get('act_type', 'N/A')}
+- Target Antigen: {act.get('target_antigen', 'N/A')}
+- Cell Source: {act.get('cell_source', 'N/A')}
+- Preconditioning Regimen: {act.get('preconditioning_regimen', 'N/A')}
+- T-Cell Harvest Date: {act.get('t_cell_harvest_date', 'N/A')}
+- Expected CRS Risk: {act.get('expected_crs_risk', 'N/A')}
+- Expected Neurotoxicity Risk: {act.get('expected_neurotoxicity_risk', 'N/A')}
+"""
+        preamble += f"""
+## Microbiome Profile Overview
+**Sample Date:** {m['sample_date']}
+**Sequencing Method:** {m['sequencing_method']}
+This report summarizes gut microbiome findings relevant to anticipated immunotherapy response based on current evidence from peer-reviewed literature.
+---
+"""
+        return preamble
+    def generate_section_1(self, patient_data: Dict) -> Optional[str]:
+        """
+        Generate Section 1: Microbiome Diversity & Composition Profile
+        """
+        logger.info("Generating Section 1: Microbiome Diversity & Composition Profile")
+        # Retrieve evidence
+        chunks = self.rag.retrieve_for_section_1(patient_data)
+        if not chunks:
+            logger.warning("No evidence retrieved for Section 1, omitting section")
+            return None
+        # Track citations
+        for citation, title in self.rag.get_unique_citation_metadata(chunks):
+            self.all_citations[citation] = title
+        # Format evidence
+        evidence = self.rag.format_chunks_for_llm(chunks)
+        # Prepare detected taxa string
+        key_bacteria = patient_data["microbiome"]["key_bacteria"]
+        detected_taxa_lines = []
+        for taxon, abundance in key_bacteria.items():
+            if abundance is not None and abundance > 0:
+                taxon_display = taxon.replace("_", " ").title()
+                detected_taxa_lines.append(f"- {taxon_display}: {abundance}%")
+        detected_taxa_str = "\n".join(detected_taxa_lines) if detected_taxa_lines else "None detected above threshold"
+        # Build prompt
+        diversity = patient_data["microbiome"]["diversity"]
+        prompt = build_prompt(
+            "section_1",
+            patient_data,
+            evidence,
+            cancer_stage=patient_data["cancer"]["stage"],
+            shannon_index=diversity["shannon_index"],
+            simpson_index=diversity["simpson_index"],
+            observed_species=diversity["observed_species"],
+            detected_taxa=detected_taxa_str,
+        )
+        # Generate
+        content = self.llm.generate(prompt, max_new_tokens=config.SECTION_MAX_NEW_TOKENS["section_1"])
+        return f"## 1. Microbiome Diversity & Composition Profile\n\n{content}\n\n"
+    def generate_section_2(self, patient_data: Dict) -> Optional[str]:
+        """
+        Generate Section 2: Metabolite Landscape
+        """
+        logger.info("Generating Section 2: Metabolite Landscape")
+        # Check if metabolite data is available
+        metabolites = patient_data["microbiome"]["metabolites"]
+        has_scfa = any(v is not None for v in metabolites["scfa"].values())
+        has_metabolites = has_scfa or metabolites["bile_acids_available"] or metabolites["tryptophan_metabolites_available"]
+        if not has_metabolites:
+            logger.info("No metabolite data available, omitting Section 2")
+            return None
+        # Retrieve evidence
+        chunks = self.rag.retrieve_for_section_2(patient_data)
+        if not chunks:
+            logger.warning("No evidence retrieved for Section 2, omitting section")
+            return None
+        # Track citations
+        for citation, title in self.rag.get_unique_citation_metadata(chunks):
+            self.all_citations[citation] = title
+        # Format evidence
+        evidence = self.rag.format_chunks_for_llm(chunks)
+        # Prepare metabolite data string
+        metabolite_lines = []
+        if has_scfa:
+            metabolite_lines.append("**Short-Chain Fatty Acids:**")
+            scfa = metabolites["scfa"]
+            if scfa["butyrate_uM"] is not None:
+                metabolite_lines.append(f"- Butyrate: {scfa['butyrate_uM']} μM")
+            if scfa["propionate_uM"] is not None:
+                metabolite_lines.append(f"- Propionate: {scfa['propionate_uM']} μM")
+            if scfa["acetate_uM"] is not None:
+                metabolite_lines.append(f"- Acetate: {scfa['acetate_uM']} μM")
+        if metabolites["bile_acids_available"]:
+            metabolite_lines.append("**Bile Acids:** Analysis available")
+        if metabolites["tryptophan_metabolites_available"]:
+            metabolite_lines.append("**Tryptophan Metabolites:** Analysis available")
+        metabolite_data_str = "\n".join(metabolite_lines)
+        # Build prompt
+        prompt = build_prompt(
+            "section_2",
+            patient_data,
+            evidence,
+            metabolite_data=metabolite_data_str,
+        )
+        # Generate
+        content = self.llm.generate(prompt, max_new_tokens=config.SECTION_MAX_NEW_TOKENS["section_2"])
+        return f"## 2. Metabolite Landscape\n\n{content}\n\n"
+    def generate_section_3(self, patient_data: Dict) -> Optional[str]:
+        """
+        Generate Section 3: Drug–Microbiome Interaction Outlook
+        Supports both ICI and ACT therapies
+        """
+        logger.info("Generating Section 3: Drug–Microbiome Interaction Outlook")
+        # Retrieve evidence
+        chunks = self.rag.retrieve_for_section_3(patient_data)
+        if not chunks:
+            logger.warning("No evidence retrieved for Section 3, omitting section")
+            return None
+        # Track citations
+        for citation, title in self.rag.get_unique_citation_metadata(chunks):
+            self.all_citations[citation] = title
+        # Format evidence
+        evidence = self.rag.format_chunks_for_llm(chunks)
+        # Prepare summaries
+        diversity = patient_data["microbiome"]["diversity"]
+        key_bacteria = patient_data["microbiome"]["key_bacteria"]
+        # Key taxa summary
+        detected_taxa = [
+            (k.replace("_", " ").title(), v)
+            for k, v in key_bacteria.items()
+            if v is not None and v > 0
+        ]
+        detected_taxa.sort(key=lambda x: x[1], reverse=True)
+        key_taxa_summary = ", ".join([f"{t} ({a}%)" for t, a in detected_taxa[:5]])
+        # Metabolite summary
+        metabolites = patient_data["microbiome"]["metabolites"]
+        metabolite_flags = []
+        if any(v is not None for v in metabolites["scfa"].values()):
+            metabolite_flags.append("SCFAs measured")
+        if metabolites["bile_acids_available"]:
+            metabolite_flags.append("bile acids available")
+        if metabolites["tryptophan_metabolites_available"]:
+            metabolite_flags.append("tryptophan metabolites available")
+        metabolite_summary = ", ".join(metabolite_flags) if metabolite_flags else "limited metabolite data"
+        # Determine therapy type
+        therapy_type = patient_data["immunotherapy"].get("therapy_type", "ICI")
+        # Build prompt based on therapy type
+        if therapy_type == "ACT":
+            act_details = patient_data["immunotherapy"].get("act_details", {})
+            prompt = build_prompt(
+                "section_3",
+                patient_data,
+                evidence,
+                cancer_stage=patient_data["cancer"]["stage"],
+                act_type=act_details.get("act_type", "CAR-T"),
+                target_antigen=act_details.get("target_antigen", "CD19"),
+                cell_source=act_details.get("cell_source", "autologous"),
+                crs_risk=act_details.get("expected_crs_risk", "unknown"),
+                neurotoxicity_risk=act_details.get("expected_neurotoxicity_risk", "unknown"),
+                line_of_therapy=patient_data["immunotherapy"]["line_of_therapy"],
+                shannon_index=diversity["shannon_index"],
+                simpson_index=diversity["simpson_index"],
+                key_taxa_summary=key_taxa_summary,
+                metabolite_summary=metabolite_summary,
+            )
+        else:  # ICI
+            biomarkers = patient_data["cancer"]["biomarkers"]
+            prompt = build_prompt(
+                "section_3",
+                patient_data,
+                evidence,
+                cancer_stage=patient_data["cancer"]["stage"],
+                pdl1=biomarkers["pdl1_expression"],
+                tmb=biomarkers["tmb"],
+                msi=biomarkers["msi_status"],
+                line_of_therapy=patient_data["immunotherapy"]["line_of_therapy"],
+                shannon_index=diversity["shannon_index"],
+                simpson_index=diversity["simpson_index"],
+                key_taxa_summary=key_taxa_summary,
+                metabolite_summary=metabolite_summary,
+            )
+        # Generate
+        content = self.llm.generate(prompt, max_new_tokens=config.SECTION_MAX_NEW_TOKENS["section_3"])  # Longer for this section
+        # Section title varies by therapy type
+        if therapy_type == "ACT":
+            section_title = "3. Microbiome–ACT Interaction Outlook"
+        else:
+            section_title = "3. Drug–Microbiome Interaction Outlook"
+        return f"## {section_title}\n\n{content}\n\n"
+    def generate_section_4(self, patient_data: Dict) -> Optional[str]:
+        """
+        Generate Section 4: Confounding Factors
+        """
+        logger.info("Generating Section 4: Confounding Factors")
+        # Check if any confounding factors are present
+        meds = patient_data["medications"]
+        prior = patient_data["prior_treatments"]
+        has_confounders = (
+            meds["antibiotic_history"]["recent_antibiotics"] or
+            meds["ppi_use"]["currently_on_ppi"] or
+            prior["chemotherapy"]["received"] or
+            prior["prior_immunotherapy"]["received"] or
+            len(patient_data.get("comorbidities", [])) > 0
+        )
+        if not has_confounders:
+            logger.info("No confounding factors present, omitting Section 4")
+            return None
+        # Retrieve evidence
+        chunks = self.rag.retrieve_for_section_4(patient_data)
+        if not chunks:
+            logger.warning("No evidence retrieved for Section 4, omitting section")
+            return None
+        # Track citations
+        for citation, title in self.rag.get_unique_citation_metadata(chunks):
+            self.all_citations[citation] = title
+        # Format evidence
+        evidence = self.rag.format_chunks_for_llm(chunks)
+        # Prepare confounding data string
+        confounding_lines = []
+        # Antibiotic history
+        if meds["antibiotic_history"]["recent_antibiotics"]:
+            confounding_lines.append("**Antibiotic Exposure:**")
+            for exp in meds["antibiotic_history"]["exposures"]:
+                confounding_lines.append(
+                    f"- {exp['antibiotic_name']} ({exp['antibiotic_class']}): "
+                    f"{exp['start_date']} to {exp['end_date']} "
+                    f"({exp['days_before_ici']} days before ICI start)"
+                )
+        # PPI use
+        if meds["ppi_use"]["currently_on_ppi"]:
+            ppi = meds["ppi_use"]
+            confounding_lines.append(f"**Proton Pump Inhibitor Use:**")
+            confounding_lines.append(f"- {ppi['ppi_name']}, duration: {ppi['duration_months']} months")
+        # Prior chemotherapy
+        if prior["chemotherapy"]["received"]:
+            chemo = prior["chemotherapy"]
+            regimens_str = ", ".join(chemo["regimens"])
+            confounding_lines.append(f"**Prior Chemotherapy:**")
+            confounding_lines.append(f"- Regimens: {regimens_str}")
+            confounding_lines.append(f"- Response: {chemo['response']}")
+        # Prior immunotherapy
+        if prior["prior_immunotherapy"]["received"]:
+            prior_ici = prior["prior_immunotherapy"]
+            drugs_str = ", ".join(prior_ici["drugs"])
+            confounding_lines.append(f"**Prior Immunotherapy:**")
+            confounding_lines.append(f"- Drugs: {drugs_str}")
+            confounding_lines.append(f"- Response: {prior_ici['response']}")
+        # Comorbidities
+        if patient_data.get("comorbidities"):
+            comorbidities_str = ", ".join(patient_data["comorbidities"])
+            confounding_lines.append(f"**Comorbidities:** {comorbidities_str}")
+        confounding_data_str = "\n".join(confounding_lines)
+        # Build prompt
+        prompt = build_prompt(
+            "section_4",
+            patient_data,
+            evidence,
+            confounding_data=confounding_data_str,
+        )
+        # Generate
+        content = self.llm.generate(prompt, max_new_tokens=config.SECTION_MAX_NEW_TOKENS["section_4"])
+        return f"## 4. Confounding Factors\n\n{content}\n\n"
+    def generate_section_5(self, patient_data: Dict) -> Optional[str]:
+        """
+        Generate Section 5: Microbiota-Modulation Intervention Considerations
+        """
+        logger.info("Generating Section 5: Microbiota-Modulation Intervention Considerations")
+        # Retrieve evidence for each intervention type
+        intervention_chunks = self.rag.retrieve_for_section_5(patient_data)
+        # Check if any intervention evidence was retrieved
+        total_chunks = sum(len(chunks) for chunks in intervention_chunks.values())
+        if total_chunks == 0:
+            logger.warning("No intervention evidence retrieved for Section 5, omitting section")
+            return None
+        # Track citations from all intervention types
+        for chunks in intervention_chunks.values():
+            for citation, title in self.rag.get_unique_citation_metadata(chunks):
+                self.all_citations[citation] = title
+        # Format evidence for each intervention type
+        evidence_str = "## Diet and Prebiotics Evidence\n\n"
+        evidence_str += self.rag.format_chunks_for_llm(intervention_chunks.get("diet", []))
+        evidence_str += "\n\n## Probiotics Evidence\n\n"
+        evidence_str += self.rag.format_chunks_for_llm(intervention_chunks.get("probiotics", []))
+        # Prepare microbiome summary
+        key_bacteria = patient_data["microbiome"]["key_bacteria"]
+        detected_taxa = [
+            k.replace("_", " ").title()
+            for k, v in key_bacteria.items()
+            if v is not None and v > 0
+        ]
+        microbiome_summary = f"Detected taxa: {', '.join(detected_taxa[:5])}"
+        # Build prompt
+        prompt = build_prompt(
+            "section_5",
+            patient_data,
+            evidence_str,
+            microbiome_summary=microbiome_summary,
+        )
+        # Generate
+        content = self.llm.generate(prompt, max_new_tokens=config.SECTION_MAX_NEW_TOKENS["section_5"])
+        return f"## 5. Microbiota-Modulation Intervention Considerations\n\n{content}\n\n"
+    def generate_section_6(self, patient_data: Dict) -> str:
+        """
+        Generate Section 6: Data Quality & Interpretive Limitations
+        """
+        logger.info("Generating Section 6: Data Quality & Interpretive Limitations")
+        # This section doesn't use RAG, just data quality fields
+        data_quality = patient_data["microbiome"]["data_quality"]
+        metabolites = patient_data["microbiome"]["metabolites"]
+        # Prepare data quality string
+        dq_lines = [
+            f"**Sequencing Method:** {patient_data['microbiome']['sequencing_method']}",
+            f"**Data Completeness:** {data_quality['completeness']}",
+            f"**Data Source:** {data_quality['source']}",
+        ]
+        if data_quality.get("limitations"):
+            dq_lines.append(f"**Limitations:** {', '.join(data_quality['limitations'])}")
+        # Note missing metabolite data
+        missing_metabolites = []
+        if not any(v is not None for v in metabolites["scfa"].values()):
+            missing_metabolites.append("Short-chain fatty acids")
+        if not metabolites["bile_acids_available"]:
+            missing_metabolites.append("Bile acids")
+        if not metabolites["tryptophan_metabolites_available"]:
+            missing_metabolites.append("Tryptophan metabolites")
+        if missing_metabolites:
+            dq_lines.append(f"**Missing Metabolite Data:** {', '.join(missing_metabolites)}")
+        data_quality_str = "\n".join(dq_lines)
+        # Build prompt (no RAG evidence needed)
+        from .prompts import SECTION_6_PROMPT
+        prompt = SECTION_6_PROMPT.format(data_quality=data_quality_str)
+        # Generate
+        content = self.llm.generate(prompt, max_new_tokens=config.SECTION_MAX_NEW_TOKENS["section_6"])
+        from .prompts import SECTION_6_FIXED_CAVEATS
+        full_content = f"{SECTION_6_FIXED_CAVEATS}\n\n{content}"
+        return f"## 6. Data Quality & Interpretive Limitations\n\n{full_content}\n\n"
+    def get_all_citations(self) -> List[tuple]:
+        """Return sorted list of all unique (citation, title) tuples used in the report"""
+        return sorted(list(self.all_citations.items()))