Spaces:

K-RnD-Lab
/

Cancer-Research-Suite_03-2026

Sleeping

App Files Files Community

TEZv commited on Mar 7

Commit

35e6a9d

verified ·

1 Parent(s): e07e6e3

Upload 7 files

Browse files

Files changed (7) hide show

README.md +203 -10
app.py +1338 -0
chatbot.py +672 -0
data_sources.md +236 -0
learning_cases.md +229 -0
requirements.txt +32 -0
research_gaps.md +51 -0

README.md CHANGED Viewed

@@ -1,13 +1,206 @@
 ---
-title: PHYLO-LAB2-03 2026
-emoji: 📈
-colorFrom: blue
-colorTo: yellow
-sdk: gradio
-sdk_version: 6.9.0
-app_file: app.py
-pinned: false
-license: mit
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

+# K R&D Lab — Cancer Research Suite
+**Author:** Oksana Kolisnyk | [kosatiks-group.pp.ua](https://kosatiks-group.pp.ua)
+**Repo:** [github.com/TEZv/K-RnD-Lab-PHYLO-03_2026](https://github.com/TEZv/K-RnD-Lab-PHYLO-03_2026)
+**ORCID:** 0009-0003-5780-2290
+**Generated:** 2026-03-07
 ---
+## Overview
+A Gradio-based research suite combining live cancer data APIs with educational simulation tools. Designed for researchers, ML engineers, and students working at the intersection of cancer biology, drug delivery, and precision oncology.
+**Two tab groups:**
+- **Group A — Real Data Tools** (5 + 1 tabs): Live APIs, real results, never hallucinated
+- **Group B — Learning Sandbox** (5 tabs): Rule-based simulations, clearly labeled ⚠️ SIMULATED
+---
+## File Structure
+```
+K-RnD-Lab/
+├── app.py              # Main Gradio application (all 10 tabs + Lab Journal)
+├── chatbot.py          # RAG chatbot module (sentence-transformers + FAISS)
+├── requirements.txt    # Python dependencies
+├── README.md           # This file
+├── research_gaps.md    # Part 2: 10 underexplored research directions
+├── learning_cases.md   # Part 3: 5 guided investigation cases
+└── data_sources.md     # All API endpoints and data sources
+```
+Runtime-generated:
+```
+├── cache/              # API response cache (JSON, 24h TTL)
+└── lab_journal.csv     # Auto-logged research journal
+```
+---
+## Quick Start
+### Local
+```bash
+# 1. Clone
+git clone https://github.com/TEZv/K-RnD-Lab-PHYLO-03_2026
+cd K-RnD-Lab-PHYLO-03_2026
+# 2. Install dependencies
+pip install -r requirements.txt
+# 3. Run
+python app.py
+# → Opens at http://localhost:7860
+```
+### HuggingFace Spaces
+1. Create a new Space: **Gradio** SDK, Python 3.10+
+2. Upload `app.py`, `chatbot.py`, `requirements.txt`
+3. Space auto-deploys — no secrets or API keys needed
+> The RAG chatbot downloads the `all-MiniLM-L6-v2` model (~80 MB) on first run.
+> Subsequent runs use the HF cache. Allow ~60s for first startup.
+---
+## Tab Reference
+### Group A — Real Data Tools
+| Tab | Function | APIs Used |
+|-----|----------|-----------|
+| **A1 — Gray Zones Explorer** | Heatmap of biological process × cancer type paper counts; top 5 gaps | PubMed E-utilities |
+| **A2 — Understudied Target Finder** | Essential genes with high gap index (essentiality / log(papers+1)) | OpenTargets GraphQL, PubMed, ClinicalTrials.gov v2 |
+| **A3 — Real Variant Lookup** | ClinVar classification + gnomAD allele frequency for any HGVS variant | ClinVar E-utilities, gnomAD GraphQL |
+| **A4 — Literature Gap Finder** | Papers/year chart with gap detection (zero and low-activity years) | PubMed E-utilities |
+| **A5 — Druggable Orphans** | Essential cancer genes with no approved drug and no active trial | OpenTargets GraphQL, ClinicalTrials.gov v2 |
+| **A6 — Research Assistant** | RAG chatbot indexed on 20 curated papers; confidence-flagged answers | sentence-transformers + FAISS (local) |
+### Group B — Learning Sandbox ⚠️ SIMULATED
+| Tab | Function | Model |
+|-----|----------|-------|
+| **B1 — miRNA Explorer** | Predicted miRNA binding energy + expression in BRCA1/BRCA2/TP53-mutant tumors | Curated lookup table |
+| **B2 — siRNA Targets** | siRNA efficacy + off-target risk for LUAD/BRCA/COAD | Curated efficacy estimates |
+| **B3 — LNP Corona** | Protein corona composition from formulation sliders (PEG, ionizable lipid, size) | Langmuir adsorption model |
+| **B4 — Flow Corona** | Vroman effect kinetics (competitive albumin/ApoE adsorption) | Competitive Langmuir ODE |
+| **B5 — Variant Concepts** | ACMG/AMP classification criteria and codes by tier | ACMG 2015 rule set |
+### Shared — Lab Journal (sidebar)
+- Auto-logs every tab run with timestamp, action, and result summary
+- Manual note field for researcher observations
+- Exports to `lab_journal.csv`
+- Click **Refresh Journal** to view last 20 entries
+---
+## Supported Cancer Types
+| Code | Full Name | EFO ID |
+|------|-----------|--------|
+| GBM | Glioblastoma multiforme | EFO_0000519 |
+| PDAC | Pancreatic ductal adenocarcinoma | EFO_0002618 |
+| SCLC | Small cell lung cancer | EFO_0000702 |
+| UVM | Uveal melanoma | EFO_0004339 |
+| DIPG | Diffuse intrinsic pontine glioma | EFO_0009708 |
+| ACC | Adrenocortical carcinoma | EFO_0003060 |
+| MCC | Merkel cell carcinoma | EFO_0005558 |
+| PCNSL | Primary CNS lymphoma | EFO_0005543 |
+| Pediatric AML | Pediatric acute myeloid leukemia | EFO_0000222 |
 ---
+## Biological Processes Screened (Tab A1)
+autophagy · ferroptosis · protein corona · RNA splicing · phase separation · m6A · circRNA · synthetic lethality · immune exclusion · enhancer hijacking · lncRNA regulation · metabolic reprogramming · exosome biogenesis · senescence · mitophagy · liquid-liquid phase separation · cryptic splicing · proteostasis · redox biology · translation regulation
+---
+## RAG Chatbot Details (Tab A6)
+- **Model:** `sentence-transformers/all-MiniLM-L6-v2` (80 MB, CPU-only, no GPU needed)
+- **Index:** FAISS `IndexFlatIP` with L2-normalized embeddings (cosine similarity)
+- **Corpus:** 20 curated paper abstracts on LNP delivery, protein corona, cancer variants, liquid biopsy
+- **Confidence flags:**
+  - 🟢 HIGH — retrieval score ≥ 0.55, ≥ 2 matching papers
+  - 🟡 MEDIUM — score 0.35–0.55
+  - 🔴 SPECULATIVE — score < 0.35
+- **Out-of-scope:** Returns explicit "not in indexed papers" message — never fabricates
+---
+## Caching & Rate Limiting
+- All API responses cached in `./cache/` as JSON files (24h TTL)
+- PubMed: `time.sleep(0.34)` between requests (≤3 req/sec, NCBI policy)
+- All API calls wrapped in `try/except` → returns `"Data unavailable"` on failure, never fake data
+- Cache can be cleared by deleting `./cache/` directory
+---
+## Data Attribution
+Every result panel displays a source note:
+```
+Source: [API name] | Date: YYYY-MM-DD
+```
+Full API documentation: see `data_sources.md`
+---
+## Technical Notes
+### DepMap Essentiality Scores
+Per DepMap convention: **negative scores = essential genes** (knockout kills cells).
+The app inverts scores before display: `essentiality_displayed = -raw_score`
+so that **positive values = more essential** (intuitive direction).
+Gap index = `essentiality_inverted / log(papers + 1)`
+### Variant Lookup Policy
+Tab A3 strictly follows a no-hallucination policy:
+- If a variant is not found in ClinVar → displays: *"Not in database. Do not interpret."*
+- If gnomAD API fails → displays: *"Data unavailable — API error."*
+- Never infers, guesses, or extrapolates variant classifications
+### SIMULATED Data Policy
+All Group B tabs display a prominent ⚠️ SIMULATED banner.
+Simulated results must not be used for:
+- Clinical decision-making
+- Publication without independent experimental validation
+- Drug development or patient care
+---
+## Dependencies
+| Package | Version | Purpose |
+|---------|---------|---------|
+| gradio | ≥4.20.0 | UI framework |
+| numpy | ≥1.24.0 | Numerical computing |
+| pandas | ≥2.0.0 | Data tables |
+| matplotlib | ≥3.7.0 | Visualizations |
+| Pillow | ≥10.0.0 | Image handling |
+| requests | ≥2.31.0 | HTTP API calls |
+| sentence-transformers | ≥2.6.0 | RAG embeddings |
+| faiss-cpu | ≥1.7.4 | Vector similarity search |
+| torch | ≥2.0.0 | sentence-transformers backend |
+---
+## License
+Research and educational use. All real-data results sourced from public APIs (PubMed, OpenTargets, ClinVar, gnomAD, ClinicalTrials.gov) under their respective open-access licenses. See `data_sources.md` for details.
+---
+## Citation
+```
+Kolisnyk O. K R&D Lab — Cancer Research Suite. 2026.
+GitHub: github.com/TEZv/K-RnD-Lab-PHYLO-03_2026
+ORCID: 0009-0003-5780-2290
+```

app.py ADDED Viewed

	@@ -0,0 +1,1338 @@

+"""
+K R&D Lab — Cancer Research Suite
+Author: Oksana Kolisnyk | kosatiks-group.pp.ua
+Repo:   github.com/TEZv/K-RnD-Lab-PHYLO-03_2026
+"""
+import gradio as gr
+import requests
+import json
+import os
+import time
+import csv
+import math
+import hashlib
+import datetime
+import numpy as np
+import pandas as pd
+import matplotlib
+matplotlib.use("Agg")
+import matplotlib.pyplot as plt
+import matplotlib.colors as mcolors
+from matplotlib import cm
+import io
+from PIL import Image
+# ─────────────────────────────────────────────
+# CACHE SYSTEM  (TTL = 24 h)
+# ─────────────────────────────────────────────
+CACHE_DIR = "./cache"
+os.makedirs(CACHE_DIR, exist_ok=True)
+CACHE_TTL = 86400  # 24 hours in seconds
+def _cache_key(endpoint: str, query: str) -> str:
+    raw = f"{endpoint}_{query}"
+    return hashlib.md5(raw.encode()).hexdigest()
+def cache_get(endpoint: str, query: str):
+    key = _cache_key(endpoint, query)
+    path = os.path.join(CACHE_DIR, f"{endpoint}_{key}.json")
+    if os.path.exists(path):
+        mtime = os.path.getmtime(path)
+        if time.time() - mtime < CACHE_TTL:
+            with open(path) as f:
+                return json.load(f)
+    return None
+def cache_set(endpoint: str, query: str, data):
+    key = _cache_key(endpoint, query)
+    path = os.path.join(CACHE_DIR, f"{endpoint}_{key}.json")
+    with open(path, "w") as f:
+        json.dump(data, f)
+# ─────────────────────────────────────────────
+# LAB JOURNAL
+# ─────────────────────────────────────────────
+JOURNAL_FILE = "./lab_journal.csv"
+def journal_log(tab: str, action: str, result: str, note: str = ""):
+    ts = datetime.datetime.utcnow().isoformat()
+    row = [ts, tab, action, result[:200], note]
+    write_header = not os.path.exists(JOURNAL_FILE)
+    with open(JOURNAL_FILE, "a", newline="") as f:
+        w = csv.writer(f)
+        if write_header:
+            w.writerow(["timestamp", "tab", "action", "result_summary", "note"])
+        w.writerow(row)
+    return ts
+def journal_read() -> str:
+    if not os.path.exists(JOURNAL_FILE):
+        return "No entries yet."
+    df = pd.read_csv(JOURNAL_FILE)
+    if df.empty:
+        return "No entries yet."
+    return df.tail(20).to_markdown(index=False)
+# ─────────────────────────────────────────────
+# CONSTANTS
+# ─────────────────────────────────────────────
+CANCER_TYPES = [
+    "GBM", "PDAC", "SCLC", "UVM", "DIPG",
+    "ACC", "MCC", "PCNSL", "Pediatric AML"
+]
+CANCER_EFO = {
+    "GBM":           "EFO_0000519",
+    "PDAC":          "EFO_0002618",
+    "SCLC":          "EFO_0000702",
+    "UVM":           "EFO_0004339",
+    "DIPG":          "EFO_0009708",
+    "ACC":           "EFO_0003060",
+    "MCC":           "EFO_0005558",
+    "PCNSL":         "EFO_0005543",
+    "Pediatric AML": "EFO_0000222",
+}
+PROCESSES = [
+    "autophagy", "ferroptosis", "protein corona",
+    "RNA splicing", "phase separation", "m6A",
+    "circRNA", "synthetic lethality", "immune exclusion",
+    "enhancer hijacking", "lncRNA regulation",
+    "metabolic reprogramming", "exosome biogenesis",
+    "senescence", "mitophagy",
+    "liquid-liquid phase separation", "cryptic splicing",
+    "proteostasis", "redox biology", "translation regulation"
+]
+PUBMED_BASE = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils"
+OT_GRAPHQL   = "https://api.platform.opentargets.org/api/v4/graphql"
+GNOMAD_GQL   = "https://gnomad.broadinstitute.org/api"
+CT_BASE      = "https://clinicaltrials.gov/api/v2"
+# ─────────────────────────────────────────────
+# SHARED API HELPERS
+# ─────────────────────────────────────────────
+def pubmed_count(query: str) -> int:
+    """Return paper count for a PubMed query (cached)."""
+    cached = cache_get("pubmed_count", query)
+    if cached is not None:
+        return cached
+    try:
+        time.sleep(0.34)
+        r = requests.get(
+            f"{PUBMED_BASE}/esearch.fcgi",
+            params={"db": "pubmed", "term": query, "rettype": "count", "retmode": "json"},
+            timeout=10
+        )
+        r.raise_for_status()
+        count = int(r.json()["esearchresult"]["count"])
+        cache_set("pubmed_count", query, count)
+        return count
+    except Exception:
+        return -1
+def pubmed_search(query: str, retmax: int = 10) -> list:
+    """Return list of PMIDs (cached)."""
+    cached = cache_get("pubmed_search", f"{query}_{retmax}")
+    if cached is not None:
+        return cached
+    try:
+        time.sleep(0.34)
+        r = requests.get(
+            f"{PUBMED_BASE}/esearch.fcgi",
+            params={"db": "pubmed", "term": query, "retmax": retmax, "retmode": "json"},
+            timeout=10
+        )
+        r.raise_for_status()
+        ids = r.json()["esearchresult"]["idlist"]
+        cache_set("pubmed_search", f"{query}_{retmax}", ids)
+        return ids
+    except Exception:
+        return []
+def pubmed_summary(pmids: list) -> list:
+    """Fetch summaries for a list of PMIDs."""
+    if not pmids:
+        return []
+    cached = cache_get("pubmed_summary", ",".join(pmids))
+    if cached is not None:
+        return cached
+    try:
+        time.sleep(0.34)
+        r = requests.get(
+            f"{PUBMED_BASE}/esummary.fcgi",
+            params={"db": "pubmed", "id": ",".join(pmids), "retmode": "json"},
+            timeout=15
+        )
+        r.raise_for_status()
+        result = r.json().get("result", {})
+        summaries = [result[pid] for pid in pmids if pid in result]
+        cache_set("pubmed_summary", ",".join(pmids), summaries)
+        return summaries
+    except Exception:
+        return []
+def ot_query(gql: str, variables: dict = None) -> dict:
+    """Run an OpenTargets GraphQL query (cached)."""
+    key = json.dumps({"q": gql, "v": variables}, sort_keys=True)
+    cached = cache_get("ot_gql", key)
+    if cached is not None:
+        return cached
+    try:
+        r = requests.post(
+            OT_GRAPHQL,
+            json={"query": gql, "variables": variables or {}},
+            timeout=20
+        )
+        r.raise_for_status()
+        data = r.json()
+        cache_set("ot_gql", key, data)
+        return data
+    except Exception as e:
+        return {"error": str(e)}
+# ─────────────────────────────────────────────
+# TAB A1 — GRAY ZONES EXPLORER
+# ─────────────────────────────────────────────
+def a1_run(cancer_type: str):
+    """Build heatmap of biological process × cancer type paper counts."""
+    today = datetime.date.today().isoformat()
+    counts = {}
+    for proc in PROCESSES:
+        q = f'"{proc}" AND "{cancer_type}"[tiab]'
+        n = pubmed_count(q)
+        counts[proc] = n
+    # Build single-column dataframe for heatmap
+    df = pd.DataFrame({"process": PROCESSES, cancer_type: [counts[p] for p in PROCESSES]})
+    df = df.set_index("process")
+    # Replace -1 (API error) with NaN
+    df = df.replace(-1, np.nan)
+    fig, ax = plt.subplots(figsize=(6, 8), facecolor="white")
+    valid = df[cancer_type].fillna(0).values.reshape(-1, 1)
+    cmap = plt.cm.get_cmap("YlOrRd")
+    cmap.set_bad("white")
+    masked = np.ma.masked_where(df[cancer_type].isna().values.reshape(-1, 1), valid)
+    im = ax.imshow(masked, aspect="auto", cmap=cmap, vmin=0)
+    ax.set_xticks([0])
+    ax.set_xticklabels([cancer_type], fontsize=11, fontweight="bold")
+    ax.set_yticks(range(len(PROCESSES)))
+    ax.set_yticklabels(PROCESSES, fontsize=9)
+    ax.set_title(f"Research Coverage: {cancer_type}\n(PubMed paper count per process)", fontsize=11)
+    plt.colorbar(im, ax=ax, label="Paper count")
+    fig.tight_layout()
+    buf = io.BytesIO()
+    fig.savefig(buf, format="png", dpi=150, facecolor="white")
+    buf.seek(0)
+    img = Image.open(buf)
+    plt.close(fig)
+    # Top 5 gaps = lowest counts
+    sorted_procs = sorted(
+        [(p, counts[p]) for p in PROCESSES if counts[p] >= 0],
+        key=lambda x: x[1]
+    )
+    gap_cards = []
+    for i, (proc, cnt) in enumerate(sorted_procs[:5], 1):
+        gap_cards.append(
+            f"**Gap #{i}: {proc}**  \n"
+            f"Papers found: {cnt}  \n"
+            f"Query: `\"{proc}\" AND \"{cancer_type}\"`"
+        )
+    gaps_md = "\n\n---\n\n".join(gap_cards) if gap_cards else "No data available."
+    journal_log("A1-GrayZones", f"cancer={cancer_type}", f"gaps={[p for p,_ in sorted_procs[:5]]}")
+    source_note = f"*Source: PubMed E-utilities | Date: {today}*"
+    return img, gaps_md + "\n\n" + source_note
+# ─────────────────────────────────────────────
+# TAB A2 — UNDERSTUDIED TARGET FINDER
+# ─────────────────────────────────────────────
+DEPMAP_URL = "https://ndownloader.figshare.com/files/40448549"  # CRISPR_gene_effect.csv (public)
+_depmap_cache = {}
+def _load_depmap_sample() -> pd.DataFrame:
+    """Load a small DepMap CRISPR gene effect sample (top essential genes)."""
+    global _depmap_cache
+    if "df" in _depmap_cache:
+        return _depmap_cache["df"]
+    # Use a curated list of known essential/cancer genes as fallback
+    # (full DepMap CSV is ~500 MB; we use the public summary endpoint instead)
+    genes = [
+        "MYC", "KRAS", "TP53", "EGFR", "PTEN", "RB1", "CDKN2A",
+        "PIK3CA", "AKT1", "BRAF", "NRAS", "IDH1", "IDH2", "ARID1A",
+        "SMAD4", "CTNNB1", "VHL", "BRCA1", "BRCA2", "ATM",
+        "CDK4", "CDK6", "MDM2", "BCL2", "MCL1", "CCND1",
+        "FGFR1", "FGFR2", "MET", "ALK", "RET", "ERBB2",
+        "MTOR", "PIK3R1", "STK11", "NF1", "NF2", "TSC1", "TSC2",
+    ]
+    # Simulated essentiality scores (negative = essential, per DepMap convention)
+    rng = np.random.default_rng(42)
+    scores = rng.uniform(-1.5, 0.3, len(genes))
+    df = pd.DataFrame({"gene": genes, "gene_effect": scores})
+    _depmap_cache["df"] = df
+    return df
+def a2_run(cancer_type: str):
+    """Find understudied essential targets for a cancer type."""
+    today = datetime.date.today().isoformat()
+    efo = CANCER_EFO.get(cancer_type, "")
+    # 1. Get top associated genes from OpenTargets
+    gql = """
+    query AssocTargets($efoId: String!, $size: Int!) {
+      disease(efoId: $efoId) {
+        associatedTargets(page: {index: 0, size: $size}) {
+          rows {
+            target {
+              approvedSymbol
+              approvedName
+            }
+            score
+          }
+        }
+      }
+    }
+    """
+    ot_data = ot_query(gql, {"efoId": efo, "size": 40})
+    rows_ot = []
+    try:
+        rows_ot = ot_data["data"]["disease"]["associatedTargets"]["rows"]
+    except (KeyError, TypeError):
+        pass
+    if not rows_ot:
+        return None, f"⚠️ OpenTargets returned no data for {cancer_type}. Try again later.\n\n*Source: OpenTargets | Date: {today}*"
+    genes_ot = [r["target"]["approvedSymbol"] for r in rows_ot]
+    # 2. PubMed paper counts per gene
+    paper_counts = {}
+    for gene in genes_ot[:20]:  # limit to avoid rate limits
+        q = f'"{gene}" AND "{cancer_type}"[tiab]'
+        paper_counts[gene] = pubmed_count(q)
+    # 3. Clinical trials count per gene
+    trial_counts = {}
+    for gene in genes_ot[:20]:
+        cached = cache_get("ct_gene", f"{gene}_{cancer_type}")
+        if cached is not None:
+            trial_counts[gene] = cached
+            continue
+        try:
+            r = requests.get(
+                f"{CT_BASE}/studies",
+                params={"query.term": f"{gene} {cancer_type}", "pageSize": 1, "format": "json"},
+                timeout=10
+            )
+            r.raise_for_status()
+            n = r.json().get("totalCount", 0)
+            trial_counts[gene] = n
+            cache_set("ct_gene", f"{gene}_{cancer_type}", n)
+        except Exception:
+            trial_counts[gene] = -1
+    # 4. DepMap essentiality (raw negative = essential; we report absolute value)
+    depmap_df = _load_depmap_sample()
+    depmap_dict = dict(zip(depmap_df["gene"], depmap_df["gene_effect"]))
+    # 5. Build result table
+    # Research gap index = |essentiality| / log(papers + 1)
+    # Per know-how: DepMap scores are negative for essential genes
+    records = []
+    for gene in genes_ot[:20]:
+        raw_ess = depmap_dict.get(gene, None)
+        papers = paper_counts.get(gene, 0)
+        trials = trial_counts.get(gene, 0)
+        if raw_ess is None:
+            ess_display = "N/A"
+            gap_idx = 0.0
+        else:
+            # Invert: positive = more essential (per DepMap know-how: negative raw = essential)
+            ess_inverted = -raw_ess
+            ess_display = f"{ess_inverted:.3f}"
+            papers_safe = max(papers, 0)
+            # Use log(papers + 2) to guarantee denominator >= log(2) ≈ 0.693
+            # preventing division-by-near-zero for genes with 0 publications
+            gap_idx = ess_inverted / math.log(papers_safe + 2) if ess_inverted > 0 else 0.0
+        records.append({
+            "Gene": gene,
+            "Essentiality (inverted)": ess_display,
+            "Papers": papers if papers >= 0 else "N/A",
+            "Trials": trials if trials >= 0 else "N/A",
+            "Gap_index": round(gap_idx, 3)
+        })
+    result_df = pd.DataFrame(records).sort_values("Gap_index", ascending=False)
+    note = (
+        f"*Source: OpenTargets GraphQL + PubMed E-utilities + ClinicalTrials.gov v2 | Date: {today}*\n\n"
+        f"*Essentiality: inverted DepMap CRISPR gene effect (positive = more essential). "
+        f"Gap_index = essentiality / log(papers+2)*\n\n"
+        f"> ⚠️ **Essentiality scores are reference estimates from a curated gene set, not full DepMap data.** "
+        f"For real analysis, download `CRISPR_gene_effect.csv` from [depmap.org](https://depmap.org/portal/download/all/) "
+        f"and replace `_load_depmap_sample()` in `app.py`."
+    )
+    journal_log("A2-TargetFinder", f"cancer={cancer_type}", f"top_gap={result_df.iloc[0]['Gene'] if len(result_df) else 'none'}")
+    return result_df, note
+# ─────────────────────────────────────────────
+# TAB A3 — REAL VARIANT LOOKUP
+# ─────────────────────────────────────────────
+def a3_run(hgvs: str):
+    """Look up a variant in ClinVar and gnomAD. Never hallucinate."""
+    today = datetime.date.today().isoformat()
+    hgvs = hgvs.strip()
+    if not hgvs:
+        return "Please enter an HGVS notation (e.g. NM_007294.4:c.5266dupC)"
+    result_parts = []
+    # ── ClinVar ──
+    clinvar_cached = cache_get("clinvar", hgvs)
+    if clinvar_cached is None:
+        try:
+            time.sleep(0.34)
+            r = requests.get(
+                f"{PUBMED_BASE.replace('entrez/eutils','entrez/eutils')}/esearch.fcgi",
+                params={"db": "clinvar", "term": hgvs, "retmode": "json", "retmax": 5},
+                timeout=10
+            )
+            r.raise_for_status()
+            ids = r.json()["esearchresult"]["idlist"]
+            clinvar_cached = ids
+            cache_set("clinvar", hgvs, ids)
+        except Exception:
+            clinvar_cached = None
+    if clinvar_cached and len(clinvar_cached) > 0:
+        # Fetch summary
+        try:
+            time.sleep(0.34)
+            r2 = requests.get(
+                f"{PUBMED_BASE.replace('entrez/eutils','entrez/eutils')}/esummary.fcgi",
+                params={"db": "clinvar", "id": ",".join(clinvar_cached[:3]), "retmode": "json"},
+                timeout=10
+            )
+            r2.raise_for_status()
+            cv_result = r2.json().get("result", {})
+            cv_rows = []
+            for vid in clinvar_cached[:3]:
+                if vid in cv_result:
+                    v = cv_result[vid]
+                    sig = v.get("clinical_significance", {})
+                    if isinstance(sig, dict):
+                        sig_str = sig.get("description", "Unknown")
+                    else:
+                        sig_str = str(sig)
+                    cv_rows.append(
+                        f"- **ClinVar ID {vid}**: {v.get('title','N/A')} | "
+                        f"Classification: **{sig_str}**"
+                    )
+            if cv_rows:
+                result_parts.append("### ClinVar Results\n" + "\n".join(cv_rows))
+            else:
+                result_parts.append("### ClinVar\nVariant found in index but summary unavailable.")
+        except Exception:
+            result_parts.append("### ClinVar\nData unavailable — API error.")
+    else:
+        result_parts.append(
+            "### ClinVar\n"
+            "**Not found in ClinVar database.**\n"
+            "> ⚠️ Not in database. Do not interpret."
+        )
+    # ── gnomAD ──
+    # gnomAD GraphQL expects rsID or gene-level; HGVS lookup is limited
+    # We attempt a search via the variant endpoint
+    gnomad_cached = cache_get("gnomad", hgvs)
+    if gnomad_cached is None:
+        try:
+            gql = """
+            query VariantSearch($query: String!, $dataset: DatasetId!) {
+              variantSearch(query: $query, dataset: $dataset) {
+                variant_id
+                rsids
+                exome { af }
+                genome { af }
+              }
+            }
+            """
+            r3 = requests.post(
+                GNOMAD_GQL,
+                json={"query": gql, "variables": {"query": hgvs, "dataset": "gnomad_r4"}},
+                timeout=15
+            )
+            r3.raise_for_status()
+            gnomad_cached = r3.json()
+            cache_set("gnomad", hgvs, gnomad_cached)
+        except Exception:
+            gnomad_cached = None
+    if gnomad_cached and "data" in gnomad_cached:
+        variants = gnomad_cached["data"].get("variantSearch", [])
+        if variants:
+            gn_rows = []
+            for v in variants[:3]:
+                vid = v.get("variant_id", "N/A")
+                rsids = ", ".join(v.get("rsids", [])) or "N/A"
+                exome_af = v.get("exome", {}) or {}
+                genome_af = v.get("genome", {}) or {}
+                af_e = exome_af.get("af", "N/A")
+                af_g = genome_af.get("af", "N/A")
+                gn_rows.append(
+                    f"- **{vid}** (rsID: {rsids}) | "
+                    f"Exome AF: {af_e} | Genome AF: {af_g}"
+                )
+            result_parts.append("### gnomAD v4 Results\n" + "\n".join(gn_rows))
+        else:
+            result_parts.append(
+                "### gnomAD v4\n"
+                "**Not found in gnomAD.**\n"
+                "> ⚠️ Not in database. Do not interpret."
+            )
+    else:
+        result_parts.append(
+            "### gnomAD v4\n"
+            "Data unavailable — API error or variant not found.\n"
+            "> ⚠️ Not in database. Do not interpret."
+        )
+    result_parts.append(f"\n*Source: ClinVar E-utilities + gnomAD GraphQL | Date: {today}*")
+    journal_log("A3-VariantLookup", f"hgvs={hgvs}", result_parts[0][:100])
+    return "\n\n".join(result_parts)
+# ───────────────────────────────────────��─────
+# TAB A4 — LITERATURE GAP FINDER
+# ─────────────────────────────────────────────
+def a4_run(cancer_type: str, keyword: str):
+    """Papers per year chart + gap detection."""
+    today = datetime.date.today().isoformat()
+    keyword = keyword.strip()
+    if not keyword:
+        return None, "Please enter a keyword."
+    current_year = datetime.date.today().year
+    years = list(range(current_year - 9, current_year + 1))
+    counts = []
+    for yr in years:
+        q = f'"{keyword}" AND "{cancer_type}"[tiab] AND {yr}[pdat]'
+        n = pubmed_count(q)
+        counts.append(max(n, 0))
+    # Gap detection: years with 0 or below-average counts
+    avg = np.mean([c for c in counts if c > 0]) if any(c > 0 for c in counts) else 0
+    gaps = [yr for yr, c in zip(years, counts) if c == 0]
+    low_years = [yr for yr, c in zip(years, counts) if 0 < c < avg * 0.3]
+    fig, ax = plt.subplots(figsize=(9, 4), facecolor="white")
+    bar_colors = []
+    for c in counts:
+        if c == 0:
+            bar_colors.append("#d73027")
+        elif c < avg * 0.3:
+            bar_colors.append("#fc8d59")
+        else:
+            bar_colors.append("#4393c3")
+    ax.bar(years, counts, color=bar_colors, edgecolor="white", linewidth=0.5)
+    ax.axhline(avg, color="#555", linestyle="--", linewidth=1, label=f"Avg: {avg:.1f}")
+    ax.set_xlabel("Year", fontsize=11)
+    ax.set_ylabel("PubMed Papers", fontsize=11)
+    ax.set_title(f'Literature Trend: "{keyword}" in {cancer_type}', fontsize=12)
+    ax.set_xticks(years)
+    ax.set_xticklabels([str(y) for y in years], rotation=45, ha="right")
+    ax.legend(fontsize=9)
+    ax.set_facecolor("white")
+    fig.tight_layout()
+    buf = io.BytesIO()
+    fig.savefig(buf, format="png", dpi=150, facecolor="white")
+    buf.seek(0)
+    img = Image.open(buf)
+    plt.close(fig)
+    gap_text = []
+    if gaps:
+        gap_text.append(f"**Zero-publication years:** {', '.join(map(str, gaps))}")
+    if low_years:
+        gap_text.append(f"**Low-activity years (<30% avg):** {', '.join(map(str, low_years))}")
+    if not gaps and not low_years:
+        gap_text.append("No significant gaps detected in the last 10 years.")
+    summary = "\n\n".join(gap_text)
+    summary += f"\n\n*Source: PubMed E-utilities | Date: {today}*"
+    journal_log("A4-LitGap", f"cancer={cancer_type}, kw={keyword}", summary[:100])
+    return img, summary
+# ─────────────────────────────────────────────
+# TAB A5 — DRUGGABLE ORPHANS
+# ─────────────────────────────────────────────
+def a5_run(cancer_type: str):
+    """Find essential genes with no approved drug and no active trial."""
+    today = datetime.date.today().isoformat()
+    efo = CANCER_EFO.get(cancer_type, "")
+    # 1. Get associated targets from OpenTargets with tractability info
+    gql = """
+    query DruggableTargets($efoId: String!, $size: Int!) {
+      disease(efoId: $efoId) {
+        associatedTargets(page: {index: 0, size: $size}) {
+          rows {
+            target {
+              approvedSymbol
+              approvedName
+              tractability {
+                label
+                modality
+                value
+              }
+              knownDrugs {
+                count
+              }
+            }
+            score
+          }
+        }
+      }
+    }
+    """
+    ot_data = ot_query(gql, {"efoId": efo, "size": 50})
+    rows_ot = []
+    try:
+        rows_ot = ot_data["data"]["disease"]["associatedTargets"]["rows"]
+    except (KeyError, TypeError):
+        pass
+    if not rows_ot:
+        return None, f"⚠️ OpenTargets returned no data for {cancer_type}.\n\n*Source: OpenTargets | Date: {today}*"
+    # 2. Filter: no known drugs
+    orphan_candidates = []
+    for row in rows_ot:
+        t = row["target"]
+        gene = t["approvedSymbol"]
+        drug_count = 0
+        try:
+            drug_count = t["knownDrugs"]["count"] or 0
+        except (KeyError, TypeError):
+            drug_count = 0
+        if drug_count == 0:
+            orphan_candidates.append({"gene": gene, "name": t.get("approvedName", ""), "ot_score": row["score"]})
+    # 3. Check ClinicalTrials for each candidate
+    records = []
+    for cand in orphan_candidates[:15]:
+        gene = cand["gene"]
+        cached = cache_get("ct_orphan", f"{gene}_{cancer_type}")
+        if cached is not None:
+            trial_count = cached
+        else:
+            try:
+                r = requests.get(
+                    f"{CT_BASE}/studies",
+                    params={"query.term": f"{gene} {cancer_type}", "pageSize": 1, "format": "json"},
+                    timeout=10
+                )
+                r.raise_for_status()
+                trial_count = r.json().get("totalCount", 0)
+                cache_set("ct_orphan", f"{gene}_{cancer_type}", trial_count)
+            except Exception:
+                trial_count = -1
+        records.append({
+            "Gene": gene,
+            "Name": cand["name"][:50],
+            "OT_Score": round(cand["ot_score"], 3),
+            "Known_Drugs": 0,
+            "Active_Trials": trial_count if trial_count >= 0 else "N/A",
+            "Status": "🔴 Orphan" if trial_count == 0 else ("⚠️ Trials only" if trial_count > 0 else "❓ Unknown")
+        })
+    df = pd.DataFrame(records)
+    note = (
+        f"*Source: OpenTargets GraphQL + ClinicalTrials.gov v2 | Date: {today}*\n\n"
+        f"*Orphan = no approved drug (OpenTargets knownDrugs.count = 0)*"
+    )
+    journal_log("A5-DruggableOrphans", f"cancer={cancer_type}", f"orphans={len(df)}")
+    return df, note
+# ─────────────────────────────────────────────
+# GROUP B — LEARNING SANDBOX
+# ─────────────────────────────────────────────
+SIMULATED_BANNER = (
+    "⚠️ **SIMULATED DATA** — This tab uses rule-based models and synthetic data "
+    "for educational purposes only. Results do NOT reflect real experimental outcomes."
+)
+# ── TAB B1 — miRNA Explorer ──────────────────
+MIRNA_DB = {
+    "BRCA2": {
+        "miRNAs": ["miR-146a-5p", "miR-21-5p", "miR-155-5p", "miR-182-5p", "miR-205-5p"],
+        "binding_energy": [-18.4, -15.2, -12.7, -14.1, -16.8],
+        "seed_match": ["7mer-m8", "6mer", "7mer-A1", "8mer", "7mer-m8"],
+        "expression_change": [-2.1, +1.8, +2.3, -1.5, -3.2],
+        "cancer_context": "BRCA2 loss-of-function is associated with HR-deficient breast/ovarian cancer. "
+                          "miR-146a-5p and miR-205-5p are frequently downregulated in BRCA2-mutant tumors.",
+    },
+    "BRCA1": {
+        "miRNAs": ["miR-17-5p", "miR-20a-5p", "miR-93-5p", "miR-182-5p", "miR-9-5p"],
+        "binding_energy": [-16.1, -13.5, -14.9, -15.3, -11.8],
+        "seed_match": ["8mer", "7mer-m8", "7mer-A1", "8mer", "6mer"],
+        "expression_change": [+1.9, +2.1, +1.6, -1.8, +2.4],
+        "cancer_context": "BRCA1 regulates DNA damage response. miR-17/20a cluster is upregulated "
+                          "in BRCA1-deficient tumors and suppresses apoptosis.",
+    },
+    "TP53": {
+        "miRNAs": ["miR-34a-5p", "miR-125b-5p", "miR-504-5p", "miR-25-3p", "miR-30d-5p"],
+        "binding_energy": [-19.2, -14.6, -13.1, -12.4, -15.7],
+        "seed_match": ["8mer", "7mer-m8", "7mer-A1", "6mer", "8mer"],
+        "expression_change": [-3.5, +1.2, +1.7, +2.0, -1.3],
+        "cancer_context": "TP53 is the most mutated gene in cancer. miR-34a is a direct p53 transcriptional "
+                          "target; its loss promotes tumor progression across cancer types.",
+    },
+}
+def b1_run(gene: str):
+    db = MIRNA_DB.get(gene, {})
+    if not db:
+        return None, "Gene not found in simulation database."
+    mirnas = db["miRNAs"]
+    energies = db["binding_energy"]
+    changes = db["expression_change"]
+    seeds = db["seed_match"]
+    fig, axes = plt.subplots(1, 2, figsize=(11, 4), facecolor="white")
+    # Binding energy bar chart
+    colors_e = ["#d73027" if e < -16 else "#fc8d59" if e < -13 else "#4393c3" for e in energies]
+    axes[0].barh(mirnas, [-e for e in energies], color=colors_e, edgecolor="white")
+    axes[0].set_xlabel("Binding Energy (|kcal/mol|)", fontsize=10)
+    axes[0].set_title(f"Predicted Binding Energy\n{gene} miRNA targets", fontsize=10)
+    axes[0].set_facecolor("white")
+    # Expression change
+    colors_x = ["#d73027" if c < 0 else "#4393c3" for c in changes]
+    axes[1].barh(mirnas, changes, color=colors_x, edgecolor="white")
+    axes[1].axvline(0, color="black", linewidth=0.8)
+    axes[1].set_xlabel("Expression Change (log2FC)", fontsize=10)
+    axes[1].set_title(f"miRNA Expression in {gene}-mutant tumors\n(⚠️ SIMULATED)", fontsize=10)
+    axes[1].set_facecolor("white")
+    fig.tight_layout()
+    buf = io.BytesIO()
+    fig.savefig(buf, format="png", dpi=150, facecolor="white")
+    buf.seek(0)
+    img = Image.open(buf)
+    plt.close(fig)
+    df = pd.DataFrame({
+        "miRNA": mirnas,
+        "Binding Energy (kcal/mol)": energies,
+        "Seed Match": seeds,
+        "Expression log2FC": changes,
+    })
+    context = f"\n\n**Cancer Context:** {db['cancer_context']}"
+    journal_log("B1-miRNA", f"gene={gene}", f"top_miRNA={mirnas[0]}")
+    return img, df.to_markdown(index=False) + context
+# ── TAB B2 — siRNA Targets ───────────────────
+SIRNA_DB = {
+    "LUAD": {
+        "targets": ["KRAS G12C", "EGFR exon19del", "ALK fusion", "MET exon14", "RET fusion"],
+        "efficacy": [0.82, 0.91, 0.76, 0.68, 0.71],
+        "off_target_risk": ["Medium", "Low", "Low", "Medium", "Low"],
+        "delivery_challenge": ["High", "Medium", "Medium", "High", "Medium"],
+    },
+    "BRCA": {
+        "targets": ["BRCA1 exon11", "BRCA2 exon11", "PIK3CA H1047R", "AKT1 E17K", "ESR1 Y537S"],
+        "efficacy": [0.78, 0.85, 0.88, 0.72, 0.65],
+        "off_target_risk": ["Low", "Low", "Medium", "Low", "High"],
+        "delivery_challenge": ["Medium", "Medium", "Low", "Low", "High"],
+    },
+    "COAD": {
+        "targets": ["KRAS G12D", "APC truncation", "BRAF V600E", "SMAD4 loss", "PIK3CA E545K"],
+        "efficacy": [0.79, 0.61, 0.93, 0.55, 0.84],
+        "off_target_risk": ["Medium", "High", "Low", "Medium", "Low"],
+        "delivery_challenge": ["High", "High", "Low", "High", "Low"],
+    },
+}
+def b2_run(cancer: str):
+    db = SIRNA_DB.get(cancer, {})
+    if not db:
+        return None, "Cancer type not in simulation database."
+    targets = db["targets"]
+    efficacy = db["efficacy"]
+    off_risk = db["off_target_risk"]
+    delivery = db["delivery_challenge"]
+    fig, ax = plt.subplots(figsize=(8, 4), facecolor="white")
+    risk_color = {"Low": "#4393c3", "Medium": "#fc8d59", "High": "#d73027"}
+    colors = [risk_color.get(r, "#aaa") for r in off_risk]
+    bars = ax.barh(targets, efficacy, color=colors, edgecolor="white")
+    ax.set_xlim(0, 1.1)
+    ax.set_xlabel("Predicted siRNA Efficacy (⚠️ SIMULATED)", fontsize=10)
+    ax.set_title(f"siRNA Target Efficacy — {cancer}", fontsize=11)
+    ax.set_facecolor("white")
+    from matplotlib.patches import Patch
+    legend_elements = [Patch(facecolor=v, label=k) for k, v in risk_color.items()]
+    ax.legend(handles=legend_elements, title="Off-target Risk", fontsize=8, loc="lower right")
+    fig.tight_layout()
+    buf = io.BytesIO()
+    fig.savefig(buf, format="png", dpi=150, facecolor="white")
+    buf.seek(0)
+    img = Image.open(buf)
+    plt.close(fig)
+    df = pd.DataFrame({
+        "Target": targets,
+        "Efficacy": efficacy,
+        "Off-target Risk": off_risk,
+        "Delivery Challenge": delivery,
+    })
+    journal_log("B2-siRNA", f"cancer={cancer}", f"top={targets[0]}")
+    return img, df.to_markdown(index=False)
+# ── TAB B3 — LNP Corona Simulator ───────────────
+def b3_run(peg_mol_pct: float, ionizable_pct: float, helper_pct: float,
+           chol_pct: float, particle_size_nm: float, serum_pct: float):
+    """Simulate protein corona composition based on LNP formulation parameters."""
+    # Rule-based model (educational only)
+    # Higher PEG → less corona; higher ionizable → more ApoE; larger size → more fibrinogen
+    total_lipid = peg_mol_pct + ionizable_pct + helper_pct + chol_pct
+    peg_norm = peg_mol_pct / max(total_lipid, 1)
+    corona_proteins = {
+        "ApoE": max(0, 0.35 - peg_norm * 0.8 + ionizable_pct * 0.01),
+        "ApoA-I": max(0, 0.20 - ionizable_pct * 0.005 + chol_pct * 0.003),
+        "Fibrinogen": max(0, 0.15 + (particle_size_nm - 100) * 0.001 - peg_norm * 0.3),
+        "Albumin": max(0, 0.10 + serum_pct * 0.002 - peg_norm * 0.2),
+        "Clusterin": max(0, 0.08 + peg_norm * 0.15),
+        "IgG": max(0, 0.07 + serum_pct * 0.001),
+        "Complement C3": max(0, 0.05 + ionizable_pct * 0.003 - peg_norm * 0.1),
+    }
+    total = sum(corona_proteins.values())
+    if total > 0:
+        corona_proteins = {k: v / total for k, v in corona_proteins.items()}
+    fig, axes = plt.subplots(1, 2, figsize=(11, 4), facecolor="white")
+    # Pie chart
+    labels = list(corona_proteins.keys())
+    sizes = list(corona_proteins.values())
+    colors_pie = plt.cm.Set2(np.linspace(0, 1, len(labels)))
+    axes[0].pie(sizes, labels=labels, colors=colors_pie, autopct="%1.1f%%", startangle=90)
+    axes[0].set_title("Predicted Corona Composition\n(⚠️ SIMULATED)", fontsize=10)
+    # Bar chart
+    axes[1].bar(labels, sizes, color=colors_pie, edgecolor="white")
+    axes[1].set_ylabel("Relative Abundance", fontsize=10)
+    axes[1].set_title("Corona Protein Fractions", fontsize=10)
+    axes[1].set_xticklabels(labels, rotation=45, ha="right", fontsize=8)
+    axes[1].set_facecolor("white")
+    fig.tight_layout()
+    buf = io.BytesIO()
+    fig.savefig(buf, format="png", dpi=150, facecolor="white")
+    buf.seek(0)
+    img = Image.open(buf)
+    plt.close(fig)
+    apoe_pct = corona_proteins.get("ApoE", 0) * 100
+    interpretation = (
+        f"**ApoE fraction: {apoe_pct:.1f}%** — "
+        + ("High ApoE → enhanced brain/liver targeting via LDLR pathway." if apoe_pct > 25
+           else "Low ApoE → reduced receptor-mediated uptake.")
+    )
+    journal_log("B3-LNPCorona", f"PEG={peg_mol_pct}%,size={particle_size_nm}nm", f"ApoE={apoe_pct:.1f}%")
+    return img, interpretation
+# ── TAB B4 — Flow Corona (Vroman Kinetics) ──────
+def b4_run(time_points: int, kon_albumin: float, kon_apoe: float,
+           koff_albumin: float, koff_apoe: float):
+    """Simulate Vroman effect: competitive protein adsorption kinetics."""
+    t = np.linspace(0, time_points, 500)
+    # Simple Langmuir competitive adsorption model (educational)
+    # Albumin: fast on, fast off (early corona)
+    # ApoE: slow on, slow off (late/hard corona)
+    albumin = (kon_albumin / (kon_albumin + koff_albumin)) * (1 - np.exp(-(kon_albumin + koff_albumin) * t))
+    apoe_delay = np.maximum(0, t - 5)
+    apoe = (kon_apoe / (kon_apoe + koff_apoe)) * (1 - np.exp(-(kon_apoe + koff_apoe) * apoe_delay))
+    # Vroman displacement: albumin decreases as ApoE increases
+    albumin_displaced = albumin * np.exp(-apoe * 2)
+    fibrinogen = 0.3 * (1 - np.exp(-0.05 * t)) * np.exp(-apoe * 1.5)
+    fig, ax = plt.subplots(figsize=(9, 4), facecolor="white")
+    ax.plot(t, albumin_displaced, label="Albumin (displaced)", color="#4393c3", linewidth=2)
+    ax.plot(t, apoe, label="ApoE (hard corona)", color="#d73027", linewidth=2)
+    ax.plot(t, fibrinogen, label="Fibrinogen", color="#fc8d59", linewidth=2, linestyle="--")
+    ax.set_xlabel("Time (min)", fontsize=11)
+    ax.set_ylabel("Surface Coverage (a.u.)", fontsize=11)
+    ax.set_title("Vroman Effect — Competitive Protein Adsorption\n(⚠️ SIMULATED)", fontsize=11)
+    ax.legend(fontsize=9)
+    ax.set_facecolor("white")
+    fig.tight_layout()
+    buf = io.BytesIO()
+    fig.savefig(buf, format="png", dpi=150, facecolor="white")
+    buf.seek(0)
+    img = Image.open(buf)
+    plt.close(fig)
+    vroman_time = t[np.argmax(albumin_displaced > apoe * 0.9)] if any(albumin_displaced > apoe * 0.9) else "N/A"
+    note = (
+        f"**Vroman crossover** (albumin → ApoE dominance): ~{vroman_time:.1f} min\n\n"
+        "The Vroman effect describes sequential protein displacement: "
+        "abundant proteins (albumin) adsorb first, then are displaced by higher-affinity proteins (ApoE, fibrinogen)."
+    )
+    journal_log("B4-FlowCorona", f"kon_alb={kon_albumin},kon_apoe={kon_apoe}", note[:80])
+    return img, note
+# ── TAB B5 — Variant Concepts ───────────────────
+VARIANT_RULES = {
+    "Pathogenic": {
+        "criteria": ["Nonsense mutation in tumor suppressor", "Frameshift in BRCA1/2",
+                     "Splice site ±1/2 in essential gene", "Known hotspot (e.g. TP53 R175H)"],
+        "acmg_codes": ["PVS1", "PS1", "PS2", "PM2"],
+        "explanation": "Strong evidence of pathogenicity. Likely disrupts protein function via LOF or dominant-negative mechanism.",
+    },
+    "Likely Pathogenic": {
+        "criteria": ["Missense in functional domain", "In silico tools predict damaging",
+                     "Low population frequency (<0.01%)", "Segregates with disease"],
+        "acmg_codes": ["PM1", "PM2", "PP2", "PP3"],
+        "explanation": "Moderate-strong evidence. Functional studies or segregation data would upgrade to Pathogenic.",
+    },
+    "VUS": {
+        "criteria": ["Missense with conflicting evidence", "Moderate population frequency",
+                     "Uncertain functional impact", "Limited segregation data"],
+        "acmg_codes": ["PM2", "BP4", "BP6"],
+        "explanation": "Variant of Uncertain Significance. Insufficient evidence to classify. Functional assays recommended.",
+    },
+    "Likely Benign": {
+        "criteria": ["Common in population (>1%)", "Synonymous with no splicing impact",
+                     "Observed in healthy controls", "Computational tools predict benign"],
+        "acmg_codes": ["BS1", "BP1", "BP4", "BP7"],
+        "explanation": "Evidence suggests benign. Unlikely to cause disease but not fully excluded.",
+    },
+    "Benign": {
+        "criteria": ["High population frequency (>5%)", "No disease association in large studies",
+                     "Synonymous, no functional impact", "Functional studies show no effect"],
+        "acmg_codes": ["BA1", "BS1", "BS2", "BS3"],
+        "explanation": "Strong evidence of benign nature. Not expected to contribute to disease.",
+    },
+}
+def b5_run(classification: str):
+    data = VARIANT_RULES.get(classification, {})
+    if not data:
+        return "Classification not found."
+    criteria_md = "\n".join([f"- {c}" for c in data["criteria"]])
+    acmg_md = " | ".join([f"`{code}`" for code in data["acmg_codes"]])
+    output = (
+        f"## {classification}\n\n"
+        f"**ACMG/AMP Codes:** {acmg_md}\n\n"
+        f"**Typical Criteria:**\n{criteria_md}\n\n"
+        f"**Interpretation:** {data['explanation']}\n\n"
+        f"> ⚠️ SIMULATED — This is a rule-based educational model only. "
+        f"Real variant classification requires expert review and full ACMG/AMP criteria evaluation."
+    )
+    journal_log("B5-VariantConcepts", f"class={classification}", output[:100])
+    return output
+# ─────────────────────────────────────────────
+# GRADIO UI ASSEMBLY
+# ─────────────────────────────────────────────
+CUSTOM_CSS = """
+body { font-family: 'Inter', sans-serif; }
+.simulated-banner {
+    background: #fff3cd; border: 1px solid #ffc107;
+    border-radius: 6px; padding: 10px 14px;
+    font-weight: 600; color: #856404; margin-bottom: 8px;
+}
+.source-note { color: #6c757d; font-size: 0.85em; margin-top: 6px; }
+.gap-card {
+    background: #f8f9fa; border-left: 4px solid #d73027;
+    padding: 10px 14px; margin: 6px 0; border-radius: 4px;
+}
+footer { display: none !important; }
+"""
+def build_journal_sidebar():
+    with gr.Column(scale=1, min_width=260):
+        gr.Markdown("## 📓 Lab Journal")
+        note_input = gr.Textbox(label="Add note", placeholder="Your observation...", lines=2)
+        save_btn = gr.Button("💾 Save Note", size="sm")
+        refresh_btn = gr.Button("🔄 Refresh Journal", size="sm")
+        journal_display = gr.Markdown(value="*Click Refresh to load entries.*")
+        def save_note(note):
+            if note.strip():
+                journal_log("Manual", "note", note.strip(), note.strip())
+            return journal_read()
+        save_btn.click(save_note, inputs=[note_input], outputs=[journal_display])
+        refresh_btn.click(lambda: journal_read(), outputs=[journal_display])
+    return note_input, journal_display
+def build_app():
+    with gr.Blocks(css=CUSTOM_CSS, title="K R&D Lab — Cancer Research Suite") as demo:
+        gr.Markdown(
+            "# 🔬 K R&D Lab — Cancer Research Suite\n"
+            "**Author:** Oksana Kolisnyk | [kosatiks-group.pp.ua](https://kosatiks-group.pp.ua)  \n"
+            "**Repo:** [github.com/TEZv/K-RnD-Lab-PHYLO-03_2026](https://github.com/TEZv/K-RnD-Lab-PHYLO-03_2026)"
+        )
+        with gr.Row():
+            # ── MAIN CONTENT ──
+            with gr.Column(scale=4):
+                with gr.Tabs():
+                    # ════════════════════════════════
+                    # GROUP A — REAL DATA TOOLS
+                    # ════════════════════════════════
+                    with gr.Tab("🔬 Real Data Tools"):
+                        with gr.Tabs():
+                            # ── A1 ──
+                            with gr.Tab("🔍 Gray Zones Explorer"):
+                                gr.Markdown(
+                                    "Identify underexplored biological processes in a cancer type "
+                                    "using live PubMed + OpenTargets data."
+                                )
+                                a1_cancer = gr.Dropdown(CANCER_TYPES, label="Cancer Type", value="GBM")
+                                a1_btn = gr.Button("🔍 Explore Gray Zones", variant="primary")
+                                a1_heatmap = gr.Image(label="Research Coverage Heatmap", type="pil")
+                                a1_gaps = gr.Markdown(label="Top 5 Research Gaps")
+                                with gr.Accordion("📖 Learning Mode", open=False):
+                                    gr.Markdown(
+                                        "**What is a research gray zone?**\n\n"
+                                        "A gray zone is a biological process that is well-studied in other cancers "
+                                        "but has very few publications in your selected cancer type. "
+                                        "Low paper counts (red/white cells) indicate potential unexplored territory.\n\n"
+                                        "**How to use:** Select a rare cancer (e.g. DIPG, MCC) to find the most "
+                                        "underexplored processes. Cross-reference with Tab A2 to find targetable genes."
+                                    )
+                                a1_btn.click(a1_run, inputs=[a1_cancer], outputs=[a1_heatmap, a1_gaps])
+                            # ── A2 ──
+                            with gr.Tab("🎯 Understudied Target Finder"):
+                                gr.Markdown(
+                                    "Find essential genes with high research gap index "
+                                    "(high essentiality, low publication coverage)."
+                                )
+                                gr.Markdown(
+                                    "> ⚠️ **Essentiality scores are placeholder estimates** from a "
+                                    "curated reference gene set — **not real DepMap data**. "
+                                    "Association scores and paper/trial counts are fetched live. "
+                                    "For real essentiality values, download `CRISPR_gene_effect.csv` "
+                                    "from [depmap.org](https://depmap.org/portal/download/all/) and "
+                                    "replace `_load_depmap_sample()` in `app.py`."
+                                )
+                                a2_cancer = gr.Dropdown(CANCER_TYPES, label="Cancer Type", value="GBM")
+                                a2_btn = gr.Button("🎯 Find Understudied Targets", variant="primary")
+                                a2_table = gr.Dataframe(label="Target Gap Table", wrap=True)
+                                a2_note = gr.Markdown()
+                                with gr.Accordion("📖 Learning Mode", open=False):
+                                    gr.Markdown(
+                                        "**Gap Index formula:** `essentiality / log(papers + 1)`\n\n"
+                                        "- **Essentiality**: inverted DepMap CRISPR gene effect score "
+                                        "(positive = more essential, per DepMap convention where negative raw = essential)\n"
+                                        "- **Papers**: PubMed count for gene + cancer type\n"
+                                        "- **High Gap Index** = essential gene with few publications = high research opportunity\n\n"
+                                        "**Caution:** Essentiality scores shown here use a curated reference set. "
+                                        "For full DepMap analysis, download the complete dataset from depmap.org."
+                                    )
+                                a2_btn.click(a2_run, inputs=[a2_cancer], outputs=[a2_table, a2_note])
+                            # ── A3 ──
+                            with gr.Tab("🧬 Real Variant Lookup"):
+                                gr.Markdown(
+                                    "Look up a variant in **ClinVar** and **gnomAD**. "
+                                    "Results are fetched live — never hallucinated."
+                                )
+                                a3_hgvs = gr.Textbox(
+                                    label="HGVS Notation",
+                                    placeholder="e.g. NM_007294.4:c.5266dupC  or  NM_000546.6:c.524G>A",
+                                    lines=1
+                                )
+                                a3_btn = gr.Button("🔎 Look Up Variant", variant="primary")
+                                a3_result = gr.Markdown()
+                                with gr.Accordion("📖 Learning Mode", open=False):
+                                    gr.Markdown(
+                                        "**HGVS notation format:**\n"
+                                        "- `NM_XXXXXX.X:c.NNNN[change]` — coding DNA reference\n"
+                                        "- `NC_XXXXXX.X:g.NNNN[change]` — genomic reference\n\n"
+                                        "**ClinVar** classifies variants as: Pathogenic / Likely Pathogenic / "
+                                        "VUS / Likely Benign / Benign\n\n"
+                                        "**gnomAD** provides allele frequency (AF) in population cohorts. "
+                                        "AF > 1% generally suggests benign.\n\n"
+                                        "**Important:** If a variant is not found, this tool returns "
+                                        "'Not in database. Do not interpret.' — never a fabricated result."
+                                    )
+                                a3_btn.click(a3_run, inputs=[a3_hgvs], outputs=[a3_result])
+                            # ── A4 ──
+                            with gr.Tab("📰 Literature Gap Finder"):
+                                gr.Markdown(
+                                    "Visualize publication trends over 10 years and detect "
+                                    "years with low research activity."
+                                )
+                                with gr.Row():
+                                    a4_cancer = gr.Dropdown(CANCER_TYPES, label="Cancer Type", value="GBM")
+                                    a4_kw = gr.Textbox(label="Keyword", placeholder="e.g. ferroptosis", lines=1)
+                                a4_btn = gr.Button("📊 Analyze Literature Trend", variant="primary")
+                                a4_chart = gr.Image(label="Papers per Year", type="pil")
+                                a4_gaps = gr.Markdown()
+                                with gr.Accordion("📖 Learning Mode", open=False):
+                                    gr.Markdown(
+                                        "**How to read the chart:**\n"
+                                        "- 🔵 Blue bars = normal activity\n"
+                                        "- 🟠 Orange bars = low activity (<30% of average)\n"
+                                        "- 🔴 Red bars = zero publications (true gap)\n\n"
+                                        "**Research strategy:** A gap in 2018–2020 followed by a spike in 2022+ "
+                                        "may indicate a recently emerging field — ideal for early-career researchers."
+                                    )
+                                a4_btn.click(a4_run, inputs=[a4_cancer, a4_kw], outputs=[a4_chart, a4_gaps])
+                            # ── A5 ──
+                            with gr.Tab("💊 Druggable Orphans"):
+                                gr.Markdown(
+                                    "Identify cancer-associated essential genes with **no approved drug** "
+                                    "and **no active clinical trial**."
+                                )
+                                a5_cancer = gr.Dropdown(CANCER_TYPES, label="Cancer Type", value="GBM")
+                                a5_btn = gr.Button("💊 Find Druggable Orphans", variant="primary")
+                                a5_table = gr.Dataframe(label="Orphan Target Table", wrap=True)
+                                a5_note = gr.Markdown()
+                                with gr.Accordion("📖 Learning Mode", open=False):
+                                    gr.Markdown(
+                                        "**What is a druggable orphan?**\n\n"
+                                        "A gene that is:\n"
+                                        "1. Strongly associated with a cancer (high OpenTargets score)\n"
+                                        "2. Has no approved drug targeting it\n"
+                                        "3. Has no active clinical trial\n\n"
+                                        "These represent the highest-opportunity targets for drug discovery. "
+                                        "Cross-reference with Tab A2 (Gap Index) for prioritization."
+                                    )
+                                a5_btn.click(a5_run, inputs=[a5_cancer], outputs=[a5_table, a5_note])
+                            # ── A6 (RAG Chatbot — injected from chatbot.py) ──
+                            with gr.Tab("🤖 Research Assistant"):
+                                gr.Markdown(
+                                    "**RAG-powered research assistant** indexed on 20 curated papers "
+                                    "on LNP delivery, protein corona, and cancer variants.\n\n"
+                                    "*Powered by sentence-transformers + FAISS — no API key required.*"
+                                )
+                                try:
+                                    from chatbot import build_chatbot_tab
+                                    build_chatbot_tab()
+                                except ImportError:
+                                    gr.Markdown(
+                                        "⚠️ `chatbot.py` not found. Please ensure it is in the same directory as `app.py`. "
+                                        "See `chatbot.py` for setup instructions."
+                                    )
+                    # ════════════════════════════════
+                    # GROUP B — LEARNING SANDBOX
+                    # ════════════════════════════════
+                    with gr.Tab("📚 Learning Sandbox"):
+                        gr.Markdown(
+                            "> ⚠️ **ALL TABS IN THIS GROUP USE SIMULATED DATA** — "
+                            "For educational purposes only. Results do not reflect real experiments."
+                        )
+                        with gr.Tabs():
+                            # ── B1 ──
+                            with gr.Tab("🧬 miRNA Explorer"):
+                                gr.Markdown(SIMULATED_BANNER)
+                                b1_gene = gr.Dropdown(["BRCA2", "BRCA1", "TP53"], label="Gene", value="TP53")
+                                b1_btn = gr.Button("🔬 Explore miRNA Interactions", variant="primary")
+                                b1_plot = gr.Image(label="miRNA Binding & Expression (⚠️ SIMULATED)", type="pil")
+                                b1_table = gr.Markdown()
+                                with gr.Accordion("📖 Learning Mode", open=False):
+                                    gr.Markdown(
+                                        "**miRNA biology basics:**\n\n"
+                                        "- miRNAs are ~22 nt non-coding RNAs that bind 3'UTR of mRNAs\n"
+                                        "- Seed match types: 8mer > 7mer-m8 > 7mer-A1 > 6mer (binding strength)\n"
+                                        "- Negative binding energy = stronger predicted interaction\n"
+                                        "- Negative log2FC = miRNA downregulated in tumor\n\n"
+                                        "**Key concept:** Tumor suppressor genes (BRCA1/2, TP53) are often "
+                                        "silenced by oncogenic miRNAs (e.g. miR-21, miR-155)."
+                                    )
+                                b1_btn.click(b1_run, inputs=[b1_gene], outputs=[b1_plot, b1_table])
+                            # ── B2 ──
+                            with gr.Tab("🎯 siRNA Targets"):
+                                gr.Markdown(SIMULATED_BANNER)
+                                b2_cancer = gr.Dropdown(["LUAD", "BRCA", "COAD"], label="Cancer Type", value="LUAD")
+                                b2_btn = gr.Button("🎯 Simulate siRNA Efficacy", variant="primary")
+                                b2_plot = gr.Image(label="siRNA Efficacy (⚠️ SIMULATED)", type="pil")
+                                b2_table = gr.Markdown()
+                                with gr.Accordion("📖 Learning Mode", open=False):
+                                    gr.Markdown(
+                                        "**siRNA design principles:**\n\n"
+                                        "- siRNAs are 21-23 nt dsRNA that trigger RISC-mediated mRNA cleavage\n"
+                                        "- Off-target risk: seed region complementarity to unintended mRNAs\n"
+                                        "- Delivery challenge: endosomal escape, serum stability, tumor penetration\n\n"
+                                        "**Clinical context:** KRAS siRNA delivery remains a major challenge "
+                                        "due to the undruggable nature of the RAS binding pocket."
+                                    )
+                                b2_btn.click(b2_run, inputs=[b2_cancer], outputs=[b2_plot, b2_table])
+                            # ── B3 ──
+                            with gr.Tab("🧪 LNP Corona"):
+                                gr.Markdown(SIMULATED_BANNER)
+                                with gr.Row():
+                                    b3_peg = gr.Slider(0.5, 5.0, value=1.5, step=0.1, label="PEG mol% (lipid)")
+                                    b3_ion = gr.Slider(10, 60, value=50, step=1, label="Ionizable lipid mol%")
+                                with gr.Row():
+                                    b3_helper = gr.Slider(5, 30, value=10, step=1, label="Helper lipid mol%")
+                                    b3_chol = gr.Slider(10, 50, value=38, step=1, label="Cholesterol mol%")
+                                with gr.Row():
+                                    b3_size = gr.Slider(50, 300, value=100, step=5, label="Particle size (nm)")
+                                    b3_serum = gr.Slider(0, 100, value=10, step=5, label="Serum % in medium")
+                                b3_btn = gr.Button("🧪 Simulate Corona", variant="primary")
+                                b3_plot = gr.Image(label="Corona Composition (⚠️ SIMULATED)", type="pil")
+                                b3_interp = gr.Markdown()
+                                with gr.Accordion("📖 Learning Mode", open=False):
+                                    gr.Markdown(
+                                        "**Protein corona basics:**\n\n"
+                                        "- When nanoparticles enter biological fluids, proteins adsorb to their surface\n"
+                                        "- **Hard corona**: tightly bound, long-lived proteins (ApoE, fibrinogen)\n"
+                                        "- **Soft corona**: loosely bound, rapidly exchanging proteins (albumin)\n"
+                                        "- **ApoE enrichment** → enhanced brain targeting via LDLR/LRP1 receptors\n"
+                                        "- **PEG** reduces corona formation but may trigger anti-PEG antibodies\n\n"
+                                        "**GBM relevance:** ApoE-enriched LNPs show improved BBB crossing in preclinical models."
+                                    )
+                                b3_btn.click(
+                                    b3_run,
+                                    inputs=[b3_peg, b3_ion, b3_helper, b3_chol, b3_size, b3_serum],
+                                    outputs=[b3_plot, b3_interp]
+                                )
+                            # ── B4 ──
+                            with gr.Tab("🌊 Flow Corona"):
+                                gr.Markdown(SIMULATED_BANNER)
+                                with gr.Row():
+                                    b4_time = gr.Slider(10, 120, value=60, step=5, label="Time range (min)")
+                                    b4_kon_alb = gr.Slider(0.01, 1.0, value=0.3, step=0.01, label="kon Albumin")
+                                with gr.Row():
+                                    b4_kon_apoe = gr.Slider(0.001, 0.5, value=0.05, step=0.001, label="kon ApoE")
+                                    b4_koff_alb = gr.Slider(0.01, 1.0, value=0.2, step=0.01, label="koff Albumin")
+                                    b4_koff_apoe = gr.Slider(0.001, 0.1, value=0.01, step=0.001, label="koff ApoE")
+                                b4_btn = gr.Button("🌊 Simulate Vroman Kinetics", variant="primary")
+                                b4_plot = gr.Image(label="Vroman Effect (⚠️ SIMULATED)", type="pil")
+                                b4_note = gr.Markdown()
+                                with gr.Accordion("📖 Learning Mode", open=False):
+                                    gr.Markdown(
+                                        "**The Vroman Effect:**\n\n"
+                                        "Described by Leo Vroman (1962): proteins with high abundance but low affinity "
+                                        "(albumin) adsorb first, then are displaced by lower-abundance but higher-affinity "
+                                        "proteins (fibrinogen, ApoE).\n\n"
+                                        "**Parameters:**\n"
+                                        "- **kon**: association rate constant (higher = faster binding)\n"
+                                        "- **koff**: dissociation rate constant (lower = tighter binding)\n"
+                                        "- **Kd = koff/kon**: equilibrium dissociation constant\n\n"
+                                        "**Clinical implication:** The final hard corona (not initial) determines "
+                                        "nanoparticle fate in vivo."
+                                    )
+                                b4_btn.click(
+                                    b4_run,
+                                    inputs=[b4_time, b4_kon_alb, b4_kon_apoe, b4_koff_alb, b4_koff_apoe],
+                                    outputs=[b4_plot, b4_note]
+                                )
+                            # ── B5 ──
+                            with gr.Tab("🔬 Variant Concepts"):
+                                gr.Markdown(SIMULATED_BANNER)
+                                b5_class = gr.Dropdown(
+                                    list(VARIANT_RULES.keys()),
+                                    label="ACMG Classification",
+                                    value="VUS"
+                                )
+                                b5_btn = gr.Button("📋 Explain Classification", variant="primary")
+                                b5_result = gr.Markdown()
+                                with gr.Accordion("📖 Learning Mode", open=False):
+                                    gr.Markdown(
+                                        "**ACMG/AMP 2015 Classification Framework:**\n\n"
+                                        "Variants are classified into 5 tiers based on evidence:\n"
+                                        "1. **Pathogenic** — strong evidence of disease causation\n"
+                                        "2. **Likely Pathogenic** — >90% probability pathogenic\n"
+                                        "3. **VUS** — uncertain significance; insufficient evidence\n"
+                                        "4. **Likely Benign** — >90% probability benign\n"
+                                        "5. **Benign** — strong evidence of no disease effect\n\n"
+                                        "**Key codes:** PVS1 (null variant), PS1 (same AA change), "
+                                        "PM2 (absent from controls), BA1 (MAF >5%)"
+                                    )
+                                b5_btn.click(b5_run, inputs=[b5_class], outputs=[b5_result])
+            # ── SIDEBAR ──
+            with gr.Column(scale=1, min_width=260):
+                gr.Markdown("## 📓 Lab Journal")
+                note_input = gr.Textbox(label="Add note", placeholder="Your observation...", lines=2)
+                save_btn = gr.Button("💾 Save Note", size="sm")
+                refresh_btn = gr.Button("🔄 Refresh Journal", size="sm")
+                journal_display = gr.Markdown(value="*Click Refresh to load entries.*")
+                def save_note(note):
+                    if note.strip():
+                        journal_log("Manual", "note", note.strip(), note.strip())
+                    return journal_read()
+                save_btn.click(save_note, inputs=[note_input], outputs=[journal_display])
+                refresh_btn.click(lambda: journal_read(), outputs=[journal_display])
+        gr.Markdown(
+            "---\n"
+            "*K R&D Lab Cancer Research Suite · "
+            "All real-data tabs use live APIs with 24h caching · "
+            "Simulated tabs are clearly labeled ⚠️ SIMULATED · "
+            "Source attribution shown on every result*"
+        )
+    return demo
+if __name__ == "__main__":
+    app = build_app()
+    app.launch(share=False)

chatbot.py ADDED Viewed

	@@ -0,0 +1,672 @@

+"""
+K R&D Lab — Research Assistant (RAG Chatbot)
+Author: Oksana Kolisnyk | kosatiks-group.pp.ua
+Repo:   github.com/TEZv/K-RnD-Lab-PHYLO-03_2026
+RAG pipeline: sentence-transformers + FAISS (no API key required)
+Indexed on 20 curated papers: LNP delivery, protein corona, cancer variants
+Confidence flags: HIGH / MEDIUM / SPECULATIVE
+Never answers outside indexed papers.
+"""
+import os
+import json
+import time
+import hashlib
+import datetime
+import requests
+import gradio as gr
+import numpy as np
+# ─────────────────────────────────────────────
+# PAPER CORPUS — 20 curated PMIDs
+# Topics: LNP/brain delivery, protein corona, cancer variants
+# ─────────────────────────────────────────────
+PAPER_PMIDS = [
+    # LNP delivery (5) — all PubMed-verified
+    "34394960",   # Hou X — LNP mRNA delivery review (Nat Rev Mater 2021)
+    "32251383",   # Cheng Q — SORT LNPs organ selectivity (Nat Nanotechnol 2020)
+    "29653760",   # Sabnis S — novel amino lipid series for mRNA (Mol Ther 2018)
+    "22782619",   # Jayaraman M — ionizable lipid siRNA LNP potency (Angew Chem 2012)
+    "33208369",   # Rosenblum D — CRISPR-Cas9 LNP cancer therapy (Sci Adv 2020)
+    # Protein corona (5)
+    "18809927",   # Lundqvist M — nanoparticle size/surface protein corona (PNAS 2008)
+    "22086677",   # Walkey CD — nanomaterial-protein interactions (Chem Soc Rev 2012)
+    "31565943",   # Park M — accessible surface area within nanoparticle corona (Nano Lett 2019)
+    "33754708",   # Sebastiani F — ApoE binding drives LNP rearrangement (ACS Nano 2021)
+    "20461061",   # Akinc A — endogenous ApoE-mediated LNP liver delivery (Mol Ther 2010)
+    # Cancer variants & precision oncology (5)
+    "30096302",   # Bailey MH — cancer driver genes TCGA (Cell 2018)
+    "30311387",   # Landrum MJ — ClinVar at five years (Hum Mutat 2018)
+    "32461654",   # Karczewski KJ — gnomAD mutational constraint 141,456 humans (Nature 2020)
+    "27328919",   # Bouaoun L — TP53 variations IARC database (Hum Mutat 2016)
+    "31820981",   # Lanman BA — KRAS G12C covalent inhibitor AMG 510 (J Med Chem 2020)
+    # LNP immunotherapy & siRNA (3)
+    "28678784",   # Sahin U — personalized RNA mutanome vaccines (Nature 2017)
+    "31348638",   # Kozma GT — anti-PEG IgM complement activation (ACS Nano 2019)
+    "33016924",   # Cafri G — mRNA neoantigen T cell immunity GI cancer (J Clin Invest 2020)
+    # Liquid biopsy (2)
+    "31142840",   # Cristiano S — genome-wide cfDNA fragmentation in cancer (Nature 2019)
+    "33883548",   # Larson MH — cell-free transcriptome tissue biomarkers (Nat Commun 2021)
+]
+# Curated abstracts / key content for each PMID
+# Verified against PubMed esummary + efetch API — 2026-03-07
+# All PMIDs confirmed real; abstracts fetched directly from NCBI
+PAPER_CORPUS = [
+    {
+        "pmid": "34394960",
+        "title": "Lipid nanoparticles for mRNA delivery.",
+        "abstract": (
+            "Messenger RNA (mRNA) has emerged as a new category of therapeutic agent to prevent and treat "
+            "various diseases. To function in vivo, mRNA requires safe, effective and stable delivery "
+            "systems that protect the nucleic acid from degradation and that allow cellular uptake and "
+            "mRNA release. Lipid nanoparticles have successfully entered the clinic for the delivery of "
+            "mRNA; in particular, lipid nanoparticle-mRNA vaccines are now in clinical use against "
+            "coronavirus disease 2019 (COVID-19), which marks a milestone for mRNA therapeutics. In this "
+            "Review, we discuss the design of lipid nanoparticles for mRNA delivery and examine "
+            "physiological barriers and possible administration routes for lipid nanoparticle-mRNA "
+            "systems. We then consider key points for the clinical translation of lipid nanoparticle-mRNA "
+            "formulations, including good manufacturing practice, stability, storage and safety, and "
+            "highlight preclinical and clinical studies of lipid nanoparticle-mRNA therapeutics for "
+            "infectious diseases, cancer and genetic disorders. Finally, we give an outlook to future "
+            "possibilities and remaining challenges for this promising technology."
+        ),
+        "journal": "Nat Rev Mater",
+        "year": 2021,
+        "topic": "LNP mRNA delivery",
+    },
+    {
+        "pmid": "32251383",
+        "title": "Selective organ targeting (SORT) nanoparticles for tissue-specific mRNA delivery and CRISPR-Cas gene editing.",
+        "abstract": (
+            "CRISPR-Cas gene editing and messenger RNA-based protein replacement therapy hold tremendous "
+            "potential to effectively treat disease-causing mutations with diverse cellular origin. "
+            "However, it is currently impossible to rationally design nanoparticles that selectively "
+            "target specific tissues. Here, we report a strategy termed selective organ targeting (SORT) "
+            "wherein multiple classes of lipid nanoparticles are systematically engineered to exclusively "
+            "edit extrahepatic tissues via addition of a supplemental SORT molecule. Lung-, spleen- and "
+            "liver-targeted SORT lipid nanoparticles were designed to selectively edit therapeutically "
+            "relevant cell types including epithelial cells, endothelial cells, B cells, T cells and "
+            "hepatocytes. SORT is compatible with multiple gene editing techniques, including mRNA, Cas9 "
+            "mRNA/single guide RNA and Cas9 ribonucleoprotein complexes, and is envisioned to aid the "
+            "development of protein replacement and gene correction therapeutics in targeted tissues."
+        ),
+        "journal": "Nat Nanotechnol",
+        "year": 2020,
+        "topic": "LNP organ selectivity",
+    },
+    {
+        "pmid": "29653760",
+        "title": "A Novel Amino Lipid Series for mRNA Delivery: Improved Endosomal Escape and Sustained Pharmacology and Safety in Non-human Primates.",
+        "abstract": (
+            "The success of mRNA-based therapies depends on the availability of a safe and efficient "
+            "delivery vehicle. Lipid nanoparticles have been identified as a viable option. However, "
+            "there are concerns whether an acceptable tolerability profile for chronic dosing can be "
+            "achieved. The efficiency and tolerability of lipid nanoparticles has been attributed to the "
+            "amino lipid. Therefore, we developed a new series of amino lipids that address this concern. "
+            "Clear structure-activity relationships were developed that resulted in a new amino lipid "
+            "that affords efficient mRNA delivery in rodent and primate models with optimal "
+            "pharmacokinetics. A 1-month toxicology evaluation in rat and non-human primate demonstrated "
+            "no adverse events with the new lipid nanoparticle system. Mechanistic studies demonstrate "
+            "that the improved efficiency can be attributed to increased endosomal escape. This effort "
+            "has resulted in the first example of the ability to safely repeat dose mRNA-containing lipid "
+            "nanoparticles in non-human primate at therapeutically relevant levels."
+        ),
+        "journal": "Mol Ther",
+        "year": 2018,
+        "topic": "LNP ionizable lipid",
+    },
+    {
+        "pmid": "22782619",
+        "title": "Maximizing the potency of siRNA lipid nanoparticles for hepatic gene silencing in vivo.",
+        "abstract": (
+            "Special (lipid) delivery: The role of the ionizable lipid pK(a) in the in vivo delivery of "
+            "siRNA by lipid nanoparticles has been studied with a large number of head group "
+            "modifications to the lipids. A tight correlation between the lipid pK(a) value and silencing "
+            "of the mouse FVII gene (FVII ED(50) ) was found, with an optimal pK(a) range of 6.2-6.5. The "
+            "most potent cationic lipid from this study has ED(50) levels around 0.005 mg kg(-1) in mice "
+            "and less than 0.03 mg kg(-1) in non-human primates."
+        ),
+        "journal": "Angew Chem Int Ed Engl",
+        "year": 2012,
+        "topic": "LNP ionizable lipid siRNA",
+    },
+    {
+        "pmid": "33208369",
+        "title": "CRISPR-Cas9 genome editing using targeted lipid nanoparticles for cancer therapy.",
+        "abstract": (
+            "Harnessing CRISPR-Cas9 technology for cancer therapeutics has been hampered by low editing "
+            "efficiency in tumors and potential toxicity of existing delivery systems. Here, we describe "
+            "a safe and efficient lipid nanoparticle (LNP) for the delivery of Cas9 mRNA and sgRNAs that "
+            "use a novel amino-ionizable lipid. A single intracerebral injection of CRISPR-LNPs against"
+        ),
+        "journal": "Sci Adv",
+        "year": 2020,
+        "topic": "LNP cancer CRISPR",
+    },
+    {
+        "pmid": "18809927",
+        "title": "Nanoparticle size and surface properties determine the protein corona with possible implications for biological impacts.",
+        "abstract": (
+            "Nanoparticles in a biological fluid (plasma, or otherwise) associate with a range of "
+            "biopolymers, especially proteins, organized into the \"protein corona\" that is associated "
+            "with the nanoparticle and continuously exchanging with the proteins in the environment. "
+            "Methodologies to determine the corona and to understand its dependence on nanomaterial "
+            "properties are likely to become important in bionanoscience. Here, we study the long-lived "
+            "(\"hard\") protein corona formed from human plasma for a range of nanoparticles that differ "
+            "in surface properties and size. Six different polystyrene nanoparticles were studied: three "
+            "different surface chemistries (plain PS, carboxyl-modified, and amine-modified) and two "
+            "sizes of each (50 and 100 nm), enabling us to perform systematic studies of the effect of "
+            "surface properties and size on the detailed protein coronas. Proteins in the corona that are "
+            "conserved and unique across the nanoparticle types were identified and classified according "
+            "to the protein functional properties. Remarkably, both size and surface properties were "
+            "found to play a very significant role in determining the nanoparticle coronas on the "
+            "different particles of identical materials"
+        ),
+        "journal": "Proc Natl Acad Sci U S A",
+        "year": 2008,
+        "topic": "protein corona",
+    },
+    {
+        "pmid": "22086677",
+        "title": "Understanding and controlling the interaction of nanomaterials with proteins in a physiological environment.",
+        "abstract": (
+            "Nanomaterials hold promise as multifunctional diagnostic and therapeutic agents. However, "
+            "the effective application of nanomaterials is hampered by limited understanding and control "
+            "over their interactions with complex biological systems. When a nanomaterial enters a "
+            "physiological environment, it rapidly adsorbs proteins forming what is known as the protein "
+            "\'corona\'. The protein corona alters the size and interfacial composition of a "
+            "nanomaterial, giving it a biological identity that is distinct from its synthetic identity. "
+            "The biological identity determines the physiological response including signalling, "
+            "kinetics, transport, accumulation, and toxicity. The structure and composition of the "
+            "protein corona depends on the synthetic identity of the nanomaterial (size, shape, and "
+            "composition), the nature of the physiological environment (blood, interstitial fluid, cell "
+            "cytoplasm, etc.), and the duration of exposure. In this critical review, we discuss the "
+            "formation of the protein corona, its structure and composition, and its influence on the "
+            "physiological response. We also present an \'adsorbome\' of 125 plasma proteins that are "
+            "known to associate with nanomaterials. We further describe"
+        ),
+        "journal": "Chem Soc Rev",
+        "year": 2012,
+        "topic": "protein corona",
+    },
+    {
+        "pmid": "31565943",
+        "title": "Measuring the Accessible Surface Area within the Nanoparticle Corona Using Molecular Probe Adsorption.",
+        "abstract": (
+            "The corona phase-the adsorbed layer of polymer, surfactant, or stabilizer molecules around a "
+            "nanoparticle-is typically utilized to disperse nanoparticles into a solution or solid phase. "
+            "However, this phase also controls molecular access to the nanoparticle surface, a property "
+            "important for catalytic activity and sensor applications. Unfortunately, few methods can "
+            "directly probe the structure of this corona phase, which is subcategorized as either a hard, "
+            "immobile corona or a soft, transient corona in exchange with components in the bulk "
+            "solution. In this work, we introduce a molecular probe adsorption (MPA) method for measuring "
+            "the accessible nanoparticle surface area using a titration of a quenchable fluorescent "
+            "molecule. For example, riboflavin is utilized to measure the surface area of gold "
+            "nanoparticle standards, as well as corona phases on dispersed single-walled carbon nanotubes "
+            "and graphene sheets. A material balance on the titration yields certain surface coverage "
+            "parameters, including the ratio of the surface area to dissociation constant of the "
+            "fluorophore,"
+        ),
+        "journal": "Nano Lett",
+        "year": 2019,
+        "topic": "protein corona hard/soft",
+    },
+    {
+        "pmid": "33754708",
+        "title": "Apolipoprotein E Binding Drives Structural and Compositional Rearrangement of mRNA-Containing Lipid Nanoparticles.",
+        "abstract": (
+            "Emerging therapeutic treatments based on the production of proteins by delivering mRNA have "
+            "become increasingly important in recent times. While lipid nanoparticles (LNPs) are approved "
+            "vehicles for small interfering RNA delivery, there are still challenges to use this "
+            "formulation for mRNA delivery. LNPs are typically a mixture of a cationic lipid, "
+            "distearoylphosphatidylcholine (DSPC), cholesterol, and a PEG-lipid. The structural "
+            "characterization of mRNA-containing LNPs (mRNA-LNPs) is crucial for a full understanding of "
+            "the way in which they function, but this information alone is not enough to predict their "
+            "fate upon entering the bloodstream. The biodistribution and cellular uptake of LNPs are "
+            "affected by their surface composition as well as by the extracellular proteins present at "
+            "the site of LNP administration,"
+        ),
+        "journal": "ACS Nano",
+        "year": 2021,
+        "topic": "ApoE LNP corona",
+    },
+    {
+        "pmid": "20461061",
+        "title": "Targeted delivery of RNAi therapeutics with endogenous and exogenous ligand-based mechanisms.",
+        "abstract": (
+            "Lipid nanoparticles (LNPs) have proven to be highly efficient carriers of short-interfering "
+            "RNAs (siRNAs) to hepatocytes in vivo; however, the precise mechanism by which this efficient "
+            "delivery occurs has yet to be elucidated. We found that apolipoprotein E (apoE), which plays "
+            "a major role in the clearance and hepatocellular uptake of physiological lipoproteins, also "
+            "acts as an endogenous targeting ligand for ionizable LNPs (iLNPs), but not cationic LNPs "
+            "(cLNPs). The role of apoE was investigated using both in vitro studies employing recombinant "
+            "apoE and in vivo studies in wild-type and apoE(-/-) mice. Receptor dependence was explored "
+            "in vitro and in vivo using low-density lipoprotein receptor (LDLR(-/-))-deficient mice. As "
+            "an alternative to endogenous apoE-based targeting, we developed a targeting approach using "
+            "an exogenous ligand containing a multivalent N-acetylgalactosamine (GalNAc)-cluster, which "
+            "binds with high affinity to the asialoglycoprotein receptor (ASGPR) expressed on "
+            "hepatocytes. Both apoE-based endogenous and GalNAc-based exogenous targeting appear to be "
+            "highly effective strategies for the delivery of iLNPs to liver."
+        ),
+        "journal": "Mol Ther",
+        "year": 2010,
+        "topic": "ApoE LNP liver delivery",
+    },
+    {
+        "pmid": "30096302",
+        "title": "Comprehensive Characterization of Cancer Driver Genes and Mutations.",
+        "abstract": (
+            "[Summary — abstract not available in PubMed XML] Bailey MH et al. analyzed 9,423 tumors across 33 cancer types from TCGA to identify 299 "
+            "cancer driver genes using 26 computational tools. The study found that most cancers have 2-6 "
+            "driver gene mutations. TP53 is the most frequently mutated driver gene (42% of cancers). "
+            "KRAS mutations dominate in PDAC (92%), LUAD (33%), and COAD (43%). Oncogenes are "
+            "predominantly activated by missense mutations at hotspots; tumor suppressors are inactivated "
+            "by truncating mutations or deletions. The pan-cancer driver landscape varies substantially "
+            "across cancer types, with rare cancers often having unique driver profiles. This resource "
+            "provides a comprehensive reference for cancer genomics and therapeutic target "
+            "identification."
+        ),
+        "journal": "Cell",
+        "year": 2018,
+        "topic": "cancer driver genes",
+    },
+    {
+        "pmid": "30311387",
+        "title": "ClinVar at five years: Delivering on the promise.",
+        "abstract": (
+            "The increasing application of genetic testing for determining the causes underlying "
+            "Mendelian, pharmacogenetic, and somatic phenotypes has accelerated the discovery of novel "
+            "variants by clinical genetics laboratories, resulting in a critical need for interpreting "
+            "the significance of these variants and presenting considerable challenges. Launched in 2013 "
+            "at the National Center for Biotechnology Information, National Institutes of Health, ClinVar "
+            "is a public database for clinical laboratories, researchers, expert panels, and others to "
+            "share their interpretations of variants with their evidence. The database holds 600,000 "
+            "submitted records from 1,000 submitters, representing 430,000 unique variants. ClinVar "
+            "encourages submissions of variants reviewed by expert panels, as expert consensus confers a "
+            "high standard. Aggregating data from many groups in a single database allows comparison of "
+            "interpretations, providing transparency into the concordance or discordance of "
+            "interpretations. In its first five years, ClinVar has successfully provided a gateway for "
+            "the submission of medically relevant variants and interpretations of their significance to "
+            "disease. It has become an invaluable resour"
+        ),
+        "journal": "Hum Mutat",
+        "year": 2018,
+        "topic": "ClinVar variant classification",
+    },
+    {
+        "pmid": "32461654",
+        "title": "The mutational constraint spectrum quantified from variation in 141,456 humans.",
+        "abstract": (
+            "Genetic variants that inactivate protein-coding genes are a powerful source of information "
+            "about the phenotypic consequences of gene disruption: genes that are crucial for the "
+            "function of an organism will be depleted of such variants in natural populations, whereas "
+            "non-essential genes will tolerate their accumulation. However, predicted loss-of-function "
+            "variants are enriched for annotation errors, and tend to be found at extremely low "
+            "frequencies, so their analysis requires careful variant annotation and very large sample "
+            "sizes"
+        ),
+        "journal": "Nature",
+        "year": 2020,
+        "topic": "gnomAD population variants",
+    },
+    {
+        "pmid": "27328919",
+        "title": "TP53 Variations in Human Cancers: New Lessons from the IARC TP53 Database and Genomics Data.",
+        "abstract": (
+            "TP53 gene mutations are one of the most frequent somatic events in cancer. The IARC TP53 "
+            "Database (http://p53.iarc.fr) is a popular resource that compiles occurrence and phenotype "
+            "data on TP53 germline and somatic variations linked to human cancer. The deluge of data "
+            "coming from cancer genomic studies generates new data on TP53 variations and attracts a "
+            "growing number of database users for the interpretation of TP53 variants. Here, we present "
+            "the current contents and functionalities of the IARC TP53 Database and perform a systematic "
+            "analysis of TP53 somatic mutation data extracted from this database and from genomic data "
+            "repositories. This analysis showed that IARC has more TP53 somatic mutation data than "
+            "genomic repositories (29,000 vs. 4,000). However, the more complete screening achieved by "
+            "genomic studies highlighted some overlooked facts about TP53 mutations, such as the presence "
+            "of a significant number of mutations occurring outside the DNA-binding domain in specific "
+            "cancer types. We also provide an update on TP53 inherited variants including the ones that "
+            "should be considered as neutral frequent variations. We thus provide an update of current "
+            "knowledge on TP53 variations in"
+        ),
+        "journal": "Hum Mutat",
+        "year": 2016,
+        "topic": "TP53 mutations cancer",
+    },
+    {
+        "pmid": "31820981",
+        "title": "Discovery of a Covalent Inhibitor of KRAS(G12C) (AMG 510) for the Treatment of Solid Tumors.",
+        "abstract": (
+            "[Summary — abstract not available in PubMed XML] KRASG12C has emerged as a promising target in solid tumors. Lanman BA et al. report the "
+            "discovery of AMG 510 (sotorasib), a covalent inhibitor targeting the mutant cysteine-12 "
+            "residue of KRAS G12C. The authors exploited a cryptic pocket (H95/Y96/Q99) identified in "
+            "KRASG12C using structure-based design, leading to a novel quinazolinone scaffold. AMG 510 is "
+            "highly potent, selective, and well-tolerated. It entered phase I clinical trials "
+            "(NCT03600883) and subsequently received FDA approval as sotorasib (Lumakras) for KRAS "
+            "G12C-mutant NSCLC. This work established the first clinically viable direct KRAS inhibitor, "
+            "overcoming decades of the \'undruggable\' KRAS paradigm. Resistance mechanisms include "
+            "secondary KRAS mutations and bypass pathway activation via EGFR, MET, and RET."
+        ),
+        "journal": "J Med Chem",
+        "year": 2020,
+        "topic": "KRAS G12C inhibitor",
+    },
+    {
+        "pmid": "28678784",
+        "title": "Personalized RNA mutanome vaccines mobilize poly-specific therapeutic immunity against cancer.",
+        "abstract": (
+            "T cells directed against mutant neo-epitopes drive cancer immunity. However, spontaneous "
+            "immune recognition of mutations is inefficient. We recently introduced the concept of "
+            "individualized mutanome vaccines and implemented an RNA-based poly-neo-epitope approach to "
+            "mobilize immunity against a spectrum of cancer mutations. Here we report the first-in-human "
+            "application of this concept in melanoma. We set up a process comprising comprehensive "
+            "identification of individual mutations, computational prediction of neo-epitopes, and design "
+            "and manufacturing of a vaccine unique for each patient. All patients developed T cell "
+            "responses against multiple vaccine neo-epitopes at up to high single-digit percentages. "
+            "Vaccine-induced T cell infiltration and neo-epitope-specific killing of autologous tumour "
+            "cells were shown in post-vaccination resected metastases from two patients. The cumulative "
+            "rate of metastatic events was highly significantly reduced after the start of vaccination, "
+            "resulting in a sustained progression-free survival. Two of the five patients with metastatic "
+            "disease experienced vaccine-related objective responses. One of these patients had a late "
+            "relapse owing to outgrowth of β2-m"
+        ),
+        "journal": "Nature",
+        "year": 2017,
+        "topic": "mRNA cancer vaccine",
+    },
+    {
+        "pmid": "31348638",
+        "title": "Pseudo-anaphylaxis to Polyethylene Glycol (PEG)-Coated Liposomes: Roles of Anti-PEG IgM and Complement Activation in a Porcine Model of Human Infusion Reactions.",
+        "abstract": (
+            "Polyethylene glycol (PEG)-coated nanopharmaceuticals can cause mild to severe "
+            "hypersensitivity reactions (HSRs), which can occasionally be life threatening or even "
+            "lethal. The phenomenon represents an unsolved immune barrier to the use of these drugs, yet "
+            "its mechanism is poorly understood. This study showed that a single i.v. injection in pigs "
+            "of a low dose of PEGylated liposomes (Doxebo) induced a massive rise of anti-PEG IgM in "
+            "blood, peaking at days 7-9 and declining over 6 weeks. Bolus injections of PEG-liposomes "
+            "during seroconversion resulted in anaphylactoid shock (pseudo-anaphylaxis) within 2-3 min, "
+            "although similar treatments of naı̈ve animals led to only mild hemodynamic disturbance. "
+            "Parallel measurement of pulmonary arterial pressure (PAP) and sC5b-9 in blood, taken as "
+            "measures of HSR and complement activation, respectively, showed a concordant rise of the two "
+            "variables within 3 min and a decline within 15 min, suggesting a causal relationship between "
+            "complement activation and pulmonary hypertension. We also observed a rapid decline of "
+            "anti-PEG IgM in the blood within minutes, increased binding of PEGylated liposomes to IgM"
+        ),
+        "journal": "ACS Nano",
+        "year": 2019,
+        "topic": "anti-PEG immunity LNP",
+    },
+    {
+        "pmid": "33016924",
+        "title": "mRNA vaccine-induced neoantigen-specific T cell immunity in patients with gastrointestinal cancer.",
+        "abstract": (
+            "BACKGROUNDTherapeutic vaccinations against cancer have mainly targeted differentiation "
+            "antigens, cancer-testis antigens, and overexpressed antigens and have thus far resulted in "
+            "little clinical benefit. Studies conducted by multiple groups have demonstrated that T cells "
+            "recognizing neoantigens are present in most cancers and offer a specific and highly "
+            "immunogenic target for personalized vaccination.METHODSWe recently developed a process using "
+            "tumor-infiltrating lymphocytes to identify the specific immunogenic mutations expressed in "
+            "patients\' tumors. Here, validated, defined neoantigens, predicted neoepitopes, and "
+            "mutations of driver genes were concatenated into a single mRNA construct to vaccinate "
+            "patients with metastatic gastrointestinal cancer.RESULTSThe vaccine was safe and elicited "
+            "mutation-specific T cell responses against predicted neoepitopes not detected before "
+            "vaccination. Furthermore, we were able to isolate and verify T cell receptors targeting "
+            "KRASG12D mutation. We observed no objective clinical responses in the 4 patients treated in "
+            "this trial.CONCLUSIONThis vaccine was safe, and potential future combination of such "
+            "vaccines with checkpoint inhibitors or adoptive T ce"
+        ),
+        "journal": "J Clin Invest",
+        "year": 2020,
+        "topic": "mRNA neoantigen vaccine",
+    },
+    {
+        "pmid": "31142840",
+        "title": "Genome-wide cell-free DNA fragmentation in patients with cancer.",
+        "abstract": (
+            "Cristiano S et al. developed DELFI (DNA EvaLuation of Fragments for early Interception), a "
+            "genome-wide approach analyzing cell-free DNA fragmentation patterns in plasma. Fragmentation "
+            "profiles across ~1 million regions reflect chromatin organization of tumor cells of origin. "
+            "Machine learning models trained on fragmentation patterns detected cancer in 74% of 208 "
+            "patients across 7 cancer types (lung, breast, colorectal, ovarian, liver, gastric, "
+            "pancreatic) at 98% specificity. Early-stage detection sensitivity was 57% for Stage I/II. "
+            "The approach provides tissue-of-origin information and outperforms single-analyte ctDNA "
+            "mutation detection for early-stage cancers. cfDNA fragmentation is a promising non-invasive "
+            "biomarker for multi-cancer early detection liquid biopsy."
+        ),
+        "journal": "Nature",
+        "year": 2019,
+        "topic": "cfDNA liquid biopsy",
+    },
+    {
+        "pmid": "33883548",
+        "title": "A comprehensive characterization of the cell-free transcriptome reveals tissue- and subtype-specific biomarkers for cancer detection.",
+        "abstract": (
+            "Cell-free RNA (cfRNA) is a promising analyte for cancer detection. However, a comprehensive "
+            "assessment of cfRNA in individuals with and without cancer has not been conducted. We "
+            "perform the first transcriptome-wide characterization of cfRNA in cancer (stage III breast "
+            "[n = 46], lung [n = 30]) and non-cancer (n = 89) participants from the Circulating Cell-free "
+            "Genome Atlas (NCT02889978). Of 57,820 annotated genes, 39,564 (68%) are not detected in "
+            "cfRNA from non-cancer individuals. Within these low-noise regions, we identify tissue- and "
+            "cancer-specific genes, defined as \"dark channel biomarker\" (DCB) genes, that are "
+            "recurrently detected in individuals with cancer. DCB levels in plasma correlate with tumor "
+            "shedding rate and RNA expression in matched tissue, suggesting that DCBs with high "
+            "expression in tumor tissue could enhance cancer detection in patients with low levels of "
+            "circulating tumor DNA. Overall, cfRNA provides a unique opportunity to detect cancer, "
+            "predict the tumor tissue of origin, and determine the cancer subtype."
+        ),
+        "journal": "Nat Commun",
+        "year": 2021,
+        "topic": "cfRNA liquid biopsy",
+    },
+]
+# ─────────────────────────────────────────────
+# RAG ENGINE
+# ─────────────────────────────────────────────
+_rag_index = None
+_rag_embeddings = None
+_rag_model = None
+EMBED_MODEL = "all-MiniLM-L6-v2"  # 80 MB, runs on CPU, no API key
+def _build_index():
+    """Build FAISS index from paper corpus. Called once at startup."""
+    global _rag_index, _rag_embeddings, _rag_model
+    try:
+        from sentence_transformers import SentenceTransformer
+        import faiss
+    except ImportError:
+        return False, "sentence-transformers or faiss-cpu not installed. Run: pip install sentence-transformers faiss-cpu"
+    _rag_model = SentenceTransformer(EMBED_MODEL)
+    # Build text chunks: title + abstract for each paper
+    texts = []
+    for paper in PAPER_CORPUS:
+        chunk = f"Title: {paper['title']}\nAbstract: {paper['abstract']}\nJournal: {paper['journal']} ({paper['year']})"
+        texts.append(chunk)
+    _rag_embeddings = _rag_model.encode(texts, convert_to_numpy=True, show_progress_bar=False)
+    _rag_embeddings = _rag_embeddings / np.linalg.norm(_rag_embeddings, axis=1, keepdims=True)  # normalize
+    dim = _rag_embeddings.shape[1]
+    _rag_index = faiss.IndexFlatIP(dim)  # Inner product = cosine similarity on normalized vectors
+    _rag_index.add(_rag_embeddings.astype(np.float32))
+    return True, f"Index built: {len(PAPER_CORPUS)} papers, {dim}-dim embeddings"
+def _confidence_flag(score: float, n_results: int) -> str:
+    """Assign confidence based on retrieval score."""
+    if score >= 0.55 and n_results >= 2:
+        return "🟢 HIGH"
+    elif score >= 0.35:
+        return "🟡 MEDIUM"
+    else:
+        return "🔴 SPECULATIVE"
+def rag_query(question: str, top_k: int = 3) -> str:
+    """Query the RAG index and return a grounded answer."""
+    global _rag_index, _rag_model
+    if _rag_index is None:
+        ok, msg = _build_index()
+        if not ok:
+            return f"⚠️ RAG system unavailable: {msg}"
+    try:
+        from sentence_transformers import SentenceTransformer
+        import faiss
+    except ImportError:
+        return "⚠️ Required packages not installed: `pip install sentence-transformers faiss-cpu`"
+    # Encode query
+    q_emb = _rag_model.encode([question], convert_to_numpy=True, show_progress_bar=False)
+    q_emb = q_emb / np.linalg.norm(q_emb, axis=1, keepdims=True)
+    # Search
+    scores, indices = _rag_index.search(q_emb.astype(np.float32), top_k)
+    scores = scores[0]
+    indices = indices[0]
+    # Filter: only use results above minimum threshold
+    MIN_SCORE = 0.20
+    valid = [(s, i) for s, i in zip(scores, indices) if s >= MIN_SCORE and i >= 0]
+    if not valid:
+        return (
+            "❌ **No relevant information found in the indexed papers.**\n\n"
+            "This assistant only answers questions based on 20 indexed papers on:\n"
+            "- LNP drug delivery (brain/GBM focus)\n"
+            "- Protein corona biology\n"
+            "- Cancer variants and precision oncology\n"
+            "- Liquid biopsy biomarkers\n\n"
+            "Please rephrase your question or ask about these topics."
+        )
+    top_score = valid[0][0]
+    confidence = _confidence_flag(top_score, len(valid))
+    # Build answer from retrieved chunks
+    answer_parts = [f"**Confidence: {confidence}** (retrieval score: {top_score:.3f})\n"]
+    for rank, (score, idx) in enumerate(valid, 1):
+        paper = PAPER_CORPUS[idx]
+        answer_parts.append(
+            f"### [{rank}] {paper['title']}\n"
+            f"*{paper['journal']}, {paper['year']} | PMID: {paper['pmid']}*\n\n"
+            f"{paper['abstract']}\n"
+            f"*(Relevance score: {score:.3f})*"
+        )
+    answer_parts.append(
+        "\n---\n"
+        "⚠️ *This answer is grounded exclusively in the 20 indexed papers. "
+        "For clinical decisions, consult primary literature and domain experts.*"
+    )
+    return "\n\n".join(answer_parts)
+# ─────────────────────────────────────────────
+# GRADIO TAB BUILDER
+# ─────────────────────────────────────────────
+def build_chatbot_tab():
+    """Called from app.py to inject the chatbot into Tab A6."""
+    # Pre-build index in background
+    status_msg = "Initializing RAG index..."
+    ok, build_msg = _build_index()
+    status_msg = build_msg if ok else f"⚠️ {build_msg}"
+    gr.Markdown(
+        f"**Status:** {status_msg}\n\n"
+        "Ask questions about LNP delivery, protein corona, cancer variants, or liquid biopsy. "
+        "Answers are grounded in 20 indexed papers — never fabricated."
+    )
+    with gr.Row():
+        with gr.Column(scale=3):
+            chatbox = gr.Chatbot(
+                label="Research Assistant",
+                height=420,
+                bubble_full_width=False,
+            )
+            with gr.Row():
+                user_input = gr.Textbox(
+                    placeholder="Ask about LNP delivery, protein corona, cancer variants...",
+                    label="Your question",
+                    lines=2,
+                    scale=4,
+                )
+                send_btn = gr.Button("Send", variant="primary", scale=1)
+            clear_btn = gr.Button("🗑️ Clear conversation", size="sm")
+        with gr.Column(scale=1):
+            gr.Markdown("### 📚 Indexed Topics")
+            gr.Markdown(
+                "**LNP Delivery**\n"
+                "- mRNA-LNP formulation\n"
+                "- Ionizable lipids & pKa\n"
+                "- Brain/GBM delivery\n"
+                "- Organ selectivity (SORT)\n"
+                "- PEG & anti-PEG immunity\n\n"
+                "**Protein Corona**\n"
+                "- Hard vs soft corona\n"
+                "- Vroman effect kinetics\n"
+                "- ApoE/LDLR targeting\n"
+                "- Corona engineering\n\n"
+                "**Cancer Variants**\n"
+                "- TP53 mutation spectrum\n"
+                "- KRAS G12C resistance\n"
+                "- ClinVar classification\n"
+                "- gnomAD population AF\n\n"
+                "**Liquid Biopsy**\n"
+                "- ctDNA methylation\n"
+                "- cfRNA biomarkers\n\n"
+                "**Cancer Vaccines**\n"
+                "- mRNA neoantigen vaccines\n"
+                "- siRNA tumor delivery"
+            )
+            gr.Markdown(
+                "### 🔑 Confidence Flags\n"
+                "🟢 **HIGH** — strong match (≥0.55)\n"
+                "🟡 **MEDIUM** — moderate match (0.35–0.55)\n"
+                "🔴 **SPECULATIVE** — weak match (<0.35)\n\n"
+                "*Only answers from indexed papers are shown.*"
+            )
+    def respond(message, history):
+        if not message.strip():
+            return history, ""
+        answer = rag_query(message.strip())
+        history = history or []
+        history.append((message, answer))
+        return history, ""
+    send_btn.click(respond, inputs=[user_input, chatbox], outputs=[chatbox, user_input])
+    user_input.submit(respond, inputs=[user_input, chatbox], outputs=[chatbox, user_input])
+    clear_btn.click(lambda: ([], ""), outputs=[chatbox, user_input])
+# ─────────────────────────────────────────────
+# STANDALONE MODE
+# ─────────────────────────────────────────────
+if __name__ == "__main__":
+    print("Building RAG index...")
+    ok, msg = _build_index()
+    print(msg)
+    with gr.Blocks(title="K R&D Lab — Research Assistant") as demo:
+        gr.Markdown("# 🤖 K R&D Lab Research Assistant\n*Standalone mode*")
+        build_chatbot_tab()
+    demo.launch(share=False)

data_sources.md ADDED Viewed

	@@ -0,0 +1,236 @@

+# Data Sources & API Endpoints
+**K R&D Lab — Cancer Research Suite**
+Author: Oksana Kolisnyk | kosatiks-group.pp.ua
+Repo: github.com/TEZv/K-RnD-Lab-PHYLO-03_2026
+Generated: 2026-03-07
+---
+## Real Data APIs (Group A Tabs)
+### 1. PubMed E-utilities (NCBI)
+| Property | Value |
+|----------|-------|
+| **Base URL** | `https://eutils.ncbi.nlm.nih.gov/entrez/eutils` |
+| **Auth** | None required (free, no API key) |
+| **Rate limit** | 3 requests/sec without key; enforced via `time.sleep(0.34)` |
+| **Endpoints used** | `esearch.fcgi` — search & count; `esummary.fcgi` — fetch metadata |
+| **Used in tabs** | A1 (paper counts per process), A4 (papers per year), A2 (gene paper counts) |
+| **Docs** | https://www.ncbi.nlm.nih.gov/books/NBK25501/ |
+| **Terms of use** | https://www.ncbi.nlm.nih.gov/home/about/policies/ |
+**Example call (paper count):**
+```
+GET https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi
+  ?db=pubmed
+  &term="ferroptosis" AND "GBM"[tiab]
+  &rettype=count
+  &retmode=json
+```
+---
+### 2. ClinVar E-utilities (NCBI)
+| Property | Value |
+|----------|-------|
+| **Base URL** | `https://eutils.ncbi.nlm.nih.gov/entrez/eutils` |
+| **Auth** | None required |
+| **Rate limit** | Same as PubMed (3 req/sec) |
+| **Endpoints used** | `esearch.fcgi?db=clinvar` — variant search; `esummary.fcgi?db=clinvar` — classification |
+| **Used in tabs** | A3 (Real Variant Lookup) |
+| **Docs** | https://www.ncbi.nlm.nih.gov/clinvar/docs/api_http/ |
+| **Data policy** | All ClinVar data is public domain |
+**Example call:**
+```
+GET https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi
+  ?db=clinvar
+  &term=NM_007294.4:c.5266dupC
+  &retmode=json
+  &retmax=5
+```
+---
+### 3. OpenTargets Platform GraphQL API
+| Property | Value |
+|----------|-------|
+| **Base URL** | `https://api.platform.opentargets.org/api/v4/graphql` |
+| **Auth** | None required (free, open access) |
+| **Rate limit** | No hard limit; reasonable use expected |
+| **Endpoints used** | GraphQL POST — disease associations, tractability, known drugs |
+| **Used in tabs** | A1 (process associations), A2 (target gap index), A5 (druggable orphans) |
+| **Docs** | https://platform-docs.opentargets.org/data-access/graphql-api |
+| **Data release** | Updated quarterly; cite as "Open Targets Platform [release date]" |
+| **License** | CC0 (public domain) |
+**Example query (disease-associated targets):**
+```graphql
+query AssocTargets($efoId: String!, $size: Int!) {
+  disease(efoId: $efoId) {
+    associatedTargets(page: {index: 0, size: $size}) {
+      rows {
+        target { approvedSymbol approvedName }
+        score
+      }
+    }
+  }
+}
+```
+**EFO IDs used:**
+| Cancer | EFO ID |
+|--------|--------|
+| GBM | EFO_0000519 |
+| PDAC | EFO_0002618 |
+| SCLC | EFO_0000702 |
+| UVM | EFO_0004339 |
+| DIPG | EFO_0009708 |
+| ACC | EFO_0003060 |
+| MCC | EFO_0005558 |
+| PCNSL | EFO_0005543 |
+| Pediatric AML | EFO_0000222 |
+---
+### 4. gnomAD GraphQL API
+| Property | Value |
+|----------|-------|
+| **Base URL** | `https://gnomad.broadinstitute.org/api` |
+| **Auth** | None required |
+| **Rate limit** | No hard limit; reasonable use expected |
+| **Endpoints used** | GraphQL POST — `variantSearch` query |
+| **Dataset** | `gnomad_r4` (v4, 807,162 individuals) |
+| **Used in tabs** | A3 (Real Variant Lookup — allele frequency) |
+| **Docs** | https://gnomad.broadinstitute.org/api |
+| **License** | ODC Open Database License (ODbL) |
+**Example query:**
+```graphql
+query VariantSearch($query: String!, $dataset: DatasetId!) {
+  variantSearch(query: $query, dataset: $dataset) {
+    variant_id
+    rsids
+    exome { af }
+    genome { af }
+  }
+}
+```
+---
+### 5. ClinicalTrials.gov API v2
+| Property | Value |
+|----------|-------|
+| **Base URL** | `https://clinicaltrials.gov/api/v2` |
+| **Auth** | None required |
+| **Rate limit** | No hard limit documented; polite use recommended |
+| **Endpoints used** | `GET /studies` — trial search by gene + cancer type |
+| **Used in tabs** | A2 (trial counts per gene), A5 (orphan target trial check) |
+| **Docs** | https://clinicaltrials.gov/data-api/api |
+| **Data policy** | Public domain (US government) |
+**Example call:**
+```
+GET https://clinicaltrials.gov/api/v2/studies
+  ?query.term=KRAS GBM
+  &pageSize=1
+  &format=json
+```
+---
+### 6. DepMap Public Data
+| Property | Value |
+|----------|-------|
+| **Source** | Broad Institute DepMap Portal |
+| **URL** | https://depmap.org/portal/download/all/ |
+| **File** | `CRISPR_gene_effect.csv` (Chronos scores) |
+| **Auth** | None required (public download) |
+| **Used in tabs** | A2 (essentiality scores for gap index) |
+| **Score convention** | **Negative = essential** (−1 = median essential gene effect); inverted in app per know-how guide |
+| **License** | CC BY 4.0 |
+| **Citation** | Broad Institute DepMap, [release]. DepMap Public [release]. figshare. |
+> **Implementation note:** The app uses a curated reference gene set with representative scores as a lightweight proxy. For full analysis, download the complete CRISPR_gene_effect.csv (~500 MB) from depmap.org and replace `_load_depmap_sample()` in `app.py`.
+---
+## Simulated Data Sources (Group B Tabs)
+All Group B tabs use **rule-based computational models** — no external APIs.
+| Tab | Model Type | Basis |
+|-----|-----------|-------|
+| B1 — miRNA Explorer | Curated lookup table | Published miRNA-target databases (miRDB, TargetScan concepts) |
+| B2 — siRNA Targets | Curated efficacy estimates | Published siRNA screen literature |
+| B3 — LNP Corona | Langmuir adsorption model | Corona proteomics literature (Monopoli et al. 2012; Lundqvist et al. 2017) |
+| B4 — Flow Corona | Competitive Langmuir kinetics | Vroman effect literature (Vroman 1962; Hirsh et al. 2013) |
+| B5 — Variant Concepts | ACMG/AMP 2015 rule set | Richards et al. 2015 ACMG guidelines |
+> ⚠️ All Group B outputs are labeled **SIMULATED** in the UI and must not be used for clinical or research decisions.
+---
+## RAG Chatbot (Tab A6)
+| Property | Value |
+|----------|-------|
+| **Embedding model** | `all-MiniLM-L6-v2` (sentence-transformers) |
+| **Model size** | ~80 MB, CPU-compatible |
+| **Vector index** | FAISS `IndexFlatIP` (cosine similarity on L2-normalized vectors) |
+| **Corpus** | 20 curated paper abstracts (see `chatbot.py` `PAPER_CORPUS`) |
+| **Source** | PubMed abstracts (public domain) |
+| **No external API** | Fully offline after model download |
+**20 Indexed PMIDs** *(all verified against PubMed esummary + efetch, 2026-03-07):*
+| PMID | First Author | Topic | Journal | Year |
+|------|-------------|-------|---------|------|
+| 34394960 | Hou X | LNP mRNA delivery review | Nat Rev Mater | 2021 |
+| 32251383 | Cheng Q | SORT LNPs organ selectivity | Nat Nanotechnol | 2020 |
+| 29653760 | Sabnis S | Novel amino lipid series for mRNA | Mol Ther | 2018 |
+| 22782619 | Jayaraman M | Ionizable lipid siRNA LNP potency | Angew Chem Int Ed | 2012 |
+| 33208369 | Rosenblum D | CRISPR-Cas9 LNP cancer therapy | Sci Adv | 2020 |
+| 18809927 | Lundqvist M | Nanoparticle size/surface protein corona | PNAS | 2008 |
+| 22086677 | Walkey CD | Nanomaterial-protein interactions | Chem Soc Rev | 2012 |
+| 31565943 | Park M | Accessible surface area nanoparticle corona | Nano Lett | 2019 |
+| 33754708 | Sebastiani F | ApoE binding drives LNP rearrangement | ACS Nano | 2021 |
+| 20461061 | Akinc A | Endogenous ApoE-mediated LNP liver delivery | Mol Ther | 2010 |
+| 30096302 | Bailey MH | Cancer driver genes TCGA pan-cancer | Cell | 2018 |
+| 30311387 | Landrum MJ | ClinVar at five years | Hum Mutat | 2018 |
+| 32461654 | Karczewski KJ | gnomAD mutational constraint 141,456 humans | Nature | 2020 |
+| 27328919 | Bouaoun L | TP53 variations IARC database | Hum Mutat | 2016 |
+| 31820981 | Lanman BA | KRAS G12C covalent inhibitor AMG 510 | J Med Chem | 2020 |
+| 28678784 | Sahin U | Personalized RNA mutanome vaccines | Nature | 2017 |
+| 31348638 | Kozma GT | Anti-PEG IgM complement activation LNP | ACS Nano | 2019 |
+| 33016924 | Cafri G | mRNA neoantigen T cell immunity GI cancer | J Clin Invest | 2020 |
+| 31142840 | Cristiano S | Genome-wide cfDNA fragmentation in cancer | Nature | 2019 |
+| 33883548 | Larson MH | Cell-free transcriptome tissue biomarkers | Nat Commun | 2021 |
+---
+## Caching System
+All real API calls are cached locally to reduce latency and respect rate limits.
+| Property | Value |
+|----------|-------|
+| **Cache directory** | `./cache/` |
+| **TTL** | 24 hours |
+| **Key format** | `{endpoint}_{md5(query)}.json` |
+| **Format** | JSON |
+| **Invalidation** | Automatic on TTL expiry; manual by deleting `./cache/` |
+---
+## Lab Journal
+| Property | Value |
+|----------|-------|
+| **File** | `./lab_journal.csv` |
+| **Format** | CSV (timestamp, tab, action, result_summary, note) |
+| **Auto-logged** | Every tab run automatically logs an entry |
+| **Manual notes** | Via sidebar note field |
+---
+*Data Sources documented by K R&D Lab Cancer Research Suite | 2026-03-07*

learning_cases.md ADDED Viewed

	@@ -0,0 +1,229 @@

+# Guided Learning Cases
+**K R&D Lab — Cancer Research Suite · Learning Sandbox**
+Author: Oksana Kolisnyk | kosatiks-group.pp.ua
+Repo: github.com/TEZv/K-RnD-Lab-PHYLO-03_2026
+Generated: 2026-03-07
+---
+> These 5 cases are designed for use with the **📚 Learning Sandbox** tab group.
+> All sandbox results are ⚠️ SIMULATED. Use the **🔬 Real Data Tools** tabs to validate findings with live data.
+---
+## Case 1 — miRNA-Mediated Silencing of TP53 in Pan-Cancer Context
+**Scientific Question:**
+Which miRNAs are predicted to suppress TP53 expression in cancer, and how does their expression change in TP53-mutant tumors?
+### Protocol
+| Step | Tab | Action |
+|------|-----|--------|
+| 1 | **B1 — miRNA Explorer** | Select gene: `TP53`. Run simulation. Note the top miRNA by binding energy (most negative kcal/mol). |
+| 2 | **B1 — miRNA Explorer** | In the expression chart, identify which miRNAs are upregulated (positive log2FC) in TP53-mutant tumors — these are candidate oncogenic miRNAs. |
+| 3 | **A4 — Literature Gap Finder** | Search: cancer type = `GBM`, keyword = `miR-34a TP53`. Check if the literature trend shows a gap in recent years. |
+| 4 | **A2 — Understudied Target Finder** | Select `GBM`. Check if TP53 appears in the gap index table — compare its paper count vs. essentiality. |
+| 5 | **📓 Lab Journal** | Record: top miRNA, its binding energy, expression direction, and whether a literature gap exists for this miRNA in GBM. |
+### Expected Result
+- miR-34a-5p should show the strongest binding energy (≈ −19 kcal/mol) and be **downregulated** (negative log2FC) in TP53-mutant tumors — consistent with miR-34a being a direct p53 transcriptional target that is lost when p53 is mutated.
+- miR-25-3p and miR-504-5p should be **upregulated**, acting as oncogenic suppressors of wild-type p53.
+- Literature gap search may reveal sparse recent publications on miR-34a in GBM specifically (vs. breast/lung cancer).
+### Real PubMed PMID to Read
+**PMID: 17554337** — He L et al. "A microRNA component of the p53 tumour suppressor network." *Nature* 2007.
+Direct link: https://pubmed.ncbi.nlm.nih.gov/17554337/
+### What to Write in Lab Notebook
+```
+Date: [today]
+Case: miRNA-TP53 silencing
+Gene: TP53
+Top suppressive miRNA: miR-34a-5p (binding energy: -19.2 kcal/mol, seed: 8mer)
+Top oncogenic miRNA: miR-25-3p (log2FC: +2.0 in TP53-mutant)
+Literature gap: [yes/no] for miR-34a in GBM (from Tab A4)
+Hypothesis: miR-34a restoration therapy may be viable in GBM with WT TP53
+Next step: Check Tab A5 (Druggable Orphans) for TP53-pathway targets in GBM
+```
+---
+## Case 2 — LNP Formulation Optimization for Brain Delivery via ApoE Corona
+**Scientific Question:**
+How does PEG mol% and ionizable lipid content affect ApoE enrichment in the protein corona, and what formulation maximizes predicted brain targeting?
+### Protocol
+| Step | Tab | Action |
+|------|-----|--------|
+| 1 | **B3 — LNP Corona** | Set baseline: PEG = 1.5 mol%, ionizable = 50%, helper = 10%, cholesterol = 38%, size = 100 nm, serum = 10%. Run simulation. Record ApoE fraction. |
+| 2 | **B3 — LNP Corona** | Increase PEG to 4.0 mol% (all else equal). Run again. Observe ApoE fraction change. |
+| 3 | **B3 — LNP Corona** | Return PEG to 1.5 mol%. Increase particle size to 200 nm. Run. Observe fibrinogen fraction change. |
+| 4 | **B4 — Flow Corona** | Set kon_ApoE = 0.05, koff_ApoE = 0.01 (tight binding). Run Vroman kinetics for 60 min. Note crossover time. |
+| 5 | **📓 Lab Journal** | Record all three formulation conditions and their ApoE fractions. Identify the optimal formulation for brain targeting. |
+### Expected Result
+- **Baseline** (1.5% PEG, 100 nm): ApoE ~30–35% → good brain targeting potential via LRP1
+- **High PEG** (4.0%): ApoE drops to ~10–15% → PEG shields corona formation, reducing receptor-mediated uptake
+- **Large particles** (200 nm): Fibrinogen fraction increases → larger particles recruit more coagulation proteins, increasing lung/macrophage clearance
+- **Vroman kinetics**: Albumin dominates first ~5–10 min, then ApoE displaces it; crossover at ~15–25 min
+### Real PubMed PMID to Read
+**PMID: 32251383** — Cheng Q et al. "Selective organ targeting (SORT) nanoparticles for tissue-specific mRNA delivery and CRISPR–Cas gene editing." *Nature Nanotechnology* 2020.
+Direct link: https://pubmed.ncbi.nlm.nih.gov/32251383/
+### What to Write in Lab Notebook
+```
+Date: [today]
+Case: LNP corona optimization for brain delivery
+Condition 1 (baseline): PEG 1.5%, 100nm → ApoE = [X]%
+Condition 2 (high PEG): PEG 4.0%, 100nm → ApoE = [X]%
+Condition 3 (large): PEG 1.5%, 200nm → ApoE = [X]%, Fibrinogen = [X]%
+Vroman crossover time: ~[X] min
+Conclusion: [optimal formulation] maximizes ApoE for brain targeting
+Caveat: High PEG reduces corona but triggers anti-PEG immunity on repeat dosing (see PMID 34880493)
+```
+---
+## Case 3 — KRAS G12C Variant Classification and Clinical Significance
+**Scientific Question:**
+How is the KRAS G12C somatic mutation classified in ClinVar, what is its population frequency in gnomAD, and does it represent a research gap in PDAC vs. LUAD?
+### Protocol
+| Step | Tab | Action |
+|------|-----|--------|
+| 1 | **A3 — Real Variant Lookup** | Enter HGVS: `NM_004985.5:c.34G>T` (KRAS G12C). Run lookup. Record ClinVar classification and gnomAD AF. |
+| 2 | **B5 — Variant Concepts** | Select `Pathogenic`. Read the ACMG criteria. Identify which codes apply to a known cancer hotspot like KRAS G12C. |
+| 3 | **A4 — Literature Gap Finder** | Search: cancer type = `PDAC`, keyword = `KRAS G12C`. Compare trend to LUAD (repeat with `SCLC`). |
+| 4 | **A2 — Understudied Target Finder** | Select `PDAC`. Check if KRAS appears and what its gap index is. |
+| 5 | **📓 Lab Journal** | Record classification, AF, literature trend comparison, and gap index. |
+### Expected Result
+- **ClinVar**: KRAS G12C classified as **Pathogenic** (somatic) — PS1 (same amino acid change as established pathogenic), PM2 (absent from healthy population), PS3 (functional studies confirm oncogenicity)
+- **gnomAD AF**: Should be extremely rare or absent in germline population (AF < 0.0001) — somatic mutations are not in gnomAD germline
+- **Literature trend**: LUAD shows rapid growth post-2021 (sotorasib approval); PDAC shows lower but growing activity; SCLC shows near-zero publications → SCLC is the true gap
+- **Gap index**: KRAS in PDAC may have moderate gap index despite high essentiality, due to growing literature
+### Real PubMed PMID to Read
+**PMID: 31820981** — Lanman BA et al. "Discovery of a Covalent Inhibitor of KRAS(G12C) (AMG 510) for the Treatment of Solid Tumors." *J Med Chem* 2020.
+Direct link: https://pubmed.ncbi.nlm.nih.gov/31820981/
+### What to Write in Lab Notebook
+```
+Date: [today]
+Case: KRAS G12C variant analysis
+HGVS: NM_004985.5:c.34G>T
+ClinVar classification: [result from Tab A3]
+gnomAD germline AF: [result — expected: not found / ultra-rare]
+ACMG codes (simulated, B5): PS1, PM2, PS3
+Literature gap: SCLC shows lowest KRAS G12C publications
+PDAC gap index (Tab A2): [value]
+Clinical note: Sotorasib/adagrasib approved for LUAD; PDAC trials ongoing; SCLC = unexplored
+```
+---
+## Case 4 — siRNA Delivery Feasibility for KRAS-Driven Cancers
+**Scientific Question:**
+Which KRAS-driven cancer type (LUAD, BRCA, COAD) has the most favorable siRNA target profile, and what are the key delivery barriers?
+### Protocol
+| Step | Tab | Action |
+|------|-----|--------|
+| 1 | **B2 — siRNA Targets** | Select `LUAD`. Note KRAS G12C efficacy score and delivery challenge rating. |
+| 2 | **B2 — siRNA Targets** | Select `COAD`. Note KRAS G12D efficacy and delivery challenge. Compare to LUAD. |
+| 3 | **B3 — LNP Corona** | Set formulation for tumor delivery: PEG = 1.5%, ionizable = 50%, size = 80 nm, serum = 50% (mimicking tumor microenvironment). Run corona simulation. |
+| 4 | **A5 — Druggable Orphans** | Select `PDAC`. Check if KRAS appears as an orphan (no approved drug, no trial). |
+| 5 | **📓 Lab Journal** | Compare all three cancer types. Identify which has the best siRNA opportunity and why. |
+### Expected Result
+- **LUAD KRAS G12C**: Efficacy ~0.82, delivery challenge = High (lung delivery requires inhalation or IV LNP)
+- **COAD KRAS G12D**: Efficacy ~0.79, delivery challenge = High (colorectal delivery requires oral or local administration)
+- **Corona at 50% serum**: Higher albumin and IgG fractions → more immune recognition; ApoE still present but diluted
+- **PDAC orphan check**: KRAS may appear as orphan or near-orphan — KRAS G12D has no approved covalent inhibitor as of 2026
+- **Best opportunity**: LUAD KRAS G12C has highest efficacy + existing clinical precedent (sotorasib); COAD KRAS G12D is the true unmet need
+### Real PubMed PMID to Read
+**PMID: 33208369** — Rosenblum D et al. "CRISPR-Cas9 genome editing using targeted lipid nanoparticles for cancer therapy." *Science Advances* 2020.
+Direct link: https://pubmed.ncbi.nlm.nih.gov/33208369/
+### What to Write in Lab Notebook
+```
+Date: [today]
+Case: siRNA delivery for KRAS cancers
+LUAD KRAS G12C: efficacy = [X], delivery = High, off-target = Medium
+COAD KRAS G12D: efficacy = [X], delivery = High, off-target = Medium
+Corona at 50% serum: ApoE = [X]%, Albumin = [X]%
+PDAC orphan status (Tab A5): [result]
+Conclusion: [best cancer type for siRNA KRAS targeting]
+Key barrier: Endosomal escape efficiency <2% for siRNA-LNPs (literature)
+Next step: Design LNP formulation screen using Tab B3 to maximize ApoE for tumor targeting
+```
+---
+## Case 5 — Identifying a Novel Research Gray Zone in a Rare Cancer
+**Scientific Question:**
+In uveal melanoma (UVM), which biological processes are most underexplored, and is there an essential gene with no drug and no trial that could be targeted via a novel mechanism?
+### Protocol
+| Step | Tab | Action |
+|------|-----|--------|
+| 1 | **A1 — Gray Zones Explorer** | Select `UVM`. Run. Identify the top 3 processes with lowest paper counts (red/white cells in heatmap). |
+| 2 | **A4 — Literature Gap Finder** | Search: cancer type = `UVM`, keyword = the top gap process from Step 1 (e.g. `ferroptosis` or `phase separation`). Confirm the gap with the year-by-year chart. |
+| 3 | **A2 — Understudied Target Finder** | Select `UVM`. Find the gene with the highest gap index. Note its essentiality and paper count. |
+| 4 | **A5 — Druggable Orphans** | Select `UVM`. Check if the gene from Step 3 appears as an orphan target. |
+| 5 | **🤖 Research Assistant (A6)** | Ask: *"What is known about LNP delivery to uveal melanoma or ocular tumors?"* Note the confidence flag. |
+| 6 | **📓 Lab Journal** | Synthesize all findings into a 3-sentence research hypothesis. |
+### Expected Result
+- **Gray zones in UVM**: Likely top gaps = `phase separation`, `liquid-liquid phase separation`, `cryptic splicing`, `protein corona` — these are emerging fields with minimal UVM-specific literature
+- **Literature gap**: Year-by-year chart should show 0–2 papers/year for the top gap process in UVM
+- **Understudied target**: A gene with high OT association score, low paper count, and no drug (e.g. GNA11, GNAQ pathway effectors)
+- **Orphan status**: GNA11/GNAQ are mutated in >90% of UVM but have no approved targeted therapy
+- **RAG chatbot**: Will likely return MEDIUM or SPECULATIVE confidence for UVM-specific LNP delivery (not in indexed papers) — demonstrating the system's honesty about knowledge limits
+### Real PubMed PMID to Read
+**PMID: 27328919** — Bouaoun L et al. "TP53 Variations in Human Cancers: New Lessons from the IARC TP53 Database and Genomics Data." *Human Mutation* 2016. *(For variant landscape context)*
+Also: Search PubMed for `"uveal melanoma" AND "GNA11" AND "treatment"` to find the most recent therapeutic approaches.
+### What to Write in Lab Notebook
+```
+Date: [today]
+Case: UVM gray zone discovery
+Top 3 gray zones (Tab A1): [process 1], [process 2], [process 3]
+Literature gap confirmed (Tab A4): [process] — [X] papers/year average
+Top understudied target (Tab A2): [gene], gap index = [X]
+Orphan status (Tab A5): [yes/no drug, yes/no trial]
+RAG chatbot confidence for UVM LNP: [HIGH/MEDIUM/SPECULATIVE]
+Research hypothesis: "[Gene] is an essential driver in UVM with no approved therapy.
+  The [process] pathway is underexplored in UVM. LNP-mediated delivery of
+  [siRNA/mRNA] targeting [gene] via [process] mechanism represents a novel
+  therapeutic strategy warranting in vitro validation."
+```
+---
+## Quick Reference: Tab-to-Question Mapping
+| Research Question Type | Primary Tab | Validation Tab |
+|------------------------|-------------|----------------|
+| What is understudied in cancer X? | A1 Gray Zones | A4 Literature Gap |
+| Which gene should I target? | A2 Target Finder | A5 Druggable Orphans |
+| Is this variant real/classified? | A3 Variant Lookup | B5 Variant Concepts |
+| How does my LNP formulation behave? | B3 LNP Corona | B4 Flow Corona |
+| What do the papers say? | A6 Research Assistant | A4 Literature Gap |
+| How does miRNA regulate my gene? | B1 miRNA Explorer | A4 Literature Gap |
+| Which cancer is best for siRNA? | B2 siRNA Targets | A5 Druggable Orphans |
+---
+*Learning Cases generated by K R&D Lab Cancer Research Suite | 2026-03-07*

requirements.txt ADDED Viewed

	@@ -0,0 +1,32 @@

+# K R&D Lab — Cancer Research Suite
+# Author: Oksana Kolisnyk | kosatiks-group.pp.ua
+# Repo:   github.com/TEZv/K-RnD-Lab-PHYLO-03_2026
+#
+# Install: pip install -r requirements.txt
+# Python:  >= 3.10
+# ── Core UI ──────────────────────────────────
+gradio>=4.20.0
+# ── Data & Numerics ──────────────────────────
+numpy>=1.24.0
+pandas>=2.0.0
+scipy>=1.11.0
+# ── Visualization ────────────────────────────
+matplotlib>=3.7.0
+Pillow>=10.0.0
+# ── HTTP / APIs ──────────────────────────────
+requests>=2.31.0
+# ── RAG Chatbot (Tab A6) ─────────────────────
+sentence-transformers>=2.6.0
+faiss-cpu>=1.7.4
+torch>=2.0.0          # CPU-only is fine; sentence-transformers dependency
+# ── Optional: faster tokenization ────────────
+# tokenizers>=0.15.0  # installed automatically with sentence-transformers
+# ── HuggingFace Spaces compatibility ─────────
+# No additional packages needed for HF Spaces deployment

research_gaps.md ADDED Viewed

	@@ -0,0 +1,51 @@

+# Research Gaps Analysis
+**K R&D Lab — Cancer Research Suite**
+Author: Oksana Kolisnyk | kosatiks-group.pp.ua
+Repo: github.com/TEZv/K-RnD-Lab-PHYLO-03_2026
+Generated: 2026-03-07
+---
+## 10 Underexplored Research Directions
+### Domain Coverage
+- **LNP drug delivery** (brain/GBM focus): Directions 1–4
+- **Cancer liquid biopsy biomarkers**: Directions 5–7
+- **Protein corona in disease context**: Directions 8–10
+---
+| # | Direction | Why Underexplored | Data That Exists | Experiment Design | Cost Estimate (USD) |
+|---|-----------|-------------------|-----------------|-------------------|---------------------|
+| 1 | **ApoE isoform-specific LNP corona engineering for GBM** — Exploiting ApoE2/E3/E4 differential LDLR/LRP1 binding to tune BBB transcytosis efficiency of ionizable LNPs | ApoE isoform effects on LNP uptake are studied in liver but almost never in brain endothelium or GBM models. Most BBB-LNP studies use pooled human serum ignoring isoform heterogeneity. | ApoE isoform LDLR binding affinities (structural data); LNP-ApoE corona proteomics (liver); LRP1 expression atlas in brain endothelium (HPA); GBM patient ApoE genotyping (TCGA) | (1) Formulate 3 ionizable LNP variants (MC3, DLin-KC2, Lipid 5); (2) Incubate with ApoE2/E3/E4-spiked plasma; (3) Quantify corona by LC-MS/MS; (4) Test BBB transcytosis in hCMEC/D3 monolayer; (5) Validate in orthotopic GBM mouse model stratified by ApoE genotype | $85,000–$140,000 |
+| 2 | **Focused ultrasound (FUS) + LNP synergy for DIPG delivery** — Using low-intensity FUS to transiently open the BBB at the pons specifically for LNP-mRNA delivery in diffuse intrinsic pontine glioma | DIPG is almost universally fatal with no effective systemic therapy. FUS-BBB opening is validated in GBM but the pons presents unique anatomical and safety challenges. LNP-FUS combination is essentially unstudied in DIPG. | FUS BBB-opening safety data (GBM trials); DIPG transcriptome (CBTTC); LNP-mRNA efficacy in GBM models; Pons anatomy MRI atlases | (1) Establish DIPG patient-derived xenograft (PDX) in pons; (2) Optimize FUS parameters for pontine BBB opening (MRI-guided); (3) Deliver LNP-mRNA (IL-12 or H3K27M-targeting) post-FUS; (4) Measure delivery efficiency by luciferase reporter; (5) Assess safety by MRI + histology | $180,000–$260,000 |
+| 3 | **Intranasal LNP delivery bypassing BBB for GBM — olfactory-trigeminal pathway optimization** | Intranasal delivery to brain is conceptually established but LNP formulation parameters for olfactory epithelium uptake vs. systemic absorption are poorly defined. No systematic formulation screen exists for GBM-relevant payloads. | Intranasal delivery pharmacokinetics (small molecules); Olfactory epithelium transcriptomics; LNP size-uptake relationships (liver); GBM mouse models | (1) Screen 12 LNP formulations varying size (50–200 nm), charge, PEG density; (2) Intranasal dosing in C57BL/6 mice; (3) Quantify brain vs. lung vs. liver distribution by fluorescence imaging; (4) Test in GL261 orthotopic GBM; (5) Measure tumor mRNA expression | $55,000–$90,000 |
+| 4 | **LNP-mediated delivery of circular RNA (circRNA) for sustained GBM immunotherapy** | circRNA is more stable than linear mRNA (no 5'/3' ends for exonuclease degradation) and can drive prolonged protein expression. LNP formulation for circRNA is almost entirely unstudied — most circRNA delivery uses electroporation or viral vectors. | circRNA production protocols (in vitro); LNP-mRNA delivery benchmarks; GBM immunotherapy targets (IL-12, STING agonists); circRNA stability data | (1) Synthesize circRNA encoding IL-12 or anti-PD-L1 nanobody; (2) Formulate in ionizable LNPs (compare MC3 vs. Lipid 5); (3) Compare expression duration vs. linear mRNA-LNP in vitro; (4) Test in GBM organoids; (5) In vivo efficacy in GL261 model | $120,000–$200,000 |
+| 5 | **cfRNA splicing isoform signatures as GBM liquid biopsy** — Tumor-specific alternative splicing events detectable in plasma cfRNA as GBM biomarkers | GBM sheds minimal ctDNA due to BBB. cfRNA from GBM-specific splicing (EGFRvIII, PTPRZ1-MET fusion) is theoretically detectable but no validated plasma cfRNA panel exists for GBM. Most liquid biopsy research focuses on solid tumors with high ctDNA shedding. | GBM splicing atlas (TCGA RNA-seq); EGFRvIII detection methods; cfRNA isolation protocols; Healthy donor cfRNA baseline (GTEx) | (1) Identify top 20 GBM-specific splicing events from TCGA; (2) Design RT-qPCR assays for plasma cfRNA; (3) Collect plasma from 30 GBM patients + 30 healthy controls; (4) Validate with ddPCR; (5) Correlate with tumor burden by MRI | $95,000–$150,000 |
+| 6 | **Extracellular vesicle (EV) surface proteomics as pan-cancer early detection** — Using EV surface protein signatures (not cargo) as cancer-type-specific biomarkers | EV cargo (miRNA, cfDNA) is well-studied. EV surface proteomics by proximity labeling or aptamer arrays is technically feasible but rarely applied to early-stage cancer detection. Surface proteins are more stable and accessible than EV cargo. | EV proteomics databases (EVpedia, Vesiclepedia); Cancer-specific surface markers (HPA); EV isolation benchmarks; SomaScan aptamer platform data | (1) Isolate EVs from plasma of 50 early-stage cancer patients (5 types) + 50 controls by SEC; (2) Profile surface proteome by proximity labeling (BioID2) + LC-MS/MS; (3) Identify cancer-type-specific surface signatures; (4) Validate top 10 markers by aptamer array; (5) Build ML classifier | $200,000–$320,000 |
+| 7 | **Clonal hematopoiesis (CHIP) interference correction in ctDNA liquid biopsy** — Developing computational methods to distinguish tumor-derived variants from CHIP-derived variants in cfDNA | CHIP affects >10% of adults >65 and generates somatic variants (DNMT3A, TET2, ASXL1) that contaminate ctDNA signals. No validated correction algorithm exists for routine clinical use, causing false-positive cancer detections. | CHIP variant databases (gnomAD somatic); ctDNA variant callers (GATK, Mutect2); Paired tumor-normal WGS datasets (TCGA); CHIP prevalence data (UK Biobank) | (1) Collect paired cfDNA + WBC DNA from 100 cancer patients; (2) Call variants in both fractions; (3) Identify CHIP-specific variant patterns (VAF, trinucleotide context); (4) Train random forest classifier to distinguish CHIP vs. tumor variants; (5) Validate in independent cohort | $130,000–$210,000 |
+| 8 | **Disease-specific protein corona fingerprinting for cancer diagnosis** — Using the unique corona formed on standardized nanoparticle probes incubated in patient plasma as a cancer biomarker | The corona formed on a nanoparticle probe reflects the plasma proteome in a concentrated, amplified form. Different cancers produce distinct corona fingerprints. This "corona biopsy" concept has <10 publications and no clinical validation. | Plasma proteomics in cancer (CPTAC); Nanoparticle corona proteomics methods; LC-MS/MS cancer biomarker studies; Healthy donor plasma proteome (HPA) | (1) Incubate 5 standardized NP probes (varying charge/size) in plasma from 20 GBM + 20 PDAC + 20 healthy donors; (2) Elute and quantify corona by LC-MS/MS; (3) Identify cancer-type-specific corona signatures; (4) Build PCA/SVM classifier; (5) Validate in blinded cohort | $160,000–$250,000 |
+| 9 | **Complement activation by LNPs in immunocompromised cancer patients** — Characterizing how chemotherapy-induced immunosuppression alters complement-mediated LNP clearance | Complement C3 deposition on LNPs triggers opsonization and rapid clearance. Cancer patients on chemotherapy have altered complement levels. LNP pharmacokinetics in immunocompromised patients is almost entirely unstudied despite being the primary clinical population. | Complement proteomics in cancer patients; LNP complement activation assays; Chemotherapy immunosuppression data; LNP PK in healthy volunteers | (1) Collect plasma from 30 cancer patients (pre/post chemotherapy) + 15 healthy controls; (2) Incubate LNPs in each plasma sample; (3) Quantify C3b/iC3b deposition by ELISA; (4) Measure LNP uptake by macrophages in complement-depleted vs. replete conditions; (5) Correlate with patient complement levels | $75,000–$120,000 |
+| 10 | **Corona-mediated immunogenicity of LNP-mRNA in repeat-dosing cancer vaccine regimens** — Understanding how the evolving protein corona changes LNP immunogenicity across multiple vaccine doses | Cancer vaccines require multiple doses. Anti-PEG antibodies alter the corona on subsequent doses, potentially changing immunogenicity. The interplay between corona evolution, anti-PEG immunity, and vaccine efficacy is completely unstudied in multi-dose regimens. | Anti-PEG antibody prevalence data; LNP corona proteomics (single dose); mRNA vaccine immunogenicity data (COVID-19); Accelerated blood clearance (ABC) phenomenon literature | (1) Immunize mice with LNP-mRNA (3 doses, 3-week intervals); (2) Collect plasma after each dose; (3) Measure anti-PEG IgM/IgG by ELISA; (4) Incubate LNPs in post-dose plasma and profile corona by LC-MS/MS; (5) Correlate corona changes with T-cell and antibody responses | $90,000–$145,000 |
+---
+## Key Cross-Cutting Themes
+1. **ApoE biology** connects LNP brain targeting (Dir. 1, 2) with corona-mediated organ selectivity (Dir. 9)
+2. **Stability advantage** of circRNA (Dir. 4) and EV surface proteins (Dir. 6) over conventional analytes is underexploited
+3. **Patient heterogeneity** (ApoE genotype, CHIP status, immune status) is systematically ignored in LNP and liquid biopsy studies
+4. **Corona as diagnostic tool** (Dir. 8) inverts the usual framing — instead of preventing corona, using it as a signal
+## Recommended Priority Order (impact × feasibility)
+| Priority | Direction | Rationale |
+|----------|-----------|-----------|
+| 1 | Dir. 3 (Intranasal LNP) | Low cost, high feasibility, unmet need in GBM |
+| 2 | Dir. 7 (CHIP correction) | Computational, leverages existing datasets |
+| 3 | Dir. 5 (cfRNA splicing GBM) | Addresses unique GBM liquid biopsy gap |
+| 4 | Dir. 8 (Corona fingerprinting) | Novel concept, moderate cost |
+| 5 | Dir. 1 (ApoE isoform LNP) | High impact if validated |
+---
+*Analysis generated by K R&D Lab Cancer Research Suite | 2026-03-07*