Spaces:

theodabos
/

varientlens

Sleeping

Codex Claude Opus 4.7 commited on 24 days ago

Commit

10b0e7d

1 Parent(s): 6ea8c5e

RAG: expand PubMed query with protein forms; full-text on by default

The PubMed query was exact-phrase matching on coding HGVS only
(c.413G>A) which excluded every paper that cites the variant by
protein change. For older Mendelian disease genes — NPHS2, COL4A5,
GLA, podocinopathies, collagenopathies — papers virtually never use
the canonical HGVS coding form. They use R138Q, Arg138Gln, p.R138Q.
Result: RAG returned no hits even for well-characterized variants.

build_query now emits five alternate spellings per variant, each
quoted so PubMed treats them as exact phrases:
- coding HGVS (existing)
- p.Arg138Gln, Arg138Gln (three-letter, with/without prefix)
- R138Q, p.R138Q (one-letter, with/without prefix)
Nonsense variants emit R###* in addition to Arg###Ter. The protein
HGVS is parsed once at the build_query layer; the previous "lone p
token" regression is guarded by an explicit unit test.

Also flips rag_fetch_fulltext to default True. Live demo now runs the
full FullTextFetcher strategy chain (EuropePMC → PMC → bioRxiv →
Unpaywall → Semantic Scholar → OpenAlex) for every paper, giving
Claude body text to cite from rather than abstracts only. Adds
~30-60s per query but materially improves PS3/PP1/PM3 citation
specificity.

7 new tests cover query expansion across coding/protein/nonsense
forms plus the regression guard.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Files changed (3) hide show

backend/app/config.py +6 -4
backend/app/services/rag/fetcher.py +93 -18
backend/tests/test_rag_query.py +103 -0

backend/app/config.py CHANGED Viewed

@@ -61,10 +61,12 @@ class Settings(BaseSettings):
     jwt_algorithm: str = "HS256"
     jwt_expire_minutes: int = 480
-    # Default off for the live demo — abstracts give Claude enough to cite
-    # without paying the ~30-60s full-text fetch + PDF parse tax per query.
-    # Re-enable (RAG_FETCH_FULLTEXT=true) for offline batch validation.
-    rag_fetch_fulltext: bool = False
     # Toggle pypdf-based text extraction from Unpaywall / Wiley / Semantic
     # Scholar / OpenAlex PDFs. Slim deployments that don't ship pypdf hit
     # the soft-import guard in `_extract_pdf_text` and degrade gracefully

     jwt_algorithm: str = "HS256"
     jwt_expire_minutes: int = 480
+    # Full-text fetch is on by default — every paper that comes back from
+    # PubMed gets the FullTextFetcher strategy chain (EuropePMC → PMC →
+    # bioRxiv → Unpaywall → Semantic Scholar → OpenAlex) so Claude has
+    # body text to cite from, not just an abstract. Set
+    # RAG_FETCH_FULLTEXT=false to revert to abstracts-only.
+    rag_fetch_fulltext: bool = True
     # Toggle pypdf-based text extraction from Unpaywall / Wiley / Semantic
     # Scholar / OpenAlex PDFs. Slim deployments that don't ship pypdf hit
     # the soft-import guard in `_extract_pdf_text` and degrade gracefully

backend/app/services/rag/fetcher.py CHANGED Viewed

@@ -90,33 +90,108 @@ class LiteratureFetcher:
     def build_query(
         self, gene: str, hgvs: str, protein: str | None, raw_hgvs: str | None = None
     ) -> str:
-        # Only emit coding (c./g./n.) HGVS terms. Protein forms cause
-        # silent corruption: PubMed's eutils does not parse phrases like
-        # "p.(Gln1756ProfsTer74)" or "p.Gln1756ProfsTer74" as a single
-        # term, drops them to `quotedphrasesnotfound`, AND leaves a bare
-        # `p` token in the OR clause. The lone `p` matches every paper
-        # mentioning protein → 7000+ irrelevant hits dominated by random
-        # BRCA1 papers. Coding HGVS doesn't have this problem.
-        #
-        # Also include both Mutalyzer-normalized (e.g. c.5266dup) and the
-        # user's raw input (e.g. c.5266dupC) — papers always use the
-        # original form with the trailing nucleotide.
-        hgvs_terms: list[str] = []
         for h in (hgvs, raw_hgvs):
             if not h:
                 continue
             short = self._strip_transcript_prefix(h)
             if not short or short.startswith("p."):
                 continue
-            if short not in hgvs_terms:
-                hgvs_terms.append(short)
-        if not hgvs_terms:
-            # Edge case — no coding form available. Fall back to gene only;
-            # caller will get a broad result set but at least nothing junk.
             return f'"{gene}"'
-        quoted = [f'"{t}"' for t in hgvs_terms]
         return f'("{gene}") AND ({" OR ".join(quoted)})'
     @retry(stop=stop_after_attempt(3), wait=wait_exponential(min=1, max=10), reraise=True)
     async def search_pubmed(self, query: str, retmax: int | None = None) -> list[str]:
         cap = retmax if retmax is not None else self.max_results

     def build_query(
         self, gene: str, hgvs: str, protein: str | None, raw_hgvs: str | None = None
     ) -> str:
+        """Build a PubMed eutils query that matches the variant whether the
+        paper uses HGVS coding, HGVS protein, three-letter codon notation,
+        or one-letter codon notation. Each variant identifier is a quoted
+        phrase to avoid the previous "bare p token" bug — PubMed will only
+        match the literal phrase, never a substring.
+        Coverage strategy:
+        - HGVS coding: `c.413G>A`, `c.5266dupC` (both Mutalyzer-normalized
+          and user-raw, because papers often retain the trailing nucleotide).
+        - HGVS protein: `p.Arg138Gln`, `p.Arg138Gln` (paren-stripped).
+        - Three-letter short: `Arg138Gln`.
+        - One-letter short: `R138Q` — by far the most common form in older
+          literature, especially channelopathy / collagenopathy / Fabry /
+          podocinopathy papers that predate ClinVar's HGVS-first convention.
+        The risk addressed in the previous version (bare `p` matching
+        everything) only happens when a phrase isn't fully quoted. Here
+        every term is wrapped in double-quotes and joined with OR.
+        """
+        terms: list[str] = []
+        # --- Coding HGVS terms (Mutalyzer + raw) ---
         for h in (hgvs, raw_hgvs):
             if not h:
                 continue
             short = self._strip_transcript_prefix(h)
             if not short or short.startswith("p."):
                 continue
+            if short not in terms:
+                terms.append(short)
+        # --- Protein HGVS terms ---
+        protein_forms = self._expand_protein_forms(protein, hgvs, raw_hgvs)
+        for pf in protein_forms:
+            if pf not in terms:
+                terms.append(pf)
+        if not terms:
             return f'"{gene}"'
+        quoted = [f'"{t}"' for t in terms]
         return f'("{gene}") AND ({" OR ".join(quoted)})'
+    # Standard amino-acid three-letter → one-letter table. Stop codons
+    # are represented as `*` / `Ter` / `X` in different journals; emit
+    # both common variants when we hit one.
+    _AA3_TO_1: dict[str, str] = {
+        "Ala": "A", "Arg": "R", "Asn": "N", "Asp": "D", "Cys": "C",
+        "Glu": "E", "Gln": "Q", "Gly": "G", "His": "H", "Ile": "I",
+        "Leu": "L", "Lys": "K", "Met": "M", "Phe": "F", "Pro": "P",
+        "Ser": "S", "Thr": "T", "Trp": "W", "Tyr": "Y", "Val": "V",
+        "Ter": "*", "Sec": "U", "Pyl": "O",
+    }
+    @classmethod
+    def _expand_protein_forms(
+        cls,
+        protein: str | None,
+        hgvs: str | None,
+        raw_hgvs: str | None,
+    ) -> list[str]:
+        """Return every alternate protein-change spelling we can derive
+        from the canonical protein HGVS. All entries are quoted-phrase
+        safe (no bare single-letter tokens) since the caller wraps each
+        in double quotes."""
+        import re
+        out: list[str] = []
+        sources = [protein] + [
+            cls._strip_transcript_prefix(h) for h in (hgvs, raw_hgvs) if h
+        ]
+        for src in sources:
+            if not src:
+                continue
+            # Accept p.(Arg138Gln), p.Arg138Gln, Arg138Gln (any leading prefix).
+            m = re.search(r"p\.?\(?([A-Za-z]{3})(\d+)([A-Za-z]{3}|=|\*|Ter|fs\w*)\)?", src)
+            if not m:
+                continue
+            ref3, pos, alt3 = m.group(1), m.group(2), m.group(3)
+            ref3_t = ref3.title()
+            alt3_t = alt3.title() if alt3.isalpha() else alt3
+            # Three-letter (HGVS canonical and stripped)
+            three = f"{ref3_t}{pos}{alt3_t}"
+            out.append(f"p.{three}")
+            out.append(three)
+            # One-letter (common in older literature)
+            r1 = cls._AA3_TO_1.get(ref3_t)
+            a1 = cls._AA3_TO_1.get(alt3_t) if alt3_t in cls._AA3_TO_1 else (
+                "*" if alt3_t in ("Ter", "*") else None
+            )
+            if r1 and a1:
+                out.append(f"{r1}{pos}{a1}")
+                out.append(f"p.{r1}{pos}{a1}")
+            elif r1 and alt3_t.startswith("fs"):
+                out.append(f"{r1}{pos}fs")
+        # De-dupe preserving order
+        seen: set[str] = set()
+        unique = []
+        for t in out:
+            if t not in seen:
+                seen.add(t)
+                unique.append(t)
+        return unique
     @retry(stop=stop_after_attempt(3), wait=wait_exponential(min=1, max=10), reraise=True)
     async def search_pubmed(self, query: str, retmax: int | None = None) -> list[str]:
         cap = retmax if retmax is not None else self.max_results

backend/tests/test_rag_query.py ADDED Viewed

	@@ -0,0 +1,103 @@

+"""Tests for the PubMed query builder.
+The fetcher only sees papers PubMed indexes under the exact phrases in
+its `term` parameter. The previous coding-only query missed every paper
+that cites the variant by protein change (R138Q, Arg138Gln) — the
+expansion below adds those forms.
+"""
+from backend.app.services.rag.fetcher import LiteratureFetcher
+def _terms_in_query(q: str) -> set[str]:
+    """Pull all quoted phrases out of `(GENE) AND ("a" OR "b" OR ...)`."""
+    import re
+    return set(re.findall(r'"([^"]+)"', q))
+def test_query_includes_coding_hgvs() -> None:
+    f = LiteratureFetcher()
+    q = f.build_query(
+        "NPHS2",
+        "NM_014625.4:c.413G>A",
+        "NM_014625.4(NP_055440.1):p.(Arg138Gln)",
+    )
+    terms = _terms_in_query(q)
+    assert "NPHS2" in terms
+    assert "c.413G>A" in terms
+def test_query_includes_three_letter_and_one_letter_protein() -> None:
+    """Recovers papers that cite the variant by protein change rather than
+    HGVS coding — the major failure mode on older Mendelian disease genes."""
+    f = LiteratureFetcher()
+    q = f.build_query(
+        "NPHS2",
+        "NM_014625.4:c.413G>A",
+        "NM_014625.4(NP_055440.1):p.(Arg138Gln)",
+    )
+    terms = _terms_in_query(q)
+    assert "p.Arg138Gln" in terms
+    assert "Arg138Gln" in terms
+    assert "p.R138Q" in terms
+    assert "R138Q" in terms
+def test_query_handles_collagenopathy_glycine_substitution() -> None:
+    """COL4A5 Alport — glycine substitutions are usually cited as G953V
+    in the channelopathy / collagenopathy literature."""
+    f = LiteratureFetcher()
+    q = f.build_query(
+        "COL4A5",
+        "NM_033380.3:c.2858G>T",
+        "NP_203699.1:p.(Gly953Val)",
+    )
+    terms = _terms_in_query(q)
+    assert "Gly953Val" in terms
+    assert "G953V" in terms
+def test_query_no_bare_p_token() -> None:
+    """The regression guard for the original bug — a bare `p` token in the
+    OR clause matched every paper mentioning protein. Every term we emit
+    must be fully quoted and at least 2 characters."""
+    f = LiteratureFetcher()
+    q = f.build_query(
+        "BRCA1",
+        "NM_007294.4:c.5266dup",
+        "NM_007294.4(NP_009225.1):p.(Gln1756ProfsTer74)",
+    )
+    terms = _terms_in_query(q)
+    assert "p" not in terms
+    assert all(len(t) >= 2 for t in terms)
+def test_query_falls_back_to_gene_when_no_hgvs() -> None:
+    f = LiteratureFetcher()
+    q = f.build_query("BRCA1", "", None)
+    assert q == '"BRCA1"'
+def test_query_handles_nonsense_variants() -> None:
+    """Stop-gained variants get cited as R306X, p.R306*, p.Arg306Ter
+    interchangeably."""
+    f = LiteratureFetcher()
+    q = f.build_query(
+        "PKD2",
+        "NM_000297.4:c.916C>T",
+        "NP_000288.1:p.(Arg306Ter)",
+    )
+    terms = _terms_in_query(q)
+    assert "Arg306Ter" in terms
+    assert "R306*" in terms
+def test_expand_protein_forms_dedup() -> None:
+    """The expander should not emit duplicate spellings when given the
+    same protein change through multiple input fields."""
+    forms = LiteratureFetcher._expand_protein_forms(
+        protein="NP_055440.1:p.(Arg138Gln)",
+        hgvs="NM_014625.4:c.413G>A",
+        raw_hgvs="NM_014625.4(NP_055440.1):p.(Arg138Gln)",
+    )
+    assert len(forms) == len(set(forms))