Spaces:

DataQuests
/

DeepCritical

Sleeping

File size: 8,015 Bytes

2f8ae1f

# P0 Actionable Fixes - What to Do

**Date:** November 27, 2025
**Status:** ACTIONABLE

---

## Summary: What's Broken and What's Fixable

| Tool | Problem | Fixable? | How |
|------|---------|----------|-----|
| BioRxiv | API has NO search endpoint | **NO** | Replace with Europe PMC |
| PubMed | No query preprocessing | **YES** | Add query cleaner |
| ClinicalTrials | No filters applied | **YES** | Add filter params |
| Magentic Framework | Nothing wrong | N/A | Already working |

---

## FIX 1: Replace BioRxiv with Europe PMC (30 min)

### Why BioRxiv Can't Be Fixed

The bioRxiv API only has this endpoint:
```
https://api.biorxiv.org/details/{server}/{date-range}/{cursor}/json
```

This returns papers **by date**, not by keyword. There is NO search endpoint.

**Proof:** I queried `medrxiv/2024-01-01/2024-01-02` and got:
- "Global risk of Plasmodium falciparum" (malaria)
- "Multiple Endocrine Neoplasia in India"
- "Acupuncture for Acute Musculoskeletal Pain"

**None of these are about Long COVID** because the API doesn't search.

### Europe PMC Has Search + Preprints

```bash
curl "https://www.ebi.ac.uk/europepmc/webservices/rest/search?query=long+covid+treatment&resultType=core&pageSize=3&format=json"
```

Returns 283,058 results including:
- "Long COVID Treatment No Silver Bullets, Only a Few Bronze BBs" ✅

### The Fix

Replace `src/tools/biorxiv.py` with `src/tools/europepmc.py`:

```python
"""Europe PMC preprint and paper search tool."""

import httpx
from src.utils.models import Citation, Evidence

class EuropePMCTool:
    """Search Europe PMC for papers and preprints."""

    BASE_URL = "https://www.ebi.ac.uk/europepmc/webservices/rest/search"

    @property
    def name(self) -> str:
        return "europepmc"

    async def search(self, query: str, max_results: int = 10) -> list[Evidence]:
        """Search Europe PMC (includes preprints from bioRxiv/medRxiv)."""
        params = {
            "query": query,
            "resultType": "core",
            "pageSize": max_results,
            "format": "json",
        }

        async with httpx.AsyncClient(timeout=30.0) as client:
            response = await client.get(self.BASE_URL, params=params)
            response.raise_for_status()

            data = response.json()
            results = data.get("resultList", {}).get("result", [])

            return [self._to_evidence(r) for r in results]

    def _to_evidence(self, result: dict) -> Evidence:
        """Convert Europe PMC result to Evidence."""
        title = result.get("title", "Untitled")
        abstract = result.get("abstractText", "No abstract")
        doi = result.get("doi", "")
        pub_year = result.get("pubYear", "Unknown")
        source = result.get("source", "europepmc")

        # Mark preprints
        pub_type = result.get("pubTypeList", {}).get("pubType", [])
        is_preprint = "Preprint" in pub_type

        content = f"{'[PREPRINT] ' if is_preprint else ''}{abstract[:1800]}"

        return Evidence(
            content=content,
            citation=Citation(
                source="europepmc" if not is_preprint else "preprint",
                title=title[:500],
                url=f"https://doi.org/{doi}" if doi else "",
                date=str(pub_year),
            ),
            relevance=0.75 if is_preprint else 0.9,
        )
```

---

## FIX 2: Add PubMed Query Preprocessing (1 hour)

### Current Problem

User enters: `What medications show promise for Long COVID?`
PubMed receives: `What medications show promise for Long COVID?`

The question words pollute the search.

### The Fix

Add `src/tools/query_utils.py`:

```python
"""Query preprocessing utilities."""

import re

# Question words to remove
QUESTION_WORDS = {
    "what", "which", "how", "why", "when", "where", "who",
    "is", "are", "can", "could", "would", "should", "do", "does",
    "show", "promise", "help", "treat", "cure",
}

# Medical synonyms to expand
SYNONYMS = {
    "long covid": ["long COVID", "PASC", "post-COVID syndrome", "post-acute sequelae"],
    "alzheimer": ["Alzheimer's disease", "AD", "Alzheimer dementia"],
    "cancer": ["neoplasm", "tumor", "malignancy", "carcinoma"],
}

def preprocess_pubmed_query(raw_query: str) -> str:
    """Convert natural language to cleaner PubMed query."""
    # Lowercase
    query = raw_query.lower()

    # Remove question marks
    query = query.replace("?", "")

    # Remove question words
    words = query.split()
    words = [w for w in words if w not in QUESTION_WORDS]
    query = " ".join(words)

    # Expand synonyms
    for term, expansions in SYNONYMS.items():
        if term in query:
            # Add OR clause
            expansion = " OR ".join([f'"{e}"' for e in expansions])
            query = query.replace(term, f"({expansion})")

    return query.strip()
```

Then update `src/tools/pubmed.py`:

```python
from src.tools.query_utils import preprocess_pubmed_query

async def search(self, query: str, max_results: int = 10) -> list[Evidence]:
    # Preprocess query
    clean_query = preprocess_pubmed_query(query)

    search_params = self._build_params(
        db="pubmed",
        term=clean_query,  # Use cleaned query
        retmax=max_results,
        sort="relevance",
    )
    # ... rest unchanged
```

---

## FIX 3: Add ClinicalTrials.gov Filters (30 min)

### Current Problem

Returns ALL trials including withdrawn, terminated, observational studies.

### The Fix

The API supports `filter.overallStatus` and other filters. Update `src/tools/clinicaltrials.py`:

```python
async def search(self, query: str, max_results: int = 10) -> list[Evidence]:
    params: dict[str, str | int] = {
        "query.term": query,
        "pageSize": min(max_results, 100),
        "fields": "|".join(self.FIELDS),
        # ADD THESE FILTERS:
        "filter.overallStatus": "COMPLETED|RECRUITING|ACTIVE_NOT_RECRUITING",
        # Only interventional studies (not observational)
        "aggFilters": "studyType:int",
    }
    # ... rest unchanged
```

**Note:** I tested the API - it supports filtering but with slightly different syntax. Check the [API docs](https://clinicaltrials.gov/data-api/api).

---

## What NOT to Change

### Microsoft Agent Framework - WORKING

I verified:
```python
from agent_framework import MagenticBuilder, ChatAgent
from agent_framework.openai import OpenAIChatClient
# All imports OK

orchestrator = MagenticOrchestrator(max_rounds=2)
workflow = orchestrator._build_workflow()
# Workflow built successfully
```

The Magentic agents are correctly wired:
- SearchAgent → GPT-5.1 ✅
- JudgeAgent → GPT-5.1 ✅
- HypothesisAgent → GPT-5.1 ✅
- ReportAgent → GPT-5.1 ✅

**The framework is fine. The tools it calls are broken.**

---

## Priority Order

1. **Replace BioRxiv** → Immediate, fundamental
2. **Add PubMed preprocessing** → High impact, easy
3. **Add ClinicalTrials filters** → Medium impact, easy

---

## Test After Fixes

```bash
# Test Europe PMC
uv run python -c "
import asyncio
from src.tools.europepmc import EuropePMCTool
tool = EuropePMCTool()
results = asyncio.run(tool.search('long covid treatment', 3))
for r in results:
    print(r.citation.title)
"

# Test PubMed with preprocessing
uv run python -c "
from src.tools.query_utils import preprocess_pubmed_query
q = 'What medications show promise for Long COVID?'
print(preprocess_pubmed_query(q))
# Should output: (\"long COVID\" OR \"PASC\" OR \"post-COVID syndrome\") medications
"
```

---

## After These Fixes

The Magentic workflow will:
1. SearchAgent calls `search_pubmed("long COVID treatment")` → Gets RELEVANT papers
2. SearchAgent calls `search_preprints("long COVID treatment")` → Gets RELEVANT preprints via Europe PMC
3. SearchAgent calls `search_clinical_trials("long COVID")` → Gets INTERVENTIONAL trials only
4. JudgeAgent evaluates GOOD evidence
5. HypothesisAgent generates hypotheses from GOOD evidence
6. ReportAgent synthesizes GOOD report

**The framework will work once we feed it good data.**