DeepCritical / docs /bugs /P0_ACTIONABLE_FIXES.md
VibecoderMcSwaggins's picture
refactor(tools): replace BioRxiv with Europe PMC (Phase 01)
2f8ae1f
|
raw
history blame
8.02 kB
# P0 Actionable Fixes - What to Do
**Date:** November 27, 2025
**Status:** ACTIONABLE
---
## Summary: What's Broken and What's Fixable
| Tool | Problem | Fixable? | How |
|------|---------|----------|-----|
| BioRxiv | API has NO search endpoint | **NO** | Replace with Europe PMC |
| PubMed | No query preprocessing | **YES** | Add query cleaner |
| ClinicalTrials | No filters applied | **YES** | Add filter params |
| Magentic Framework | Nothing wrong | N/A | Already working |
---
## FIX 1: Replace BioRxiv with Europe PMC (30 min)
### Why BioRxiv Can't Be Fixed
The bioRxiv API only has this endpoint:
```
https://api.biorxiv.org/details/{server}/{date-range}/{cursor}/json
```
This returns papers **by date**, not by keyword. There is NO search endpoint.
**Proof:** I queried `medrxiv/2024-01-01/2024-01-02` and got:
- "Global risk of Plasmodium falciparum" (malaria)
- "Multiple Endocrine Neoplasia in India"
- "Acupuncture for Acute Musculoskeletal Pain"
**None of these are about Long COVID** because the API doesn't search.
### Europe PMC Has Search + Preprints
```bash
curl "https://www.ebi.ac.uk/europepmc/webservices/rest/search?query=long+covid+treatment&resultType=core&pageSize=3&format=json"
```
Returns 283,058 results including:
- "Long COVID Treatment No Silver Bullets, Only a Few Bronze BBs" βœ…
### The Fix
Replace `src/tools/biorxiv.py` with `src/tools/europepmc.py`:
```python
"""Europe PMC preprint and paper search tool."""
import httpx
from src.utils.models import Citation, Evidence
class EuropePMCTool:
"""Search Europe PMC for papers and preprints."""
BASE_URL = "https://www.ebi.ac.uk/europepmc/webservices/rest/search"
@property
def name(self) -> str:
return "europepmc"
async def search(self, query: str, max_results: int = 10) -> list[Evidence]:
"""Search Europe PMC (includes preprints from bioRxiv/medRxiv)."""
params = {
"query": query,
"resultType": "core",
"pageSize": max_results,
"format": "json",
}
async with httpx.AsyncClient(timeout=30.0) as client:
response = await client.get(self.BASE_URL, params=params)
response.raise_for_status()
data = response.json()
results = data.get("resultList", {}).get("result", [])
return [self._to_evidence(r) for r in results]
def _to_evidence(self, result: dict) -> Evidence:
"""Convert Europe PMC result to Evidence."""
title = result.get("title", "Untitled")
abstract = result.get("abstractText", "No abstract")
doi = result.get("doi", "")
pub_year = result.get("pubYear", "Unknown")
source = result.get("source", "europepmc")
# Mark preprints
pub_type = result.get("pubTypeList", {}).get("pubType", [])
is_preprint = "Preprint" in pub_type
content = f"{'[PREPRINT] ' if is_preprint else ''}{abstract[:1800]}"
return Evidence(
content=content,
citation=Citation(
source="europepmc" if not is_preprint else "preprint",
title=title[:500],
url=f"https://doi.org/{doi}" if doi else "",
date=str(pub_year),
),
relevance=0.75 if is_preprint else 0.9,
)
```
---
## FIX 2: Add PubMed Query Preprocessing (1 hour)
### Current Problem
User enters: `What medications show promise for Long COVID?`
PubMed receives: `What medications show promise for Long COVID?`
The question words pollute the search.
### The Fix
Add `src/tools/query_utils.py`:
```python
"""Query preprocessing utilities."""
import re
# Question words to remove
QUESTION_WORDS = {
"what", "which", "how", "why", "when", "where", "who",
"is", "are", "can", "could", "would", "should", "do", "does",
"show", "promise", "help", "treat", "cure",
}
# Medical synonyms to expand
SYNONYMS = {
"long covid": ["long COVID", "PASC", "post-COVID syndrome", "post-acute sequelae"],
"alzheimer": ["Alzheimer's disease", "AD", "Alzheimer dementia"],
"cancer": ["neoplasm", "tumor", "malignancy", "carcinoma"],
}
def preprocess_pubmed_query(raw_query: str) -> str:
"""Convert natural language to cleaner PubMed query."""
# Lowercase
query = raw_query.lower()
# Remove question marks
query = query.replace("?", "")
# Remove question words
words = query.split()
words = [w for w in words if w not in QUESTION_WORDS]
query = " ".join(words)
# Expand synonyms
for term, expansions in SYNONYMS.items():
if term in query:
# Add OR clause
expansion = " OR ".join([f'"{e}"' for e in expansions])
query = query.replace(term, f"({expansion})")
return query.strip()
```
Then update `src/tools/pubmed.py`:
```python
from src.tools.query_utils import preprocess_pubmed_query
async def search(self, query: str, max_results: int = 10) -> list[Evidence]:
# Preprocess query
clean_query = preprocess_pubmed_query(query)
search_params = self._build_params(
db="pubmed",
term=clean_query, # Use cleaned query
retmax=max_results,
sort="relevance",
)
# ... rest unchanged
```
---
## FIX 3: Add ClinicalTrials.gov Filters (30 min)
### Current Problem
Returns ALL trials including withdrawn, terminated, observational studies.
### The Fix
The API supports `filter.overallStatus` and other filters. Update `src/tools/clinicaltrials.py`:
```python
async def search(self, query: str, max_results: int = 10) -> list[Evidence]:
params: dict[str, str | int] = {
"query.term": query,
"pageSize": min(max_results, 100),
"fields": "|".join(self.FIELDS),
# ADD THESE FILTERS:
"filter.overallStatus": "COMPLETED|RECRUITING|ACTIVE_NOT_RECRUITING",
# Only interventional studies (not observational)
"aggFilters": "studyType:int",
}
# ... rest unchanged
```
**Note:** I tested the API - it supports filtering but with slightly different syntax. Check the [API docs](https://clinicaltrials.gov/data-api/api).
---
## What NOT to Change
### Microsoft Agent Framework - WORKING
I verified:
```python
from agent_framework import MagenticBuilder, ChatAgent
from agent_framework.openai import OpenAIChatClient
# All imports OK
orchestrator = MagenticOrchestrator(max_rounds=2)
workflow = orchestrator._build_workflow()
# Workflow built successfully
```
The Magentic agents are correctly wired:
- SearchAgent β†’ GPT-5.1 βœ…
- JudgeAgent β†’ GPT-5.1 βœ…
- HypothesisAgent β†’ GPT-5.1 βœ…
- ReportAgent β†’ GPT-5.1 βœ…
**The framework is fine. The tools it calls are broken.**
---
## Priority Order
1. **Replace BioRxiv** β†’ Immediate, fundamental
2. **Add PubMed preprocessing** β†’ High impact, easy
3. **Add ClinicalTrials filters** β†’ Medium impact, easy
---
## Test After Fixes
```bash
# Test Europe PMC
uv run python -c "
import asyncio
from src.tools.europepmc import EuropePMCTool
tool = EuropePMCTool()
results = asyncio.run(tool.search('long covid treatment', 3))
for r in results:
print(r.citation.title)
"
# Test PubMed with preprocessing
uv run python -c "
from src.tools.query_utils import preprocess_pubmed_query
q = 'What medications show promise for Long COVID?'
print(preprocess_pubmed_query(q))
# Should output: (\"long COVID\" OR \"PASC\" OR \"post-COVID syndrome\") medications
"
```
---
## After These Fixes
The Magentic workflow will:
1. SearchAgent calls `search_pubmed("long COVID treatment")` β†’ Gets RELEVANT papers
2. SearchAgent calls `search_preprints("long COVID treatment")` β†’ Gets RELEVANT preprints via Europe PMC
3. SearchAgent calls `search_clinical_trials("long COVID")` β†’ Gets INTERVENTIONAL trials only
4. JudgeAgent evaluates GOOD evidence
5. HypothesisAgent generates hypotheses from GOOD evidence
6. ReportAgent synthesizes GOOD report
**The framework will work once we feed it good data.**