Spaces:

DataQuests
/

DeepCritical

Sleeping

App Files Files Community

DeepCritical / docs /bugs /P0_ACTIONABLE_FIXES.md

VibecoderMcSwaggins

refactor(tools): replace BioRxiv with Europe PMC (Phase 01)

2f8ae1f about 1 month ago

preview code

raw

history blame

8.02 kB

	# P0 Actionable Fixes - What to Do

	Date: November 27, 2025
	Status: ACTIONABLE

	---

	## Summary: What's Broken and What's Fixable

	\| Tool \| Problem \| Fixable? \| How \|
	\|------\|---------\|----------\|-----\|
	\| BioRxiv \| API has NO search endpoint \| NO \| Replace with Europe PMC \|
	\| PubMed \| No query preprocessing \| YES \| Add query cleaner \|
	\| ClinicalTrials \| No filters applied \| YES \| Add filter params \|
	\| Magentic Framework \| Nothing wrong \| N/A \| Already working \|

	---

	## FIX 1: Replace BioRxiv with Europe PMC (30 min)

	### Why BioRxiv Can't Be Fixed

	The bioRxiv API only has this endpoint:
	```
	https://api.biorxiv.org/details/{server}/{date-range}/{cursor}/json
	```

	This returns papers by date, not by keyword. There is NO search endpoint.

	Proof: I queried `medrxiv/2024-01-01/2024-01-02` and got:
	- "Global risk of Plasmodium falciparum" (malaria)
	- "Multiple Endocrine Neoplasia in India"
	- "Acupuncture for Acute Musculoskeletal Pain"

	None of these are about Long COVID because the API doesn't search.

	### Europe PMC Has Search + Preprints

	```bash
	curl "https://www.ebi.ac.uk/europepmc/webservices/rest/search?query=long+covid+treatment&resultType=core&pageSize=3&format=json"
	```

	Returns 283,058 results including:
	- "Long COVID Treatment No Silver Bullets, Only a Few Bronze BBs" ✅

	### The Fix

	Replace `src/tools/biorxiv.py` with `src/tools/europepmc.py`:

	```python
	"""Europe PMC preprint and paper search tool."""

	import httpx
	from src.utils.models import Citation, Evidence

	class EuropePMCTool:
	"""Search Europe PMC for papers and preprints."""

	BASE_URL = "https://www.ebi.ac.uk/europepmc/webservices/rest/search"

	@property
	def name(self) -> str:
	return "europepmc"

	async def search(self, query: str, max_results: int = 10) -> list[Evidence]:
	"""Search Europe PMC (includes preprints from bioRxiv/medRxiv)."""
	params = {
	"query": query,
	"resultType": "core",
	"pageSize": max_results,
	"format": "json",
	}

	async with httpx.AsyncClient(timeout=30.0) as client:
	response = await client.get(self.BASE_URL, params=params)
	response.raise_for_status()

	data = response.json()
	results = data.get("resultList", {}).get("result", [])

	return [self._to_evidence(r) for r in results]

	def _to_evidence(self, result: dict) -> Evidence:
	"""Convert Europe PMC result to Evidence."""
	title = result.get("title", "Untitled")
	abstract = result.get("abstractText", "No abstract")
	doi = result.get("doi", "")
	pub_year = result.get("pubYear", "Unknown")
	source = result.get("source", "europepmc")

	# Mark preprints
	pub_type = result.get("pubTypeList", {}).get("pubType", [])
	is_preprint = "Preprint" in pub_type

	content = f"{'[PREPRINT] ' if is_preprint else ''}{abstract[:1800]}"

	return Evidence(
	content=content,
	citation=Citation(
	source="europepmc" if not is_preprint else "preprint",
	title=title[:500],
	url=f"https://doi.org/{doi}" if doi else "",
	date=str(pub_year),
	),
	relevance=0.75 if is_preprint else 0.9,
	)
	```

	---

	## FIX 2: Add PubMed Query Preprocessing (1 hour)

	### Current Problem

	User enters: `What medications show promise for Long COVID?`
	PubMed receives: `What medications show promise for Long COVID?`

	The question words pollute the search.

	### The Fix

	Add `src/tools/query_utils.py`:

	```python
	"""Query preprocessing utilities."""

	import re

	# Question words to remove
	QUESTION_WORDS = {
	"what", "which", "how", "why", "when", "where", "who",
	"is", "are", "can", "could", "would", "should", "do", "does",
	"show", "promise", "help", "treat", "cure",
	}

	# Medical synonyms to expand
	SYNONYMS = {
	"long covid": ["long COVID", "PASC", "post-COVID syndrome", "post-acute sequelae"],
	"alzheimer": ["Alzheimer's disease", "AD", "Alzheimer dementia"],
	"cancer": ["neoplasm", "tumor", "malignancy", "carcinoma"],
	}

	def preprocess_pubmed_query(raw_query: str) -> str:
	"""Convert natural language to cleaner PubMed query."""
	# Lowercase
	query = raw_query.lower()

	# Remove question marks
	query = query.replace("?", "")

	# Remove question words
	words = query.split()
	words = [w for w in words if w not in QUESTION_WORDS]
	query = " ".join(words)

	# Expand synonyms
	for term, expansions in SYNONYMS.items():
	if term in query:
	# Add OR clause
	expansion = " OR ".join([f'"{e}"' for e in expansions])
	query = query.replace(term, f"({expansion})")

	return query.strip()
	```

	Then update `src/tools/pubmed.py`:

	```python
	from src.tools.query_utils import preprocess_pubmed_query

	async def search(self, query: str, max_results: int = 10) -> list[Evidence]:
	# Preprocess query
	clean_query = preprocess_pubmed_query(query)

	search_params = self._build_params(
	db="pubmed",
	term=clean_query, # Use cleaned query
	retmax=max_results,
	sort="relevance",
	)
	# ... rest unchanged
	```

	---

	## FIX 3: Add ClinicalTrials.gov Filters (30 min)

	### Current Problem

	Returns ALL trials including withdrawn, terminated, observational studies.

	### The Fix

	The API supports `filter.overallStatus` and other filters. Update `src/tools/clinicaltrials.py`:

	```python
	async def search(self, query: str, max_results: int = 10) -> list[Evidence]:
	params: dict[str, str \| int] = {
	"query.term": query,
	"pageSize": min(max_results, 100),
	"fields": "\|".join(self.FIELDS),
	# ADD THESE FILTERS:
	"filter.overallStatus": "COMPLETED\|RECRUITING\|ACTIVE_NOT_RECRUITING",
	# Only interventional studies (not observational)
	"aggFilters": "studyType:int",
	}
	# ... rest unchanged
	```

	Note: I tested the API - it supports filtering but with slightly different syntax. Check the [API docs](https://clinicaltrials.gov/data-api/api).

	---

	## What NOT to Change

	### Microsoft Agent Framework - WORKING

	I verified:
	```python
	from agent_framework import MagenticBuilder, ChatAgent
	from agent_framework.openai import OpenAIChatClient
	# All imports OK

	orchestrator = MagenticOrchestrator(max_rounds=2)
	workflow = orchestrator._build_workflow()
	# Workflow built successfully
	```

	The Magentic agents are correctly wired:
	- SearchAgent → GPT-5.1 ✅
	- JudgeAgent → GPT-5.1 ✅
	- HypothesisAgent → GPT-5.1 ✅
	- ReportAgent → GPT-5.1 ✅

	The framework is fine. The tools it calls are broken.

	---

	## Priority Order

	1. Replace BioRxiv → Immediate, fundamental
	2. Add PubMed preprocessing → High impact, easy
	3. Add ClinicalTrials filters → Medium impact, easy

	---

	## Test After Fixes

	```bash
	# Test Europe PMC
	uv run python -c "
	import asyncio
	from src.tools.europepmc import EuropePMCTool
	tool = EuropePMCTool()
	results = asyncio.run(tool.search('long covid treatment', 3))
	for r in results:
	print(r.citation.title)
	"

	# Test PubMed with preprocessing
	uv run python -c "
	from src.tools.query_utils import preprocess_pubmed_query
	q = 'What medications show promise for Long COVID?'
	print(preprocess_pubmed_query(q))
	# Should output: (\"long COVID\" OR \"PASC\" OR \"post-COVID syndrome\") medications
	"
	```

	---

	## After These Fixes

	The Magentic workflow will:
	1. SearchAgent calls `search_pubmed("long COVID treatment")` → Gets RELEVANT papers
	2. SearchAgent calls `search_preprints("long COVID treatment")` → Gets RELEVANT preprints via Europe PMC
	3. SearchAgent calls `search_clinical_trials("long COVID")` → Gets INTERVENTIONAL trials only
	4. JudgeAgent evaluates GOOD evidence
	5. HypothesisAgent generates hypotheses from GOOD evidence
	6. ReportAgent synthesizes GOOD report

	The framework will work once we feed it good data.