Spaces:
Sleeping
Sleeping
File size: 8,015 Bytes
2f8ae1f |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 |
# P0 Actionable Fixes - What to Do
**Date:** November 27, 2025
**Status:** ACTIONABLE
---
## Summary: What's Broken and What's Fixable
| Tool | Problem | Fixable? | How |
|------|---------|----------|-----|
| BioRxiv | API has NO search endpoint | **NO** | Replace with Europe PMC |
| PubMed | No query preprocessing | **YES** | Add query cleaner |
| ClinicalTrials | No filters applied | **YES** | Add filter params |
| Magentic Framework | Nothing wrong | N/A | Already working |
---
## FIX 1: Replace BioRxiv with Europe PMC (30 min)
### Why BioRxiv Can't Be Fixed
The bioRxiv API only has this endpoint:
```
https://api.biorxiv.org/details/{server}/{date-range}/{cursor}/json
```
This returns papers **by date**, not by keyword. There is NO search endpoint.
**Proof:** I queried `medrxiv/2024-01-01/2024-01-02` and got:
- "Global risk of Plasmodium falciparum" (malaria)
- "Multiple Endocrine Neoplasia in India"
- "Acupuncture for Acute Musculoskeletal Pain"
**None of these are about Long COVID** because the API doesn't search.
### Europe PMC Has Search + Preprints
```bash
curl "https://www.ebi.ac.uk/europepmc/webservices/rest/search?query=long+covid+treatment&resultType=core&pageSize=3&format=json"
```
Returns 283,058 results including:
- "Long COVID Treatment No Silver Bullets, Only a Few Bronze BBs" β
### The Fix
Replace `src/tools/biorxiv.py` with `src/tools/europepmc.py`:
```python
"""Europe PMC preprint and paper search tool."""
import httpx
from src.utils.models import Citation, Evidence
class EuropePMCTool:
"""Search Europe PMC for papers and preprints."""
BASE_URL = "https://www.ebi.ac.uk/europepmc/webservices/rest/search"
@property
def name(self) -> str:
return "europepmc"
async def search(self, query: str, max_results: int = 10) -> list[Evidence]:
"""Search Europe PMC (includes preprints from bioRxiv/medRxiv)."""
params = {
"query": query,
"resultType": "core",
"pageSize": max_results,
"format": "json",
}
async with httpx.AsyncClient(timeout=30.0) as client:
response = await client.get(self.BASE_URL, params=params)
response.raise_for_status()
data = response.json()
results = data.get("resultList", {}).get("result", [])
return [self._to_evidence(r) for r in results]
def _to_evidence(self, result: dict) -> Evidence:
"""Convert Europe PMC result to Evidence."""
title = result.get("title", "Untitled")
abstract = result.get("abstractText", "No abstract")
doi = result.get("doi", "")
pub_year = result.get("pubYear", "Unknown")
source = result.get("source", "europepmc")
# Mark preprints
pub_type = result.get("pubTypeList", {}).get("pubType", [])
is_preprint = "Preprint" in pub_type
content = f"{'[PREPRINT] ' if is_preprint else ''}{abstract[:1800]}"
return Evidence(
content=content,
citation=Citation(
source="europepmc" if not is_preprint else "preprint",
title=title[:500],
url=f"https://doi.org/{doi}" if doi else "",
date=str(pub_year),
),
relevance=0.75 if is_preprint else 0.9,
)
```
---
## FIX 2: Add PubMed Query Preprocessing (1 hour)
### Current Problem
User enters: `What medications show promise for Long COVID?`
PubMed receives: `What medications show promise for Long COVID?`
The question words pollute the search.
### The Fix
Add `src/tools/query_utils.py`:
```python
"""Query preprocessing utilities."""
import re
# Question words to remove
QUESTION_WORDS = {
"what", "which", "how", "why", "when", "where", "who",
"is", "are", "can", "could", "would", "should", "do", "does",
"show", "promise", "help", "treat", "cure",
}
# Medical synonyms to expand
SYNONYMS = {
"long covid": ["long COVID", "PASC", "post-COVID syndrome", "post-acute sequelae"],
"alzheimer": ["Alzheimer's disease", "AD", "Alzheimer dementia"],
"cancer": ["neoplasm", "tumor", "malignancy", "carcinoma"],
}
def preprocess_pubmed_query(raw_query: str) -> str:
"""Convert natural language to cleaner PubMed query."""
# Lowercase
query = raw_query.lower()
# Remove question marks
query = query.replace("?", "")
# Remove question words
words = query.split()
words = [w for w in words if w not in QUESTION_WORDS]
query = " ".join(words)
# Expand synonyms
for term, expansions in SYNONYMS.items():
if term in query:
# Add OR clause
expansion = " OR ".join([f'"{e}"' for e in expansions])
query = query.replace(term, f"({expansion})")
return query.strip()
```
Then update `src/tools/pubmed.py`:
```python
from src.tools.query_utils import preprocess_pubmed_query
async def search(self, query: str, max_results: int = 10) -> list[Evidence]:
# Preprocess query
clean_query = preprocess_pubmed_query(query)
search_params = self._build_params(
db="pubmed",
term=clean_query, # Use cleaned query
retmax=max_results,
sort="relevance",
)
# ... rest unchanged
```
---
## FIX 3: Add ClinicalTrials.gov Filters (30 min)
### Current Problem
Returns ALL trials including withdrawn, terminated, observational studies.
### The Fix
The API supports `filter.overallStatus` and other filters. Update `src/tools/clinicaltrials.py`:
```python
async def search(self, query: str, max_results: int = 10) -> list[Evidence]:
params: dict[str, str | int] = {
"query.term": query,
"pageSize": min(max_results, 100),
"fields": "|".join(self.FIELDS),
# ADD THESE FILTERS:
"filter.overallStatus": "COMPLETED|RECRUITING|ACTIVE_NOT_RECRUITING",
# Only interventional studies (not observational)
"aggFilters": "studyType:int",
}
# ... rest unchanged
```
**Note:** I tested the API - it supports filtering but with slightly different syntax. Check the [API docs](https://clinicaltrials.gov/data-api/api).
---
## What NOT to Change
### Microsoft Agent Framework - WORKING
I verified:
```python
from agent_framework import MagenticBuilder, ChatAgent
from agent_framework.openai import OpenAIChatClient
# All imports OK
orchestrator = MagenticOrchestrator(max_rounds=2)
workflow = orchestrator._build_workflow()
# Workflow built successfully
```
The Magentic agents are correctly wired:
- SearchAgent β GPT-5.1 β
- JudgeAgent β GPT-5.1 β
- HypothesisAgent β GPT-5.1 β
- ReportAgent β GPT-5.1 β
**The framework is fine. The tools it calls are broken.**
---
## Priority Order
1. **Replace BioRxiv** β Immediate, fundamental
2. **Add PubMed preprocessing** β High impact, easy
3. **Add ClinicalTrials filters** β Medium impact, easy
---
## Test After Fixes
```bash
# Test Europe PMC
uv run python -c "
import asyncio
from src.tools.europepmc import EuropePMCTool
tool = EuropePMCTool()
results = asyncio.run(tool.search('long covid treatment', 3))
for r in results:
print(r.citation.title)
"
# Test PubMed with preprocessing
uv run python -c "
from src.tools.query_utils import preprocess_pubmed_query
q = 'What medications show promise for Long COVID?'
print(preprocess_pubmed_query(q))
# Should output: (\"long COVID\" OR \"PASC\" OR \"post-COVID syndrome\") medications
"
```
---
## After These Fixes
The Magentic workflow will:
1. SearchAgent calls `search_pubmed("long COVID treatment")` β Gets RELEVANT papers
2. SearchAgent calls `search_preprints("long COVID treatment")` β Gets RELEVANT preprints via Europe PMC
3. SearchAgent calls `search_clinical_trials("long COVID")` β Gets INTERVENTIONAL trials only
4. JudgeAgent evaluates GOOD evidence
5. HypothesisAgent generates hypotheses from GOOD evidence
6. ReportAgent synthesizes GOOD report
**The framework will work once we feed it good data.**
|