Spaces:
Running
Running
A newer version of the Gradio SDK is available:
6.1.0
PubMed Tool: Current State & Future Improvements
Status: Currently Implemented Priority: High (Core Data Source)
Current Implementation
What We Have (src/tools/pubmed.py)
- Basic E-utilities search via
esearch.fcgiandefetch.fcgi - Query preprocessing (strips question words, expands synonyms)
- Returns: title, abstract, authors, journal, PMID
- Rate limiting: None implemented (relying on NCBI defaults)
Current Limitations
- No Full-Text Access: Only retrieves abstracts, not full paper text
- No Rate Limiting: Risk of being blocked by NCBI
- No BioC Format: Missing structured full-text extraction
- No Figure Retrieval: No supplementary materials access
- No PMC Integration: Missing open-access full-text via PMC
Reference Implementation (DeepBoner Reference Repo)
The reference repo at reference_repos/DeepBoner/DeepResearch/src/tools/bioinformatics_tools.py has a more sophisticated implementation:
Features We're Missing
# Rate limiting (lines 47-50)
from limits import parse
from limits.storage import MemoryStorage
from limits.strategies import MovingWindowRateLimiter
storage = MemoryStorage()
limiter = MovingWindowRateLimiter(storage)
rate_limit = parse("3/second") # NCBI allows 3/sec without API key, 10/sec with
# Full-text via BioC format (lines 108-120)
def _get_fulltext(pmid: int) -> dict[str, Any] | None:
pmid_url = f"https://www.ncbi.nlm.nih.gov/research/bionlp/RESTful/pmcoa.cgi/BioC_json/{pmid}/unicode"
# Returns structured JSON with full text for open-access papers
# Figure retrieval via Europe PMC (lines 123-149)
def _get_figures(pmcid: str) -> dict[str, str]:
suppl_url = f"https://www.ebi.ac.uk/europepmc/webservices/rest/{pmcid}/supplementaryFiles"
# Returns base64-encoded images from supplementary materials
Recommended Improvements
Phase 1: Rate Limiting (Critical)
# Add to src/tools/pubmed.py
from limits import parse
from limits.storage import MemoryStorage
from limits.strategies import MovingWindowRateLimiter
storage = MemoryStorage()
limiter = MovingWindowRateLimiter(storage)
# With NCBI_API_KEY: 10/sec, without: 3/sec
def get_rate_limit():
if settings.ncbi_api_key:
return parse("10/second")
return parse("3/second")
Dependencies: pip install limits
Phase 2: Full-Text Retrieval
async def get_fulltext(pmid: str) -> str | None:
"""Get full text for open-access papers via BioC API."""
url = f"https://www.ncbi.nlm.nih.gov/research/bionlp/RESTful/pmcoa.cgi/BioC_json/{pmid}/unicode"
# Only works for PMC papers (open access)
Phase 3: PMC ID Resolution
async def get_pmc_id(pmid: str) -> str | None:
"""Convert PMID to PMCID for full-text access."""
url = f"https://www.ncbi.nlm.nih.gov/pmc/utils/idconv/v1.0/?ids={pmid}&format=json"
Python Libraries to Consider
| Library | Purpose | Notes |
|---|---|---|
| Biopython | Bio.Entrez module |
Official, well-maintained |
| PyMed | PubMed wrapper | Simpler API, less control |
| metapub | Full-featured | Tested on 1/3 of PubMed |
| limits | Rate limiting | Used by reference repo |
API Endpoints Reference
| Endpoint | Purpose | Rate Limit |
|---|---|---|
esearch.fcgi |
Search for PMIDs | 3/sec (10 with key) |
efetch.fcgi |
Fetch metadata | 3/sec (10 with key) |
esummary.fcgi |
Quick metadata | 3/sec (10 with key) |
pmcoa.cgi/BioC_json |
Full text (PMC only) | Unknown |
idconv/v1.0 |
PMID β PMCID | Unknown |