DeepBoner / docs /brainstorming /01_PUBMED_IMPROVEMENTS.md
VibecoderMcSwaggins's picture
rebrand: DeepCritical β†’ DeepBoner (sexual health research agent)
5d12635

A newer version of the Gradio SDK is available: 6.1.0

Upgrade

PubMed Tool: Current State & Future Improvements

Status: Currently Implemented Priority: High (Core Data Source)


Current Implementation

What We Have (src/tools/pubmed.py)

  • Basic E-utilities search via esearch.fcgi and efetch.fcgi
  • Query preprocessing (strips question words, expands synonyms)
  • Returns: title, abstract, authors, journal, PMID
  • Rate limiting: None implemented (relying on NCBI defaults)

Current Limitations

  1. No Full-Text Access: Only retrieves abstracts, not full paper text
  2. No Rate Limiting: Risk of being blocked by NCBI
  3. No BioC Format: Missing structured full-text extraction
  4. No Figure Retrieval: No supplementary materials access
  5. No PMC Integration: Missing open-access full-text via PMC

Reference Implementation (DeepBoner Reference Repo)

The reference repo at reference_repos/DeepBoner/DeepResearch/src/tools/bioinformatics_tools.py has a more sophisticated implementation:

Features We're Missing

# Rate limiting (lines 47-50)
from limits import parse
from limits.storage import MemoryStorage
from limits.strategies import MovingWindowRateLimiter

storage = MemoryStorage()
limiter = MovingWindowRateLimiter(storage)
rate_limit = parse("3/second")  # NCBI allows 3/sec without API key, 10/sec with

# Full-text via BioC format (lines 108-120)
def _get_fulltext(pmid: int) -> dict[str, Any] | None:
    pmid_url = f"https://www.ncbi.nlm.nih.gov/research/bionlp/RESTful/pmcoa.cgi/BioC_json/{pmid}/unicode"
    # Returns structured JSON with full text for open-access papers

# Figure retrieval via Europe PMC (lines 123-149)
def _get_figures(pmcid: str) -> dict[str, str]:
    suppl_url = f"https://www.ebi.ac.uk/europepmc/webservices/rest/{pmcid}/supplementaryFiles"
    # Returns base64-encoded images from supplementary materials

Recommended Improvements

Phase 1: Rate Limiting (Critical)

# Add to src/tools/pubmed.py
from limits import parse
from limits.storage import MemoryStorage
from limits.strategies import MovingWindowRateLimiter

storage = MemoryStorage()
limiter = MovingWindowRateLimiter(storage)

# With NCBI_API_KEY: 10/sec, without: 3/sec
def get_rate_limit():
    if settings.ncbi_api_key:
        return parse("10/second")
    return parse("3/second")

Dependencies: pip install limits

Phase 2: Full-Text Retrieval

async def get_fulltext(pmid: str) -> str | None:
    """Get full text for open-access papers via BioC API."""
    url = f"https://www.ncbi.nlm.nih.gov/research/bionlp/RESTful/pmcoa.cgi/BioC_json/{pmid}/unicode"
    # Only works for PMC papers (open access)

Phase 3: PMC ID Resolution

async def get_pmc_id(pmid: str) -> str | None:
    """Convert PMID to PMCID for full-text access."""
    url = f"https://www.ncbi.nlm.nih.gov/pmc/utils/idconv/v1.0/?ids={pmid}&format=json"

Python Libraries to Consider

Library Purpose Notes
Biopython Bio.Entrez module Official, well-maintained
PyMed PubMed wrapper Simpler API, less control
metapub Full-featured Tested on 1/3 of PubMed
limits Rate limiting Used by reference repo

API Endpoints Reference

Endpoint Purpose Rate Limit
esearch.fcgi Search for PMIDs 3/sec (10 with key)
efetch.fcgi Fetch metadata 3/sec (10 with key)
esummary.fcgi Quick metadata 3/sec (10 with key)
pmcoa.cgi/BioC_json Full text (PMC only) Unknown
idconv/v1.0 PMID ↔ PMCID Unknown

Sources