DeepBoner / docs /brainstorming /03_EUROPEPMC_IMPROVEMENTS.md
VibecoderMcSwaggins's picture
rebrand: DeepCritical β†’ DeepBoner (sexual health research agent)
5d12635

A newer version of the Gradio SDK is available: 6.1.0

Upgrade

Europe PMC Tool: Current State & Future Improvements

Status: Currently Implemented (Replaced bioRxiv) Priority: High (Preprint + Open Access Source)


Why Europe PMC Over bioRxiv?

bioRxiv API Limitations (Why We Abandoned It)

  1. No Search API: Only returns papers by date range or DOI
  2. No Query Capability: Cannot search for "metformin cancer"
  3. Workaround Required: Would need to download ALL preprints and build local search
  4. Known Issue: Gradio Issue #8861 documents the limitation

Europe PMC Advantages

  1. Full Search API: Boolean queries, filters, facets
  2. Aggregates bioRxiv: Includes bioRxiv, medRxiv content anyway
  3. Includes PubMed: Also has MEDLINE content
  4. 34 Preprint Servers: Not just bioRxiv
  5. Open Access Focus: Full-text when available

Current Implementation

What We Have (src/tools/europepmc.py)

  • REST API search via europepmc.org/webservices/rest/search
  • Preprint flagging via firstPublicationDate heuristics
  • Returns: title, abstract, authors, DOI, source
  • Marks preprints for transparency

Current Limitations

  1. No Full-Text Retrieval: Only metadata/abstracts
  2. No Citation Network: Missing references/citations
  3. No Supplementary Files: Not fetching figures/data
  4. Basic Preprint Detection: Heuristic, not explicit flag

Europe PMC API Capabilities

Endpoints We Could Use

Endpoint Purpose Currently Using
/search Query papers Yes
/fulltext/{ID} Full text (XML/JSON) No
/{PMCID}/supplementaryFiles Figures, data No
/citations/{ID} Who cited this No
/references/{ID} What this cites No
/annotations Text-mined entities No

Rich Query Syntax

# Current simple query
query = "metformin cancer"

# Could use advanced syntax
query = "(TITLE:metformin OR ABSTRACT:metformin) AND (cancer OR oncology)"
query += " AND (SRC:PPR)"  # Only preprints
query += " AND (FIRST_PDATE:[2023-01-01 TO 2024-12-31])"  # Date range
query += " AND (OPEN_ACCESS:y)"  # Only open access

Source Filters

# Filter by source
"SRC:MED"     # MEDLINE
"SRC:PMC"     # PubMed Central
"SRC:PPR"     # Preprints (bioRxiv, medRxiv, etc.)
"SRC:AGR"     # Agricola
"SRC:CBA"     # Chinese Biological Abstracts

Recommended Improvements

Phase 1: Rich Metadata

# Add to search results
additional_fields = [
    "citedByCount",           # Impact indicator
    "source",                 # Explicit source (MED, PMC, PPR)
    "isOpenAccess",           # Boolean flag
    "fullTextUrlList",        # URLs for full text
    "authorAffiliations",     # Institution info
    "grantsList",             # Funding info
]

Phase 2: Full-Text Retrieval

async def get_fulltext(pmcid: str) -> str | None:
    """Get full text for open access papers."""
    # XML format
    url = f"https://www.ebi.ac.uk/europepmc/webservices/rest/{pmcid}/fullTextXML"
    # Or JSON
    url = f"https://www.ebi.ac.uk/europepmc/webservices/rest/{pmcid}/fullTextJSON"

Phase 3: Citation Network

async def get_citations(pmcid: str) -> list[str]:
    """Get papers that cite this one."""
    url = f"https://www.ebi.ac.uk/europepmc/webservices/rest/{pmcid}/citations"

async def get_references(pmcid: str) -> list[str]:
    """Get papers this one cites."""
    url = f"https://www.ebi.ac.uk/europepmc/webservices/rest/{pmcid}/references"

Phase 4: Text-Mined Annotations

Europe PMC extracts entities automatically:

async def get_annotations(pmcid: str) -> dict:
    """Get text-mined entities (genes, diseases, drugs)."""
    url = f"https://www.ebi.ac.uk/europepmc/annotations_api/annotationsByArticleIds"
    params = {
        "articleIds": f"PMC:{pmcid}",
        "type": "Gene_Proteins,Diseases,Chemicals",
        "format": "JSON",
    }
    # Returns structured entity mentions with positions

Supplementary File Retrieval

From reference repo (bioinformatics_tools.py lines 123-149):

def get_figures(pmcid: str) -> dict[str, str]:
    """Download figures and supplementary files."""
    url = f"https://www.ebi.ac.uk/europepmc/webservices/rest/{pmcid}/supplementaryFiles?includeInlineImage=true"
    # Returns ZIP with images, returns base64-encoded

Preprint-Specific Features

Identify Preprint Servers

PREPRINT_SOURCES = {
    "PPR": "General preprints",
    "bioRxiv": "Biology preprints",
    "medRxiv": "Medical preprints",
    "chemRxiv": "Chemistry preprints",
    "Research Square": "Multi-disciplinary",
    "Preprints.org": "MDPI preprints",
}

# Check if published version exists
async def check_published_version(preprint_doi: str) -> str | None:
    """Check if preprint has been peer-reviewed and published."""
    # Europe PMC links preprints to final versions

Rate Limiting

Europe PMC is more generous than NCBI:

# No documented hard limit, but be respectful
# Recommend: 10-20 requests/second max
# Use email in User-Agent for polite pool
headers = {
    "User-Agent": "DeepBoner/1.0 (mailto:your@email.com)"
}

vs. The Lens & OpenAlex

Feature Europe PMC The Lens OpenAlex
Biomedical Focus Yes Partial Partial
Preprints Yes (34 servers) Yes Yes
Full Text PMC papers Links No
Citations Yes Yes Yes
Annotations Yes (text-mined) No No
Rate Limits Generous Moderate Very generous
API Key Optional Required Optional

Sources