DeepBoner / docs /brainstorming /01_PUBMED_IMPROVEMENTS.md
VibecoderMcSwaggins's picture
rebrand: DeepCritical β†’ DeepBoner (sexual health research agent)
5d12635
|
raw
history blame
4.08 kB
# PubMed Tool: Current State & Future Improvements
**Status**: Currently Implemented
**Priority**: High (Core Data Source)
---
## Current Implementation
### What We Have (`src/tools/pubmed.py`)
- Basic E-utilities search via `esearch.fcgi` and `efetch.fcgi`
- Query preprocessing (strips question words, expands synonyms)
- Returns: title, abstract, authors, journal, PMID
- Rate limiting: None implemented (relying on NCBI defaults)
### Current Limitations
1. **No Full-Text Access**: Only retrieves abstracts, not full paper text
2. **No Rate Limiting**: Risk of being blocked by NCBI
3. **No BioC Format**: Missing structured full-text extraction
4. **No Figure Retrieval**: No supplementary materials access
5. **No PMC Integration**: Missing open-access full-text via PMC
---
## Reference Implementation (DeepBoner Reference Repo)
The reference repo at `reference_repos/DeepBoner/DeepResearch/src/tools/bioinformatics_tools.py` has a more sophisticated implementation:
### Features We're Missing
```python
# Rate limiting (lines 47-50)
from limits import parse
from limits.storage import MemoryStorage
from limits.strategies import MovingWindowRateLimiter
storage = MemoryStorage()
limiter = MovingWindowRateLimiter(storage)
rate_limit = parse("3/second") # NCBI allows 3/sec without API key, 10/sec with
# Full-text via BioC format (lines 108-120)
def _get_fulltext(pmid: int) -> dict[str, Any] | None:
pmid_url = f"https://www.ncbi.nlm.nih.gov/research/bionlp/RESTful/pmcoa.cgi/BioC_json/{pmid}/unicode"
# Returns structured JSON with full text for open-access papers
# Figure retrieval via Europe PMC (lines 123-149)
def _get_figures(pmcid: str) -> dict[str, str]:
suppl_url = f"https://www.ebi.ac.uk/europepmc/webservices/rest/{pmcid}/supplementaryFiles"
# Returns base64-encoded images from supplementary materials
```
---
## Recommended Improvements
### Phase 1: Rate Limiting (Critical)
```python
# Add to src/tools/pubmed.py
from limits import parse
from limits.storage import MemoryStorage
from limits.strategies import MovingWindowRateLimiter
storage = MemoryStorage()
limiter = MovingWindowRateLimiter(storage)
# With NCBI_API_KEY: 10/sec, without: 3/sec
def get_rate_limit():
if settings.ncbi_api_key:
return parse("10/second")
return parse("3/second")
```
**Dependencies**: `pip install limits`
### Phase 2: Full-Text Retrieval
```python
async def get_fulltext(pmid: str) -> str | None:
"""Get full text for open-access papers via BioC API."""
url = f"https://www.ncbi.nlm.nih.gov/research/bionlp/RESTful/pmcoa.cgi/BioC_json/{pmid}/unicode"
# Only works for PMC papers (open access)
```
### Phase 3: PMC ID Resolution
```python
async def get_pmc_id(pmid: str) -> str | None:
"""Convert PMID to PMCID for full-text access."""
url = f"https://www.ncbi.nlm.nih.gov/pmc/utils/idconv/v1.0/?ids={pmid}&format=json"
```
---
## Python Libraries to Consider
| Library | Purpose | Notes |
|---------|---------|-------|
| [Biopython](https://biopython.org/) | `Bio.Entrez` module | Official, well-maintained |
| [PyMed](https://pypi.org/project/pymed/) | PubMed wrapper | Simpler API, less control |
| [metapub](https://pypi.org/project/metapub/) | Full-featured | Tested on 1/3 of PubMed |
| [limits](https://pypi.org/project/limits/) | Rate limiting | Used by reference repo |
---
## API Endpoints Reference
| Endpoint | Purpose | Rate Limit |
|----------|---------|------------|
| `esearch.fcgi` | Search for PMIDs | 3/sec (10 with key) |
| `efetch.fcgi` | Fetch metadata | 3/sec (10 with key) |
| `esummary.fcgi` | Quick metadata | 3/sec (10 with key) |
| `pmcoa.cgi/BioC_json` | Full text (PMC only) | Unknown |
| `idconv/v1.0` | PMID ↔ PMCID | Unknown |
---
## Sources
- [PubMed E-utilities Documentation](https://www.ncbi.nlm.nih.gov/books/NBK25501/)
- [NCBI BioC API](https://www.ncbi.nlm.nih.gov/research/bionlp/APIs/)
- [Searching PubMed with Python](https://marcobonzanini.com/2015/01/12/searching-pubmed-with-python/)
- [PyMed on PyPI](https://pypi.org/project/pymed/)