| # DeepBoner Data Sources: Roadmap Summary |
|
|
| **Created**: 2024-11-27 |
| **Purpose**: Future maintainability and hackathon continuation |
|
|
| --- |
|
|
| ## Current State |
|
|
| ### Working Tools |
|
|
| | Tool | Status | Data Quality | |
| |------|--------|--------------| |
| | PubMed | β
Works | Good (abstracts only) | |
| | ClinicalTrials.gov | β
Works | Good (filtered for interventional) | |
| | Europe PMC | β
Works | Good (includes preprints) | |
|
|
| ### Removed Tools |
|
|
| | Tool | Status | Reason | |
| |------|--------|--------| |
| | bioRxiv | β Removed | No search API - only date/DOI lookup | |
|
|
| --- |
|
|
| ## Priority Improvements |
|
|
| ### P0: Critical (Do First) |
|
|
| 1. **Add Rate Limiting to PubMed** |
| - NCBI will block us without it |
| - Use `limits` library (see reference repo) |
| - 3/sec without key, 10/sec with key |
|
|
| ### P1: High Value, Medium Effort |
|
|
| 2. **Add OpenAlex as 4th Source** |
| - Citation network (huge for drug repurposing) |
| - Concept tagging (semantic discovery) |
| - Already implemented in reference repo |
| - Free, no API key |
|
|
| 3. **PubMed Full-Text via BioC** |
| - Get full paper text for PMC papers |
| - Already in reference repo |
|
|
| ### P2: Nice to Have |
|
|
| 4. **ClinicalTrials.gov Results** |
| - Get efficacy data from completed trials |
| - Requires more complex API calls |
|
|
| 5. **Europe PMC Annotations** |
| - Text-mined entities (genes, drugs, diseases) |
| - Automatic entity extraction |
|
|
| --- |
|
|
| ## Effort Estimates |
|
|
| | Improvement | Effort | Impact | Priority | |
| |-------------|--------|--------|----------| |
| | PubMed rate limiting | 1 hour | Stability | P0 | |
| | OpenAlex basic search | 2 hours | High | P1 | |
| | OpenAlex citations | 2 hours | Very High | P1 | |
| | PubMed full-text | 3 hours | Medium | P1 | |
| | CT.gov results | 4 hours | Medium | P2 | |
| | Europe PMC annotations | 3 hours | Medium | P2 | |
|
|
| --- |
|
|
| ## Architecture Decision |
|
|
| ### Option A: Keep Current + Add OpenAlex |
|
|
| ``` |
| User Query |
| β |
| βββββββββββββββββββββΌββββββββββββββββββββ |
| β β β |
| PubMed ClinicalTrials Europe PMC |
| (abstracts) (trials only) (preprints) |
| β β β |
| βββββββββββββββββββββΌββββββββββββββββββββ |
| β |
| OpenAlex β NEW |
| (citations, concepts) |
| β |
| Orchestrator |
| β |
| Report |
| ``` |
|
|
| **Pros**: Low risk, additive |
| **Cons**: More complexity, some overlap |
|
|
| ### Option B: OpenAlex as Primary |
|
|
| ``` |
| User Query |
| β |
| βββββββββββββββββββββΌββββββββββββββββββββ |
| β β β |
| OpenAlex ClinicalTrials Europe PMC |
| (primary (trials only) (full-text |
| search) fallback) |
| β β β |
| βββββββββββββββββββββΌββββββββββββββββββββ |
| β |
| Orchestrator |
| β |
| Report |
| ``` |
|
|
| **Pros**: Simpler, citation network built-in |
| **Cons**: Lose some PubMed-specific features |
|
|
| ### Recommendation: Option A |
|
|
| Keep current architecture working, add OpenAlex incrementally. |
|
|
| --- |
|
|
| ## Quick Wins (Can Do Today) |
|
|
| 1. **Add `limits` to `pyproject.toml`** |
| ```toml |
| dependencies = [ |
| "limits>=3.0", |
| ] |
| ``` |
|
|
| 2. **Copy OpenAlex tool from reference repo** |
| - File: `reference_repos/DeepBoner/DeepResearch/src/tools/openalex_tools.py` |
| - Adapt to our `SearchTool` base class |
|
|
| 3. **Enable NCBI API Key** |
| - Add to `.env`: `NCBI_API_KEY=your_key` |
| - 10x rate limit improvement |
|
|
| --- |
|
|
| ## External Resources Worth Exploring |
|
|
| ### Python Libraries |
|
|
| | Library | For | Notes | |
| |---------|-----|-------| |
| | `limits` | Rate limiting | Used by reference repo | |
| | `pyalex` | OpenAlex wrapper | [GitHub](https://github.com/J535D165/pyalex) | |
| | `metapub` | PubMed | Full-featured | |
| | `sentence-transformers` | Semantic search | For embeddings | |
|
|
| ### APIs Not Yet Used |
|
|
| | API | Provides | Effort | |
| |-----|----------|--------| |
| | RxNorm | Drug name normalization | Low | |
| | DrugBank | Drug targets/mechanisms | Medium (license) | |
| | UniProt | Protein data | Medium | |
| | ChEMBL | Bioactivity data | Medium | |
|
|
| ### RAG Tools (Future) |
|
|
| | Tool | Purpose | |
| |------|---------| |
| | [PaperQA](https://github.com/Future-House/paper-qa) | RAG for scientific papers | |
| | [txtai](https://github.com/neuml/txtai) | Embeddings + search | |
| | [PubMedBERT](https://huggingface.co/NeuML/pubmedbert-base-embeddings) | Biomedical embeddings | |
|
|
| --- |
|
|
| ## Files in This Directory |
|
|
| | File | Contents | |
| |------|----------| |
| | `00_ROADMAP_SUMMARY.md` | This file | |
| | `01_PUBMED_IMPROVEMENTS.md` | PubMed enhancement details | |
| | `02_CLINICALTRIALS_IMPROVEMENTS.md` | ClinicalTrials.gov details | |
| | `03_EUROPEPMC_IMPROVEMENTS.md` | Europe PMC details | |
| | `04_OPENALEX_INTEGRATION.md` | OpenAlex integration plan | |
|
|
| --- |
|
|
| ## For Future Maintainers |
|
|
| If you're picking this up after the hackathon: |
|
|
| 1. **Start with OpenAlex** - biggest bang for buck |
| 2. **Add rate limiting** - prevents API blocks |
| 3. **Don't bother with bioRxiv** - use Europe PMC instead |
| 4. **Reference repo is gold** - `reference_repos/DeepBoner/` has working implementations |
|
|
| Good luck! π |
|
|