Spaces:
Running
Running
A newer version of the Gradio SDK is available:
6.1.0
DeepBoner Data Sources: Roadmap Summary
Created: 2024-11-27 Purpose: Future maintainability and hackathon continuation
Current State
Working Tools
| Tool | Status | Data Quality |
|---|---|---|
| PubMed | β Works | Good (abstracts only) |
| ClinicalTrials.gov | β Works | Good (filtered for interventional) |
| Europe PMC | β Works | Good (includes preprints) |
Removed Tools
| Tool | Status | Reason |
|---|---|---|
| bioRxiv | β Removed | No search API - only date/DOI lookup |
Priority Improvements
P0: Critical (Do First)
- Add Rate Limiting to PubMed
- NCBI will block us without it
- Use
limitslibrary (see reference repo) - 3/sec without key, 10/sec with key
P1: High Value, Medium Effort
Add OpenAlex as 4th Source
- Citation network (huge for drug repurposing)
- Concept tagging (semantic discovery)
- Already implemented in reference repo
- Free, no API key
PubMed Full-Text via BioC
- Get full paper text for PMC papers
- Already in reference repo
P2: Nice to Have
ClinicalTrials.gov Results
- Get efficacy data from completed trials
- Requires more complex API calls
Europe PMC Annotations
- Text-mined entities (genes, drugs, diseases)
- Automatic entity extraction
Effort Estimates
| Improvement | Effort | Impact | Priority |
|---|---|---|---|
| PubMed rate limiting | 1 hour | Stability | P0 |
| OpenAlex basic search | 2 hours | High | P1 |
| OpenAlex citations | 2 hours | Very High | P1 |
| PubMed full-text | 3 hours | Medium | P1 |
| CT.gov results | 4 hours | Medium | P2 |
| Europe PMC annotations | 3 hours | Medium | P2 |
Architecture Decision
Option A: Keep Current + Add OpenAlex
User Query
β
βββββββββββββββββββββΌββββββββββββββββββββ
β β β
PubMed ClinicalTrials Europe PMC
(abstracts) (trials only) (preprints)
β β β
βββββββββββββββββββββΌββββββββββββββββββββ
β
OpenAlex β NEW
(citations, concepts)
β
Orchestrator
β
Report
Pros: Low risk, additive Cons: More complexity, some overlap
Option B: OpenAlex as Primary
User Query
β
βββββββββββββββββββββΌββββββββββββββββββββ
β β β
OpenAlex ClinicalTrials Europe PMC
(primary (trials only) (full-text
search) fallback)
β β β
βββββββββββββββββββββΌββββββββββββββββββββ
β
Orchestrator
β
Report
Pros: Simpler, citation network built-in Cons: Lose some PubMed-specific features
Recommendation: Option A
Keep current architecture working, add OpenAlex incrementally.
Quick Wins (Can Do Today)
Add
limitstopyproject.tomldependencies = [ "limits>=3.0", ]Copy OpenAlex tool from reference repo
- File:
reference_repos/DeepBoner/DeepResearch/src/tools/openalex_tools.py - Adapt to our
SearchToolbase class
- File:
Enable NCBI API Key
- Add to
.env:NCBI_API_KEY=your_key - 10x rate limit improvement
- Add to
External Resources Worth Exploring
Python Libraries
| Library | For | Notes |
|---|---|---|
limits |
Rate limiting | Used by reference repo |
pyalex |
OpenAlex wrapper | GitHub |
metapub |
PubMed | Full-featured |
sentence-transformers |
Semantic search | For embeddings |
APIs Not Yet Used
| API | Provides | Effort |
|---|---|---|
| RxNorm | Drug name normalization | Low |
| DrugBank | Drug targets/mechanisms | Medium (license) |
| UniProt | Protein data | Medium |
| ChEMBL | Bioactivity data | Medium |
RAG Tools (Future)
| Tool | Purpose |
|---|---|
| PaperQA | RAG for scientific papers |
| txtai | Embeddings + search |
| PubMedBERT | Biomedical embeddings |
Files in This Directory
| File | Contents |
|---|---|
00_ROADMAP_SUMMARY.md |
This file |
01_PUBMED_IMPROVEMENTS.md |
PubMed enhancement details |
02_CLINICALTRIALS_IMPROVEMENTS.md |
ClinicalTrials.gov details |
03_EUROPEPMC_IMPROVEMENTS.md |
Europe PMC details |
04_OPENALEX_INTEGRATION.md |
OpenAlex integration plan |
For Future Maintainers
If you're picking this up after the hackathon:
- Start with OpenAlex - biggest bang for buck
- Add rate limiting - prevents API blocks
- Don't bother with bioRxiv - use Europe PMC instead
- Reference repo is gold -
reference_repos/DeepBoner/has working implementations
Good luck! π