DeepBoner / docs /brainstorming /00_ROADMAP_SUMMARY.md
VibecoderMcSwaggins's picture
rebrand: DeepCritical β†’ DeepBoner (sexual health research agent)
5d12635
|
raw
history blame
5.61 kB
# DeepBoner Data Sources: Roadmap Summary
**Created**: 2024-11-27
**Purpose**: Future maintainability and hackathon continuation
---
## Current State
### Working Tools
| Tool | Status | Data Quality |
|------|--------|--------------|
| PubMed | βœ… Works | Good (abstracts only) |
| ClinicalTrials.gov | βœ… Works | Good (filtered for interventional) |
| Europe PMC | βœ… Works | Good (includes preprints) |
### Removed Tools
| Tool | Status | Reason |
|------|--------|--------|
| bioRxiv | ❌ Removed | No search API - only date/DOI lookup |
---
## Priority Improvements
### P0: Critical (Do First)
1. **Add Rate Limiting to PubMed**
- NCBI will block us without it
- Use `limits` library (see reference repo)
- 3/sec without key, 10/sec with key
### P1: High Value, Medium Effort
2. **Add OpenAlex as 4th Source**
- Citation network (huge for drug repurposing)
- Concept tagging (semantic discovery)
- Already implemented in reference repo
- Free, no API key
3. **PubMed Full-Text via BioC**
- Get full paper text for PMC papers
- Already in reference repo
### P2: Nice to Have
4. **ClinicalTrials.gov Results**
- Get efficacy data from completed trials
- Requires more complex API calls
5. **Europe PMC Annotations**
- Text-mined entities (genes, drugs, diseases)
- Automatic entity extraction
---
## Effort Estimates
| Improvement | Effort | Impact | Priority |
|-------------|--------|--------|----------|
| PubMed rate limiting | 1 hour | Stability | P0 |
| OpenAlex basic search | 2 hours | High | P1 |
| OpenAlex citations | 2 hours | Very High | P1 |
| PubMed full-text | 3 hours | Medium | P1 |
| CT.gov results | 4 hours | Medium | P2 |
| Europe PMC annotations | 3 hours | Medium | P2 |
---
## Architecture Decision
### Option A: Keep Current + Add OpenAlex
```
User Query
↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
↓ ↓ ↓
PubMed ClinicalTrials Europe PMC
(abstracts) (trials only) (preprints)
↓ ↓ ↓
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
↓
OpenAlex ← NEW
(citations, concepts)
↓
Orchestrator
↓
Report
```
**Pros**: Low risk, additive
**Cons**: More complexity, some overlap
### Option B: OpenAlex as Primary
```
User Query
↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
↓ ↓ ↓
OpenAlex ClinicalTrials Europe PMC
(primary (trials only) (full-text
search) fallback)
↓ ↓ ↓
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
↓
Orchestrator
↓
Report
```
**Pros**: Simpler, citation network built-in
**Cons**: Lose some PubMed-specific features
### Recommendation: Option A
Keep current architecture working, add OpenAlex incrementally.
---
## Quick Wins (Can Do Today)
1. **Add `limits` to `pyproject.toml`**
```toml
dependencies = [
"limits>=3.0",
]
```
2. **Copy OpenAlex tool from reference repo**
- File: `reference_repos/DeepBoner/DeepResearch/src/tools/openalex_tools.py`
- Adapt to our `SearchTool` base class
3. **Enable NCBI API Key**
- Add to `.env`: `NCBI_API_KEY=your_key`
- 10x rate limit improvement
---
## External Resources Worth Exploring
### Python Libraries
| Library | For | Notes |
|---------|-----|-------|
| `limits` | Rate limiting | Used by reference repo |
| `pyalex` | OpenAlex wrapper | [GitHub](https://github.com/J535D165/pyalex) |
| `metapub` | PubMed | Full-featured |
| `sentence-transformers` | Semantic search | For embeddings |
### APIs Not Yet Used
| API | Provides | Effort |
|-----|----------|--------|
| RxNorm | Drug name normalization | Low |
| DrugBank | Drug targets/mechanisms | Medium (license) |
| UniProt | Protein data | Medium |
| ChEMBL | Bioactivity data | Medium |
### RAG Tools (Future)
| Tool | Purpose |
|------|---------|
| [PaperQA](https://github.com/Future-House/paper-qa) | RAG for scientific papers |
| [txtai](https://github.com/neuml/txtai) | Embeddings + search |
| [PubMedBERT](https://huggingface.co/NeuML/pubmedbert-base-embeddings) | Biomedical embeddings |
---
## Files in This Directory
| File | Contents |
|------|----------|
| `00_ROADMAP_SUMMARY.md` | This file |
| `01_PUBMED_IMPROVEMENTS.md` | PubMed enhancement details |
| `02_CLINICALTRIALS_IMPROVEMENTS.md` | ClinicalTrials.gov details |
| `03_EUROPEPMC_IMPROVEMENTS.md` | Europe PMC details |
| `04_OPENALEX_INTEGRATION.md` | OpenAlex integration plan |
---
## For Future Maintainers
If you're picking this up after the hackathon:
1. **Start with OpenAlex** - biggest bang for buck
2. **Add rate limiting** - prevents API blocks
3. **Don't bother with bioRxiv** - use Europe PMC instead
4. **Reference repo is gold** - `reference_repos/DeepBoner/` has working implementations
Good luck! πŸš€