File size: 5,614 Bytes
5d12635 9286db5 5d12635 9286db5 5d12635 9286db5 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 |
# DeepBoner Data Sources: Roadmap Summary
**Created**: 2024-11-27
**Purpose**: Future maintainability and hackathon continuation
---
## Current State
### Working Tools
| Tool | Status | Data Quality |
|------|--------|--------------|
| PubMed | β
Works | Good (abstracts only) |
| ClinicalTrials.gov | β
Works | Good (filtered for interventional) |
| Europe PMC | β
Works | Good (includes preprints) |
### Removed Tools
| Tool | Status | Reason |
|------|--------|--------|
| bioRxiv | β Removed | No search API - only date/DOI lookup |
---
## Priority Improvements
### P0: Critical (Do First)
1. **Add Rate Limiting to PubMed**
- NCBI will block us without it
- Use `limits` library (see reference repo)
- 3/sec without key, 10/sec with key
### P1: High Value, Medium Effort
2. **Add OpenAlex as 4th Source**
- Citation network (huge for drug repurposing)
- Concept tagging (semantic discovery)
- Already implemented in reference repo
- Free, no API key
3. **PubMed Full-Text via BioC**
- Get full paper text for PMC papers
- Already in reference repo
### P2: Nice to Have
4. **ClinicalTrials.gov Results**
- Get efficacy data from completed trials
- Requires more complex API calls
5. **Europe PMC Annotations**
- Text-mined entities (genes, drugs, diseases)
- Automatic entity extraction
---
## Effort Estimates
| Improvement | Effort | Impact | Priority |
|-------------|--------|--------|----------|
| PubMed rate limiting | 1 hour | Stability | P0 |
| OpenAlex basic search | 2 hours | High | P1 |
| OpenAlex citations | 2 hours | Very High | P1 |
| PubMed full-text | 3 hours | Medium | P1 |
| CT.gov results | 4 hours | Medium | P2 |
| Europe PMC annotations | 3 hours | Medium | P2 |
---
## Architecture Decision
### Option A: Keep Current + Add OpenAlex
```
User Query
β
βββββββββββββββββββββΌββββββββββββββββββββ
β β β
PubMed ClinicalTrials Europe PMC
(abstracts) (trials only) (preprints)
β β β
βββββββββββββββββββββΌββββββββββββββββββββ
β
OpenAlex β NEW
(citations, concepts)
β
Orchestrator
β
Report
```
**Pros**: Low risk, additive
**Cons**: More complexity, some overlap
### Option B: OpenAlex as Primary
```
User Query
β
βββββββββββββββββββββΌββββββββββββββββββββ
β β β
OpenAlex ClinicalTrials Europe PMC
(primary (trials only) (full-text
search) fallback)
β β β
βββββββββββββββββββββΌββββββββββββββββββββ
β
Orchestrator
β
Report
```
**Pros**: Simpler, citation network built-in
**Cons**: Lose some PubMed-specific features
### Recommendation: Option A
Keep current architecture working, add OpenAlex incrementally.
---
## Quick Wins (Can Do Today)
1. **Add `limits` to `pyproject.toml`**
```toml
dependencies = [
"limits>=3.0",
]
```
2. **Copy OpenAlex tool from reference repo**
- File: `reference_repos/DeepBoner/DeepResearch/src/tools/openalex_tools.py`
- Adapt to our `SearchTool` base class
3. **Enable NCBI API Key**
- Add to `.env`: `NCBI_API_KEY=your_key`
- 10x rate limit improvement
---
## External Resources Worth Exploring
### Python Libraries
| Library | For | Notes |
|---------|-----|-------|
| `limits` | Rate limiting | Used by reference repo |
| `pyalex` | OpenAlex wrapper | [GitHub](https://github.com/J535D165/pyalex) |
| `metapub` | PubMed | Full-featured |
| `sentence-transformers` | Semantic search | For embeddings |
### APIs Not Yet Used
| API | Provides | Effort |
|-----|----------|--------|
| RxNorm | Drug name normalization | Low |
| DrugBank | Drug targets/mechanisms | Medium (license) |
| UniProt | Protein data | Medium |
| ChEMBL | Bioactivity data | Medium |
### RAG Tools (Future)
| Tool | Purpose |
|------|---------|
| [PaperQA](https://github.com/Future-House/paper-qa) | RAG for scientific papers |
| [txtai](https://github.com/neuml/txtai) | Embeddings + search |
| [PubMedBERT](https://huggingface.co/NeuML/pubmedbert-base-embeddings) | Biomedical embeddings |
---
## Files in This Directory
| File | Contents |
|------|----------|
| `00_ROADMAP_SUMMARY.md` | This file |
| `01_PUBMED_IMPROVEMENTS.md` | PubMed enhancement details |
| `02_CLINICALTRIALS_IMPROVEMENTS.md` | ClinicalTrials.gov details |
| `03_EUROPEPMC_IMPROVEMENTS.md` | Europe PMC details |
| `04_OPENALEX_INTEGRATION.md` | OpenAlex integration plan |
---
## For Future Maintainers
If you're picking this up after the hackathon:
1. **Start with OpenAlex** - biggest bang for buck
2. **Add rate limiting** - prevents API blocks
3. **Don't bother with bioRxiv** - use Europe PMC instead
4. **Reference repo is gold** - `reference_repos/DeepBoner/` has working implementations
Good luck! π
|