DeepBoner / docs /future-roadmap /OPENALEX_INTEGRATION.md
VibecoderMcSwaggins's picture
docs: reorganize documentation structure for clarity
631e5fc
# OpenAlex Integration: The Missing Piece?
**Status**: NOT Implemented (Candidate for Addition)
**Priority**: HIGH - Could Replace Multiple Tools
**Reference**: Already implemented in `reference_repos/DeepBoner`
---
## What is OpenAlex?
OpenAlex is a **fully open** index of the global research system:
- **209M+ works** (papers, books, datasets)
- **2B+ author records** (disambiguated)
- **124K+ venues** (journals, repositories)
- **109K+ institutions**
- **65K+ concepts** (hierarchical, linked to Wikidata)
**Free. Open. No API key required.**
---
## Why OpenAlex for DeepBoner?
### Current Architecture
```
User Query
↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ PubMed ClinicalTrials Europe PMC β”‚ ← 3 separate APIs
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
↓
Orchestrator (deduplicate, judge, synthesize)
```
### With OpenAlex
```
User Query
↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ OpenAlex β”‚ ← Single API
β”‚ (includes PubMed + preprints + β”‚
β”‚ citations + concepts + authors) β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
↓
Orchestrator (enrich with CT.gov for trials)
```
**OpenAlex already aggregates**:
- PubMed/MEDLINE
- Crossref
- ORCID
- Unpaywall (open access links)
- Microsoft Academic Graph (legacy)
- Preprint servers
---
## Reference Implementation
From `reference_repos/DeepBoner/DeepResearch/src/tools/openalex_tools.py`:
```python
class OpenAlexFetchTool(ToolRunner):
def __init__(self):
super().__init__(
ToolSpec(
name="openalex_fetch",
description="Fetch OpenAlex work or author",
inputs={"entity": "TEXT", "identifier": "TEXT"},
outputs={"result": "JSON"},
)
)
def run(self, params: dict[str, Any]) -> ExecutionResult:
entity = params["entity"] # "works", "authors", "venues"
identifier = params["identifier"]
base = "https://api.openalex.org"
url = f"{base}/{entity}/{identifier}"
resp = requests.get(url, timeout=30)
return ExecutionResult(success=True, data={"result": resp.json()})
```
---
## OpenAlex API Features
### Search Works (Papers)
```python
# Search for metformin + cancer papers
url = "https://api.openalex.org/works"
params = {
"search": "metformin cancer drug repurposing",
"filter": "publication_year:>2020,type:article",
"sort": "cited_by_count:desc",
"per_page": 50,
}
```
### Rich Filtering
```python
# Filter examples
"publication_year:2023"
"type:article" # vs preprint, book, etc.
"is_oa:true" # Open access only
"concepts.id:C71924100" # Papers about "Medicine"
"authorships.institutions.id:I27837315" # From Harvard
"cited_by_count:>100" # Highly cited
"has_fulltext:true" # Full text available
```
### What You Get Back
```json
{
"id": "W2741809807",
"title": "Metformin: A candidate drug for...",
"publication_year": 2023,
"type": "article",
"cited_by_count": 45,
"is_oa": true,
"primary_location": {
"source": {"display_name": "Nature Medicine"},
"pdf_url": "https://...",
"landing_page_url": "https://..."
},
"concepts": [
{"id": "C71924100", "display_name": "Medicine", "score": 0.95},
{"id": "C54355233", "display_name": "Pharmacology", "score": 0.88}
],
"authorships": [
{
"author": {"id": "A123", "display_name": "John Smith"},
"institutions": [{"display_name": "Harvard Medical School"}]
}
],
"referenced_works": ["W123", "W456"], # Citations
"related_works": ["W789", "W012"] # Similar papers
}
```
---
## Key Advantages Over Current Tools
### 1. Citation Network (We Don't Have This!)
```python
# Get papers that cite a work
url = f"https://api.openalex.org/works?filter=cites:{work_id}"
# Get papers cited by a work
# Already in `referenced_works` field
```
### 2. Concept Tagging (We Don't Have This!)
OpenAlex auto-tags papers with hierarchical concepts:
- "Medicine" β†’ "Pharmacology" β†’ "Drug Repurposing"
- Can search by concept, not just keywords
### 3. Author Disambiguation (We Don't Have This!)
```python
# Find all works by an author
url = f"https://api.openalex.org/works?filter=authorships.author.id:{author_id}"
```
### 4. Institution Tracking
```python
# Find drug repurposing papers from top institutions
url = "https://api.openalex.org/works"
params = {
"search": "drug repurposing",
"filter": "authorships.institutions.id:I27837315", # Harvard
}
```
### 5. Related Works
Each paper comes with `related_works` - semantically similar papers discovered by OpenAlex's ML.
---
## Proposed Implementation
### New Tool: `src/tools/openalex.py`
```python
"""OpenAlex search tool for comprehensive scholarly data."""
import httpx
from src.tools.base import SearchTool
from src.utils.models import Evidence
class OpenAlexTool(SearchTool):
"""Search OpenAlex for scholarly works with rich metadata."""
name = "openalex"
async def search(self, query: str, max_results: int = 10) -> list[Evidence]:
async with httpx.AsyncClient() as client:
resp = await client.get(
"https://api.openalex.org/works",
params={
"search": query,
"filter": "type:article,is_oa:true",
"sort": "cited_by_count:desc",
"per_page": max_results,
"mailto": "deepboner@example.com", # Polite pool
},
)
data = resp.json()
return [
Evidence(
source="openalex",
title=work["title"],
abstract=work.get("abstract", ""),
url=work["primary_location"]["landing_page_url"],
metadata={
"cited_by_count": work["cited_by_count"],
"concepts": [c["display_name"] for c in work["concepts"][:5]],
"is_open_access": work["is_oa"],
"pdf_url": work["primary_location"].get("pdf_url"),
},
)
for work in data["results"]
]
```
---
## Rate Limits
OpenAlex is **extremely generous**:
- No hard rate limit documented
- Recommended: <100,000 requests/day
- **Polite pool**: Add `mailto=your@email.com` param for faster responses
- No API key required (optional for priority support)
---
## Should We Add OpenAlex?
### Arguments FOR
1. **Already in reference repo** - proven pattern
2. **Richer data** - citations, concepts, authors
3. **Single source** - reduces API complexity
4. **Free & open** - no keys, no limits
5. **Institution adoption** - Leiden, Sorbonne switched to it
### Arguments AGAINST
1. **Adds complexity** - another data source
2. **Overlap** - duplicates some PubMed data
3. **Not biomedical-focused** - covers all disciplines
4. **No full text** - still need PMC/Europe PMC for that
### Recommendation
**Add OpenAlex as a 4th source**, don't replace existing tools.
Use it for:
- Citation network analysis
- Concept-based discovery
- High-impact paper finding
- Author/institution tracking
Keep PubMed, ClinicalTrials, Europe PMC for:
- Authoritative biomedical search
- Clinical trial data
- Full-text access
- Preprint tracking
---
## Implementation Priority
| Task | Effort | Value |
|------|--------|-------|
| Basic search | Low | High |
| Citation network | Medium | Very High |
| Concept filtering | Low | High |
| Related works | Low | High |
| Author tracking | Medium | Medium |
---
## Sources
- [OpenAlex Documentation](https://docs.openalex.org)
- [OpenAlex API Overview](https://docs.openalex.org/api)
- [OpenAlex Wikipedia](https://en.wikipedia.org/wiki/OpenAlex)
- [Leiden University Announcement](https://www.leidenranking.com/information/openalex)
- [OpenAlex: A fully-open index (Paper)](https://arxiv.org/abs/2205.01833)