Spaces:
Running
Running
| # OpenAlex Integration: The Missing Piece? | |
| **Status**: NOT Implemented (Candidate for Addition) | |
| **Priority**: HIGH - Could Replace Multiple Tools | |
| **Reference**: Already implemented in `reference_repos/DeepBoner` | |
| --- | |
| ## What is OpenAlex? | |
| OpenAlex is a **fully open** index of the global research system: | |
| - **209M+ works** (papers, books, datasets) | |
| - **2B+ author records** (disambiguated) | |
| - **124K+ venues** (journals, repositories) | |
| - **109K+ institutions** | |
| - **65K+ concepts** (hierarchical, linked to Wikidata) | |
| **Free. Open. No API key required.** | |
| --- | |
| ## Why OpenAlex for DeepBoner? | |
| ### Current Architecture | |
| ``` | |
| User Query | |
| β | |
| ββββββββββββββββββββββββββββββββββββββββ | |
| β PubMed ClinicalTrials Europe PMC β β 3 separate APIs | |
| ββββββββββββββββββββββββββββββββββββββββ | |
| β | |
| Orchestrator (deduplicate, judge, synthesize) | |
| ``` | |
| ### With OpenAlex | |
| ``` | |
| User Query | |
| β | |
| ββββββββββββββββββββββββββββββββββββββββ | |
| β OpenAlex β β Single API | |
| β (includes PubMed + preprints + β | |
| β citations + concepts + authors) β | |
| ββββββββββββββββββββββββββββββββββββββββ | |
| β | |
| Orchestrator (enrich with CT.gov for trials) | |
| ``` | |
| **OpenAlex already aggregates**: | |
| - PubMed/MEDLINE | |
| - Crossref | |
| - ORCID | |
| - Unpaywall (open access links) | |
| - Microsoft Academic Graph (legacy) | |
| - Preprint servers | |
| --- | |
| ## Reference Implementation | |
| From `reference_repos/DeepBoner/DeepResearch/src/tools/openalex_tools.py`: | |
| ```python | |
| class OpenAlexFetchTool(ToolRunner): | |
| def __init__(self): | |
| super().__init__( | |
| ToolSpec( | |
| name="openalex_fetch", | |
| description="Fetch OpenAlex work or author", | |
| inputs={"entity": "TEXT", "identifier": "TEXT"}, | |
| outputs={"result": "JSON"}, | |
| ) | |
| ) | |
| def run(self, params: dict[str, Any]) -> ExecutionResult: | |
| entity = params["entity"] # "works", "authors", "venues" | |
| identifier = params["identifier"] | |
| base = "https://api.openalex.org" | |
| url = f"{base}/{entity}/{identifier}" | |
| resp = requests.get(url, timeout=30) | |
| return ExecutionResult(success=True, data={"result": resp.json()}) | |
| ``` | |
| --- | |
| ## OpenAlex API Features | |
| ### Search Works (Papers) | |
| ```python | |
| # Search for metformin + cancer papers | |
| url = "https://api.openalex.org/works" | |
| params = { | |
| "search": "metformin cancer drug repurposing", | |
| "filter": "publication_year:>2020,type:article", | |
| "sort": "cited_by_count:desc", | |
| "per_page": 50, | |
| } | |
| ``` | |
| ### Rich Filtering | |
| ```python | |
| # Filter examples | |
| "publication_year:2023" | |
| "type:article" # vs preprint, book, etc. | |
| "is_oa:true" # Open access only | |
| "concepts.id:C71924100" # Papers about "Medicine" | |
| "authorships.institutions.id:I27837315" # From Harvard | |
| "cited_by_count:>100" # Highly cited | |
| "has_fulltext:true" # Full text available | |
| ``` | |
| ### What You Get Back | |
| ```json | |
| { | |
| "id": "W2741809807", | |
| "title": "Metformin: A candidate drug for...", | |
| "publication_year": 2023, | |
| "type": "article", | |
| "cited_by_count": 45, | |
| "is_oa": true, | |
| "primary_location": { | |
| "source": {"display_name": "Nature Medicine"}, | |
| "pdf_url": "https://...", | |
| "landing_page_url": "https://..." | |
| }, | |
| "concepts": [ | |
| {"id": "C71924100", "display_name": "Medicine", "score": 0.95}, | |
| {"id": "C54355233", "display_name": "Pharmacology", "score": 0.88} | |
| ], | |
| "authorships": [ | |
| { | |
| "author": {"id": "A123", "display_name": "John Smith"}, | |
| "institutions": [{"display_name": "Harvard Medical School"}] | |
| } | |
| ], | |
| "referenced_works": ["W123", "W456"], # Citations | |
| "related_works": ["W789", "W012"] # Similar papers | |
| } | |
| ``` | |
| --- | |
| ## Key Advantages Over Current Tools | |
| ### 1. Citation Network (We Don't Have This!) | |
| ```python | |
| # Get papers that cite a work | |
| url = f"https://api.openalex.org/works?filter=cites:{work_id}" | |
| # Get papers cited by a work | |
| # Already in `referenced_works` field | |
| ``` | |
| ### 2. Concept Tagging (We Don't Have This!) | |
| OpenAlex auto-tags papers with hierarchical concepts: | |
| - "Medicine" β "Pharmacology" β "Drug Repurposing" | |
| - Can search by concept, not just keywords | |
| ### 3. Author Disambiguation (We Don't Have This!) | |
| ```python | |
| # Find all works by an author | |
| url = f"https://api.openalex.org/works?filter=authorships.author.id:{author_id}" | |
| ``` | |
| ### 4. Institution Tracking | |
| ```python | |
| # Find drug repurposing papers from top institutions | |
| url = "https://api.openalex.org/works" | |
| params = { | |
| "search": "drug repurposing", | |
| "filter": "authorships.institutions.id:I27837315", # Harvard | |
| } | |
| ``` | |
| ### 5. Related Works | |
| Each paper comes with `related_works` - semantically similar papers discovered by OpenAlex's ML. | |
| --- | |
| ## Proposed Implementation | |
| ### New Tool: `src/tools/openalex.py` | |
| ```python | |
| """OpenAlex search tool for comprehensive scholarly data.""" | |
| import httpx | |
| from src.tools.base import SearchTool | |
| from src.utils.models import Evidence | |
| class OpenAlexTool(SearchTool): | |
| """Search OpenAlex for scholarly works with rich metadata.""" | |
| name = "openalex" | |
| async def search(self, query: str, max_results: int = 10) -> list[Evidence]: | |
| async with httpx.AsyncClient() as client: | |
| resp = await client.get( | |
| "https://api.openalex.org/works", | |
| params={ | |
| "search": query, | |
| "filter": "type:article,is_oa:true", | |
| "sort": "cited_by_count:desc", | |
| "per_page": max_results, | |
| "mailto": "deepboner@example.com", # Polite pool | |
| }, | |
| ) | |
| data = resp.json() | |
| return [ | |
| Evidence( | |
| source="openalex", | |
| title=work["title"], | |
| abstract=work.get("abstract", ""), | |
| url=work["primary_location"]["landing_page_url"], | |
| metadata={ | |
| "cited_by_count": work["cited_by_count"], | |
| "concepts": [c["display_name"] for c in work["concepts"][:5]], | |
| "is_open_access": work["is_oa"], | |
| "pdf_url": work["primary_location"].get("pdf_url"), | |
| }, | |
| ) | |
| for work in data["results"] | |
| ] | |
| ``` | |
| --- | |
| ## Rate Limits | |
| OpenAlex is **extremely generous**: | |
| - No hard rate limit documented | |
| - Recommended: <100,000 requests/day | |
| - **Polite pool**: Add `mailto=your@email.com` param for faster responses | |
| - No API key required (optional for priority support) | |
| --- | |
| ## Should We Add OpenAlex? | |
| ### Arguments FOR | |
| 1. **Already in reference repo** - proven pattern | |
| 2. **Richer data** - citations, concepts, authors | |
| 3. **Single source** - reduces API complexity | |
| 4. **Free & open** - no keys, no limits | |
| 5. **Institution adoption** - Leiden, Sorbonne switched to it | |
| ### Arguments AGAINST | |
| 1. **Adds complexity** - another data source | |
| 2. **Overlap** - duplicates some PubMed data | |
| 3. **Not biomedical-focused** - covers all disciplines | |
| 4. **No full text** - still need PMC/Europe PMC for that | |
| ### Recommendation | |
| **Add OpenAlex as a 4th source**, don't replace existing tools. | |
| Use it for: | |
| - Citation network analysis | |
| - Concept-based discovery | |
| - High-impact paper finding | |
| - Author/institution tracking | |
| Keep PubMed, ClinicalTrials, Europe PMC for: | |
| - Authoritative biomedical search | |
| - Clinical trial data | |
| - Full-text access | |
| - Preprint tracking | |
| --- | |
| ## Implementation Priority | |
| | Task | Effort | Value | | |
| |------|--------|-------| | |
| | Basic search | Low | High | | |
| | Citation network | Medium | Very High | | |
| | Concept filtering | Low | High | | |
| | Related works | Low | High | | |
| | Author tracking | Medium | Medium | | |
| --- | |
| ## Sources | |
| - [OpenAlex Documentation](https://docs.openalex.org) | |
| - [OpenAlex API Overview](https://docs.openalex.org/api) | |
| - [OpenAlex Wikipedia](https://en.wikipedia.org/wiki/OpenAlex) | |
| - [Leiden University Announcement](https://www.leidenranking.com/information/openalex) | |
| - [OpenAlex: A fully-open index (Paper)](https://arxiv.org/abs/2205.01833) | |