| # Phase 15: OpenAlex Integration |
|
|
| **Priority**: HIGH - Biggest bang for buck |
| **Effort**: ~2-3 hours |
| **Dependencies**: None (existing codebase patterns sufficient) |
|
|
| --- |
|
|
| ## Prerequisites (COMPLETED) |
|
|
| The following model changes have been implemented to support this integration: |
|
|
| 1. **`SourceName` Literal Updated** (`src/utils/models.py:9`) |
| ```python |
| SourceName = Literal["pubmed", "clinicaltrials", "europepmc", "preprint", "openalex"] |
| ``` |
| - Without this, `source="openalex"` would fail Pydantic validation |
|
|
| 2. **`Evidence.metadata` Field Added** (`src/utils/models.py:39-42`) |
| ```python |
| metadata: dict[str, Any] = Field( |
| default_factory=dict, |
| description="Additional metadata (e.g., cited_by_count, concepts, is_open_access)", |
| ) |
| ``` |
| - Required for storing `cited_by_count`, `concepts`, etc. |
| - Model is still frozen - metadata must be passed at construction time |
|
|
| 3. **`__init__.py` Exports Updated** (`src/tools/__init__.py`) |
| - All tools are now exported: `ClinicalTrialsTool`, `EuropePMCTool`, `PubMedTool` |
| - OpenAlexTool should be added here after implementation |
| |
| --- |
| |
| ## Overview |
| |
| Add OpenAlex as a 4th data source for comprehensive scholarly data including: |
| - Citation networks (who cites whom) |
| - Concept tagging (hierarchical topic classification) |
| - Author disambiguation |
| - 209M+ works indexed |
| |
| **Why OpenAlex?** |
| - Free, no API key required |
| - Already implemented in reference repo |
| - Provides citation data we don't have |
| - Aggregates PubMed + preprints + more |
| |
| --- |
| |
| ## TDD Implementation Plan |
| |
| ### Step 1: Write the Tests First |
| |
| **File**: `tests/unit/tools/test_openalex.py` |
| |
| ```python |
| """Tests for OpenAlex search tool.""" |
| |
| import pytest |
| import respx |
| from httpx import Response |
| |
| from src.tools.openalex import OpenAlexTool |
| from src.utils.models import Evidence |
| |
| |
| class TestOpenAlexTool: |
| """Test suite for OpenAlex search functionality.""" |
| |
| @pytest.fixture |
| def tool(self) -> OpenAlexTool: |
| return OpenAlexTool() |
| |
| def test_name_property(self, tool: OpenAlexTool) -> None: |
| """Tool should identify itself as 'openalex'.""" |
| assert tool.name == "openalex" |
| |
| @respx.mock |
| @pytest.mark.asyncio |
| async def test_search_returns_evidence(self, tool: OpenAlexTool) -> None: |
| """Search should return list of Evidence objects.""" |
| mock_response = { |
| "results": [ |
| { |
| "id": "W2741809807", |
| "title": "Metformin and cancer: A systematic review", |
| "publication_year": 2023, |
| "cited_by_count": 45, |
| "type": "article", |
| "is_oa": True, |
| "primary_location": { |
| "source": {"display_name": "Nature Medicine"}, |
| "landing_page_url": "https://doi.org/10.1038/example", |
| "pdf_url": None, |
| }, |
| "abstract_inverted_index": { |
| "Metformin": [0], |
| "shows": [1], |
| "anticancer": [2], |
| "effects": [3], |
| }, |
| "concepts": [ |
| {"display_name": "Medicine", "score": 0.95}, |
| {"display_name": "Oncology", "score": 0.88}, |
| ], |
| "authorships": [ |
| { |
| "author": {"display_name": "John Smith"}, |
| "institutions": [{"display_name": "Harvard"}], |
| } |
| ], |
| } |
| ] |
| } |
| |
| respx.get("https://api.openalex.org/works").mock( |
| return_value=Response(200, json=mock_response) |
| ) |
| |
| results = await tool.search("metformin cancer", max_results=10) |
| |
| assert len(results) == 1 |
| assert isinstance(results[0], Evidence) |
| assert "Metformin and cancer" in results[0].citation.title |
| assert results[0].citation.source == "openalex" |
| |
| @respx.mock |
| @pytest.mark.asyncio |
| async def test_search_empty_results(self, tool: OpenAlexTool) -> None: |
| """Search with no results should return empty list.""" |
| respx.get("https://api.openalex.org/works").mock( |
| return_value=Response(200, json={"results": []}) |
| ) |
| |
| results = await tool.search("xyznonexistentquery123") |
| assert results == [] |
| |
| @respx.mock |
| @pytest.mark.asyncio |
| async def test_search_handles_missing_abstract(self, tool: OpenAlexTool) -> None: |
| """Tool should handle papers without abstracts.""" |
| mock_response = { |
| "results": [ |
| { |
| "id": "W123", |
| "title": "Paper without abstract", |
| "publication_year": 2023, |
| "cited_by_count": 10, |
| "type": "article", |
| "is_oa": False, |
| "primary_location": { |
| "source": {"display_name": "Journal"}, |
| "landing_page_url": "https://example.com", |
| }, |
| "abstract_inverted_index": None, |
| "concepts": [], |
| "authorships": [], |
| } |
| ] |
| } |
| |
| respx.get("https://api.openalex.org/works").mock( |
| return_value=Response(200, json=mock_response) |
| ) |
| |
| results = await tool.search("test query") |
| assert len(results) == 1 |
| assert results[0].content == "" # No abstract |
| |
| @respx.mock |
| @pytest.mark.asyncio |
| async def test_search_extracts_citation_count(self, tool: OpenAlexTool) -> None: |
| """Citation count should be in metadata.""" |
| mock_response = { |
| "results": [ |
| { |
| "id": "W456", |
| "title": "Highly cited paper", |
| "publication_year": 2020, |
| "cited_by_count": 500, |
| "type": "article", |
| "is_oa": True, |
| "primary_location": { |
| "source": {"display_name": "Science"}, |
| "landing_page_url": "https://example.com", |
| }, |
| "abstract_inverted_index": {"Test": [0]}, |
| "concepts": [], |
| "authorships": [], |
| } |
| ] |
| } |
| |
| respx.get("https://api.openalex.org/works").mock( |
| return_value=Response(200, json=mock_response) |
| ) |
| |
| results = await tool.search("highly cited") |
| assert results[0].metadata["cited_by_count"] == 500 |
| |
| @respx.mock |
| @pytest.mark.asyncio |
| async def test_search_extracts_concepts(self, tool: OpenAlexTool) -> None: |
| """Concepts should be extracted for semantic discovery.""" |
| mock_response = { |
| "results": [ |
| { |
| "id": "W789", |
| "title": "Drug repurposing study", |
| "publication_year": 2023, |
| "cited_by_count": 25, |
| "type": "article", |
| "is_oa": True, |
| "primary_location": { |
| "source": {"display_name": "PLOS ONE"}, |
| "landing_page_url": "https://example.com", |
| }, |
| "abstract_inverted_index": {"Drug": [0], "repurposing": [1]}, |
| "concepts": [ |
| {"display_name": "Pharmacology", "score": 0.92}, |
| {"display_name": "Drug Discovery", "score": 0.85}, |
| {"display_name": "Medicine", "score": 0.80}, |
| ], |
| "authorships": [], |
| } |
| ] |
| } |
| |
| respx.get("https://api.openalex.org/works").mock( |
| return_value=Response(200, json=mock_response) |
| ) |
| |
| results = await tool.search("drug repurposing") |
| assert "Pharmacology" in results[0].metadata["concepts"] |
| assert "Drug Discovery" in results[0].metadata["concepts"] |
| |
| @respx.mock |
| @pytest.mark.asyncio |
| async def test_search_api_error_raises_search_error( |
| self, tool: OpenAlexTool |
| ) -> None: |
| """API errors should raise SearchError.""" |
| from src.utils.exceptions import SearchError |
| |
| respx.get("https://api.openalex.org/works").mock( |
| return_value=Response(500, text="Internal Server Error") |
| ) |
| |
| with pytest.raises(SearchError): |
| await tool.search("test query") |
| |
| def test_reconstruct_abstract(self, tool: OpenAlexTool) -> None: |
| """Test abstract reconstruction from inverted index.""" |
| inverted_index = { |
| "Metformin": [0, 5], |
| "is": [1], |
| "a": [2], |
| "diabetes": [3], |
| "drug": [4], |
| "effective": [6], |
| } |
| abstract = tool._reconstruct_abstract(inverted_index) |
| assert abstract == "Metformin is a diabetes drug Metformin effective" |
| ``` |
| |
| --- |
| |
| ### Step 2: Create the Implementation |
| |
| **File**: `src/tools/openalex.py` |
| |
| ```python |
| """OpenAlex search tool for comprehensive scholarly data.""" |
| |
| from typing import Any |
| |
| import httpx |
| from tenacity import retry, stop_after_attempt, wait_exponential |
| |
| from src.utils.exceptions import SearchError |
| from src.utils.models import Citation, Evidence |
| |
| |
| class OpenAlexTool: |
| """ |
| Search OpenAlex for scholarly works with rich metadata. |
| |
| OpenAlex provides: |
| - 209M+ scholarly works |
| - Citation counts and networks |
| - Concept tagging (hierarchical) |
| - Author disambiguation |
| - Open access links |
| |
| API Docs: https://docs.openalex.org/ |
| """ |
| |
| BASE_URL = "https://api.openalex.org/works" |
| |
| def __init__(self, email: str | None = None) -> None: |
| """ |
| Initialize OpenAlex tool. |
| |
| Args: |
| email: Optional email for polite pool (faster responses) |
| """ |
| self.email = email or "deepcritical@example.com" |
| |
| @property |
| def name(self) -> str: |
| return "openalex" |
| |
| @retry( |
| stop=stop_after_attempt(3), |
| wait=wait_exponential(multiplier=1, min=1, max=10), |
| reraise=True, |
| ) |
| async def search(self, query: str, max_results: int = 10) -> list[Evidence]: |
| """ |
| Search OpenAlex for scholarly works. |
| |
| Args: |
| query: Search terms |
| max_results: Maximum results to return (max 200 per request) |
| |
| Returns: |
| List of Evidence objects with citation metadata |
| |
| Raises: |
| SearchError: If API request fails |
| """ |
| params = { |
| "search": query, |
| "filter": "type:article", # Only peer-reviewed articles |
| "sort": "cited_by_count:desc", # Most cited first |
| "per_page": min(max_results, 200), |
| "mailto": self.email, # Polite pool for faster responses |
| } |
| |
| async with httpx.AsyncClient(timeout=30.0) as client: |
| try: |
| response = await client.get(self.BASE_URL, params=params) |
| response.raise_for_status() |
| |
| data = response.json() |
| results = data.get("results", []) |
| |
| return [self._to_evidence(work) for work in results[:max_results]] |
| |
| except httpx.HTTPStatusError as e: |
| raise SearchError(f"OpenAlex API error: {e}") from e |
| except httpx.RequestError as e: |
| raise SearchError(f"OpenAlex connection failed: {e}") from e |
| |
| def _to_evidence(self, work: dict[str, Any]) -> Evidence: |
| """Convert OpenAlex work to Evidence object.""" |
| title = work.get("title", "Untitled") |
| pub_year = work.get("publication_year", "Unknown") |
| cited_by = work.get("cited_by_count", 0) |
| is_oa = work.get("is_oa", False) |
| |
| # Reconstruct abstract from inverted index |
| abstract_index = work.get("abstract_inverted_index") |
| abstract = self._reconstruct_abstract(abstract_index) if abstract_index else "" |
| |
| # Extract concepts (top 5) |
| concepts = [ |
| c.get("display_name", "") |
| for c in work.get("concepts", [])[:5] |
| if c.get("display_name") |
| ] |
| |
| # Extract authors (top 5) |
| authorships = work.get("authorships", []) |
| authors = [ |
| a.get("author", {}).get("display_name", "") |
| for a in authorships[:5] |
| if a.get("author", {}).get("display_name") |
| ] |
| |
| # Get URL |
| primary_loc = work.get("primary_location") or {} |
| url = primary_loc.get("landing_page_url", "") |
| if not url: |
| # Fallback to OpenAlex page |
| work_id = work.get("id", "").replace("https://openalex.org/", "") |
| url = f"https://openalex.org/{work_id}" |
| |
| return Evidence( |
| content=abstract[:2000], |
| citation=Citation( |
| source="openalex", |
| title=title[:500], |
| url=url, |
| date=str(pub_year), |
| authors=authors, |
| ), |
| relevance=min(0.9, 0.5 + (cited_by / 1000)), # Boost by citations |
| metadata={ |
| "cited_by_count": cited_by, |
| "is_open_access": is_oa, |
| "concepts": concepts, |
| "pdf_url": primary_loc.get("pdf_url"), |
| }, |
| ) |
| |
| def _reconstruct_abstract( |
| self, inverted_index: dict[str, list[int]] |
| ) -> str: |
| """ |
| Reconstruct abstract from OpenAlex inverted index format. |
| |
| OpenAlex stores abstracts as {"word": [position1, position2, ...]}. |
| This rebuilds the original text. |
| """ |
| if not inverted_index: |
| return "" |
| |
| # Build position -> word mapping |
| position_word: dict[int, str] = {} |
| for word, positions in inverted_index.items(): |
| for pos in positions: |
| position_word[pos] = word |
| |
| # Reconstruct in order |
| if not position_word: |
| return "" |
| |
| max_pos = max(position_word.keys()) |
| words = [position_word.get(i, "") for i in range(max_pos + 1)] |
| return " ".join(w for w in words if w) |
| ``` |
| |
| --- |
| |
| ### Step 3: Register in Search Handler |
| |
| **File**: `src/tools/search_handler.py` (add to imports and tool list) |
| |
| ```python |
| # Add import |
| from src.tools.openalex import OpenAlexTool |
| |
| # Add to _create_tools method |
| def _create_tools(self) -> list[SearchTool]: |
| return [ |
| PubMedTool(), |
| ClinicalTrialsTool(), |
| EuropePMCTool(), |
| OpenAlexTool(), # NEW |
| ] |
| ``` |
| |
| --- |
| |
| ### Step 4: Update `__init__.py` |
| |
| **File**: `src/tools/__init__.py` |
| |
| ```python |
| from src.tools.openalex import OpenAlexTool |
| |
| __all__ = [ |
| "PubMedTool", |
| "ClinicalTrialsTool", |
| "EuropePMCTool", |
| "OpenAlexTool", # NEW |
| # ... |
| ] |
| ``` |
| |
| --- |
| |
| ## Demo Script |
| |
| **File**: `examples/openalex_demo.py` |
| |
| ```python |
| #!/usr/bin/env python3 |
| """Demo script to verify OpenAlex integration.""" |
| |
| import asyncio |
| from src.tools.openalex import OpenAlexTool |
| |
| |
| async def main(): |
| """Run OpenAlex search demo.""" |
| tool = OpenAlexTool() |
| |
| print("=" * 60) |
| print("OpenAlex Integration Demo") |
| print("=" * 60) |
| |
| # Test 1: Basic drug repurposing search |
| print("\n[Test 1] Searching for 'metformin cancer drug repurposing'...") |
| results = await tool.search("metformin cancer drug repurposing", max_results=5) |
| |
| for i, evidence in enumerate(results, 1): |
| print(f"\n--- Result {i} ---") |
| print(f"Title: {evidence.citation.title}") |
| print(f"Year: {evidence.citation.date}") |
| print(f"Citations: {evidence.metadata.get('cited_by_count', 'N/A')}") |
| print(f"Concepts: {', '.join(evidence.metadata.get('concepts', []))}") |
| print(f"Open Access: {evidence.metadata.get('is_open_access', False)}") |
| print(f"URL: {evidence.citation.url}") |
| if evidence.content: |
| print(f"Abstract: {evidence.content[:200]}...") |
| |
| # Test 2: High-impact papers |
| print("\n" + "=" * 60) |
| print("[Test 2] Finding highly-cited papers on 'long COVID treatment'...") |
| results = await tool.search("long COVID treatment", max_results=3) |
| |
| for evidence in results: |
| print(f"\n- {evidence.citation.title}") |
| print(f" Citations: {evidence.metadata.get('cited_by_count', 0)}") |
| |
| print("\n" + "=" * 60) |
| print("Demo complete!") |
| |
| |
| if __name__ == "__main__": |
| asyncio.run(main()) |
| ``` |
| |
| --- |
| |
| ## Verification Checklist |
| |
| ### Unit Tests |
| ```bash |
| # Run just OpenAlex tests |
| uv run pytest tests/unit/tools/test_openalex.py -v |
| |
| # Expected: All tests pass |
| ``` |
| |
| ### Integration Test (Manual) |
| ```bash |
| # Run demo script with real API |
| uv run python examples/openalex_demo.py |
| |
| # Expected: Real results from OpenAlex API |
| ``` |
| |
| ### Full Test Suite |
| ```bash |
| # Ensure nothing broke |
| make check |
| |
| # Expected: All 110+ tests pass, mypy clean |
| ``` |
| |
| --- |
| |
| ## Success Criteria |
| |
| 1. **Unit tests pass**: All mocked tests in `test_openalex.py` pass |
| 2. **Integration works**: Demo script returns real results |
| 3. **No regressions**: `make check` passes completely |
| 4. **SearchHandler integration**: OpenAlex appears in search results alongside other sources |
| 5. **Citation metadata**: Results include `cited_by_count`, `concepts`, `is_open_access` |
| |
| --- |
| |
| ## Future Enhancements (P2) |
| |
| Once basic integration works: |
| |
| 1. **Citation Network Queries** |
| ```python |
| # Get papers citing a specific work |
| async def get_citing_works(self, work_id: str) -> list[Evidence]: |
| params = {"filter": f"cites:{work_id}"} |
| ... |
| ``` |
|
|
| 2. **Concept-Based Search** |
| ```python |
| # Search by OpenAlex concept ID |
| async def search_by_concept(self, concept_id: str) -> list[Evidence]: |
| params = {"filter": f"concepts.id:{concept_id}"} |
| ... |
| ``` |
|
|
| 3. **Author Tracking** |
| ```python |
| # Find all works by an author |
| async def search_by_author(self, author_id: str) -> list[Evidence]: |
| params = {"filter": f"authorships.author.id:{author_id}"} |
| ... |
| ``` |
|
|
| --- |
|
|
| ## Notes |
|
|
| - OpenAlex is **very generous** with rate limits (no documented hard limit) |
| - Adding `mailto` parameter gives priority access (polite pool) |
| - Abstract is stored as inverted index - must reconstruct |
| - Citation count is a good proxy for paper quality/impact |
| - Consider caching responses for repeated queries |
|
|