# SPEC 03: OpenAlex Integration ## Priority: P1 (Feature Enhancement) ## Problem Statement We currently search 3 sources (PubMed, Europe PMC, ClinicalTrials.gov) but lack **citation metrics**. We cannot distinguish a highly-cited landmark paper from an obscure one. OpenAlex provides: 1. **Citation counts** - Prioritize authoritative papers 2. **Citation networks** - "Who cites whom" 3. **Concept tagging** - Hierarchical categorization 4. **Open access links** - Direct PDF URLs **FREE API. No key required. 209M+ works indexed.** > **Note:** This spec supersedes `docs/future-roadmap/phases/15_PHASE_OPENALEX.md`. ## Groundwork Already Done ```python # src/utils/models.py:9 SourceName = Literal["pubmed", "clinicaltrials", "europepmc", "preprint", "openalex", "web"] # src/utils/models.py:39-42 metadata: dict[str, Any] = Field( default_factory=dict, description="Additional metadata (e.g., cited_by_count, concepts, is_open_access)", ) ``` The infrastructure is ready. We just need to build the tool. ## OpenAlex API Reference ### Endpoint ``` GET https://api.openalex.org/works ``` ### Key Parameters | Parameter | Description | |-----------|-------------| | `search` | Full-text search across title, abstract, fulltext | | `filter` | Constrain results (e.g., `type:article`, `has_abstract:true`) | | `sort` | Order results (e.g., `cited_by_count:desc`) | | `per_page` | Results per page (max 200) | | `mailto` | Email for polite pool (higher rate limits) | ### Example Request ```bash GET https://api.openalex.org/works?search=metformin%20cancer&filter=type:article,has_abstract:true&sort=cited_by_count:desc&per_page=10&mailto=deepboner-research@proton.me ``` ### Response Structure ```json { "results": [ { "id": "https://openalex.org/W2741809807", "doi": "https://doi.org/10.1234/example", "display_name": "Paper Title", "publication_year": 2024, "cited_by_count": 150, "abstract_inverted_index": { "word1": [0], "word2": [1, 5] }, "concepts": [ {"display_name": "Metformin", "score": 0.95, "level": 2} ], "authorships": [ {"author": {"display_name": "John Smith"}} ], "open_access": { "is_oa": true, "oa_url": "https://example.com/pdf" }, "best_oa_location": { "pdf_url": "https://example.com/paper.pdf" } } ] } ``` ## Architecture ### Class Diagram ``` ┌─────────────────────────────────────┐ │ SearchTool (Protocol) │ │ ───────────────────────────────── │ │ + name: str │ │ + search(query, max_results) → list[Evidence] │ └──────────────────┬──────────────────┘ │ implements ┌──────────────────▼──────────────────┐ │ OpenAlexTool │ │ ───────────────────────────────── │ │ - BASE_URL: str │ │ - POLITE_EMAIL: str │ │ ───────────────────────────────── │ │ + name → "openalex" │ │ + search(query, max_results) → list[Evidence] │ │ - _reconstruct_abstract(inverted_index) → str │ │ - _to_evidence(work) → Evidence │ │ - _extract_authors(authorships) → list[str] │ │ - _extract_concepts(concepts) → list[str] │ └─────────────────────────────────────┘ ``` ## TDD Implementation Plan ### Red Phase: Write Failing Tests First **File: `tests/unit/tools/test_openalex.py`** ```python """Unit tests for OpenAlex tool - TDD RED phase.""" from unittest.mock import AsyncMock, MagicMock import pytest from src.tools.openalex import OpenAlexTool from src.utils.models import Evidence # Sample OpenAlex response SAMPLE_OPENALEX_RESPONSE = { "results": [ { "id": "https://openalex.org/W12345", "doi": "https://doi.org/10.1234/test", "display_name": "Metformin in Cancer Treatment", "publication_year": 2024, "cited_by_count": 150, "abstract_inverted_index": { "Metformin": [0], "shows": [1], "promise": [2], "in": [3], "cancer": [4], "treatment": [5], }, "concepts": [ {"display_name": "Metformin", "score": 0.95, "level": 2}, {"display_name": "Cancer", "score": 0.88, "level": 1}, ], "authorships": [ {"author": {"display_name": "John Smith"}}, {"author": {"display_name": "Jane Doe"}}, ], "open_access": {"is_oa": True, "oa_url": "https://example.com/oa"}, "best_oa_location": {"pdf_url": "https://example.com/paper.pdf"}, } ] } @pytest.mark.unit class TestOpenAlexTool: """Tests for OpenAlexTool.""" @pytest.fixture def tool(self) -> OpenAlexTool: return OpenAlexTool() @pytest.fixture def mock_client(self, mocker): """Create a standardized mock client with context manager support.""" client = AsyncMock() client.__aenter__.return_value = client client.__aexit__.return_value = None # Standard response mock resp = MagicMock() resp.json.return_value = SAMPLE_OPENALEX_RESPONSE resp.raise_for_status.return_value = None client.get.return_value = resp mocker.patch("httpx.AsyncClient", return_value=client) return client def test_tool_name(self, tool: OpenAlexTool) -> None: """Tool name should be 'openalex'.""" assert tool.name == "openalex" @pytest.mark.asyncio async def test_search_returns_evidence(self, tool: OpenAlexTool, mock_client) -> None: """Search should return Evidence objects.""" results = await tool.search("metformin cancer", max_results=5) assert len(results) == 1 assert isinstance(results[0], Evidence) assert results[0].citation.source == "openalex" @pytest.mark.asyncio async def test_search_includes_citation_count(self, tool: OpenAlexTool, mock_client) -> None: """Evidence metadata should include cited_by_count.""" results = await tool.search("metformin cancer", max_results=5) assert results[0].metadata["cited_by_count"] == 150 @pytest.mark.asyncio async def test_search_calculates_relevance(self, tool: OpenAlexTool, mock_client) -> None: """Evidence relevance should be based on citations (capped at 1.0).""" results = await tool.search("metformin cancer", max_results=5) # 150 citations / 100 = 1.5 -> capped at 1.0 assert results[0].relevance == 1.0 @pytest.mark.asyncio async def test_search_includes_concepts(self, tool: OpenAlexTool, mock_client) -> None: """Evidence metadata should include concepts.""" results = await tool.search("metformin cancer", max_results=5) assert "Metformin" in results[0].metadata["concepts"] assert "Cancer" in results[0].metadata["concepts"] @pytest.mark.asyncio async def test_search_includes_open_access_info(self, tool: OpenAlexTool, mock_client) -> None: """Evidence metadata should include open access info.""" results = await tool.search("metformin cancer", max_results=5) assert results[0].metadata["is_open_access"] is True assert results[0].metadata["pdf_url"] == "https://example.com/paper.pdf" def test_reconstruct_abstract(self, tool: OpenAlexTool) -> None: """Abstract reconstruction from inverted index.""" inverted_index = { "Hello": [0], "world": [1], "this": [2], "is": [3], "a": [4], "test": [5], } result = tool._reconstruct_abstract(inverted_index) assert result == "Hello world this is a test" def test_reconstruct_abstract_empty(self, tool: OpenAlexTool) -> None: """Handle None or empty inverted index.""" assert tool._reconstruct_abstract(None) == "" assert tool._reconstruct_abstract({}) == "" @pytest.mark.asyncio async def test_search_empty_results(self, tool: OpenAlexTool, mock_client) -> None: """Handle empty results gracefully.""" mock_client.get.return_value.json.return_value = {"results": []} results = await tool.search("xyznonexistent123", max_results=5) assert results == [] @pytest.mark.asyncio async def test_search_params(self, tool: OpenAlexTool, mock_client) -> None: """Verify API call requests citation-sorted results and uses polite pool.""" mock_client.get.return_value.json.return_value = {"results": []} await tool.search("test query", max_results=5) # Verify call params call_args = mock_client.get.call_args params = call_args[1]["params"] assert params["sort"] == "cited_by_count:desc" assert params["mailto"] == tool.POLITE_EMAIL assert "type:article" in params["filter"] assert "has_abstract:true" in params["filter"] ``` ### Green Phase: Implement to Pass Tests **File: `src/tools/openalex.py`** ```python """OpenAlex search tool - citation-aware scholarly search.""" from typing import Any import httpx from tenacity import retry, stop_after_attempt, wait_exponential from src.utils.exceptions import SearchError from src.utils.models import Citation, Evidence class OpenAlexTool: """ Search OpenAlex for scholarly works with citation metrics. OpenAlex indexes 209M+ works and provides: - Citation counts (prioritize influential papers) - Concept tagging (hierarchical classification) - Open access links (direct PDF URLs) - Related works (ML-powered similarity) API Docs: https://docs.openalex.org Rate Limits: Polite pool with mailto = 100k/day """ BASE_URL = "https://api.openalex.org/works" POLITE_EMAIL = "deepboner-research@proton.me" @property def name(self) -> str: return "openalex" @retry( stop=stop_after_attempt(3), wait=wait_exponential(multiplier=1, min=1, max=10), reraise=True, ) async def search(self, query: str, max_results: int = 10) -> list[Evidence]: """ Search OpenAlex, sorted by citation count. Args: query: Search terms max_results: Maximum results to return Returns: List of Evidence objects with citation metadata """ params: dict[str, str | int] = { "search": query, "filter": "type:article,has_abstract:true", # Only articles with abstracts "sort": "cited_by_count:desc", # Most cited first "per_page": min(max_results, 100), "mailto": self.POLITE_EMAIL, } async with httpx.AsyncClient(timeout=30.0) as client: try: response = await client.get(self.BASE_URL, params=params) response.raise_for_status() data = response.json() works = data.get("results", []) return [self._to_evidence(work) for work in works[:max_results]] except httpx.HTTPStatusError as e: raise SearchError(f"OpenAlex API error: {e}") from e except httpx.RequestError as e: raise SearchError(f"OpenAlex connection failed: {e}") from e def _to_evidence(self, work: dict[str, Any]) -> Evidence: """Convert OpenAlex work to Evidence with rich metadata.""" # Extract basic fields title = work.get("display_name", "Untitled") doi = work.get("doi", "") year = work.get("publication_year", "Unknown") cited_by_count = work.get("cited_by_count", 0) # Reconstruct abstract from inverted index abstract = self._reconstruct_abstract(work.get("abstract_inverted_index")) if not abstract: # Should be caught by filter=has_abstract:true, but defensive coding abstract = f"[No abstract available. Cited by {cited_by_count} works.]" # Extract authors (limit to 5) authors = self._extract_authors(work.get("authorships", [])) # Extract concepts (top 5 by score) concepts = self._extract_concepts(work.get("concepts", [])) # Open access info oa_info = work.get("open_access", {}) is_oa = oa_info.get("is_oa", False) # Get PDF URL (prefer best_oa_location) best_oa = work.get("best_oa_location", {}) pdf_url = best_oa.get("pdf_url") if best_oa else None # Build URL if doi: url = doi if doi.startswith("http") else f"https://doi.org/{doi}" else: openalex_id = work.get("id", "") url = openalex_id if openalex_id else "https://openalex.org" # Prepend citation badge to content citation_badge = f"[Cited by {cited_by_count}] " if cited_by_count > 0 else "" content = f"{citation_badge}{abstract[:1900]}" # Calculate relevance: normalized citation count (capped at 1.0 for 100 citations) # 100 citations is a very strong signal in most fields. relevance = min(1.0, cited_by_count / 100.0) return Evidence( content=content[:2000], citation=Citation( source="openalex", title=title[:500], url=url, date=str(year), authors=authors, ), relevance=relevance, metadata={ "cited_by_count": cited_by_count, "concepts": concepts, "is_open_access": is_oa, "pdf_url": pdf_url, }, ) def _reconstruct_abstract(self, inverted_index: dict[str, list[int]] | None) -> str: """Rebuild abstract from {"word": [positions]} format.""" if not inverted_index: return "" position_word: dict[int, str] = {} for word, positions in inverted_index.items(): for pos in positions: position_word[pos] = word if not position_word: return "" max_pos = max(position_word.keys()) return " ".join(position_word.get(i, "") for i in range(max_pos + 1)) def _extract_authors(self, authorships: list[dict[str, Any]]) -> list[str]: """Extract author names from authorships array.""" authors = [] for authorship in authorships[:5]: author = authorship.get("author", {}) name = author.get("display_name") if name: authors.append(name) return authors def _extract_concepts(self, concepts: list[dict[str, Any]]) -> list[str]: """Extract concept names, sorted by score.""" sorted_concepts = sorted(concepts, key=lambda c: c.get("score", 0), reverse=True) return [c.get("display_name", "") for c in sorted_concepts[:5] if c.get("display_name")] ``` ### Refactor Phase: Clean Integration **Update: `src/tools/__init__.py`** ```python """Search tools package.""" from src.tools.base import SearchTool from src.tools.clinicaltrials import ClinicalTrialsTool from src.tools.europepmc import EuropePMCTool from src.tools.openalex import OpenAlexTool from src.tools.pubmed import PubMedTool from src.tools.search_handler import SearchHandler __all__ = [ "ClinicalTrialsTool", "EuropePMCTool", "OpenAlexTool", "PubMedTool", "SearchHandler", "SearchTool", ] ``` ## Test Matrix | Test | What It Validates | Priority | |------|------------------|----------| | `test_tool_name` | Returns "openalex" | P0 | | `test_search_returns_evidence` | Returns `list[Evidence]` | P0 | | `test_search_includes_citation_count` | `metadata["cited_by_count"]` populated | P0 | | `test_search_calculates_relevance` | `relevance` derived from citations | P1 | | `test_search_includes_concepts` | `metadata["concepts"]` populated | P0 | | `test_search_includes_open_access_info` | `metadata["is_open_access"]` and `pdf_url` | P1 | | `test_reconstruct_abstract` | Inverted index → text | P0 | | `test_reconstruct_abstract_empty` | Handle None/empty inputs | P1 | | `test_search_empty_results` | Return `[]` for no matches | P0 | | `test_search_params` | API params (`sort`, `filter`, `mailto`) | P1 | ## Integration Test ```python @pytest.mark.integration class TestOpenAlexIntegration: """Integration tests with real OpenAlex API.""" @pytest.mark.asyncio async def test_real_api_returns_results(self) -> None: """Test actual API returns relevant results.""" tool = OpenAlexTool() results = await tool.search("metformin cancer treatment", max_results=3) assert len(results) > 0 # Should have citation counts assert results[0].metadata["cited_by_count"] >= 0 # Should have abstract text assert len(results[0].content) > 50 # Should have concepts assert len(results[0].metadata["concepts"]) > 0 ``` ## Acceptance Criteria - [x] `OpenAlexTool` implements `SearchTool` Protocol - [x] Tool returns `list[Evidence]` with citation metadata - [x] Abstract reconstructed from inverted index format - [x] Relevance calculated from citation count (capped at 1.0) - [x] Exported from `src/tools/__init__.py` - [x] Integrated into `src/app.py` SearchHandler - [x] UI description updated to mention OpenAlex - [x] All unit tests pass (11 tests) - [x] Integration test passes with real API **Status: IMPLEMENTED** (commits fd28242, cb46aac) ## Files Modified 1. `src/tools/openalex.py` - NEW: OpenAlex tool implementation 2. `tests/unit/tools/test_openalex.py` - NEW: Unit and integration tests 3. `src/tools/__init__.py` - Export OpenAlexTool 4. `src/app.py` - Wire OpenAlexTool into SearchHandler