DeepBoner / docs /specs /SPEC_03_OPENALEX_INTEGRATION.md
VibecoderMcSwaggins's picture
docs: mark SPEC_03/04/05 as IMPLEMENTED with acceptance criteria
af7d422
|
raw
history blame
18.5 kB
# SPEC 03: OpenAlex Integration
## Priority: P1 (Feature Enhancement)
## Problem Statement
We currently search 3 sources (PubMed, Europe PMC, ClinicalTrials.gov) but lack **citation metrics**. We cannot distinguish a highly-cited landmark paper from an obscure one. OpenAlex provides:
1. **Citation counts** - Prioritize authoritative papers
2. **Citation networks** - "Who cites whom"
3. **Concept tagging** - Hierarchical categorization
4. **Open access links** - Direct PDF URLs
**FREE API. No key required. 209M+ works indexed.**
> **Note:** This spec supersedes `docs/future-roadmap/phases/15_PHASE_OPENALEX.md`.
## Groundwork Already Done
```python
# src/utils/models.py:9
SourceName = Literal["pubmed", "clinicaltrials", "europepmc", "preprint", "openalex", "web"]
# src/utils/models.py:39-42
metadata: dict[str, Any] = Field(
default_factory=dict,
description="Additional metadata (e.g., cited_by_count, concepts, is_open_access)",
)
```
The infrastructure is ready. We just need to build the tool.
## OpenAlex API Reference
### Endpoint
```
GET https://api.openalex.org/works
```
### Key Parameters
| Parameter | Description |
|-----------|-------------|
| `search` | Full-text search across title, abstract, fulltext |
| `filter` | Constrain results (e.g., `type:article`, `has_abstract:true`) |
| `sort` | Order results (e.g., `cited_by_count:desc`) |
| `per_page` | Results per page (max 200) |
| `mailto` | Email for polite pool (higher rate limits) |
### Example Request
```bash
GET https://api.openalex.org/works?search=metformin%20cancer&filter=type:article,has_abstract:true&sort=cited_by_count:desc&per_page=10&mailto=deepboner-research@proton.me
```
### Response Structure
```json
{
"results": [
{
"id": "https://openalex.org/W2741809807",
"doi": "https://doi.org/10.1234/example",
"display_name": "Paper Title",
"publication_year": 2024,
"cited_by_count": 150,
"abstract_inverted_index": {
"word1": [0],
"word2": [1, 5]
},
"concepts": [
{"display_name": "Metformin", "score": 0.95, "level": 2}
],
"authorships": [
{"author": {"display_name": "John Smith"}}
],
"open_access": {
"is_oa": true,
"oa_url": "https://example.com/pdf"
},
"best_oa_location": {
"pdf_url": "https://example.com/paper.pdf"
}
}
]
}
```
## Architecture
### Class Diagram
```
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ SearchTool (Protocol) β”‚
β”‚ ───────────────────────────────── β”‚
β”‚ + name: str β”‚
β”‚ + search(query, max_results) β†’ list[Evidence] β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚ implements
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ OpenAlexTool β”‚
β”‚ ───────────────────────────────── β”‚
β”‚ - BASE_URL: str β”‚
β”‚ - POLITE_EMAIL: str β”‚
β”‚ ───────────────────────────────── β”‚
β”‚ + name β†’ "openalex" β”‚
β”‚ + search(query, max_results) β†’ list[Evidence] β”‚
β”‚ - _reconstruct_abstract(inverted_index) β†’ str β”‚
β”‚ - _to_evidence(work) β†’ Evidence β”‚
β”‚ - _extract_authors(authorships) β†’ list[str] β”‚
β”‚ - _extract_concepts(concepts) β†’ list[str] β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```
## TDD Implementation Plan
### Red Phase: Write Failing Tests First
**File: `tests/unit/tools/test_openalex.py`**
```python
"""Unit tests for OpenAlex tool - TDD RED phase."""
from unittest.mock import AsyncMock, MagicMock
import pytest
from src.tools.openalex import OpenAlexTool
from src.utils.models import Evidence
# Sample OpenAlex response
SAMPLE_OPENALEX_RESPONSE = {
"results": [
{
"id": "https://openalex.org/W12345",
"doi": "https://doi.org/10.1234/test",
"display_name": "Metformin in Cancer Treatment",
"publication_year": 2024,
"cited_by_count": 150,
"abstract_inverted_index": {
"Metformin": [0],
"shows": [1],
"promise": [2],
"in": [3],
"cancer": [4],
"treatment": [5],
},
"concepts": [
{"display_name": "Metformin", "score": 0.95, "level": 2},
{"display_name": "Cancer", "score": 0.88, "level": 1},
],
"authorships": [
{"author": {"display_name": "John Smith"}},
{"author": {"display_name": "Jane Doe"}},
],
"open_access": {"is_oa": True, "oa_url": "https://example.com/oa"},
"best_oa_location": {"pdf_url": "https://example.com/paper.pdf"},
}
]
}
@pytest.mark.unit
class TestOpenAlexTool:
"""Tests for OpenAlexTool."""
@pytest.fixture
def tool(self) -> OpenAlexTool:
return OpenAlexTool()
@pytest.fixture
def mock_client(self, mocker):
"""Create a standardized mock client with context manager support."""
client = AsyncMock()
client.__aenter__.return_value = client
client.__aexit__.return_value = None
# Standard response mock
resp = MagicMock()
resp.json.return_value = SAMPLE_OPENALEX_RESPONSE
resp.raise_for_status.return_value = None
client.get.return_value = resp
mocker.patch("httpx.AsyncClient", return_value=client)
return client
def test_tool_name(self, tool: OpenAlexTool) -> None:
"""Tool name should be 'openalex'."""
assert tool.name == "openalex"
@pytest.mark.asyncio
async def test_search_returns_evidence(self, tool: OpenAlexTool, mock_client) -> None:
"""Search should return Evidence objects."""
results = await tool.search("metformin cancer", max_results=5)
assert len(results) == 1
assert isinstance(results[0], Evidence)
assert results[0].citation.source == "openalex"
@pytest.mark.asyncio
async def test_search_includes_citation_count(self, tool: OpenAlexTool, mock_client) -> None:
"""Evidence metadata should include cited_by_count."""
results = await tool.search("metformin cancer", max_results=5)
assert results[0].metadata["cited_by_count"] == 150
@pytest.mark.asyncio
async def test_search_calculates_relevance(self, tool: OpenAlexTool, mock_client) -> None:
"""Evidence relevance should be based on citations (capped at 1.0)."""
results = await tool.search("metformin cancer", max_results=5)
# 150 citations / 100 = 1.5 -> capped at 1.0
assert results[0].relevance == 1.0
@pytest.mark.asyncio
async def test_search_includes_concepts(self, tool: OpenAlexTool, mock_client) -> None:
"""Evidence metadata should include concepts."""
results = await tool.search("metformin cancer", max_results=5)
assert "Metformin" in results[0].metadata["concepts"]
assert "Cancer" in results[0].metadata["concepts"]
@pytest.mark.asyncio
async def test_search_includes_open_access_info(self, tool: OpenAlexTool, mock_client) -> None:
"""Evidence metadata should include open access info."""
results = await tool.search("metformin cancer", max_results=5)
assert results[0].metadata["is_open_access"] is True
assert results[0].metadata["pdf_url"] == "https://example.com/paper.pdf"
def test_reconstruct_abstract(self, tool: OpenAlexTool) -> None:
"""Abstract reconstruction from inverted index."""
inverted_index = {
"Hello": [0],
"world": [1],
"this": [2],
"is": [3],
"a": [4],
"test": [5],
}
result = tool._reconstruct_abstract(inverted_index)
assert result == "Hello world this is a test"
def test_reconstruct_abstract_empty(self, tool: OpenAlexTool) -> None:
"""Handle None or empty inverted index."""
assert tool._reconstruct_abstract(None) == ""
assert tool._reconstruct_abstract({}) == ""
@pytest.mark.asyncio
async def test_search_empty_results(self, tool: OpenAlexTool, mock_client) -> None:
"""Handle empty results gracefully."""
mock_client.get.return_value.json.return_value = {"results": []}
results = await tool.search("xyznonexistent123", max_results=5)
assert results == []
@pytest.mark.asyncio
async def test_search_params(self, tool: OpenAlexTool, mock_client) -> None:
"""Verify API call requests citation-sorted results and uses polite pool."""
mock_client.get.return_value.json.return_value = {"results": []}
await tool.search("test query", max_results=5)
# Verify call params
call_args = mock_client.get.call_args
params = call_args[1]["params"]
assert params["sort"] == "cited_by_count:desc"
assert params["mailto"] == tool.POLITE_EMAIL
assert "type:article" in params["filter"]
assert "has_abstract:true" in params["filter"]
```
### Green Phase: Implement to Pass Tests
**File: `src/tools/openalex.py`**
```python
"""OpenAlex search tool - citation-aware scholarly search."""
from typing import Any
import httpx
from tenacity import retry, stop_after_attempt, wait_exponential
from src.utils.exceptions import SearchError
from src.utils.models import Citation, Evidence
class OpenAlexTool:
"""
Search OpenAlex for scholarly works with citation metrics.
OpenAlex indexes 209M+ works and provides:
- Citation counts (prioritize influential papers)
- Concept tagging (hierarchical classification)
- Open access links (direct PDF URLs)
- Related works (ML-powered similarity)
API Docs: https://docs.openalex.org
Rate Limits: Polite pool with mailto = 100k/day
"""
BASE_URL = "https://api.openalex.org/works"
POLITE_EMAIL = "deepboner-research@proton.me"
@property
def name(self) -> str:
return "openalex"
@retry(
stop=stop_after_attempt(3),
wait=wait_exponential(multiplier=1, min=1, max=10),
reraise=True,
)
async def search(self, query: str, max_results: int = 10) -> list[Evidence]:
"""
Search OpenAlex, sorted by citation count.
Args:
query: Search terms
max_results: Maximum results to return
Returns:
List of Evidence objects with citation metadata
"""
params: dict[str, str | int] = {
"search": query,
"filter": "type:article,has_abstract:true", # Only articles with abstracts
"sort": "cited_by_count:desc", # Most cited first
"per_page": min(max_results, 100),
"mailto": self.POLITE_EMAIL,
}
async with httpx.AsyncClient(timeout=30.0) as client:
try:
response = await client.get(self.BASE_URL, params=params)
response.raise_for_status()
data = response.json()
works = data.get("results", [])
return [self._to_evidence(work) for work in works[:max_results]]
except httpx.HTTPStatusError as e:
raise SearchError(f"OpenAlex API error: {e}") from e
except httpx.RequestError as e:
raise SearchError(f"OpenAlex connection failed: {e}") from e
def _to_evidence(self, work: dict[str, Any]) -> Evidence:
"""Convert OpenAlex work to Evidence with rich metadata."""
# Extract basic fields
title = work.get("display_name", "Untitled")
doi = work.get("doi", "")
year = work.get("publication_year", "Unknown")
cited_by_count = work.get("cited_by_count", 0)
# Reconstruct abstract from inverted index
abstract = self._reconstruct_abstract(work.get("abstract_inverted_index"))
if not abstract:
# Should be caught by filter=has_abstract:true, but defensive coding
abstract = f"[No abstract available. Cited by {cited_by_count} works.]"
# Extract authors (limit to 5)
authors = self._extract_authors(work.get("authorships", []))
# Extract concepts (top 5 by score)
concepts = self._extract_concepts(work.get("concepts", []))
# Open access info
oa_info = work.get("open_access", {})
is_oa = oa_info.get("is_oa", False)
# Get PDF URL (prefer best_oa_location)
best_oa = work.get("best_oa_location", {})
pdf_url = best_oa.get("pdf_url") if best_oa else None
# Build URL
if doi:
url = doi if doi.startswith("http") else f"https://doi.org/{doi}"
else:
openalex_id = work.get("id", "")
url = openalex_id if openalex_id else "https://openalex.org"
# Prepend citation badge to content
citation_badge = f"[Cited by {cited_by_count}] " if cited_by_count > 0 else ""
content = f"{citation_badge}{abstract[:1900]}"
# Calculate relevance: normalized citation count (capped at 1.0 for 100 citations)
# 100 citations is a very strong signal in most fields.
relevance = min(1.0, cited_by_count / 100.0)
return Evidence(
content=content[:2000],
citation=Citation(
source="openalex",
title=title[:500],
url=url,
date=str(year),
authors=authors,
),
relevance=relevance,
metadata={
"cited_by_count": cited_by_count,
"concepts": concepts,
"is_open_access": is_oa,
"pdf_url": pdf_url,
},
)
def _reconstruct_abstract(self, inverted_index: dict[str, list[int]] | None) -> str:
"""Rebuild abstract from {"word": [positions]} format."""
if not inverted_index:
return ""
position_word: dict[int, str] = {}
for word, positions in inverted_index.items():
for pos in positions:
position_word[pos] = word
if not position_word:
return ""
max_pos = max(position_word.keys())
return " ".join(position_word.get(i, "") for i in range(max_pos + 1))
def _extract_authors(self, authorships: list[dict[str, Any]]) -> list[str]:
"""Extract author names from authorships array."""
authors = []
for authorship in authorships[:5]:
author = authorship.get("author", {})
name = author.get("display_name")
if name:
authors.append(name)
return authors
def _extract_concepts(self, concepts: list[dict[str, Any]]) -> list[str]:
"""Extract concept names, sorted by score."""
sorted_concepts = sorted(concepts, key=lambda c: c.get("score", 0), reverse=True)
return [c.get("display_name", "") for c in sorted_concepts[:5] if c.get("display_name")]
```
### Refactor Phase: Clean Integration
**Update: `src/tools/__init__.py`**
```python
"""Search tools package."""
from src.tools.base import SearchTool
from src.tools.clinicaltrials import ClinicalTrialsTool
from src.tools.europepmc import EuropePMCTool
from src.tools.openalex import OpenAlexTool
from src.tools.pubmed import PubMedTool
from src.tools.search_handler import SearchHandler
__all__ = [
"ClinicalTrialsTool",
"EuropePMCTool",
"OpenAlexTool",
"PubMedTool",
"SearchHandler",
"SearchTool",
]
```
## Test Matrix
| Test | What It Validates | Priority |
|------|------------------|----------|
| `test_tool_name` | Returns "openalex" | P0 |
| `test_search_returns_evidence` | Returns `list[Evidence]` | P0 |
| `test_search_includes_citation_count` | `metadata["cited_by_count"]` populated | P0 |
| `test_search_calculates_relevance` | `relevance` derived from citations | P1 |
| `test_search_includes_concepts` | `metadata["concepts"]` populated | P0 |
| `test_search_includes_open_access_info` | `metadata["is_open_access"]` and `pdf_url` | P1 |
| `test_reconstruct_abstract` | Inverted index β†’ text | P0 |
| `test_reconstruct_abstract_empty` | Handle None/empty inputs | P1 |
| `test_search_empty_results` | Return `[]` for no matches | P0 |
| `test_search_params` | API params (`sort`, `filter`, `mailto`) | P1 |
## Integration Test
```python
@pytest.mark.integration
class TestOpenAlexIntegration:
"""Integration tests with real OpenAlex API."""
@pytest.mark.asyncio
async def test_real_api_returns_results(self) -> None:
"""Test actual API returns relevant results."""
tool = OpenAlexTool()
results = await tool.search("metformin cancer treatment", max_results=3)
assert len(results) > 0
# Should have citation counts
assert results[0].metadata["cited_by_count"] >= 0
# Should have abstract text
assert len(results[0].content) > 50
# Should have concepts
assert len(results[0].metadata["concepts"]) > 0
```
## Acceptance Criteria
- [x] `OpenAlexTool` implements `SearchTool` Protocol
- [x] Tool returns `list[Evidence]` with citation metadata
- [x] Abstract reconstructed from inverted index format
- [x] Relevance calculated from citation count (capped at 1.0)
- [x] Exported from `src/tools/__init__.py`
- [x] Integrated into `src/app.py` SearchHandler
- [x] UI description updated to mention OpenAlex
- [x] All unit tests pass (11 tests)
- [x] Integration test passes with real API
**Status: IMPLEMENTED** (commits fd28242, cb46aac)
## Files Modified
1. `src/tools/openalex.py` - NEW: OpenAlex tool implementation
2. `tests/unit/tools/test_openalex.py` - NEW: Unit and integration tests
3. `src/tools/__init__.py` - Export OpenAlexTool
4. `src/app.py` - Wire OpenAlexTool into SearchHandler