Spaces:

VibecoderMcSwaggins
/

DeepBoner

Paused

App Files Files Community

DeepBoner / docs /brainstorming /implementation /15_PHASE_OPENALEX.md

VibecoderMcSwaggins

feat: add roadmap summary and detailed improvement plans for data sources

9286db5 5 months ago

preview code

raw

history blame

18.5 kB

	# Phase 15: OpenAlex Integration

	Priority: HIGH - Biggest bang for buck
	Effort: ~2-3 hours
	Dependencies: None (existing codebase patterns sufficient)

	---

	## Prerequisites (COMPLETED)

	The following model changes have been implemented to support this integration:

	1. `SourceName` Literal Updated (`src/utils/models.py:9`)
	```python
	SourceName = Literal["pubmed", "clinicaltrials", "europepmc", "preprint", "openalex"]
	```
	- Without this, `source="openalex"` would fail Pydantic validation

	2. `Evidence.metadata` Field Added (`src/utils/models.py:39-42`)
	```python
	metadata: dict[str, Any] = Field(
	default_factory=dict,
	description="Additional metadata (e.g., cited_by_count, concepts, is_open_access)",
	)
	```
	- Required for storing `cited_by_count`, `concepts`, etc.
	- Model is still frozen - metadata must be passed at construction time

	3. `__init__.py` Exports Updated (`src/tools/__init__.py`)
	- All tools are now exported: `ClinicalTrialsTool`, `EuropePMCTool`, `PubMedTool`
	- OpenAlexTool should be added here after implementation

	---

	## Overview

	Add OpenAlex as a 4th data source for comprehensive scholarly data including:
	- Citation networks (who cites whom)
	- Concept tagging (hierarchical topic classification)
	- Author disambiguation
	- 209M+ works indexed

	Why OpenAlex?
	- Free, no API key required
	- Already implemented in reference repo
	- Provides citation data we don't have
	- Aggregates PubMed + preprints + more

	---

	## TDD Implementation Plan

	### Step 1: Write the Tests First

	File: `tests/unit/tools/test_openalex.py`

	```python
	"""Tests for OpenAlex search tool."""

	import pytest
	import respx
	from httpx import Response

	from src.tools.openalex import OpenAlexTool
	from src.utils.models import Evidence


	class TestOpenAlexTool:
	"""Test suite for OpenAlex search functionality."""

	@pytest.fixture
	def tool(self) -> OpenAlexTool:
	return OpenAlexTool()

	def test_name_property(self, tool: OpenAlexTool) -> None:
	"""Tool should identify itself as 'openalex'."""
	assert tool.name == "openalex"

	@respx.mock
	@pytest.mark.asyncio
	async def test_search_returns_evidence(self, tool: OpenAlexTool) -> None:
	"""Search should return list of Evidence objects."""
	mock_response = {
	"results": [
	{
	"id": "W2741809807",
	"title": "Metformin and cancer: A systematic review",
	"publication_year": 2023,
	"cited_by_count": 45,
	"type": "article",
	"is_oa": True,
	"primary_location": {
	"source": {"display_name": "Nature Medicine"},
	"landing_page_url": "https://doi.org/10.1038/example",
	"pdf_url": None,
	},
	"abstract_inverted_index": {
	"Metformin": [0],
	"shows": [1],
	"anticancer": [2],
	"effects": [3],
	},
	"concepts": [
	{"display_name": "Medicine", "score": 0.95},
	{"display_name": "Oncology", "score": 0.88},
	],
	"authorships": [
	{
	"author": {"display_name": "John Smith"},
	"institutions": [{"display_name": "Harvard"}],
	}
	],
	}
	]
	}

	respx.get("https://api.openalex.org/works").mock(
	return_value=Response(200, json=mock_response)
	)

	results = await tool.search("metformin cancer", max_results=10)

	assert len(results) == 1
	assert isinstance(results[0], Evidence)
	assert "Metformin and cancer" in results[0].citation.title
	assert results[0].citation.source == "openalex"

	@respx.mock
	@pytest.mark.asyncio
	async def test_search_empty_results(self, tool: OpenAlexTool) -> None:
	"""Search with no results should return empty list."""
	respx.get("https://api.openalex.org/works").mock(
	return_value=Response(200, json={"results": []})
	)

	results = await tool.search("xyznonexistentquery123")
	assert results == []

	@respx.mock
	@pytest.mark.asyncio
	async def test_search_handles_missing_abstract(self, tool: OpenAlexTool) -> None:
	"""Tool should handle papers without abstracts."""
	mock_response = {
	"results": [
	{
	"id": "W123",
	"title": "Paper without abstract",
	"publication_year": 2023,
	"cited_by_count": 10,
	"type": "article",
	"is_oa": False,
	"primary_location": {
	"source": {"display_name": "Journal"},
	"landing_page_url": "https://example.com",
	},
	"abstract_inverted_index": None,
	"concepts": [],
	"authorships": [],
	}
	]
	}

	respx.get("https://api.openalex.org/works").mock(
	return_value=Response(200, json=mock_response)
	)

	results = await tool.search("test query")
	assert len(results) == 1
	assert results[0].content == "" # No abstract

	@respx.mock
	@pytest.mark.asyncio
	async def test_search_extracts_citation_count(self, tool: OpenAlexTool) -> None:
	"""Citation count should be in metadata."""
	mock_response = {
	"results": [
	{
	"id": "W456",
	"title": "Highly cited paper",
	"publication_year": 2020,
	"cited_by_count": 500,
	"type": "article",
	"is_oa": True,
	"primary_location": {
	"source": {"display_name": "Science"},
	"landing_page_url": "https://example.com",
	},
	"abstract_inverted_index": {"Test": [0]},
	"concepts": [],
	"authorships": [],
	}
	]
	}

	respx.get("https://api.openalex.org/works").mock(
	return_value=Response(200, json=mock_response)
	)

	results = await tool.search("highly cited")
	assert results[0].metadata["cited_by_count"] == 500

	@respx.mock
	@pytest.mark.asyncio
	async def test_search_extracts_concepts(self, tool: OpenAlexTool) -> None:
	"""Concepts should be extracted for semantic discovery."""
	mock_response = {
	"results": [
	{
	"id": "W789",
	"title": "Drug repurposing study",
	"publication_year": 2023,
	"cited_by_count": 25,
	"type": "article",
	"is_oa": True,
	"primary_location": {
	"source": {"display_name": "PLOS ONE"},
	"landing_page_url": "https://example.com",
	},
	"abstract_inverted_index": {"Drug": [0], "repurposing": [1]},
	"concepts": [
	{"display_name": "Pharmacology", "score": 0.92},
	{"display_name": "Drug Discovery", "score": 0.85},
	{"display_name": "Medicine", "score": 0.80},
	],
	"authorships": [],
	}
	]
	}

	respx.get("https://api.openalex.org/works").mock(
	return_value=Response(200, json=mock_response)
	)

	results = await tool.search("drug repurposing")
	assert "Pharmacology" in results[0].metadata["concepts"]
	assert "Drug Discovery" in results[0].metadata["concepts"]

	@respx.mock
	@pytest.mark.asyncio
	async def test_search_api_error_raises_search_error(
	self, tool: OpenAlexTool
	) -> None:
	"""API errors should raise SearchError."""
	from src.utils.exceptions import SearchError

	respx.get("https://api.openalex.org/works").mock(
	return_value=Response(500, text="Internal Server Error")
	)

	with pytest.raises(SearchError):
	await tool.search("test query")

	def test_reconstruct_abstract(self, tool: OpenAlexTool) -> None:
	"""Test abstract reconstruction from inverted index."""
	inverted_index = {
	"Metformin": [0, 5],
	"is": [1],
	"a": [2],
	"diabetes": [3],
	"drug": [4],
	"effective": [6],
	}
	abstract = tool._reconstruct_abstract(inverted_index)
	assert abstract == "Metformin is a diabetes drug Metformin effective"
	```

	---

	### Step 2: Create the Implementation

	File: `src/tools/openalex.py`

	```python
	"""OpenAlex search tool for comprehensive scholarly data."""

	from typing import Any

	import httpx
	from tenacity import retry, stop_after_attempt, wait_exponential

	from src.utils.exceptions import SearchError
	from src.utils.models import Citation, Evidence


	class OpenAlexTool:
	"""
	Search OpenAlex for scholarly works with rich metadata.

	OpenAlex provides:
	- 209M+ scholarly works
	- Citation counts and networks
	- Concept tagging (hierarchical)
	- Author disambiguation
	- Open access links

	API Docs: https://docs.openalex.org/
	"""

	BASE_URL = "https://api.openalex.org/works"

	def __init__(self, email: str \| None = None) -> None:
	"""
	Initialize OpenAlex tool.

	Args:
	email: Optional email for polite pool (faster responses)
	"""
	self.email = email or "deepcritical@example.com"

	@property
	def name(self) -> str:
	return "openalex"

	@retry(
	stop=stop_after_attempt(3),
	wait=wait_exponential(multiplier=1, min=1, max=10),
	reraise=True,
	)
	async def search(self, query: str, max_results: int = 10) -> list[Evidence]:
	"""
	Search OpenAlex for scholarly works.

	Args:
	query: Search terms
	max_results: Maximum results to return (max 200 per request)

	Returns:
	List of Evidence objects with citation metadata

	Raises:
	SearchError: If API request fails
	"""
	params = {
	"search": query,
	"filter": "type:article", # Only peer-reviewed articles
	"sort": "cited_by_count:desc", # Most cited first
	"per_page": min(max_results, 200),
	"mailto": self.email, # Polite pool for faster responses
	}

	async with httpx.AsyncClient(timeout=30.0) as client:
	try:
	response = await client.get(self.BASE_URL, params=params)
	response.raise_for_status()

	data = response.json()
	results = data.get("results", [])

	return [self._to_evidence(work) for work in results[:max_results]]

	except httpx.HTTPStatusError as e:
	raise SearchError(f"OpenAlex API error: {e}") from e
	except httpx.RequestError as e:
	raise SearchError(f"OpenAlex connection failed: {e}") from e

	def _to_evidence(self, work: dict[str, Any]) -> Evidence:
	"""Convert OpenAlex work to Evidence object."""
	title = work.get("title", "Untitled")
	pub_year = work.get("publication_year", "Unknown")
	cited_by = work.get("cited_by_count", 0)
	is_oa = work.get("is_oa", False)

	# Reconstruct abstract from inverted index
	abstract_index = work.get("abstract_inverted_index")
	abstract = self._reconstruct_abstract(abstract_index) if abstract_index else ""

	# Extract concepts (top 5)
	concepts = [
	c.get("display_name", "")
	for c in work.get("concepts", [])[:5]
	if c.get("display_name")
	]

	# Extract authors (top 5)
	authorships = work.get("authorships", [])
	authors = [
	a.get("author", {}).get("display_name", "")
	for a in authorships[:5]
	if a.get("author", {}).get("display_name")
	]

	# Get URL
	primary_loc = work.get("primary_location") or {}
	url = primary_loc.get("landing_page_url", "")
	if not url:
	# Fallback to OpenAlex page
	work_id = work.get("id", "").replace("https://openalex.org/", "")
	url = f"https://openalex.org/{work_id}"

	return Evidence(
	content=abstract[:2000],
	citation=Citation(
	source="openalex",
	title=title[:500],
	url=url,
	date=str(pub_year),
	authors=authors,
	),
	relevance=min(0.9, 0.5 + (cited_by / 1000)), # Boost by citations
	metadata={
	"cited_by_count": cited_by,
	"is_open_access": is_oa,
	"concepts": concepts,
	"pdf_url": primary_loc.get("pdf_url"),
	},
	)

	def _reconstruct_abstract(
	self, inverted_index: dict[str, list[int]]
	) -> str:
	"""
	Reconstruct abstract from OpenAlex inverted index format.

	OpenAlex stores abstracts as {"word": [position1, position2, ...]}.
	This rebuilds the original text.
	"""
	if not inverted_index:
	return ""

	# Build position -> word mapping
	position_word: dict[int, str] = {}
	for word, positions in inverted_index.items():
	for pos in positions:
	position_word[pos] = word

	# Reconstruct in order
	if not position_word:
	return ""

	max_pos = max(position_word.keys())
	words = [position_word.get(i, "") for i in range(max_pos + 1)]
	return " ".join(w for w in words if w)
	```

	---

	### Step 3: Register in Search Handler

	File: `src/tools/search_handler.py` (add to imports and tool list)

	```python
	# Add import
	from src.tools.openalex import OpenAlexTool

	# Add to _create_tools method
	def _create_tools(self) -> list[SearchTool]:
	return [
	PubMedTool(),
	ClinicalTrialsTool(),
	EuropePMCTool(),
	OpenAlexTool(), # NEW
	]
	```

	---

	### Step 4: Update `__init__.py`

	File: `src/tools/__init__.py`

	```python
	from src.tools.openalex import OpenAlexTool

	__all__ = [
	"PubMedTool",
	"ClinicalTrialsTool",
	"EuropePMCTool",
	"OpenAlexTool", # NEW
	# ...
	]
	```

	---

	## Demo Script

	File: `examples/openalex_demo.py`

	```python
	#!/usr/bin/env python3
	"""Demo script to verify OpenAlex integration."""

	import asyncio
	from src.tools.openalex import OpenAlexTool


	async def main():
	"""Run OpenAlex search demo."""
	tool = OpenAlexTool()

	print("=" * 60)
	print("OpenAlex Integration Demo")
	print("=" * 60)

	# Test 1: Basic drug repurposing search
	print("\n[Test 1] Searching for 'metformin cancer drug repurposing'...")
	results = await tool.search("metformin cancer drug repurposing", max_results=5)

	for i, evidence in enumerate(results, 1):
	print(f"\n--- Result {i} ---")
	print(f"Title: {evidence.citation.title}")
	print(f"Year: {evidence.citation.date}")
	print(f"Citations: {evidence.metadata.get('cited_by_count', 'N/A')}")
	print(f"Concepts: {', '.join(evidence.metadata.get('concepts', []))}")
	print(f"Open Access: {evidence.metadata.get('is_open_access', False)}")
	print(f"URL: {evidence.citation.url}")
	if evidence.content:
	print(f"Abstract: {evidence.content[:200]}...")

	# Test 2: High-impact papers
	print("\n" + "=" * 60)
	print("[Test 2] Finding highly-cited papers on 'long COVID treatment'...")
	results = await tool.search("long COVID treatment", max_results=3)

	for evidence in results:
	print(f"\n- {evidence.citation.title}")
	print(f" Citations: {evidence.metadata.get('cited_by_count', 0)}")

	print("\n" + "=" * 60)
	print("Demo complete!")


	if __name__ == "__main__":
	asyncio.run(main())
	```

	---

	## Verification Checklist

	### Unit Tests
	```bash
	# Run just OpenAlex tests
	uv run pytest tests/unit/tools/test_openalex.py -v

	# Expected: All tests pass
	```

	### Integration Test (Manual)
	```bash
	# Run demo script with real API
	uv run python examples/openalex_demo.py

	# Expected: Real results from OpenAlex API
	```

	### Full Test Suite
	```bash
	# Ensure nothing broke
	make check

	# Expected: All 110+ tests pass, mypy clean
	```

	---

	## Success Criteria

	1. Unit tests pass: All mocked tests in `test_openalex.py` pass
	2. Integration works: Demo script returns real results
	3. No regressions: `make check` passes completely
	4. SearchHandler integration: OpenAlex appears in search results alongside other sources
	5. Citation metadata: Results include `cited_by_count`, `concepts`, `is_open_access`

	---

	## Future Enhancements (P2)

	Once basic integration works:

	1. Citation Network Queries
	```python
	# Get papers citing a specific work
	async def get_citing_works(self, work_id: str) -> list[Evidence]:
	params = {"filter": f"cites:{work_id}"}
	...
	```

	2. Concept-Based Search
	```python
	# Search by OpenAlex concept ID
	async def search_by_concept(self, concept_id: str) -> list[Evidence]:
	params = {"filter": f"concepts.id:{concept_id}"}
	...
	```

	3. Author Tracking
	```python
	# Find all works by an author
	async def search_by_author(self, author_id: str) -> list[Evidence]:
	params = {"filter": f"authorships.author.id:{author_id}"}
	...
	```

	---

	## Notes

	- OpenAlex is very generous with rate limits (no documented hard limit)
	- Adding `mailto` parameter gives priority access (polite pool)
	- Abstract is stored as inverted index - must reconstruct
	- Citation count is a good proxy for paper quality/impact
	- Consider caching responses for repeated queries