# RECON -- Temporally-Aware Scientific Literature Retrieval

> A multi-agent RAG system that asks not just "what is relevant?" but "what should I trust, right now?"

**Live demo:** https://huggingface.co/spaces/MukulRay/recon  
**Status:** Active development -- v2 (edge reliability) deployed

---

## The Problem

Standard RAG retrieves the most semantically similar papers. It has no mechanism to detect when those papers have been superseded by newer work.

A 2019 paper with 800 citations scores high on cosine similarity and high on authority. If a 2023 paper explicitly refutes its central claims, retrieving the 2019 paper produces a confident but stale answer. RECON detects this -- and explains why.

---

## What's New in v2

v2 replaces the age-based staleness threshold with a three-signal **edge reliability formula**:

```
edge_reliability = (citation_centrality x 0.4)
                 + (recency_signal       x 0.3)
                 + (content_coherence    x 0.3)
```

A 2003 paper with 10,000 citations scores **FOUNDATIONAL** -- high centrality overrides age.  
A 2020 paper with 5 citations, superseded by newer work, scores **SUPERSEDED**.  
Pure age-based detection cannot make this distinction.

---

## Architecture

```
session_loader -> planner -> retriever -> critic -> synthesizer -> END
                                             |
                                      retry_retriever (max 2)
```

| Agent | Role |
|---|---|
| **Planner** | Decomposes query into 2-3 temporally-typed sub-questions (foundational / recent / open) |
| **Retriever** | Fetches papers from Semantic Scholar + OpenAlex, deduplicated by DOI. Hybrid scoring: semantic x 0.5 + recency x 0.3 + authority x 0.2 |
| **Critic** | Computes edge reliability scores, then issues verdict: PASS / STALE / CONTRADICTED / INSUFFICIENT / FORCED_PASS. On non-PASS: rewrites sub-questions with failure-specific strategy |
| **Synthesizer** | Four-section brief (Overview / Key Findings / Active Debates / Outlook) with per-claim citations and per-paper trust summary |

---

## Edge Reliability Scoring (`src/reliability.py`)

Each retrieved paper receives a `ReliabilityScore` with:

- `score` -- composite [0, 1]
- `centrality` -- `min(1.0, log1p(cited_by_count) / log1p(10000))` from OpenAlex
- `recency` -- `max(0, 1 - age/20)` linear decay
- `coherence` -- LLM batch check: does this paper's abstract still represent current consensus?
- `dominant_signal` -- `FOUNDATIONAL` / `CURRENT` / `DECLINING` / `SUPERSEDED`
- `reason` -- one-line explanation

The synthesizer appends a trust summary to every response so domain experts can verify verdict reasoning.

---

## Evaluation

130-question benchmark across three categories: consensus claims (Cat A), superseded claims (Cat B), contested claims (Cat C). Ground truth sourced from real ML survey paper supersession chains.

| Architecture | Staleness Catch Rate | Position Accuracy | False Positives |
|---|---|---|---|
| Single-pass RAG (baseline) | 0% | 32.3% | -- |
| Naive multi-agent | 0% | 44.6% | -- |
| RECON v1 (age-based STALE) | 52% | 43.9% | 8% |
| **RECON v2 (edge reliability)** | **44%** | **44.6%** | **2%** |

v2 trades some staleness recall for substantially lower false-positive rate. The reliability formula correctly preserves foundational papers that v1 would incorrectly flag as stale.

**Known limitation:** Contradiction catch rate is 0% -- the retriever returns topically adjacent papers rather than opposing-camp papers. This is a retrieval problem, not a critic problem. Addressed in future work.

---

## Repository Structure

```
src/
  agents/
    planner.py          -- query decomposition, temporally-typed sub-questions
    retriever.py        -- S2 + OpenAlex fetch, hybrid scoring, DOI dedup
    critic.py           -- edge reliability scoring, verdict logic, retry
    synthesizer.py      -- synthesis, trust summary, claim extraction
  openalex_utils.py     -- OpenAlex API (search, DOI lookup, citation centrality)
  reliability.py        -- three-signal edge reliability scorer
  retriever_utils.py    -- hybrid_score, recency_score, authority_score, S2 API
  state.py              -- ResearchState TypedDict, Paper/Claim dataclasses
  memory.py             -- SQLite session persistence
  graph.py              -- LangGraph state machine, node wiring
app.py                  -- Gradio UI
eval/
  run_eval.py           -- 5-architecture evaluation harness, LLM-as-judge
  questions.json        -- 130-question benchmark
  ground_truth.json     -- ground truth for Cat A/B
  results/              -- eval CSVs
  archived/             -- patch_contradiction.py (archived, not used in reported metrics)
```

---

## Setup

```bash
git clone https://github.com/MukulRay1603/project-recon
cd project-recon
pip install -r requirements.txt
```

Create a `.env` file:

```
GROQ_API_KEY=...
OPENALEX_API_KEY=...   # free at openalex.org/settings/api
S2_API_KEY=...         # optional but recommended
TAVILY_API_KEY=...     # optional fallback web search
```

```bash
python app.py
```

---

## Tech Stack

| Component | Choice |
|---|---|
| Orchestration | LangGraph |
| LLM | Llama 3.3 70B via Groq |
| Embeddings | all-MiniLM-L6-v2 (sentence-transformers) |
| Paper APIs | Semantic Scholar + OpenAlex |
| Web search | DuckDuckGo (Tavily fallback) |
| Session memory | SQLite |
| UI | Gradio |
| Deployment | Hugging Face Spaces |

---

## Author

Mukul Ray -- MS Applied ML, University of Maryland College Park  
GitHub: [@MukulRay1603](https://github.com/MukulRay1603)