# RECON -- Temporally-Aware Scientific Literature Retrieval > A multi-agent RAG system that asks not just "what is relevant?" but "what should I trust, right now?" **Live demo:** https://huggingface.co/spaces/MukulRay/recon **Status:** Active development -- v2 (edge reliability) deployed --- ## The Problem Standard RAG retrieves the most semantically similar papers. It has no mechanism to detect when those papers have been superseded by newer work. A 2019 paper with 800 citations scores high on cosine similarity and high on authority. If a 2023 paper explicitly refutes its central claims, retrieving the 2019 paper produces a confident but stale answer. RECON detects this -- and explains why. --- ## What's New in v2 v2 replaces the age-based staleness threshold with a three-signal **edge reliability formula**: ``` edge_reliability = (citation_centrality x 0.4) + (recency_signal x 0.3) + (content_coherence x 0.3) ``` A 2003 paper with 10,000 citations scores **FOUNDATIONAL** -- high centrality overrides age. A 2020 paper with 5 citations, superseded by newer work, scores **SUPERSEDED**. Pure age-based detection cannot make this distinction. --- ## Architecture ``` session_loader -> planner -> retriever -> critic -> synthesizer -> END | retry_retriever (max 2) ``` | Agent | Role | |---|---| | **Planner** | Decomposes query into 2-3 temporally-typed sub-questions (foundational / recent / open) | | **Retriever** | Fetches papers from Semantic Scholar + OpenAlex, deduplicated by DOI. Hybrid scoring: semantic x 0.5 + recency x 0.3 + authority x 0.2 | | **Critic** | Computes edge reliability scores, then issues verdict: PASS / STALE / CONTRADICTED / INSUFFICIENT / FORCED_PASS. On non-PASS: rewrites sub-questions with failure-specific strategy | | **Synthesizer** | Four-section brief (Overview / Key Findings / Active Debates / Outlook) with per-claim citations and per-paper trust summary | --- ## Edge Reliability Scoring (`src/reliability.py`) Each retrieved paper receives a `ReliabilityScore` with: - `score` -- composite [0, 1] - `centrality` -- `min(1.0, log1p(cited_by_count) / log1p(10000))` from OpenAlex - `recency` -- `max(0, 1 - age/20)` linear decay - `coherence` -- LLM batch check: does this paper's abstract still represent current consensus? - `dominant_signal` -- `FOUNDATIONAL` / `CURRENT` / `DECLINING` / `SUPERSEDED` - `reason` -- one-line explanation The synthesizer appends a trust summary to every response so domain experts can verify verdict reasoning. --- ## Evaluation 130-question benchmark across three categories: consensus claims (Cat A), superseded claims (Cat B), contested claims (Cat C). Ground truth sourced from real ML survey paper supersession chains. | Architecture | Staleness Catch Rate | Position Accuracy | False Positives | |---|---|---|---| | Single-pass RAG (baseline) | 0% | 32.3% | -- | | Naive multi-agent | 0% | 44.6% | -- | | RECON v1 (age-based STALE) | 52% | 43.9% | 8% | | **RECON v2 (edge reliability)** | **44%** | **44.6%** | **2%** | v2 trades some staleness recall for substantially lower false-positive rate. The reliability formula correctly preserves foundational papers that v1 would incorrectly flag as stale. **Known limitation:** Contradiction catch rate is 0% -- the retriever returns topically adjacent papers rather than opposing-camp papers. This is a retrieval problem, not a critic problem. Addressed in future work. --- ## Repository Structure ``` src/ agents/ planner.py -- query decomposition, temporally-typed sub-questions retriever.py -- S2 + OpenAlex fetch, hybrid scoring, DOI dedup critic.py -- edge reliability scoring, verdict logic, retry synthesizer.py -- synthesis, trust summary, claim extraction openalex_utils.py -- OpenAlex API (search, DOI lookup, citation centrality) reliability.py -- three-signal edge reliability scorer retriever_utils.py -- hybrid_score, recency_score, authority_score, S2 API state.py -- ResearchState TypedDict, Paper/Claim dataclasses memory.py -- SQLite session persistence graph.py -- LangGraph state machine, node wiring app.py -- Gradio UI eval/ run_eval.py -- 5-architecture evaluation harness, LLM-as-judge questions.json -- 130-question benchmark ground_truth.json -- ground truth for Cat A/B results/ -- eval CSVs archived/ -- patch_contradiction.py (archived, not used in reported metrics) ``` --- ## Setup ```bash git clone https://github.com/MukulRay1603/project-recon cd project-recon pip install -r requirements.txt ``` Create a `.env` file: ``` GROQ_API_KEY=... OPENALEX_API_KEY=... # free at openalex.org/settings/api S2_API_KEY=... # optional but recommended TAVILY_API_KEY=... # optional fallback web search ``` ```bash python app.py ``` --- ## Tech Stack | Component | Choice | |---|---| | Orchestration | LangGraph | | LLM | Llama 3.3 70B via Groq | | Embeddings | all-MiniLM-L6-v2 (sentence-transformers) | | Paper APIs | Semantic Scholar + OpenAlex | | Web search | DuckDuckGo (Tavily fallback) | | Session memory | SQLite | | UI | Gradio | | Deployment | Hugging Face Spaces | --- ## Author Mukul Ray -- MS Applied ML, University of Maryland College Park GitHub: [@MukulRay1603](https://github.com/MukulRay1603)