recon / GITHUB_README.md
MukulRay's picture
fix: trust summary resilience, Unicode in reason strings, full GitHub README
4f24399

A newer version of the Gradio SDK is available: 6.16.0

Upgrade

RECON -- Temporally-Aware Scientific Literature Retrieval

A multi-agent RAG system that asks not just "what is relevant?" but "what should I trust, right now?"

Live demo: https://huggingface.co/spaces/MukulRay/recon
Status: Active development -- v2 (edge reliability) deployed


The Problem

Standard RAG retrieves the most semantically similar papers. It has no mechanism to detect when those papers have been superseded by newer work.

A 2019 paper with 800 citations scores high on cosine similarity and high on authority. If a 2023 paper explicitly refutes its central claims, retrieving the 2019 paper produces a confident but stale answer. RECON detects this -- and explains why.


What's New in v2

v2 replaces the age-based staleness threshold with a three-signal edge reliability formula:

edge_reliability = (citation_centrality x 0.4)
                 + (recency_signal       x 0.3)
                 + (content_coherence    x 0.3)

A 2003 paper with 10,000 citations scores FOUNDATIONAL -- high centrality overrides age.
A 2020 paper with 5 citations, superseded by newer work, scores SUPERSEDED.
Pure age-based detection cannot make this distinction.


Architecture

session_loader -> planner -> retriever -> critic -> synthesizer -> END
                                             |
                                      retry_retriever (max 2)
Agent Role
Planner Decomposes query into 2-3 temporally-typed sub-questions (foundational / recent / open)
Retriever Fetches papers from Semantic Scholar + OpenAlex, deduplicated by DOI. Hybrid scoring: semantic x 0.5 + recency x 0.3 + authority x 0.2
Critic Computes edge reliability scores, then issues verdict: PASS / STALE / CONTRADICTED / INSUFFICIENT / FORCED_PASS. On non-PASS: rewrites sub-questions with failure-specific strategy
Synthesizer Four-section brief (Overview / Key Findings / Active Debates / Outlook) with per-claim citations and per-paper trust summary

Edge Reliability Scoring (src/reliability.py)

Each retrieved paper receives a ReliabilityScore with:

  • score -- composite [0, 1]
  • centrality -- min(1.0, log1p(cited_by_count) / log1p(10000)) from OpenAlex
  • recency -- max(0, 1 - age/20) linear decay
  • coherence -- LLM batch check: does this paper's abstract still represent current consensus?
  • dominant_signal -- FOUNDATIONAL / CURRENT / DECLINING / SUPERSEDED
  • reason -- one-line explanation

The synthesizer appends a trust summary to every response so domain experts can verify verdict reasoning.


Evaluation

130-question benchmark across three categories: consensus claims (Cat A), superseded claims (Cat B), contested claims (Cat C). Ground truth sourced from real ML survey paper supersession chains.

Architecture Staleness Catch Rate Position Accuracy False Positives
Single-pass RAG (baseline) 0% 32.3% --
Naive multi-agent 0% 44.6% --
RECON v1 (age-based STALE) 52% 43.9% 8%
RECON v2 (edge reliability) 44% 44.6% 2%

v2 trades some staleness recall for substantially lower false-positive rate. The reliability formula correctly preserves foundational papers that v1 would incorrectly flag as stale.

Known limitation: Contradiction catch rate is 0% -- the retriever returns topically adjacent papers rather than opposing-camp papers. This is a retrieval problem, not a critic problem. Addressed in future work.


Repository Structure

src/
  agents/
    planner.py          -- query decomposition, temporally-typed sub-questions
    retriever.py        -- S2 + OpenAlex fetch, hybrid scoring, DOI dedup
    critic.py           -- edge reliability scoring, verdict logic, retry
    synthesizer.py      -- synthesis, trust summary, claim extraction
  openalex_utils.py     -- OpenAlex API (search, DOI lookup, citation centrality)
  reliability.py        -- three-signal edge reliability scorer
  retriever_utils.py    -- hybrid_score, recency_score, authority_score, S2 API
  state.py              -- ResearchState TypedDict, Paper/Claim dataclasses
  memory.py             -- SQLite session persistence
  graph.py              -- LangGraph state machine, node wiring
app.py                  -- Gradio UI
eval/
  run_eval.py           -- 5-architecture evaluation harness, LLM-as-judge
  questions.json        -- 130-question benchmark
  ground_truth.json     -- ground truth for Cat A/B
  results/              -- eval CSVs
  archived/             -- patch_contradiction.py (archived, not used in reported metrics)

Setup

git clone https://github.com/MukulRay1603/project-recon
cd project-recon
pip install -r requirements.txt

Create a .env file:

GROQ_API_KEY=...
OPENALEX_API_KEY=...   # free at openalex.org/settings/api
S2_API_KEY=...         # optional but recommended
TAVILY_API_KEY=...     # optional fallback web search
python app.py

Tech Stack

Component Choice
Orchestration LangGraph
LLM Llama 3.3 70B via Groq
Embeddings all-MiniLM-L6-v2 (sentence-transformers)
Paper APIs Semantic Scholar + OpenAlex
Web search DuckDuckGo (Tavily fallback)
Session memory SQLite
UI Gradio
Deployment Hugging Face Spaces

Author

Mukul Ray -- MS Applied ML, University of Maryland College Park
GitHub: @MukulRay1603