A newer version of the Gradio SDK is available: 6.16.0
RECON -- Temporally-Aware Scientific Literature Retrieval
A multi-agent RAG system that asks not just "what is relevant?" but "what should I trust, right now?"
Live demo: https://huggingface.co/spaces/MukulRay/recon
Status: Active development -- v2 (edge reliability) deployed
The Problem
Standard RAG retrieves the most semantically similar papers. It has no mechanism to detect when those papers have been superseded by newer work.
A 2019 paper with 800 citations scores high on cosine similarity and high on authority. If a 2023 paper explicitly refutes its central claims, retrieving the 2019 paper produces a confident but stale answer. RECON detects this -- and explains why.
What's New in v2
v2 replaces the age-based staleness threshold with a three-signal edge reliability formula:
edge_reliability = (citation_centrality x 0.4)
+ (recency_signal x 0.3)
+ (content_coherence x 0.3)
A 2003 paper with 10,000 citations scores FOUNDATIONAL -- high centrality overrides age.
A 2020 paper with 5 citations, superseded by newer work, scores SUPERSEDED.
Pure age-based detection cannot make this distinction.
Architecture
session_loader -> planner -> retriever -> critic -> synthesizer -> END
|
retry_retriever (max 2)
| Agent | Role |
|---|---|
| Planner | Decomposes query into 2-3 temporally-typed sub-questions (foundational / recent / open) |
| Retriever | Fetches papers from Semantic Scholar + OpenAlex, deduplicated by DOI. Hybrid scoring: semantic x 0.5 + recency x 0.3 + authority x 0.2 |
| Critic | Computes edge reliability scores, then issues verdict: PASS / STALE / CONTRADICTED / INSUFFICIENT / FORCED_PASS. On non-PASS: rewrites sub-questions with failure-specific strategy |
| Synthesizer | Four-section brief (Overview / Key Findings / Active Debates / Outlook) with per-claim citations and per-paper trust summary |
Edge Reliability Scoring (src/reliability.py)
Each retrieved paper receives a ReliabilityScore with:
score-- composite [0, 1]centrality--min(1.0, log1p(cited_by_count) / log1p(10000))from OpenAlexrecency--max(0, 1 - age/20)linear decaycoherence-- LLM batch check: does this paper's abstract still represent current consensus?dominant_signal--FOUNDATIONAL/CURRENT/DECLINING/SUPERSEDEDreason-- one-line explanation
The synthesizer appends a trust summary to every response so domain experts can verify verdict reasoning.
Evaluation
130-question benchmark across three categories: consensus claims (Cat A), superseded claims (Cat B), contested claims (Cat C). Ground truth sourced from real ML survey paper supersession chains.
| Architecture | Staleness Catch Rate | Position Accuracy | False Positives |
|---|---|---|---|
| Single-pass RAG (baseline) | 0% | 32.3% | -- |
| Naive multi-agent | 0% | 44.6% | -- |
| RECON v1 (age-based STALE) | 52% | 43.9% | 8% |
| RECON v2 (edge reliability) | 44% | 44.6% | 2% |
v2 trades some staleness recall for substantially lower false-positive rate. The reliability formula correctly preserves foundational papers that v1 would incorrectly flag as stale.
Known limitation: Contradiction catch rate is 0% -- the retriever returns topically adjacent papers rather than opposing-camp papers. This is a retrieval problem, not a critic problem. Addressed in future work.
Repository Structure
src/
agents/
planner.py -- query decomposition, temporally-typed sub-questions
retriever.py -- S2 + OpenAlex fetch, hybrid scoring, DOI dedup
critic.py -- edge reliability scoring, verdict logic, retry
synthesizer.py -- synthesis, trust summary, claim extraction
openalex_utils.py -- OpenAlex API (search, DOI lookup, citation centrality)
reliability.py -- three-signal edge reliability scorer
retriever_utils.py -- hybrid_score, recency_score, authority_score, S2 API
state.py -- ResearchState TypedDict, Paper/Claim dataclasses
memory.py -- SQLite session persistence
graph.py -- LangGraph state machine, node wiring
app.py -- Gradio UI
eval/
run_eval.py -- 5-architecture evaluation harness, LLM-as-judge
questions.json -- 130-question benchmark
ground_truth.json -- ground truth for Cat A/B
results/ -- eval CSVs
archived/ -- patch_contradiction.py (archived, not used in reported metrics)
Setup
git clone https://github.com/MukulRay1603/project-recon
cd project-recon
pip install -r requirements.txt
Create a .env file:
GROQ_API_KEY=...
OPENALEX_API_KEY=... # free at openalex.org/settings/api
S2_API_KEY=... # optional but recommended
TAVILY_API_KEY=... # optional fallback web search
python app.py
Tech Stack
| Component | Choice |
|---|---|
| Orchestration | LangGraph |
| LLM | Llama 3.3 70B via Groq |
| Embeddings | all-MiniLM-L6-v2 (sentence-transformers) |
| Paper APIs | Semantic Scholar + OpenAlex |
| Web search | DuckDuckGo (Tavily fallback) |
| Session memory | SQLite |
| UI | Gradio |
| Deployment | Hugging Face Spaces |
Author
Mukul Ray -- MS Applied ML, University of Maryland College Park
GitHub: @MukulRay1603