Spaces:

MukulRay
/

recon

Sleeping

App Files Files Community

recon / GITHUB_README.md

MukulRay

fix: trust summary resilience, Unicode in reason strings, full GitHub README

4f24399 about 2 months ago

preview code

raw

history blame contribute delete

5.55 kB

	# RECON -- Temporally-Aware Scientific Literature Retrieval

	> A multi-agent RAG system that asks not just "what is relevant?" but "what should I trust, right now?"

	Live demo: https://huggingface.co/spaces/MukulRay/recon
	Status: Active development -- v2 (edge reliability) deployed

	---

	## The Problem

	Standard RAG retrieves the most semantically similar papers. It has no mechanism to detect when those papers have been superseded by newer work.

	A 2019 paper with 800 citations scores high on cosine similarity and high on authority. If a 2023 paper explicitly refutes its central claims, retrieving the 2019 paper produces a confident but stale answer. RECON detects this -- and explains why.

	---

	## What's New in v2

	v2 replaces the age-based staleness threshold with a three-signal edge reliability formula:

	```
	edge_reliability = (citation_centrality x 0.4)
	+ (recency_signal x 0.3)
	+ (content_coherence x 0.3)
	```

	A 2003 paper with 10,000 citations scores FOUNDATIONAL -- high centrality overrides age.
	A 2020 paper with 5 citations, superseded by newer work, scores SUPERSEDED.
	Pure age-based detection cannot make this distinction.

	---

	## Architecture

	```
	session_loader -> planner -> retriever -> critic -> synthesizer -> END
	\|
	retry_retriever (max 2)
	```

	\| Agent \| Role \|
	\|---\|---\|
	\| Planner \| Decomposes query into 2-3 temporally-typed sub-questions (foundational / recent / open) \|
	\| Retriever \| Fetches papers from Semantic Scholar + OpenAlex, deduplicated by DOI. Hybrid scoring: semantic x 0.5 + recency x 0.3 + authority x 0.2 \|
	\| Critic \| Computes edge reliability scores, then issues verdict: PASS / STALE / CONTRADICTED / INSUFFICIENT / FORCED_PASS. On non-PASS: rewrites sub-questions with failure-specific strategy \|
	\| Synthesizer \| Four-section brief (Overview / Key Findings / Active Debates / Outlook) with per-claim citations and per-paper trust summary \|

	---

	## Edge Reliability Scoring (`src/reliability.py`)

	Each retrieved paper receives a `ReliabilityScore` with:

	- `score` -- composite [0, 1]
	- `centrality` -- `min(1.0, log1p(cited_by_count) / log1p(10000))` from OpenAlex
	- `recency` -- `max(0, 1 - age/20)` linear decay
	- `coherence` -- LLM batch check: does this paper's abstract still represent current consensus?
	- `dominant_signal` -- `FOUNDATIONAL` / `CURRENT` / `DECLINING` / `SUPERSEDED`
	- `reason` -- one-line explanation

	The synthesizer appends a trust summary to every response so domain experts can verify verdict reasoning.

	---

	## Evaluation

	130-question benchmark across three categories: consensus claims (Cat A), superseded claims (Cat B), contested claims (Cat C). Ground truth sourced from real ML survey paper supersession chains.

	\| Architecture \| Staleness Catch Rate \| Position Accuracy \| False Positives \|
	\|---\|---\|---\|---\|
	\| Single-pass RAG (baseline) \| 0% \| 32.3% \| -- \|
	\| Naive multi-agent \| 0% \| 44.6% \| -- \|
	\| RECON v1 (age-based STALE) \| 52% \| 43.9% \| 8% \|
	\| RECON v2 (edge reliability) \| 44% \| 44.6% \| 2% \|

	v2 trades some staleness recall for substantially lower false-positive rate. The reliability formula correctly preserves foundational papers that v1 would incorrectly flag as stale.

	Known limitation: Contradiction catch rate is 0% -- the retriever returns topically adjacent papers rather than opposing-camp papers. This is a retrieval problem, not a critic problem. Addressed in future work.

	---

	## Repository Structure

	```
	src/
	agents/
	planner.py -- query decomposition, temporally-typed sub-questions
	retriever.py -- S2 + OpenAlex fetch, hybrid scoring, DOI dedup
	critic.py -- edge reliability scoring, verdict logic, retry
	synthesizer.py -- synthesis, trust summary, claim extraction
	openalex_utils.py -- OpenAlex API (search, DOI lookup, citation centrality)
	reliability.py -- three-signal edge reliability scorer
	retriever_utils.py -- hybrid_score, recency_score, authority_score, S2 API
	state.py -- ResearchState TypedDict, Paper/Claim dataclasses
	memory.py -- SQLite session persistence
	graph.py -- LangGraph state machine, node wiring
	app.py -- Gradio UI
	eval/
	run_eval.py -- 5-architecture evaluation harness, LLM-as-judge
	questions.json -- 130-question benchmark
	ground_truth.json -- ground truth for Cat A/B
	results/ -- eval CSVs
	archived/ -- patch_contradiction.py (archived, not used in reported metrics)
	```

	---

	## Setup

	```bash
	git clone https://github.com/MukulRay1603/project-recon
	cd project-recon
	pip install -r requirements.txt
	```

	Create a `.env` file:

	```
	GROQ_API_KEY=...
	OPENALEX_API_KEY=... # free at openalex.org/settings/api
	S2_API_KEY=... # optional but recommended
	TAVILY_API_KEY=... # optional fallback web search
	```

	```bash
	python app.py
	```

	---

	## Tech Stack

	\| Component \| Choice \|
	\|---\|---\|
	\| Orchestration \| LangGraph \|
	\| LLM \| Llama 3.3 70B via Groq \|
	\| Embeddings \| all-MiniLM-L6-v2 (sentence-transformers) \|
	\| Paper APIs \| Semantic Scholar + OpenAlex \|
	\| Web search \| DuckDuckGo (Tavily fallback) \|
	\| Session memory \| SQLite \|
	\| UI \| Gradio \|
	\| Deployment \| Hugging Face Spaces \|

	---

	## Author

	Mukul Ray -- MS Applied ML, University of Maryland College Park
	GitHub: [@MukulRay1603](https://github.com/MukulRay1603)