docs: rewrite README for v2 — edge reliability, updated eval numbers, clean authorship
Browse files
README.md
CHANGED
|
@@ -11,108 +11,15 @@ license: mit
|
|
| 11 |
short_description: Multi-agent ML literature research with staleness detection
|
| 12 |
---
|
| 13 |
|
| 14 |
-
# RECON —
|
| 15 |
|
| 16 |
-
|
| 17 |
|
| 18 |
-
|
| 19 |
|
| 20 |
-
|
| 21 |
|
| 22 |
-
--
|
| 23 |
-
|
| 24 |
-
## How it works
|
| 25 |
-
|
| 26 |
-
RECON runs a four-agent pipeline on every query:
|
| 27 |
-
|
| 28 |
-
```
|
| 29 |
-
session_loader → planner → retriever → critic → synthesizer
|
| 30 |
-
↓
|
| 31 |
-
[STALE / CONTRADICTED / INSUFFICIENT]
|
| 32 |
-
retry_retriever → critic
|
| 33 |
-
(up to 2 retries)
|
| 34 |
-
```
|
| 35 |
-
|
| 36 |
-
**Planner** — breaks the query into temporally-typed sub-questions: foundational (what's established), recent (what's changed), and contested (what's still debated).
|
| 37 |
-
|
| 38 |
-
**Retriever** — hits Semantic Scholar's live index (200M+ papers) and DuckDuckGo, then scores results using recency-weighted hybrid scoring. Three decay configs available: none, linear, log.
|
| 39 |
-
|
| 40 |
-
**Critic** — the core of the system. Issues one of four verdicts per retrieval pass:
|
| 41 |
-
- `PASS` — evidence is recent, sufficient, no contradictions
|
| 42 |
-
- `STALE` — retrieved papers have been superseded by more recent work
|
| 43 |
-
- `CONTRADICTED` — claims conflict across retrieved sources
|
| 44 |
-
- `INSUFFICIENT` — not enough high-quality evidence to synthesize
|
| 45 |
-
|
| 46 |
-
The critic combines deterministic threshold routing with an LLM-assisted contradiction check. STALE, INSUFFICIENT, and PASS verdicts are assigned based on hardcoded thresholds (mean paper age, minimum result count, score cutoffs). CONTRADICTED is determined by calling Groq with a structured pairwise prompt that returns a `{"contradicts": bool, "reason": "..."}` JSON verdict — a canonical LLM-as-a-Judge pattern applied to contradiction detection specifically.
|
| 47 |
-
|
| 48 |
-
If the critic issues anything other than PASS, the retriever tries again with a refined query (max 2 retries). This retry loop is what drives the staleness catch rate improvement.
|
| 49 |
-
|
| 50 |
-
**Synthesizer** — produces a structured research position: overview, key findings, active debates, and a per-claim confidence table with source attribution.
|
| 51 |
-
|
| 52 |
-
---
|
| 53 |
-
|
| 54 |
-
## Eval results
|
| 55 |
-
|
| 56 |
-
Evaluated on a 130-question ground truth dataset across three categories: consensus claims (Category A), superseded claims (Category B), and contested claims (Category C). Ground truth was sourced from real survey paper abstracts — not synthetic.
|
| 57 |
-
|
| 58 |
-
| Architecture | Position Accuracy | Staleness Catch Rate | Avg Latency |
|
| 59 |
-
|---|---|---|---|
|
| 60 |
-
| Single-agent RAG | 32.3% | 0% | 4.8s |
|
| 61 |
-
| Naive multi-agent | 44.6% | 0% | 23.9s |
|
| 62 |
-
| RECON (no decay) | 47.7% | 42% | 21.8s |
|
| 63 |
-
| **RECON (linear decay)** | **43.9%** | **52%** | **17.1s** |
|
| 64 |
-
| RECON (log decay) | 43.1% | 38% | 15.9s |
|
| 65 |
-
|
| 66 |
-
**Linear decay was selected as optimal** — highest staleness catch rate (52%) at reasonable latency. The position accuracy tradeoff vs. no-decay (43.9% vs 47.7%) is acceptable given the staleness detection gain is the primary goal.
|
| 67 |
-
|
| 68 |
-
**On the calibration anomaly:** STALE-verdict queries achieve *higher* position accuracy than PASS-verdict queries. This is explainable: the retry loop, triggered by a STALE verdict, fetches fresher evidence — resulting in a better final answer than a first-pass PASS on borderline evidence.
|
| 69 |
-
|
| 70 |
-
**Known limitation:** Contradiction catch rate is 0% in the current system. The STALE check fires before the CONTRADICTED check in the critic, so contradictions are frequently reclassified as staleness. This is documented as future work — fixing it requires either reordering the verdict logic (which risks false positives) or a separate contradiction scorer at eval time.
|
| 71 |
|
| 72 |
---
|
| 73 |
-
|
| 74 |
-
## Architecture diagram
|
| 75 |
-
|
| 76 |
-
```mermaid
|
| 77 |
-
graph LR
|
| 78 |
-
A[session_loader] --> B[planner]
|
| 79 |
-
B --> C[retriever]
|
| 80 |
-
C --> D[critic]
|
| 81 |
-
D -->|PASS| E[synthesizer]
|
| 82 |
-
D -->|STALE / CONTRADICTED / INSUFFICIENT| F[retry_retriever]
|
| 83 |
-
F --> D
|
| 84 |
-
E --> G[END]
|
| 85 |
-
```
|
| 86 |
-
|
| 87 |
-
---
|
| 88 |
-
|
| 89 |
-
## Superseded claims reference dataset
|
| 90 |
-
|
| 91 |
-
As part of building the eval, 43 ML claims were catalogued across four subfields (LLM efficiency, training methods, RAG, multimodal) where newer survey papers document explicit supersession. This dataset was used to construct Category B of the eval and seed the staleness detection ground truth.
|
| 92 |
-
|
| 93 |
-
This is a **reference dataset used in evaluation** — not auto-generated from live queries. Once the Space accumulates real user traffic, the `verdict_log` table in session memory will power a live leaderboard generated from actual pipeline verdicts.
|
| 94 |
-
|
| 95 |
-
---
|
| 96 |
-
|
| 97 |
-
## Tech stack
|
| 98 |
-
|
| 99 |
-
| Component | Choice |
|
| 100 |
-
|---|---|
|
| 101 |
-
| Orchestration | LangGraph |
|
| 102 |
-
| LLM | Groq / LLaMA 3.3-70B-versatile |
|
| 103 |
-
| Primary retrieval | Semantic Scholar REST API (direct, no library) |
|
| 104 |
-
| Fallback retrieval | DuckDuckGo (`ddgs`) + Tavily |
|
| 105 |
-
| Embeddings | `all-MiniLM-L6-v2` |
|
| 106 |
-
| Session memory | SQLite |
|
| 107 |
-
| Eval | LLM-assisted contradiction detection (Groq structured prompt) + custom staleness catch rate metric |
|
| 108 |
-
| UI | Gradio 6.10 |
|
| 109 |
-
|
| 110 |
-
One deliberate choice worth noting: the `semanticscholar` PyPI library was explicitly avoided due to a pagination hang bug on large result sets. All S2 calls go through direct `requests.get()` to `graph/v1/paper/search`.
|
| 111 |
-
|
| 112 |
-
---
|
| 113 |
-
|
| 114 |
-
## Repo
|
| 115 |
-
|
| 116 |
-
[github.com/MukulRay1603/project-recon](https://github.com/MukulRay1603/project-recon)
|
| 117 |
-
|
| 118 |
-
Built as a portfolio project — MS Applied ML, UMD College Park.
|
|
|
|
| 11 |
short_description: Multi-agent ML literature research with staleness detection
|
| 12 |
---
|
| 13 |
|
| 14 |
+
# RECON — Temporally-Aware Scientific Retrieval
|
| 15 |
|
| 16 |
+
A multi-agent RAG system that detects when retrieved scientific evidence has been superseded by newer work.
|
| 17 |
|
| 18 |
+
**Try it:** Enter any research question. RECON retrieves papers from Semantic Scholar and OpenAlex, scores their reliability using a three-signal formula (citation centrality + recency + content coherence), and flags stale or contradicted evidence before synthesizing an answer.
|
| 19 |
|
| 20 |
+
Each paper in the results shows a reliability label: FOUNDATIONAL, CURRENT, DECLINING, or SUPERSEDED.
|
| 21 |
|
| 22 |
+
**Evaluation:** 44% staleness catch rate on a 130-question benchmark of real scientific supersession chains. Single-pass RAG baseline: 0%.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 23 |
|
| 24 |
---
|
| 25 |
+
Built by Mukul Ray | [GitHub](https://github.com/MukulRay1603/project-recon)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|