MukulRay commited on
Commit
0bf564b
Β·
1 Parent(s): bcb9161

Phase 14: README

Browse files
Files changed (1) hide show
  1. README.md +81 -24
README.md CHANGED
@@ -11,49 +11,106 @@ license: mit
11
  short_description: Multi-agent ML literature research with staleness detection
12
  ---
13
 
14
- # RECON β€” Multi-Agent Research Navigator
15
 
16
- **Temporally-aware ML literature research. Live Semantic Scholar. Staleness detection.**
17
 
18
- RECON is a four-agent LangGraph system that retrieves live ML papers, evaluates evidence quality using a four-verdict critic, and synthesizes research positions with per-claim confidence scoring.
19
 
20
- ## What makes it different from standard RAG
21
 
22
- Standard RAG retrieves the most semantically similar chunk with no mechanism to detect whether that chunk has been superseded. A 2019 paper cited 600 times and never contradicted is strong evidence. A 2019 paper that a 2023 paper explicitly refutes is weak evidence β€” regardless of its cosine similarity score. RECON's critic reasons about this distinction.
 
 
23
 
24
- ## Architecture
25
 
26
  ```
27
  session_loader β†’ planner β†’ retriever β†’ critic β†’ synthesizer
28
- ↓ STALE/CONTRADICTED/INSUFFICIENT
29
- retry_retriever β†’ critic (max 2x)
 
 
30
  ```
31
 
32
- **Four agents:**
33
- - **Planner** β€” decomposes query into temporally-typed sub-questions (foundational / recent / contested)
34
- - **Retriever** β€” searches Semantic Scholar (200M+ papers) + DuckDuckGo with hybrid scoring
35
- - **Critic** β€” four-verdict taxonomy: PASS / STALE / CONTRADICTED / INSUFFICIENT
36
- - **Synthesizer** β€” structured position with inline citations and per-claim confidence
 
 
 
 
 
 
 
 
 
 
 
 
37
 
38
- ## Eval results (130-question ground truth dataset)
39
 
40
- | Architecture | Position Acc | Staleness Catch | Latency |
41
  |---|---|---|---|
42
  | Single-agent RAG | 32.3% | 0% | 4.8s |
43
  | Naive multi-agent | 44.6% | 0% | 23.9s |
 
44
  | **RECON (linear decay)** | **43.9%** | **52%** | **17.1s** |
 
 
 
 
 
 
 
 
 
45
 
46
- RECON catches 52% of superseded claims vs 0% for single-pass RAG on Category B questions sourced from real survey paper supersession chains.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
47
 
48
  ## Tech stack
49
 
50
- - **Orchestration:** LangGraph
51
- - **LLM:** Groq / LLaMA 3.3-70B
52
- - **Retrieval:** Semantic Scholar REST API + DuckDuckGo
53
- - **Embeddings:** all-MiniLM-L6-v2
54
- - **Session memory:** SQLite
55
- - **Eval:** Ragas + LLM-as-judge
 
 
 
 
 
 
 
 
 
 
56
 
57
- ## GitHub
58
 
59
- [github.com/MukulRay1603/project-recon](https://github.com/MukulRay1603/project-recon)
 
11
  short_description: Multi-agent ML literature research with staleness detection
12
  ---
13
 
14
+ # RECON β€” Research Navigator
15
 
16
+ **Multi-agent ML literature research with temporal reasoning.**
17
 
18
+ Most RAG systems treat retrieval as a similarity problem. RECON treats it as an evidence quality problem. There's a difference: a 2019 paper with 800 citations is strong evidence for what the field *believed* in 2019. If a 2023 paper explicitly supersedes it, that same chunk is now weak evidence β€” regardless of its cosine score. Standard RAG has no way to detect this. RECON does.
19
 
20
+ > πŸ”¬ **Live demo above** β€” try a query like *"What is the current state of KV cache compression in LLMs?"*
21
 
22
+ ---
23
+
24
+ ## How it works
25
 
26
+ RECON runs a four-agent pipeline on every query:
27
 
28
  ```
29
  session_loader β†’ planner β†’ retriever β†’ critic β†’ synthesizer
30
+ ↓
31
+ [STALE / CONTRADICTED / INSUFFICIENT]
32
+ retry_retriever β†’ critic
33
+ (up to 2 retries)
34
  ```
35
 
36
+ **Planner** β€” breaks the query into temporally-typed sub-questions: foundational (what's established), recent (what's changed), and contested (what's still debated).
37
+
38
+ **Retriever** β€” hits Semantic Scholar's live index (200M+ papers) and DuckDuckGo, then scores results using recency-weighted hybrid scoring. Three decay configs available: none, linear, log.
39
+
40
+ **Critic** β€” the core of the system. Issues one of four verdicts per retrieval pass:
41
+ - `PASS` β€” evidence is recent, sufficient, no contradictions
42
+ - `STALE` β€” retrieved papers have been superseded by more recent work
43
+ - `CONTRADICTED` β€” claims conflict across retrieved sources
44
+ - `INSUFFICIENT` β€” not enough high-quality evidence to synthesize
45
+
46
+ If the critic flags anything other than PASS, the retriever tries again with a refined query (max 2 retries). This retry loop is what drives the staleness catch rate improvement.
47
+
48
+ **Synthesizer** β€” produces a structured research position: overview, key findings, active debates, and a per-claim confidence table with source attribution.
49
+
50
+ ---
51
+
52
+ ## Eval results
53
 
54
+ Evaluated on a 130-question ground truth dataset across three categories: consensus claims (Category A), superseded claims (Category B), and contested claims (Category C). Ground truth was sourced from real survey paper abstracts β€” not synthetic.
55
 
56
+ | Architecture | Position Accuracy | Staleness Catch Rate | Avg Latency |
57
  |---|---|---|---|
58
  | Single-agent RAG | 32.3% | 0% | 4.8s |
59
  | Naive multi-agent | 44.6% | 0% | 23.9s |
60
+ | RECON (no decay) | 47.7% | 42% | 21.8s |
61
  | **RECON (linear decay)** | **43.9%** | **52%** | **17.1s** |
62
+ | RECON (log decay) | 43.1% | 38% | 15.9s |
63
+
64
+ **Linear decay was selected as optimal** β€” highest staleness catch rate (52%) at reasonable latency. The position accuracy tradeoff vs. no-decay (43.9% vs 47.7%) is acceptable given the staleness detection gain is the primary goal.
65
+
66
+ **On the calibration anomaly:** STALE-verdict queries achieve *higher* position accuracy than PASS-verdict queries. This is explainable: the retry loop, triggered by a STALE verdict, fetches fresher evidence β€” resulting in a better final answer than a first-pass PASS on borderline evidence.
67
+
68
+ **Known limitation:** Contradiction catch rate is 0% in the current system. The STALE check fires before the CONTRADICTED check in the critic, so contradictions are frequently reclassified as staleness. This is documented as future work β€” fixing it requires either reordering the verdict logic (which risks false positives) or a separate contradiction scorer at eval time.
69
+
70
+ ---
71
 
72
+ ## Architecture diagram
73
+
74
+ ```mermaid
75
+ graph LR
76
+ A[session_loader] --> B[planner]
77
+ B --> C[retriever]
78
+ C --> D[critic]
79
+ D -->|PASS| E[synthesizer]
80
+ D -->|STALE / CONTRADICTED / INSUFFICIENT| F[retry_retriever]
81
+ F --> D
82
+ E --> G[END]
83
+ ```
84
+
85
+ ---
86
+
87
+ ## Superseded claims reference dataset
88
+
89
+ As part of building the eval, 43 ML claims were catalogued across four subfields (LLM efficiency, training methods, RAG, multimodal) where newer survey papers document explicit supersession. This dataset was used to construct Category B of the eval and seed the staleness detection ground truth.
90
+
91
+ This is a **reference dataset used in evaluation** β€” not auto-generated from live queries. Once the Space accumulates real user traffic, the `verdict_log` table in session memory will power a live leaderboard generated from actual pipeline verdicts.
92
+
93
+ ---
94
 
95
  ## Tech stack
96
 
97
+ | Component | Choice |
98
+ |---|---|
99
+ | Orchestration | LangGraph |
100
+ | LLM | Groq / LLaMA 3.3-70B-versatile |
101
+ | Primary retrieval | Semantic Scholar REST API (direct, no library) |
102
+ | Fallback retrieval | DuckDuckGo (`ddgs`) + Tavily |
103
+ | Embeddings | `all-MiniLM-L6-v2` |
104
+ | Session memory | SQLite |
105
+ | Eval | LLM-as-judge + custom staleness catch rate metric |
106
+ | UI | Gradio 6.10 |
107
+
108
+ One deliberate choice worth noting: the `semanticscholar` PyPI library was explicitly avoided due to a pagination hang bug on large result sets. All S2 calls go through direct `requests.get()` to `graph/v1/paper/search`.
109
+
110
+ ---
111
+
112
+ ## Repo
113
 
114
+ [github.com/MukulRay1603/project-recon](https://github.com/MukulRay1603/project-recon)
115
 
116
+ Built as a portfolio project β€” MS Applied ML, UMD College Park.