VibecoderMcSwaggins commited on
Commit
3ce1e8b
Β·
1 Parent(s): 8d97867

docs: Remove outdated documentation files

Browse files

- Delete the `index.md` and `STATUS_LLAMAINDEX_INTEGRATION.md` files as they are no longer relevant to the current project structure and documentation needs.
- Add a new `workflow-diagrams.md` file to provide a comprehensive overview of the updated Magentic architecture and workflow, enhancing clarity for users and developers.

docs/STATUS_LLAMAINDEX_INTEGRATION.md DELETED
@@ -1,228 +0,0 @@
1
- # After This PR: What's Working, What's Missing, What's Next
2
-
3
- **TL;DR:** DeepBoner is a **fully working** biomedical research agent. The LlamaIndex integration we just completed is wired in correctly. The system can search PubMed, ClinicalTrials.gov, and Europe PMC, deduplicate evidence semantically, and generate research reports. **It's ready for hackathon submission.**
4
-
5
- ---
6
-
7
- ## What Does LlamaIndex Actually Do Here?
8
-
9
- **Short answer:** LlamaIndex provides **better embeddings + persistence** when you have an OpenAI API key.
10
-
11
- ```
12
- User has OPENAI_API_KEY β†’ LlamaIndex (OpenAI embeddings, disk persistence)
13
- User has NO API key β†’ Local embeddings (sentence-transformers, in-memory)
14
- ```
15
-
16
- ### What it does:
17
- 1. **Embeds evidence** - Converts paper abstracts to vectors for semantic search
18
- 2. **Stores to disk** - Evidence survives app restart (ChromaDB PersistentClient)
19
- 3. **Deduplicates** - Prevents storing 99% similar papers (0.9 threshold)
20
- 4. **Retrieves context** - Judge gets top-30 semantically relevant papers, not random ones
21
-
22
- ### What it does NOT do:
23
- - **Primary search** - PubMed/ClinicalTrials return results; LlamaIndex stores them
24
- - **Ranking** - No reranking of search results (they come pre-ranked from APIs)
25
- - **Query routing** - Doesn't decide which database to search
26
-
27
- ---
28
-
29
- ## Is This a "Real" RAG System?
30
-
31
- **Yes, but simpler than you might expect.**
32
-
33
- ```
34
- Traditional RAG: Query β†’ Retrieve from vector DB β†’ Generate with context
35
- DeepBoner's RAG: Query β†’ Search APIs β†’ Store in vector DB β†’ Judge with context
36
- ```
37
-
38
- We're doing **"Search-and-Store RAG"** not "Retrieve-and-Generate RAG":
39
- - Evidence comes from **real biomedical APIs** (PubMed, etc.), not a pre-built knowledge base
40
- - Vector DB is for **deduplication + context windowing**, not primary retrieval
41
- - The "retrieval" happens from external APIs, not from embeddings
42
-
43
- **This is the RIGHT architecture** for a research agent - you want fresh, authoritative sources (PubMed) not a static knowledge base.
44
-
45
- ---
46
-
47
- ## Do We Need Neo4j / FAISS / More Complex RAG?
48
-
49
- **No.** Here's why:
50
-
51
- | You might think you need... | But actually... |
52
- |----------------------------|-----------------|
53
- | Neo4j for knowledge graphs | Evidence relationships are implicit in citations/abstracts |
54
- | FAISS for fast search | ChromaDB handles our scale (hundreds of papers, not millions) |
55
- | Complex ingestion pipeline | Our pipeline IS working: Search β†’ Dedupe β†’ Store β†’ Retrieve |
56
- | Reranking models | PubMed already ranks by relevance; judge handles scoring |
57
-
58
- **The bottleneck is NOT the vector store.** It's:
59
- 1. API rate limits (PubMed: 3 req/sec without key, 10 with key)
60
- 2. LLM context windows (judge can only see ~30 papers effectively)
61
- 3. Search query quality (garbage in, garbage out)
62
-
63
- ---
64
-
65
- ## What's Actually Working (End-to-End)
66
-
67
- ### Core Research Loop
68
- ```
69
- User Query: "What drugs improve female libido post-menopause?"
70
- ↓
71
- [1] SearchHandler queries 3 databases in parallel
72
- β”œβ”€ PubMed: 10 results
73
- β”œβ”€ ClinicalTrials.gov: 5 results
74
- └─ Europe PMC: 10 results
75
- ↓
76
- [2] ResearchMemory deduplicates (25 β†’ 18 unique)
77
- ↓
78
- [3] Evidence stored in ChromaDB/LlamaIndex
79
- ↓
80
- [4] Judge gets top-30 by semantic similarity
81
- ↓
82
- [5] Judge scores: mechanism=7/10, clinical=6/10
83
- ↓
84
- [6] Judge says: "Need more on flibanserin mechanism"
85
- ↓
86
- [7] Loop with new queries (up to 10 iterations)
87
- ↓
88
- [8] Generate report with drug candidates + findings
89
- ```
90
-
91
- ### What Each Component Does
92
-
93
- | Component | Status | What It Does |
94
- |-----------|--------|--------------|
95
- | `SearchHandler` | Working | Parallel search across 3 databases |
96
- | `ResearchMemory` | Working | Stores evidence, tracks hypotheses |
97
- | `EmbeddingService` | Working | Free tier: local sentence-transformers |
98
- | `LlamaIndexRAGService` | Working | Premium tier: OpenAI embeddings + persistence |
99
- | `JudgeHandler` | Working | LLM scores evidence, suggests next queries |
100
- | `SimpleOrchestrator` | Working | Main research loop (search β†’ judge β†’ synthesize) |
101
- | `AdvancedOrchestrator` | Working | Multi-agent mode (requires agent-framework) |
102
- | Gradio UI | Working | Chat interface with streaming events |
103
-
104
- ---
105
-
106
- ## What's Missing (But Not Blocking)
107
-
108
- ### 1. **Active Knowledge Base Querying** (P2)
109
- Currently: Judge guesses what to search next
110
- Should: Judge checks "what do we already have?" before suggesting new queries
111
-
112
- **Impact:** Could reduce redundant searches
113
- **Effort:** Medium (modify judge prompt to include memory summary)
114
-
115
- ### 2. **Evidence Diversity Selection** (P2)
116
- Currently: Judge sees top-30 by relevance (might be redundant)
117
- Should: Use MMR (Maximal Marginal Relevance) for diversity
118
-
119
- **Impact:** Better coverage of different perspectives
120
- **Effort:** Low (we have `select_diverse_evidence()` but it's not used everywhere)
121
-
122
- ### 3. **Singleton Pattern for LlamaIndex** (P3)
123
- Currently: Each call creates new LlamaIndexRAGService instance
124
- Should: Cache like `_shared_model` in EmbeddingService
125
-
126
- **Impact:** Minor performance improvement
127
- **Effort:** Low
128
-
129
- ### 4. **Evidence Quality Scoring** (P3)
130
- Currently: Judge gives overall scores (mechanism + clinical)
131
- Should: Score each paper (study design, sample size, etc.)
132
-
133
- **Impact:** Better synthesis quality
134
- **Effort:** High (significant prompt engineering)
135
-
136
- ---
137
-
138
- ## What's Definitely NOT Needed
139
-
140
- | Over-engineering | Why it's unnecessary |
141
- |------------------|---------------------|
142
- | GraphRAG / Neo4j | Our scale is hundreds of papers, not knowledge graphs |
143
- | FAISS / Pinecone | ChromaDB handles our volume fine |
144
- | Custom embedding models | OpenAI/sentence-transformers work great for biomedical text |
145
- | Complex chunking strategies | We're storing abstracts (already short) |
146
- | Hybrid search (BM25 + vector) | APIs already do keyword matching |
147
-
148
- ---
149
-
150
- ## Hackathon Submission Checklist
151
-
152
- - [x] Core research loop working
153
- - [x] 3 biomedical databases integrated (PubMed, ClinicalTrials, Europe PMC)
154
- - [x] Semantic deduplication working
155
- - [x] Judge assessment working
156
- - [x] Report generation working
157
- - [x] Gradio UI working
158
- - [x] 202 tests passing
159
- - [x] Tiered embedding service (free vs premium)
160
- - [x] LlamaIndex integration complete
161
-
162
- **You're ready to submit.**
163
-
164
- ---
165
-
166
- ## Post-Hackathon Roadmap
167
-
168
- ### Phase 1: Polish (1-2 days)
169
- - [ ] Add singleton pattern for LlamaIndex service
170
- - [ ] Integration test with real API keys
171
- - [ ] Verify persistence works on HuggingFace Spaces
172
-
173
- ### Phase 2: Intelligence (1 week)
174
- - [ ] Judge queries memory before suggesting searches
175
- - [ ] MMR diversity selection for evidence context
176
- - [ ] Hypothesis-driven search refinement
177
-
178
- ### Phase 3: Scale (2+ weeks)
179
- - [ ] Rate limit handling improvements
180
- - [ ] Batch embedding for large evidence sets
181
- - [ ] Multi-query parallelization
182
- - [ ] Export to structured formats (JSON, BibTeX)
183
-
184
- ### Phase 4: Production (future)
185
- - [ ] User authentication
186
- - [ ] Persistent user sessions
187
- - [ ] Evidence caching across users
188
- - [ ] Usage analytics
189
-
190
- ---
191
-
192
- ## Quick Reference: Where Things Are
193
-
194
- ```
195
- src/
196
- β”œβ”€β”€ orchestrators/
197
- β”‚ β”œβ”€β”€ simple.py # Main research loop (START HERE)
198
- β”‚ └── advanced.py # Multi-agent mode
199
- β”œβ”€β”€ services/
200
- β”‚ β”œβ”€β”€ embeddings.py # Free tier (sentence-transformers)
201
- β”‚ β”œβ”€β”€ llamaindex_rag.py # Premium tier (OpenAI + persistence)
202
- β”‚ β”œβ”€β”€ embedding_protocol.py # Interface both implement
203
- β”‚ └── research_memory.py # Evidence storage + retrieval
204
- β”œβ”€β”€ tools/
205
- β”‚ β”œβ”€β”€ pubmed.py # PubMed E-utilities
206
- β”‚ β”œβ”€β”€ clinicaltrials.py # ClinicalTrials.gov API
207
- β”‚ └── europepmc.py # Europe PMC API
208
- β”œβ”€β”€ agent_factory/
209
- β”‚ └── judges.py # LLM judge (assess evidence sufficiency)
210
- └── utils/
211
- β”œβ”€β”€ config.py # Environment variables
212
- β”œβ”€β”€ service_loader.py # Tiered service selection
213
- └── models.py # Evidence, Citation, etc.
214
- ```
215
-
216
- ---
217
-
218
- ## The Bottom Line
219
-
220
- **DeepBoner is not missing anything critical.** The LlamaIndex integration you just completed was the last major infrastructure piece. What remains is optimization and polish, not core functionality.
221
-
222
- The system works like this:
223
- 1. **Search real databases** (not a vector store)
224
- 2. **Store + deduplicate** (this is where LlamaIndex helps)
225
- 3. **Judge with context** (top-30 semantically relevant papers)
226
- 4. **Loop or synthesize** (code-enforced decision)
227
-
228
- This is a sensible architecture for a research agent. You don't need more complexity - you need to ship it.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
docs/{workflow-diagrams.md β†’ architecture/workflow-diagrams.md} RENAMED
File without changes
docs/index.md DELETED
@@ -1,107 +0,0 @@
1
- # DeepBoner Documentation
2
-
3
- ## Sexual Health Research Agent
4
-
5
- AI-powered deep research system for sexual wellness, reproductive health, and hormone therapy research.
6
-
7
- ---
8
-
9
- ## Quick Links
10
-
11
- ### Architecture
12
- - **[Overview](architecture/overview.md)** - Project overview, use case, architecture
13
- - **[Design Patterns](architecture/design-patterns.md)** - Technical patterns, data models
14
- - **[Workflow Diagrams](workflow-diagrams.md)** - Visual architecture (Magentic v2.0)
15
-
16
- ### Implementation (Phases 1-14 βœ… COMPLETE)
17
- - **[Roadmap](implementation/roadmap.md)** - Phased execution plan with TDD
18
- - **[Phase 1: Foundation](implementation/01_phase_foundation.md)** βœ… - Tooling, config, first tests
19
- - **[Phase 2: Search](implementation/02_phase_search.md)** βœ… - PubMed search
20
- - **[Phase 3: Judge](implementation/03_phase_judge.md)** βœ… - LLM evidence assessment
21
- - **[Phase 4: UI](implementation/04_phase_ui.md)** βœ… - Orchestrator + Gradio
22
- - **[Phase 5: Magentic](implementation/05_phase_magentic.md)** βœ… - Multi-agent orchestration
23
- - **[Phase 6: Embeddings](implementation/06_phase_embeddings.md)** βœ… - Semantic search + dedup
24
- - **[Phase 7: Hypothesis](implementation/07_phase_hypothesis.md)** βœ… - Mechanistic reasoning
25
- - **[Phase 8: Report](implementation/08_phase_report.md)** βœ… - Structured scientific reports
26
- - **[Phase 9: Source Cleanup](implementation/09_phase_source_cleanup.md)** βœ… - Remove DuckDuckGo
27
- - **[Phase 10: ClinicalTrials](implementation/10_phase_clinicaltrials.md)** βœ… - Clinical trials API
28
- - **[Phase 11: Europe PMC](implementation/11_phase_europepmc.md)** βœ… - Preprint search
29
- - **[Phase 12: MCP Server](implementation/12_phase_mcp_server.md)** βœ… - Claude Desktop integration
30
- - **[Phase 13: Modal Integration](implementation/13_phase_modal_integration.md)** βœ… - Secure code execution
31
- - **[Phase 14: Demo Submission](implementation/14_phase_demo_submission.md)** βœ… - Hackathon submission
32
-
33
- ### Future Roadmap
34
- - **[Overview](future-roadmap/phases/README.md)** - Planned phases 15-17
35
- - **[Phase 15: OpenAlex](future-roadmap/phases/15_PHASE_OPENALEX.md)** - Citation network integration
36
- - **[Phase 16: PubMed Full-text](future-roadmap/phases/16_PHASE_PUBMED_FULLTEXT.md)** - BioC API
37
- - **[Phase 17: Rate Limiting](future-roadmap/phases/17_PHASE_RATE_LIMITING.md)** - Production hardening
38
- - **[Deep Research Mode](future-roadmap/DEEP_RESEARCH_ROADMAP.md)** - GPT-Researcher style enhancements
39
-
40
- ### Bugs & Issues
41
- - **[Active Bugs](bugs/ACTIVE_BUGS.md)** - Current issues and workarounds
42
-
43
- ### Decisions
44
- - **[PR #55 Evaluation](decisions/2025-11-27-pr55-evaluation.md)** - Architecture decision record
45
- - **[Magentic + PydanticAI](decisions/architecture-2025-11/)** - Framework architecture decisions
46
-
47
- ### Guides
48
- - **[Deployment Guide](guides/deployment.md)** - Gradio, MCP, and Modal launch steps
49
-
50
- ### Development
51
- - **[Testing Strategy](development/testing.md)** - Unit, Integration, and E2E testing patterns
52
-
53
- ### Brainstorming (Source Improvements)
54
- - **[Roadmap Summary](brainstorming/00_ROADMAP_SUMMARY.md)** - Data source enhancement ideas
55
- - **[PubMed Improvements](brainstorming/01_PUBMED_IMPROVEMENTS.md)**
56
- - **[ClinicalTrials Improvements](brainstorming/02_CLINICALTRIALS_IMPROVEMENTS.md)**
57
- - **[Europe PMC Improvements](brainstorming/03_EUROPEPMC_IMPROVEMENTS.md)**
58
-
59
- ---
60
-
61
- ## What We're Building
62
-
63
- **One-liner**: AI agent that searches medical literature to find evidence for sexual health research questions.
64
-
65
- **Example Queries**:
66
- > "What drugs improve female libido post-menopause?"
67
- > "Evidence for testosterone therapy in women with HSDD?"
68
- > "Clinical trials for erectile dysfunction alternatives to PDE5 inhibitors?"
69
-
70
- **Output**: Research report with drug candidates, mechanisms, evidence quality, and citations.
71
-
72
- ---
73
-
74
- ## Architecture Summary
75
-
76
- ```
77
- User Question β†’ Research Agent (Orchestrator)
78
- ↓
79
- Search Loop:
80
- β†’ Tools (PubMed, ClinicalTrials, Europe PMC)
81
- β†’ Judge (Quality + Budget)
82
- β†’ Repeat or Synthesize
83
- ↓
84
- Research Report with Citations
85
- ```
86
-
87
- ---
88
-
89
- ## Features
90
-
91
- | Feature | Status | Description |
92
- |---------|--------|-------------|
93
- | **Gradio UI** | βœ… Complete | Streaming chat interface |
94
- | **MCP Server** | βœ… Complete | Tools accessible from Claude Desktop |
95
- | **Modal Sandbox** | βœ… Complete | Secure statistical analysis |
96
- | **Multi-Source Search** | βœ… Complete | PubMed, ClinicalTrials, Europe PMC |
97
-
98
- ---
99
-
100
- ## Status
101
-
102
- | Phase | Status |
103
- |-------|--------|
104
- | Phases 1-14 | βœ… COMPLETE |
105
-
106
- **Tests**: 318 passing, 0 warnings
107
- **Known Issues**: See [Active Bugs](bugs/ACTIVE_BUGS.md)