Spaces:

VibecoderMcSwaggins
/

DeepBoner

Paused

VibecoderMcSwaggins commited on 17 days ago

Commit

3ce1e8b

1 Parent(s): 8d97867

docs: Remove outdated documentation files

- Delete the `index.md` and `STATUS_LLAMAINDEX_INTEGRATION.md` files as they are no longer relevant to the current project structure and documentation needs.
- Add a new `workflow-diagrams.md` file to provide a comprehensive overview of the updated Magentic architecture and workflow, enhancing clarity for users and developers.

Files changed (3) hide show

docs/STATUS_LLAMAINDEX_INTEGRATION.md +0 -228
docs/{workflow-diagrams.md → architecture/workflow-diagrams.md} +0 -0
docs/index.md +0 -107

docs/STATUS_LLAMAINDEX_INTEGRATION.md DELETED Viewed

@@ -1,228 +0,0 @@
-# After This PR: What's Working, What's Missing, What's Next
-**TL;DR:** DeepBoner is a **fully working** biomedical research agent. The LlamaIndex integration we just completed is wired in correctly. The system can search PubMed, ClinicalTrials.gov, and Europe PMC, deduplicate evidence semantically, and generate research reports. **It's ready for hackathon submission.**
----
-## What Does LlamaIndex Actually Do Here?
-**Short answer:** LlamaIndex provides **better embeddings + persistence** when you have an OpenAI API key.
-```
-User has OPENAI_API_KEY → LlamaIndex (OpenAI embeddings, disk persistence)
-User has NO API key     → Local embeddings (sentence-transformers, in-memory)
-```
-### What it does:
-1. **Embeds evidence** - Converts paper abstracts to vectors for semantic search
-2. **Stores to disk** - Evidence survives app restart (ChromaDB PersistentClient)
-3. **Deduplicates** - Prevents storing 99% similar papers (0.9 threshold)
-4. **Retrieves context** - Judge gets top-30 semantically relevant papers, not random ones
-### What it does NOT do:
-- **Primary search** - PubMed/ClinicalTrials return results; LlamaIndex stores them
-- **Ranking** - No reranking of search results (they come pre-ranked from APIs)
-- **Query routing** - Doesn't decide which database to search
----
-## Is This a "Real" RAG System?
-**Yes, but simpler than you might expect.**
-```
-Traditional RAG:     Query → Retrieve from vector DB → Generate with context
-DeepBoner's RAG:     Query → Search APIs → Store in vector DB → Judge with context
-```
-We're doing **"Search-and-Store RAG"** not "Retrieve-and-Generate RAG":
-- Evidence comes from **real biomedical APIs** (PubMed, etc.), not a pre-built knowledge base
-- Vector DB is for **deduplication + context windowing**, not primary retrieval
-- The "retrieval" happens from external APIs, not from embeddings
-**This is the RIGHT architecture** for a research agent - you want fresh, authoritative sources (PubMed) not a static knowledge base.
----
-## Do We Need Neo4j / FAISS / More Complex RAG?
-**No.** Here's why:
-| You might think you need... | But actually... |
-|----------------------------|-----------------|
-| Neo4j for knowledge graphs | Evidence relationships are implicit in citations/abstracts |
-| FAISS for fast search | ChromaDB handles our scale (hundreds of papers, not millions) |
-| Complex ingestion pipeline | Our pipeline IS working: Search → Dedupe → Store → Retrieve |
-| Reranking models | PubMed already ranks by relevance; judge handles scoring |
-**The bottleneck is NOT the vector store.** It's:
-1. API rate limits (PubMed: 3 req/sec without key, 10 with key)
-2. LLM context windows (judge can only see ~30 papers effectively)
-3. Search query quality (garbage in, garbage out)
----
-## What's Actually Working (End-to-End)
-### Core Research Loop
-```
-User Query: "What drugs improve female libido post-menopause?"
-    ↓
-[1] SearchHandler queries 3 databases in parallel
-    ├─ PubMed: 10 results
-    ├─ ClinicalTrials.gov: 5 results
-    └─ Europe PMC: 10 results
-    ↓
-[2] ResearchMemory deduplicates (25 → 18 unique)
-    ↓
-[3] Evidence stored in ChromaDB/LlamaIndex
-    ↓
-[4] Judge gets top-30 by semantic similarity
-    ↓
-[5] Judge scores: mechanism=7/10, clinical=6/10
-    ↓
-[6] Judge says: "Need more on flibanserin mechanism"
-    ↓
-[7] Loop with new queries (up to 10 iterations)
-    ↓
-[8] Generate report with drug candidates + findings
-```
-### What Each Component Does
-| Component | Status | What It Does |
-|-----------|--------|--------------|
-| `SearchHandler` | Working | Parallel search across 3 databases |
-| `ResearchMemory` | Working | Stores evidence, tracks hypotheses |
-| `EmbeddingService` | Working | Free tier: local sentence-transformers |
-| `LlamaIndexRAGService` | Working | Premium tier: OpenAI embeddings + persistence |
-| `JudgeHandler` | Working | LLM scores evidence, suggests next queries |
-| `SimpleOrchestrator` | Working | Main research loop (search → judge → synthesize) |
-| `AdvancedOrchestrator` | Working | Multi-agent mode (requires agent-framework) |
-| Gradio UI | Working | Chat interface with streaming events |
----
-## What's Missing (But Not Blocking)
-### 1. **Active Knowledge Base Querying** (P2)
-Currently: Judge guesses what to search next
-Should: Judge checks "what do we already have?" before suggesting new queries
-**Impact:** Could reduce redundant searches
-**Effort:** Medium (modify judge prompt to include memory summary)
-### 2. **Evidence Diversity Selection** (P2)
-Currently: Judge sees top-30 by relevance (might be redundant)
-Should: Use MMR (Maximal Marginal Relevance) for diversity
-**Impact:** Better coverage of different perspectives
-**Effort:** Low (we have `select_diverse_evidence()` but it's not used everywhere)
-### 3. **Singleton Pattern for LlamaIndex** (P3)
-Currently: Each call creates new LlamaIndexRAGService instance
-Should: Cache like `_shared_model` in EmbeddingService
-**Impact:** Minor performance improvement
-**Effort:** Low
-### 4. **Evidence Quality Scoring** (P3)
-Currently: Judge gives overall scores (mechanism + clinical)
-Should: Score each paper (study design, sample size, etc.)
-**Impact:** Better synthesis quality
-**Effort:** High (significant prompt engineering)
----
-## What's Definitely NOT Needed
-| Over-engineering | Why it's unnecessary |
-|------------------|---------------------|
-| GraphRAG / Neo4j | Our scale is hundreds of papers, not knowledge graphs |
-| FAISS / Pinecone | ChromaDB handles our volume fine |
-| Custom embedding models | OpenAI/sentence-transformers work great for biomedical text |
-| Complex chunking strategies | We're storing abstracts (already short) |
-| Hybrid search (BM25 + vector) | APIs already do keyword matching |
----
-## Hackathon Submission Checklist
-- [x] Core research loop working
-- [x] 3 biomedical databases integrated (PubMed, ClinicalTrials, Europe PMC)
-- [x] Semantic deduplication working
-- [x] Judge assessment working
-- [x] Report generation working
-- [x] Gradio UI working
-- [x] 202 tests passing
-- [x] Tiered embedding service (free vs premium)
-- [x] LlamaIndex integration complete
-**You're ready to submit.**
----
-## Post-Hackathon Roadmap
-### Phase 1: Polish (1-2 days)
-- [ ] Add singleton pattern for LlamaIndex service
-- [ ] Integration test with real API keys
-- [ ] Verify persistence works on HuggingFace Spaces
-### Phase 2: Intelligence (1 week)
-- [ ] Judge queries memory before suggesting searches
-- [ ] MMR diversity selection for evidence context
-- [ ] Hypothesis-driven search refinement
-### Phase 3: Scale (2+ weeks)
-- [ ] Rate limit handling improvements
-- [ ] Batch embedding for large evidence sets
-- [ ] Multi-query parallelization
-- [ ] Export to structured formats (JSON, BibTeX)
-### Phase 4: Production (future)
-- [ ] User authentication
-- [ ] Persistent user sessions
-- [ ] Evidence caching across users
-- [ ] Usage analytics
----
-## Quick Reference: Where Things Are
-```
-src/
-├── orchestrators/
-│   ├── simple.py          # Main research loop (START HERE)
-│   └── advanced.py        # Multi-agent mode
-├── services/
-│   ├── embeddings.py      # Free tier (sentence-transformers)
-│   ├── llamaindex_rag.py  # Premium tier (OpenAI + persistence)
-│   ├── embedding_protocol.py  # Interface both implement
-│   └── research_memory.py # Evidence storage + retrieval
-├── tools/
-│   ├── pubmed.py          # PubMed E-utilities
-│   ├── clinicaltrials.py  # ClinicalTrials.gov API
-│   └── europepmc.py       # Europe PMC API
-├── agent_factory/
-│   └── judges.py          # LLM judge (assess evidence sufficiency)
-└── utils/
-    ├── config.py          # Environment variables
-    ├── service_loader.py  # Tiered service selection
-    └── models.py          # Evidence, Citation, etc.
-```
----
-## The Bottom Line
-**DeepBoner is not missing anything critical.** The LlamaIndex integration you just completed was the last major infrastructure piece. What remains is optimization and polish, not core functionality.
-The system works like this:
-1. **Search real databases** (not a vector store)
-2. **Store + deduplicate** (this is where LlamaIndex helps)
-3. **Judge with context** (top-30 semantically relevant papers)
-4. **Loop or synthesize** (code-enforced decision)
-This is a sensible architecture for a research agent. You don't need more complexity - you need to ship it.

docs/{workflow-diagrams.md → architecture/workflow-diagrams.md} RENAMED Viewed

File without changes

docs/index.md DELETED Viewed

@@ -1,107 +0,0 @@
-# DeepBoner Documentation
-## Sexual Health Research Agent
-AI-powered deep research system for sexual wellness, reproductive health, and hormone therapy research.
----
-## Quick Links
-### Architecture
-- **[Overview](architecture/overview.md)** - Project overview, use case, architecture
-- **[Design Patterns](architecture/design-patterns.md)** - Technical patterns, data models
-- **[Workflow Diagrams](workflow-diagrams.md)** - Visual architecture (Magentic v2.0)
-### Implementation (Phases 1-14 ✅ COMPLETE)
-- **[Roadmap](implementation/roadmap.md)** - Phased execution plan with TDD
-- **[Phase 1: Foundation](implementation/01_phase_foundation.md)** ✅ - Tooling, config, first tests
-- **[Phase 2: Search](implementation/02_phase_search.md)** ✅ - PubMed search
-- **[Phase 3: Judge](implementation/03_phase_judge.md)** ✅ - LLM evidence assessment
-- **[Phase 4: UI](implementation/04_phase_ui.md)** ✅ - Orchestrator + Gradio
-- **[Phase 5: Magentic](implementation/05_phase_magentic.md)** ✅ - Multi-agent orchestration
-- **[Phase 6: Embeddings](implementation/06_phase_embeddings.md)** ✅ - Semantic search + dedup
-- **[Phase 7: Hypothesis](implementation/07_phase_hypothesis.md)** ✅ - Mechanistic reasoning
-- **[Phase 8: Report](implementation/08_phase_report.md)** ✅ - Structured scientific reports
-- **[Phase 9: Source Cleanup](implementation/09_phase_source_cleanup.md)** ✅ - Remove DuckDuckGo
-- **[Phase 10: ClinicalTrials](implementation/10_phase_clinicaltrials.md)** ✅ - Clinical trials API
-- **[Phase 11: Europe PMC](implementation/11_phase_europepmc.md)** ✅ - Preprint search
-- **[Phase 12: MCP Server](implementation/12_phase_mcp_server.md)** ✅ - Claude Desktop integration
-- **[Phase 13: Modal Integration](implementation/13_phase_modal_integration.md)** ✅ - Secure code execution
-- **[Phase 14: Demo Submission](implementation/14_phase_demo_submission.md)** ✅ - Hackathon submission
-### Future Roadmap
-- **[Overview](future-roadmap/phases/README.md)** - Planned phases 15-17
-- **[Phase 15: OpenAlex](future-roadmap/phases/15_PHASE_OPENALEX.md)** - Citation network integration
-- **[Phase 16: PubMed Full-text](future-roadmap/phases/16_PHASE_PUBMED_FULLTEXT.md)** - BioC API
-- **[Phase 17: Rate Limiting](future-roadmap/phases/17_PHASE_RATE_LIMITING.md)** - Production hardening
-- **[Deep Research Mode](future-roadmap/DEEP_RESEARCH_ROADMAP.md)** - GPT-Researcher style enhancements
-### Bugs & Issues
-- **[Active Bugs](bugs/ACTIVE_BUGS.md)** - Current issues and workarounds
-### Decisions
-- **[PR #55 Evaluation](decisions/2025-11-27-pr55-evaluation.md)** - Architecture decision record
-- **[Magentic + PydanticAI](decisions/architecture-2025-11/)** - Framework architecture decisions
-### Guides
-- **[Deployment Guide](guides/deployment.md)** - Gradio, MCP, and Modal launch steps
-### Development
-- **[Testing Strategy](development/testing.md)** - Unit, Integration, and E2E testing patterns
-### Brainstorming (Source Improvements)
-- **[Roadmap Summary](brainstorming/00_ROADMAP_SUMMARY.md)** - Data source enhancement ideas
-- **[PubMed Improvements](brainstorming/01_PUBMED_IMPROVEMENTS.md)**
-- **[ClinicalTrials Improvements](brainstorming/02_CLINICALTRIALS_IMPROVEMENTS.md)**
-- **[Europe PMC Improvements](brainstorming/03_EUROPEPMC_IMPROVEMENTS.md)**
----
-## What We're Building
-**One-liner**: AI agent that searches medical literature to find evidence for sexual health research questions.
-**Example Queries**:
-> "What drugs improve female libido post-menopause?"
-> "Evidence for testosterone therapy in women with HSDD?"
-> "Clinical trials for erectile dysfunction alternatives to PDE5 inhibitors?"
-**Output**: Research report with drug candidates, mechanisms, evidence quality, and citations.
----
-## Architecture Summary
-```
-User Question → Research Agent (Orchestrator)
-                      ↓
-              Search Loop:
-                → Tools (PubMed, ClinicalTrials, Europe PMC)
-                → Judge (Quality + Budget)
-                → Repeat or Synthesize
-                      ↓
-              Research Report with Citations
-```
----
-## Features
-| Feature | Status | Description |
-|---------|--------|-------------|
-| **Gradio UI** | ✅ Complete | Streaming chat interface |
-| **MCP Server** | ✅ Complete | Tools accessible from Claude Desktop |
-| **Modal Sandbox** | ✅ Complete | Secure statistical analysis |
-| **Multi-Source Search** | ✅ Complete | PubMed, ClinicalTrials, Europe PMC |
----
-## Status
-| Phase | Status |
-|-------|--------|
-| Phases 1-14 | ✅ COMPLETE |
-**Tests**: 318 passing, 0 warnings
-**Known Issues**: See [Active Bugs](bugs/ACTIVE_BUGS.md)