| --- |
| title: Rabbook Agentic RAG |
| emoji: π |
| colorFrom: indigo |
| colorTo: purple |
| sdk: docker |
| app_port: 6001 |
| pinned: false |
| --- |
| |
| # Rabbook β Agentic RAG System |
|
|
| > A production-quality Retrieval-Augmented Generation application built from scratch, featuring a real tool-use agent loop, hybrid retrieval, and a self-expanding knowledge base. |
|
|
|  |
|  |
|  |
|  |
|  |
|  |
|  |
|  |
|
|
| --- |
|
|
| ## What Makes This Different |
|
|
| Most RAG projects embed documents and call an LLM. Rabbook is built the way production systems are built: |
|
|
| | What | Why It Matters | |
| |------|----------------| |
| | **Real tool-use agent loop** | The LLM decides which tool to call each turn β not a hardcoded pipeline. Mirrors how Claude, Codex, and Gemini work. | |
| | **7-stage retrieval pipeline** | Dense + sparse fusion β RRF β cross-encoder reranking β context expansion β grounding gate. Each stage is measurable and independently testable. | |
| | **Self-expanding knowledge base** | When the agent fetches a web page, it auto-embeds it. Future queries over that content go through the full RAG pipeline β not raw text. | |
| | **Multi-provider LLM support** | Groq (Llama), Google Gemini, and local Ollama models (including thinking-mode toggle). Swap providers with a single env var. | |
| | **57 unit tests, zero LLM calls** | Full mock coverage across retrieval, agent loop, research graph, and structured output. | |
|
|
| --- |
|
|
| ## Results & Impact |
|
|
| I treated this as a real engineering project: build it, **measure it on a hard public benchmark, find the bottlenecks, and prove the fix** β all on a **free local 4.6B model** (Ollama `gemma`) at **$0 inference cost**. |
|
|
| **Benchmark:** 100 cases β 80 multi-hop **HotpotQA** (distractor setting) + 20 unanswerable **SQuAD v2** β scored by an LLM-as-judge **calibrated to 95% agreement with human labels** before use. |
|
|
| | Metric | Before | After | Lever | |
| |--------|:------:|:-----:|-------| |
| | **Answer accuracy** (multi-hop QA) | 64% | **71%** | Evidence-based prompt rework | |
| | **Hallucination** (unanswerable Qs) | ~20% | **~10%** | Grounding-discipline prompt rules | |
| | **Retrieval β both gold chunks found** | 54% | **89%** | Widened hybrid candidate pool | |
| | **Retrieval β Hit@k** | 0.99 | **1.00** | (same) | |
| | **Tool escalation** (snippet β full page) | 2 / 100 | **8 / 100** | Resolved "escalate vs. refuse" prompt conflict | |
|
|
| Each gain was **diagnosed before it was fixed** β e.g. the retrieval jump came from proving the second multi-hop chunk was missing from the candidate *pool* (not just mis-ranked), then widening it. Full methodology, per-case verdicts, and an **honest regression analysis** live in the white paper: **[`docs/EVALUATION.md`](docs/EVALUATION.md)**. |
|
|
| > **Why this matters:** diagnosing *why* a RAG system fails and proving the improvement with numbers is the skill that separates real AI engineering from a "ChatGPT wrapper." |
|
|
| --- |
|
|
| ## Two Agent Modes |
|
|
| ### Mode 1 β Tool-Use Agent Loop (`agents/tool_agent.py`) |
| |
| A real agentic loop where the LLM autonomously picks tools until it has enough information to answer. No hardcoded routing. |
| |
| ``` |
| User query |
| βββΆ LLM decides tool call |
| βββΆ query_documents β hybrid RAG search over local library |
| βββΆ web_search β DuckDuckGo, returns URLs + snippets |
| βββΆ fetch_url β crawls page, auto-embeds into Chroma, |
| returns "indexed β use query_documents" |
| βββΆ LLM calls query_documents again β hits newly embedded content |
| βββΆ LLM produces final grounded answer |
| ``` |
| |
| The `fetch_url β embed β query_documents` pattern means every fetched page permanently enriches the local library for future queries. |
|
|
| ### Mode 2 β LangGraph RAG Graph (`agents/rag_graph.py`) |
| |
| A deterministic graph for structured, auditable retrieval with explicit grounding gates. |
| |
| ```mermaid |
| flowchart TD |
| Start((User Query)) --> Prepare[Prepare Input & Metadata Filters] |
| Prepare --> Retrieve[Hybrid Retrieval: Dense + Sparse] |
| Retrieve --> Expand[Context Window Expansion] |
| Expand --> Ground[Grounding Gate] |
| |
| Ground --> Decide{Decide Next Action} |
| |
| Decide -- "Grounded" --> Generate[Generate Answer with Citations] |
| Decide -- "Weak Evidence" --> Refine[Refine Query] |
| Decide -- "No Evidence" --> Research[Web Research Agent] |
| |
| Refine -->|retry| Retrieve |
| Research -->|ingest & retry| Retrieve |
| |
| Generate --> Finalize[Finalize & Save History] |
| Finalize --> End((Response)) |
| |
| style Research fill:#f96,stroke:#333,stroke-width:2px |
| style Refine fill:#bbf,stroke:#333,stroke-width:2px |
| style Ground fill:#dfd,stroke:#333,stroke-width:2px |
| ``` |
| |
| Switch modes with `RABBOOK_ENABLE_TOOL_AGENT=true/false`. |
|
|
| --- |
|
|
| ## Retrieval Pipeline |
|
|
| Seven stages run in sequence on every query: |
|
|
| ``` |
| 1. Query Transform LLM generates 2β4 sub-queries for broader coverage |
| 2. Candidate Collection Dense (Chroma) + BM25 results per sub-query, deduplicated |
| 3. RRF Fusion Reciprocal Rank Fusion merges the ranked lists |
| 4. Cross-Encoder Reranking ms-marco-MiniLM re-scores against the original query |
| 5. Context Window Expansion Neighboring chunks added for full document context |
| 6. Grounding Gate Rerank score + chunk count gate; blocks hallucination-prone answers |
| 7. Answer Generation Structured output with citation repair fallback |
| ``` |
|
|
| --- |
|
|
| ## Tech Stack |
|
|
| | Layer | Technology | |
| |-------|-----------| |
| | Backend | FastAPI, Python 3.13 | |
| | Agent Orchestration | LangGraph, LangChain tool-use (`bind_tools`) | |
| | Vector Store | ChromaDB | |
| | Embeddings | `all-MiniLM-L6-v2` (HuggingFace, local) | |
| | Sparse Retrieval | Rank-BM25 | |
| | Reranking | `ms-marco-MiniLM-L-6-v2` Cross-Encoder | |
| | LLM Providers | Groq (Llama 3.x), Google Gemini, Ollama (local, thinking-mode aware) | |
| | Web Crawling | crawl4ai + DuckDuckGo (`ddgs`) | |
| | Frontend | Jinja2, Vanilla CSS | |
| | Testing | `unittest` + mocks, 57 tests, no real LLM calls | |
|
|
| --- |
|
|
| ## Project Structure |
|
|
| ``` |
| agents/ |
| tool_agent.py β real tool-use agent loop (the main path) |
| rag_graph.py β LangGraph deterministic graph |
| research_graph.py β standalone web research agent |
| services.py β public API: answer_query(), AnswerResult |
| rag/ |
| retrieve.py β full 7-stage retrieval pipeline |
| chunking.py β semantic chunking (embedding-based split points) |
| ingest.py β document loading β Chroma + chunk registry |
| web_ingest.py β URL fetch, crawl, save, web_search |
| registry.py β chunk registry (O(1) neighbor lookup for context expansion) |
| app/ |
| web.py β FastAPI routes, LLM instantiation, provider switching |
| runtime.py β lazy-load & cache: vectorstore, BM25, registry |
| core/ |
| config.py β all env vars with defaults |
| evaluation/ β the 3-layer evaluation suite (see "Evaluation" below) |
| evaluate_retrieval_metrics.py β Layer 1: Hit@k / Recall@k / MRR |
| evaluate_agent.py β Layer 3: routing & refusal checks |
| evaluate_ragas.py β RAGAS faithfulness / answer relevancy |
| time_agent.py β full agent run harness (timing, tools, answers) |
| build_eval_corpus.py β builds the 100-case HotpotQA + SQuAD v2 benchmark |
| data/ β golden dataset + per-case results & verdicts |
| docs/ |
| EVALUATION.md β evaluation white paper (methodology, results, analysis) |
| ``` |
|
|
| > π **Evaluation lives in [`evaluation/`](evaluation/) with a full write-up in |
| > [`docs/EVALUATION.md`](docs/EVALUATION.md).** See the [Results & Impact](#results--impact) |
| > and [Evaluation](#evaluation) sections. |
|
|
| --- |
|
|
| ## Setup |
|
|
| ### Docker (recommended) |
|
|
| ```bash |
| cp .env.example .env |
| # Add GROQ_API_KEY or GEMINI_KEY |
| |
| docker compose up --build |
| # β http://localhost:6001 |
| ``` |
|
|
| The image pre-downloads embedding and reranking models at build time so the first query is instant. The `data/` directory is mounted as a volume β documents and chat history persist across restarts. |
|
|
| ### Local |
|
|
| ```bash |
| git clone <repo> |
| cd rabbook |
| |
| python -m venv venv |
| source venv/bin/activate |
| pip install -r requirements.txt |
| |
| cp .env.example .env |
| # Add GROQ_API_KEY or GEMINI_KEY |
| ``` |
|
|
| Key `.env` options: |
|
|
| ```bash |
| RABBOOK_LLM_PROVIDER=groq # groq | gemini | ollama |
| RABBOOK_LLM_MODEL=llama-3.1-8b-instant |
| RABBOOK_ENABLE_TOOL_AGENT=true # real agent loop (recommended) |
| RABBOOK_ENABLE_LANGGRAPH_AGENT=true |
| RABBOOK_ENABLE_RESEARCH_FALLBACK=false |
| RABBOOK_OLLAMA_THINKING=false # suppress <think> blocks for gemma/deepseek |
| ``` |
|
|
| ```bash |
| python main.py # β http://127.0.0.1:6001 |
| python ingest_docs.py # embed files from data/uploads/ |
| ``` |
|
|
| --- |
|
|
| ## Running Tests |
|
|
| ```bash |
| python -m pytest tests/ -q |
| # 57 passed |
| ``` |
|
|
| All tests use mocks β no API keys, no network, no vectorstore required. |
|
|
| --- |
|
|
| ## Evaluation |
|
|
| Rabbook ships with a **three-layer evaluation suite** β because retrieval can look perfect while generation fabricates, and generation can look fine while agent routing is broken. Each layer isolates one failure mode: |
|
|
| | Layer | Measures | Judge | |
| |-------|----------|-------| |
| | **Retrieval** | Does the retriever fetch the gold chunks? (Hit@k, Recall@k, MRR) | Deterministic IR metrics | |
| | **Answer quality** | Is the final answer correct / non-fabricated? | LLM-as-judge (95% human-calibrated) | |
| | **Agent behaviour** | Does it route locally first and refuse unanswerable questions? | Heuristic | |
|
|
| **Benchmark:** 100 cases from two public datasets β **80 multi-hop HotpotQA** (distractor setting: 2 gold + 8 distractor paragraphs per question) and **20 unanswerable SQuAD v2** questions (to test refusal vs. hallucination). |
|
|
| | Layer | Headline (tuned, gemma4:e2b) | |
| |-------|------------------------------| |
| | Retrieval | Hit@k **1.00** Β· Recall@k **0.83** Β· MRR **0.95** | |
| | Answer quality | **57 / 80 β 71%** correct on multi-hop QA | |
| | Hallucination | No-fabrication rate **90%** Β· ~10% fabricated on unanswerable Qs | |
|
|
| ```bash |
| # Layer 1 β retrieval IR metrics (fast, no API cost) |
| python -m evaluation.evaluate_retrieval_metrics |
| # Layer 3 β agent behaviour checks |
| python -m evaluation.evaluate_agent |
| ``` |
|
|
| π **Full write-up:** [`docs/EVALUATION.md`](docs/EVALUATION.md) β a white paper covering the methodology, the recall diagnostic, the prompt-failure taxonomy, the before/after results, a per-case verdict table, and limitations. |
|
|
| > RAGAS metrics (Faithfulness, Answer Relevancy) are also wired in via `evaluation/evaluate_ragas.py` as an industry-standard cross-check. |
| |
| --- |
| |
| ## Notes |
| |
| - Port defaults to `6001` (browsers commonly block `6000`) |
| - Uploaded files: `data/uploads/` |
| - URL imports: `data/uploads/urls/` (persisted, re-ingested on restart) |
| - Chunk registry: `data/chunk_registry.json` (flat index for O(1) neighbor lookup) |
|
|