--- title: Rabbook Agentic RAG emoji: ๐Ÿ“š colorFrom: indigo colorTo: purple sdk: docker app_port: 6001 pinned: false --- # Rabbook โ€” Agentic RAG System > A production-quality Retrieval-Augmented Generation application built from scratch, featuring a real tool-use agent loop, hybrid retrieval, and a self-expanding knowledge base. ![Python](https://img.shields.io/badge/Python-3.13-blue?logo=python) ![FastAPI](https://img.shields.io/badge/FastAPI-0.115-green?logo=fastapi) ![LangGraph](https://img.shields.io/badge/LangGraph-Agentic-orange) ![Tests](https://img.shields.io/badge/Tests-57%20passing-brightgreen) ![Docker](https://img.shields.io/badge/Docker-ready-2496ED?logo=docker) ![Accuracy](https://img.shields.io/badge/Answer%20Accuracy-71%25-success) ![Benchmark](https://img.shields.io/badge/Benchmark-100%20cases-blue) ![License](https://img.shields.io/badge/License-MIT-lightgrey) --- ## What Makes This Different Most RAG projects embed documents and call an LLM. Rabbook is built the way production systems are built: | What | Why It Matters | |------|----------------| | **Real tool-use agent loop** | The LLM decides which tool to call each turn โ€” not a hardcoded pipeline. Mirrors how Claude, Codex, and Gemini work. | | **7-stage retrieval pipeline** | Dense + sparse fusion โ†’ RRF โ†’ cross-encoder reranking โ†’ context expansion โ†’ grounding gate. Each stage is measurable and independently testable. | | **Self-expanding knowledge base** | When the agent fetches a web page, it auto-embeds it. Future queries over that content go through the full RAG pipeline โ€” not raw text. | | **Multi-provider LLM support** | Groq (Llama), Google Gemini, and local Ollama models (including thinking-mode toggle). Swap providers with a single env var. | | **57 unit tests, zero LLM calls** | Full mock coverage across retrieval, agent loop, research graph, and structured output. | --- ## Results & Impact I treated this as a real engineering project: build it, **measure it on a hard public benchmark, find the bottlenecks, and prove the fix** โ€” all on a **free local 4.6B model** (Ollama `gemma`) at **$0 inference cost**. **Benchmark:** 100 cases โ€” 80 multi-hop **HotpotQA** (distractor setting) + 20 unanswerable **SQuAD v2** โ€” scored by an LLM-as-judge **calibrated to 95% agreement with human labels** before use. | Metric | Before | After | Lever | |--------|:------:|:-----:|-------| | **Answer accuracy** (multi-hop QA) | 64% | **71%** | Evidence-based prompt rework | | **Hallucination** (unanswerable Qs) | ~20% | **~10%** | Grounding-discipline prompt rules | | **Retrieval โ€” both gold chunks found** | 54% | **89%** | Widened hybrid candidate pool | | **Retrieval โ€” Hit@k** | 0.99 | **1.00** | (same) | | **Tool escalation** (snippet โ†’ full page) | 2 / 100 | **8 / 100** | Resolved "escalate vs. refuse" prompt conflict | Each gain was **diagnosed before it was fixed** โ€” e.g. the retrieval jump came from proving the second multi-hop chunk was missing from the candidate *pool* (not just mis-ranked), then widening it. Full methodology, per-case verdicts, and an **honest regression analysis** live in the white paper: **[`docs/EVALUATION.md`](docs/EVALUATION.md)**. > **Why this matters:** diagnosing *why* a RAG system fails and proving the improvement with numbers is the skill that separates real AI engineering from a "ChatGPT wrapper." --- ## Two Agent Modes ### Mode 1 โ€” Tool-Use Agent Loop (`agents/tool_agent.py`) A real agentic loop where the LLM autonomously picks tools until it has enough information to answer. No hardcoded routing. ``` User query โ””โ”€โ–ถ LLM decides tool call โ”œโ”€โ–ถ query_documents โ†’ hybrid RAG search over local library โ”œโ”€โ–ถ web_search โ†’ DuckDuckGo, returns URLs + snippets โ””โ”€โ–ถ fetch_url โ†’ crawls page, auto-embeds into Chroma, returns "indexed โ€” use query_documents" โ””โ”€โ–ถ LLM calls query_documents again โ†’ hits newly embedded content โ””โ”€โ–ถ LLM produces final grounded answer ``` The `fetch_url โ†’ embed โ†’ query_documents` pattern means every fetched page permanently enriches the local library for future queries. ### Mode 2 โ€” LangGraph RAG Graph (`agents/rag_graph.py`) A deterministic graph for structured, auditable retrieval with explicit grounding gates. ```mermaid flowchart TD Start((User Query)) --> Prepare[Prepare Input & Metadata Filters] Prepare --> Retrieve[Hybrid Retrieval: Dense + Sparse] Retrieve --> Expand[Context Window Expansion] Expand --> Ground[Grounding Gate] Ground --> Decide{Decide Next Action} Decide -- "Grounded" --> Generate[Generate Answer with Citations] Decide -- "Weak Evidence" --> Refine[Refine Query] Decide -- "No Evidence" --> Research[Web Research Agent] Refine -->|retry| Retrieve Research -->|ingest & retry| Retrieve Generate --> Finalize[Finalize & Save History] Finalize --> End((Response)) style Research fill:#f96,stroke:#333,stroke-width:2px style Refine fill:#bbf,stroke:#333,stroke-width:2px style Ground fill:#dfd,stroke:#333,stroke-width:2px ``` Switch modes with `RABBOOK_ENABLE_TOOL_AGENT=true/false`. --- ## Retrieval Pipeline Seven stages run in sequence on every query: ``` 1. Query Transform LLM generates 2โ€“4 sub-queries for broader coverage 2. Candidate Collection Dense (Chroma) + BM25 results per sub-query, deduplicated 3. RRF Fusion Reciprocal Rank Fusion merges the ranked lists 4. Cross-Encoder Reranking ms-marco-MiniLM re-scores against the original query 5. Context Window Expansion Neighboring chunks added for full document context 6. Grounding Gate Rerank score + chunk count gate; blocks hallucination-prone answers 7. Answer Generation Structured output with citation repair fallback ``` --- ## Tech Stack | Layer | Technology | |-------|-----------| | Backend | FastAPI, Python 3.13 | | Agent Orchestration | LangGraph, LangChain tool-use (`bind_tools`) | | Vector Store | ChromaDB | | Embeddings | `all-MiniLM-L6-v2` (HuggingFace, local) | | Sparse Retrieval | Rank-BM25 | | Reranking | `ms-marco-MiniLM-L-6-v2` Cross-Encoder | | LLM Providers | Groq (Llama 3.x), Google Gemini, Ollama (local, thinking-mode aware) | | Web Crawling | crawl4ai + DuckDuckGo (`ddgs`) | | Frontend | Jinja2, Vanilla CSS | | Testing | `unittest` + mocks, 57 tests, no real LLM calls | --- ## Project Structure ``` agents/ tool_agent.py โ€” real tool-use agent loop (the main path) rag_graph.py โ€” LangGraph deterministic graph research_graph.py โ€” standalone web research agent services.py โ€” public API: answer_query(), AnswerResult rag/ retrieve.py โ€” full 7-stage retrieval pipeline chunking.py โ€” semantic chunking (embedding-based split points) ingest.py โ€” document loading โ†’ Chroma + chunk registry web_ingest.py โ€” URL fetch, crawl, save, web_search registry.py โ€” chunk registry (O(1) neighbor lookup for context expansion) app/ web.py โ€” FastAPI routes, LLM instantiation, provider switching runtime.py โ€” lazy-load & cache: vectorstore, BM25, registry core/ config.py โ€” all env vars with defaults evaluation/ โ€” the 3-layer evaluation suite (see "Evaluation" below) evaluate_retrieval_metrics.py โ€” Layer 1: Hit@k / Recall@k / MRR evaluate_agent.py โ€” Layer 3: routing & refusal checks evaluate_ragas.py โ€” RAGAS faithfulness / answer relevancy time_agent.py โ€” full agent run harness (timing, tools, answers) build_eval_corpus.py โ€” builds the 100-case HotpotQA + SQuAD v2 benchmark data/ โ€” golden dataset + per-case results & verdicts docs/ EVALUATION.md โ€” evaluation white paper (methodology, results, analysis) ``` > ๐Ÿ“Š **Evaluation lives in [`evaluation/`](evaluation/) with a full write-up in > [`docs/EVALUATION.md`](docs/EVALUATION.md).** See the [Results & Impact](#results--impact) > and [Evaluation](#evaluation) sections. --- ## Setup ### Docker (recommended) ```bash cp .env.example .env # Add GROQ_API_KEY or GEMINI_KEY docker compose up --build # โ†’ http://localhost:6001 ``` The image pre-downloads embedding and reranking models at build time so the first query is instant. The `data/` directory is mounted as a volume โ€” documents and chat history persist across restarts. ### Local ```bash git clone cd rabbook python -m venv venv source venv/bin/activate pip install -r requirements.txt cp .env.example .env # Add GROQ_API_KEY or GEMINI_KEY ``` Key `.env` options: ```bash RABBOOK_LLM_PROVIDER=groq # groq | gemini | ollama RABBOOK_LLM_MODEL=llama-3.1-8b-instant RABBOOK_ENABLE_TOOL_AGENT=true # real agent loop (recommended) RABBOOK_ENABLE_LANGGRAPH_AGENT=true RABBOOK_ENABLE_RESEARCH_FALLBACK=false RABBOOK_OLLAMA_THINKING=false # suppress blocks for gemma/deepseek ``` ```bash python main.py # โ†’ http://127.0.0.1:6001 python ingest_docs.py # embed files from data/uploads/ ``` --- ## Running Tests ```bash python -m pytest tests/ -q # 57 passed ``` All tests use mocks โ€” no API keys, no network, no vectorstore required. --- ## Evaluation Rabbook ships with a **three-layer evaluation suite** โ€” because retrieval can look perfect while generation fabricates, and generation can look fine while agent routing is broken. Each layer isolates one failure mode: | Layer | Measures | Judge | |-------|----------|-------| | **Retrieval** | Does the retriever fetch the gold chunks? (Hit@k, Recall@k, MRR) | Deterministic IR metrics | | **Answer quality** | Is the final answer correct / non-fabricated? | LLM-as-judge (95% human-calibrated) | | **Agent behaviour** | Does it route locally first and refuse unanswerable questions? | Heuristic | **Benchmark:** 100 cases from two public datasets โ€” **80 multi-hop HotpotQA** (distractor setting: 2 gold + 8 distractor paragraphs per question) and **20 unanswerable SQuAD v2** questions (to test refusal vs. hallucination). | Layer | Headline (tuned, gemma4:e2b) | |-------|------------------------------| | Retrieval | Hit@k **1.00** ยท Recall@k **0.83** ยท MRR **0.95** | | Answer quality | **57 / 80 โ‰ˆ 71%** correct on multi-hop QA | | Hallucination | No-fabrication rate **90%** ยท ~10% fabricated on unanswerable Qs | ```bash # Layer 1 โ€” retrieval IR metrics (fast, no API cost) python -m evaluation.evaluate_retrieval_metrics # Layer 3 โ€” agent behaviour checks python -m evaluation.evaluate_agent ``` ๐Ÿ“„ **Full write-up:** [`docs/EVALUATION.md`](docs/EVALUATION.md) โ€” a white paper covering the methodology, the recall diagnostic, the prompt-failure taxonomy, the before/after results, a per-case verdict table, and limitations. > RAGAS metrics (Faithfulness, Answer Relevancy) are also wired in via `evaluation/evaluate_ragas.py` as an industry-standard cross-check. --- ## Notes - Port defaults to `6001` (browsers commonly block `6000`) - Uploaded files: `data/uploads/` - URL imports: `data/uploads/urls/` (persisted, re-ingested on restart) - Chunk registry: `data/chunk_registry.json` (flat index for O(1) neighbor lookup)