title: Rabbook Agentic RAG
emoji: π
colorFrom: indigo
colorTo: purple
sdk: docker
app_port: 6001
pinned: false
Rabbook β Agentic RAG System
A production-quality Retrieval-Augmented Generation application built from scratch, featuring a real tool-use agent loop, hybrid retrieval, and a self-expanding knowledge base.
What Makes This Different
Most RAG projects embed documents and call an LLM. Rabbook is built the way production systems are built:
| What | Why It Matters |
|---|---|
| Real tool-use agent loop | The LLM decides which tool to call each turn β not a hardcoded pipeline. Mirrors how Claude, Codex, and Gemini work. |
| 7-stage retrieval pipeline | Dense + sparse fusion β RRF β cross-encoder reranking β context expansion β grounding gate. Each stage is measurable and independently testable. |
| Self-expanding knowledge base | When the agent fetches a web page, it auto-embeds it. Future queries over that content go through the full RAG pipeline β not raw text. |
| Multi-provider LLM support | Groq (Llama), Google Gemini, and local Ollama models (including thinking-mode toggle). Swap providers with a single env var. |
| 57 unit tests, zero LLM calls | Full mock coverage across retrieval, agent loop, research graph, and structured output. |
Results & Impact
I treated this as a real engineering project: build it, measure it on a hard public benchmark, find the bottlenecks, and prove the fix β all on a free local 4.6B model (Ollama gemma) at $0 inference cost.
Benchmark: 100 cases β 80 multi-hop HotpotQA (distractor setting) + 20 unanswerable SQuAD v2 β scored by an LLM-as-judge calibrated to 95% agreement with human labels before use.
| Metric | Before | After | Lever |
|---|---|---|---|
| Answer accuracy (multi-hop QA) | 64% | 71% | Evidence-based prompt rework |
| Hallucination (unanswerable Qs) | ~20% | ~10% | Grounding-discipline prompt rules |
| Retrieval β both gold chunks found | 54% | 89% | Widened hybrid candidate pool |
| Retrieval β Hit@k | 0.99 | 1.00 | (same) |
| Tool escalation (snippet β full page) | 2 / 100 | 8 / 100 | Resolved "escalate vs. refuse" prompt conflict |
Each gain was diagnosed before it was fixed β e.g. the retrieval jump came from proving the second multi-hop chunk was missing from the candidate pool (not just mis-ranked), then widening it. Full methodology, per-case verdicts, and an honest regression analysis live in the white paper: docs/EVALUATION.md.
Why this matters: diagnosing why a RAG system fails and proving the improvement with numbers is the skill that separates real AI engineering from a "ChatGPT wrapper."
Two Agent Modes
Mode 1 β Tool-Use Agent Loop (agents/tool_agent.py)
A real agentic loop where the LLM autonomously picks tools until it has enough information to answer. No hardcoded routing.
User query
βββΆ LLM decides tool call
βββΆ query_documents β hybrid RAG search over local library
βββΆ web_search β DuckDuckGo, returns URLs + snippets
βββΆ fetch_url β crawls page, auto-embeds into Chroma,
returns "indexed β use query_documents"
βββΆ LLM calls query_documents again β hits newly embedded content
βββΆ LLM produces final grounded answer
The fetch_url β embed β query_documents pattern means every fetched page permanently enriches the local library for future queries.
Mode 2 β LangGraph RAG Graph (agents/rag_graph.py)
A deterministic graph for structured, auditable retrieval with explicit grounding gates.
flowchart TD
Start((User Query)) --> Prepare[Prepare Input & Metadata Filters]
Prepare --> Retrieve[Hybrid Retrieval: Dense + Sparse]
Retrieve --> Expand[Context Window Expansion]
Expand --> Ground[Grounding Gate]
Ground --> Decide{Decide Next Action}
Decide -- "Grounded" --> Generate[Generate Answer with Citations]
Decide -- "Weak Evidence" --> Refine[Refine Query]
Decide -- "No Evidence" --> Research[Web Research Agent]
Refine -->|retry| Retrieve
Research -->|ingest & retry| Retrieve
Generate --> Finalize[Finalize & Save History]
Finalize --> End((Response))
style Research fill:#f96,stroke:#333,stroke-width:2px
style Refine fill:#bbf,stroke:#333,stroke-width:2px
style Ground fill:#dfd,stroke:#333,stroke-width:2px
Switch modes with RABBOOK_ENABLE_TOOL_AGENT=true/false.
Retrieval Pipeline
Seven stages run in sequence on every query:
1. Query Transform LLM generates 2β4 sub-queries for broader coverage
2. Candidate Collection Dense (Chroma) + BM25 results per sub-query, deduplicated
3. RRF Fusion Reciprocal Rank Fusion merges the ranked lists
4. Cross-Encoder Reranking ms-marco-MiniLM re-scores against the original query
5. Context Window Expansion Neighboring chunks added for full document context
6. Grounding Gate Rerank score + chunk count gate; blocks hallucination-prone answers
7. Answer Generation Structured output with citation repair fallback
Tech Stack
| Layer | Technology |
|---|---|
| Backend | FastAPI, Python 3.13 |
| Agent Orchestration | LangGraph, LangChain tool-use (bind_tools) |
| Vector Store | ChromaDB |
| Embeddings | all-MiniLM-L6-v2 (HuggingFace, local) |
| Sparse Retrieval | Rank-BM25 |
| Reranking | ms-marco-MiniLM-L-6-v2 Cross-Encoder |
| LLM Providers | Groq (Llama 3.x), Google Gemini, Ollama (local, thinking-mode aware) |
| Web Crawling | crawl4ai + DuckDuckGo (ddgs) |
| Frontend | Jinja2, Vanilla CSS |
| Testing | unittest + mocks, 57 tests, no real LLM calls |
Project Structure
agents/
tool_agent.py β real tool-use agent loop (the main path)
rag_graph.py β LangGraph deterministic graph
research_graph.py β standalone web research agent
services.py β public API: answer_query(), AnswerResult
rag/
retrieve.py β full 7-stage retrieval pipeline
chunking.py β semantic chunking (embedding-based split points)
ingest.py β document loading β Chroma + chunk registry
web_ingest.py β URL fetch, crawl, save, web_search
registry.py β chunk registry (O(1) neighbor lookup for context expansion)
app/
web.py β FastAPI routes, LLM instantiation, provider switching
runtime.py β lazy-load & cache: vectorstore, BM25, registry
core/
config.py β all env vars with defaults
evaluation/ β the 3-layer evaluation suite (see "Evaluation" below)
evaluate_retrieval_metrics.py β Layer 1: Hit@k / Recall@k / MRR
evaluate_agent.py β Layer 3: routing & refusal checks
evaluate_ragas.py β RAGAS faithfulness / answer relevancy
time_agent.py β full agent run harness (timing, tools, answers)
build_eval_corpus.py β builds the 100-case HotpotQA + SQuAD v2 benchmark
data/ β golden dataset + per-case results & verdicts
docs/
EVALUATION.md β evaluation white paper (methodology, results, analysis)
π Evaluation lives in
evaluation/with a full write-up indocs/EVALUATION.md. See the Results & Impact and Evaluation sections.
Setup
Docker (recommended)
cp .env.example .env
# Add GROQ_API_KEY or GEMINI_KEY
docker compose up --build
# β http://localhost:6001
The image pre-downloads embedding and reranking models at build time so the first query is instant. The data/ directory is mounted as a volume β documents and chat history persist across restarts.
Local
git clone <repo>
cd rabbook
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt
cp .env.example .env
# Add GROQ_API_KEY or GEMINI_KEY
Key .env options:
RABBOOK_LLM_PROVIDER=groq # groq | gemini | ollama
RABBOOK_LLM_MODEL=llama-3.1-8b-instant
RABBOOK_ENABLE_TOOL_AGENT=true # real agent loop (recommended)
RABBOOK_ENABLE_LANGGRAPH_AGENT=true
RABBOOK_ENABLE_RESEARCH_FALLBACK=false
RABBOOK_OLLAMA_THINKING=false # suppress <think> blocks for gemma/deepseek
python main.py # β http://127.0.0.1:6001
python ingest_docs.py # embed files from data/uploads/
Running Tests
python -m pytest tests/ -q
# 57 passed
All tests use mocks β no API keys, no network, no vectorstore required.
Evaluation
Rabbook ships with a three-layer evaluation suite β because retrieval can look perfect while generation fabricates, and generation can look fine while agent routing is broken. Each layer isolates one failure mode:
| Layer | Measures | Judge |
|---|---|---|
| Retrieval | Does the retriever fetch the gold chunks? (Hit@k, Recall@k, MRR) | Deterministic IR metrics |
| Answer quality | Is the final answer correct / non-fabricated? | LLM-as-judge (95% human-calibrated) |
| Agent behaviour | Does it route locally first and refuse unanswerable questions? | Heuristic |
Benchmark: 100 cases from two public datasets β 80 multi-hop HotpotQA (distractor setting: 2 gold + 8 distractor paragraphs per question) and 20 unanswerable SQuAD v2 questions (to test refusal vs. hallucination).
| Layer | Headline (tuned, gemma4:e2b) |
|---|---|
| Retrieval | Hit@k 1.00 Β· Recall@k 0.83 Β· MRR 0.95 |
| Answer quality | 57 / 80 β 71% correct on multi-hop QA |
| Hallucination | No-fabrication rate 90% Β· ~10% fabricated on unanswerable Qs |
# Layer 1 β retrieval IR metrics (fast, no API cost)
python -m evaluation.evaluate_retrieval_metrics
# Layer 3 β agent behaviour checks
python -m evaluation.evaluate_agent
π Full write-up: docs/EVALUATION.md β a white paper covering the methodology, the recall diagnostic, the prompt-failure taxonomy, the before/after results, a per-case verdict table, and limitations.
RAGAS metrics (Faithfulness, Answer Relevancy) are also wired in via
evaluation/evaluate_ragas.pyas an industry-standard cross-check.
Notes
- Port defaults to
6001(browsers commonly block6000) - Uploaded files:
data/uploads/ - URL imports:
data/uploads/urls/(persisted, re-ingested on restart) - Chunk registry:
data/chunk_registry.json(flat index for O(1) neighbor lookup)