Rabbook / README.md
Matcry's picture
Deploy snapshot
c76423f
|
Raw
History Blame Contribute Delete
11.4 kB
metadata
title: Rabbook Agentic RAG
emoji: πŸ“š
colorFrom: indigo
colorTo: purple
sdk: docker
app_port: 6001
pinned: false

Rabbook β€” Agentic RAG System

A production-quality Retrieval-Augmented Generation application built from scratch, featuring a real tool-use agent loop, hybrid retrieval, and a self-expanding knowledge base.

Python FastAPI LangGraph Tests Docker Accuracy Benchmark License


What Makes This Different

Most RAG projects embed documents and call an LLM. Rabbook is built the way production systems are built:

What Why It Matters
Real tool-use agent loop The LLM decides which tool to call each turn β€” not a hardcoded pipeline. Mirrors how Claude, Codex, and Gemini work.
7-stage retrieval pipeline Dense + sparse fusion β†’ RRF β†’ cross-encoder reranking β†’ context expansion β†’ grounding gate. Each stage is measurable and independently testable.
Self-expanding knowledge base When the agent fetches a web page, it auto-embeds it. Future queries over that content go through the full RAG pipeline β€” not raw text.
Multi-provider LLM support Groq (Llama), Google Gemini, and local Ollama models (including thinking-mode toggle). Swap providers with a single env var.
57 unit tests, zero LLM calls Full mock coverage across retrieval, agent loop, research graph, and structured output.

Results & Impact

I treated this as a real engineering project: build it, measure it on a hard public benchmark, find the bottlenecks, and prove the fix β€” all on a free local 4.6B model (Ollama gemma) at $0 inference cost.

Benchmark: 100 cases β€” 80 multi-hop HotpotQA (distractor setting) + 20 unanswerable SQuAD v2 β€” scored by an LLM-as-judge calibrated to 95% agreement with human labels before use.

Metric Before After Lever
Answer accuracy (multi-hop QA) 64% 71% Evidence-based prompt rework
Hallucination (unanswerable Qs) ~20% ~10% Grounding-discipline prompt rules
Retrieval β€” both gold chunks found 54% 89% Widened hybrid candidate pool
Retrieval β€” Hit@k 0.99 1.00 (same)
Tool escalation (snippet β†’ full page) 2 / 100 8 / 100 Resolved "escalate vs. refuse" prompt conflict

Each gain was diagnosed before it was fixed β€” e.g. the retrieval jump came from proving the second multi-hop chunk was missing from the candidate pool (not just mis-ranked), then widening it. Full methodology, per-case verdicts, and an honest regression analysis live in the white paper: docs/EVALUATION.md.

Why this matters: diagnosing why a RAG system fails and proving the improvement with numbers is the skill that separates real AI engineering from a "ChatGPT wrapper."


Two Agent Modes

Mode 1 β€” Tool-Use Agent Loop (agents/tool_agent.py)

A real agentic loop where the LLM autonomously picks tools until it has enough information to answer. No hardcoded routing.

User query
    └─▢ LLM decides tool call
            β”œβ”€β–Ά query_documents  β†’  hybrid RAG search over local library
            β”œβ”€β–Ά web_search       β†’  DuckDuckGo, returns URLs + snippets
            └─▢ fetch_url        β†’  crawls page, auto-embeds into Chroma,
                                    returns "indexed β€” use query_documents"
    └─▢ LLM calls query_documents again β†’ hits newly embedded content
    └─▢ LLM produces final grounded answer

The fetch_url β†’ embed β†’ query_documents pattern means every fetched page permanently enriches the local library for future queries.

Mode 2 β€” LangGraph RAG Graph (agents/rag_graph.py)

A deterministic graph for structured, auditable retrieval with explicit grounding gates.

flowchart TD
    Start((User Query)) --> Prepare[Prepare Input & Metadata Filters]
    Prepare --> Retrieve[Hybrid Retrieval: Dense + Sparse]
    Retrieve --> Expand[Context Window Expansion]
    Expand --> Ground[Grounding Gate]

    Ground --> Decide{Decide Next Action}

    Decide -- "Grounded" --> Generate[Generate Answer with Citations]
    Decide -- "Weak Evidence" --> Refine[Refine Query]
    Decide -- "No Evidence" --> Research[Web Research Agent]

    Refine -->|retry| Retrieve
    Research -->|ingest & retry| Retrieve

    Generate --> Finalize[Finalize & Save History]
    Finalize --> End((Response))

    style Research fill:#f96,stroke:#333,stroke-width:2px
    style Refine fill:#bbf,stroke:#333,stroke-width:2px
    style Ground fill:#dfd,stroke:#333,stroke-width:2px

Switch modes with RABBOOK_ENABLE_TOOL_AGENT=true/false.


Retrieval Pipeline

Seven stages run in sequence on every query:

1. Query Transform     LLM generates 2–4 sub-queries for broader coverage
2. Candidate Collection Dense (Chroma) + BM25 results per sub-query, deduplicated
3. RRF Fusion          Reciprocal Rank Fusion merges the ranked lists
4. Cross-Encoder Reranking  ms-marco-MiniLM re-scores against the original query
5. Context Window Expansion  Neighboring chunks added for full document context
6. Grounding Gate      Rerank score + chunk count gate; blocks hallucination-prone answers
7. Answer Generation   Structured output with citation repair fallback

Tech Stack

Layer Technology
Backend FastAPI, Python 3.13
Agent Orchestration LangGraph, LangChain tool-use (bind_tools)
Vector Store ChromaDB
Embeddings all-MiniLM-L6-v2 (HuggingFace, local)
Sparse Retrieval Rank-BM25
Reranking ms-marco-MiniLM-L-6-v2 Cross-Encoder
LLM Providers Groq (Llama 3.x), Google Gemini, Ollama (local, thinking-mode aware)
Web Crawling crawl4ai + DuckDuckGo (ddgs)
Frontend Jinja2, Vanilla CSS
Testing unittest + mocks, 57 tests, no real LLM calls

Project Structure

agents/
  tool_agent.py       β€” real tool-use agent loop (the main path)
  rag_graph.py        β€” LangGraph deterministic graph
  research_graph.py   β€” standalone web research agent
  services.py         β€” public API: answer_query(), AnswerResult
rag/
  retrieve.py         β€” full 7-stage retrieval pipeline
  chunking.py         β€” semantic chunking (embedding-based split points)
  ingest.py           β€” document loading β†’ Chroma + chunk registry
  web_ingest.py       β€” URL fetch, crawl, save, web_search
  registry.py         β€” chunk registry (O(1) neighbor lookup for context expansion)
app/
  web.py              β€” FastAPI routes, LLM instantiation, provider switching
  runtime.py          β€” lazy-load & cache: vectorstore, BM25, registry
core/
  config.py           β€” all env vars with defaults
evaluation/           β€” the 3-layer evaluation suite (see "Evaluation" below)
  evaluate_retrieval_metrics.py  β€” Layer 1: Hit@k / Recall@k / MRR
  evaluate_agent.py              β€” Layer 3: routing & refusal checks
  evaluate_ragas.py             β€” RAGAS faithfulness / answer relevancy
  time_agent.py                 β€” full agent run harness (timing, tools, answers)
  build_eval_corpus.py          β€” builds the 100-case HotpotQA + SQuAD v2 benchmark
  data/                         β€” golden dataset + per-case results & verdicts
docs/
  EVALUATION.md       β€” evaluation white paper (methodology, results, analysis)

πŸ“Š Evaluation lives in evaluation/ with a full write-up in docs/EVALUATION.md. See the Results & Impact and Evaluation sections.


Setup

Docker (recommended)

cp .env.example .env
# Add GROQ_API_KEY or GEMINI_KEY

docker compose up --build
# β†’ http://localhost:6001

The image pre-downloads embedding and reranking models at build time so the first query is instant. The data/ directory is mounted as a volume β€” documents and chat history persist across restarts.

Local

git clone <repo>
cd rabbook

python -m venv venv
source venv/bin/activate
pip install -r requirements.txt

cp .env.example .env
# Add GROQ_API_KEY or GEMINI_KEY

Key .env options:

RABBOOK_LLM_PROVIDER=groq          # groq | gemini | ollama
RABBOOK_LLM_MODEL=llama-3.1-8b-instant
RABBOOK_ENABLE_TOOL_AGENT=true     # real agent loop (recommended)
RABBOOK_ENABLE_LANGGRAPH_AGENT=true
RABBOOK_ENABLE_RESEARCH_FALLBACK=false
RABBOOK_OLLAMA_THINKING=false      # suppress <think> blocks for gemma/deepseek
python main.py          # β†’ http://127.0.0.1:6001
python ingest_docs.py   # embed files from data/uploads/

Running Tests

python -m pytest tests/ -q
# 57 passed

All tests use mocks β€” no API keys, no network, no vectorstore required.


Evaluation

Rabbook ships with a three-layer evaluation suite β€” because retrieval can look perfect while generation fabricates, and generation can look fine while agent routing is broken. Each layer isolates one failure mode:

Layer Measures Judge
Retrieval Does the retriever fetch the gold chunks? (Hit@k, Recall@k, MRR) Deterministic IR metrics
Answer quality Is the final answer correct / non-fabricated? LLM-as-judge (95% human-calibrated)
Agent behaviour Does it route locally first and refuse unanswerable questions? Heuristic

Benchmark: 100 cases from two public datasets β€” 80 multi-hop HotpotQA (distractor setting: 2 gold + 8 distractor paragraphs per question) and 20 unanswerable SQuAD v2 questions (to test refusal vs. hallucination).

Layer Headline (tuned, gemma4:e2b)
Retrieval Hit@k 1.00 Β· Recall@k 0.83 Β· MRR 0.95
Answer quality 57 / 80 β‰ˆ 71% correct on multi-hop QA
Hallucination No-fabrication rate 90% Β· ~10% fabricated on unanswerable Qs
# Layer 1 β€” retrieval IR metrics (fast, no API cost)
python -m evaluation.evaluate_retrieval_metrics
# Layer 3 β€” agent behaviour checks
python -m evaluation.evaluate_agent

πŸ“„ Full write-up: docs/EVALUATION.md β€” a white paper covering the methodology, the recall diagnostic, the prompt-failure taxonomy, the before/after results, a per-case verdict table, and limitations.

RAGAS metrics (Faithfulness, Answer Relevancy) are also wired in via evaluation/evaluate_ragas.py as an industry-standard cross-check.


Notes

  • Port defaults to 6001 (browsers commonly block 6000)
  • Uploaded files: data/uploads/
  • URL imports: data/uploads/urls/ (persisted, re-ingested on restart)
  • Chunk registry: data/chunk_registry.json (flat index for O(1) neighbor lookup)