Spaces:

Matcry
/

Rabbook

Running

App Files Files Community

Rabbook / README.md

Matcry

Deploy snapshot

c76423f 24 days ago

preview code

Raw

History Blame Contribute Delete

11.4 kB

metadata

title: Rabbook Agentic RAG
emoji: 📚
colorFrom: indigo
colorTo: purple
sdk: docker
app_port: 6001
pinned: false

Rabbook — Agentic RAG System

A production-quality Retrieval-Augmented Generation application built from scratch, featuring a real tool-use agent loop, hybrid retrieval, and a self-expanding knowledge base.

What Makes This Different

Most RAG projects embed documents and call an LLM. Rabbook is built the way production systems are built:

What	Why It Matters
Real tool-use agent loop	The LLM decides which tool to call each turn — not a hardcoded pipeline. Mirrors how Claude, Codex, and Gemini work.
7-stage retrieval pipeline	Dense + sparse fusion → RRF → cross-encoder reranking → context expansion → grounding gate. Each stage is measurable and independently testable.
Self-expanding knowledge base	When the agent fetches a web page, it auto-embeds it. Future queries over that content go through the full RAG pipeline — not raw text.
Multi-provider LLM support	Groq (Llama), Google Gemini, and local Ollama models (including thinking-mode toggle). Swap providers with a single env var.
57 unit tests, zero LLM calls	Full mock coverage across retrieval, agent loop, research graph, and structured output.

Results & Impact

I treated this as a real engineering project: build it, measure it on a hard public benchmark, find the bottlenecks, and prove the fix — all on a free local 4.6B model (Ollama gemma) at $0 inference cost.

Benchmark: 100 cases — 80 multi-hop HotpotQA (distractor setting) + 20 unanswerable SQuAD v2 — scored by an LLM-as-judge calibrated to 95% agreement with human labels before use.

Metric	Before	After	Lever
Answer accuracy (multi-hop QA)	64%	71%	Evidence-based prompt rework
Hallucination (unanswerable Qs)	~20%	~10%	Grounding-discipline prompt rules
Retrieval — both gold chunks found	54%	89%	Widened hybrid candidate pool
Retrieval — Hit@k	0.99	1.00	(same)
Tool escalation (snippet → full page)	2 / 100	8 / 100	Resolved "escalate vs. refuse" prompt conflict

Each gain was diagnosed before it was fixed — e.g. the retrieval jump came from proving the second multi-hop chunk was missing from the candidate pool (not just mis-ranked), then widening it. Full methodology, per-case verdicts, and an honest regression analysis live in the white paper: docs/EVALUATION.md.

Why this matters: diagnosing why a RAG system fails and proving the improvement with numbers is the skill that separates real AI engineering from a "ChatGPT wrapper."

Two Agent Modes

Mode 1 — Tool-Use Agent Loop (`agents/tool_agent.py`)

A real agentic loop where the LLM autonomously picks tools until it has enough information to answer. No hardcoded routing.

User query
    └─▶ LLM decides tool call
            ├─▶ query_documents  →  hybrid RAG search over local library
            ├─▶ web_search       →  DuckDuckGo, returns URLs + snippets
            └─▶ fetch_url        →  crawls page, auto-embeds into Chroma,
                                    returns "indexed — use query_documents"
    └─▶ LLM calls query_documents again → hits newly embedded content
    └─▶ LLM produces final grounded answer

The fetch_url → embed → query_documents pattern means every fetched page permanently enriches the local library for future queries.

Mode 2 — LangGraph RAG Graph (`agents/rag_graph.py`)

A deterministic graph for structured, auditable retrieval with explicit grounding gates.

flowchart TD
    Start((User Query)) --> Prepare[Prepare Input & Metadata Filters]
    Prepare --> Retrieve[Hybrid Retrieval: Dense + Sparse]
    Retrieve --> Expand[Context Window Expansion]
    Expand --> Ground[Grounding Gate]

    Ground --> Decide{Decide Next Action}

    Decide -- "Grounded" --> Generate[Generate Answer with Citations]
    Decide -- "Weak Evidence" --> Refine[Refine Query]
    Decide -- "No Evidence" --> Research[Web Research Agent]

    Refine -->|retry| Retrieve
    Research -->|ingest & retry| Retrieve

    Generate --> Finalize[Finalize & Save History]
    Finalize --> End((Response))

    style Research fill:#f96,stroke:#333,stroke-width:2px
    style Refine fill:#bbf,stroke:#333,stroke-width:2px
    style Ground fill:#dfd,stroke:#333,stroke-width:2px

Switch modes with RABBOOK_ENABLE_TOOL_AGENT=true/false.

Retrieval Pipeline

Seven stages run in sequence on every query:

1. Query Transform     LLM generates 2–4 sub-queries for broader coverage
2. Candidate Collection Dense (Chroma) + BM25 results per sub-query, deduplicated
3. RRF Fusion          Reciprocal Rank Fusion merges the ranked lists
4. Cross-Encoder Reranking  ms-marco-MiniLM re-scores against the original query
5. Context Window Expansion  Neighboring chunks added for full document context
6. Grounding Gate      Rerank score + chunk count gate; blocks hallucination-prone answers
7. Answer Generation   Structured output with citation repair fallback

Tech Stack

Layer	Technology
Backend	FastAPI, Python 3.13
Agent Orchestration	LangGraph, LangChain tool-use (`bind_tools`)
Vector Store	ChromaDB
Embeddings	`all-MiniLM-L6-v2` (HuggingFace, local)
Sparse Retrieval	Rank-BM25
Reranking	`ms-marco-MiniLM-L-6-v2` Cross-Encoder
LLM Providers	Groq (Llama 3.x), Google Gemini, Ollama (local, thinking-mode aware)
Web Crawling	crawl4ai + DuckDuckGo (`ddgs`)
Frontend	Jinja2, Vanilla CSS
Testing	`unittest` + mocks, 57 tests, no real LLM calls

Project Structure

agents/
  tool_agent.py       — real tool-use agent loop (the main path)
  rag_graph.py        — LangGraph deterministic graph
  research_graph.py   — standalone web research agent
  services.py         — public API: answer_query(), AnswerResult
rag/
  retrieve.py         — full 7-stage retrieval pipeline
  chunking.py         — semantic chunking (embedding-based split points)
  ingest.py           — document loading → Chroma + chunk registry
  web_ingest.py       — URL fetch, crawl, save, web_search
  registry.py         — chunk registry (O(1) neighbor lookup for context expansion)
app/
  web.py              — FastAPI routes, LLM instantiation, provider switching
  runtime.py          — lazy-load & cache: vectorstore, BM25, registry
core/
  config.py           — all env vars with defaults
evaluation/           — the 3-layer evaluation suite (see "Evaluation" below)
  evaluate_retrieval_metrics.py  — Layer 1: Hit@k / Recall@k / MRR
  evaluate_agent.py              — Layer 3: routing & refusal checks
  evaluate_ragas.py             — RAGAS faithfulness / answer relevancy
  time_agent.py                 — full agent run harness (timing, tools, answers)
  build_eval_corpus.py          — builds the 100-case HotpotQA + SQuAD v2 benchmark
  data/                         — golden dataset + per-case results & verdicts
docs/
  EVALUATION.md       — evaluation white paper (methodology, results, analysis)

📊 Evaluation lives in evaluation/ with a full write-up in docs/EVALUATION.md. See the Results & Impact and Evaluation sections.

Setup

Docker (recommended)

cp .env.example .env
# Add GROQ_API_KEY or GEMINI_KEY

docker compose up --build
# → http://localhost:6001

The image pre-downloads embedding and reranking models at build time so the first query is instant. The data/ directory is mounted as a volume — documents and chat history persist across restarts.

Local

git clone <repo>
cd rabbook

python -m venv venv
source venv/bin/activate
pip install -r requirements.txt

cp .env.example .env
# Add GROQ_API_KEY or GEMINI_KEY

Key .env options:

RABBOOK_LLM_PROVIDER=groq          # groq | gemini | ollama
RABBOOK_LLM_MODEL=llama-3.1-8b-instant
RABBOOK_ENABLE_TOOL_AGENT=true     # real agent loop (recommended)
RABBOOK_ENABLE_LANGGRAPH_AGENT=true
RABBOOK_ENABLE_RESEARCH_FALLBACK=false
RABBOOK_OLLAMA_THINKING=false      # suppress <think> blocks for gemma/deepseek

python main.py          # → http://127.0.0.1:6001
python ingest_docs.py   # embed files from data/uploads/

Running Tests

python -m pytest tests/ -q
# 57 passed

All tests use mocks — no API keys, no network, no vectorstore required.

Evaluation

Rabbook ships with a three-layer evaluation suite — because retrieval can look perfect while generation fabricates, and generation can look fine while agent routing is broken. Each layer isolates one failure mode:

Layer	Measures	Judge
Retrieval	Does the retriever fetch the gold chunks? (Hit@k, Recall@k, MRR)	Deterministic IR metrics
Answer quality	Is the final answer correct / non-fabricated?	LLM-as-judge (95% human-calibrated)
Agent behaviour	Does it route locally first and refuse unanswerable questions?	Heuristic

Benchmark: 100 cases from two public datasets — 80 multi-hop HotpotQA (distractor setting: 2 gold + 8 distractor paragraphs per question) and 20 unanswerable SQuAD v2 questions (to test refusal vs. hallucination).

Layer	Headline (tuned, gemma4:e2b)
Retrieval	Hit@k 1.00 · Recall@k 0.83 · MRR 0.95
Answer quality	57 / 80 ≈ 71% correct on multi-hop QA
Hallucination	No-fabrication rate 90% · ~10% fabricated on unanswerable Qs

# Layer 1 — retrieval IR metrics (fast, no API cost)
python -m evaluation.evaluate_retrieval_metrics
# Layer 3 — agent behaviour checks
python -m evaluation.evaluate_agent

📄 Full write-up: docs/EVALUATION.md — a white paper covering the methodology, the recall diagnostic, the prompt-failure taxonomy, the before/after results, a per-case verdict table, and limitations.

RAGAS metrics (Faithfulness, Answer Relevancy) are also wired in via evaluation/evaluate_ragas.py as an industry-standard cross-check.

Notes

Port defaults to 6001 (browsers commonly block 6000)
Uploaded files: data/uploads/
URL imports: data/uploads/urls/ (persisted, re-ingested on restart)
Chunk registry: data/chunk_registry.json (flat index for O(1) neighbor lookup)