Rabbook / README.md
Matcry's picture
Deploy snapshot
c76423f
|
Raw
History Blame Contribute Delete
11.4 kB
---
title: Rabbook Agentic RAG
emoji: πŸ“š
colorFrom: indigo
colorTo: purple
sdk: docker
app_port: 6001
pinned: false
---
# Rabbook β€” Agentic RAG System
> A production-quality Retrieval-Augmented Generation application built from scratch, featuring a real tool-use agent loop, hybrid retrieval, and a self-expanding knowledge base.
![Python](https://img.shields.io/badge/Python-3.13-blue?logo=python)
![FastAPI](https://img.shields.io/badge/FastAPI-0.115-green?logo=fastapi)
![LangGraph](https://img.shields.io/badge/LangGraph-Agentic-orange)
![Tests](https://img.shields.io/badge/Tests-57%20passing-brightgreen)
![Docker](https://img.shields.io/badge/Docker-ready-2496ED?logo=docker)
![Accuracy](https://img.shields.io/badge/Answer%20Accuracy-71%25-success)
![Benchmark](https://img.shields.io/badge/Benchmark-100%20cases-blue)
![License](https://img.shields.io/badge/License-MIT-lightgrey)
---
## What Makes This Different
Most RAG projects embed documents and call an LLM. Rabbook is built the way production systems are built:
| What | Why It Matters |
|------|----------------|
| **Real tool-use agent loop** | The LLM decides which tool to call each turn β€” not a hardcoded pipeline. Mirrors how Claude, Codex, and Gemini work. |
| **7-stage retrieval pipeline** | Dense + sparse fusion β†’ RRF β†’ cross-encoder reranking β†’ context expansion β†’ grounding gate. Each stage is measurable and independently testable. |
| **Self-expanding knowledge base** | When the agent fetches a web page, it auto-embeds it. Future queries over that content go through the full RAG pipeline β€” not raw text. |
| **Multi-provider LLM support** | Groq (Llama), Google Gemini, and local Ollama models (including thinking-mode toggle). Swap providers with a single env var. |
| **57 unit tests, zero LLM calls** | Full mock coverage across retrieval, agent loop, research graph, and structured output. |
---
## Results & Impact
I treated this as a real engineering project: build it, **measure it on a hard public benchmark, find the bottlenecks, and prove the fix** β€” all on a **free local 4.6B model** (Ollama `gemma`) at **$0 inference cost**.
**Benchmark:** 100 cases β€” 80 multi-hop **HotpotQA** (distractor setting) + 20 unanswerable **SQuAD v2** β€” scored by an LLM-as-judge **calibrated to 95% agreement with human labels** before use.
| Metric | Before | After | Lever |
|--------|:------:|:-----:|-------|
| **Answer accuracy** (multi-hop QA) | 64% | **71%** | Evidence-based prompt rework |
| **Hallucination** (unanswerable Qs) | ~20% | **~10%** | Grounding-discipline prompt rules |
| **Retrieval β€” both gold chunks found** | 54% | **89%** | Widened hybrid candidate pool |
| **Retrieval β€” Hit@k** | 0.99 | **1.00** | (same) |
| **Tool escalation** (snippet β†’ full page) | 2 / 100 | **8 / 100** | Resolved "escalate vs. refuse" prompt conflict |
Each gain was **diagnosed before it was fixed** β€” e.g. the retrieval jump came from proving the second multi-hop chunk was missing from the candidate *pool* (not just mis-ranked), then widening it. Full methodology, per-case verdicts, and an **honest regression analysis** live in the white paper: **[`docs/EVALUATION.md`](docs/EVALUATION.md)**.
> **Why this matters:** diagnosing *why* a RAG system fails and proving the improvement with numbers is the skill that separates real AI engineering from a "ChatGPT wrapper."
---
## Two Agent Modes
### Mode 1 β€” Tool-Use Agent Loop (`agents/tool_agent.py`)
A real agentic loop where the LLM autonomously picks tools until it has enough information to answer. No hardcoded routing.
```
User query
└─▢ LLM decides tool call
β”œβ”€β–Ά query_documents β†’ hybrid RAG search over local library
β”œβ”€β–Ά web_search β†’ DuckDuckGo, returns URLs + snippets
└─▢ fetch_url β†’ crawls page, auto-embeds into Chroma,
returns "indexed β€” use query_documents"
└─▢ LLM calls query_documents again β†’ hits newly embedded content
└─▢ LLM produces final grounded answer
```
The `fetch_url β†’ embed β†’ query_documents` pattern means every fetched page permanently enriches the local library for future queries.
### Mode 2 β€” LangGraph RAG Graph (`agents/rag_graph.py`)
A deterministic graph for structured, auditable retrieval with explicit grounding gates.
```mermaid
flowchart TD
Start((User Query)) --> Prepare[Prepare Input & Metadata Filters]
Prepare --> Retrieve[Hybrid Retrieval: Dense + Sparse]
Retrieve --> Expand[Context Window Expansion]
Expand --> Ground[Grounding Gate]
Ground --> Decide{Decide Next Action}
Decide -- "Grounded" --> Generate[Generate Answer with Citations]
Decide -- "Weak Evidence" --> Refine[Refine Query]
Decide -- "No Evidence" --> Research[Web Research Agent]
Refine -->|retry| Retrieve
Research -->|ingest & retry| Retrieve
Generate --> Finalize[Finalize & Save History]
Finalize --> End((Response))
style Research fill:#f96,stroke:#333,stroke-width:2px
style Refine fill:#bbf,stroke:#333,stroke-width:2px
style Ground fill:#dfd,stroke:#333,stroke-width:2px
```
Switch modes with `RABBOOK_ENABLE_TOOL_AGENT=true/false`.
---
## Retrieval Pipeline
Seven stages run in sequence on every query:
```
1. Query Transform LLM generates 2–4 sub-queries for broader coverage
2. Candidate Collection Dense (Chroma) + BM25 results per sub-query, deduplicated
3. RRF Fusion Reciprocal Rank Fusion merges the ranked lists
4. Cross-Encoder Reranking ms-marco-MiniLM re-scores against the original query
5. Context Window Expansion Neighboring chunks added for full document context
6. Grounding Gate Rerank score + chunk count gate; blocks hallucination-prone answers
7. Answer Generation Structured output with citation repair fallback
```
---
## Tech Stack
| Layer | Technology |
|-------|-----------|
| Backend | FastAPI, Python 3.13 |
| Agent Orchestration | LangGraph, LangChain tool-use (`bind_tools`) |
| Vector Store | ChromaDB |
| Embeddings | `all-MiniLM-L6-v2` (HuggingFace, local) |
| Sparse Retrieval | Rank-BM25 |
| Reranking | `ms-marco-MiniLM-L-6-v2` Cross-Encoder |
| LLM Providers | Groq (Llama 3.x), Google Gemini, Ollama (local, thinking-mode aware) |
| Web Crawling | crawl4ai + DuckDuckGo (`ddgs`) |
| Frontend | Jinja2, Vanilla CSS |
| Testing | `unittest` + mocks, 57 tests, no real LLM calls |
---
## Project Structure
```
agents/
tool_agent.py β€” real tool-use agent loop (the main path)
rag_graph.py β€” LangGraph deterministic graph
research_graph.py β€” standalone web research agent
services.py β€” public API: answer_query(), AnswerResult
rag/
retrieve.py β€” full 7-stage retrieval pipeline
chunking.py β€” semantic chunking (embedding-based split points)
ingest.py β€” document loading β†’ Chroma + chunk registry
web_ingest.py β€” URL fetch, crawl, save, web_search
registry.py β€” chunk registry (O(1) neighbor lookup for context expansion)
app/
web.py β€” FastAPI routes, LLM instantiation, provider switching
runtime.py β€” lazy-load & cache: vectorstore, BM25, registry
core/
config.py β€” all env vars with defaults
evaluation/ β€” the 3-layer evaluation suite (see "Evaluation" below)
evaluate_retrieval_metrics.py β€” Layer 1: Hit@k / Recall@k / MRR
evaluate_agent.py β€” Layer 3: routing & refusal checks
evaluate_ragas.py β€” RAGAS faithfulness / answer relevancy
time_agent.py β€” full agent run harness (timing, tools, answers)
build_eval_corpus.py β€” builds the 100-case HotpotQA + SQuAD v2 benchmark
data/ β€” golden dataset + per-case results & verdicts
docs/
EVALUATION.md β€” evaluation white paper (methodology, results, analysis)
```
> πŸ“Š **Evaluation lives in [`evaluation/`](evaluation/) with a full write-up in
> [`docs/EVALUATION.md`](docs/EVALUATION.md).** See the [Results & Impact](#results--impact)
> and [Evaluation](#evaluation) sections.
---
## Setup
### Docker (recommended)
```bash
cp .env.example .env
# Add GROQ_API_KEY or GEMINI_KEY
docker compose up --build
# β†’ http://localhost:6001
```
The image pre-downloads embedding and reranking models at build time so the first query is instant. The `data/` directory is mounted as a volume β€” documents and chat history persist across restarts.
### Local
```bash
git clone <repo>
cd rabbook
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt
cp .env.example .env
# Add GROQ_API_KEY or GEMINI_KEY
```
Key `.env` options:
```bash
RABBOOK_LLM_PROVIDER=groq # groq | gemini | ollama
RABBOOK_LLM_MODEL=llama-3.1-8b-instant
RABBOOK_ENABLE_TOOL_AGENT=true # real agent loop (recommended)
RABBOOK_ENABLE_LANGGRAPH_AGENT=true
RABBOOK_ENABLE_RESEARCH_FALLBACK=false
RABBOOK_OLLAMA_THINKING=false # suppress <think> blocks for gemma/deepseek
```
```bash
python main.py # β†’ http://127.0.0.1:6001
python ingest_docs.py # embed files from data/uploads/
```
---
## Running Tests
```bash
python -m pytest tests/ -q
# 57 passed
```
All tests use mocks β€” no API keys, no network, no vectorstore required.
---
## Evaluation
Rabbook ships with a **three-layer evaluation suite** β€” because retrieval can look perfect while generation fabricates, and generation can look fine while agent routing is broken. Each layer isolates one failure mode:
| Layer | Measures | Judge |
|-------|----------|-------|
| **Retrieval** | Does the retriever fetch the gold chunks? (Hit@k, Recall@k, MRR) | Deterministic IR metrics |
| **Answer quality** | Is the final answer correct / non-fabricated? | LLM-as-judge (95% human-calibrated) |
| **Agent behaviour** | Does it route locally first and refuse unanswerable questions? | Heuristic |
**Benchmark:** 100 cases from two public datasets β€” **80 multi-hop HotpotQA** (distractor setting: 2 gold + 8 distractor paragraphs per question) and **20 unanswerable SQuAD v2** questions (to test refusal vs. hallucination).
| Layer | Headline (tuned, gemma4:e2b) |
|-------|------------------------------|
| Retrieval | Hit@k **1.00** Β· Recall@k **0.83** Β· MRR **0.95** |
| Answer quality | **57 / 80 β‰ˆ 71%** correct on multi-hop QA |
| Hallucination | No-fabrication rate **90%** Β· ~10% fabricated on unanswerable Qs |
```bash
# Layer 1 β€” retrieval IR metrics (fast, no API cost)
python -m evaluation.evaluate_retrieval_metrics
# Layer 3 β€” agent behaviour checks
python -m evaluation.evaluate_agent
```
πŸ“„ **Full write-up:** [`docs/EVALUATION.md`](docs/EVALUATION.md) β€” a white paper covering the methodology, the recall diagnostic, the prompt-failure taxonomy, the before/after results, a per-case verdict table, and limitations.
> RAGAS metrics (Faithfulness, Answer Relevancy) are also wired in via `evaluation/evaluate_ragas.py` as an industry-standard cross-check.
---
## Notes
- Port defaults to `6001` (browsers commonly block `6000`)
- Uploaded files: `data/uploads/`
- URL imports: `data/uploads/urls/` (persisted, re-ingested on restart)
- Chunk registry: `data/chunk_registry.json` (flat index for O(1) neighbor lookup)