soft.engineer
update the sdk version
79e0552

A newer version of the Gradio SDK is available: 6.9.0

Upgrade
metadata
title: RAG Evaluation System
emoji: πŸ“š
colorFrom: indigo
colorTo: blue
sdk: gradio
sdk_version: 5.49.1
app_file: app.py
pinned: false
license: mit

RAG Evaluation System

A comprehensive system for evaluating Hierarchical RAG vs Standard RAG pipelines with support for multiple document types and metadata hierarchies.

Features

  • Dual RAG Pipelines: Compare Base-RAG vs Hierarchical RAG side-by-side
  • Multiple Hierarchies: Hospital, Banking, and Fluid Simulation domains
  • Comprehensive Evaluation: Quantitative metrics (Hit@k, MRR, latency, semantic similarity) and qualitative analysis
  • Gradio UI: User-friendly interface for all operations
  • MCP Server: Model Context Protocol server for programmatic access
  • API Export: All main functions exposed via Gradio Client API

Repository Layout

.
β”œβ”€β”€ app.py                    # Spaces entry; defines UI and exposed functions (with api_name)
β”œβ”€β”€ core/                     # Internal logic (NOT publicly exposed)
β”‚   β”œβ”€β”€ ingest.py             # Loaders, hierarchical classification, chunking
β”‚   β”œβ”€β”€ index.py              # Embeddings, vector DB, metadata filters
β”‚   β”œβ”€β”€ retrieval.py          # Base-RAG / Hier-RAG pipelines
β”‚   β”œβ”€β”€ eval.py               # Metrics: Hit@k, MRR, latency, similarity
β”‚   └── utils.py              # Shared helpers (e.g., PII masking)
β”œβ”€β”€ hierarchies/              # Hierarchy definitions (YAML)
β”‚   β”œβ”€β”€ hospital.yaml
β”‚   β”œβ”€β”€ bank.yaml
β”‚   └── fluid_simulation.yaml
β”œβ”€β”€ tests/                    # pytest cases
β”‚   β”œβ”€β”€ test_ingest.py
β”‚   β”œβ”€β”€ test_retrieval.py
β”‚   β”œβ”€β”€ test_eval.py
β”‚   └── test_index.py
β”œβ”€β”€ reports/                  # Evaluation results (CSV/JSON)
β”œβ”€β”€ requirements.txt          # Dependencies
└── README.md                 # This file

Setup

Prerequisites

  • Python 3.8+
  • pip or conda

Installation

  1. Clone the repository:
git clone <repository-url>
cd rag-evaluation-system
  1. Install dependencies:
pip install -r requirements.txt
  1. Create necessary directories:
mkdir -p reports chroma_data

Environment Variables

Create a .env file (see .env.example) to configure the app:

  • OPENAI_API_KEY (optional): enables OpenAI for embeddings and detection
  • OPENAI_MODEL (optional): chat model for detection (default gpt-4o-mini)
  • OPENAI_EMBED_MODEL (optional): embedding model (default text-embedding-3-small)
  • USE_OPENAI_EMBEDDINGS (optional): true|false to force provider selection
  • ST_EMBED_MODEL (optional): fallback SentenceTransformers model (default all-MiniLM-L6-v2)
  • CHROMA_PERSIST_DIR (optional): Chroma dir (default ./chroma_data)
  • DEFAULT_SEARCH_K (optional): default k in Search auto mode (default 5)
  • GRADIO_SERVER_PORT (optional): server port (default 7860)
  • LOG_LEVEL (optional): DEBUG|INFO|... (default INFO)

Note: collections are namespaced by embeddings provider/dimension (e.g., documents__oai_1536, documents__st_384). Re‑upload after switching providers/models.

Usage

Using the Gradio API (gradio_client)

All main functions are exposed via the Gradio API with api_name:

  • build_rag: Build RAG index from uploaded files
  • search: Search documents using both pipelines
  • chat: Chat interface with RAG system
  • evaluate: Run quantitative evaluation

Example usage:

from gradio_client import Client

client = Client("http://your-server:7860/")

# Build RAG index
result = client.predict(
    files=["doc1.pdf", "doc2.pdf"],
    hierarchy="hospital",
    doc_type="Report",
    language="en",
    api_name="/build_rag"
)

# Search documents
results = client.predict(
    query="What are emergency procedures?",
    k=5,
    level1="Clinical",
    level2="Emergency",
    level3=None,
    doc_type="Report",
    api_name="/search"
)

MCP Server

The system can run as an MCP (Model Context Protocol) server for programmatic access:

python app.py --mcp

Connecting to MCP Server

Add to your MCP client configuration (e.g., for Claude Desktop):

{
  "mcpServers": {
    "rag-evaluation": {
      "command": "python",
      "args": ["/path/to/app.py", "--mcp"],
      "env": {}
    }
  }
}

Available MCP Tools

  1. search_documents: Search documents using RAG system

    • Parameters: query, k, pipeline, level1, level2, level3, doc_type
  2. evaluate_retrieval: Evaluate RAG performance with batch queries

    • Parameters: queries (array), output_file

UI Tabs

1. Upload Documents

  • Upload multiple PDF/TXT files
  • Set Hierarchy/Doc Type/Language to Auto for per‑chunk detection (OpenAI preferred; heuristic fallback)
  • Paragraph‑first chunking with merging of consecutive similar paragraphs (same hierarchy + level1 + level2). Explicit labels (Domain/Section/Topic) β€œstick” across following paragraphs until overridden
  • After build:
    • Build Status (processed count, indexed chunks)
    • File Summary (Filename, Chunks, Language, Doc Type, Hierarchy)
    • Indexed Chunks (preview with Level1/2/3 and first 160 chars)

2. Search

Default (auto):

  • Enter your query and click Search
  • k uses DEFAULT_SEARCH_K (default 5)
  • Filters (level1/2/3, doc_type) inferred from query (OpenAI if enabled; else heuristics)

Manual (optional):

  • Check β€œManual controls” to enable k and filters (they default to Auto)
  • Leave Auto to detect; set a value to force

3. Chat

Conversational interface for qualitative testing:

  • Choose pipeline (Base-RAG or Hier-RAG)
  • Adjust retrieval parameters
  • View retrieved sources

4. Evaluation

Run quantitative evaluation:

  • Input queries in JSON format with ground truth
  • Specify k values for evaluation
  • Apply optional filters
  • View metrics: Hit@k, MRR, semantic similarity, latency
  • Export results to CSV/JSON in reports/ directory

Evaluation

Quantitative Evaluation

The system compares Base‑RAG vs Hier‑RAG on Hit@k, MRR, semantic similarity, and latency. Provide JSON with ground_truth to see metrics and the Performance Comparison chart.

Evaluation Input Format

[
  {
    "query": "What are emergency procedures?",
    "ground_truth": ["Emergency protocols for triage", "Patient assessment guidelines"],
    "k_values": [1, 3, 5],
    "level1": "Clinical",
    "level2": "Emergency",
    "level3": "Triage",
    "doc_type": "Report"
  }
]

Evaluation Results

Results are saved to reports/ directory:

  • CSV file with detailed metrics per query
  • JSON file with full evaluation data
  • Summary statistics by pipeline and k value

Hierarchy Structure

Each hierarchy defines 3 levels:

  • Level1 (Domain): Top-level categorization (e.g., Clinical, Administrative)
  • Level2 (Section): Sub-domain within Level1 (e.g., Emergency, Inpatient)
  • Level3 (Topic): Specific topic within Level2 (e.g., Triage, Trauma)

Hierarchy files are located in hierarchies/ directory and follow YAML format.

Metadata Schema

Chunks are tagged with the following metadata:

{
  "doc_id": "uuid",
  "chunk_id": "uuid",
  "source_name": "filename.pdf",
  "lang": "ja|en",
  "level1": "domain",
  "level2": "section",
  "level3": "topic",
  "doc_type": "policy|manual|faq",
  "chunk_size": 1000,
  "token_count": 250
}

Testing

Run tests with pytest:

pytest tests/ -v

Test coverage includes:

  • Document loading and chunking
  • Hierarchy classification
  • Metadata filtering
  • Retrieval pipelines (Base-RAG and Hier-RAG)
  • Evaluation metrics calculation
  • Vector store operations
  • API behaviors

Architecture

Retrieval Pipelines

Base-RAG:

  1. Vector similarity search
  2. Return top-k results
  3. Format and return

Hier-RAG:

  1. Pre-filter by hierarchical tags (level1/2/3, doc_type)
  2. Vector search within filtered subset
  3. Return top-k results
  4. Format with hierarchy context

Vector Database & Embeddings

  • ChromaDB with persistence
  • Embeddings provider:
    • OpenAI (if OPENAI_API_KEY present): OPENAI_EMBED_MODEL (default text-embedding-3-small)
    • SentenceTransformers fallback: ST_EMBED_MODEL (default all-MiniLM-L6-v2)
  • Collections namespaced by provider/dimension to avoid mismatch
  • Metadata filtering supported for level1/2/3/doc_type

Acknowledgments

Built for comparing hierarchical vs standard RAG retrieval approaches.