---
title: RAG Evaluation System
emoji: 📚
colorFrom: indigo
colorTo: blue
sdk: gradio
sdk_version: 5.49.1
app_file: app.py
pinned: false
license: mit
---

# RAG Evaluation System

A comprehensive system for evaluating **Hierarchical RAG** vs **Standard RAG** pipelines with support for multiple document types and metadata hierarchies.

## Features

- **Dual RAG Pipelines**: Compare Base-RAG vs Hierarchical RAG side-by-side
- **Multiple Hierarchies**: Hospital, Banking, and Fluid Simulation domains
- **Comprehensive Evaluation**: Quantitative metrics (Hit@k, MRR, latency, semantic similarity) and qualitative analysis
- **Gradio UI**: User-friendly interface for all operations
- **MCP Server**: Model Context Protocol server for programmatic access
- **API Export**: All main functions exposed via Gradio Client API

## Repository Layout

```
.
├── app.py                    # Spaces entry; defines UI and exposed functions (with api_name)
├── core/                     # Internal logic (NOT publicly exposed)
│   ├── ingest.py             # Loaders, hierarchical classification, chunking
│   ├── index.py              # Embeddings, vector DB, metadata filters
│   ├── retrieval.py          # Base-RAG / Hier-RAG pipelines
│   ├── eval.py               # Metrics: Hit@k, MRR, latency, similarity
│   └── utils.py              # Shared helpers (e.g., PII masking)
├── hierarchies/              # Hierarchy definitions (YAML)
│   ├── hospital.yaml
│   ├── bank.yaml
│   └── fluid_simulation.yaml
├── tests/                    # pytest cases
│   ├── test_ingest.py
│   ├── test_retrieval.py
│   ├── test_eval.py
│   └── test_index.py
├── reports/                  # Evaluation results (CSV/JSON)
├── requirements.txt          # Dependencies
└── README.md                 # This file
```

## Setup

### Prerequisites

- Python 3.8+
- pip or conda

### Installation

1. Clone the repository:
```bash
git clone <repository-url>
cd rag-evaluation-system
```

2. Install dependencies:
```bash
pip install -r requirements.txt
```

3. Create necessary directories:
```bash
mkdir -p reports chroma_data
```

### Environment Variables

Create a `.env` file (see `.env.example`) to configure the app:

- `OPENAI_API_KEY` (optional): enables OpenAI for embeddings and detection
- `OPENAI_MODEL` (optional): chat model for detection (default `gpt-4o-mini`)
- `OPENAI_EMBED_MODEL` (optional): embedding model (default `text-embedding-3-small`)
- `USE_OPENAI_EMBEDDINGS` (optional): `true|false` to force provider selection
- `ST_EMBED_MODEL` (optional): fallback SentenceTransformers model (default `all-MiniLM-L6-v2`)
- `CHROMA_PERSIST_DIR` (optional): Chroma dir (default `./chroma_data`)
- `DEFAULT_SEARCH_K` (optional): default k in Search auto mode (default `5`)
- `GRADIO_SERVER_PORT` (optional): server port (default `7860`)
- `LOG_LEVEL` (optional): `DEBUG|INFO|...` (default `INFO`)

Note: collections are namespaced by embeddings provider/dimension (e.g., `documents__oai_1536`, `documents__st_384`). Re‑upload after switching providers/models.

## Usage

### Using the Gradio API (gradio_client)

All main functions are exposed via the Gradio API with `api_name`:

- `build_rag`: Build RAG index from uploaded files
- `search`: Search documents using both pipelines
- `chat`: Chat interface with RAG system
- `evaluate`: Run quantitative evaluation

Example usage:

```python
from gradio_client import Client

client = Client("http://your-server:7860/")

# Build RAG index
result = client.predict(
    files=["doc1.pdf", "doc2.pdf"],
    hierarchy="hospital",
    doc_type="Report",
    language="en",
    api_name="/build_rag"
)

# Search documents
results = client.predict(
    query="What are emergency procedures?",
    k=5,
    level1="Clinical",
    level2="Emergency",
    level3=None,
    doc_type="Report",
    api_name="/search"
)
```

### MCP Server

The system can run as an MCP (Model Context Protocol) server for programmatic access:

```bash
python app.py --mcp
```

#### Connecting to MCP Server

Add to your MCP client configuration (e.g., for Claude Desktop):

```json
{
  "mcpServers": {
    "rag-evaluation": {
      "command": "python",
      "args": ["/path/to/app.py", "--mcp"],
      "env": {}
    }
  }
}
```

#### Available MCP Tools

1. **search_documents**: Search documents using RAG system
   - Parameters: `query`, `k`, `pipeline`, `level1`, `level2`, `level3`, `doc_type`

2. **evaluate_retrieval**: Evaluate RAG performance with batch queries
   - Parameters: `queries` (array), `output_file`

## UI Tabs

### 1. Upload Documents

- Upload multiple PDF/TXT files
- Set Hierarchy/Doc Type/Language to `Auto` for per‑chunk detection (OpenAI preferred; heuristic fallback)
- Paragraph‑first chunking with merging of consecutive similar paragraphs (same hierarchy + level1 + level2). Explicit labels (Domain/Section/Topic) “stick” across following paragraphs until overridden
- After build:
  - Build Status (processed count, indexed chunks)
  - File Summary (Filename, Chunks, Language, Doc Type, Hierarchy)
  - Indexed Chunks (preview with Level1/2/3 and first 160 chars)

### 2. Search

Default (auto):
- Enter your query and click Search
- k uses `DEFAULT_SEARCH_K` (default 5)
- Filters (level1/2/3, doc_type) inferred from query (OpenAI if enabled; else heuristics)

Manual (optional):
- Check “Manual controls” to enable k and filters (they default to `Auto`)
- Leave `Auto` to detect; set a value to force

### 3. Chat

Conversational interface for qualitative testing:
- Choose pipeline (Base-RAG or Hier-RAG)
- Adjust retrieval parameters
- View retrieved sources

### 4. Evaluation

Run quantitative evaluation:
- Input queries in JSON format with ground truth
- Specify k values for evaluation
- Apply optional filters
- View metrics: Hit@k, MRR, semantic similarity, latency
- Export results to CSV/JSON in `reports/` directory

## Evaluation

### Quantitative Evaluation

The system compares Base‑RAG vs Hier‑RAG on Hit@k, MRR, semantic similarity, and latency. Provide JSON with `ground_truth` to see metrics and the Performance Comparison chart.

### Evaluation Input Format

```json
[
  {
    "query": "What are emergency procedures?",
    "ground_truth": ["Emergency protocols for triage", "Patient assessment guidelines"],
    "k_values": [1, 3, 5],
    "level1": "Clinical",
    "level2": "Emergency",
    "level3": "Triage",
    "doc_type": "Report"
  }
]
```

### Evaluation Results

Results are saved to `reports/` directory:
- CSV file with detailed metrics per query
- JSON file with full evaluation data
- Summary statistics by pipeline and k value

## Hierarchy Structure

Each hierarchy defines 3 levels:

- **Level1 (Domain)**: Top-level categorization (e.g., Clinical, Administrative)
- **Level2 (Section)**: Sub-domain within Level1 (e.g., Emergency, Inpatient)
- **Level3 (Topic)**: Specific topic within Level2 (e.g., Triage, Trauma)

Hierarchy files are located in `hierarchies/` directory and follow YAML format.

## Metadata Schema

Chunks are tagged with the following metadata:

```json
{
  "doc_id": "uuid",
  "chunk_id": "uuid",
  "source_name": "filename.pdf",
  "lang": "ja|en",
  "level1": "domain",
  "level2": "section",
  "level3": "topic",
  "doc_type": "policy|manual|faq",
  "chunk_size": 1000,
  "token_count": 250
}
```

## Testing

Run tests with pytest:

```bash
pytest tests/ -v
```

Test coverage includes:
- Document loading and chunking
- Hierarchy classification
- Metadata filtering
- Retrieval pipelines (Base-RAG and Hier-RAG)
- Evaluation metrics calculation
- Vector store operations
- API behaviors

## Architecture

### Retrieval Pipelines

**Base-RAG:**
1. Vector similarity search
2. Return top-k results
3. Format and return

**Hier-RAG:**
1. Pre-filter by hierarchical tags (level1/2/3, doc_type)
2. Vector search within filtered subset
3. Return top-k results
4. Format with hierarchy context

## Vector Database & Embeddings

- ChromaDB with persistence
- Embeddings provider:
  - OpenAI (if `OPENAI_API_KEY` present): `OPENAI_EMBED_MODEL` (default `text-embedding-3-small`)
  - SentenceTransformers fallback: `ST_EMBED_MODEL` (default `all-MiniLM-L6-v2`)
- Collections namespaced by provider/dimension to avoid mismatch
- Metadata filtering supported for level1/2/3/doc_type


## Acknowledgments

Built for comparing hierarchical vs standard RAG retrieval approaches.