File size: 8,529 Bytes
2fcf470
 
 
 
3c77e66
2fcf470
79e0552
2fcf470
 
 
 
 
e71fabd
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
---
title: RAG Evaluation System
emoji: πŸ“š
colorFrom: indigo
colorTo: blue
sdk: gradio
sdk_version: 5.49.1
app_file: app.py
pinned: false
license: mit
---

# RAG Evaluation System

A comprehensive system for evaluating **Hierarchical RAG** vs **Standard RAG** pipelines with support for multiple document types and metadata hierarchies.

## Features

- **Dual RAG Pipelines**: Compare Base-RAG vs Hierarchical RAG side-by-side
- **Multiple Hierarchies**: Hospital, Banking, and Fluid Simulation domains
- **Comprehensive Evaluation**: Quantitative metrics (Hit@k, MRR, latency, semantic similarity) and qualitative analysis
- **Gradio UI**: User-friendly interface for all operations
- **MCP Server**: Model Context Protocol server for programmatic access
- **API Export**: All main functions exposed via Gradio Client API

## Repository Layout

```
.
β”œβ”€β”€ app.py                    # Spaces entry; defines UI and exposed functions (with api_name)
β”œβ”€β”€ core/                     # Internal logic (NOT publicly exposed)
β”‚   β”œβ”€β”€ ingest.py             # Loaders, hierarchical classification, chunking
β”‚   β”œβ”€β”€ index.py              # Embeddings, vector DB, metadata filters
β”‚   β”œβ”€β”€ retrieval.py          # Base-RAG / Hier-RAG pipelines
β”‚   β”œβ”€β”€ eval.py               # Metrics: Hit@k, MRR, latency, similarity
β”‚   └── utils.py              # Shared helpers (e.g., PII masking)
β”œβ”€β”€ hierarchies/              # Hierarchy definitions (YAML)
β”‚   β”œβ”€β”€ hospital.yaml
β”‚   β”œβ”€β”€ bank.yaml
β”‚   └── fluid_simulation.yaml
β”œβ”€β”€ tests/                    # pytest cases
β”‚   β”œβ”€β”€ test_ingest.py
β”‚   β”œβ”€β”€ test_retrieval.py
β”‚   β”œβ”€β”€ test_eval.py
β”‚   └── test_index.py
β”œβ”€β”€ reports/                  # Evaluation results (CSV/JSON)
β”œβ”€β”€ requirements.txt          # Dependencies
└── README.md                 # This file
```

## Setup

### Prerequisites

- Python 3.8+
- pip or conda

### Installation

1. Clone the repository:
```bash
git clone <repository-url>
cd rag-evaluation-system
```

2. Install dependencies:
```bash
pip install -r requirements.txt
```

3. Create necessary directories:
```bash
mkdir -p reports chroma_data
```

### Environment Variables

Create a `.env` file (see `.env.example`) to configure the app:

- `OPENAI_API_KEY` (optional): enables OpenAI for embeddings and detection
- `OPENAI_MODEL` (optional): chat model for detection (default `gpt-4o-mini`)
- `OPENAI_EMBED_MODEL` (optional): embedding model (default `text-embedding-3-small`)
- `USE_OPENAI_EMBEDDINGS` (optional): `true|false` to force provider selection
- `ST_EMBED_MODEL` (optional): fallback SentenceTransformers model (default `all-MiniLM-L6-v2`)
- `CHROMA_PERSIST_DIR` (optional): Chroma dir (default `./chroma_data`)
- `DEFAULT_SEARCH_K` (optional): default k in Search auto mode (default `5`)
- `GRADIO_SERVER_PORT` (optional): server port (default `7860`)
- `LOG_LEVEL` (optional): `DEBUG|INFO|...` (default `INFO`)

Note: collections are namespaced by embeddings provider/dimension (e.g., `documents__oai_1536`, `documents__st_384`). Re‑upload after switching providers/models.

## Usage

### Using the Gradio API (gradio_client)

All main functions are exposed via the Gradio API with `api_name`:

- `build_rag`: Build RAG index from uploaded files
- `search`: Search documents using both pipelines
- `chat`: Chat interface with RAG system
- `evaluate`: Run quantitative evaluation

Example usage:

```python
from gradio_client import Client

client = Client("http://your-server:7860/")

# Build RAG index
result = client.predict(
    files=["doc1.pdf", "doc2.pdf"],
    hierarchy="hospital",
    doc_type="Report",
    language="en",
    api_name="/build_rag"
)

# Search documents
results = client.predict(
    query="What are emergency procedures?",
    k=5,
    level1="Clinical",
    level2="Emergency",
    level3=None,
    doc_type="Report",
    api_name="/search"
)
```

### MCP Server

The system can run as an MCP (Model Context Protocol) server for programmatic access:

```bash
python app.py --mcp
```

#### Connecting to MCP Server

Add to your MCP client configuration (e.g., for Claude Desktop):

```json
{
  "mcpServers": {
    "rag-evaluation": {
      "command": "python",
      "args": ["/path/to/app.py", "--mcp"],
      "env": {}
    }
  }
}
```

#### Available MCP Tools

1. **search_documents**: Search documents using RAG system
   - Parameters: `query`, `k`, `pipeline`, `level1`, `level2`, `level3`, `doc_type`

2. **evaluate_retrieval**: Evaluate RAG performance with batch queries
   - Parameters: `queries` (array), `output_file`

## UI Tabs

### 1. Upload Documents

- Upload multiple PDF/TXT files
- Set Hierarchy/Doc Type/Language to `Auto` for per‑chunk detection (OpenAI preferred; heuristic fallback)
- Paragraph‑first chunking with merging of consecutive similar paragraphs (same hierarchy + level1 + level2). Explicit labels (Domain/Section/Topic) β€œstick” across following paragraphs until overridden
- After build:
  - Build Status (processed count, indexed chunks)
  - File Summary (Filename, Chunks, Language, Doc Type, Hierarchy)
  - Indexed Chunks (preview with Level1/2/3 and first 160 chars)

### 2. Search

Default (auto):
- Enter your query and click Search
- k uses `DEFAULT_SEARCH_K` (default 5)
- Filters (level1/2/3, doc_type) inferred from query (OpenAI if enabled; else heuristics)

Manual (optional):
- Check β€œManual controls” to enable k and filters (they default to `Auto`)
- Leave `Auto` to detect; set a value to force

### 3. Chat

Conversational interface for qualitative testing:
- Choose pipeline (Base-RAG or Hier-RAG)
- Adjust retrieval parameters
- View retrieved sources

### 4. Evaluation

Run quantitative evaluation:
- Input queries in JSON format with ground truth
- Specify k values for evaluation
- Apply optional filters
- View metrics: Hit@k, MRR, semantic similarity, latency
- Export results to CSV/JSON in `reports/` directory

## Evaluation

### Quantitative Evaluation

The system compares Base‑RAG vs Hier‑RAG on Hit@k, MRR, semantic similarity, and latency. Provide JSON with `ground_truth` to see metrics and the Performance Comparison chart.

### Evaluation Input Format

```json
[
  {
    "query": "What are emergency procedures?",
    "ground_truth": ["Emergency protocols for triage", "Patient assessment guidelines"],
    "k_values": [1, 3, 5],
    "level1": "Clinical",
    "level2": "Emergency",
    "level3": "Triage",
    "doc_type": "Report"
  }
]
```

### Evaluation Results

Results are saved to `reports/` directory:
- CSV file with detailed metrics per query
- JSON file with full evaluation data
- Summary statistics by pipeline and k value

## Hierarchy Structure

Each hierarchy defines 3 levels:

- **Level1 (Domain)**: Top-level categorization (e.g., Clinical, Administrative)
- **Level2 (Section)**: Sub-domain within Level1 (e.g., Emergency, Inpatient)
- **Level3 (Topic)**: Specific topic within Level2 (e.g., Triage, Trauma)

Hierarchy files are located in `hierarchies/` directory and follow YAML format.

## Metadata Schema

Chunks are tagged with the following metadata:

```json
{
  "doc_id": "uuid",
  "chunk_id": "uuid",
  "source_name": "filename.pdf",
  "lang": "ja|en",
  "level1": "domain",
  "level2": "section",
  "level3": "topic",
  "doc_type": "policy|manual|faq",
  "chunk_size": 1000,
  "token_count": 250
}
```

## Testing

Run tests with pytest:

```bash
pytest tests/ -v
```

Test coverage includes:
- Document loading and chunking
- Hierarchy classification
- Metadata filtering
- Retrieval pipelines (Base-RAG and Hier-RAG)
- Evaluation metrics calculation
- Vector store operations
- API behaviors

## Architecture

### Retrieval Pipelines

**Base-RAG:**
1. Vector similarity search
2. Return top-k results
3. Format and return

**Hier-RAG:**
1. Pre-filter by hierarchical tags (level1/2/3, doc_type)
2. Vector search within filtered subset
3. Return top-k results
4. Format with hierarchy context

## Vector Database & Embeddings

- ChromaDB with persistence
- Embeddings provider:
  - OpenAI (if `OPENAI_API_KEY` present): `OPENAI_EMBED_MODEL` (default `text-embedding-3-small`)
  - SentenceTransformers fallback: `ST_EMBED_MODEL` (default `all-MiniLM-L6-v2`)
- Collections namespaced by provider/dimension to avoid mismatch
- Metadata filtering supported for level1/2/3/doc_type


## Acknowledgments

Built for comparing hierarchical vs standard RAG retrieval approaches.