soft.engineer
update the sdk version
79e0552
---
title: RAG Evaluation System
emoji: πŸ“š
colorFrom: indigo
colorTo: blue
sdk: gradio
sdk_version: 5.49.1
app_file: app.py
pinned: false
license: mit
---
# RAG Evaluation System
A comprehensive system for evaluating **Hierarchical RAG** vs **Standard RAG** pipelines with support for multiple document types and metadata hierarchies.
## Features
- **Dual RAG Pipelines**: Compare Base-RAG vs Hierarchical RAG side-by-side
- **Multiple Hierarchies**: Hospital, Banking, and Fluid Simulation domains
- **Comprehensive Evaluation**: Quantitative metrics (Hit@k, MRR, latency, semantic similarity) and qualitative analysis
- **Gradio UI**: User-friendly interface for all operations
- **MCP Server**: Model Context Protocol server for programmatic access
- **API Export**: All main functions exposed via Gradio Client API
## Repository Layout
```
.
β”œβ”€β”€ app.py # Spaces entry; defines UI and exposed functions (with api_name)
β”œβ”€β”€ core/ # Internal logic (NOT publicly exposed)
β”‚ β”œβ”€β”€ ingest.py # Loaders, hierarchical classification, chunking
β”‚ β”œβ”€β”€ index.py # Embeddings, vector DB, metadata filters
β”‚ β”œβ”€β”€ retrieval.py # Base-RAG / Hier-RAG pipelines
β”‚ β”œβ”€β”€ eval.py # Metrics: Hit@k, MRR, latency, similarity
β”‚ └── utils.py # Shared helpers (e.g., PII masking)
β”œβ”€β”€ hierarchies/ # Hierarchy definitions (YAML)
β”‚ β”œβ”€β”€ hospital.yaml
β”‚ β”œβ”€β”€ bank.yaml
β”‚ └── fluid_simulation.yaml
β”œβ”€β”€ tests/ # pytest cases
β”‚ β”œβ”€β”€ test_ingest.py
β”‚ β”œβ”€β”€ test_retrieval.py
β”‚ β”œβ”€β”€ test_eval.py
β”‚ └── test_index.py
β”œβ”€β”€ reports/ # Evaluation results (CSV/JSON)
β”œβ”€β”€ requirements.txt # Dependencies
└── README.md # This file
```
## Setup
### Prerequisites
- Python 3.8+
- pip or conda
### Installation
1. Clone the repository:
```bash
git clone <repository-url>
cd rag-evaluation-system
```
2. Install dependencies:
```bash
pip install -r requirements.txt
```
3. Create necessary directories:
```bash
mkdir -p reports chroma_data
```
### Environment Variables
Create a `.env` file (see `.env.example`) to configure the app:
- `OPENAI_API_KEY` (optional): enables OpenAI for embeddings and detection
- `OPENAI_MODEL` (optional): chat model for detection (default `gpt-4o-mini`)
- `OPENAI_EMBED_MODEL` (optional): embedding model (default `text-embedding-3-small`)
- `USE_OPENAI_EMBEDDINGS` (optional): `true|false` to force provider selection
- `ST_EMBED_MODEL` (optional): fallback SentenceTransformers model (default `all-MiniLM-L6-v2`)
- `CHROMA_PERSIST_DIR` (optional): Chroma dir (default `./chroma_data`)
- `DEFAULT_SEARCH_K` (optional): default k in Search auto mode (default `5`)
- `GRADIO_SERVER_PORT` (optional): server port (default `7860`)
- `LOG_LEVEL` (optional): `DEBUG|INFO|...` (default `INFO`)
Note: collections are namespaced by embeddings provider/dimension (e.g., `documents__oai_1536`, `documents__st_384`). Re‑upload after switching providers/models.
## Usage
### Using the Gradio API (gradio_client)
All main functions are exposed via the Gradio API with `api_name`:
- `build_rag`: Build RAG index from uploaded files
- `search`: Search documents using both pipelines
- `chat`: Chat interface with RAG system
- `evaluate`: Run quantitative evaluation
Example usage:
```python
from gradio_client import Client
client = Client("http://your-server:7860/")
# Build RAG index
result = client.predict(
files=["doc1.pdf", "doc2.pdf"],
hierarchy="hospital",
doc_type="Report",
language="en",
api_name="/build_rag"
)
# Search documents
results = client.predict(
query="What are emergency procedures?",
k=5,
level1="Clinical",
level2="Emergency",
level3=None,
doc_type="Report",
api_name="/search"
)
```
### MCP Server
The system can run as an MCP (Model Context Protocol) server for programmatic access:
```bash
python app.py --mcp
```
#### Connecting to MCP Server
Add to your MCP client configuration (e.g., for Claude Desktop):
```json
{
"mcpServers": {
"rag-evaluation": {
"command": "python",
"args": ["/path/to/app.py", "--mcp"],
"env": {}
}
}
}
```
#### Available MCP Tools
1. **search_documents**: Search documents using RAG system
- Parameters: `query`, `k`, `pipeline`, `level1`, `level2`, `level3`, `doc_type`
2. **evaluate_retrieval**: Evaluate RAG performance with batch queries
- Parameters: `queries` (array), `output_file`
## UI Tabs
### 1. Upload Documents
- Upload multiple PDF/TXT files
- Set Hierarchy/Doc Type/Language to `Auto` for per‑chunk detection (OpenAI preferred; heuristic fallback)
- Paragraph‑first chunking with merging of consecutive similar paragraphs (same hierarchy + level1 + level2). Explicit labels (Domain/Section/Topic) β€œstick” across following paragraphs until overridden
- After build:
- Build Status (processed count, indexed chunks)
- File Summary (Filename, Chunks, Language, Doc Type, Hierarchy)
- Indexed Chunks (preview with Level1/2/3 and first 160 chars)
### 2. Search
Default (auto):
- Enter your query and click Search
- k uses `DEFAULT_SEARCH_K` (default 5)
- Filters (level1/2/3, doc_type) inferred from query (OpenAI if enabled; else heuristics)
Manual (optional):
- Check β€œManual controls” to enable k and filters (they default to `Auto`)
- Leave `Auto` to detect; set a value to force
### 3. Chat
Conversational interface for qualitative testing:
- Choose pipeline (Base-RAG or Hier-RAG)
- Adjust retrieval parameters
- View retrieved sources
### 4. Evaluation
Run quantitative evaluation:
- Input queries in JSON format with ground truth
- Specify k values for evaluation
- Apply optional filters
- View metrics: Hit@k, MRR, semantic similarity, latency
- Export results to CSV/JSON in `reports/` directory
## Evaluation
### Quantitative Evaluation
The system compares Base‑RAG vs Hier‑RAG on Hit@k, MRR, semantic similarity, and latency. Provide JSON with `ground_truth` to see metrics and the Performance Comparison chart.
### Evaluation Input Format
```json
[
{
"query": "What are emergency procedures?",
"ground_truth": ["Emergency protocols for triage", "Patient assessment guidelines"],
"k_values": [1, 3, 5],
"level1": "Clinical",
"level2": "Emergency",
"level3": "Triage",
"doc_type": "Report"
}
]
```
### Evaluation Results
Results are saved to `reports/` directory:
- CSV file with detailed metrics per query
- JSON file with full evaluation data
- Summary statistics by pipeline and k value
## Hierarchy Structure
Each hierarchy defines 3 levels:
- **Level1 (Domain)**: Top-level categorization (e.g., Clinical, Administrative)
- **Level2 (Section)**: Sub-domain within Level1 (e.g., Emergency, Inpatient)
- **Level3 (Topic)**: Specific topic within Level2 (e.g., Triage, Trauma)
Hierarchy files are located in `hierarchies/` directory and follow YAML format.
## Metadata Schema
Chunks are tagged with the following metadata:
```json
{
"doc_id": "uuid",
"chunk_id": "uuid",
"source_name": "filename.pdf",
"lang": "ja|en",
"level1": "domain",
"level2": "section",
"level3": "topic",
"doc_type": "policy|manual|faq",
"chunk_size": 1000,
"token_count": 250
}
```
## Testing
Run tests with pytest:
```bash
pytest tests/ -v
```
Test coverage includes:
- Document loading and chunking
- Hierarchy classification
- Metadata filtering
- Retrieval pipelines (Base-RAG and Hier-RAG)
- Evaluation metrics calculation
- Vector store operations
- API behaviors
## Architecture
### Retrieval Pipelines
**Base-RAG:**
1. Vector similarity search
2. Return top-k results
3. Format and return
**Hier-RAG:**
1. Pre-filter by hierarchical tags (level1/2/3, doc_type)
2. Vector search within filtered subset
3. Return top-k results
4. Format with hierarchy context
## Vector Database & Embeddings
- ChromaDB with persistence
- Embeddings provider:
- OpenAI (if `OPENAI_API_KEY` present): `OPENAI_EMBED_MODEL` (default `text-embedding-3-small`)
- SentenceTransformers fallback: `ST_EMBED_MODEL` (default `all-MiniLM-L6-v2`)
- Collections namespaced by provider/dimension to avoid mismatch
- Metadata filtering supported for level1/2/3/doc_type
## Acknowledgments
Built for comparing hierarchical vs standard RAG retrieval approaches.