Spaces:
Sleeping
A newer version of the Gradio SDK is available:
6.9.0
title: RAG Evaluation System
emoji: π
colorFrom: indigo
colorTo: blue
sdk: gradio
sdk_version: 5.49.1
app_file: app.py
pinned: false
license: mit
RAG Evaluation System
A comprehensive system for evaluating Hierarchical RAG vs Standard RAG pipelines with support for multiple document types and metadata hierarchies.
Features
- Dual RAG Pipelines: Compare Base-RAG vs Hierarchical RAG side-by-side
- Multiple Hierarchies: Hospital, Banking, and Fluid Simulation domains
- Comprehensive Evaluation: Quantitative metrics (Hit@k, MRR, latency, semantic similarity) and qualitative analysis
- Gradio UI: User-friendly interface for all operations
- MCP Server: Model Context Protocol server for programmatic access
- API Export: All main functions exposed via Gradio Client API
Repository Layout
.
βββ app.py # Spaces entry; defines UI and exposed functions (with api_name)
βββ core/ # Internal logic (NOT publicly exposed)
β βββ ingest.py # Loaders, hierarchical classification, chunking
β βββ index.py # Embeddings, vector DB, metadata filters
β βββ retrieval.py # Base-RAG / Hier-RAG pipelines
β βββ eval.py # Metrics: Hit@k, MRR, latency, similarity
β βββ utils.py # Shared helpers (e.g., PII masking)
βββ hierarchies/ # Hierarchy definitions (YAML)
β βββ hospital.yaml
β βββ bank.yaml
β βββ fluid_simulation.yaml
βββ tests/ # pytest cases
β βββ test_ingest.py
β βββ test_retrieval.py
β βββ test_eval.py
β βββ test_index.py
βββ reports/ # Evaluation results (CSV/JSON)
βββ requirements.txt # Dependencies
βββ README.md # This file
Setup
Prerequisites
- Python 3.8+
- pip or conda
Installation
- Clone the repository:
git clone <repository-url>
cd rag-evaluation-system
- Install dependencies:
pip install -r requirements.txt
- Create necessary directories:
mkdir -p reports chroma_data
Environment Variables
Create a .env file (see .env.example) to configure the app:
OPENAI_API_KEY(optional): enables OpenAI for embeddings and detectionOPENAI_MODEL(optional): chat model for detection (defaultgpt-4o-mini)OPENAI_EMBED_MODEL(optional): embedding model (defaulttext-embedding-3-small)USE_OPENAI_EMBEDDINGS(optional):true|falseto force provider selectionST_EMBED_MODEL(optional): fallback SentenceTransformers model (defaultall-MiniLM-L6-v2)CHROMA_PERSIST_DIR(optional): Chroma dir (default./chroma_data)DEFAULT_SEARCH_K(optional): default k in Search auto mode (default5)GRADIO_SERVER_PORT(optional): server port (default7860)LOG_LEVEL(optional):DEBUG|INFO|...(defaultINFO)
Note: collections are namespaced by embeddings provider/dimension (e.g., documents__oai_1536, documents__st_384). Reβupload after switching providers/models.
Usage
Using the Gradio API (gradio_client)
All main functions are exposed via the Gradio API with api_name:
build_rag: Build RAG index from uploaded filessearch: Search documents using both pipelineschat: Chat interface with RAG systemevaluate: Run quantitative evaluation
Example usage:
from gradio_client import Client
client = Client("http://your-server:7860/")
# Build RAG index
result = client.predict(
files=["doc1.pdf", "doc2.pdf"],
hierarchy="hospital",
doc_type="Report",
language="en",
api_name="/build_rag"
)
# Search documents
results = client.predict(
query="What are emergency procedures?",
k=5,
level1="Clinical",
level2="Emergency",
level3=None,
doc_type="Report",
api_name="/search"
)
MCP Server
The system can run as an MCP (Model Context Protocol) server for programmatic access:
python app.py --mcp
Connecting to MCP Server
Add to your MCP client configuration (e.g., for Claude Desktop):
{
"mcpServers": {
"rag-evaluation": {
"command": "python",
"args": ["/path/to/app.py", "--mcp"],
"env": {}
}
}
}
Available MCP Tools
search_documents: Search documents using RAG system
- Parameters:
query,k,pipeline,level1,level2,level3,doc_type
- Parameters:
evaluate_retrieval: Evaluate RAG performance with batch queries
- Parameters:
queries(array),output_file
- Parameters:
UI Tabs
1. Upload Documents
- Upload multiple PDF/TXT files
- Set Hierarchy/Doc Type/Language to
Autofor perβchunk detection (OpenAI preferred; heuristic fallback) - Paragraphβfirst chunking with merging of consecutive similar paragraphs (same hierarchy + level1 + level2). Explicit labels (Domain/Section/Topic) βstickβ across following paragraphs until overridden
- After build:
- Build Status (processed count, indexed chunks)
- File Summary (Filename, Chunks, Language, Doc Type, Hierarchy)
- Indexed Chunks (preview with Level1/2/3 and first 160 chars)
2. Search
Default (auto):
- Enter your query and click Search
- k uses
DEFAULT_SEARCH_K(default 5) - Filters (level1/2/3, doc_type) inferred from query (OpenAI if enabled; else heuristics)
Manual (optional):
- Check βManual controlsβ to enable k and filters (they default to
Auto) - Leave
Autoto detect; set a value to force
3. Chat
Conversational interface for qualitative testing:
- Choose pipeline (Base-RAG or Hier-RAG)
- Adjust retrieval parameters
- View retrieved sources
4. Evaluation
Run quantitative evaluation:
- Input queries in JSON format with ground truth
- Specify k values for evaluation
- Apply optional filters
- View metrics: Hit@k, MRR, semantic similarity, latency
- Export results to CSV/JSON in
reports/directory
Evaluation
Quantitative Evaluation
The system compares BaseβRAG vs HierβRAG on Hit@k, MRR, semantic similarity, and latency. Provide JSON with ground_truth to see metrics and the Performance Comparison chart.
Evaluation Input Format
[
{
"query": "What are emergency procedures?",
"ground_truth": ["Emergency protocols for triage", "Patient assessment guidelines"],
"k_values": [1, 3, 5],
"level1": "Clinical",
"level2": "Emergency",
"level3": "Triage",
"doc_type": "Report"
}
]
Evaluation Results
Results are saved to reports/ directory:
- CSV file with detailed metrics per query
- JSON file with full evaluation data
- Summary statistics by pipeline and k value
Hierarchy Structure
Each hierarchy defines 3 levels:
- Level1 (Domain): Top-level categorization (e.g., Clinical, Administrative)
- Level2 (Section): Sub-domain within Level1 (e.g., Emergency, Inpatient)
- Level3 (Topic): Specific topic within Level2 (e.g., Triage, Trauma)
Hierarchy files are located in hierarchies/ directory and follow YAML format.
Metadata Schema
Chunks are tagged with the following metadata:
{
"doc_id": "uuid",
"chunk_id": "uuid",
"source_name": "filename.pdf",
"lang": "ja|en",
"level1": "domain",
"level2": "section",
"level3": "topic",
"doc_type": "policy|manual|faq",
"chunk_size": 1000,
"token_count": 250
}
Testing
Run tests with pytest:
pytest tests/ -v
Test coverage includes:
- Document loading and chunking
- Hierarchy classification
- Metadata filtering
- Retrieval pipelines (Base-RAG and Hier-RAG)
- Evaluation metrics calculation
- Vector store operations
- API behaviors
Architecture
Retrieval Pipelines
Base-RAG:
- Vector similarity search
- Return top-k results
- Format and return
Hier-RAG:
- Pre-filter by hierarchical tags (level1/2/3, doc_type)
- Vector search within filtered subset
- Return top-k results
- Format with hierarchy context
Vector Database & Embeddings
- ChromaDB with persistence
- Embeddings provider:
- OpenAI (if
OPENAI_API_KEYpresent):OPENAI_EMBED_MODEL(defaulttext-embedding-3-small) - SentenceTransformers fallback:
ST_EMBED_MODEL(defaultall-MiniLM-L6-v2)
- OpenAI (if
- Collections namespaced by provider/dimension to avoid mismatch
- Metadata filtering supported for level1/2/3/doc_type
Acknowledgments
Built for comparing hierarchical vs standard RAG retrieval approaches.