--- title: RAG Evaluation System emoji: πŸ“š colorFrom: indigo colorTo: blue sdk: gradio sdk_version: 5.49.1 app_file: app.py pinned: false license: mit --- # RAG Evaluation System A comprehensive system for evaluating **Hierarchical RAG** vs **Standard RAG** pipelines with support for multiple document types and metadata hierarchies. ## Features - **Dual RAG Pipelines**: Compare Base-RAG vs Hierarchical RAG side-by-side - **Multiple Hierarchies**: Hospital, Banking, and Fluid Simulation domains - **Comprehensive Evaluation**: Quantitative metrics (Hit@k, MRR, latency, semantic similarity) and qualitative analysis - **Gradio UI**: User-friendly interface for all operations - **MCP Server**: Model Context Protocol server for programmatic access - **API Export**: All main functions exposed via Gradio Client API ## Repository Layout ``` . β”œβ”€β”€ app.py # Spaces entry; defines UI and exposed functions (with api_name) β”œβ”€β”€ core/ # Internal logic (NOT publicly exposed) β”‚ β”œβ”€β”€ ingest.py # Loaders, hierarchical classification, chunking β”‚ β”œβ”€β”€ index.py # Embeddings, vector DB, metadata filters β”‚ β”œβ”€β”€ retrieval.py # Base-RAG / Hier-RAG pipelines β”‚ β”œβ”€β”€ eval.py # Metrics: Hit@k, MRR, latency, similarity β”‚ └── utils.py # Shared helpers (e.g., PII masking) β”œβ”€β”€ hierarchies/ # Hierarchy definitions (YAML) β”‚ β”œβ”€β”€ hospital.yaml β”‚ β”œβ”€β”€ bank.yaml β”‚ └── fluid_simulation.yaml β”œβ”€β”€ tests/ # pytest cases β”‚ β”œβ”€β”€ test_ingest.py β”‚ β”œβ”€β”€ test_retrieval.py β”‚ β”œβ”€β”€ test_eval.py β”‚ └── test_index.py β”œβ”€β”€ reports/ # Evaluation results (CSV/JSON) β”œβ”€β”€ requirements.txt # Dependencies └── README.md # This file ``` ## Setup ### Prerequisites - Python 3.8+ - pip or conda ### Installation 1. Clone the repository: ```bash git clone cd rag-evaluation-system ``` 2. Install dependencies: ```bash pip install -r requirements.txt ``` 3. Create necessary directories: ```bash mkdir -p reports chroma_data ``` ### Environment Variables Create a `.env` file (see `.env.example`) to configure the app: - `OPENAI_API_KEY` (optional): enables OpenAI for embeddings and detection - `OPENAI_MODEL` (optional): chat model for detection (default `gpt-4o-mini`) - `OPENAI_EMBED_MODEL` (optional): embedding model (default `text-embedding-3-small`) - `USE_OPENAI_EMBEDDINGS` (optional): `true|false` to force provider selection - `ST_EMBED_MODEL` (optional): fallback SentenceTransformers model (default `all-MiniLM-L6-v2`) - `CHROMA_PERSIST_DIR` (optional): Chroma dir (default `./chroma_data`) - `DEFAULT_SEARCH_K` (optional): default k in Search auto mode (default `5`) - `GRADIO_SERVER_PORT` (optional): server port (default `7860`) - `LOG_LEVEL` (optional): `DEBUG|INFO|...` (default `INFO`) Note: collections are namespaced by embeddings provider/dimension (e.g., `documents__oai_1536`, `documents__st_384`). Re‑upload after switching providers/models. ## Usage ### Using the Gradio API (gradio_client) All main functions are exposed via the Gradio API with `api_name`: - `build_rag`: Build RAG index from uploaded files - `search`: Search documents using both pipelines - `chat`: Chat interface with RAG system - `evaluate`: Run quantitative evaluation Example usage: ```python from gradio_client import Client client = Client("http://your-server:7860/") # Build RAG index result = client.predict( files=["doc1.pdf", "doc2.pdf"], hierarchy="hospital", doc_type="Report", language="en", api_name="/build_rag" ) # Search documents results = client.predict( query="What are emergency procedures?", k=5, level1="Clinical", level2="Emergency", level3=None, doc_type="Report", api_name="/search" ) ``` ### MCP Server The system can run as an MCP (Model Context Protocol) server for programmatic access: ```bash python app.py --mcp ``` #### Connecting to MCP Server Add to your MCP client configuration (e.g., for Claude Desktop): ```json { "mcpServers": { "rag-evaluation": { "command": "python", "args": ["/path/to/app.py", "--mcp"], "env": {} } } } ``` #### Available MCP Tools 1. **search_documents**: Search documents using RAG system - Parameters: `query`, `k`, `pipeline`, `level1`, `level2`, `level3`, `doc_type` 2. **evaluate_retrieval**: Evaluate RAG performance with batch queries - Parameters: `queries` (array), `output_file` ## UI Tabs ### 1. Upload Documents - Upload multiple PDF/TXT files - Set Hierarchy/Doc Type/Language to `Auto` for per‑chunk detection (OpenAI preferred; heuristic fallback) - Paragraph‑first chunking with merging of consecutive similar paragraphs (same hierarchy + level1 + level2). Explicit labels (Domain/Section/Topic) β€œstick” across following paragraphs until overridden - After build: - Build Status (processed count, indexed chunks) - File Summary (Filename, Chunks, Language, Doc Type, Hierarchy) - Indexed Chunks (preview with Level1/2/3 and first 160 chars) ### 2. Search Default (auto): - Enter your query and click Search - k uses `DEFAULT_SEARCH_K` (default 5) - Filters (level1/2/3, doc_type) inferred from query (OpenAI if enabled; else heuristics) Manual (optional): - Check β€œManual controls” to enable k and filters (they default to `Auto`) - Leave `Auto` to detect; set a value to force ### 3. Chat Conversational interface for qualitative testing: - Choose pipeline (Base-RAG or Hier-RAG) - Adjust retrieval parameters - View retrieved sources ### 4. Evaluation Run quantitative evaluation: - Input queries in JSON format with ground truth - Specify k values for evaluation - Apply optional filters - View metrics: Hit@k, MRR, semantic similarity, latency - Export results to CSV/JSON in `reports/` directory ## Evaluation ### Quantitative Evaluation The system compares Base‑RAG vs Hier‑RAG on Hit@k, MRR, semantic similarity, and latency. Provide JSON with `ground_truth` to see metrics and the Performance Comparison chart. ### Evaluation Input Format ```json [ { "query": "What are emergency procedures?", "ground_truth": ["Emergency protocols for triage", "Patient assessment guidelines"], "k_values": [1, 3, 5], "level1": "Clinical", "level2": "Emergency", "level3": "Triage", "doc_type": "Report" } ] ``` ### Evaluation Results Results are saved to `reports/` directory: - CSV file with detailed metrics per query - JSON file with full evaluation data - Summary statistics by pipeline and k value ## Hierarchy Structure Each hierarchy defines 3 levels: - **Level1 (Domain)**: Top-level categorization (e.g., Clinical, Administrative) - **Level2 (Section)**: Sub-domain within Level1 (e.g., Emergency, Inpatient) - **Level3 (Topic)**: Specific topic within Level2 (e.g., Triage, Trauma) Hierarchy files are located in `hierarchies/` directory and follow YAML format. ## Metadata Schema Chunks are tagged with the following metadata: ```json { "doc_id": "uuid", "chunk_id": "uuid", "source_name": "filename.pdf", "lang": "ja|en", "level1": "domain", "level2": "section", "level3": "topic", "doc_type": "policy|manual|faq", "chunk_size": 1000, "token_count": 250 } ``` ## Testing Run tests with pytest: ```bash pytest tests/ -v ``` Test coverage includes: - Document loading and chunking - Hierarchy classification - Metadata filtering - Retrieval pipelines (Base-RAG and Hier-RAG) - Evaluation metrics calculation - Vector store operations - API behaviors ## Architecture ### Retrieval Pipelines **Base-RAG:** 1. Vector similarity search 2. Return top-k results 3. Format and return **Hier-RAG:** 1. Pre-filter by hierarchical tags (level1/2/3, doc_type) 2. Vector search within filtered subset 3. Return top-k results 4. Format with hierarchy context ## Vector Database & Embeddings - ChromaDB with persistence - Embeddings provider: - OpenAI (if `OPENAI_API_KEY` present): `OPENAI_EMBED_MODEL` (default `text-embedding-3-small`) - SentenceTransformers fallback: `ST_EMBED_MODEL` (default `all-MiniLM-L6-v2`) - Collections namespaced by provider/dimension to avoid mismatch - Metadata filtering supported for level1/2/3/doc_type ## Acknowledgments Built for comparing hierarchical vs standard RAG retrieval approaches.