Spaces:
Sleeping
Sleeping
| title: RAG Evaluation System | |
| emoji: π | |
| colorFrom: indigo | |
| colorTo: blue | |
| sdk: gradio | |
| sdk_version: 5.49.1 | |
| app_file: app.py | |
| pinned: false | |
| license: mit | |
| # RAG Evaluation System | |
| A comprehensive system for evaluating **Hierarchical RAG** vs **Standard RAG** pipelines with support for multiple document types and metadata hierarchies. | |
| ## Features | |
| - **Dual RAG Pipelines**: Compare Base-RAG vs Hierarchical RAG side-by-side | |
| - **Multiple Hierarchies**: Hospital, Banking, and Fluid Simulation domains | |
| - **Comprehensive Evaluation**: Quantitative metrics (Hit@k, MRR, latency, semantic similarity) and qualitative analysis | |
| - **Gradio UI**: User-friendly interface for all operations | |
| - **MCP Server**: Model Context Protocol server for programmatic access | |
| - **API Export**: All main functions exposed via Gradio Client API | |
| ## Repository Layout | |
| ``` | |
| . | |
| βββ app.py # Spaces entry; defines UI and exposed functions (with api_name) | |
| βββ core/ # Internal logic (NOT publicly exposed) | |
| β βββ ingest.py # Loaders, hierarchical classification, chunking | |
| β βββ index.py # Embeddings, vector DB, metadata filters | |
| β βββ retrieval.py # Base-RAG / Hier-RAG pipelines | |
| β βββ eval.py # Metrics: Hit@k, MRR, latency, similarity | |
| β βββ utils.py # Shared helpers (e.g., PII masking) | |
| βββ hierarchies/ # Hierarchy definitions (YAML) | |
| β βββ hospital.yaml | |
| β βββ bank.yaml | |
| β βββ fluid_simulation.yaml | |
| βββ tests/ # pytest cases | |
| β βββ test_ingest.py | |
| β βββ test_retrieval.py | |
| β βββ test_eval.py | |
| β βββ test_index.py | |
| βββ reports/ # Evaluation results (CSV/JSON) | |
| βββ requirements.txt # Dependencies | |
| βββ README.md # This file | |
| ``` | |
| ## Setup | |
| ### Prerequisites | |
| - Python 3.8+ | |
| - pip or conda | |
| ### Installation | |
| 1. Clone the repository: | |
| ```bash | |
| git clone <repository-url> | |
| cd rag-evaluation-system | |
| ``` | |
| 2. Install dependencies: | |
| ```bash | |
| pip install -r requirements.txt | |
| ``` | |
| 3. Create necessary directories: | |
| ```bash | |
| mkdir -p reports chroma_data | |
| ``` | |
| ### Environment Variables | |
| Create a `.env` file (see `.env.example`) to configure the app: | |
| - `OPENAI_API_KEY` (optional): enables OpenAI for embeddings and detection | |
| - `OPENAI_MODEL` (optional): chat model for detection (default `gpt-4o-mini`) | |
| - `OPENAI_EMBED_MODEL` (optional): embedding model (default `text-embedding-3-small`) | |
| - `USE_OPENAI_EMBEDDINGS` (optional): `true|false` to force provider selection | |
| - `ST_EMBED_MODEL` (optional): fallback SentenceTransformers model (default `all-MiniLM-L6-v2`) | |
| - `CHROMA_PERSIST_DIR` (optional): Chroma dir (default `./chroma_data`) | |
| - `DEFAULT_SEARCH_K` (optional): default k in Search auto mode (default `5`) | |
| - `GRADIO_SERVER_PORT` (optional): server port (default `7860`) | |
| - `LOG_LEVEL` (optional): `DEBUG|INFO|...` (default `INFO`) | |
| Note: collections are namespaced by embeddings provider/dimension (e.g., `documents__oai_1536`, `documents__st_384`). Reβupload after switching providers/models. | |
| ## Usage | |
| ### Using the Gradio API (gradio_client) | |
| All main functions are exposed via the Gradio API with `api_name`: | |
| - `build_rag`: Build RAG index from uploaded files | |
| - `search`: Search documents using both pipelines | |
| - `chat`: Chat interface with RAG system | |
| - `evaluate`: Run quantitative evaluation | |
| Example usage: | |
| ```python | |
| from gradio_client import Client | |
| client = Client("http://your-server:7860/") | |
| # Build RAG index | |
| result = client.predict( | |
| files=["doc1.pdf", "doc2.pdf"], | |
| hierarchy="hospital", | |
| doc_type="Report", | |
| language="en", | |
| api_name="/build_rag" | |
| ) | |
| # Search documents | |
| results = client.predict( | |
| query="What are emergency procedures?", | |
| k=5, | |
| level1="Clinical", | |
| level2="Emergency", | |
| level3=None, | |
| doc_type="Report", | |
| api_name="/search" | |
| ) | |
| ``` | |
| ### MCP Server | |
| The system can run as an MCP (Model Context Protocol) server for programmatic access: | |
| ```bash | |
| python app.py --mcp | |
| ``` | |
| #### Connecting to MCP Server | |
| Add to your MCP client configuration (e.g., for Claude Desktop): | |
| ```json | |
| { | |
| "mcpServers": { | |
| "rag-evaluation": { | |
| "command": "python", | |
| "args": ["/path/to/app.py", "--mcp"], | |
| "env": {} | |
| } | |
| } | |
| } | |
| ``` | |
| #### Available MCP Tools | |
| 1. **search_documents**: Search documents using RAG system | |
| - Parameters: `query`, `k`, `pipeline`, `level1`, `level2`, `level3`, `doc_type` | |
| 2. **evaluate_retrieval**: Evaluate RAG performance with batch queries | |
| - Parameters: `queries` (array), `output_file` | |
| ## UI Tabs | |
| ### 1. Upload Documents | |
| - Upload multiple PDF/TXT files | |
| - Set Hierarchy/Doc Type/Language to `Auto` for perβchunk detection (OpenAI preferred; heuristic fallback) | |
| - Paragraphβfirst chunking with merging of consecutive similar paragraphs (same hierarchy + level1 + level2). Explicit labels (Domain/Section/Topic) βstickβ across following paragraphs until overridden | |
| - After build: | |
| - Build Status (processed count, indexed chunks) | |
| - File Summary (Filename, Chunks, Language, Doc Type, Hierarchy) | |
| - Indexed Chunks (preview with Level1/2/3 and first 160 chars) | |
| ### 2. Search | |
| Default (auto): | |
| - Enter your query and click Search | |
| - k uses `DEFAULT_SEARCH_K` (default 5) | |
| - Filters (level1/2/3, doc_type) inferred from query (OpenAI if enabled; else heuristics) | |
| Manual (optional): | |
| - Check βManual controlsβ to enable k and filters (they default to `Auto`) | |
| - Leave `Auto` to detect; set a value to force | |
| ### 3. Chat | |
| Conversational interface for qualitative testing: | |
| - Choose pipeline (Base-RAG or Hier-RAG) | |
| - Adjust retrieval parameters | |
| - View retrieved sources | |
| ### 4. Evaluation | |
| Run quantitative evaluation: | |
| - Input queries in JSON format with ground truth | |
| - Specify k values for evaluation | |
| - Apply optional filters | |
| - View metrics: Hit@k, MRR, semantic similarity, latency | |
| - Export results to CSV/JSON in `reports/` directory | |
| ## Evaluation | |
| ### Quantitative Evaluation | |
| The system compares BaseβRAG vs HierβRAG on Hit@k, MRR, semantic similarity, and latency. Provide JSON with `ground_truth` to see metrics and the Performance Comparison chart. | |
| ### Evaluation Input Format | |
| ```json | |
| [ | |
| { | |
| "query": "What are emergency procedures?", | |
| "ground_truth": ["Emergency protocols for triage", "Patient assessment guidelines"], | |
| "k_values": [1, 3, 5], | |
| "level1": "Clinical", | |
| "level2": "Emergency", | |
| "level3": "Triage", | |
| "doc_type": "Report" | |
| } | |
| ] | |
| ``` | |
| ### Evaluation Results | |
| Results are saved to `reports/` directory: | |
| - CSV file with detailed metrics per query | |
| - JSON file with full evaluation data | |
| - Summary statistics by pipeline and k value | |
| ## Hierarchy Structure | |
| Each hierarchy defines 3 levels: | |
| - **Level1 (Domain)**: Top-level categorization (e.g., Clinical, Administrative) | |
| - **Level2 (Section)**: Sub-domain within Level1 (e.g., Emergency, Inpatient) | |
| - **Level3 (Topic)**: Specific topic within Level2 (e.g., Triage, Trauma) | |
| Hierarchy files are located in `hierarchies/` directory and follow YAML format. | |
| ## Metadata Schema | |
| Chunks are tagged with the following metadata: | |
| ```json | |
| { | |
| "doc_id": "uuid", | |
| "chunk_id": "uuid", | |
| "source_name": "filename.pdf", | |
| "lang": "ja|en", | |
| "level1": "domain", | |
| "level2": "section", | |
| "level3": "topic", | |
| "doc_type": "policy|manual|faq", | |
| "chunk_size": 1000, | |
| "token_count": 250 | |
| } | |
| ``` | |
| ## Testing | |
| Run tests with pytest: | |
| ```bash | |
| pytest tests/ -v | |
| ``` | |
| Test coverage includes: | |
| - Document loading and chunking | |
| - Hierarchy classification | |
| - Metadata filtering | |
| - Retrieval pipelines (Base-RAG and Hier-RAG) | |
| - Evaluation metrics calculation | |
| - Vector store operations | |
| - API behaviors | |
| ## Architecture | |
| ### Retrieval Pipelines | |
| **Base-RAG:** | |
| 1. Vector similarity search | |
| 2. Return top-k results | |
| 3. Format and return | |
| **Hier-RAG:** | |
| 1. Pre-filter by hierarchical tags (level1/2/3, doc_type) | |
| 2. Vector search within filtered subset | |
| 3. Return top-k results | |
| 4. Format with hierarchy context | |
| ## Vector Database & Embeddings | |
| - ChromaDB with persistence | |
| - Embeddings provider: | |
| - OpenAI (if `OPENAI_API_KEY` present): `OPENAI_EMBED_MODEL` (default `text-embedding-3-small`) | |
| - SentenceTransformers fallback: `ST_EMBED_MODEL` (default `all-MiniLM-L6-v2`) | |
| - Collections namespaced by provider/dimension to avoid mismatch | |
| - Metadata filtering supported for level1/2/3/doc_type | |
| ## Acknowledgments | |
| Built for comparing hierarchical vs standard RAG retrieval approaches. | |