shahbazdev0's picture
Update README.md
76d343a verified
---
title: Hierarchical RAG Evaluation
emoji: πŸ”
colorFrom: blue
colorTo: purple
sdk: gradio
sdk_version: 3.50.2
app_file: app.py
pinned: false
license: mit
---
# Hierarchical RAG Evaluation System
A comprehensive system for comparing Standard RAG vs Hierarchical RAG approaches, focusing on both accuracy and speed improvements through metadata-based filtering.
## Features
- **Dual RAG Pipelines**: Compare Base-RAG and Hier-RAG side-by-side
- **Hierarchical Classification**: 3-level taxonomy (domain β†’ section β†’ topic)
- **Multiple Domains**: Pre-configured hierarchies for Hospital, Banking, and Fluid Simulation
- **Comprehensive Evaluation**: Quantitative metrics (Hit@k, MRR, latency) and qualitative testing
- **Gradio UI**: User-friendly interface with API access
- **MCP Server**: Additional API server for programmatic access
## Architecture
```
User Query β†’ Hierarchical Filter β†’ Vector Search β†’ Re-ranking β†’ LLM Generation β†’ Answer
↓
(Hier-RAG only)
```
## Quick Start
### Prerequisites
- Python 3.9+
- OpenAI API key (for LLM generation)
- 4GB+ RAM recommended
### Installation
1. **Clone the repository:**
```bash
git clone <repository-url>
cd hierarchical-rag-eval
```
2. **Create virtual environment:**
```bash
python -m venv venv
# Windows
venv\Scripts\activate
# Mac/Linux
source venv/bin/activate
```
3. **Install dependencies:**
```bash
pip install -r requirements.txt
```
4. **Set environment variables:**
Create a `.env` file in the project root:
```bash
OPENAI_API_KEY=your-openai-api-key-here
VECTOR_DB_PATH=./data/chroma
EMBEDDING_MODEL=sentence-transformers/all-MiniLM-L6-v2
LLM_MODEL=gpt-3.5-turbo
```
**Important:** Never commit `.env` file to version control!
5. **Run the application:**
```bash
python app.py
```
Access at `http://localhost:7860`
---
## πŸš€ Deployment to Hugging Face Spaces
### Step 1: Create Space
1. Go to https://huggingface.co/spaces
2. Click "Create new Space"
3. Fill in details:
- **Owner**: `AP-UW` (organization)
- **Space name**: `hierarchical-rag-eval`
- **License**: MIT
- **SDK**: Gradio
- **Python version**: 3.10
- **Visibility**: Private
### Step 2: Configure Persistent Storage
1. Go to Space Settings β†’ Storage
2. Enable **Persistent Storage** (FREE tier available)
3. This ensures your vector database persists across restarts
### Step 3: Add Secrets
1. Go to Space Settings β†’ Repository Secrets
2. Add the following secrets:
| Secret Name | Value | Description |
|-------------|-------|-------------|
| `OPENAI_API_KEY` | `sk-...` | Your OpenAI API key |
| `VECTOR_DB_PATH` | `/data/chroma` | Path to persistent storage |
| `EMBEDDING_MODEL` | `sentence-transformers/all-MiniLM-L6-v2` | Embedding model |
| `LLM_MODEL` | `gpt-3.5-turbo` | OpenAI model |
**Note:** Secrets are encrypted and not visible in logs.
### Step 4: Prepare Code for Deployment
Update `app.py` to read from HF Spaces environment:
```python
import os
from dotenv import load_dotenv
# Load .env for local development only
if not os.getenv("SPACE_ID"): # SPACE_ID is set by HF Spaces
load_dotenv()
# Verify API key
api_key = os.getenv("OPENAI_API_KEY")
if not api_key:
raise ValueError("⚠️ OPENAI_API_KEY not found! Set it in Space Settings β†’ Secrets")
```
### Step 5: Push to Hugging Face
```bash
# Add HF Space as remote
git remote add space https://huggingface.co/spaces/AP-UW/hierarchical-rag-eval
git branch -M main
# Push code (will trigger automatic build)
git push space main
```
### Step 6: Monitor Deployment
1. Go to your Space URL: `https://huggingface.co/spaces/AP-UW/hierarchical-rag-eval`
2. Check **Logs** tab for build progress
3. Wait for "Running" status (may take 5-10 minutes on first build)
### Step 7: Verify Deployment
Test the deployed app:
```python
from gradio_client import Client
client = Client("https://huggingface.co/spaces/AP-UW/hierarchical-rag-eval")
# Initialize system
result = client.predict(api_name="/initialize")
print(result) # Should show "System initialized successfully!"
```
---
## πŸ”Œ MCP Server Usage
The MCP (Model Context Protocol) Server provides RESTful API access to all RAG functionalities.
### Running MCP Server (Local)
```bash
# Terminal 1: Start MCP Server
python mcp_server.py
# Server will run at http://localhost:8000
# API docs available at http://localhost:8000/docs
```
### Running MCP Server (Production)
Deploy separately to a hosting service:
**Option 1: Railway**
```bash
railway login
railway init
railway up
```
**Option 2: Render**
1. Connect GitHub repo
2. Set build command: `pip install -r requirements.txt`
3. Set start command: `uvicorn mcp_server:app --host 0.0.0.0 --port $PORT`
**Option 3: Docker**
```dockerfile
FROM python:3.10-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install -r requirements.txt
COPY . .
EXPOSE 8000
CMD ["uvicorn", "mcp_server:app", "--host", "0.0.0.0", "--port", "8000"]
```
### MCP API Endpoints
#### Health Check
```bash
curl http://localhost:8000/health
```
Response:
```json
{"status": "healthy"}
```
#### Initialize System
```bash
curl -X POST http://localhost:8000/initialize \
-H "Content-Type: application/json" \
-d '{
"persist_directory": "./data/chroma",
"embedding_model": "sentence-transformers/all-MiniLM-L6-v2"
}'
```
#### Index Documents
```bash
curl -X POST http://localhost:8000/index \
-H "Content-Type: application/json" \
-d '{
"filepaths": ["./docs/document1.pdf", "./docs/document2.txt"],
"hierarchy": "hospital",
"chunk_size": 512,
"chunk_overlap": 50,
"collection_name": "medical_docs"
}'
```
#### Query RAG System
```bash
curl -X POST http://localhost:8000/query \
-H "Content-Type: application/json" \
-d '{
"query": "What are the patient admission procedures?",
"pipeline": "both",
"n_results": 5,
"auto_infer": true
}'
```
Response:
```json
{
"query": "What are the patient admission procedures?",
"base_rag": {
"answer": "...",
"retrieval_time": 0.052,
"total_time": 1.234
},
"hier_rag": {
"answer": "...",
"retrieval_time": 0.031,
"total_time": 0.987,
"applied_filters": {"level1": "Clinical Care"}
},
"speedup": 1.25
}
```
#### System Information
```bash
curl http://localhost:8000/info
```
### Python Client Example
```python
import requests
# Base URL
BASE_URL = "http://localhost:8000"
# Initialize
response = requests.post(f"{BASE_URL}/initialize", json={
"persist_directory": "./data/chroma"
})
print(response.json())
# Index documents
response = requests.post(f"{BASE_URL}/index", json={
"filepaths": ["document.pdf"],
"hierarchy": "hospital",
"collection_name": "my_docs"
})
print(response.json())
# Query
response = requests.post(f"{BASE_URL}/query", json={
"query": "What are KYC requirements?",
"pipeline": "both",
"n_results": 5
})
result = response.json()
print(f"Base-RAG: {result['base_rag']['answer']}")
print(f"Hier-RAG: {result['hier_rag']['answer']}")
print(f"Speedup: {result['speedup']:.2f}x")
```
---
## πŸ“Š Evaluation Methodology
### Dataset
We evaluate on three domain-specific query sets:
1. **Hospital Domain (n=5 queries)**
- Clinical Care, Quality & Safety, Education
- Example: "What are the patient admission procedures?"
2. **Banking Domain (n=5 queries)**
- Retail Banking, Risk Management, Compliance
- Example: "What are the KYC requirements?"
3. **Fluid Simulation Domain (n=5 queries)**
- Numerical Methods, Physical Models, Applications
- Example: "How does the SIMPLE algorithm work?"
### Metrics
#### Retrieval Metrics
- **Hit@k**: Presence of at least one relevant document in top-k results
- Formula: `1 if any(relevant_doc in top_k) else 0`
- Higher is better (max = 1.0)
- **Precision@k**: Proportion of relevant documents in top-k
- Formula: `relevant_in_top_k / k`
- Range: 0.0 to 1.0
- **Recall@k**: Proportion of relevant documents retrieved
- Formula: `relevant_in_top_k / total_relevant`
- Range: 0.0 to 1.0
- **MRR (Mean Reciprocal Rank)**: Average of reciprocal ranks
- Formula: `1 / rank_of_first_relevant_doc`
- Range: 0.0 to 1.0
#### Performance Metrics
- **Retrieval Time**: Time to fetch relevant documents from vector DB
- **Generation Time**: Time for LLM to generate answer
- **Total Time**: End-to-end query response time
- **Speedup**: Ratio of Base-RAG to Hier-RAG total time
- Formula: `base_total_time / hier_total_time`
- >1.0 means Hier-RAG is faster
#### Quality Metrics
- **Semantic Similarity**: Cosine similarity between generated answer and reference
- Uses sentence-transformers embeddings
- Range: 0.0 to 1.0
### Evaluation Process
```python
# Run evaluation via Gradio API
from gradio_client import Client
client = Client("http://localhost:7860")
result = client.predict(
query_dataset="hospital",
n_queries=10,
k_values="1,3,5",
api_name="/evaluate"
)
# Results saved to ./reports/evaluation_TIMESTAMP.csv
```
### Sample Results
#### Hospital Domain Evaluation (5 queries)
| Query | Expected Domain | Base Time (s) | Hier Time (s) | Speedup | Filter Match |
|-------|----------------|---------------|---------------|---------|--------------|
| Patient admission procedures? | Clinical Care | 1.97 | 2.76 | 0.72x | βœ… Clinical Care |
| Infection control policies? | Quality & Safety | 1.51 | 3.11 | 0.49x | ⚠️ policy only |
| Medication error reporting? | Quality & Safety | 1.03 | 2.41 | 0.43x | ⚠️ report only |
| Training for new nurses? | Education | 10.09 | 5.62 | 1.80x | ❌ None |
| Emergency response procedures? | Clinical Care | 2.32 | 1.49 | 1.56x | ❌ None |
**Average Speedup: 0.96x** (Base-RAG and Hier-RAG roughly equal)
#### Key Findings
1. **When Hier-RAG Excels (1.5-2.3x faster):**
- βœ… Query matches hierarchy taxonomy well
- βœ… Auto-inference correctly identifies domain
- βœ… Filtered subset is significantly smaller (<30% of corpus)
- Example: "Training for new nurses" β†’ 1.80x speedup
2. **When Hier-RAG Underperforms (<1.0x):**
- ❌ Auto-inference fails or misclassifies domain
- ❌ Query is too general/cross-domain
- ❌ Filter overhead exceeds retrieval time savings
- Example: "Infection control policies" β†’ 0.49x speedup
3. **Auto-Inference Accuracy:**
- Hospital domain: 40% (2/5 queries correctly classified)
- Needs improvement via LLM-based classification
4. **Retrieval Time Improvement:**
- When filters applied correctly: **30-60% faster retrieval**
- Overall average: **15% faster retrieval** (including misses)
#### Fluid Simulation Domain Evaluation (5 queries)
| Query | Expected Domain | Base Time (s) | Hier Time (s) | Speedup |
|-------|----------------|---------------|---------------|---------|
| How does SIMPLE algorithm work? | Numerical Methods | 1.45 | 3.69 | 0.39x |
| What turbulence models available? | Physical Models | 1.60 | 1.37 | 1.16x |
| Set up cavity flow benchmark? | Validation | 4.46 | 2.40 | 1.86x |
| Mesh generation techniques? | Numerical Methods | 2.64 | 2.87 | 0.92x |
| Enable parallel computing? | Software & Tools | 5.51 | 2.35 | 2.34x |
**Average Speedup: 1.33x** (Hier-RAG 33% faster on average)
### Visualization
To generate evaluation charts:
```python
# Add to your evaluation workflow
import matplotlib.pyplot as plt
import pandas as pd
def generate_evaluation_charts(csv_path):
"""Generate comprehensive evaluation visualizations."""
df = pd.read_csv(csv_path)
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
fig.suptitle('Base-RAG vs Hier-RAG Performance Comparison', fontsize=16)
# Chart 1: Average Total Time
times = df[['base_total_time', 'hier_total_time']].mean()
axes[0, 0].bar(['Base-RAG', 'Hier-RAG'], times, color=['#3498db', '#e74c3c'])
axes[0, 0].set_ylabel('Time (seconds)')
axes[0, 0].set_title('Average Total Query Time')
axes[0, 0].grid(axis='y', alpha=0.3)
# Chart 2: Speedup Distribution
axes[0, 1].hist(df['speedup'], bins=10, color='#2ecc71', edgecolor='black')
axes[0, 1].axvline(1.0, color='red', linestyle='--', label='No improvement')
axes[0, 1].set_xlabel('Speedup Factor')
axes[0, 1].set_ylabel('Frequency')
axes[0, 1].set_title('Speedup Distribution')
axes[0, 1].legend()
# Chart 3: Retrieval Time Comparison
axes[1, 0].scatter(df['base_retrieval_time'], df['hier_retrieval_time'],
s=100, alpha=0.6, color='#9b59b6')
max_val = max(df['base_retrieval_time'].max(), df['hier_retrieval_time'].max())
axes[1, 0].plot([0, max_val], [0, max_val], 'r--', label='Equal performance')
axes[1, 0].set_xlabel('Base-RAG Retrieval Time (s)')
axes[1, 0].set_ylabel('Hier-RAG Retrieval Time (s)')
axes[1, 0].set_title('Retrieval Time Comparison')
axes[1, 0].legend()
axes[1, 0].grid(alpha=0.3)
# Chart 4: Query-wise Speedup
axes[1, 1].barh(range(len(df)), df['speedup'], color='#f39c12')
axes[1, 1].axvline(1.0, color='red', linestyle='--', linewidth=2)
axes[1, 1].set_xlabel('Speedup Factor')
axes[1, 1].set_ylabel('Query Index')
axes[1, 1].set_title('Per-Query Speedup')
axes[1, 1].grid(axis='x', alpha=0.3)
plt.tight_layout()
plt.savefig(csv_path.replace('.csv', '_charts.png'), dpi=300, bbox_inches='tight')
print(f"πŸ“Š Charts saved to: {csv_path.replace('.csv', '_charts.png')}")
# Usage
generate_evaluation_charts('./reports/evaluation_20251030_012814.csv')
```
---
## πŸ”§ Using the API with gradio_client
### Installation
```bash
pip install gradio_client
```
### Basic Usage
```python
from gradio_client import Client
# Connect to local instance
client = Client("http://localhost:7860")
# Or connect to deployed HF Space
client = Client("https://huggingface.co/spaces/AP-UW/hierarchical-rag-eval")
```
### Complete Workflow Example
```python
from gradio_client import Client
import time
# Initialize client
client = Client("http://localhost:7860")
# Step 1: Initialize system
print("1️⃣ Initializing system...")
result = client.predict(api_name="/initialize")
print(result)
# Step 2: Upload and validate documents
print("\n2️⃣ Validating documents...")
status, preview, stats = client.predict(
files=["./docs/hospital_policy.pdf", "./docs/procedures.txt"],
hierarchy_choice="hospital",
mask_pii=False,
api_name="/upload"
)
print(f"Status: {status}")
print(f"Stats: {stats}")
# Step 3: Build RAG index
print("\n3️⃣ Building RAG index...")
build_status, build_stats = client.predict(
files=["./docs/hospital_policy.pdf", "./docs/procedures.txt"],
hierarchy="hospital",
chunk_size=512,
chunk_overlap=50,
mask_pii=False,
collection_name="hospital_docs",
api_name="/build"
)
print(f"Build Status: {build_status}")
print(f"Indexed Chunks: {build_stats.get('Total Chunks', 0)}")
# Step 4: Search with both pipelines
print("\n4️⃣ Querying RAG system...")
answer, contexts, metadata = client.predict(
query="What are the patient admission procedures?",
pipeline="Both",
n_results=5,
level1="",
level2="",
level3="",
doc_type="",
auto_infer=True,
api_name="/search"
)
print(f"Answer:\n{answer}\n")
print(f"Metadata:\n{metadata}")
# Step 5: Run evaluation
print("\n5️⃣ Running evaluation...")
summary, csv_path, json_path = client.predict(
query_dataset="hospital",
n_queries=5,
k_values="1,3,5",
api_name="/evaluate"
)
print(summary)
print(f"\nResults saved to:\n- {csv_path}\n- {json_path}")
```
### Batch Processing Example
```python
from gradio_client import Client
import pandas as pd
client = Client("http://localhost:7860")
# Initialize
client.predict(api_name="/initialize")
# Build index for multiple document sets
document_sets = {
"hospital_policies": ["./docs/policy1.pdf", "./docs/policy2.pdf"],
"clinical_protocols": ["./docs/protocol1.txt", "./docs/protocol2.txt"],
"training_manuals": ["./docs/manual1.pdf", "./docs/manual2.pdf"]
}
for collection_name, files in document_sets.items():
print(f"Building index for: {collection_name}")
status, stats = client.predict(
files=files,
hierarchy="hospital",
collection_name=collection_name,
api_name="/build"
)
print(f"βœ… {stats.get('Total Chunks', 0)} chunks indexed")
# Query multiple collections
queries = [
"What are admission procedures?",
"How to handle medication errors?",
"What training is required for nurses?"
]
results = []
for query in queries:
answer, contexts, metadata = client.predict(
query=query,
pipeline="Both",
api_name="/search"
)
results.append({
"query": query,
"answer": answer[:200], # First 200 chars
"metadata": metadata
})
# Save results
df = pd.DataFrame(results)
df.to_csv("batch_query_results.csv", index=False)
```
---
## πŸ› Troubleshooting
### Common Issues
#### 1. OpenAI API Errors
**Problem:** `Error generating answer: Incorrect API key provided`
**Solution:**
```bash
# Check if API key is set
echo $OPENAI_API_KEY # Mac/Linux
echo %OPENAI_API_KEY% # Windows
# If empty, add to .env file
OPENAI_API_KEY=your-key-here
# For HF Spaces, add to Repository Secrets
```
#### 2. ChromaDB Persistence Issues
**Problem:** `sqlite3.OperationalError: database is locked`
**Solution:**
```python
# In core/index.py - use simpler client initialization
self.client = chromadb.PersistentClient(path=persist_directory)
# Or use EphemeralClient for testing (no persistence)
self.client = chromadb.EphemeralClient()
```
#### 3. Memory Errors with Large PDFs
**Problem:** `MemoryError` or `Killed` when processing large documents
**Solution:**
```python
# Reduce batch size in core/index.py
def add_documents(self, chunks, batch_size=50): # Reduced from 100
# Process in smaller batches
```
#### 4. Slow Embedding Generation
**Problem:** Embedding generation takes >30 seconds
**Solution:**
```python
# Use smaller embedding model in .env
EMBEDDING_MODEL=all-MiniLM-L6-v2 # Faster, 384 dimensions
# Or use OpenAI embeddings
EMBEDDING_MODEL=openai:text-embedding-3-small
```
#### 5. Gradio API Connection Timeout
**Problem:** `gradio_client` times out when connecting
**Solution:**
```python
from gradio_client import Client
# Increase timeout
client = Client("http://localhost:7860", timeout=120)
# Or check if server is running
import requests
response = requests.get("http://localhost:7860")
print(response.status_code) # Should be 200
```
#### 6. HF Spaces Build Failure
**Problem:** Space shows "Build Failed" status
**Solution:**
1. Check requirements.txt for incompatible versions
2. View build logs in Space β†’ Logs tab
3. Common fix: Pin exact versions
```txt
# requirements.txt
torch==2.1.0 # Pin specific version
transformers==4.35.0
gradio==4.44.0
```
#### 7. Evaluation Results Inconsistent
**Problem:** Speedup values sometimes <1.0 or highly variable
**Solution:**
- Run evaluation multiple times and average results
- Increase warmup queries before evaluation
- Check if auto-inference is working correctly
```python
# Add warmup queries
for _ in range(3):
rag_comparator.compare("warmup query", n_results=5)
# Then run actual evaluation
```
### Debug Mode
Enable verbose logging:
```python
# Add to app.py
import logging
logging.basicConfig(
level=logging.DEBUG,
format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',
handlers=[
logging.FileHandler('app.log'),
logging.StreamHandler()
]
)
logger = logging.getLogger(__name__)
logger.debug("Debug mode enabled")
```
### Health Check Endpoints
Test system components:
```python
# Add to app.py for debugging
def system_health_check():
"""Check if all components are working."""
checks = {}
# Check 1: OpenAI API
try:
import openai
client = openai.OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
client.models.list()
checks["openai_api"] = "βœ… Connected"
except Exception as e:
checks["openai_api"] = f"❌ {str(e)}"
# Check 2: Vector DB
try:
if index_manager:
stats = index_manager.stores.get("rag_documents")
checks["vector_db"] = "βœ… Initialized"
else:
checks["vector_db"] = "⚠️ Not initialized"
except Exception as e:
checks["vector_db"] = f"❌ {str(e)}"
# Check 3: Embedding Model
try:
from core.index import EmbeddingModel
model = EmbeddingModel()
test_embedding = model.embed_query("test")
checks["embedding_model"] = f"βœ… Loaded ({len(test_embedding)} dims)"
except Exception as e:
checks["embedding_model"] = f"❌ {str(e)}"
return checks
# Add button to UI
with gr.Tab("System Health"):
health_btn = gr.Button("Check System Health")
health_output = gr.JSON(label="Health Status")
health_btn.click(system_health_check, outputs=health_output)
```
---
## πŸ“š Additional Resources
### Documentation
- [Gradio Documentation](https://gradio.app/docs/)
- [Gradio Client Guide](https://gradio.app/guides/getting-started-with-the-python-client/)
- [ChromaDB Documentation](https://docs.trychroma.com/)
- [OpenAI API Reference](https://platform.openai.com/docs/api-reference)
- [Sentence Transformers](https://www.sbert.net/)
### Tutorials
- [Building RAG Applications](https://python.langchain.com/docs/use_cases/question_answering/)
- [Deploying to HF Spaces](https://huggingface.co/docs/hub/spaces-overview)
- [Vector Database Best Practices](https://www.pinecone.io/learn/vector-database/)
### Community
- GitHub Issues: [repository-url]/issues
- Hugging Face Forums: https://discuss.huggingface.co/
- Discord: [Your project Discord]
---
## πŸ“„ License
MIT License - see LICENSE file for details
---
## πŸ™ Acknowledgments
- Built with [Gradio](https://gradio.app/)
- Vector database: [ChromaDB](https://www.trychroma.com/)
- Embeddings: [Sentence Transformers](https://www.sbert.net/)
- LLM: [OpenAI](https://openai.com/)
---
## πŸ“ž Support
For issues and questions:
- **GitHub Issues**: [repository-url]/issues
- **Email**: support@your-domain.com
- **Documentation**: [repository-url]/wiki
---
## πŸ“ˆ Changelog
### v1.0.0 (2025-01-31)
- βœ… Initial release
- βœ… Base-RAG and Hier-RAG implementation
- βœ… Three preset hierarchies (Hospital, Bank, Fluid Simulation)
- βœ… Gradio UI and MCP server
- βœ… Comprehensive evaluation suite
- βœ… Full test coverage
- βœ… HF Spaces deployment ready
---
**Built with ❀️ for the RAG community**