Spaces:

puji4ml
/

RAG-Pipeline-Optimizer

Sleeping

App Files Files Community

puji4ml commited on Jan 25

Commit

6e42cf5

verified ·

1 Parent(s): 2b22a59

Upload 9 files

Browse files

Files changed (9) hide show

.dockerignore +78 -0
.gitattributes +19 -0
.gitignore +0 -0
Dockerfile +49 -0
README.md +639 -0
docker-compose.yml +94 -0
init_project.py +43 -0
requirements.txt +46 -0
results.xlsx +3 -0

.dockerignore ADDED Viewed

	@@ -0,0 +1,78 @@

+# Python cache
+__pycache__/
+*.py[cod]
+*$py.class
+*.so
+.Python
+# Virtual environments
+env/
+venv/
+ENV/
+env.bak/
+venv.bak/
+.venv/
+# IDE
+.vscode/
+.idea/
+*.swp
+*.swo
+.DS_Store
+# Logs
+*.log
+logs/
+*.log.*
+# Git
+.git/
+.gitignore
+.gitattributes
+# Documentation
+README.md
+docs/
+*.pdf
+# Tests
+tests/
+test_*.py
+*_test.py
+# Temporary files
+tmp/
+temp/
+*.tmp
+.cache/
+# Jupyter notebooks
+*.ipynb
+.ipynb_checkpoints/
+# Database temporary files (keep main DB, ignore temp files)
+*.db-shm
+*.db-wal
+# Environment files (will be injected via docker-compose)
+.env
+.env.*
+# Large corpus files (will handle separately)
+# Uncomment if not including in Docker image
+# chroma_db/
+# data/wikipedia_corpus.parquet
+# Build artifacts
+dist/
+build/
+*.egg-info/
+# Docker files (no need to copy into image)
+Dockerfile
+docker-compose.yml
+.dockerignore
+# GitHub/CI files
+.github/
+.gitmodules

.gitattributes ADDED Viewed

	@@ -0,0 +1,19 @@

+# Git LFS Configuration for Large Files
+# ChromaDB vectsor stores (entire directory)
+chroma_db/** filter=lfs diff=lfs merge=lfs -text
+data/vector_stores/** filter=lfs diff=lfs merge=lfs -text
+# SQLite databases (all variants)
+*.db filter=lfs diff=lfs merge=lfs -text
+*.sqlite filter=lfs diff=lfs merge=lfs -text
+*.sqlite3 filter=lfs diff=lfs merge=lfs -text
+# Binary index files
+*.bin filter=lfs diff=lfs merge=lfs -text
+# Parquet data files
+*.parquet filter=lfs diff=lfs merge=lfs -text
+# JSONL corpus files
+*.jsonl filter=lfs diff=lfs merge=lfs -text
+# Model files (if any)
+*.safetensors filter=lfs diff=lfs merge=lfs -text
+*.gguf filter=lfs diff=lfs merge=lfs -text
+data/vector_stores/**/chroma.sqlite3 filter=lfs diff=lfs merge=lfs -text
+results.xlsx filter=lfs diff=lfs merge=lfs -text

.gitignore ADDED Viewed

Binary file (999 Bytes). View file

Dockerfile ADDED Viewed

	@@ -0,0 +1,49 @@

+# Phase 5: Production Dockerfile for RAG Pipeline Optimizer
+FROM python:3.12-slim
+# Set working directory
+WORKDIR /app
+# Install system dependencies
+RUN apt-get update && apt-get install -y \
+    build-essential \
+    curl \
+    git \
+    git-lfs \
+    && rm -rf /var/lib/apt/lists/*
+# Initialize Git LFS
+RUN git lfs install
+# Copy requirements first (for Docker layer caching)
+COPY requirements.txt .
+# Install Python dependencies
+RUN pip install --no-cache-dir --upgrade pip && \
+    pip install --no-cache-dir -r requirements.txt
+# Copy project files
+COPY config/ ./config/
+COPY core/ ./core/
+COPY utils/ ./utils/
+COPY scripts/ ./scripts/
+COPY app/ ./app/
+COPY data/ ./data/
+COPY chroma_db/ ./chroma_db/
+# Create necessary directories
+RUN mkdir -p logs
+# Expose Streamlit port
+EXPOSE 8501
+# Health check
+HEALTHCHECK --interval=30s --timeout=10s --start-period=40s --retries=3 \
+    CMD curl --fail http://localhost:8501/_stcore/health || exit 1
+# Run Streamlit dashboard
+CMD ["streamlit", "run", "app/dashboard.py", \
+     "--server.port=8501", \
+     "--server.address=0.0.0.0", \
+     "--server.headless=true", \
+     "--browser.gatherUsageStats=false"]

README.md ADDED Viewed

	@@ -0,0 +1,639 @@

+# RAG Pipeline Optimizer - Phase 1 Complete ✅
+An MLOps platform for evaluating and optimizing RAG (Retrieval-Augmented Generation) pipelines across multiple models and configurations.
+---
+## 🎯 Project Overview
+**The Problem**: Every company has a RAG system, but almost no one knows if their RAG is good. Is chunk_size=512 better than 1024? Is Cohere a better embedder than OpenAI for their data? They're just guessing.
+**The Solution**: A full-stack RAG evaluation platform that runs multiple pipeline configurations in parallel, scores them using AI evaluation, and shows you which configuration works best for YOUR data.
+---
+## ✅ Phase 1: Complete
+### What's Built
+- ✅ **Project structure** with clean separation of concerns
+- ✅ **6 diverse RAG pipelines** leveraging different strategies:
+  - Pipeline A: Speed-Optimized (Azure GPT-5)
+  - Pipeline B: Accuracy-Optimized (Azure GPT-5 + Reranking)
+  - Pipeline C: Balanced (Azure Cohere)
+  - Pipeline D: Reasoning (Anthropic Claude)
+  - Pipeline E: Cost-Optimized (Azure DeepSeek)
+  - Pipeline F: Experimental (xAI Grok)
+- ✅ **Configuration management** with environment variables
+- ✅ **Cost estimation** for each pipeline
+- ✅ **Comprehensive tests** to validate configurations
+### Technology Stack
+| Component | Technology | Purpose |
+|-----------|-----------|---------|
+| **LLM Providers** | Azure OpenAI, Cohere, DeepSeek, Anthropic, xAI | Diverse model comparison |
+| **Embeddings** | OpenAI, Sentence-Transformers | Vector representations |
+| **Vector DB** | ChromaDB | Local vector storage |
+| **Framework** | LangChain | RAG orchestration |
+| **Storage** | SQLite | Results & metadata |
+| **Backend** (Phase 2) | FastAPI | REST API |
+| **Frontend** (Phase 3) | Streamlit | User interface |
+| **Deployment** (Phase 4) | Hugging Face Spaces | Cloud hosting |
+---
+## 📁 Project Structure
+rag_optimizer/
+├── config/
+│ ├── init.py
+│ └── pipeline_configs.py # 6 pipeline configurations
+├── core/ # [Phase 2] Document processing
+│ ├── init.py
+│ ├── document_loader.py # [Coming next]
+│ ├── chunker.py # [Coming next]
+│ ├── embedder.py # [Coming next]
+│ ├── vector_store.py # [Coming next]
+│ ├── retriever.py # [Coming next]
+│ ├── generator.py # [Coming next]
+│ └── pipeline.py # [Coming next]
+├── data/
+│ ├── uploads/ # User-uploaded documents
+│ ├── vector_stores/ # ChromaDB storage
+│ └── results.db # SQLite evaluation results
+├── utils/
+│ ├── init.py
+│ └── database.py # [Phase 3]
+├── tests/
+│ ├── init.py
+│ └── test_pipeline_config.py # ✅ Tests pass
+├── .env # Your API keys (DO NOT COMMIT)
+├── .env.example # Template for .env
+├── requirements.txt # Python dependencies
+└── README.md # This file
+------------------------------------------------------------------------------------------------------------------------------
+🚀 Quick Start
+------------------------------------------------------------------------------------------------------------------------------
+open "rag_optimizer" directory in VsCode
+1. Installation
+# navigate to project
+python init_project.py
+# Create virtual environment
+python -m venv venv
+# Activate virtual environment
+# On Windows:
+.\venv\Scripts\activate
+# Install dependencies
+pip install -r requirements.txt
+#Configure keys inside .env.example (I am using Azure Foundry AI for OpenAi,Cohere,deepseek)
+a)configure ENDPOINT,API_KEY,DEPLOYMENT_NAME of models used via Azure and for rest like ANTHROPIC and GROQ directly use API_KEY
+b)cp .env.example .env
+2. Verify Setup
+# View pipeline comparison
+python config/pipeline_configs.py
+# Run tests
+python tests/test_pipeline_config.py
+Last Updated: January 14, 2026
+Project: RAG Pipeline Optimizer
+Phase: 1 of 5
+#  RAG Pipeline Optimizer - Phase 2 Complete ✅
+Phase 2: Core RAG Components
+Successfully implemented and tested all core components for document processing, embedding generation, and vector storage using LangChain framework.
+🎯 Phase 2 Deliverables
+✅ Document Loader - Multi-format document parsing (PDF, DOCX, TXT, MD, PPTX, XLSX)
+✅ Text Chunker - LangChain-based chunking with multiple strategies
+✅ Embedder - Local + Azure OpenAI embeddings
+✅ Vector Store - ChromaDB with LangChain integration
+📁 Files Created
+rag_optimizer/
+├── core/
+│   ├── __init__.py
+│   ├── document_loader.py       ✅ Multi-format document loading
+│   ├── chunker.py               ✅ LangChain text splitting
+│   ├── embedder.py              ✅ Embedding generation
+│   └── vector_store.py          ✅ ChromaDB vector storage
+│
+├── data/
+│   ├── uploads/                 📂 User uploaded documents
+│   └── vector_stores/           📂 Persisted vector databases
+│
+└── requirements.txt             ✅ Updated with LangChain packages
+🔧 Components Overview
+1. Document Loader (core/document_loader.py)
+Purpose: Load and parse documents in multiple formats
+Supported Formats:
+PDF (.pdf) - Extracts text with page numbers
+Word (.docx) - Paragraphs and formatting
+Text (.txt) - Plain text files
+Markdown (.md) - Converts to plain text
+PowerPoint (.pptx) - Slide content
+Excel (.xlsx) - Sheet data
+Key Features:
+Automatic format detection
+Metadata extraction (file size, page count)
+Error handling for corrupted files
+Batch document loading
+2. Text Chunker (core/chunker.py)
+Purpose: Split documents into semantic chunks for embedding
+Framework: LangChain Text Splitters
+Chunking Strategies:
+| Strategy    | Description                      | Use Case                      | Quality |
+| ----------- | -------------------------------- | ----------------------------- | ------- |
+| recursive ✅ | Tries \\n\\n → \\n → .  →        | RECOMMENDED for all pipelines | A+      |
+| character   | Simple character-based splitting | Basic documents               | B       |
+| token       | Token-aware splitting            | Token-limited models          | B       |
+| sentence    | Sentence boundary detection      | Short documents               | C       |
+Key Features:
+Configurable chunk size (tokens)
+Overlap for context preservation
+Clean semantic boundaries
+No fragment generation
+3. Embedder (core/embedder.py)
+Purpose: Generate vector embeddings for text chunks
+Framework: LangChain Embeddings
+Supported Providers:
+| Provider              | Model                  | Dimension | Cost     | Speed  | Use Case            |
+| --------------------- | ---------------------- | --------- | -------- | ------ | ------------------- |
+| sentence-transformers | all-MiniLM-L6-v2       | 384D      | FREE ✅   | Fast   | Development/Testing |
+| sentence-transformers | all-mpnet-base-v2      | 768D      | FREE ✅   | Medium | Better quality      |
+| azure-openai          | text-embedding-3-small | 1536D     | $0.02/1M | Fast   | Production          |
+| azure-openai          | text-embedding-3-large | 3072D     | $0.13/1M | Medium | Highest accuracy    |
+Key Features:
+Automatic batching for efficiency
+Cosine similarity calculation
+Normalized embeddings
+Local caching (future)
+4. Vector Store (core/vector_store.py)
+Purpose: Store and retrieve document chunks using vector similarity
+Framework: LangChain + ChromaDB
+Key Features:
+Local persistent storage (no external DB needed)
+Fast similarity search (cosine distance)
+Metadata filtering
+LangChain retriever integration
+Collection management
+Storage Structure:
+data/vector_stores/
+└── {collection_name}/
+    ├── chroma.sqlite3         # Metadata
+    └── {uuid}/                # Vector data
+        └── data_level0.bin
+------------------------------------------------------------------------------------------------------------------------------
+🚀 Quick Start
+------------------------------------------------------------------------------------------------------------------------------
+1).\venv\Scripts\activate
+2)pip install -r requirements.txt
+3)python core/document_loader.py
+4)python core/chunker.py
+5)python core/embedder.py
+6)python core/vector_store.py
+Last Updated: January 14, 2026, 8:38 PM EST
+Project: RAG Pipeline Optimizer
+Phase: 2 of 5
+#📘 Phase 3 README: Pipeline Orchestration & Parallel Evaluation
+Phase 3 Roadmap (Step-by-Step)
+Step 1: Generator Module ⬅️ START HERE
+Build LLM interface for all 6 models (Azure OpenAI, Cohere, DeepSeek, Claude, Grok)
+Step 2: Retriever Module
+Combine VectorStore + optional reranking (Pipeline B uses Cohere rerank)
+Step 3: Pipeline Orchestrator
+Connect all components: Document → Chunks → Embeddings → Retrieval → Generation
+Step 4: Dataset Integration
+Download wiki_dpr + Natural Questions, load into vector stores
+Step 5: Parallel Execution
+Run all 6 pipelines on same query simultaneously
+Step 6: Evaluation & Results Storage
+SQLite database to store query results, costs, metrics
+🎯 Phase 3 Overview
+Phase 3 integrated all core RAG components into a fully functional multi-pipeline evaluation system capable of running 6 different RAG configurations in parallel, comparing their performance, and storing results for analysis.
+What We Built
+✅ LLM Generator (core/generator.py) - Multi-provider response generation
+✅ Smart Retriever (core/retriever.py) - Context retrieval with optional reranking
+✅ Pipeline Orchestrator (core/pipeline.py) - End-to-end RAG workflow
+✅ Parallel Evaluator (scripts/run_parallel_evaluation.py) - Simultaneous pipeline execution
+✅ Analysis Dashboard (scripts/analyze_results.py) - Performance comparison tools
+✅ Database Schema (data/evaluation_results.db) - SQLite storage for metrics
+✅ Dataset Integration (scripts/dataset_loader.py) - NQ-Open evaluation dataset
+✅ Corpus Ingestion (scripts/ingest_corpus.py) - Wikipedia knowledge base
+🏗️ Architecture Overview
+┌─────────────────────────────────────────────────────────────────┐
+│                    USER QUERY INPUT                              │
+└────────────────────┬────────────────────────────────────────────┘
+                     │
+                     ▼
+┌─────────────────────────────────────────────────────────────────┐
+│           PARALLEL PIPELINE EXECUTION (6 Pipelines)              │
+│  ┌──────────┬──────────┬──────────┬──────────┬──────────┬─────┐ │
+│  │Pipeline A│Pipeline B│Pipeline C│Pipeline D│Pipeline E│Pipe │ │
+│  │  (Speed) │(Accuracy)│(Balanced)│(Reasoning│  (Cost)  │  F  │ │
+│  └─────┬────┴────┬─────┴─────┬────┴─────┬────┴─────┬────┴──┬──┘ │
+└────────┼─────────┼───────────┼──────────┼──────────┼───────┼────┘
+         │         │           │          │          │       │
+         ▼         ▼           ▼          ▼          ▼       ▼
+    ┌────────────────────────────────────────────────────────────┐
+    │              VECTOR STORE (ChromaDB)                        │
+    │  Retrieves top-k relevant chunks for each pipeline         │
+    └───────────────────┬────────────────────────────────────────┘
+                        │
+                        ▼
+    ┌────────────────────────────────────────────────────────────┐
+    │           RETRIEVER (with optional reranking)              │
+    │  • Pipeline B: Cohere reranking (accuracy boost)           │
+    │  • Others: Direct similarity search                        │
+    └───────────────────┬────────────────────────────────────────┘
+                        │
+                        ▼
+    ┌────────────────────────────────────────────────────────────┐
+    │                    GENERATOR                                │
+    │  • Pipeline A: Azure GPT-5 (fast)                          │
+    │  • Pipeline B: Azure GPT-5 (high quality)                  │
+    │  • Pipeline C: Azure Cohere Command                        │
+    │  • Pipeline D: Anthropic Claude (reasoning)                │
+    │  • Pipeline E: DeepSeek V3.2 (cost-optimized)              │
+    │  • Pipeline F: Groq Llama (experimental)                   │
+    └───────────────────┬────────────────────────────────────────┘
+                        │
+                        ▼
+    ┌────────────────────────────────────────────────────────────┐
+    │         EVALUATION & METRICS COLLECTION                     │
+    │  • Answer correctness (exact match + fuzzy)                │
+    │  • Latency tracking (retrieval + generation)               │
+    │  • Cost calculation (per query)                            │
+    │  • Token usage monitoring                                  │
+    └───────────────────┬────────────────────────────────────────┘
+                        │
+                        ▼
+    ┌────────────────────────────��───────────────────────────────┐
+    │          SQLite DATABASE (evaluation_results.db)           │
+    │  Stores: Queries, Answers, Metrics, Timestamps             │
+    └────────────────────────────────────────────────────────────┘
+                        │
+                        ▼
+    ┌────────────────────────────────────────────────────────────┐
+    │           ANALYSIS DASHBOARD (analyze_results.py)          │
+    │  • Pipeline comparison                                     │
+    │  • Cost efficiency analysis                                │
+    │  • Question difficulty breakdown                           │
+    │  • Excel export for deeper analysis                        │
+    └────────────────────────────────────────────────────────────┘
+📦 Components Built in Phase 3
+1. Generator (core/generator.py)
+Purpose: Interface to all LLM providers with unified response handling.
+Features:
+✅ Multi-provider support (Azure OpenAI, Cohere, Claude, DeepSeek, Groq)
+✅ Prompt template management
+✅ Automatic cost calculation
+✅ Token usage tracking
+✅ Error handling & retries
+✅ Response parsing with strict format validation
+Supported Models:
+AZURE_GPT5      = "gpt-5-chat"          # Fast, high quality
+AZURE_COHERE    = "cohere-command-a"    # Balanced performance
+AZURE_DEEPSEEK  = "DeepSeek-V3.2"       # Ultra cost-efficient
+ANTHROPIC_CLAUDE= "claude-3-5-sonnet"   # Advanced reasoning
+GROQ_LLAMA      = "llama-3.3-70b"       # Experimental, fast inference
+2. Retriever (core/retriever.py)
+Purpose: Fetch relevant context chunks with optional reranking.
+Features:
+✅ Semantic similarity search (ChromaDB)
+✅ Cohere reranking for Pipeline B (accuracy boost)
+✅ Configurable top-k retrieval
+✅ Score normalization
+✅ Metadata filtering
+✅ Performance timing
+Retrieval Strategies:
+| Pipeline | Strategy | Chunks | Reranking | Use Case        |
+| -------- | -------- | ------ | --------- | --------------- |
+| A        | Fast     | 3      | ❌         | Speed-critical  |
+| B        | Accuracy | 10     | ✅ Cohere  | Maximum quality |
+| C-F      | Standard | 5-10   | ❌         | General use     |
+3. Pipeline Orchestrator (core/pipeline.py)
+Purpose: End-to-end RAG workflow coordinator.
+Features:
+✅ Component integration (Embedder → VectorStore → Retriever → Generator)
+✅ Stage-wise timing (retrieval_time_ms, generation_time_ms, total_time_ms)
+✅ Cost accumulation
+✅ Metadata tracking
+✅ Error recovery
+Pipeline Flow:
+User Query → Embedding → Vector Search → Rerank (optional) → LLM Generation → Response
+     ↓            ↓            ↓                ↓                    ↓            ↓
+  Timing      Timing       Timing           Timing               Timing       Total
+4. Parallel Evaluator (scripts/run_parallel_evaluation.py)
+Purpose: Run all 6 pipelines simultaneously on evaluation dataset.
+Features:
+✅ Concurrent execution (ThreadPoolExecutor)
+✅ Progress tracking (tqdm)
+✅ Automatic database insertion
+✅ Error isolation (one pipeline failure doesn't stop others)
+✅ Answer validation (exact match + fuzzy matching)
+✅ Run ID tracking for experiment management
+Performance Metrics Tracked:
+✅ Accuracy (answer_found: 0 or 1)
+✅ Latency (retrieval_time_ms, generation_time_ms, total_time_ms)
+✅ Cost (generation_cost_usd, total_cost_usd)
+✅ Token usage (prompt_tokens, completion_tokens, total_tokens)
+✅ Retrieval quality (num_chunks_retrieved, retrieval_scores)
+5. Analysis Dashboard (scripts/analyze_results.py)
+Purpose: Comprehensive evaluation results analysis.
+Features:
+✅ Pipeline performance summary (accuracy, cost, speed)
+✅ Cost efficiency analysis (cost per correct answer)
+✅ Time breakdown (retrieval vs generation)
+✅ Token usage statistics
+✅ Retrieval quality metrics
+✅ Difficult questions identification (0% accuracy)
+✅ Easy questions identification (>66% accuracy)
+✅ Question-by-question comparison
+✅ Excel export with 8 detailed sheets
+Usage:
+# View dashboard in terminal
+python scripts/analyze_results.py
+# Export to Excel
+python scripts/analyze_results.py --export results.xlsx
+# List all runs
+python scripts/analyze_results.py --list-runs
+6. Database Schema (data/evaluation_results.db)
+Table: evaluation_results
+| Column               | Type                | Description                                         |
+| -------------------- | ------------------- | --------------------------------------------------- |
+| id                   | INTEGER PRIMARY KEY | Auto-increment ID                                   |
+| run_id               | TEXT                | Evaluation run identifier (e.g., "20260117_182253") |
+| pipeline_id          | TEXT                | Pipeline identifier                                 |
+| pipeline_name        | TEXT                | Human-readable pipeline name                        |
+| question_id          | TEXT                | Question identifier from dataset                    |
+| query                | TEXT                | Input question                                      |
+| ground_truth_answers | TEXT                | JSON array of correct answers                       |
+| retrieved_chunks     | TEXT                | JSON array of context chunks                        |
+| retrieval_scores     | TEXT                | JSON array of similarity scores                     |
+| num_chunks_retrieved | INTEGER             | Number of chunks retrieved                          |
+| retrieval_time_ms    | REAL                | Time spent on retrieval                             |
+| reranking_time_ms    | REAL                | Time spent on reranking (if applicable)             |
+| reranked             | INTEGER             | Whether reranking was used (0 or 1)                 |
+| generated_answer     | TEXT                | Model's generated answer                            |
+| generation_time_ms   | REAL                | Time spent on generation                            |
+| prompt_tokens        | INTEGER             | Input tokens used                                   |
+| completion_tokens    | INTEGER             | Output tokens generated                             |
+| total_tokens         | INTEGER             | Total tokens (prompt + completion)                  |
+| generation_cost_usd  | REAL                | Cost of generation                                  |
+| total_cost_usd       | REAL                | Total query cost                                    |
+| total_time_ms        | REAL                | End-to-end latency                                  |
+| has_answer           | INTEGER             | Whether answer is present (1 or 0)                  |
+| answer_found         | INTEGER             | Whether answer is correct (1 or 0)                  |
+| timestamp            | TEXT                | ISO 8601 timestamp                                  |
+------------------------------------------------------------------------------------------------------------------------------
+🚀 Quick Start
+------------------------------------------------------------------------------------------------------------------------------
+Prerequisites:
+# Ensure Phase 1 & 2 are complete
+✅ 6 pipeline configurations defined
+✅ All API keys in .env
+✅ ChromaDB vector store populated
+✅ Wikipedia corpus ingested
+1)core/generator.py                   																			#LLM response generation
+2)core/retriever.py                   						     												#Context retrieval + reranking
+3)core/pipeline.py                    						     												#End-to-end orchestration
+4)utils/dataset_loader.py             						     												#Load Natural Questions + Wikipedia Dataset
+5)scripts/ingest_corpus_selective_pipeline.py    (see below)       												#Ingest Wikipedia Corpus into All 6 Pipelines
+6)python scripts/run_generic_evaluation.py --num-questions 60 --pipelines A,B,C,D,E,F       					#Parallel RAG Pipeline Evaluation
+7)scripts/analyze_results.py          																			#Results dashboard -diff types of runs to generate diff outout
+for big scale dataset:
+5a)python scripts/ingest_corpus_selective_pipeline.py --pipelines A,C,D,E,F --passages 500000 --batch-size 5000
+5b)python scripts/ingest_corpus_selective_pipeline.py --pipelines B --passages 500000 --batch-size 1000
+Last Updated: January 17, 2026, 7:38 PM EST
+Project: RAG Pipeline Optimizer
+Phase: 3 of 5
+# 📊Phase 4: Advanced Evaluation & Interactive Dashboard
+🎯 Overview
+Phase 4 delivers a two-part system for advanced RAG pipeline evaluation:
+Phase 4A: LLM-as-a-Judge evaluation system using GPT-4o to score answer quality across 6 dimensions
+Phase 4B: Full-stack interactive Streamlit dashboard for visualizing and comparing results
+Together, these provide objective quality scoring and interactive exploration of pipeline performance beyond basic metrics like speed and cost.
+📦 Phase 4 Components
+Phase 4A: LLM Judge Evaluation System
+Automated answer quality scoring using GPT-4o as an AI judge
+Phase 4B: Interactive Dashboard
+8-page Streamlit application for data exploration and real-time testing
+🔬 Phase 4A: LLM Judge Evaluation
+Overview
+Phase 4A adds multi-dimensional quality scoring to existing evaluation results using GPT-4o as an objective judge. Each answer is scored across 6 quality dimensions, providing insights beyond operational metrics.
+✨ Features
+6-Dimensional Quality Scoring
+Correctness (0-10) - Factual accuracy compared to ground truth
+Relevance (0-10) - How well the answer addresses the question
+Completeness (0-10) - Coverage of important information
+Clarity (0-10) - Clear, understandable language
+Conciseness (0-10) - Brevity without sacrificing information
+Overall (0-10) - Weighted average of all dimensions
+Automated Evaluation
+Evaluates existing Phase 3 results retroactively
+No need to re-run pipelines
+Batch processing with progress tracking
+Results stored in separate database table
+Cost-Efficient
+Only evaluates answers, not entire pipeline re-runs
+Uses GPT-4o-mini for cost efficiency
+Batches requests to minimize API calls
+🏗️ Architecture
+rag_optimizer/
+├── core/
+│   └── evaluator.py          # LLM Judge implementation
+├── utils/
+│   └── database.py           # Database utilities for score storage
+├── scripts/
+│   └── evaluate_with_judge.py  # CLI tool for running evaluations
+└── data/
+    └── evaluation_results.db   # SQLite (updated schema)
+🗄️ Database Schema (Phase 4A Extension)
+New Table: evaluation_scores
+Stores LLM judge quality scores for each evaluation result.
+| Column               | Type    | Description                         |
+| -------------------- | ------- | ----------------------------------- |
+| id                   | INTEGER | Primary key                         |
+| evaluation_result_id | INTEGER | Foreign key → evaluation_results.id |
+| correctness_score    | REAL    | Factual accuracy (0-10)             |
+| relevance_score      | REAL    | Question relevance (0-10)           |
+| completeness_score   | REAL    | Information coverage (0-10)         |
+| clarity_score        | REAL    | Language clarity (0-10)             |
+| conciseness_score    | REAL    | Brevity (0-10)                      |
+| overall_score        | REAL    | Weighted average (0-10)             |
+| judge_reasoning      | TEXT    | LLM's explanation for scores        |
+| timestamp            | TEXT    | ISO timestamp                       |
+Indexes:
+idx_eval_result on evaluation_result_id
+idx_overall_score on overall_score
+------------------------------------------------------------------------------------------------------------------------------
+🚀 Quick Start
+------------------------------------------------------------------------------------------------------------------------------
+core/evaluator.py
+utils/database.py
+python scripts/evaluate_with_judge.py --latest --limit 5
+🖥️ Phase 4B: Interactive Dashboard
+Overview
+Full-stack Streamlit dashboard with 9 pages for exploring evaluation results and testing pipelines in real-time.
+✨ Features
+🏠 Home Page
+Project overview and capabilities
+Quick stats (6 pipelines, 5 LLM providers, 500K+ corpus)
+Pipeline configuration cards
+Modern dark theme UI
+📊 Pipeline Comparison
+Side-by-side performance metrics
+Quality scores from LLM judge (correctness, relevance, completeness, clarity, conciseness)
+Interactive comparison tables
+Filter by evaluation run
+Sort by accuracy, speed, cost, or quality score
+Multi-dimensional scoring
+🔍 Question Explorer
+Browse all evaluated questions
+See how each pipeline answered
+View quality scores per answer
+Compare answers across pipelines
+View retrieved context chunks
+Ground truth validation
+💰 Cost Analysis
+Token usage breakdown
+Cost per query analysis
+Cost efficiency rankings
+Cost per quality point (cost divided by overall score)
+⚡ Performance Metrics
+Latency analysis (retrieval vs generation)
+Time breakdown by pipeline stage
+Speed comparisons
+Quality-adjusted speed (speed vs quality trade-offs)
+🔬 Performance Insights
+Analyze pipeline performance across question types, categories, and difficulty
+Performance by Question Type
+Performance by pipeline
+🧪 Live Testing
+Real-time pipeline testing
+Category-based question suggestions
+Multi-pipeline comparison
+Live progress tracking
+Answer quality comparison
+Instant quality scoring (optional)
+📦 Batch Evaluation
+Run comprehensive evaluations (5-100 questions)
+Multi-pipeline testing
+Parallel execution (1-6 workers)
+Real-time progress monitoring
+Option to run LLM judge automatically
+🏆 Leaderboard
+Overall pipeline rankings
+Quality-weighted rankings
+Multiple sorting options (accuracy, speed, cost, quality)
+Performance badges
+## Architecture
+app/
+├── dashboard.py (main file above)
+├── pages/
+│   ├── __init__.py
+│   ├── home.py
+│   ├── comparison.py
+│   ├── explorer.py
+│   ├── cost.py
+│   ├── performance.py
+│   ├── testing.py
+│   └── leaderboard.py
+│   └── batch_evaluation.py
+│   └── insights.py
+app/.streamlit/config.toml
+## Run application
+streamlit run app/dashboard.py
+Last Updated: January 24, 2026, 7:00 PM EST
+Project: RAG Pipeline Optimizer
+Phase: 4 of 5

docker-compose.yml ADDED Viewed

	@@ -0,0 +1,94 @@

+version: '3.8'
+services:
+  rag-optimizer:
+    build:
+      context: .
+      dockerfile: Dockerfile
+    container_name: rag-optimizer-dashboard
+    ports:
+      - "8501:8501"
+    volumes:
+      # Mount data directories for persistence
+      - ./data:/app/data
+      - ./chroma_db:/app/chroma_db
+      - ./logs:/app/logs
+    environment:
+      # =====================
+      # AZURE AI FOUNDRY (Main OpenAI)
+      # =====================
+      - AZURE_OPENAI_ENDPOINT=${AZURE_OPENAI_ENDPOINT}
+      - AZURE_OPENAI_API_KEY=${AZURE_OPENAI_API_KEY}
+      - AZURE_OPENAI_DEPLOYMENT_NAME=${AZURE_OPENAI_DEPLOYMENT_NAME}
+      # =====================
+      # Azure OpenAI Embeddings
+      # =====================
+      - AZURE_OPENAI_EMBEDDING_DEPLOYMENT_NAME=${AZURE_OPENAI_EMBEDDING_DEPLOYMENT_NAME}
+      - AZURE_OPENAI_EMBEDDING_MODEL_NAME=${AZURE_OPENAI_EMBEDDING_MODEL_NAME}
+      - AZURE_OPENAI_EMBEDDING_ENDPOINT=${AZURE_OPENAI_EMBEDDING_ENDPOINT}
+      - AZURE_OPENAI_EMBEDDING_API_KEY=${AZURE_OPENAI_EMBEDDING_API_KEY}
+      # =====================
+      # Cohere via Azure AI Foundry
+      # =====================
+      - AZURE_COHERE_ENDPOINT=${AZURE_COHERE_ENDPOINT}
+      - AZURE_COHERE_API_KEY=${AZURE_COHERE_API_KEY}
+      - AZURE_COHERE_DEPLOYMENT_NAME=${AZURE_COHERE_DEPLOYMENT_NAME}
+      # =====================
+      # Azure Cohere Rerank (for retrieval)
+      # =====================
+      - AZURE_COHERE_RERANK_MODEL_NAME=${AZURE_COHERE_RERANK_MODEL_NAME}
+      - AZURE_COHERE_RERANK_ENDPOINT=${AZURE_COHERE_RERANK_ENDPOINT}
+      - AZURE_COHERE_RERANK_KEY=${AZURE_COHERE_RERANK_KEY}
+      # =====================
+      # DeepSeek via Azure AI Foundry
+      # =====================
+      - AZURE_DEEPSEEK_ENDPOINT=${AZURE_DEEPSEEK_ENDPOINT}
+      - AZURE_DEEPSEEK_API_KEY=${AZURE_DEEPSEEK_API_KEY}
+      - AZURE_DEEPSEEK_DEPLOYMENT_NAME=${AZURE_DEEPSEEK_DEPLOYMENT_NAME}
+      # =====================
+      # Anthropic (Direct API)
+      # =====================
+      - ANTHROPIC_API_KEY=${ANTHROPIC_API_KEY}
+      # =====================
+      # GROQ (Sonnet model - Direct API)
+      # =====================
+      - GROK_API_KEY=${GROK_API_KEY}
+      # =====================
+      # Database Configuration
+      # =====================
+      - DATABASE_URL=${DATABASE_URL:-sqlite:///./data/results.db}
+      # =====================
+      # ChromaDB Configuration
+      # =====================
+      - CHROMA_PERSIST_DIR=${CHROMA_PERSIST_DIR:-./data/vector_stores}
+      # =====================
+      # Streamlit Configuration
+      # =====================
+      - STREAMLIT_SERVER_PORT=8501
+      - STREAMLIT_SERVER_ADDRESS=0.0.0.0
+      - STREAMLIT_BROWSER_GATHER_USAGE_STATS=false
+    restart: unless-stopped
+    healthcheck:
+      test: ["CMD", "curl", "-f", "http://localhost:8501/_stcore/health"]
+      interval: 30s
+      timeout: 10s
+      retries: 3
+      start_period: 40s
+    networks:
+      - rag-network
+networks:
+  rag-network:
+    driver: bridge

init_project.py ADDED Viewed

	@@ -0,0 +1,43 @@

+import os
+PROJECT_ROOT = "."
+DIRS = [
+    f"{PROJECT_ROOT}/core",
+    f"{PROJECT_ROOT}/config",
+    f"{PROJECT_ROOT}/data/uploads",
+    f"{PROJECT_ROOT}/data/vector_stores",
+    f"{PROJECT_ROOT}/utils",
+    f"{PROJECT_ROOT}/tests",
+]
+TEST_PIPELINE = """\
+import pytest
+from config.pipeline_configs import ALL_PIPELINES
+def test_pipeline_registry():
+    assert len(ALL_PIPELINES) >= 4
+    for key, cfg in ALL_PIPELINES.items():
+        assert cfg.chunk_size > 0
+        assert cfg.top_k > 0
+"""
+if __name__ == "__main__":
+    for d in DIRS:
+        os.makedirs(d, exist_ok=True)
+        # __init__.py for packages
+        if "data" not in d:
+            init_path = os.path.join(d, "__init__.py")
+            if not os.path.exists(init_path):
+                open(init_path, "w").close()
+    with open(f"{PROJECT_ROOT}/tests/test_pipeline.py", "w") as f:
+        f.write(TEST_PIPELINE)
+    print("✅ Phase 1 project skeleton created.")
+    print("Next steps:")
+    print("1. cd rag_optimizer")
+    print("2. python -m venv venv")
+    print("3. source venv/bin/activate  # or venv\\\\Scripts\\\\activate on Windows")
+    print("4. pip install -r requirements.txt")
+    print("5. cp .env.example .env && fill your keys")

requirements.txt ADDED Viewed

	@@ -0,0 +1,46 @@

+# Core Framework
+langchain
+langchain-huggingface
+langchain-openai
+langchain-cohere
+langchain-text-splitters
+langchain-chroma
+datasets
+hf_xet
+# Vector Database
+chromadb
+# Embeddings
+sentence-transformers
+openai
+cohere
+# LLM
+anthropic
+# Document Loading
+pypdf
+python-docx
+python-pptx
+openpyxl
+markdown
+beautifulsoup4
+# Utils
+python-dotenv
+pydantic
+tqdm
+pandas
+# Storage
+sqlalchemy
+# API (Phase 2)
+fastapi
+uvicorn
+python-multipart
+# Frontend (Phase 2)
+streamlit
+plotly

results.xlsx ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:4342adb5aa2740a5e25d2d03abb96d1ab7becdd9c74698c7276f1f7ce8dcd3fd
+size 101826