RAG-Pipeline-Optimizer / Project Readme.md
puji4ml's picture
Rename Project .md to Project Readme.md
81eb242 verified

A newer version of the Streamlit SDK is available: 1.56.0

Upgrade

RAG Pipeline Optimizer - Phase 1 Complete βœ…

An MLOps platform for evaluating and optimizing RAG (Retrieval-Augmented Generation) pipelines across multiple models and configurations.


🎯 Project Overview

The Problem: Every company has a RAG system, but almost no one knows if their RAG is good. Is chunk_size=512 better than 1024? Is Cohere a better embedder than OpenAI for their data? They're just guessing.

The Solution: A full-stack RAG evaluation platform that runs multiple pipeline configurations in parallel, scores them using AI evaluation, and shows you which configuration works best for YOUR data.


βœ… Phase 1: Complete

What's Built

  • βœ… Project structure with clean separation of concerns
  • βœ… 6 diverse RAG pipelines leveraging different strategies:
    • Pipeline A: Speed-Optimized (Azure GPT-5)
    • Pipeline B: Accuracy-Optimized (Azure GPT-5 + Reranking)
    • Pipeline C: Balanced (Azure Cohere)
    • Pipeline D: Reasoning (Anthropic Claude)
    • Pipeline E: Cost-Optimized (Azure DeepSeek)
    • Pipeline F: Experimental (xAI Grok)
  • βœ… Configuration management with environment variables
  • βœ… Cost estimation for each pipeline
  • βœ… Comprehensive tests to validate configurations

Technology Stack

Component Technology Purpose
LLM Providers Azure OpenAI, Cohere, DeepSeek, Anthropic, xAI Diverse model comparison
Embeddings OpenAI, Sentence-Transformers Vector representations
Vector DB ChromaDB Local vector storage
Framework LangChain RAG orchestration
Storage SQLite Results & metadata
Backend (Phase 2) FastAPI REST API
Frontend (Phase 3) Streamlit User interface
Deployment (Phase 4) Hugging Face Spaces Cloud hosting

πŸ“ Project Structure

rag_optimizer/ β”œβ”€β”€ config/ β”‚ β”œβ”€β”€ init.py β”‚ └── pipeline_configs.py # 6 pipeline configurations β”œβ”€β”€ core/ # [Phase 2] Document processing β”‚ β”œβ”€β”€ init.py β”‚ β”œβ”€β”€ document_loader.py # [Coming next] β”‚ β”œβ”€β”€ chunker.py # [Coming next] β”‚ β”œβ”€β”€ embedder.py # [Coming next] β”‚ β”œβ”€β”€ vector_store.py # [Coming next] β”‚ β”œβ”€β”€ retriever.py # [Coming next] β”‚ β”œβ”€β”€ generator.py # [Coming next] β”‚ └── pipeline.py # [Coming next] β”œβ”€β”€ data/ β”‚ β”œβ”€β”€ uploads/ # User-uploaded documents β”‚ β”œβ”€β”€ vector_stores/ # ChromaDB storage β”‚ └── results.db # SQLite evaluation results β”œβ”€β”€ utils/ β”‚ β”œβ”€β”€ init.py β”‚ └── database.py # [Phase 3] β”œβ”€β”€ tests/ β”‚ β”œβ”€β”€ init.py β”‚ └── test_pipeline_config.py # βœ… Tests pass β”œβ”€β”€ .env # Your API keys (DO NOT COMMIT) β”œβ”€β”€ .env.example # Template for .env β”œβ”€β”€ requirements.txt # Python dependencies └── README.md # This file

πŸš€ Quick Start

open "rag_optimizer" directory in VsCode

  1. Installation

navigate to project

python init_project.py

Create virtual environment

python -m venv venv

Activate virtual environment

On Windows:

.\venv\Scripts\activate

Install dependencies

pip install -r requirements.txt

#Configure keys inside .env.example (I am using Azure Foundry AI for OpenAi,Cohere,deepseek) a)configure ENDPOINT,API_KEY,DEPLOYMENT_NAME of models used via Azure and for rest like ANTHROPIC and GROQ directly use API_KEY b)cp .env.example .env

  1. Verify Setup

View pipeline comparison

python config/pipeline_configs.py

Run tests

python tests/test_pipeline_config.py

Last Updated: January 14, 2026 Project: RAG Pipeline Optimizer Phase: 1 of 5

RAG Pipeline Optimizer - Phase 2 Complete βœ…

Phase 2: Core RAG Components

Successfully implemented and tested all core components for document processing, embedding generation, and vector storage using LangChain framework.

🎯 Phase 2 Deliverables βœ… Document Loader - Multi-format document parsing (PDF, DOCX, TXT, MD, PPTX, XLSX) βœ… Text Chunker - LangChain-based chunking with multiple strategies βœ… Embedder - Local + Azure OpenAI embeddings βœ… Vector Store - ChromaDB with LangChain integration

πŸ“ Files Created rag_optimizer/ β”œβ”€β”€ core/ β”‚ β”œβ”€β”€ init.py β”‚ β”œβ”€β”€ document_loader.py βœ… Multi-format document loading β”‚ β”œβ”€β”€ chunker.py βœ… LangChain text splitting β”‚ β”œβ”€β”€ embedder.py βœ… Embedding generation β”‚ └── vector_store.py βœ… ChromaDB vector storage β”‚ β”œβ”€β”€ data/ β”‚ β”œβ”€β”€ uploads/ πŸ“‚ User uploaded documents β”‚ └── vector_stores/ πŸ“‚ Persisted vector databases β”‚ └── requirements.txt βœ… Updated with LangChain packages

πŸ”§ Components Overview

  1. Document Loader (core/document_loader.py) Purpose: Load and parse documents in multiple formats

Supported Formats:

PDF (.pdf) - Extracts text with page numbers Word (.docx) - Paragraphs and formatting Text (.txt) - Plain text files Markdown (.md) - Converts to plain text PowerPoint (.pptx) - Slide content Excel (.xlsx) - Sheet data

Key Features: Automatic format detection Metadata extraction (file size, page count) Error handling for corrupted files Batch document loading

  1. Text Chunker (core/chunker.py) Purpose: Split documents into semantic chunks for embedding

Framework: LangChain Text Splitters

Chunking Strategies:

Strategy Description Use Case Quality
recursive βœ… Tries \n\n β†’ \n β†’ . β†’ RECOMMENDED for all pipelines A+
character Simple character-based splitting Basic documents B
token Token-aware splitting Token-limited models B
sentence Sentence boundary detection Short documents C

Key Features:

Configurable chunk size (tokens) Overlap for context preservation Clean semantic boundaries No fragment generation

  1. Embedder (core/embedder.py) Purpose: Generate vector embeddings for text chunks

Framework: LangChain Embeddings

Supported Providers:

Provider Model Dimension Cost Speed Use Case
sentence-transformers all-MiniLM-L6-v2 384D FREE βœ… Fast Development/Testing
sentence-transformers all-mpnet-base-v2 768D FREE βœ… Medium Better quality
azure-openai text-embedding-3-small 1536D $0.02/1M Fast Production
azure-openai text-embedding-3-large 3072D $0.13/1M Medium Highest accuracy

Key Features:

Automatic batching for efficiency Cosine similarity calculation Normalized embeddings Local caching (future)

  1. Vector Store (core/vector_store.py) Purpose: Store and retrieve document chunks using vector similarity

Framework: LangChain + ChromaDB

Key Features: Local persistent storage (no external DB needed) Fast similarity search (cosine distance) Metadata filtering LangChain retriever integration Collection management

Storage Structure: data/vector_stores/ └── {collection_name}/ β”œβ”€β”€ chroma.sqlite3 # Metadata └── {uuid}/ # Vector data └── data_level0.bin


πŸš€ Quick Start

1).\venv\Scripts\activate 2)pip install -r requirements.txt 3)python core/document_loader.py 4)python core/chunker.py 5)python core/embedder.py 6)python core/vector_store.py

Last Updated: January 14, 2026, 8:38 PM EST Project: RAG Pipeline Optimizer Phase: 2 of 5

#πŸ“˜ Phase 3 README: Pipeline Orchestration & Parallel Evaluation Phase 3 Roadmap (Step-by-Step) Step 1: Generator Module ⬅️ START HERE Build LLM interface for all 6 models (Azure OpenAI, Cohere, DeepSeek, Claude, Grok) Step 2: Retriever Module Combine VectorStore + optional reranking (Pipeline B uses Cohere rerank) Step 3: Pipeline Orchestrator Connect all components: Document β†’ Chunks β†’ Embeddings β†’ Retrieval β†’ Generation Step 4: Dataset Integration Download wiki_dpr + Natural Questions, load into vector stores Step 5: Parallel Execution Run all 6 pipelines on same query simultaneously Step 6: Evaluation & Results Storage SQLite database to store query results, costs, metrics

🎯 Phase 3 Overview Phase 3 integrated all core RAG components into a fully functional multi-pipeline evaluation system capable of running 6 different RAG configurations in parallel, comparing their performance, and storing results for analysis.

What We Built βœ… LLM Generator (core/generator.py) - Multi-provider response generation βœ… Smart Retriever (core/retriever.py) - Context retrieval with optional reranking βœ… Pipeline Orchestrator (core/pipeline.py) - End-to-end RAG workflow βœ… Parallel Evaluator (scripts/run_parallel_evaluation.py) - Simultaneous pipeline execution βœ… Analysis Dashboard (scripts/analyze_results.py) - Performance comparison tools βœ… Database Schema (data/evaluation_results.db) - SQLite storage for metrics βœ… Dataset Integration (scripts/dataset_loader.py) - NQ-Open evaluation dataset βœ… Corpus Ingestion (scripts/ingest_corpus.py) - Wikipedia knowledge base

πŸ—οΈ Architecture Overview β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ USER QUERY INPUT β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β–Ό β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ PARALLEL PIPELINE EXECUTION (6 Pipelines) β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β” β”‚ β”‚ β”‚Pipeline Aβ”‚Pipeline Bβ”‚Pipeline Cβ”‚Pipeline Dβ”‚Pipeline Eβ”‚Pipe β”‚ β”‚ β”‚ β”‚ (Speed) β”‚(Accuracy)β”‚(Balanced)β”‚(Reasoningβ”‚ (Cost) β”‚ F β”‚ β”‚ β”‚ β””β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”΄β”€β”€β”¬β”€β”€β”˜ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”˜ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β–Ό β–Ό β–Ό β–Ό β–Ό β–Ό β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ VECTOR STORE (ChromaDB) β”‚ β”‚ Retrieves top-k relevant chunks for each pipeline β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β–Ό β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ RETRIEVER (with optional reranking) β”‚ β”‚ β€’ Pipeline B: Cohere reranking (accuracy boost) β”‚ β”‚ β€’ Others: Direct similarity search β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β–Ό β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ GENERATOR β”‚ β”‚ β€’ Pipeline A: Azure GPT-5 (fast) β”‚ β”‚ β€’ Pipeline B: Azure GPT-5 (high quality) β”‚ β”‚ β€’ Pipeline C: Azure Cohere Command β”‚ β”‚ β€’ Pipeline D: Anthropic Claude (reasoning) β”‚ β”‚ β€’ Pipeline E: DeepSeek V3.2 (cost-optimized) β”‚ β”‚ β€’ Pipeline F: Groq Llama (experimental) β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β–Ό β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ EVALUATION & METRICS COLLECTION β”‚ β”‚ β€’ Answer correctness (exact match + fuzzy) β”‚ β”‚ β€’ Latency tracking (retrieval + generation) β”‚ β”‚ β€’ Cost calculation (per query) β”‚ β”‚ β€’ Token usage monitoring β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β–Ό β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ SQLite DATABASE (evaluation_results.db) β”‚ β”‚ Stores: Queries, Answers, Metrics, Timestamps β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β–Ό β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ ANALYSIS DASHBOARD (analyze_results.py) β”‚ β”‚ β€’ Pipeline comparison β”‚ β”‚ β€’ Cost efficiency analysis β”‚ β”‚ β€’ Question difficulty breakdown β”‚ β”‚ β€’ Excel export for deeper analysis β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ πŸ“¦ Components Built in Phase 3

  1. Generator (core/generator.py) Purpose: Interface to all LLM providers with unified response handling.

Features:

βœ… Multi-provider support (Azure OpenAI, Cohere, Claude, DeepSeek, Groq)

βœ… Prompt template management βœ… Automatic cost calculation βœ… Token usage tracking βœ… Error handling & retries βœ… Response parsing with strict format validation

Supported Models: AZURE_GPT5 = "gpt-5-chat" # Fast, high quality AZURE_COHERE = "cohere-command-a" # Balanced performance AZURE_DEEPSEEK = "DeepSeek-V3.2" # Ultra cost-efficient ANTHROPIC_CLAUDE= "claude-3-5-sonnet" # Advanced reasoning GROQ_LLAMA = "llama-3.3-70b" # Experimental, fast inference

  1. Retriever (core/retriever.py) Purpose: Fetch relevant context chunks with optional reranking.

Features:

βœ… Semantic similarity search (ChromaDB) βœ… Cohere reranking for Pipeline B (accuracy boost) βœ… Configurable top-k retrieval βœ… Score normalization βœ… Metadata filtering βœ… Performance timing

Retrieval Strategies:

Pipeline Strategy Chunks Reranking Use Case
A Fast 3 ❌ Speed-critical
B Accuracy 10 βœ… Cohere Maximum quality
C-F Standard 5-10 ❌ General use
  1. Pipeline Orchestrator (core/pipeline.py) Purpose: End-to-end RAG workflow coordinator.

Features:

βœ… Component integration (Embedder β†’ VectorStore β†’ Retriever β†’ Generator) βœ… Stage-wise timing (retrieval_time_ms, generation_time_ms, total_time_ms) βœ… Cost accumulation βœ… Metadata tracking βœ… Error recovery

Pipeline Flow: User Query β†’ Embedding β†’ Vector Search β†’ Rerank (optional) β†’ LLM Generation β†’ Response ↓ ↓ ↓ ↓ ↓ ↓ Timing Timing Timing Timing Timing Total

  1. Parallel Evaluator (scripts/run_parallel_evaluation.py) Purpose: Run all 6 pipelines simultaneously on evaluation dataset.

Features:

βœ… Concurrent execution (ThreadPoolExecutor) βœ… Progress tracking (tqdm) βœ… Automatic database insertion βœ… Error isolation (one pipeline failure doesn't stop others) βœ… Answer validation (exact match + fuzzy matching) βœ… Run ID tracking for experiment management

Performance Metrics Tracked:

βœ… Accuracy (answer_found: 0 or 1) βœ… Latency (retrieval_time_ms, generation_time_ms, total_time_ms) βœ… Cost (generation_cost_usd, total_cost_usd) βœ… Token usage (prompt_tokens, completion_tokens, total_tokens) βœ… Retrieval quality (num_chunks_retrieved, retrieval_scores)

  1. Analysis Dashboard (scripts/analyze_results.py) Purpose: Comprehensive evaluation results analysis.

Features:

βœ… Pipeline performance summary (accuracy, cost, speed) βœ… Cost efficiency analysis (cost per correct answer) βœ… Time breakdown (retrieval vs generation) βœ… Token usage statistics βœ… Retrieval quality metrics βœ… Difficult questions identification (0% accuracy) βœ… Easy questions identification (>66% accuracy) βœ… Question-by-question comparison βœ… Excel export with 8 detailed sheets Usage:

View dashboard in terminal

python scripts/analyze_results.py

Export to Excel

python scripts/analyze_results.py --export results.xlsx

List all runs

python scripts/analyze_results.py --list-runs

  1. Database Schema (data/evaluation_results.db) Table: evaluation_results
    Column Type Description
    id INTEGER PRIMARY KEY Auto-increment ID
    run_id TEXT Evaluation run identifier (e.g., "20260117_182253")
    pipeline_id TEXT Pipeline identifier
    pipeline_name TEXT Human-readable pipeline name
    question_id TEXT Question identifier from dataset
    query TEXT Input question
    ground_truth_answers TEXT JSON array of correct answers
    retrieved_chunks TEXT JSON array of context chunks
    retrieval_scores TEXT JSON array of similarity scores
    num_chunks_retrieved INTEGER Number of chunks retrieved
    retrieval_time_ms REAL Time spent on retrieval
    reranking_time_ms REAL Time spent on reranking (if applicable)
    reranked INTEGER Whether reranking was used (0 or 1)
    generated_answer TEXT Model's generated answer
    generation_time_ms REAL Time spent on generation
    prompt_tokens INTEGER Input tokens used
    completion_tokens INTEGER Output tokens generated
    total_tokens INTEGER Total tokens (prompt + completion)
    generation_cost_usd REAL Cost of generation
    total_cost_usd REAL Total query cost
    total_time_ms REAL End-to-end latency
    has_answer INTEGER Whether answer is present (1 or 0)
    answer_found INTEGER Whether answer is correct (1 or 0)
    timestamp TEXT ISO 8601 timestamp

πŸš€ Quick Start

Prerequisites:

Ensure Phase 1 & 2 are complete

βœ… 6 pipeline configurations defined βœ… All API keys in .env βœ… ChromaDB vector store populated βœ… Wikipedia corpus ingested

1)core/generator.py #LLM response generation 2)core/retriever.py #Context retrieval + reranking 3)core/pipeline.py #End-to-end orchestration 4)utils/dataset_loader.py #Load Natural Questions + Wikipedia Dataset 5)scripts/ingest_corpus_selective_pipeline.py (see below) #Ingest Wikipedia Corpus into All 6 Pipelines 6)python scripts/run_generic_evaluation.py --num-questions 60 --pipelines A,B,C,D,E,F #Parallel RAG Pipeline Evaluation 7)scripts/analyze_results.py #Results dashboard -diff types of runs to generate diff outout

for big scale dataset: 5a)python scripts/ingest_corpus_selective_pipeline.py --pipelines A,C,D,E,F --passages 500000 --batch-size 5000 5b)python scripts/ingest_corpus_selective_pipeline.py --pipelines B --passages 500000 --batch-size 1000

Last Updated: January 17, 2026, 7:38 PM EST Project: RAG Pipeline Optimizer Phase: 3 of 5

πŸ“ŠPhase 4: Advanced Evaluation & Interactive Dashboard

🎯 Overview Phase 4 delivers a two-part system for advanced RAG pipeline evaluation:

Phase 4A: LLM-as-a-Judge evaluation system using GPT-4o to score answer quality across 6 dimensions Phase 4B: Full-stack interactive Streamlit dashboard for visualizing and comparing results

Together, these provide objective quality scoring and interactive exploration of pipeline performance beyond basic metrics like speed and cost.

πŸ“¦ Phase 4 Components Phase 4A: LLM Judge Evaluation System Automated answer quality scoring using GPT-4o as an AI judge

Phase 4B: Interactive Dashboard 8-page Streamlit application for data exploration and real-time testing

πŸ”¬ Phase 4A: LLM Judge Evaluation Overview Phase 4A adds multi-dimensional quality scoring to existing evaluation results using GPT-4o as an objective judge. Each answer is scored across 6 quality dimensions, providing insights beyond operational metrics.

✨ Features 6-Dimensional Quality Scoring Correctness (0-10) - Factual accuracy compared to ground truth Relevance (0-10) - How well the answer addresses the question Completeness (0-10) - Coverage of important information Clarity (0-10) - Clear, understandable language Conciseness (0-10) - Brevity without sacrificing information Overall (0-10) - Weighted average of all dimensions

Automated Evaluation Evaluates existing Phase 3 results retroactively No need to re-run pipelines Batch processing with progress tracking Results stored in separate database table

Cost-Efficient Only evaluates answers, not entire pipeline re-runs Uses GPT-4o-mini for cost efficiency Batches requests to minimize API calls

πŸ—οΈ Architecture rag_optimizer/ β”œβ”€β”€ core/ β”‚ └── evaluator.py # LLM Judge implementation β”œβ”€β”€ utils/ β”‚ └── database.py # Database utilities for score storage β”œβ”€β”€ scripts/ β”‚ └── evaluate_with_judge.py # CLI tool for running evaluations └── data/ └── evaluation_results.db # SQLite (updated schema)

πŸ—„οΈ Database Schema (Phase 4A Extension) New Table: evaluation_scores Stores LLM judge quality scores for each evaluation result.

Column Type Description
id INTEGER Primary key
evaluation_result_id INTEGER Foreign key β†’ evaluation_results.id
correctness_score REAL Factual accuracy (0-10)
relevance_score REAL Question relevance (0-10)
completeness_score REAL Information coverage (0-10)
clarity_score REAL Language clarity (0-10)
conciseness_score REAL Brevity (0-10)
overall_score REAL Weighted average (0-10)
judge_reasoning TEXT LLM's explanation for scores
timestamp TEXT ISO timestamp

Indexes:

idx_eval_result on evaluation_result_id idx_overall_score on overall_score


πŸš€ Quick Start

core/evaluator.py
utils/database.py python scripts/evaluate_with_judge.py --latest --limit 5

πŸ–₯️ Phase 4B: Interactive Dashboard Overview Full-stack Streamlit dashboard with 9 pages for exploring evaluation results and testing pipelines in real-time.

✨ Features 🏠 Home Page Project overview and capabilities Quick stats (6 pipelines, 5 LLM providers, 500K+ corpus) Pipeline configuration cards Modern dark theme UI

πŸ“Š Pipeline Comparison Side-by-side performance metrics Quality scores from LLM judge (correctness, relevance, completeness, clarity, conciseness) Interactive comparison tables Filter by evaluation run Sort by accuracy, speed, cost, or quality score Multi-dimensional scoring

πŸ” Question Explorer Browse all evaluated questions See how each pipeline answered View quality scores per answer Compare answers across pipelines View retrieved context chunks Ground truth validation

πŸ’° Cost Analysis Token usage breakdown Cost per query analysis Cost efficiency rankings Cost per quality point (cost divided by overall score)

⚑ Performance Metrics Latency analysis (retrieval vs generation) Time breakdown by pipeline stage Speed comparisons Quality-adjusted speed (speed vs quality trade-offs)

πŸ”¬ Performance Insights Analyze pipeline performance across question types, categories, and difficulty Performance by Question Type Performance by pipeline

πŸ§ͺ Live Testing Real-time pipeline testing Category-based question suggestions Multi-pipeline comparison Live progress tracking Answer quality comparison Instant quality scoring (optional)

πŸ“¦ Batch Evaluation Run comprehensive evaluations (5-100 questions) Multi-pipeline testing Parallel execution (1-6 workers) Real-time progress monitoring Option to run LLM judge automatically

πŸ† Leaderboard Overall pipeline rankings Quality-weighted rankings Multiple sorting options (accuracy, speed, cost, quality) Performance badges

Architecture

app/ β”œβ”€β”€ dashboard.py (main file above) β”œβ”€β”€ pages/ β”‚ β”œβ”€β”€ init.py β”‚ β”œβ”€β”€ home.py β”‚ β”œβ”€β”€ comparison.py β”‚ β”œβ”€β”€ explorer.py β”‚ β”œβ”€β”€ cost.py β”‚ β”œβ”€β”€ performance.py β”‚ β”œβ”€β”€ testing.py β”‚ └── leaderboard.py β”‚ └── batch_evaluation.py β”‚ └── insights.py

app/.streamlit/config.toml

Run application

streamlit run app/dashboard.py

Last Updated: January 24, 2026, 7:00 PM EST Project: RAG Pipeline Optimizer Phase: 4 of 5