Spaces:

puji4ml
/

RAG-Pipeline-Optimizer

Sleeping

App Files Files Community

RAG-Pipeline-Optimizer / Project Readme.md

puji4ml

Rename Project .md to Project Readme.md

81eb242 verified 3 months ago

preview code

raw

history blame contribute delete

30.2 kB

A newer version of the Streamlit SDK is available: 1.56.0

Upgrade

RAG Pipeline Optimizer - Phase 1 Complete ✅

An MLOps platform for evaluating and optimizing RAG (Retrieval-Augmented Generation) pipelines across multiple models and configurations.

🎯 Project Overview

The Problem: Every company has a RAG system, but almost no one knows if their RAG is good. Is chunk_size=512 better than 1024? Is Cohere a better embedder than OpenAI for their data? They're just guessing.

The Solution: A full-stack RAG evaluation platform that runs multiple pipeline configurations in parallel, scores them using AI evaluation, and shows you which configuration works best for YOUR data.

✅ Phase 1: Complete

What's Built

✅ Project structure with clean separation of concerns
✅ 6 diverse RAG pipelines leveraging different strategies:
- Pipeline A: Speed-Optimized (Azure GPT-5)
- Pipeline B: Accuracy-Optimized (Azure GPT-5 + Reranking)
- Pipeline C: Balanced (Azure Cohere)
- Pipeline D: Reasoning (Anthropic Claude)
- Pipeline E: Cost-Optimized (Azure DeepSeek)
- Pipeline F: Experimental (xAI Grok)
✅ Configuration management with environment variables
✅ Cost estimation for each pipeline
✅ Comprehensive tests to validate configurations

Technology Stack

Component	Technology	Purpose
LLM Providers	Azure OpenAI, Cohere, DeepSeek, Anthropic, xAI	Diverse model comparison
Embeddings	OpenAI, Sentence-Transformers	Vector representations
Vector DB	ChromaDB	Local vector storage
Framework	LangChain	RAG orchestration
Storage	SQLite	Results & metadata
Backend (Phase 2)	FastAPI	REST API
Frontend (Phase 3)	Streamlit	User interface
Deployment (Phase 4)	Hugging Face Spaces	Cloud hosting

📁 Project Structure

rag_optimizer/ ├── config/ │ ├── init.py │ └── pipeline_configs.py # 6 pipeline configurations ├── core/ # [Phase 2] Document processing │ ├── init.py │ ├── document_loader.py # [Coming next] │ ├── chunker.py # [Coming next] │ ├── embedder.py # [Coming next] │ ├── vector_store.py # [Coming next] │ ├── retriever.py # [Coming next] │ ├── generator.py # [Coming next] │ └── pipeline.py # [Coming next] ├── data/ │ ├── uploads/ # User-uploaded documents │ ├── vector_stores/ # ChromaDB storage │ └── results.db # SQLite evaluation results ├── utils/ │ ├── init.py │ └── database.py # [Phase 3] ├── tests/ │ ├── init.py │ └── test_pipeline_config.py # ✅ Tests pass ├── .env # Your API keys (DO NOT COMMIT) ├── .env.example # Template for .env ├── requirements.txt # Python dependencies └── README.md # This file

🚀 Quick Start

open "rag_optimizer" directory in VsCode

Installation

navigate to project

python init_project.py

Create virtual environment

python -m venv venv

Activate virtual environment

On Windows:

.\venv\Scripts\activate

Install dependencies

pip install -r requirements.txt

#Configure keys inside .env.example (I am using Azure Foundry AI for OpenAi,Cohere,deepseek) a)configure ENDPOINT,API_KEY,DEPLOYMENT_NAME of models used via Azure and for rest like ANTHROPIC and GROQ directly use API_KEY b)cp .env.example .env

Verify Setup

View pipeline comparison

python config/pipeline_configs.py

Run tests

python tests/test_pipeline_config.py

Last Updated: January 14, 2026 Project: RAG Pipeline Optimizer Phase: 1 of 5

RAG Pipeline Optimizer - Phase 2 Complete ✅

Phase 2: Core RAG Components

Successfully implemented and tested all core components for document processing, embedding generation, and vector storage using LangChain framework.

🎯 Phase 2 Deliverables ✅ Document Loader - Multi-format document parsing (PDF, DOCX, TXT, MD, PPTX, XLSX) ✅ Text Chunker - LangChain-based chunking with multiple strategies ✅ Embedder - Local + Azure OpenAI embeddings ✅ Vector Store - ChromaDB with LangChain integration

📁 Files Created rag_optimizer/ ├── core/ │ ├── init.py │ ├── document_loader.py ✅ Multi-format document loading │ ├── chunker.py ✅ LangChain text splitting │ ├── embedder.py ✅ Embedding generation │ └── vector_store.py ✅ ChromaDB vector storage │ ├── data/ │ ├── uploads/ 📂 User uploaded documents │ └── vector_stores/ 📂 Persisted vector databases │ └── requirements.txt ✅ Updated with LangChain packages

🔧 Components Overview

Document Loader (core/document_loader.py) Purpose: Load and parse documents in multiple formats

Supported Formats:

PDF (.pdf) - Extracts text with page numbers Word (.docx) - Paragraphs and formatting Text (.txt) - Plain text files Markdown (.md) - Converts to plain text PowerPoint (.pptx) - Slide content Excel (.xlsx) - Sheet data

Key Features: Automatic format detection Metadata extraction (file size, page count) Error handling for corrupted files Batch document loading

Text Chunker (core/chunker.py) Purpose: Split documents into semantic chunks for embedding

Framework: LangChain Text Splitters

Chunking Strategies:

Strategy	Description	Use Case	Quality
recursive ✅	Tries \n\n → \n → . →	RECOMMENDED for all pipelines	A+
character	Simple character-based splitting	Basic documents	B
token	Token-aware splitting	Token-limited models	B
sentence	Sentence boundary detection	Short documents	C

Key Features:

Configurable chunk size (tokens) Overlap for context preservation Clean semantic boundaries No fragment generation

Embedder (core/embedder.py) Purpose: Generate vector embeddings for text chunks

Framework: LangChain Embeddings

Supported Providers:

Provider	Model	Dimension	Cost	Speed	Use Case
sentence-transformers	all-MiniLM-L6-v2	384D	FREE ✅	Fast	Development/Testing
sentence-transformers	all-mpnet-base-v2	768D	FREE ✅	Medium	Better quality
azure-openai	text-embedding-3-small	1536D	$0.02/1M	Fast	Production
azure-openai	text-embedding-3-large	3072D	$0.13/1M	Medium	Highest accuracy

Key Features:

Automatic batching for efficiency Cosine similarity calculation Normalized embeddings Local caching (future)

Vector Store (core/vector_store.py) Purpose: Store and retrieve document chunks using vector similarity

Framework: LangChain + ChromaDB

Key Features: Local persistent storage (no external DB needed) Fast similarity search (cosine distance) Metadata filtering LangChain retriever integration Collection management

Storage Structure: data/vector_stores/ └── {collection_name}/ ├── chroma.sqlite3 # Metadata └── {uuid}/ # Vector data └── data_level0.bin

🚀 Quick Start

1).\venv\Scripts\activate 2)pip install -r requirements.txt 3)python core/document_loader.py 4)python core/chunker.py 5)python core/embedder.py 6)python core/vector_store.py

Last Updated: January 14, 2026, 8:38 PM EST Project: RAG Pipeline Optimizer Phase: 2 of 5

#📘 Phase 3 README: Pipeline Orchestration & Parallel Evaluation Phase 3 Roadmap (Step-by-Step) Step 1: Generator Module ⬅️ START HERE Build LLM interface for all 6 models (Azure OpenAI, Cohere, DeepSeek, Claude, Grok) Step 2: Retriever Module Combine VectorStore + optional reranking (Pipeline B uses Cohere rerank) Step 3: Pipeline Orchestrator Connect all components: Document → Chunks → Embeddings → Retrieval → Generation Step 4: Dataset Integration Download wiki_dpr + Natural Questions, load into vector stores Step 5: Parallel Execution Run all 6 pipelines on same query simultaneously Step 6: Evaluation & Results Storage SQLite database to store query results, costs, metrics

🎯 Phase 3 Overview Phase 3 integrated all core RAG components into a fully functional multi-pipeline evaluation system capable of running 6 different RAG configurations in parallel, comparing their performance, and storing results for analysis.

What We Built ✅ LLM Generator (core/generator.py) - Multi-provider response generation ✅ Smart Retriever (core/retriever.py) - Context retrieval with optional reranking ✅ Pipeline Orchestrator (core/pipeline.py) - End-to-end RAG workflow ✅ Parallel Evaluator (scripts/run_parallel_evaluation.py) - Simultaneous pipeline execution ✅ Analysis Dashboard (scripts/analyze_results.py) - Performance comparison tools ✅ Database Schema (data/evaluation_results.db) - SQLite storage for metrics ✅ Dataset Integration (scripts/dataset_loader.py) - NQ-Open evaluation dataset ✅ Corpus Ingestion (scripts/ingest_corpus.py) - Wikipedia knowledge base

🏗️ Architecture Overview ┌─────────────────────────────────────────────────────────────────┐ │ USER QUERY INPUT │ └────────────────────┬────────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────────┐ │ PARALLEL PIPELINE EXECUTION (6 Pipelines) │ │ ┌──────────┬──────────┬──────────┬──────────┬──────────┬─────┐ │ │ │Pipeline A│Pipeline B│Pipeline C│Pipeline D│Pipeline E│Pipe │ │ │ │ (Speed) │(Accuracy)│(Balanced)│(Reasoning│ (Cost) │ F │ │ │ └─────┬────┴────┬─────┴─────┬────┴─────┬────┴─────┬────┴──┬──┘ │ └────────┼─────────┼───────────┼──────────┼──────────┼───────┼────┘ │ │ │ │ │ │ ▼ ▼ ▼ ▼ ▼ ▼ ┌────────────────────────────────────────────────────────────┐ │ VECTOR STORE (ChromaDB) │ │ Retrieves top-k relevant chunks for each pipeline │ └───────────────────┬────────────────────────────────────────┘ │ ▼ ┌────────────────────────────────────────────────────────────┐ │ RETRIEVER (with optional reranking) │ │ • Pipeline B: Cohere reranking (accuracy boost) │ │ • Others: Direct similarity search │ └───────────────────┬────────────────────────────────────────┘ │ ▼ ┌────────────────────────────────────────────────────────────┐ │ GENERATOR │ │ • Pipeline A: Azure GPT-5 (fast) │ │ • Pipeline B: Azure GPT-5 (high quality) │ │ • Pipeline C: Azure Cohere Command │ │ • Pipeline D: Anthropic Claude (reasoning) │ │ • Pipeline E: DeepSeek V3.2 (cost-optimized) │ │ • Pipeline F: Groq Llama (experimental) │ └───────────────────┬────────────────────────────────────────┘ │ ▼ ┌────────────────────────────────────────────────────────────┐ │ EVALUATION & METRICS COLLECTION │ │ • Answer correctness (exact match + fuzzy) │ │ • Latency tracking (retrieval + generation) │ │ • Cost calculation (per query) │ │ • Token usage monitoring │ └───────────────────┬────────────────────────────────────────┘ │ ▼ ┌────────────────────────────────────────────────────────────┐ │ SQLite DATABASE (evaluation_results.db) │ │ Stores: Queries, Answers, Metrics, Timestamps │ └────────────────────────────────────────────────────────────┘ │ ▼ ┌────────────────────────────────────────────────────────────┐ │ ANALYSIS DASHBOARD (analyze_results.py) │ │ • Pipeline comparison │ │ • Cost efficiency analysis │ │ • Question difficulty breakdown │ │ • Excel export for deeper analysis │ └────────────────────────────────────────────────────────────┘ 📦 Components Built in Phase 3

Generator (core/generator.py) Purpose: Interface to all LLM providers with unified response handling.

Features:

✅ Multi-provider support (Azure OpenAI, Cohere, Claude, DeepSeek, Groq)

✅ Prompt template management ✅ Automatic cost calculation ✅ Token usage tracking ✅ Error handling & retries ✅ Response parsing with strict format validation

Supported Models: AZURE_GPT5 = "gpt-5-chat" # Fast, high quality AZURE_COHERE = "cohere-command-a" # Balanced performance AZURE_DEEPSEEK = "DeepSeek-V3.2" # Ultra cost-efficient ANTHROPIC_CLAUDE= "claude-3-5-sonnet" # Advanced reasoning GROQ_LLAMA = "llama-3.3-70b" # Experimental, fast inference

Retriever (core/retriever.py) Purpose: Fetch relevant context chunks with optional reranking.

Features:

✅ Semantic similarity search (ChromaDB) ✅ Cohere reranking for Pipeline B (accuracy boost) ✅ Configurable top-k retrieval ✅ Score normalization ✅ Metadata filtering ✅ Performance timing

Retrieval Strategies:

Pipeline	Strategy	Chunks	Reranking	Use Case
A	Fast	3	❌	Speed-critical
B	Accuracy	10	✅ Cohere	Maximum quality
C-F	Standard	5-10	❌	General use

Pipeline Orchestrator (core/pipeline.py) Purpose: End-to-end RAG workflow coordinator.

Features:

✅ Component integration (Embedder → VectorStore → Retriever → Generator) ✅ Stage-wise timing (retrieval_time_ms, generation_time_ms, total_time_ms) ✅ Cost accumulation ✅ Metadata tracking ✅ Error recovery

Pipeline Flow: User Query → Embedding → Vector Search → Rerank (optional) → LLM Generation → Response ↓ ↓ ↓ ↓ ↓ ↓ Timing Timing Timing Timing Timing Total

Parallel Evaluator (scripts/run_parallel_evaluation.py) Purpose: Run all 6 pipelines simultaneously on evaluation dataset.

Features:

✅ Concurrent execution (ThreadPoolExecutor) ✅ Progress tracking (tqdm) ✅ Automatic database insertion ✅ Error isolation (one pipeline failure doesn't stop others) ✅ Answer validation (exact match + fuzzy matching) ✅ Run ID tracking for experiment management

Performance Metrics Tracked:

✅ Accuracy (answer_found: 0 or 1) ✅ Latency (retrieval_time_ms, generation_time_ms, total_time_ms) ✅ Cost (generation_cost_usd, total_cost_usd) ✅ Token usage (prompt_tokens, completion_tokens, total_tokens) ✅ Retrieval quality (num_chunks_retrieved, retrieval_scores)

Analysis Dashboard (scripts/analyze_results.py) Purpose: Comprehensive evaluation results analysis.

Features:

✅ Pipeline performance summary (accuracy, cost, speed) ✅ Cost efficiency analysis (cost per correct answer) ✅ Time breakdown (retrieval vs generation) ✅ Token usage statistics ✅ Retrieval quality metrics ✅ Difficult questions identification (0% accuracy) ✅ Easy questions identification (>66% accuracy) ✅ Question-by-question comparison ✅ Excel export with 8 detailed sheets Usage:

View dashboard in terminal

python scripts/analyze_results.py

Export to Excel

python scripts/analyze_results.py --export results.xlsx

List all runs

python scripts/analyze_results.py --list-runs

Database Schema (data/evaluation_results.db) Table: evaluation_results

Column	Type	Description
id	INTEGER PRIMARY KEY	Auto-increment ID
run_id	TEXT	Evaluation run identifier (e.g., "20260117_182253")
pipeline_id	TEXT	Pipeline identifier
pipeline_name	TEXT	Human-readable pipeline name
question_id	TEXT	Question identifier from dataset
query	TEXT	Input question
ground_truth_answers	TEXT	JSON array of correct answers
retrieved_chunks	TEXT	JSON array of context chunks
retrieval_scores	TEXT	JSON array of similarity scores
num_chunks_retrieved	INTEGER	Number of chunks retrieved
retrieval_time_ms	REAL	Time spent on retrieval
reranking_time_ms	REAL	Time spent on reranking (if applicable)
reranked	INTEGER	Whether reranking was used (0 or 1)
generated_answer	TEXT	Model's generated answer
generation_time_ms	REAL	Time spent on generation
prompt_tokens	INTEGER	Input tokens used
completion_tokens	INTEGER	Output tokens generated
total_tokens	INTEGER	Total tokens (prompt + completion)
generation_cost_usd	REAL	Cost of generation
total_cost_usd	REAL	Total query cost
total_time_ms	REAL	End-to-end latency
has_answer	INTEGER	Whether answer is present (1 or 0)
answer_found	INTEGER	Whether answer is correct (1 or 0)
timestamp	TEXT	ISO 8601 timestamp

🚀 Quick Start

Prerequisites:

Ensure Phase 1 & 2 are complete

✅ 6 pipeline configurations defined ✅ All API keys in .env ✅ ChromaDB vector store populated ✅ Wikipedia corpus ingested

1)core/generator.py #LLM response generation 2)core/retriever.py #Context retrieval + reranking 3)core/pipeline.py #End-to-end orchestration 4)utils/dataset_loader.py #Load Natural Questions + Wikipedia Dataset 5)scripts/ingest_corpus_selective_pipeline.py (see below) #Ingest Wikipedia Corpus into All 6 Pipelines 6)python scripts/run_generic_evaluation.py --num-questions 60 --pipelines A,B,C,D,E,F #Parallel RAG Pipeline Evaluation 7)scripts/analyze_results.py #Results dashboard -diff types of runs to generate diff outout

for big scale dataset: 5a)python scripts/ingest_corpus_selective_pipeline.py --pipelines A,C,D,E,F --passages 500000 --batch-size 5000 5b)python scripts/ingest_corpus_selective_pipeline.py --pipelines B --passages 500000 --batch-size 1000

Last Updated: January 17, 2026, 7:38 PM EST Project: RAG Pipeline Optimizer Phase: 3 of 5

📊Phase 4: Advanced Evaluation & Interactive Dashboard

🎯 Overview Phase 4 delivers a two-part system for advanced RAG pipeline evaluation:

Phase 4A: LLM-as-a-Judge evaluation system using GPT-4o to score answer quality across 6 dimensions Phase 4B: Full-stack interactive Streamlit dashboard for visualizing and comparing results

Together, these provide objective quality scoring and interactive exploration of pipeline performance beyond basic metrics like speed and cost.

📦 Phase 4 Components Phase 4A: LLM Judge Evaluation System Automated answer quality scoring using GPT-4o as an AI judge

Phase 4B: Interactive Dashboard 8-page Streamlit application for data exploration and real-time testing

🔬 Phase 4A: LLM Judge Evaluation Overview Phase 4A adds multi-dimensional quality scoring to existing evaluation results using GPT-4o as an objective judge. Each answer is scored across 6 quality dimensions, providing insights beyond operational metrics.

✨ Features 6-Dimensional Quality Scoring Correctness (0-10) - Factual accuracy compared to ground truth Relevance (0-10) - How well the answer addresses the question Completeness (0-10) - Coverage of important information Clarity (0-10) - Clear, understandable language Conciseness (0-10) - Brevity without sacrificing information Overall (0-10) - Weighted average of all dimensions

Automated Evaluation Evaluates existing Phase 3 results retroactively No need to re-run pipelines Batch processing with progress tracking Results stored in separate database table

Cost-Efficient Only evaluates answers, not entire pipeline re-runs Uses GPT-4o-mini for cost efficiency Batches requests to minimize API calls

🏗️ Architecture rag_optimizer/ ├── core/ │ └── evaluator.py # LLM Judge implementation ├── utils/ │ └── database.py # Database utilities for score storage ├── scripts/ │ └── evaluate_with_judge.py # CLI tool for running evaluations └── data/ └── evaluation_results.db # SQLite (updated schema)

🗄️ Database Schema (Phase 4A Extension) New Table: evaluation_scores Stores LLM judge quality scores for each evaluation result.

Column	Type	Description
id	INTEGER	Primary key
evaluation_result_id	INTEGER	Foreign key → evaluation_results.id
correctness_score	REAL	Factual accuracy (0-10)
relevance_score	REAL	Question relevance (0-10)
completeness_score	REAL	Information coverage (0-10)
clarity_score	REAL	Language clarity (0-10)
conciseness_score	REAL	Brevity (0-10)
overall_score	REAL	Weighted average (0-10)
judge_reasoning	TEXT	LLM's explanation for scores
timestamp	TEXT	ISO timestamp

Indexes:

idx_eval_result on evaluation_result_id idx_overall_score on overall_score

🚀 Quick Start

core/evaluator.py
utils/database.py python scripts/evaluate_with_judge.py --latest --limit 5

🖥️ Phase 4B: Interactive Dashboard Overview Full-stack Streamlit dashboard with 9 pages for exploring evaluation results and testing pipelines in real-time.

✨ Features 🏠 Home Page Project overview and capabilities Quick stats (6 pipelines, 5 LLM providers, 500K+ corpus) Pipeline configuration cards Modern dark theme UI

📊 Pipeline Comparison Side-by-side performance metrics Quality scores from LLM judge (correctness, relevance, completeness, clarity, conciseness) Interactive comparison tables Filter by evaluation run Sort by accuracy, speed, cost, or quality score Multi-dimensional scoring

🔍 Question Explorer Browse all evaluated questions See how each pipeline answered View quality scores per answer Compare answers across pipelines View retrieved context chunks Ground truth validation

💰 Cost Analysis Token usage breakdown Cost per query analysis Cost efficiency rankings Cost per quality point (cost divided by overall score)

⚡ Performance Metrics Latency analysis (retrieval vs generation) Time breakdown by pipeline stage Speed comparisons Quality-adjusted speed (speed vs quality trade-offs)

🔬 Performance Insights Analyze pipeline performance across question types, categories, and difficulty Performance by Question Type Performance by pipeline

🧪 Live Testing Real-time pipeline testing Category-based question suggestions Multi-pipeline comparison Live progress tracking Answer quality comparison Instant quality scoring (optional)

📦 Batch Evaluation Run comprehensive evaluations (5-100 questions) Multi-pipeline testing Parallel execution (1-6 workers) Real-time progress monitoring Option to run LLM judge automatically

🏆 Leaderboard Overall pipeline rankings Quality-weighted rankings Multiple sorting options (accuracy, speed, cost, quality) Performance badges

Architecture

app/ ├── dashboard.py (main file above) ├── pages/ │ ├── init.py │ ├── home.py │ ├── comparison.py │ ├── explorer.py │ ├── cost.py │ ├── performance.py │ ├── testing.py │ └── leaderboard.py │ └── batch_evaluation.py │ └── insights.py

app/.streamlit/config.toml

Run application

streamlit run app/dashboard.py

Last Updated: January 24, 2026, 7:00 PM EST Project: RAG Pipeline Optimizer Phase: 4 of 5