Spaces:
Sleeping
A newer version of the Streamlit SDK is available: 1.56.0
RAG Pipeline Optimizer - Phase 1 Complete β
An MLOps platform for evaluating and optimizing RAG (Retrieval-Augmented Generation) pipelines across multiple models and configurations.
π― Project Overview
The Problem: Every company has a RAG system, but almost no one knows if their RAG is good. Is chunk_size=512 better than 1024? Is Cohere a better embedder than OpenAI for their data? They're just guessing.
The Solution: A full-stack RAG evaluation platform that runs multiple pipeline configurations in parallel, scores them using AI evaluation, and shows you which configuration works best for YOUR data.
β Phase 1: Complete
What's Built
- β Project structure with clean separation of concerns
- β
6 diverse RAG pipelines leveraging different strategies:
- Pipeline A: Speed-Optimized (Azure GPT-5)
- Pipeline B: Accuracy-Optimized (Azure GPT-5 + Reranking)
- Pipeline C: Balanced (Azure Cohere)
- Pipeline D: Reasoning (Anthropic Claude)
- Pipeline E: Cost-Optimized (Azure DeepSeek)
- Pipeline F: Experimental (xAI Grok)
- β Configuration management with environment variables
- β Cost estimation for each pipeline
- β Comprehensive tests to validate configurations
Technology Stack
| Component | Technology | Purpose |
|---|---|---|
| LLM Providers | Azure OpenAI, Cohere, DeepSeek, Anthropic, xAI | Diverse model comparison |
| Embeddings | OpenAI, Sentence-Transformers | Vector representations |
| Vector DB | ChromaDB | Local vector storage |
| Framework | LangChain | RAG orchestration |
| Storage | SQLite | Results & metadata |
| Backend (Phase 2) | FastAPI | REST API |
| Frontend (Phase 3) | Streamlit | User interface |
| Deployment (Phase 4) | Hugging Face Spaces | Cloud hosting |
π Project Structure
rag_optimizer/ βββ config/ β βββ init.py β βββ pipeline_configs.py # 6 pipeline configurations βββ core/ # [Phase 2] Document processing β βββ init.py β βββ document_loader.py # [Coming next] β βββ chunker.py # [Coming next] β βββ embedder.py # [Coming next] β βββ vector_store.py # [Coming next] β βββ retriever.py # [Coming next] β βββ generator.py # [Coming next] β βββ pipeline.py # [Coming next] βββ data/ β βββ uploads/ # User-uploaded documents β βββ vector_stores/ # ChromaDB storage β βββ results.db # SQLite evaluation results βββ utils/ β βββ init.py β βββ database.py # [Phase 3] βββ tests/ β βββ init.py β βββ test_pipeline_config.py # β Tests pass βββ .env # Your API keys (DO NOT COMMIT) βββ .env.example # Template for .env βββ requirements.txt # Python dependencies βββ README.md # This file
π Quick Start
open "rag_optimizer" directory in VsCode
- Installation
navigate to project
python init_project.py
Create virtual environment
python -m venv venv
Activate virtual environment
On Windows:
.\venv\Scripts\activate
Install dependencies
pip install -r requirements.txt
#Configure keys inside .env.example (I am using Azure Foundry AI for OpenAi,Cohere,deepseek) a)configure ENDPOINT,API_KEY,DEPLOYMENT_NAME of models used via Azure and for rest like ANTHROPIC and GROQ directly use API_KEY b)cp .env.example .env
- Verify Setup
View pipeline comparison
python config/pipeline_configs.py
Run tests
python tests/test_pipeline_config.py
Last Updated: January 14, 2026 Project: RAG Pipeline Optimizer Phase: 1 of 5
RAG Pipeline Optimizer - Phase 2 Complete β
Phase 2: Core RAG Components
Successfully implemented and tested all core components for document processing, embedding generation, and vector storage using LangChain framework.
π― Phase 2 Deliverables β Document Loader - Multi-format document parsing (PDF, DOCX, TXT, MD, PPTX, XLSX) β Text Chunker - LangChain-based chunking with multiple strategies β Embedder - Local + Azure OpenAI embeddings β Vector Store - ChromaDB with LangChain integration
π Files Created rag_optimizer/ βββ core/ β βββ init.py β βββ document_loader.py β Multi-format document loading β βββ chunker.py β LangChain text splitting β βββ embedder.py β Embedding generation β βββ vector_store.py β ChromaDB vector storage β βββ data/ β βββ uploads/ π User uploaded documents β βββ vector_stores/ π Persisted vector databases β βββ requirements.txt β Updated with LangChain packages
π§ Components Overview
- Document Loader (core/document_loader.py) Purpose: Load and parse documents in multiple formats
Supported Formats:
PDF (.pdf) - Extracts text with page numbers Word (.docx) - Paragraphs and formatting Text (.txt) - Plain text files Markdown (.md) - Converts to plain text PowerPoint (.pptx) - Slide content Excel (.xlsx) - Sheet data
Key Features: Automatic format detection Metadata extraction (file size, page count) Error handling for corrupted files Batch document loading
- Text Chunker (core/chunker.py) Purpose: Split documents into semantic chunks for embedding
Framework: LangChain Text Splitters
Chunking Strategies:
| Strategy | Description | Use Case | Quality |
|---|---|---|---|
| recursive β | Tries \n\n β \n β . β | RECOMMENDED for all pipelines | A+ |
| character | Simple character-based splitting | Basic documents | B |
| token | Token-aware splitting | Token-limited models | B |
| sentence | Sentence boundary detection | Short documents | C |
Key Features:
Configurable chunk size (tokens) Overlap for context preservation Clean semantic boundaries No fragment generation
- Embedder (core/embedder.py) Purpose: Generate vector embeddings for text chunks
Framework: LangChain Embeddings
Supported Providers:
| Provider | Model | Dimension | Cost | Speed | Use Case |
|---|---|---|---|---|---|
| sentence-transformers | all-MiniLM-L6-v2 | 384D | FREE β | Fast | Development/Testing |
| sentence-transformers | all-mpnet-base-v2 | 768D | FREE β | Medium | Better quality |
| azure-openai | text-embedding-3-small | 1536D | $0.02/1M | Fast | Production |
| azure-openai | text-embedding-3-large | 3072D | $0.13/1M | Medium | Highest accuracy |
Key Features:
Automatic batching for efficiency Cosine similarity calculation Normalized embeddings Local caching (future)
- Vector Store (core/vector_store.py) Purpose: Store and retrieve document chunks using vector similarity
Framework: LangChain + ChromaDB
Key Features: Local persistent storage (no external DB needed) Fast similarity search (cosine distance) Metadata filtering LangChain retriever integration Collection management
Storage Structure: data/vector_stores/ βββ {collection_name}/ βββ chroma.sqlite3 # Metadata βββ {uuid}/ # Vector data βββ data_level0.bin
π Quick Start
1).\venv\Scripts\activate 2)pip install -r requirements.txt 3)python core/document_loader.py 4)python core/chunker.py 5)python core/embedder.py 6)python core/vector_store.py
Last Updated: January 14, 2026, 8:38 PM EST Project: RAG Pipeline Optimizer Phase: 2 of 5
#π Phase 3 README: Pipeline Orchestration & Parallel Evaluation Phase 3 Roadmap (Step-by-Step) Step 1: Generator Module β¬ οΈ START HERE Build LLM interface for all 6 models (Azure OpenAI, Cohere, DeepSeek, Claude, Grok) Step 2: Retriever Module Combine VectorStore + optional reranking (Pipeline B uses Cohere rerank) Step 3: Pipeline Orchestrator Connect all components: Document β Chunks β Embeddings β Retrieval β Generation Step 4: Dataset Integration Download wiki_dpr + Natural Questions, load into vector stores Step 5: Parallel Execution Run all 6 pipelines on same query simultaneously Step 6: Evaluation & Results Storage SQLite database to store query results, costs, metrics
π― Phase 3 Overview Phase 3 integrated all core RAG components into a fully functional multi-pipeline evaluation system capable of running 6 different RAG configurations in parallel, comparing their performance, and storing results for analysis.
What We Built β LLM Generator (core/generator.py) - Multi-provider response generation β Smart Retriever (core/retriever.py) - Context retrieval with optional reranking β Pipeline Orchestrator (core/pipeline.py) - End-to-end RAG workflow β Parallel Evaluator (scripts/run_parallel_evaluation.py) - Simultaneous pipeline execution β Analysis Dashboard (scripts/analyze_results.py) - Performance comparison tools β Database Schema (data/evaluation_results.db) - SQLite storage for metrics β Dataset Integration (scripts/dataset_loader.py) - NQ-Open evaluation dataset β Corpus Ingestion (scripts/ingest_corpus.py) - Wikipedia knowledge base
ποΈ Architecture Overview βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β USER QUERY INPUT β ββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββββββββ β βΌ βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β PARALLEL PIPELINE EXECUTION (6 Pipelines) β β ββββββββββββ¬βββββββββββ¬βββββββββββ¬βββββββββββ¬βββββββββββ¬ββββββ β β βPipeline AβPipeline BβPipeline CβPipeline DβPipeline EβPipe β β β β (Speed) β(Accuracy)β(Balanced)β(Reasoningβ (Cost) β F β β β βββββββ¬βββββ΄βββββ¬ββββββ΄ββββββ¬βββββ΄ββββββ¬βββββ΄ββββββ¬βββββ΄βββ¬βββ β ββββββββββΌββββββββββΌββββββββββββΌβββββββββββΌβββββββββββΌββββββββΌβββββ β β β β β β βΌ βΌ βΌ βΌ βΌ βΌ ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β VECTOR STORE (ChromaDB) β β Retrieves top-k relevant chunks for each pipeline β βββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββββ β βΌ ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β RETRIEVER (with optional reranking) β β β’ Pipeline B: Cohere reranking (accuracy boost) β β β’ Others: Direct similarity search β βββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββββ β βΌ ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β GENERATOR β β β’ Pipeline A: Azure GPT-5 (fast) β β β’ Pipeline B: Azure GPT-5 (high quality) β β β’ Pipeline C: Azure Cohere Command β β β’ Pipeline D: Anthropic Claude (reasoning) β β β’ Pipeline E: DeepSeek V3.2 (cost-optimized) β β β’ Pipeline F: Groq Llama (experimental) β βββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββββ β βΌ ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β EVALUATION & METRICS COLLECTION β β β’ Answer correctness (exact match + fuzzy) β β β’ Latency tracking (retrieval + generation) β β β’ Cost calculation (per query) β β β’ Token usage monitoring β βββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββββ β βΌ ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β SQLite DATABASE (evaluation_results.db) β β Stores: Queries, Answers, Metrics, Timestamps β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β βΌ ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β ANALYSIS DASHBOARD (analyze_results.py) β β β’ Pipeline comparison β β β’ Cost efficiency analysis β β β’ Question difficulty breakdown β β β’ Excel export for deeper analysis β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ π¦ Components Built in Phase 3
- Generator (core/generator.py) Purpose: Interface to all LLM providers with unified response handling.
Features:
β Multi-provider support (Azure OpenAI, Cohere, Claude, DeepSeek, Groq)
β Prompt template management β Automatic cost calculation β Token usage tracking β Error handling & retries β Response parsing with strict format validation
Supported Models: AZURE_GPT5 = "gpt-5-chat" # Fast, high quality AZURE_COHERE = "cohere-command-a" # Balanced performance AZURE_DEEPSEEK = "DeepSeek-V3.2" # Ultra cost-efficient ANTHROPIC_CLAUDE= "claude-3-5-sonnet" # Advanced reasoning GROQ_LLAMA = "llama-3.3-70b" # Experimental, fast inference
- Retriever (core/retriever.py) Purpose: Fetch relevant context chunks with optional reranking.
Features:
β Semantic similarity search (ChromaDB) β Cohere reranking for Pipeline B (accuracy boost) β Configurable top-k retrieval β Score normalization β Metadata filtering β Performance timing
Retrieval Strategies:
| Pipeline | Strategy | Chunks | Reranking | Use Case |
|---|---|---|---|---|
| A | Fast | 3 | β | Speed-critical |
| B | Accuracy | 10 | β Cohere | Maximum quality |
| C-F | Standard | 5-10 | β | General use |
- Pipeline Orchestrator (core/pipeline.py) Purpose: End-to-end RAG workflow coordinator.
Features:
β Component integration (Embedder β VectorStore β Retriever β Generator) β Stage-wise timing (retrieval_time_ms, generation_time_ms, total_time_ms) β Cost accumulation β Metadata tracking β Error recovery
Pipeline Flow: User Query β Embedding β Vector Search β Rerank (optional) β LLM Generation β Response β β β β β β Timing Timing Timing Timing Timing Total
- Parallel Evaluator (scripts/run_parallel_evaluation.py) Purpose: Run all 6 pipelines simultaneously on evaluation dataset.
Features:
β Concurrent execution (ThreadPoolExecutor) β Progress tracking (tqdm) β Automatic database insertion β Error isolation (one pipeline failure doesn't stop others) β Answer validation (exact match + fuzzy matching) β Run ID tracking for experiment management
Performance Metrics Tracked:
β Accuracy (answer_found: 0 or 1) β Latency (retrieval_time_ms, generation_time_ms, total_time_ms) β Cost (generation_cost_usd, total_cost_usd) β Token usage (prompt_tokens, completion_tokens, total_tokens) β Retrieval quality (num_chunks_retrieved, retrieval_scores)
- Analysis Dashboard (scripts/analyze_results.py) Purpose: Comprehensive evaluation results analysis.
Features:
β Pipeline performance summary (accuracy, cost, speed) β Cost efficiency analysis (cost per correct answer) β Time breakdown (retrieval vs generation) β Token usage statistics β Retrieval quality metrics β Difficult questions identification (0% accuracy) β Easy questions identification (>66% accuracy) β Question-by-question comparison β Excel export with 8 detailed sheets Usage:
View dashboard in terminal
python scripts/analyze_results.py
Export to Excel
python scripts/analyze_results.py --export results.xlsx
List all runs
python scripts/analyze_results.py --list-runs
- Database Schema (data/evaluation_results.db)
Table: evaluation_results
Column Type Description id INTEGER PRIMARY KEY Auto-increment ID run_id TEXT Evaluation run identifier (e.g., "20260117_182253") pipeline_id TEXT Pipeline identifier pipeline_name TEXT Human-readable pipeline name question_id TEXT Question identifier from dataset query TEXT Input question ground_truth_answers TEXT JSON array of correct answers retrieved_chunks TEXT JSON array of context chunks retrieval_scores TEXT JSON array of similarity scores num_chunks_retrieved INTEGER Number of chunks retrieved retrieval_time_ms REAL Time spent on retrieval reranking_time_ms REAL Time spent on reranking (if applicable) reranked INTEGER Whether reranking was used (0 or 1) generated_answer TEXT Model's generated answer generation_time_ms REAL Time spent on generation prompt_tokens INTEGER Input tokens used completion_tokens INTEGER Output tokens generated total_tokens INTEGER Total tokens (prompt + completion) generation_cost_usd REAL Cost of generation total_cost_usd REAL Total query cost total_time_ms REAL End-to-end latency has_answer INTEGER Whether answer is present (1 or 0) answer_found INTEGER Whether answer is correct (1 or 0) timestamp TEXT ISO 8601 timestamp
π Quick Start
Prerequisites:
Ensure Phase 1 & 2 are complete
β 6 pipeline configurations defined β All API keys in .env β ChromaDB vector store populated β Wikipedia corpus ingested
1)core/generator.py #LLM response generation 2)core/retriever.py #Context retrieval + reranking 3)core/pipeline.py #End-to-end orchestration 4)utils/dataset_loader.py #Load Natural Questions + Wikipedia Dataset 5)scripts/ingest_corpus_selective_pipeline.py (see below) #Ingest Wikipedia Corpus into All 6 Pipelines 6)python scripts/run_generic_evaluation.py --num-questions 60 --pipelines A,B,C,D,E,F #Parallel RAG Pipeline Evaluation 7)scripts/analyze_results.py #Results dashboard -diff types of runs to generate diff outout
for big scale dataset: 5a)python scripts/ingest_corpus_selective_pipeline.py --pipelines A,C,D,E,F --passages 500000 --batch-size 5000 5b)python scripts/ingest_corpus_selective_pipeline.py --pipelines B --passages 500000 --batch-size 1000
Last Updated: January 17, 2026, 7:38 PM EST Project: RAG Pipeline Optimizer Phase: 3 of 5
πPhase 4: Advanced Evaluation & Interactive Dashboard
π― Overview Phase 4 delivers a two-part system for advanced RAG pipeline evaluation:
Phase 4A: LLM-as-a-Judge evaluation system using GPT-4o to score answer quality across 6 dimensions Phase 4B: Full-stack interactive Streamlit dashboard for visualizing and comparing results
Together, these provide objective quality scoring and interactive exploration of pipeline performance beyond basic metrics like speed and cost.
π¦ Phase 4 Components Phase 4A: LLM Judge Evaluation System Automated answer quality scoring using GPT-4o as an AI judge
Phase 4B: Interactive Dashboard 8-page Streamlit application for data exploration and real-time testing
π¬ Phase 4A: LLM Judge Evaluation Overview Phase 4A adds multi-dimensional quality scoring to existing evaluation results using GPT-4o as an objective judge. Each answer is scored across 6 quality dimensions, providing insights beyond operational metrics.
β¨ Features 6-Dimensional Quality Scoring Correctness (0-10) - Factual accuracy compared to ground truth Relevance (0-10) - How well the answer addresses the question Completeness (0-10) - Coverage of important information Clarity (0-10) - Clear, understandable language Conciseness (0-10) - Brevity without sacrificing information Overall (0-10) - Weighted average of all dimensions
Automated Evaluation Evaluates existing Phase 3 results retroactively No need to re-run pipelines Batch processing with progress tracking Results stored in separate database table
Cost-Efficient Only evaluates answers, not entire pipeline re-runs Uses GPT-4o-mini for cost efficiency Batches requests to minimize API calls
ποΈ Architecture rag_optimizer/ βββ core/ β βββ evaluator.py # LLM Judge implementation βββ utils/ β βββ database.py # Database utilities for score storage βββ scripts/ β βββ evaluate_with_judge.py # CLI tool for running evaluations βββ data/ βββ evaluation_results.db # SQLite (updated schema)
ποΈ Database Schema (Phase 4A Extension) New Table: evaluation_scores Stores LLM judge quality scores for each evaluation result.
| Column | Type | Description |
|---|---|---|
| id | INTEGER | Primary key |
| evaluation_result_id | INTEGER | Foreign key β evaluation_results.id |
| correctness_score | REAL | Factual accuracy (0-10) |
| relevance_score | REAL | Question relevance (0-10) |
| completeness_score | REAL | Information coverage (0-10) |
| clarity_score | REAL | Language clarity (0-10) |
| conciseness_score | REAL | Brevity (0-10) |
| overall_score | REAL | Weighted average (0-10) |
| judge_reasoning | TEXT | LLM's explanation for scores |
| timestamp | TEXT | ISO timestamp |
Indexes:
idx_eval_result on evaluation_result_id idx_overall_score on overall_score
π Quick Start
core/evaluator.py
utils/database.py
python scripts/evaluate_with_judge.py --latest --limit 5
π₯οΈ Phase 4B: Interactive Dashboard Overview Full-stack Streamlit dashboard with 9 pages for exploring evaluation results and testing pipelines in real-time.
β¨ Features π Home Page Project overview and capabilities Quick stats (6 pipelines, 5 LLM providers, 500K+ corpus) Pipeline configuration cards Modern dark theme UI
π Pipeline Comparison Side-by-side performance metrics Quality scores from LLM judge (correctness, relevance, completeness, clarity, conciseness) Interactive comparison tables Filter by evaluation run Sort by accuracy, speed, cost, or quality score Multi-dimensional scoring
π Question Explorer Browse all evaluated questions See how each pipeline answered View quality scores per answer Compare answers across pipelines View retrieved context chunks Ground truth validation
π° Cost Analysis Token usage breakdown Cost per query analysis Cost efficiency rankings Cost per quality point (cost divided by overall score)
β‘ Performance Metrics Latency analysis (retrieval vs generation) Time breakdown by pipeline stage Speed comparisons Quality-adjusted speed (speed vs quality trade-offs)
π¬ Performance Insights Analyze pipeline performance across question types, categories, and difficulty Performance by Question Type Performance by pipeline
π§ͺ Live Testing Real-time pipeline testing Category-based question suggestions Multi-pipeline comparison Live progress tracking Answer quality comparison Instant quality scoring (optional)
π¦ Batch Evaluation Run comprehensive evaluations (5-100 questions) Multi-pipeline testing Parallel execution (1-6 workers) Real-time progress monitoring Option to run LLM judge automatically
π Leaderboard Overall pipeline rankings Quality-weighted rankings Multiple sorting options (accuracy, speed, cost, quality) Performance badges
Architecture
app/ βββ dashboard.py (main file above) βββ pages/ β βββ init.py β βββ home.py β βββ comparison.py β βββ explorer.py β βββ cost.py β βββ performance.py β βββ testing.py β βββ leaderboard.py β βββ batch_evaluation.py β βββ insights.py
app/.streamlit/config.toml
Run application
streamlit run app/dashboard.py
Last Updated: January 24, 2026, 7:00 PM EST Project: RAG Pipeline Optimizer Phase: 4 of 5