Spaces:

Vivek1929
/

RAG10

Sleeping

App Files Files Community

Vivek Kadamati commited on Jan 3

Commit

ee444c0

0 Parent(s):

Initial commit

Browse files

Files changed (24) hide show

.env.example +15 -0
.gitignore +26 -0
Dockerfile +28 -0
ENHANCEMENTS.md +120 -0
GIT_PUSH_GUIDE.md +156 -0
Procfile +1 -0
README.md +294 -0
SETUP.md +69 -0
UPDATE_REMOTE.md +178 -0
__init__.py +15 -0
api.py +374 -0
chunking_strategies.py +207 -0
cleanup_chroma.py +93 -0
config.py +64 -0
dataset_loader.py +178 -0
docker-compose.yml +26 -0
embedding_models.py +325 -0
example.py +118 -0
llm_client.py +351 -0
requirements.txt +40 -0
run.py +99 -0
streamlit_app.py +721 -0
trace_evaluator.py +352 -0
vector_store.py +412 -0

.env.example ADDED Viewed

	@@ -0,0 +1,15 @@

+# Groq API Configuration
+GROQ_API_KEY=your_groq_api_key_here
+# Google Gemini API Configuration (for gemini-embedding-001)
+GEMINI_API_KEY=your_gemini_api_key_here
+# ChromaDB Configuration
+CHROMA_PERSIST_DIRECTORY=./chroma_db
+# Rate Limiting
+GROQ_RPM_LIMIT=30
+RATE_LIMIT_DELAY=2.0
+# Application Configuration
+LOG_LEVEL=INFO

.gitignore ADDED Viewed

	@@ -0,0 +1,26 @@

+__pycache__/
+*.pyc
+*.pyo
+*.pyd
+.Python
+*.so
+*.egg
+*.egg-info/
+dist/
+build/
+.env
+.venv
+venv/
+env/
+data_cache/
+*.log
+.DS_Store
+.vscode/
+.idea/
+*.swp
+*.swo
+*~
+.pytest_cache/
+.coverage
+htmlcov/
+chroma_db/

Dockerfile ADDED Viewed

	@@ -0,0 +1,28 @@

+FROM python:3.9-slim
+WORKDIR /app
+# Install system dependencies
+RUN apt-get update && apt-get install -y \
+    build-essential \
+    curl \
+    && rm -rf /var/lib/apt/lists/*
+# Copy requirements and install Python dependencies
+COPY requirements.txt .
+RUN pip install --no-cache-dir -r requirements.txt
+# Copy application files
+COPY . .
+# Create directories for data
+RUN mkdir -p chroma_db data_cache
+# Expose ports
+EXPOSE 8501 8000
+# Set environment variables
+ENV PYTHONUNBUFFERED=1
+# Run Streamlit by default
+CMD ["streamlit", "run", "streamlit_app.py", "--server.port=8501", "--server.address=0.0.0.0"]

ENHANCEMENTS.md ADDED Viewed

	@@ -0,0 +1,120 @@

+# RAG Application Enhancements
+## Overview
+The application has been enhanced with collection management, LLM selection, and improved user experience.
+## Key Enhancements
+### 1. **Existing Collections Management** 🗂️
+- **Auto-detection**: On application startup, the system automatically detects all existing collections in ChromaDB
+- **Load Existing Collection**: Users can now choose from existing collections and load them directly without recreating
+- **Collection Selection**: Dropdown menu shows all available collections for quick access
+- **Seamless Loading**: Click "📖 Load Existing Collection" to use a previously created collection
+### 2. **Smart Collection Recreation** 🔄
+- **Selective Deletion**: When creating a new collection, only that specific collection is deleted and recreated
+- **Other Collections Preserved**: All other existing collections remain untouched and unaffected
+- **Conflict Resolution**: If a collection with the same name exists, it's deleted before creating the new one
+- **User Feedback**: Clear warnings and progress messages when deleting and recreating collections
+### 3. **LLM Selection Options** 🤖
+#### Chat Interface
+- **Dynamic LLM Selector**: Switch between different LLM models while chatting
+- **Real-time Switching**: Change LLM without reloading the collection
+- **Automatic Pipeline Update**: The RAG pipeline automatically updates when a new LLM is selected
+- **Persistent Selection**: The selected LLM is remembered in the session state
+#### Evaluation Interface
+- **Evaluation-specific LLM**: Choose a different LLM for running TRACE evaluation
+- **Independent Selection**: Evaluation LLM can be different from the chat LLM
+- **Automatic Restoration**: After evaluation, the system restores the original LLM
+- **Flexible Testing**: Test different LLM models on the same dataset and collection
+### 4. **User Interface Improvements** 🎨
+- **Two-step Process**:
+  1. Load existing collection OR
+  2. Create new collection (with all configuration options)
+- **Clear Sections**: Separated sidebar sections for existing vs. new collections
+- **Visual Indicators**: Icons and colors to distinguish different actions
+- **Better Organization**: Configuration options logically grouped and hierarchical
+## Technical Implementation
+### New Functions
+- `get_available_collections()`: Fetches list of collections from ChromaDB
+- `load_existing_collection()`: Loads a pre-existing collection with LLM selection
+- Updated `load_and_create_collection()`: Handles selective collection deletion
+### Session State Variables
+- `current_llm`: Tracks the currently selected LLM
+- `selected_collection`: Tracks which collection is loaded
+- `available_collections`: Stores list of available collections
+### Collection Naming Convention
+Collections are named as: `{dataset}_{chunking_strategy}_{embedding_model_short_name}`
+Example: `covidqa_dense_all_mpnet`
+## User Workflow
+### Scenario 1: Using Existing Collection
+1. Application starts and detects existing collections
+2. User selects a collection from the dropdown
+3. User clicks "📖 Load Existing Collection"
+4. User selects an LLM for chatting
+5. User can start chatting immediately
+### Scenario 2: Creating New Collection
+1. User selects dataset from sidebar
+2. User clicks "🔍 Check Dataset Size" (optional)
+3. User configures chunking strategy and chunk parameters
+4. User selects embedding model
+5. User selects LLM
+6. User clicks "🚀 Load Data & Create Collection"
+7. System deletes any existing collection with same name
+8. System creates new collection with fresh data
+### Scenario 3: Switching LLMs During Chat
+1. Chat interface shows current collection and LLM selector
+2. User selects different LLM from "Select LLM for chat"
+3. RAG pipeline automatically updates with new LLM
+4. Continue chatting with new LLM
+### Scenario 4: Running Evaluation with Different LLM
+1. In Evaluation tab, user can select a different LLM
+2. Click "🔬 Run Evaluation"
+3. System uses selected LLM for evaluation
+4. Results are displayed with metrics
+5. Original chat LLM is restored after evaluation
+## Benefits
+✅ **Efficiency**: No need to recreate collections when testing different configurations
+✅ **Flexibility**: Easily compare different LLM models on the same data
+✅ **Safety**: Other collections remain untouched when managing new ones
+✅ **User Experience**: Clearer navigation and configuration options
+✅ **Time Saving**: Reuse existing collections instead of recreating them
+✅ **Testing**: Run evaluations with different LLMs for comprehensive analysis
+## API Key Management
+- API Key input is required at startup
+- Store in sidebar for use across all operations
+- Used for both chat and evaluation with different LLMs
+## Error Handling
+- Collection not found errors show helpful messages
+- LLM loading failures fall back to default model
+- Graceful error messages for all operations
+- Automatic reconnection to ChromaDB if connection is lost
+## Future Enhancement Ideas
+- 💾 Save evaluation results with metadata
+- 📊 Compare multiple LLM evaluation results
+- 🔄 Batch collection operations (delete multiple)
+- 📈 Analytics dashboard for collection usage
+- 🏷️ Collection tagging/categorization system
+- 💬 Multi-turn evaluation with conversation history

GIT_PUSH_GUIDE.md ADDED Viewed

	@@ -0,0 +1,156 @@

+# Push Code to GitHub - Steps
+## Option 1: Push to Existing GitHub Repository
+If you already have a repository on GitHub, follow these steps:
+### 1. On GitHub, create a new repository:
+   - Go to https://github.com/new
+   - Fill in repository name: `RAG-Capstone-Project` (or your preferred name)
+   - Add description: "Retrieval-Augmented Generation (RAG) system with TRACE evaluation metrics"
+   - Choose: Public or Private
+   - **DO NOT** initialize with README, .gitignore, or license (we already have these)
+   - Click "Create repository"
+### 2. In PowerShell, add remote and push:
+```powershell
+cd "D:\CapStoneProject\RAG Capstone Project"
+# Add remote (replace YOUR_USERNAME and REPO_NAME)
+git remote add origin https://github.com/YOUR_USERNAME/RAG-Capstone-Project.git
+# Rename branch to main (optional but recommended)
+git branch -M main
+# Push to GitHub
+git push -u origin main
+```
+### 3. When prompted, enter your GitHub credentials:
+   - Username: Your GitHub username
+   - Password: Your GitHub personal access token (not your password)
+   - Generate token: https://github.com/settings/tokens
+---
+## Option 2: If You Don't Have GitHub Account Yet
+1. Go to https://github.com/join
+2. Create a free account
+3. Follow Option 1 steps above
+---
+## Troubleshooting
+### If you get "fatal: remote origin already exists":
+```powershell
+git remote remove origin
+git remote add origin https://github.com/YOUR_USERNAME/RAG-Capstone-Project.git
+git push -u origin main
+```
+### If you get authentication error:
+1. Generate Personal Access Token:
+   - Go to https://github.com/settings/tokens
+   - Click "Generate new token (classic)"
+   - Select: repo, write:packages
+   - Copy the token
+2. Use token instead of password when prompted
+### If you have SSH key setup:
+```powershell
+git remote add origin git@github.com:YOUR_USERNAME/RAG-Capstone-Project.git
+git push -u origin main
+```
+---
+## What Will Be Pushed
+✅ All Python source files:
+- `streamlit_app.py` - Main Streamlit UI
+- `llm_client.py` - Groq LLM integration
+- `vector_store.py` - ChromaDB management
+- `embedding_models.py` - 8 embedding models
+- `trace_evaluator.py` - TRACE metrics
+- `dataset_loader.py` - RAGBench dataset loading
+- `chunking_strategies.py` - 4 chunking strategies
+- And more...
+✅ Documentation:
+- `README.md` - Project overview
+- `SETUP.md` - Installation guide
+- `ENHANCEMENTS.md` - Recent enhancements
+- `.env.example` - Environment template
+✅ Configuration:
+- `requirements.txt` - All dependencies
+- `Dockerfile` - Docker containerization
+- `docker-compose.yml` - Multi-container setup
+- `Procfile` - Heroku deployment
+❌ Excluded (in .gitignore):
+- `venv/` - Virtual environment
+- `chroma_db/` - ChromaDB data
+- `.env` - API keys (keep local!)
+- `__pycache__/` - Python cache
+- `.streamlit/` - Streamlit config
+---
+## Next Steps After Pushing
+1. **Add a GitHub Actions workflow** (optional):
+   - Automated testing
+   - Code quality checks
+   - Deployment automation
+2. **Set up branch protection**:
+   - Require pull request reviews
+   - Enforce status checks
+3. **Add GitHub Pages documentation**:
+   - Host project documentation
+   - API documentation
+   - Evaluation results
+4. **Setup CI/CD**:
+   - Test on every push
+   - Deploy to Heroku/Cloud Run
+---
+## Commands Summary
+```powershell
+# Navigate to project
+cd "D:\CapStoneProject\RAG Capstone Project"
+# Configure git
+git config user.email "your_email@example.com"
+git config user.name "Your Name"
+# Add remote (replace placeholders)
+git remote add origin https://github.com/YOUR_USERNAME/RAG-Capstone-Project.git
+# Rename branch to main
+git branch -M main
+# Push to GitHub
+git push -u origin main
+# Verify
+git remote -v
+git log --oneline
+```
+---
+**Share the GitHub URL with your team:**
+```
+https://github.com/YOUR_USERNAME/RAG-Capstone-Project
+```
+Let me know when you have your GitHub username ready, and I can help you complete the push!

Procfile ADDED Viewed

	@@ -0,0 +1 @@


1	+ web: streamlit run streamlit_app.py --server.port=$PORT --server.address=0.0.0.0

README.md ADDED Viewed

	@@ -0,0 +1,294 @@

+# RAG Capstone Project
+A comprehensive Retrieval-Augmented Generation (RAG) system with TRACE evaluation metrics for medical/clinical domains.
+## Features
+- 🔍 **Multiple RAG Bench Datasets**: HotpotQA, 2WikiMultihopQA, MuSiQue, Natural Questions, TriviaQA
+- 🧩 **Chunking Strategies**: Dense, Sparse, Hybrid, Re-ranking
+- 🤖 **Medical Embedding Models**:
+  - sentence-transformers/embeddinggemma-300m-medical
+  - emilyalsentzer/Bio_ClinicalBERT
+  - Simonlee711/Clinical_ModernBERT
+- 💾 **ChromaDB Vector Storage**: Persistent vector storage with efficient retrieval
+- 🦙 **Groq LLM Integration**: With rate limiting (30 RPM)
+  - meta-llama/llama-4-maverick-17b-128e-instruct
+  - llama-3.1-8b-instant
+  - openai/gpt-oss-120b
+- 📊 **TRACE Evaluation Metrics**:
+  - **U**tilization: How well the system uses retrieved documents
+  - **R**elevance: Relevance of retrieved documents to the query
+  - **A**dherence: How well the response adheres to the retrieved context
+  - **C**ompleteness: How complete the response is
+- 💬 **Chat Interface**: Streamlit-based interactive chat with history
+- 🔌 **REST API**: FastAPI backend for integration
+## Installation
+### Prerequisites
+- Python 3.8+
+- pip
+- Groq API key
+### Setup
+1. Clone the repository:
+```bash
+git clone <repository-url>
+cd "RAG Capstone Project"
+```
+2. Create a virtual environment:
+```bash
+python -m venv venv
+```
+3. Activate the virtual environment:
+**Windows:**
+```bash
+.\venv\Scripts\activate
+```
+**Linux/Mac:**
+```bash
+source venv/bin/activate
+```
+4. Install dependencies:
+```bash
+pip install -r requirements.txt
+```
+5. Create a `.env` file from the example:
+```bash
+copy .env.example .env
+```
+6. Edit `.env` and add your Groq API key:
+```
+GROQ_API_KEY=your_groq_api_key_here
+```
+## Usage
+### Streamlit Application
+Run the interactive Streamlit interface:
+```bash
+streamlit run streamlit_app.py
+```
+Then open your browser to `http://localhost:8501`
+**Workflow:**
+1. Enter your Groq API key in the sidebar
+2. Select a dataset from RAG Bench
+3. Choose chunking strategy
+4. Select embedding model
+5. Choose LLM model
+6. Click "Load Data & Create Collection"
+7. Start chatting!
+8. View retrieved documents
+9. Run TRACE evaluation
+10. Export chat history
+### FastAPI Backend
+Run the REST API server:
+```bash
+python api.py
+```
+Or with uvicorn:
+```bash
+uvicorn api:app --reload --host 0.0.0.0 --port 8000
+```
+API documentation available at: `http://localhost:8000/docs`
+#### API Endpoints
+- `GET /` - Root endpoint
+- `GET /health` - Health check
+- `GET /datasets` - List available datasets
+- `GET /models/embedding` - List embedding models
+- `GET /models/llm` - List LLM models
+- `GET /chunking-strategies` - List chunking strategies
+- `GET /collections` - List all collections
+- `GET /collections/{name}` - Get collection info
+- `POST /load-dataset` - Load dataset and create collection
+- `POST /query` - Query the RAG system
+- `GET /chat-history` - Get chat history
+- `DELETE /chat-history` - Clear chat history
+- `POST /evaluate` - Run TRACE evaluation
+- `DELETE /collections/{name}` - Delete collection
+### Python API
+Use the components programmatically:
+```python
+from config import settings
+from dataset_loader import RAGBenchLoader
+from vector_store import ChromaDBManager
+from llm_client import GroqLLMClient, RAGPipeline
+from trace_evaluator import TRACEEvaluator
+# Load dataset
+loader = RAGBenchLoader()
+dataset = loader.load_dataset("hotpotqa", max_samples=100)
+# Create vector store
+vector_store = ChromaDBManager()
+vector_store.load_dataset_into_collection(
+    collection_name="my_collection",
+    embedding_model_name="emilyalsentzer/Bio_ClinicalBERT",
+    chunking_strategy="hybrid",
+    dataset_data=dataset
+)
+# Initialize LLM
+llm = GroqLLMClient(
+    api_key="your_api_key",
+    model_name="llama-3.1-8b-instant"
+)
+# Create RAG pipeline
+rag = RAGPipeline(llm, vector_store)
+# Query
+result = rag.query("What is the capital of France?")
+print(result["response"])
+# Evaluate
+evaluator = TRACEEvaluator()
+test_cases = [...]  # Your test cases
+results = evaluator.evaluate_batch(test_cases)
+print(results)
+```
+## Project Structure
+```
+RAG Capstone Project/
+├── __init__.py                 # Package initialization
+├── config.py                   # Configuration management
+├── dataset_loader.py           # RAG Bench dataset loader
+├── chunking_strategies.py      # Document chunking strategies
+├── embedding_models.py         # Embedding model implementations
+├── vector_store.py            # ChromaDB integration
+├── llm_client.py              # Groq LLM client with rate limiting
+├── trace_evaluator.py         # TRACE evaluation metrics
+├── streamlit_app.py           # Streamlit chat interface
+├── api.py                     # FastAPI REST API
+├── requirements.txt           # Python dependencies
+├── .env.example              # Environment variables template
+├── .gitignore                # Git ignore file
+└── README.md                 # This file
+```
+## TRACE Metrics Explained
+### Utilization (U)
+Measures how well the system uses the retrieved documents in generating the response. Higher scores indicate that the system effectively incorporates information from multiple retrieved documents.
+### Relevance (R)
+Evaluates the relevance of retrieved documents to the user's query. Uses lexical overlap and keyword matching to determine if the right documents were retrieved.
+### Adherence (A)
+Assesses how well the generated response adheres to the retrieved context. Ensures the response is grounded in the provided documents rather than hallucinated.
+### Completeness (C)
+Evaluates how complete the response is in answering the query. Considers response length, question type, and comparison with ground truth if available.
+## Deployment Options
+### Heroku
+1. Create `Procfile`:
+```
+web: streamlit run streamlit_app.py --server.port=$PORT --server.address=0.0.0.0
+api: uvicorn api:app --host=0.0.0.0 --port=$PORT
+```
+2. Deploy:
+```bash
+heroku create your-app-name
+git push heroku main
+```
+### Docker
+Create `Dockerfile`:
+```dockerfile
+FROM python:3.9-slim
+WORKDIR /app
+COPY requirements.txt .
+RUN pip install -r requirements.txt
+COPY . .
+EXPOSE 8501 8000
+CMD ["streamlit", "run", "streamlit_app.py"]
+```
+Build and run:
+```bash
+docker build -t rag-capstone .
+docker run -p 8501:8501 -p 8000:8000 rag-capstone
+```
+### Cloud Run / AWS / Azure
+The application can be deployed to any cloud platform that supports Python applications. See the respective platform documentation for deployment instructions.
+## Configuration
+Edit `config.py` or set environment variables in `.env`:
+```env
+GROQ_API_KEY=your_api_key
+CHROMA_PERSIST_DIRECTORY=./chroma_db
+GROQ_RPM_LIMIT=30
+RATE_LIMIT_DELAY=2.0
+LOG_LEVEL=INFO
+```
+## Rate Limiting
+The application implements rate limiting for Groq API calls:
+- Maximum 30 requests per minute (configurable)
+- Automatic delay of 2 seconds between requests
+- Smart waiting when rate limit is reached
+## Troubleshooting
+### ChromaDB Issues
+If you encounter ChromaDB errors, try deleting the `chroma_db` directory and recreating collections.
+### Embedding Model Loading
+Medical embedding models may require significant memory. If you encounter out-of-memory errors, try:
+- Using a smaller model
+- Reducing batch size
+- Using CPU instead of GPU
+### API Key Errors
+Ensure your Groq API key is correctly set in the `.env` file or passed to the application.
+## License
+MIT License
+## Contributors
+RAG Capstone Team
+## Support
+For issues and questions, please open an issue on the GitHub repository.

SETUP.md ADDED Viewed

	@@ -0,0 +1,69 @@

+# Quick Setup Guide (Windows)
+## Requirements
+- Python 3.10+
+- Groq API Key
+## Installation Steps
+### 1. Create Virtual Environment
+```powershell
+python -m venv venv
+```
+### 2. Activate Virtual Environment
+```powershell
+.\venv\Scripts\Activate.ps1
+```
+**If you get execution policy error:**
+```powershell
+Set-ExecutionPolicy -ExecutionPolicy RemoteSigned -Scope CurrentUser
+```
+### 3. Upgrade pip
+```powershell
+python -m pip install --upgrade pip
+```
+### 4. Install Dependencies
+```powershell
+pip install -r requirements.txt
+```
+### 5. Configure API Key
+```powershell
+copy .env.example .env
+notepad .env
+```
+Add your Groq API key:
+```
+GROQ_API_KEY=your_groq_api_key_here
+```
+### 6. Run Application
+```powershell
+streamlit run streamlit_app.py
+```
+Open browser to: **http://localhost:8501**
+---
+## Common Issues
+**Execution Policy Error:**
+```powershell
+Set-ExecutionPolicy -ExecutionPolicy RemoteSigned -Scope CurrentUser
+```
+**Reset ChromaDB:**
+```powershell
+Remove-Item -Recurse -Force .\chroma_db
+```
+**Deactivate venv:**
+```powershell
+deactivate
+```

UPDATE_REMOTE.md ADDED Viewed

	@@ -0,0 +1,178 @@

+# How to Update Remote Origin
+## Method 1: Change Existing Remote URL (Recommended)
+If you already have a remote set and want to change it:
+```powershell
+# View current remote
+git remote -v
+# Option A: Update with HTTPS URL
+git remote set-url origin https://github.com/YOUR_USERNAME/YOUR_REPO_NAME.git
+# Option B: Update with SSH URL
+git remote set-url origin git@github.com:YOUR_USERNAME/YOUR_REPO_NAME.git
+# Verify the change
+git remote -v
+```
+---
+## Method 2: Remove and Re-add Remote
+If you want to completely remove and add a new remote:
+```powershell
+# Remove existing remote
+git remote remove origin
+# Add new remote
+git remote add origin https://github.com/YOUR_USERNAME/YOUR_REPO_NAME.git
+# Verify
+git remote -v
+```
+---
+## Method 3: Using Your GitHub Repository
+### Step 1: Create Repository on GitHub
+1. Go to https://github.com/new
+2. Repository name: `RAG-Capstone-Project`
+3. Description: `Retrieval-Augmented Generation system with TRACE evaluation`
+4. Select Public or Private
+5. **IMPORTANT**: Don't initialize with README, .gitignore, or license
+6. Click "Create repository"
+### Step 2: Copy Your Repository URL
+After creating, GitHub will show you the URL. Copy it (should look like):
+```
+https://github.com/YOUR_USERNAME/RAG-Capstone-Project.git
+```
+### Step 3: Update Remote in Your Local Repository
+```powershell
+cd "D:\CapStoneProject\RAG Capstone Project"
+git remote set-url origin https://github.com/YOUR_USERNAME/RAG-Capstone-Project.git
+# Verify
+git remote -v
+```
+---
+## Complete Step-by-Step Example
+Let's say your GitHub username is `john-doe`:
+```powershell
+# 1. Navigate to project
+cd "D:\CapStoneProject\RAG Capstone Project"
+# 2. Update the remote URL
+git remote set-url origin https://github.com/john-doe/RAG-Capstone-Project.git
+# 3. Verify it was updated
+git remote -v
+# 4. Push to GitHub (first time)
+git push -u origin main
+# 5. For future pushes (just use)
+git push
+```
+---
+## Using SSH Instead of HTTPS
+### Step 1: Generate SSH Key (if you don't have one)
+```powershell
+ssh-keygen -t ed25519 -C "your_email@example.com"
+```
+### Step 2: Add SSH Key to GitHub
+1. Copy the public key: `cat ~/.ssh/id_ed25519.pub`
+2. Go to https://github.com/settings/keys
+3. Click "New SSH key"
+4. Paste your public key
+5. Save
+### Step 3: Update Remote to SSH
+```powershell
+git remote set-url origin git@github.com:YOUR_USERNAME/RAG-Capstone-Project.git
+# Verify
+git remote -v
+# Test connection
+ssh -T git@github.com
+```
+---
+## Troubleshooting
+### Problem: "fatal: remote origin already exists"
+```powershell
+# Remove the old remote first
+git remote remove origin
+# Then add the new one
+git remote add origin https://github.com/YOUR_USERNAME/RAG-Capstone-Project.git
+```
+### Problem: "Permission denied (publickey)"
+This means SSH authentication failed. Use HTTPS instead:
+```powershell
+git remote set-url origin https://github.com/YOUR_USERNAME/RAG-Capstone-Project.git
+```
+### Problem: "fatal: Authentication failed"
+This means your GitHub credentials are incorrect. Use a Personal Access Token:
+1. Generate token: https://github.com/settings/tokens
+2. When pushing, use the token as password
+### Check Current Configuration
+```powershell
+# View remote URLs
+git remote -v
+# View detailed remote info
+git remote show origin
+# View git config
+git config --local -l
+```
+---
+## Quick Reference
+| Task | Command |
+|------|---------|
+| View remotes | `git remote -v` |
+| Update URL (HTTPS) | `git remote set-url origin https://github.com/USER/REPO.git` |
+| Update URL (SSH) | `git remote set-url origin git@github.com:USER/REPO.git` |
+| Remove remote | `git remote remove origin` |
+| Add remote | `git remote add origin <URL>` |
+| Push to remote | `git push -u origin main` |
+| Check remote details | `git remote show origin` |
+---
+## What You Need to Push
+Before pushing, make sure you have:
+- ✅ GitHub account created
+- ✅ Repository created on GitHub
+- ✅ Remote URL updated correctly
+- ✅ Local commits ready (already done ✓)
+---
+**What's your GitHub username?** I can help you with the exact commands once you provide it!

__init__.py ADDED Viewed

	@@ -0,0 +1,15 @@

+"""
+RAG Capstone Project - Retrieval-Augmented Generation with TRACE Evaluation
+This application provides a complete RAG system with:
+- Multiple embedding models (Bio-medical BERT models)
+- Various chunking strategies (dense, sparse, hybrid, re-ranking)
+- ChromaDB vector storage
+- Groq LLM integration with rate limiting
+- TRACE evaluation metrics
+- Streamlit chat interface
+- FastAPI REST API
+"""
+__version__ = "1.0.0"
+__author__ = "RAG Capstone Team"

api.py ADDED Viewed

	@@ -0,0 +1,374 @@

+"""FastAPI backend service for RAG application."""
+from fastapi import FastAPI, HTTPException, BackgroundTasks
+from fastapi.middleware.cors import CORSMiddleware
+from pydantic import BaseModel, Field
+from typing import List, Optional, Dict
+import uvicorn
+from datetime import datetime
+import os
+from config import settings
+from dataset_loader import RAGBenchLoader
+from vector_store import ChromaDBManager
+from llm_client import GroqLLMClient, RAGPipeline
+from trace_evaluator import TRACEEvaluator
+# Initialize FastAPI app
+app = FastAPI(
+    title="RAG Capstone API",
+    description="API for RAG system with TRACE evaluation",
+    version="1.0.0"
+)
+# Add CORS middleware
+app.add_middleware(
+    CORSMiddleware,
+    allow_origins=["*"],
+    allow_credentials=True,
+    allow_methods=["*"],
+    allow_headers=["*"],
+)
+# Global state
+rag_pipeline: Optional[RAGPipeline] = None
+vector_store: Optional[ChromaDBManager] = None
+current_collection: Optional[str] = None
+# Request/Response models
+class DatasetLoadRequest(BaseModel):
+    """Request model for loading dataset."""
+    dataset_name: str = Field(..., description="Name of the dataset")
+    num_samples: int = Field(50, description="Number of samples to load")
+    chunking_strategy: str = Field("hybrid", description="Chunking strategy")
+    chunk_size: int = Field(512, description="Size of chunks")
+    overlap: int = Field(50, description="Overlap between chunks")
+    embedding_model: str = Field(..., description="Embedding model name")
+    llm_model: str = Field("llama-3.1-8b-instant", description="LLM model name")
+    groq_api_key: str = Field(..., description="Groq API key")
+class QueryRequest(BaseModel):
+    """Request model for querying."""
+    query: str = Field(..., description="User query")
+    n_results: int = Field(5, description="Number of documents to retrieve")
+    max_tokens: int = Field(1024, description="Maximum tokens to generate")
+    temperature: float = Field(0.7, description="Sampling temperature")
+class QueryResponse(BaseModel):
+    """Response model for query."""
+    query: str
+    response: str
+    retrieved_documents: List[Dict]
+    timestamp: str
+class EvaluationRequest(BaseModel):
+    """Request model for evaluation."""
+    num_samples: int = Field(10, description="Number of test samples")
+class CollectionInfo(BaseModel):
+    """Collection information model."""
+    name: str
+    count: int
+    metadata: Dict
+# API endpoints
+@app.get("/")
+async def root():
+    """Root endpoint."""
+    return {
+        "message": "RAG Capstone API",
+        "version": "1.0.0",
+        "docs": "/docs"
+    }
+@app.get("/health")
+async def health_check():
+    """Health check endpoint."""
+    return {
+        "status": "healthy",
+        "timestamp": datetime.now().isoformat()
+    }
+@app.get("/datasets")
+async def list_datasets():
+    """List available datasets."""
+    return {
+        "datasets": settings.ragbench_datasets
+    }
+@app.get("/models/embedding")
+async def list_embedding_models():
+    """List available embedding models."""
+    return {
+        "embedding_models": settings.embedding_models
+    }
+@app.get("/models/llm")
+async def list_llm_models():
+    """List available LLM models."""
+    return {
+        "llm_models": settings.llm_models
+    }
+@app.get("/chunking-strategies")
+async def list_chunking_strategies():
+    """List available chunking strategies."""
+    return {
+        "chunking_strategies": settings.chunking_strategies
+    }
+@app.get("/collections")
+async def list_collections():
+    """List all vector store collections."""
+    global vector_store
+    if not vector_store:
+        vector_store = ChromaDBManager(settings.chroma_persist_directory)
+    collections = vector_store.list_collections()
+    return {
+        "collections": collections,
+        "count": len(collections)
+    }
+@app.get("/collections/{collection_name}")
+async def get_collection_info(collection_name: str):
+    """Get information about a specific collection."""
+    global vector_store
+    if not vector_store:
+        vector_store = ChromaDBManager(settings.chroma_persist_directory)
+    try:
+        stats = vector_store.get_collection_stats(collection_name)
+        return stats
+    except Exception as e:
+        raise HTTPException(status_code=404, detail=f"Collection not found: {str(e)}")
+@app.post("/load-dataset")
+async def load_dataset(request: DatasetLoadRequest, background_tasks: BackgroundTasks):
+    """Load dataset and create vector collection."""
+    global rag_pipeline, vector_store, current_collection
+    try:
+        # Initialize dataset loader
+        loader = RAGBenchLoader()
+        # Load dataset
+        dataset = loader.load_dataset(
+            request.dataset_name,
+            split="train",
+            max_samples=request.num_samples
+        )
+        if not dataset:
+            raise HTTPException(status_code=400, detail="Failed to load dataset")
+        # Initialize vector store
+        vector_store = ChromaDBManager(settings.chroma_persist_directory)
+        # Create collection name
+        collection_name = f"{request.dataset_name}_{request.chunking_strategy}_{request.embedding_model.split('/')[-1]}"
+        collection_name = collection_name.replace("-", "_").replace(".", "_")
+        # Load data into collection
+        vector_store.load_dataset_into_collection(
+            collection_name=collection_name,
+            embedding_model_name=request.embedding_model,
+            chunking_strategy=request.chunking_strategy,
+            dataset_data=dataset,
+            chunk_size=request.chunk_size,
+            overlap=request.overlap
+        )
+        # Initialize LLM client
+        llm_client = GroqLLMClient(
+            api_key=request.groq_api_key,
+            model_name=request.llm_model,
+            max_rpm=settings.groq_rpm_limit,
+            rate_limit_delay=settings.rate_limit_delay
+        )
+        # Create RAG pipeline
+        rag_pipeline = RAGPipeline(llm_client, vector_store)
+        current_collection = collection_name
+        return {
+            "status": "success",
+            "collection_name": collection_name,
+            "num_documents": len(dataset),
+            "message": f"Collection '{collection_name}' created successfully"
+        }
+    except Exception as e:
+        raise HTTPException(status_code=500, detail=f"Error loading dataset: {str(e)}")
+@app.post("/query", response_model=QueryResponse)
+async def query_rag(request: QueryRequest):
+    """Query the RAG system."""
+    global rag_pipeline
+    if not rag_pipeline:
+        raise HTTPException(
+            status_code=400,
+            detail="RAG pipeline not initialized. Load a dataset first."
+        )
+    try:
+        result = rag_pipeline.query(
+            query=request.query,
+            n_results=request.n_results,
+            max_tokens=request.max_tokens,
+            temperature=request.temperature
+        )
+        result["timestamp"] = datetime.now().isoformat()
+        return result
+    except Exception as e:
+        raise HTTPException(status_code=500, detail=f"Error processing query: {str(e)}")
+@app.get("/chat-history")
+async def get_chat_history():
+    """Get chat history."""
+    global rag_pipeline
+    if not rag_pipeline:
+        raise HTTPException(
+            status_code=400,
+            detail="RAG pipeline not initialized. Load a dataset first."
+        )
+    return {
+        "history": rag_pipeline.get_chat_history()
+    }
+@app.delete("/chat-history")
+async def clear_chat_history():
+    """Clear chat history."""
+    global rag_pipeline
+    if not rag_pipeline:
+        raise HTTPException(
+            status_code=400,
+            detail="RAG pipeline not initialized. Load a dataset first."
+        )
+    rag_pipeline.clear_history()
+    return {
+        "status": "success",
+        "message": "Chat history cleared"
+    }
+@app.post("/evaluate")
+async def run_evaluation(request: EvaluationRequest):
+    """Run TRACE evaluation."""
+    global rag_pipeline, current_collection
+    if not rag_pipeline:
+        raise HTTPException(
+            status_code=400,
+            detail="RAG pipeline not initialized. Load a dataset first."
+        )
+    try:
+        # Get dataset name from collection metadata
+        collection_metadata = vector_store.current_collection.metadata
+        dataset_name = current_collection.split("_")[0] if current_collection else "hotpotqa"
+        # Get test data
+        loader = RAGBenchLoader()
+        test_data = loader.get_test_data(dataset_name, request.num_samples)
+        # Prepare test cases
+        test_cases = []
+        for sample in test_data:
+            result = rag_pipeline.query(sample["question"], n_results=5)
+            test_cases.append({
+                "query": sample["question"],
+                "response": result["response"],
+                "retrieved_documents": [doc["document"] for doc in result["retrieved_documents"]],
+                "ground_truth": sample.get("answer", "")
+            })
+        # Run evaluation
+        evaluator = TRACEEvaluator()
+        results = evaluator.evaluate_batch(test_cases)
+        return {
+            "status": "success",
+            "results": results
+        }
+    except Exception as e:
+        raise HTTPException(status_code=500, detail=f"Error during evaluation: {str(e)}")
+@app.delete("/collections/{collection_name}")
+async def delete_collection(collection_name: str):
+    """Delete a collection."""
+    global vector_store
+    if not vector_store:
+        vector_store = ChromaDBManager(settings.chroma_persist_directory)
+    try:
+        vector_store.delete_collection(collection_name)
+        return {
+            "status": "success",
+            "message": f"Collection '{collection_name}' deleted"
+        }
+    except Exception as e:
+        raise HTTPException(status_code=500, detail=f"Error deleting collection: {str(e)}")
+@app.get("/current-collection")
+async def get_current_collection():
+    """Get current collection information."""
+    global current_collection, vector_store
+    if not current_collection:
+        return {
+            "collection": None,
+            "message": "No collection loaded"
+        }
+    try:
+        stats = vector_store.get_collection_stats(current_collection)
+        return {
+            "collection": current_collection,
+            "stats": stats
+        }
+    except Exception as e:
+        raise HTTPException(status_code=500, detail=f"Error getting collection info: {str(e)}")
+if __name__ == "__main__":
+    uvicorn.run(
+        "api:app",
+        host="0.0.0.0",
+        port=8000,
+        reload=True
+    )

chunking_strategies.py ADDED Viewed

	@@ -0,0 +1,207 @@

+"""Chunking strategies for document processing."""
+from typing import List, Dict, Tuple
+from abc import ABC, abstractmethod
+import re
+from rank_bm25 import BM25Okapi
+import numpy as np
+class ChunkingStrategy(ABC):
+    """Abstract base class for chunking strategies."""
+    @abstractmethod
+    def chunk_text(self, text: str, chunk_size: int = 512,
+                   overlap: int = 50) -> List[str]:
+        """Chunk text into smaller pieces.
+        Args:
+            text: Input text to chunk
+            chunk_size: Maximum size of each chunk
+            overlap: Number of characters to overlap between chunks
+        Returns:
+            List of text chunks
+        """
+        pass
+class DenseChunking(ChunkingStrategy):
+    """Dense chunking strategy - fixed-size chunks with overlap."""
+    def chunk_text(self, text: str, chunk_size: int = 512,
+                   overlap: int = 50) -> List[str]:
+        """Create dense chunks with fixed size and overlap."""
+        if not text:
+            return []
+        chunks = []
+        start = 0
+        text_length = len(text)
+        while start < text_length:
+            end = start + chunk_size
+            chunk = text[start:end]
+            # Try to break at sentence boundary
+            if end < text_length:
+                last_period = chunk.rfind('.')
+                last_newline = chunk.rfind('\n')
+                break_point = max(last_period, last_newline)
+                if break_point > chunk_size * 0.5:  # At least 50% of chunk size
+                    chunk = chunk[:break_point + 1]
+                    end = start + break_point + 1
+            chunks.append(chunk.strip())
+            start = end - overlap
+            if start >= text_length:
+                break
+        return [c for c in chunks if c]  # Remove empty chunks
+class SparseChunking(ChunkingStrategy):
+    """Sparse chunking strategy - semantic-based chunks (paragraphs/sections)."""
+    def chunk_text(self, text: str, chunk_size: int = 512,
+                   overlap: int = 50) -> List[str]:
+        """Create sparse chunks based on semantic boundaries."""
+        if not text:
+            return []
+        # Split by double newlines (paragraphs)
+        paragraphs = re.split(r'\n\s*\n', text)
+        chunks = []
+        current_chunk = ""
+        for para in paragraphs:
+            para = para.strip()
+            if not para:
+                continue
+            # If adding this paragraph exceeds chunk_size, save current chunk
+            if len(current_chunk) + len(para) > chunk_size and current_chunk:
+                chunks.append(current_chunk.strip())
+                # Start new chunk with overlap
+                if overlap > 0:
+                    words = current_chunk.split()
+                    overlap_words = words[-min(overlap // 5, len(words)):]
+                    current_chunk = " ".join(overlap_words) + " " + para
+                else:
+                    current_chunk = para
+            else:
+                current_chunk += ("\n\n" if current_chunk else "") + para
+        # Add the last chunk
+        if current_chunk:
+            chunks.append(current_chunk.strip())
+        return chunks
+class HybridChunking(ChunkingStrategy):
+    """Hybrid chunking strategy - combines dense and sparse approaches."""
+    def __init__(self):
+        self.dense_chunker = DenseChunking()
+        self.sparse_chunker = SparseChunking()
+    def chunk_text(self, text: str, chunk_size: int = 512,
+                   overlap: int = 50) -> List[str]:
+        """Create hybrid chunks combining both strategies."""
+        if not text:
+            return []
+        # First apply sparse chunking to get semantic boundaries
+        sparse_chunks = self.sparse_chunker.chunk_text(text, chunk_size * 2, 0)
+        # Then apply dense chunking to each sparse chunk
+        all_chunks = []
+        for sparse_chunk in sparse_chunks:
+            if len(sparse_chunk) > chunk_size:
+                dense_chunks = self.dense_chunker.chunk_text(
+                    sparse_chunk, chunk_size, overlap
+                )
+                all_chunks.extend(dense_chunks)
+            else:
+                all_chunks.append(sparse_chunk)
+        return all_chunks
+class ReRankingChunking(ChunkingStrategy):
+    """Re-ranking chunking strategy - creates chunks and provides relevance scoring."""
+    def __init__(self):
+        self.base_chunker = HybridChunking()
+        self.bm25 = None
+        self.chunks = []
+    def chunk_text(self, text: str, chunk_size: int = 512,
+                   overlap: int = 50) -> List[str]:
+        """Create chunks suitable for re-ranking."""
+        self.chunks = self.base_chunker.chunk_text(text, chunk_size, overlap)
+        # Initialize BM25 for re-ranking capability
+        tokenized_chunks = [chunk.lower().split() for chunk in self.chunks]
+        self.bm25 = BM25Okapi(tokenized_chunks)
+        return self.chunks
+    def rerank_chunks(self, query: str, top_k: int = 5) -> List[Tuple[str, float]]:
+        """Re-rank chunks based on query relevance.
+        Args:
+            query: Query string
+            top_k: Number of top chunks to return
+        Returns:
+            List of (chunk, score) tuples
+        """
+        if not self.bm25 or not self.chunks:
+            return []
+        tokenized_query = query.lower().split()
+        scores = self.bm25.get_scores(tokenized_query)
+        # Get top-k chunks with scores
+        top_indices = np.argsort(scores)[::-1][:top_k]
+        ranked_chunks = [
+            (self.chunks[idx], float(scores[idx]))
+            for idx in top_indices
+        ]
+        return ranked_chunks
+class ChunkingFactory:
+    """Factory for creating chunking strategy instances."""
+    STRATEGIES = {
+        "dense": DenseChunking,
+        "sparse": SparseChunking,
+        "hybrid": HybridChunking,
+        "re-ranking": ReRankingChunking
+    }
+    @classmethod
+    def create_chunker(cls, strategy: str) -> ChunkingStrategy:
+        """Create a chunking strategy instance.
+        Args:
+            strategy: Name of the chunking strategy
+        Returns:
+            ChunkingStrategy instance
+        """
+        if strategy not in cls.STRATEGIES:
+            raise ValueError(f"Unknown chunking strategy: {strategy}. "
+                           f"Available: {list(cls.STRATEGIES.keys())}")
+        return cls.STRATEGIES[strategy]()
+    @classmethod
+    def get_available_strategies(cls) -> List[str]:
+        """Get list of available chunking strategies."""
+        return list(cls.STRATEGIES.keys())

cleanup_chroma.py ADDED Viewed

	@@ -0,0 +1,93 @@

+#!/usr/bin/env python3
+"""Script to clean up ChromaDB collections and cache."""
+import shutil
+import os
+from pathlib import Path
+def cleanup_chroma_db():
+    """Clean up ChromaDB collections and cache."""
+    print("=" * 60)
+    print("ChromaDB Cleanup Utility")
+    print("=" * 60)
+    # First, forcefully delete the chroma_db directory
+    chroma_path = Path("./chroma_db")
+    if chroma_path.exists():
+        print(f"\n🗑️  Removing chroma_db directory: {chroma_path}")
+        try:
+            shutil.rmtree(chroma_path)
+            print(f"✅ Deleted directory: {chroma_path}")
+        except Exception as e:
+            print(f"❌ Error deleting directory: {e}")
+    else:
+        print(f"\n✅ chroma_db directory not found: {chroma_path}")
+    # Also check for ChromaDB in .chroma directory (alternative location)
+    chroma_alt_path = Path("./.chroma")
+    if chroma_alt_path.exists():
+        print(f"\n🗑️  Removing .chroma directory: {chroma_alt_path}")
+        try:
+            shutil.rmtree(chroma_alt_path)
+            print(f"✅ Deleted directory: {chroma_alt_path}")
+        except Exception as e:
+            print(f"❌ Error deleting directory: {e}")
+    # Clear HuggingFace dataset cache (optional)
+    response = input("\n🗑️  Clear HuggingFace dataset cache? (y/n): ").lower()
+    if response == 'y':
+        cache_path = Path.home() / ".cache" / "huggingface" / "datasets"
+        if cache_path.exists():
+            print(f"🗑️  Removing HF cache: {cache_path}")
+            try:
+                shutil.rmtree(cache_path)
+                print(f"✅ Deleted HF cache: {cache_path}")
+            except Exception as e:
+                print(f"❌ Error deleting HF cache: {e}")
+        else:
+            print("ℹ️  HuggingFace cache not found")
+    # Clear ChromaDB chroma cache directory
+    response = input("\n🗑️  Clear ChromaDB chroma cache? (y/n): ").lower()
+    if response == 'y':
+        chroma_cache = Path.home() / ".chroma"
+        if chroma_cache.exists():
+            print(f"🗑️  Removing ChromaDB cache: {chroma_cache}")
+            try:
+                shutil.rmtree(chroma_cache)
+                print(f"✅ Deleted ChromaDB cache: {chroma_cache}")
+            except Exception as e:
+                print(f"❌ Error deleting ChromaDB cache: {e}")
+    # Try to use ChromaDBManager if possible
+    print("\n📋 Attempting to connect to ChromaDB...")
+    try:
+        from vector_store import ChromaDBManager
+        manager = ChromaDBManager(persist_directory="./chroma_db")
+        # List existing collections
+        collections = manager.list_collections()
+        print(f"📊 Found {len(collections)} collection(s):")
+        for col in collections:
+            print(f"   - {col}")
+        # Clear all collections
+        if collections:
+            print("\n🗑️  Clearing all collections...")
+            deleted = manager.clear_all_collections()
+            print(f"✅ Deleted {deleted} collection(s)")
+        else:
+            print("\n✅ No collections to delete")
+    except Exception as e:
+        print(f"⚠️  Could not connect to ChromaDB via manager: {e}")
+        print("ℹ️  This is OK - the directory has been deleted already.")
+    print("\n" + "=" * 60)
+    print("✅ Cleanup completed! You can now start fresh.")
+    print("=" * 60)
+if __name__ == "__main__":
+    cleanup_chroma_db()

config.py ADDED Viewed

	@@ -0,0 +1,64 @@

+"""Configuration management for RAG Application."""
+from pydantic_settings import BaseSettings
+from typing import Optional
+import os
+class Settings(BaseSettings):
+    """Application settings."""
+    # API Keys
+    groq_api_key: str = ""
+    # ChromaDB
+    chroma_persist_directory: str = "./chroma_db"
+    # Rate Limiting
+    groq_rpm_limit: int = 30
+    rate_limit_delay: float = 2.0
+    # Embedding Models
+    embedding_models: list = [
+        "sentence-transformers/all-mpnet-base-v2",  # Stable, high quality
+        "emilyalsentzer/Bio_ClinicalBERT",  # Clinical domain
+        "microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract",  # Medical domain
+        "sentence-transformers/all-MiniLM-L6-v2",  # Fast, lightweight
+        "sentence-transformers/multilingual-MiniLM-L12-v2",  # Multilingual
+        "sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2",  # Paraphrase
+        "allenai/specter",  # Academic papers
+        "gemini-embedding-001"  # Gemini API
+    ]
+    # LLM Models
+    llm_models: list = [
+        "meta-llama/llama-4-maverick-17b-128e-instruct",
+        "llama-3.1-8b-instant",
+        "openai/gpt-oss-120b"
+    ]
+    # Chunking Strategies
+    chunking_strategies: list = ["dense", "sparse", "hybrid", "re-ranking"]
+    # RAG Bench Datasets (from rungalileo/ragbench)
+    ragbench_datasets: list = [
+        "covidqa",
+        "cuad",
+        "delucionqa",
+        "emanual",
+        "expertqa",
+        "finqa",
+        "hagrid",
+        "hotpotqa",
+        "msmarco",
+        "pubmedqa",
+        "tatqa",
+        "techqa"
+    ]
+    class Config:
+        env_file = ".env"
+        case_sensitive = False
+        extra = "allow"
+settings = Settings()

dataset_loader.py ADDED Viewed

	@@ -0,0 +1,178 @@

+"""Dataset loader for RAG Bench datasets."""
+import os
+from typing import List, Dict, Optional
+from datasets import load_dataset
+import pandas as pd
+from tqdm import tqdm
+class RAGBenchLoader:
+    """Load and manage RAG Bench datasets."""
+    SUPPORTED_DATASETS = [
+        'covidqa',
+        'cuad',
+        'delucionqa',
+        'emanual',
+        'expertqa',
+        'finqa',
+        'hagrid',
+        'hotpotqa',
+        'msmarco',
+        'pubmedqa',
+        'tatqa',
+        'techqa'
+    ]
+    def __init__(self, cache_dir: str = "./data_cache"):
+        """Initialize the dataset loader.
+        Args:
+            cache_dir: Directory to cache downloaded datasets
+        """
+        self.cache_dir = cache_dir
+        os.makedirs(cache_dir, exist_ok=True)
+    def load_dataset(self, dataset_name: str, split: str = "test",
+                     max_samples: Optional[int] = None) -> List[Dict]:
+        """Load a RAG Bench dataset from rungalileo/ragbench.
+        Args:
+            dataset_name: Name of the dataset to load
+            split: Dataset split (train/validation/test)
+            max_samples: Maximum number of samples to load
+        Returns:
+            List of dictionaries containing dataset samples
+        """
+        if dataset_name not in self.SUPPORTED_DATASETS:
+            raise ValueError(f"Unsupported dataset: {dataset_name}. "
+                           f"Supported: {self.SUPPORTED_DATASETS}")
+        print(f"Loading {dataset_name} dataset ({split} split) from rungalileo/ragbench...")
+        try:
+            # Load from rungalileo/ragbench
+            dataset = load_dataset("rungalileo/ragbench", dataset_name, split=split,
+                                   cache_dir=self.cache_dir)
+            processed_data = []
+            samples = dataset if max_samples is None else dataset.select(range(min(max_samples, len(dataset))))
+            # Process the dataset
+            for item in tqdm(samples, desc=f"Processing {dataset_name}"):
+                processed_data.append(self._process_ragbench_item(item, dataset_name))
+            print(f"Loaded {len(processed_data)} samples from {dataset_name}")
+            return processed_data
+        except Exception as e:
+            print(f"Error loading {dataset_name}: {str(e)}")
+            print("Falling back to sample data for testing...")
+            return self._create_sample_data(dataset_name, max_samples or 10)
+    def _process_ragbench_item(self, item: Dict, dataset_name: str) -> Dict:
+        """Process a single RAGBench dataset item into standardized format.
+        Args:
+            item: Raw dataset item
+            dataset_name: Name of the dataset
+        Returns:
+            Processed item dictionary
+        """
+        # RAGBench datasets typically have: question, documents, answer, and retrieved_contexts
+        processed = {
+            "question": item.get("question", ""),
+            "answer": item.get("answer", ""),
+            "context": "",  # For embedding and retrieval
+            "documents": [],  # Store original documents list
+            "dataset": dataset_name
+        }
+        # Extract documents - RAGBench uses 'documents' as primary source for embeddings
+        # Priority: documents > retrieved_contexts > context
+        if "documents" in item:
+            if isinstance(item["documents"], list):
+                processed["documents"] = [str(doc) for doc in item["documents"]]
+                processed["context"] = " ".join(processed["documents"])
+            else:
+                processed["documents"] = [str(item["documents"])]
+                processed["context"] = str(item["documents"])
+        elif "retrieved_contexts" in item:
+            if isinstance(item["retrieved_contexts"], list):
+                processed["documents"] = [str(ctx) for ctx in item["retrieved_contexts"]]
+                processed["context"] = " ".join(processed["documents"])
+            else:
+                processed["documents"] = [str(item["retrieved_contexts"])]
+                processed["context"] = str(item["retrieved_contexts"])
+        elif "context" in item:
+            if isinstance(item["context"], list):
+                processed["documents"] = [str(ctx) for ctx in item["context"]]
+                processed["context"] = " ".join(processed["documents"])
+            else:
+                processed["documents"] = [str(item["context"])]
+                processed["context"] = str(item["context"])
+        # Store additional metadata if available
+        if "metadata" in item:
+            processed["metadata"] = item["metadata"]
+        return processed
+    def load_all_datasets(self, split: str = "test", max_samples: Optional[int] = None) -> Dict[str, List[Dict]]:
+        """Load all RAGBench datasets.
+        Args:
+            split: Dataset split to load
+            max_samples: Maximum samples per dataset
+        Returns:
+            Dictionary mapping dataset names to their data
+        """
+        all_data = {}
+        for dataset_name in self.SUPPORTED_DATASETS:
+            print(f"\n{'='*50}")
+            print(f"Loading {dataset_name}...")
+            print(f"{'='*50}")
+            try:
+                all_data[dataset_name] = self.load_dataset(dataset_name, split, max_samples)
+            except Exception as e:
+                print(f"Failed to load {dataset_name}: {str(e)}")
+                all_data[dataset_name] = []
+        return all_data
+    def _create_sample_data(self, dataset_name: str, num_samples: int) -> List[Dict]:
+        """Create sample data for testing when actual dataset is unavailable."""
+        sample_data = []
+        for i in range(num_samples):
+            # Create multiple sample documents per question
+            sample_docs = [
+                f"Document 1: This is the first sample document {i+1} for {dataset_name} dataset. "
+                f"It contains relevant information to answer the question.",
+                f"Document 2: This is the second sample document {i+1} providing additional context. "
+                f"It includes more details about the topic.",
+                f"Document 3: This is the third sample document {i+1} with supplementary information."
+            ]
+            sample_data.append({
+                "question": f"Sample question {i+1} for {dataset_name}?",
+                "answer": f"Sample answer {i+1}",
+                "documents": sample_docs,
+                "context": " ".join(sample_docs),  # Combined for backward compatibility
+                "dataset": dataset_name
+            })
+        return sample_data
+    def get_test_data(self, dataset_name: str, num_samples: int = 100) -> List[Dict]:
+        """Get test data for TRACE evaluation.
+        Args:
+            dataset_name: Name of the dataset
+            num_samples: Number of test samples
+        Returns:
+            List of test samples
+        """
+        return self.load_dataset(dataset_name, split="test", max_samples=num_samples)

docker-compose.yml ADDED Viewed

	@@ -0,0 +1,26 @@

+version: '3.8'
+services:
+  streamlit:
+    build: .
+    ports:
+      - "8501:8501"
+    environment:
+      - GROQ_API_KEY=${GROQ_API_KEY}
+      - CHROMA_PERSIST_DIRECTORY=/app/chroma_db
+    volumes:
+      - ./chroma_db:/app/chroma_db
+      - ./data_cache:/app/data_cache
+    command: streamlit run streamlit_app.py --server.port=8501 --server.address=0.0.0.0
+  api:
+    build: .
+    ports:
+      - "8000:8000"
+    environment:
+      - GROQ_API_KEY=${GROQ_API_KEY}
+      - CHROMA_PERSIST_DIRECTORY=/app/chroma_db
+    volumes:
+      - ./chroma_db:/app/chroma_db
+      - ./data_cache:/app/data_cache
+    command: uvicorn api:app --host 0.0.0.0 --port 8000 --reload

embedding_models.py ADDED Viewed

	@@ -0,0 +1,325 @@

+"""Embedding models for document vectorization."""
+from typing import List, Optional
+import torch
+from sentence_transformers import SentenceTransformer
+from transformers import AutoTokenizer, AutoModel
+import numpy as np
+from tqdm import tqdm
+import os
+class EmbeddingModel:
+    """Base class for embedding models."""
+    def __init__(self, model_name: str, device: Optional[str] = None):
+        """Initialize embedding model.
+        Args:
+            model_name: Name/path of the model
+            device: Device to run model on (cuda/cpu)
+        """
+        self.model_name = model_name
+        self.device = device or ("cuda" if torch.cuda.is_available() else "cpu")
+        self.model = None
+        self.tokenizer = None
+    def load_model(self):
+        """Load the embedding model."""
+        raise NotImplementedError
+    def embed_documents(self, texts: List[str], batch_size: int = 32) -> np.ndarray:
+        """Embed a list of documents.
+        Args:
+            texts: List of texts to embed
+            batch_size: Batch size for processing
+        Returns:
+            Numpy array of embeddings
+        """
+        raise NotImplementedError
+    def embed_query(self, query: str) -> np.ndarray:
+        """Embed a single query.
+        Args:
+            query: Query text
+        Returns:
+            Numpy array of embedding
+        """
+        return self.embed_documents([query])[0]
+class SentenceTransformerEmbedding(EmbeddingModel):
+    """Sentence Transformer based embedding model."""
+    def load_model(self):
+        """Load sentence transformer model."""
+        print(f"Loading SentenceTransformer model: {self.model_name}")
+        try:
+            self.model = SentenceTransformer(self.model_name, device=self.device)
+            print(f"Model loaded successfully on {self.device}")
+        except Exception as e:
+            print(f"Error loading model {self.model_name}: {str(e)}")
+            print("Falling back to default model...")
+            self.model = SentenceTransformer('all-MiniLM-L6-v2', device=self.device)
+    def embed_documents(self, texts: List[str], batch_size: int = 32) -> np.ndarray:
+        """Embed documents using sentence transformer."""
+        if self.model is None:
+            self.load_model()
+        embeddings = []
+        for i in tqdm(range(0, len(texts), batch_size), desc="Embedding documents"):
+            batch = texts[i:i + batch_size]
+            batch_embeddings = self.model.encode(
+                batch,
+                convert_to_numpy=True,
+                show_progress_bar=False,
+                batch_size=batch_size
+            )
+            embeddings.append(batch_embeddings)
+        return np.vstack(embeddings) if embeddings else np.array([])
+class BioMedicalEmbedding(EmbeddingModel):
+    """Bio-medical BERT based embedding model."""
+    def load_model(self):
+        """Load bio-medical BERT model."""
+        print(f"Loading Bio-Medical model: {self.model_name}")
+        try:
+            self.tokenizer = AutoTokenizer.from_pretrained(self.model_name)
+            self.model = AutoModel.from_pretrained(self.model_name).to(self.device)
+            self.model.eval()
+            print(f"Model loaded successfully on {self.device}")
+        except Exception as e:
+            print(f"Error loading model {self.model_name}: {str(e)}")
+            print("Falling back to default model...")
+            self.tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
+            self.model = AutoModel.from_pretrained('bert-base-uncased').to(self.device)
+            self.model.eval()
+    def mean_pooling(self, model_output, attention_mask):
+        """Apply mean pooling to get sentence embeddings."""
+        token_embeddings = model_output[0]
+        input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
+        return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(
+            input_mask_expanded.sum(1), min=1e-9
+        )
+    def embed_documents(self, texts: List[str], batch_size: int = 32) -> np.ndarray:
+        """Embed documents using bio-medical BERT."""
+        if self.model is None:
+            self.load_model()
+        embeddings = []
+        with torch.no_grad():
+            for i in tqdm(range(0, len(texts), batch_size), desc="Embedding documents"):
+                batch = texts[i:i + batch_size]
+                # Tokenize
+                encoded_input = self.tokenizer(
+                    batch,
+                    padding=True,
+                    truncation=True,
+                    max_length=512,
+                    return_tensors='pt'
+                ).to(self.device)
+                # Get embeddings
+                model_output = self.model(**encoded_input)
+                # Apply mean pooling
+                batch_embeddings = self.mean_pooling(
+                    model_output,
+                    encoded_input['attention_mask']
+                )
+                # Normalize
+                batch_embeddings = torch.nn.functional.normalize(batch_embeddings, p=2, dim=1)
+                embeddings.append(batch_embeddings.cpu().numpy())
+        return np.vstack(embeddings) if embeddings else np.array([])
+class GeminiEmbedding(EmbeddingModel):
+    """Gemini embedding model using Google AI API."""
+    def load_model(self):
+        """Load Gemini embedding model."""
+        print(f"Initializing Gemini embedding model: {self.model_name}")
+        try:
+            import google.generativeai as genai
+            api_key = os.getenv("GEMINI_API_KEY")
+            if not api_key:
+                raise ValueError("GEMINI_API_KEY environment variable not set")
+            genai.configure(api_key=api_key)
+            self.model = genai
+            print(f"Gemini model initialized successfully")
+        except Exception as e:
+            print(f"Error loading Gemini model: {str(e)}")
+            print("Falling back to default model...")
+            self.model = SentenceTransformer('all-MiniLM-L6-v2', device=self.device)
+    def embed_documents(self, texts: List[str], batch_size: int = 32) -> np.ndarray:
+        """Embed documents using Gemini API."""
+        if self.model is None:
+            self.load_model()
+        embeddings = []
+        # Gemini API has rate limits, process with delays
+        for i in tqdm(range(0, len(texts), batch_size), desc="Embedding documents"):
+            batch = texts[i:i + batch_size]
+            for text in batch:
+                try:
+                    if hasattr(self.model, 'embed_content'):
+                        result = self.model.embed_content(
+                            model="models/embedding-001",
+                            content=text,
+                            task_type="retrieval_document"
+                        )
+                        embeddings.append(result['embedding'])
+                    else:
+                        # Fallback if Gemini not available
+                        from sentence_transformers import SentenceTransformer
+                        fallback_model = SentenceTransformer('all-MiniLM-L6-v2')
+                        emb = fallback_model.encode([text])[0]
+                        embeddings.append(emb)
+                except Exception as e:
+                    print(f"Error embedding text: {str(e)}")
+                    # Use zero vector as fallback
+                    embeddings.append(np.zeros(768))
+        return np.array(embeddings)
+class EmbeddingFactory:
+    """Factory for creating embedding model instances."""
+    # Map model names to their types
+    MODEL_TYPES = {
+        "sentence-transformers/all-mpnet-base-v2": "sentence-transformer",  # Stable, well-supported
+        "emilyalsentzer/Bio_ClinicalBERT": "biomedical",  # Clinical domain
+        "microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract": "biomedical",  # Medical domain
+        "sentence-transformers/all-MiniLM-L6-v2": "sentence-transformer",  # Fast, lightweight
+        "sentence-transformers/multilingual-MiniLM-L12-v2": "sentence-transformer",  # Multilingual
+        "sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2": "sentence-transformer",  # Paraphrase
+        "allenai/specter": "biomedical",  # Academic paper embeddings
+        "gemini-embedding-001": "gemini"  # Gemini API
+    }
+    @classmethod
+    def create_embedding_model(cls, model_name: str, device: Optional[str] = None) -> EmbeddingModel:
+        """Create an embedding model instance.
+        Args:
+            model_name: Name of the embedding model
+            device: Device to run model on
+        Returns:
+            EmbeddingModel instance
+        """
+        model_type = cls.MODEL_TYPES.get(model_name, "sentence-transformer")
+        if model_type == "gemini":
+            return GeminiEmbedding(model_name, device)
+        elif model_type == "biomedical":
+            return BioMedicalEmbedding(model_name, device)
+        else:
+            return SentenceTransformerEmbedding(model_name, device)
+    @classmethod
+    def get_available_models(cls) -> List[str]:
+        """Get list of available embedding models."""
+        return list(cls.MODEL_TYPES.keys())
+    @classmethod
+    def get_model_info(cls, model_name: str) -> dict:
+        """Get information about a specific model.
+        Args:
+            model_name: Name of the model
+        Returns:
+            Dictionary with model information
+        """
+        info = {
+            "sentence-transformers/all-mpnet-base-v2": {
+                "description": "High-quality, general-purpose sentence embeddings (384d)",
+                "dimension": 768,
+                "type": "sentence-transformer",
+                "note": "Recommended for general use"
+            },
+            "emilyalsentzer/Bio_ClinicalBERT": {
+                "description": "Clinical BERT for biomedical and clinical text",
+                "dimension": 768,
+                "type": "biomedical"
+            },
+            "microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract": {
+                "description": "PubMedBERT for biomedical and medical text",
+                "dimension": 768,
+                "type": "biomedical"
+            },
+            "sentence-transformers/all-MiniLM-L6-v2": {
+                "description": "Fast, lightweight sentence embeddings",
+                "dimension": 384,
+                "type": "sentence-transformer",
+                "note": "Good for speed-sensitive applications"
+            },
+            "sentence-transformers/multilingual-MiniLM-L12-v2": {
+                "description": "Fast multilingual sentence embeddings",
+                "dimension": 384,
+                "type": "sentence-transformer",
+                "note": "Supports 50+ languages"
+            },
+            "sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2": {
+                "description": "Multilingual paraphrase embeddings",
+                "dimension": 384,
+                "type": "sentence-transformer",
+                "note": "Good for paraphrase detection"
+            },
+            "allenai/specter": {
+                "description": "Embeddings for academic papers and citations",
+                "dimension": 768,
+                "type": "biomedical",
+                "note": "Optimized for scientific literature"
+            },
+            "gemini-embedding-001": {
+                "description": "Google Gemini embedding model via API",
+                "dimension": 768,
+                "type": "gemini",
+                "url": "https://ai.google.dev/gemini-api/docs/embeddings",
+                "note": "Requires GEMINI_API_KEY environment variable"
+            }
+        }
+        return info.get(model_name, {"description": "Unknown model", "dimension": 768})
+    @classmethod
+    def get_embedding_dimension(cls, model_name: str) -> int:
+        """Get embedding dimension for a model.
+        Args:
+            model_name: Name of the model
+        Returns:
+            Embedding dimension
+        """
+        # Default dimensions (adjust based on actual models)
+        dimensions = {
+            "sentence-transformers/all-mpnet-base-v2": 768,
+            "emilyalsentzer/Bio_ClinicalBERT": 768,
+            "microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract": 768,
+            "sentence-transformers/all-MiniLM-L6-v2": 384,
+            "sentence-transformers/multilingual-MiniLM-L12-v2": 384,
+            "sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2": 384,
+            "allenai/specter": 768,
+            "gemini-embedding-001": 768
+        }
+        return dimensions.get(model_name, 768)

example.py ADDED Viewed

	@@ -0,0 +1,118 @@

+"""
+Example script demonstrating how to use the RAG system programmatically.
+"""
+import os
+from config import settings
+from dataset_loader import RAGBenchLoader
+from vector_store import ChromaDBManager
+from llm_client import GroqLLMClient, RAGPipeline
+from trace_evaluator import TRACEEvaluator
+def main():
+    """Example usage of RAG system."""
+    # Set your API key
+    api_key = os.getenv("GROQ_API_KEY") or "your_api_key_here"
+    if api_key == "your_api_key_here":
+        print("Please set your GROQ_API_KEY in .env file or environment variable")
+        return
+    print("=" * 50)
+    print("RAG System Example")
+    print("=" * 50)
+    # 1. Load dataset
+    print("\n1. Loading dataset...")
+    loader = RAGBenchLoader()
+    dataset = loader.load_dataset("hotpotqa", split="train", max_samples=20)
+    print(f"Loaded {len(dataset)} samples")
+    # 2. Create vector store and collection
+    print("\n2. Creating vector store...")
+    vector_store = ChromaDBManager()
+    collection_name = "example_collection"
+    embedding_model = "emilyalsentzer/Bio_ClinicalBERT"
+    chunking_strategy = "hybrid"
+    print(f"Loading data into collection with {chunking_strategy} chunking...")
+    vector_store.load_dataset_into_collection(
+        collection_name=collection_name,
+        embedding_model_name=embedding_model,
+        chunking_strategy=chunking_strategy,
+        dataset_data=dataset,
+        chunk_size=512,
+        overlap=50
+    )
+    # 3. Initialize LLM client
+    print("\n3. Initializing LLM client...")
+    llm_client = GroqLLMClient(
+        api_key=api_key,
+        model_name="llama-3.1-8b-instant",
+        max_rpm=30,
+        rate_limit_delay=2.0
+    )
+    # 4. Create RAG pipeline
+    print("\n4. Creating RAG pipeline...")
+    rag = RAGPipeline(llm_client, vector_store)
+    # 5. Query the system
+    print("\n5. Querying the system...")
+    queries = [
+        "What is machine learning?",
+        "How does neural network work?",
+        "What is deep learning?"
+    ]
+    for i, query in enumerate(queries, 1):
+        print(f"\n--- Query {i}: {query} ---")
+        result = rag.query(query, n_results=3)
+        print(f"Response: {result['response']}")
+        print(f"\nRetrieved {len(result['retrieved_documents'])} documents:")
+        for j, doc in enumerate(result['retrieved_documents'], 1):
+            print(f"\nDocument {j} (Distance: {doc.get('distance', 'N/A')}):")
+            print(f"{doc['document'][:200]}...")
+    # 6. Run evaluation
+    print("\n6. Running TRACE evaluation...")
+    evaluator = TRACEEvaluator(llm_client)
+    # Prepare test cases
+    test_cases = []
+    test_samples = loader.get_test_data("hotpotqa", num_samples=5)
+    for sample in test_samples:
+        result = rag.query(sample["question"], n_results=5)
+        test_cases.append({
+            "query": sample["question"],
+            "response": result["response"],
+            "retrieved_documents": [doc["document"] for doc in result["retrieved_documents"]],
+            "ground_truth": sample.get("answer", "")
+        })
+    results = evaluator.evaluate_batch(test_cases)
+    print("\nTRACE Evaluation Results:")
+    print(f"Utilization:  {results['utilization']:.3f}")
+    print(f"Relevance:    {results['relevance']:.3f}")
+    print(f"Adherence:    {results['adherence']:.3f}")
+    print(f"Completeness: {results['completeness']:.3f}")
+    print(f"Average:      {results['average']:.3f}")
+    # 7. View chat history
+    print("\n7. Chat History:")
+    history = rag.get_chat_history()
+    print(f"Total conversations: {len(history)}")
+    print("\n" + "=" * 50)
+    print("Example completed successfully!")
+    print("=" * 50)
+if __name__ == "__main__":
+    main()

llm_client.py ADDED Viewed

	@@ -0,0 +1,351 @@

+"""Groq LLM integration with rate limiting."""
+from typing import List, Dict, Optional, AsyncIterator
+import time
+from groq import Groq
+import asyncio
+from datetime import datetime, timedelta
+from collections import deque
+import os
+class RateLimiter:
+    """Rate limiter for API calls."""
+    def __init__(self, max_requests_per_minute: int = 30):
+        """Initialize rate limiter.
+        Args:
+            max_requests_per_minute: Maximum requests allowed per minute
+        """
+        self.max_requests = max_requests_per_minute
+        self.request_times = deque()
+        self.lock = asyncio.Lock()
+    async def acquire(self):
+        """Acquire permission to make a request."""
+        async with self.lock:
+            now = datetime.now()
+            # Remove requests older than 1 minute
+            while self.request_times and (now - self.request_times[0]) > timedelta(minutes=1):
+                self.request_times.popleft()
+            # If at limit, wait
+            if len(self.request_times) >= self.max_requests:
+                # Calculate how long to wait
+                oldest_request = self.request_times[0]
+                wait_time = 60 - (now - oldest_request).total_seconds()
+                if wait_time > 0:
+                    print(f"Rate limit reached. Waiting {wait_time:.2f} seconds...")
+                    await asyncio.sleep(wait_time)
+                    # Recursive call after waiting
+                    return await self.acquire()
+            # Record this request
+            self.request_times.append(now)
+    def acquire_sync(self):
+        """Synchronous version of acquire."""
+        now = datetime.now()
+        # Remove requests older than 1 minute
+        while self.request_times and (now - self.request_times[0]) > timedelta(minutes=1):
+            self.request_times.popleft()
+        # If at limit, wait
+        if len(self.request_times) >= self.max_requests:
+            oldest_request = self.request_times[0]
+            wait_time = 60 - (now - oldest_request).total_seconds()
+            if wait_time > 0:
+                print(f"Rate limit reached. Waiting {wait_time:.2f} seconds...")
+                time.sleep(wait_time)
+                return self.acquire_sync()
+        # Record this request
+        self.request_times.append(now)
+class GroqLLMClient:
+    """Client for Groq LLM API with rate limiting."""
+    def __init__(
+        self,
+        api_key: str,
+        model_name: str = "llama-3.1-8b-instant",
+        max_rpm: int = 30,
+        rate_limit_delay: float = 2.0
+    ):
+        """Initialize Groq client.
+        Args:
+            api_key: Groq API key
+            model_name: Name of the LLM model
+            max_rpm: Maximum requests per minute
+            rate_limit_delay: Additional delay between requests (seconds)
+        """
+        self.client = Groq(api_key=api_key)
+        self.model_name = model_name
+        self.rate_limiter = RateLimiter(max_rpm)
+        self.rate_limit_delay = rate_limit_delay
+        # Available models
+        self.available_models = [
+            "meta-llama/llama-4-maverick-17b-128e-instruct",
+            "llama-3.1-8b-instant",
+            "openai/gpt-oss-120b"
+        ]
+    def set_model(self, model_name: str):
+        """Set the LLM model.
+        Args:
+            model_name: Name of the model
+        """
+        if model_name not in self.available_models:
+            print(f"Warning: {model_name} not in available models. Using anyway...")
+        self.model_name = model_name
+    def generate(
+        self,
+        prompt: str,
+        max_tokens: int = 1024,
+        temperature: float = 0.7,
+        system_prompt: Optional[str] = None
+    ) -> str:
+        """Generate text using Groq LLM.
+        Args:
+            prompt: Input prompt
+            max_tokens: Maximum tokens to generate
+            temperature: Sampling temperature
+            system_prompt: System prompt
+        Returns:
+            Generated text
+        """
+        # Apply rate limiting
+        self.rate_limiter.acquire_sync()
+        # Prepare messages
+        messages = []
+        if system_prompt:
+            messages.append({
+                "role": "system",
+                "content": system_prompt
+            })
+        messages.append({
+            "role": "user",
+            "content": prompt
+        })
+        try:
+            # Make API call
+            response = self.client.chat.completions.create(
+                model=self.model_name,
+                messages=messages,
+                max_tokens=max_tokens,
+                temperature=temperature
+            )
+            # Add delay
+            time.sleep(self.rate_limit_delay)
+            return response.choices[0].message.content
+        except Exception as e:
+            print(f"Error generating response: {str(e)}")
+            return f"Error: {str(e)}"
+    async def generate_async(
+        self,
+        prompt: str,
+        max_tokens: int = 1024,
+        temperature: float = 0.7,
+        system_prompt: Optional[str] = None
+    ) -> str:
+        """Asynchronous version of generate.
+        Args:
+            prompt: Input prompt
+            max_tokens: Maximum tokens to generate
+            temperature: Sampling temperature
+            system_prompt: System prompt
+        Returns:
+            Generated text
+        """
+        # Apply rate limiting
+        await self.rate_limiter.acquire()
+        # Prepare messages
+        messages = []
+        if system_prompt:
+            messages.append({
+                "role": "system",
+                "content": system_prompt
+            })
+        messages.append({
+            "role": "user",
+            "content": prompt
+        })
+        try:
+            # Make API call (synchronous client used in async context)
+            response = self.client.chat.completions.create(
+                model=self.model_name,
+                messages=messages,
+                max_tokens=max_tokens,
+                temperature=temperature
+            )
+            # Add delay
+            await asyncio.sleep(self.rate_limit_delay)
+            return response.choices[0].message.content
+        except Exception as e:
+            print(f"Error generating response: {str(e)}")
+            return f"Error: {str(e)}"
+    def generate_with_context(
+        self,
+        query: str,
+        context_documents: List[str],
+        max_tokens: int = 1024,
+        temperature: float = 0.7
+    ) -> str:
+        """Generate response with retrieved context.
+        Args:
+            query: User query
+            context_documents: List of retrieved documents
+            max_tokens: Maximum tokens to generate
+            temperature: Sampling temperature
+        Returns:
+            Generated response
+        """
+        # Build context
+        context = "\n\n".join([
+            f"Document {i+1}: {doc}"
+            for i, doc in enumerate(context_documents)
+        ])
+        # Build prompt
+        prompt = f"""Answer the following question based on the provided context.
+Context:
+{context}
+Question: {query}
+Answer:"""
+        system_prompt = "You are a helpful AI assistant. Answer questions based on the provided context. If the answer is not in the context, say so."
+        return self.generate(prompt, max_tokens, temperature, system_prompt)
+    def batch_generate(
+        self,
+        prompts: List[str],
+        max_tokens: int = 1024,
+        temperature: float = 0.7,
+        system_prompt: Optional[str] = None
+    ) -> List[str]:
+        """Generate responses for multiple prompts.
+        Args:
+            prompts: List of prompts
+            max_tokens: Maximum tokens to generate
+            temperature: Sampling temperature
+            system_prompt: System prompt
+        Returns:
+            List of generated responses
+        """
+        responses = []
+        for i, prompt in enumerate(prompts):
+            print(f"Processing prompt {i+1}/{len(prompts)}")
+            response = self.generate(prompt, max_tokens, temperature, system_prompt)
+            responses.append(response)
+        return responses
+class RAGPipeline:
+    """Complete RAG pipeline with LLM and vector store."""
+    def __init__(
+        self,
+        llm_client: GroqLLMClient,
+        vector_store_manager
+    ):
+        """Initialize RAG pipeline.
+        Args:
+            llm_client: Groq LLM client
+            vector_store_manager: ChromaDB manager
+        """
+        self.llm = llm_client
+        self.vector_store = vector_store_manager
+        self.chat_history = []
+    def query(
+        self,
+        query: str,
+        n_results: int = 5,
+        max_tokens: int = 1024,
+        temperature: float = 0.7
+    ) -> Dict:
+        """Query the RAG system.
+        Args:
+            query: User query
+            n_results: Number of documents to retrieve
+            max_tokens: Maximum tokens to generate
+            temperature: Sampling temperature
+        Returns:
+            Dictionary with response and retrieved documents
+        """
+        # Retrieve documents
+        retrieved_docs = self.vector_store.get_retrieved_documents(query, n_results)
+        # Extract document texts
+        doc_texts = [doc["document"] for doc in retrieved_docs]
+        # Generate response
+        response = self.llm.generate_with_context(
+            query,
+            doc_texts,
+            max_tokens,
+            temperature
+        )
+        # Store in chat history
+        self.chat_history.append({
+            "query": query,
+            "response": response,
+            "retrieved_docs": retrieved_docs,
+            "timestamp": datetime.now().isoformat()
+        })
+        return {
+            "query": query,
+            "response": response,
+            "retrieved_documents": retrieved_docs
+        }
+    def get_chat_history(self) -> List[Dict]:
+        """Get chat history.
+        Returns:
+            List of chat history entries
+        """
+        return self.chat_history
+    def clear_history(self):
+        """Clear chat history."""
+        self.chat_history = []

requirements.txt ADDED Viewed

	@@ -0,0 +1,40 @@

+# Core Dependencies
+fastapi==0.109.0
+uvicorn[standard]==0.27.0
+streamlit==1.31.0
+python-dotenv==1.0.0
+# LLM and AI
+groq>=0.11.0
+openai==1.12.0
+google-generativeai>=0.3.0
+# Embeddings and Vector Store
+sentence-transformers==2.7.0
+transformers==4.40.2
+torch>=2.0.0
+chromadb==0.5.23
+# Data Processing
+pandas==2.2.0
+numpy==1.26.3
+datasets==2.16.1
+# RAG and Retrieval
+langchain==0.1.6
+langchain-community==0.0.19
+langchain-groq==0.0.1
+# Evaluation
+ragas==0.1.4
+rank-bm25==0.2.2
+# Utilities
+pydantic==2.6.0
+pydantic-settings==2.1.0
+tenacity==8.2.3
+aiohttp==3.9.3
+tqdm==4.66.1
+# Deployment
+gunicorn==21.2.0

run.py ADDED Viewed

	@@ -0,0 +1,99 @@

+"""
+Quick start script to run the RAG application.
+"""
+import subprocess
+import sys
+import os
+def check_dependencies():
+    """Check if required dependencies are installed."""
+    try:
+        import streamlit
+        import fastapi
+        import groq
+        return True
+    except ImportError:
+        return False
+def install_dependencies():
+    """Install dependencies from requirements.txt."""
+    print("Installing dependencies...")
+    subprocess.check_call([sys.executable, "-m", "pip", "install", "-r", "requirements.txt"])
+    print("Dependencies installed successfully!")
+def check_env_file():
+    """Check if .env file exists."""
+    if not os.path.exists(".env"):
+        print("\n⚠️  Warning: .env file not found!")
+        print("Creating .env from .env.example...")
+        if os.path.exists(".env.example"):
+            with open(".env.example", "r") as src:
+                with open(".env", "w") as dst:
+                    dst.write(src.read())
+            print("✅ .env file created. Please edit it and add your Groq API key.")
+        else:
+            print("❌ .env.example not found. Please create .env manually.")
+        return False
+    return True
+def run_streamlit():
+    """Run the Streamlit application."""
+    print("\n🚀 Starting Streamlit application...")
+    print("📱 Open your browser to: http://localhost:8501")
+    subprocess.run([sys.executable, "-m", "streamlit", "run", "streamlit_app.py"])
+def run_api():
+    """Run the FastAPI application."""
+    print("\n🚀 Starting FastAPI server...")
+    print("📱 API available at: http://localhost:8000")
+    print("📚 API docs at: http://localhost:8000/docs")
+    subprocess.run([sys.executable, "api.py"])
+def main():
+    """Main function."""
+    print("=" * 50)
+    print("RAG Capstone Project - Quick Start")
+    print("=" * 50)
+    # Check dependencies
+    if not check_dependencies():
+        print("\n📦 Installing dependencies...")
+        install_dependencies()
+    # Check .env file
+    env_exists = check_env_file()
+    if not env_exists:
+        print("\n❌ Please configure your .env file before running the application.")
+        return
+    # Ask user what to run
+    print("\nWhat would you like to run?")
+    print("1. Streamlit Chat Interface (recommended)")
+    print("2. FastAPI Backend")
+    print("3. Both (requires two terminals)")
+    choice = input("\nEnter your choice (1-3): ").strip()
+    if choice == "1":
+        run_streamlit()
+    elif choice == "2":
+        run_api()
+    elif choice == "3":
+        print("\n📌 To run both:")
+        print("Terminal 1: python api.py")
+        print("Terminal 2: streamlit run streamlit_app.py")
+        print("\nStarting Streamlit in this terminal...")
+        run_streamlit()
+    else:
+        print("Invalid choice. Exiting.")
+if __name__ == "__main__":
+    main()

streamlit_app.py ADDED Viewed

	@@ -0,0 +1,721 @@

+"""Streamlit chat interface for RAG application."""
+import streamlit as st
+import sys
+import os
+from datetime import datetime
+import json
+import pandas as pd
+from typing import Optional
+import warnings
+# Suppress warnings
+warnings.filterwarnings('ignore')
+os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'
+# Add parent directory to path
+sys.path.append(os.path.dirname(os.path.abspath(__file__)))
+from config import settings
+from dataset_loader import RAGBenchLoader
+from vector_store import ChromaDBManager
+from llm_client import GroqLLMClient, RAGPipeline
+from trace_evaluator import TRACEEvaluator
+from embedding_models import EmbeddingFactory
+from chunking_strategies import ChunkingFactory
+# Page configuration
+st.set_page_config(
+    page_title="RAG Capstone Project",
+    page_icon="🤖",
+    layout="wide"
+)
+# Initialize session state
+if "chat_history" not in st.session_state:
+    st.session_state.chat_history = []
+if "rag_pipeline" not in st.session_state:
+    st.session_state.rag_pipeline = None
+if "vector_store" not in st.session_state:
+    st.session_state.vector_store = None
+if "collection_loaded" not in st.session_state:
+    st.session_state.collection_loaded = False
+if "evaluation_results" not in st.session_state:
+    st.session_state.evaluation_results = None
+if "dataset_size" not in st.session_state:
+    st.session_state.dataset_size = 10000
+if "current_dataset" not in st.session_state:
+    st.session_state.current_dataset = None
+if "current_llm" not in st.session_state:
+    st.session_state.current_llm = settings.llm_models[1]
+if "selected_collection" not in st.session_state:
+    st.session_state.selected_collection = None
+if "available_collections" not in st.session_state:
+    st.session_state.available_collections = []
+def get_available_collections():
+    """Get list of available collections from ChromaDB."""
+    try:
+        vector_store = ChromaDBManager(settings.chroma_persist_directory)
+        collections = vector_store.list_collections()
+        return collections
+    except Exception as e:
+        print(f"Error getting collections: {e}")
+        return []
+def main():
+    """Main Streamlit application."""
+    st.title("🤖 RAG Capstone Project")
+    st.markdown("### Retrieval-Augmented Generation with TRACE Evaluation")
+    # Get available collections at startup
+    available_collections = get_available_collections()
+    st.session_state.available_collections = available_collections
+    # Sidebar for configuration
+    with st.sidebar:
+        st.header("Configuration")
+        # API Key input
+        groq_api_key = st.text_input(
+            "Groq API Key",
+            type="password",
+            value=settings.groq_api_key or "",
+            help="Enter your Groq API key"
+        )
+        st.divider()
+        # Option 1: Use existing collection
+        if available_collections:
+            st.subheader("📚 Existing Collections")
+            st.write(f"Found {len(available_collections)} collection(s)")
+            selected_collection = st.selectbox(
+                "Or select existing collection:",
+                available_collections,
+                key="collection_selector"
+            )
+            if st.button("📖 Load Existing Collection", type="secondary"):
+                if not groq_api_key:
+                    st.error("Please enter your Groq API key")
+                else:
+                    load_existing_collection(groq_api_key, selected_collection)
+            st.divider()
+        # Option 2: Create new collection
+        st.subheader("🆕 Create New Collection")
+        # Dataset selection
+        st.subheader("1. Dataset Selection")
+        dataset_name = st.selectbox(
+            "Choose Dataset",
+            settings.ragbench_datasets,
+            index=0
+        )
+        # Get dataset size dynamically
+        if st.button("🔍 Check Dataset Size", key="check_size"):
+            with st.spinner("Checking dataset size..."):
+                try:
+                    from datasets import load_dataset
+                    import os
+                    # Load dataset with download_mode to avoid cache issues
+                    st.info(f"Fetching dataset info for '{dataset_name}'...")
+                    ds = load_dataset(
+                        "rungalileo/ragbench",
+                        dataset_name,
+                        split="train",
+                        trust_remote_code=True,
+                        download_mode="force_redownload"  # Force fresh download to avoid cache corruption
+                    )
+                    dataset_size = len(ds)
+                    st.session_state.dataset_size = dataset_size
+                    st.session_state.current_dataset = dataset_name
+                    st.success(f"✅ Dataset '{dataset_name}' has {dataset_size:,} samples available")
+                except Exception as e:
+                    st.error(f"❌ Error: {str(e)}")
+                    st.exception(e)
+                    st.warning(f"Could not determine dataset size. Using default of 10,000.")
+                    st.session_state.dataset_size = 10000
+                    st.session_state.current_dataset = dataset_name
+        # Use stored dataset size or default
+        max_samples_available = st.session_state.get('dataset_size', 10000)
+        st.caption(f"Max available samples: {max_samples_available:,}")
+        num_samples = st.slider(
+            "Number of samples",
+            min_value=10,
+            max_value=max_samples_available,
+            value=min(100, max_samples_available),
+            step=50 if max_samples_available > 1000 else 10,
+            help="Adjust slider to select number of samples"
+        )
+        load_all_samples = st.checkbox(
+            "Load all available samples",
+            value=False,
+            help="Override slider and load entire dataset"
+        )
+        st.divider()
+        # Chunking strategy
+        st.subheader("2. Chunking Strategy")
+        chunking_strategy = st.selectbox(
+            "Choose Chunking Strategy",
+            settings.chunking_strategies,
+            index=0
+        )
+        chunk_size = st.slider(
+            "Chunk Size",
+            min_value=256,
+            max_value=1024,
+            value=512,
+            step=128
+        )
+        overlap = st.slider(
+            "Overlap",
+            min_value=0,
+            max_value=200,
+            value=50,
+            step=10
+        )
+        st.divider()
+        # Embedding model
+        st.subheader("3. Embedding Model")
+        embedding_model = st.selectbox(
+            "Choose Embedding Model",
+            settings.embedding_models,
+            index=0
+        )
+        st.divider()
+        # LLM model selection for new collection
+        st.subheader("4. LLM Model")
+        llm_model = st.selectbox(
+            "Choose LLM",
+            settings.llm_models,
+            index=1
+        )
+        st.divider()
+        # Load data button
+        if st.button("🚀 Load Data & Create Collection", type="primary"):
+            if not groq_api_key:
+                st.error("Please enter your Groq API key")
+            else:
+                # Use None for num_samples if loading all data
+                samples_to_load = None if load_all_samples else num_samples
+                load_and_create_collection(
+                    groq_api_key,
+                    dataset_name,
+                    samples_to_load,
+                    chunking_strategy,
+                    chunk_size,
+                    overlap,
+                    embedding_model,
+                    llm_model
+                )
+    # Main content area
+    if not st.session_state.collection_loaded:
+        st.info("👈 Please configure and load a dataset from the sidebar to begin")
+        # Show instructions
+        with st.expander("📖 How to Use", expanded=True):
+            st.markdown("""
+            1. **Enter your Groq API Key** in the sidebar
+            2. **Select a dataset** from RAG Bench
+            3. **Choose a chunking strategy** (dense, sparse, hybrid, re-ranking)
+            4. **Select an embedding model** for document vectorization
+            5. **Choose an LLM model** for response generation
+            6. **Click "Load Data & Create Collection"** to initialize
+            7. **Start chatting** in the chat interface
+            8. **View retrieved documents** and evaluation metrics
+            9. **Run TRACE evaluation** on test data
+            """)
+        # Show available options
+        col1, col2 = st.columns(2)
+        with col1:
+            st.subheader("📊 Available Datasets")
+            for ds in settings.ragbench_datasets:
+                st.markdown(f"- {ds}")
+        with col2:
+            st.subheader("🤖 Available Models")
+            st.markdown("**Embedding Models:**")
+            for em in settings.embedding_models:
+                st.markdown(f"- {em}")
+            st.markdown("**LLM Models:**")
+            for lm in settings.llm_models:
+                st.markdown(f"- {lm}")
+    else:
+        # Create tabs for different functionalities
+        tab1, tab2, tab3 = st.tabs(["💬 Chat", "📊 Evaluation", "📜 History"])
+        with tab1:
+            chat_interface()
+        with tab2:
+            evaluation_interface()
+        with tab3:
+            history_interface()
+def load_existing_collection(api_key: str, collection_name: str):
+    """Load an existing collection from ChromaDB."""
+    with st.spinner(f"Loading collection '{collection_name}'..."):
+        try:
+            # Initialize vector store and get collection
+            vector_store = ChromaDBManager(settings.chroma_persist_directory)
+            vector_store.get_collection(collection_name)
+            # Prompt for LLM selection
+            st.session_state.current_llm = st.selectbox(
+                "Select LLM for this collection:",
+                settings.llm_models,
+                key=f"llm_selector_{collection_name}"
+            )
+            # Initialize LLM client
+            st.info("Initializing LLM client...")
+            llm_client = GroqLLMClient(
+                api_key=api_key,
+                model_name=st.session_state.current_llm,
+                max_rpm=settings.groq_rpm_limit,
+                rate_limit_delay=settings.rate_limit_delay
+            )
+            # Create RAG pipeline with correct parameter names
+            st.info("Creating RAG pipeline...")
+            rag_pipeline = RAGPipeline(
+                llm_client=llm_client,
+                vector_store_manager=vector_store
+            )
+            # Store in session state
+            st.session_state.vector_store = vector_store
+            st.session_state.rag_pipeline = rag_pipeline
+            st.session_state.collection_loaded = True
+            st.session_state.current_collection = collection_name
+            st.session_state.selected_collection = collection_name
+            st.session_state.groq_api_key = api_key
+            st.success(f"✅ Collection '{collection_name}' loaded successfully!")
+            st.rerun()
+        except Exception as e:
+            st.error(f"Error loading collection: {str(e)}")
+            st.exception(e)
+def load_and_create_collection(
+    api_key: str,
+    dataset_name: str,
+    num_samples: Optional[int],
+    chunking_strategy: str,
+    chunk_size: int,
+    overlap: int,
+    embedding_model: str,
+    llm_model: str
+):
+    """Load dataset and create vector collection."""
+    with st.spinner("Loading dataset and creating collection..."):
+        try:
+            # Initialize dataset loader
+            loader = RAGBenchLoader()
+            # Load dataset
+            if num_samples is None:
+                st.info(f"Loading {dataset_name} dataset (all available samples)...")
+            else:
+                st.info(f"Loading {dataset_name} dataset ({num_samples} samples)...")
+            dataset = loader.load_dataset(dataset_name, split="train", max_samples=num_samples)
+            st.info(f"Loading {dataset_name} dataset...")
+            dataset = loader.load_dataset(dataset_name, split="train", max_samples=num_samples)
+            if not dataset:
+                st.error("Failed to load dataset")
+                return
+            # Initialize vector store
+            st.info("Initializing vector store...")
+            vector_store = ChromaDBManager(settings.chroma_persist_directory)
+            # Create collection name
+            collection_name = f"{dataset_name}_{chunking_strategy}_{embedding_model.split('/')[-1]}"
+            collection_name = collection_name.replace("-", "_").replace(".", "_")
+            # Delete existing collection with same name (if exists)
+            existing_collections = vector_store.list_collections()
+            if collection_name in existing_collections:
+                st.warning(f"Collection '{collection_name}' already exists. Deleting and recreating...")
+                vector_store.delete_collection(collection_name)
+                st.info("Old collection deleted. Creating new one...")
+            # Load data into collection
+            st.info(f"Creating collection with {chunking_strategy} chunking...")
+            vector_store.load_dataset_into_collection(
+                collection_name=collection_name,
+                embedding_model_name=embedding_model,
+                chunking_strategy=chunking_strategy,
+                dataset_data=dataset,
+                chunk_size=chunk_size,
+                overlap=overlap
+            )
+            # Initialize LLM client
+            st.info("Initializing LLM client...")
+            llm_client = GroqLLMClient(
+                api_key=api_key,
+                model_name=llm_model,
+                max_rpm=settings.groq_rpm_limit,
+                rate_limit_delay=settings.rate_limit_delay
+            )
+            # Create RAG pipeline with correct parameter names
+            rag_pipeline = RAGPipeline(
+                llm_client=llm_client,
+                vector_store_manager=vector_store
+            )
+            # Store in session state
+            st.session_state.vector_store = vector_store
+            st.session_state.rag_pipeline = rag_pipeline
+            st.session_state.collection_loaded = True
+            st.session_state.current_collection = collection_name
+            st.session_state.dataset_name = dataset_name
+            st.session_state.dataset = dataset
+            st.success(f"✅ Collection '{collection_name}' created successfully!")
+            st.rerun()
+        except Exception as e:
+            st.error(f"Error: {str(e)}")
+def chat_interface():
+    """Chat interface tab."""
+    st.subheader("💬 Chat Interface")
+    # Check if collection is loaded
+    if not st.session_state.collection_loaded:
+        st.warning("⚠️ No data loaded. Please use the configuration panel to load a dataset and create a collection.")
+        st.info("""
+        Steps:
+        1. Select a dataset from the dropdown
+        2. Click "Load Data & Create Collection" button
+        3. Wait for the collection to be created
+        4. Then you can start chatting
+        """)
+        return
+    # Display collection info and LLM selector
+    col1, col2, col3 = st.columns([2, 2, 1])
+    with col1:
+        st.info(f"📚 Collection: {st.session_state.current_collection}")
+    with col2:
+        # LLM selector for chat
+        selected_llm = st.selectbox(
+            "Select LLM for chat:",
+            settings.llm_models,
+            index=settings.llm_models.index(st.session_state.current_llm),
+            key="chat_llm_selector"
+        )
+        if selected_llm != st.session_state.current_llm:
+            st.session_state.current_llm = selected_llm
+            # Recreate RAG pipeline with new LLM
+            llm_client = GroqLLMClient(
+                api_key=st.session_state.groq_api_key if "groq_api_key" in st.session_state else "",
+                model_name=selected_llm,
+                max_rpm=settings.groq_rpm_limit,
+                rate_limit_delay=settings.rate_limit_delay
+            )
+            st.session_state.rag_pipeline.llm_client = llm_client
+    with col3:
+        if st.button("🗑️ Clear History"):
+            st.session_state.chat_history = []
+            st.session_state.rag_pipeline.clear_history()
+            st.rerun()
+    # Chat container
+    chat_container = st.container()
+    # Display chat history
+    with chat_container:
+        for chat_idx, entry in enumerate(st.session_state.chat_history):
+            # User message
+            with st.chat_message("user"):
+                st.write(entry["query"])
+            # Assistant message
+            with st.chat_message("assistant"):
+                st.write(entry["response"])
+                # Show retrieved documents in expander
+                with st.expander("📄 Retrieved Documents"):
+                    for doc_idx, doc in enumerate(entry["retrieved_documents"]):
+                        st.markdown(f"**Document {doc_idx+1}** (Distance: {doc.get('distance', 'N/A'):.4f})")
+                        st.text_area(
+                            f"doc_{chat_idx}_{doc_idx}",
+                            value=doc["document"],
+                            height=100,
+                            key=f"doc_area_{chat_idx}_{doc_idx}",
+                            label_visibility="collapsed"
+                        )
+                        if doc.get("metadata"):
+                            st.caption(f"Metadata: {doc['metadata']}")
+    # Chat input
+    query = st.chat_input("Ask a question...")
+    if query:
+        # Check if collection exists
+        if not st.session_state.rag_pipeline or not st.session_state.rag_pipeline.vector_store.current_collection:
+            st.error("❌ No data loaded. Please load a dataset first using the configuration panel.")
+            st.stop()
+        # Add user message
+        with chat_container:
+            with st.chat_message("user"):
+                st.write(query)
+        # Generate response
+        with st.spinner("Generating response..."):
+            try:
+                result = st.session_state.rag_pipeline.query(query)
+            except Exception as e:
+                st.error(f"❌ Error querying: {str(e)}")
+                st.info("Please load a dataset and create a collection first.")
+                st.stop()
+        # Add assistant message
+        with chat_container:
+            with st.chat_message("assistant"):
+                st.write(result["response"])
+                # Show retrieved documents
+                with st.expander("📄 Retrieved Documents"):
+                    for doc_idx, doc in enumerate(result["retrieved_documents"]):
+                        st.markdown(f"**Document {doc_idx+1}** (Distance: {doc.get('distance', 'N/A'):.4f})")
+                        st.text_area(
+                            f"doc_current_{doc_idx}",
+                            value=doc["document"],
+                            height=100,
+                            key=f"doc_current_area_{doc_idx}",
+                            label_visibility="collapsed"
+                        )
+                        if doc.get("metadata"):
+                            st.caption(f"Metadata: {doc['metadata']}")
+        # Store in history
+        st.session_state.chat_history.append(result)
+        st.rerun()
+def evaluation_interface():
+    """Evaluation interface tab."""
+    st.subheader("📊 TRACE Evaluation")
+    # Check if collection is loaded
+    if not st.session_state.collection_loaded:
+        st.warning("⚠️ No data loaded. Please load a collection first.")
+        return
+    # LLM selector for evaluation
+    col1, col2 = st.columns([3, 1])
+    with col1:
+        selected_llm = st.selectbox(
+            "Select LLM for evaluation:",
+            settings.llm_models,
+            index=settings.llm_models.index(st.session_state.current_llm),
+            key="eval_llm_selector"
+        )
+    st.markdown("""
+    Run TRACE evaluation metrics on test data:
+    - **Utilization**: How well the system uses retrieved documents
+    - **Relevance**: Relevance of retrieved documents to the query
+    - **Adherence**: How well the response adheres to the retrieved context
+    - **Completeness**: How complete the response is in answering the query
+    """)
+    num_test_samples = st.slider(
+        "Number of test samples",
+        min_value=5,
+        max_value=50,
+        value=10,
+        step=5
+    )
+    if st.button("🔬 Run Evaluation", type="primary"):
+        # Use selected LLM for evaluation
+        run_evaluation(num_test_samples, selected_llm)
+    # Display results
+    if st.session_state.evaluation_results:
+        results = st.session_state.evaluation_results
+        st.success("✅ Evaluation Complete!")
+        # Display aggregate scores
+        col1, col2, col3, col4, col5 = st.columns(5)
+        with col1:
+            st.metric("📊 Utilization", f"{results['utilization']:.3f}")
+        with col2:
+            st.metric("🎯 Relevance", f"{results['relevance']:.3f}")
+        with col3:
+            st.metric("✅ Adherence", f"{results['adherence']:.3f}")
+        with col4:
+            st.metric("📝 Completeness", f"{results['completeness']:.3f}")
+        with col5:
+            st.metric("⭐ Average", f"{results['average']:.3f}")
+        # Detailed results
+        with st.expander("📋 Detailed Results"):
+            df = pd.DataFrame(results["individual_scores"])
+            st.dataframe(df, use_container_width=True)
+        # Download results
+        results_json = json.dumps(results, indent=2)
+        st.download_button(
+            label="💾 Download Results (JSON)",
+            data=results_json,
+            file_name=f"trace_evaluation_{datetime.now().strftime('%Y%m%d_%H%M%S')}.json",
+            mime="application/json"
+        )
+def run_evaluation(num_samples: int, selected_llm: str = None):
+    """Run TRACE evaluation."""
+    with st.spinner(f"Running evaluation on {num_samples} samples..."):
+        try:
+            # Use selected LLM if provided
+            if selected_llm and selected_llm != st.session_state.current_llm:
+                st.info(f"Switching to {selected_llm} for evaluation...")
+                groq_api_key = st.session_state.groq_api_key if "groq_api_key" in st.session_state else ""
+                eval_llm_client = GroqLLMClient(
+                    api_key=groq_api_key,
+                    model_name=selected_llm,
+                    max_rpm=settings.groq_rpm_limit,
+                    rate_limit_delay=settings.rate_limit_delay
+                )
+                # Temporarily replace LLM client
+                original_llm = st.session_state.rag_pipeline.llm_client
+                st.session_state.rag_pipeline.llm_client = eval_llm_client
+            # Get test data
+            loader = RAGBenchLoader()
+            test_data = loader.get_test_data(
+                st.session_state.dataset_name,
+                num_samples
+            )
+            # Prepare test cases
+            test_cases = []
+            progress_bar = st.progress(0)
+            for i, sample in enumerate(test_data):
+                # Query the RAG system
+                result = st.session_state.rag_pipeline.query(
+                    sample["question"],
+                    n_results=5
+                )
+                # Prepare test case
+                test_cases.append({
+                    "query": sample["question"],
+                    "response": result["response"],
+                    "retrieved_documents": [doc["document"] for doc in result["retrieved_documents"]],
+                    "ground_truth": sample.get("answer", "")
+                })
+                # Update progress
+                progress_bar.progress((i + 1) / num_samples)
+            # Run evaluation
+            evaluator = TRACEEvaluator()
+            results = evaluator.evaluate_batch(test_cases)
+            st.session_state.evaluation_results = results
+            # Restore original LLM if it was switched
+            if selected_llm and selected_llm != st.session_state.current_llm:
+                st.session_state.rag_pipeline.llm_client = original_llm
+        except Exception as e:
+            st.error(f"Error during evaluation: {str(e)}")
+def history_interface():
+    """History interface tab."""
+    st.subheader("📜 Chat History")
+    if not st.session_state.chat_history:
+        st.info("No chat history yet. Start a conversation in the Chat tab!")
+        return
+    # Export history
+    col1, col2 = st.columns([3, 1])
+    with col2:
+        history_json = json.dumps(st.session_state.chat_history, indent=2)
+        st.download_button(
+            label="💾 Export History",
+            data=history_json,
+            file_name=f"chat_history_{datetime.now().strftime('%Y%m%d_%H%M%S')}.json",
+            mime="application/json"
+        )
+    # Display history
+    for i, entry in enumerate(st.session_state.chat_history):
+        with st.expander(f"💬 Conversation {i+1}: {entry['query'][:50]}..."):
+            st.markdown(f"**Query:** {entry['query']}")
+            st.markdown(f"**Response:** {entry['response']}")
+            st.markdown(f"**Timestamp:** {entry.get('timestamp', 'N/A')}")
+            st.markdown("**Retrieved Documents:**")
+            for j, doc in enumerate(entry["retrieved_documents"]):
+                st.text_area(
+                    f"Document {j+1}",
+                    value=doc["document"],
+                    height=100,
+                    key=f"history_doc_{i}_{j}"
+                )
+if __name__ == "__main__":
+    main()

trace_evaluator.py ADDED Viewed

	@@ -0,0 +1,352 @@

+"""TRACE evaluation metrics for RAG systems.
+TRACE Metrics:
+- uTilization: How well the system uses retrieved documents
+- Relevance: Relevance of retrieved documents to the query
+- Adherence: How well the response adheres to the retrieved context
+- Completeness: How complete the response is in answering the query
+"""
+from typing import List, Dict, Optional
+import numpy as np
+from dataclasses import dataclass
+import re
+from collections import Counter
+@dataclass
+class TRACEScores:
+    """Container for TRACE evaluation scores."""
+    utilization: float
+    relevance: float
+    adherence: float
+    completeness: float
+    def to_dict(self) -> Dict:
+        """Convert to dictionary."""
+        return {
+            "utilization": self.utilization,
+            "relevance": self.relevance,
+            "adherence": self.adherence,
+            "completeness": self.completeness,
+            "average": self.average()
+        }
+    def average(self) -> float:
+        """Calculate average score."""
+        return (self.utilization + self.relevance +
+                self.adherence + self.completeness) / 4
+class TRACEEvaluator:
+    """TRACE evaluation metrics for RAG systems."""
+    def __init__(self, llm_client=None):
+        """Initialize TRACE evaluator.
+        Args:
+            llm_client: Optional LLM client for LLM-based evaluation
+        """
+        self.llm_client = llm_client
+    def evaluate(
+        self,
+        query: str,
+        response: str,
+        retrieved_documents: List[str],
+        ground_truth: Optional[str] = None
+    ) -> TRACEScores:
+        """Evaluate a RAG response using TRACE metrics.
+        Args:
+            query: User query
+            response: Generated response
+            retrieved_documents: List of retrieved documents
+            ground_truth: Optional ground truth answer
+        Returns:
+            TRACEScores object
+        """
+        utilization = self._compute_utilization(response, retrieved_documents)
+        relevance = self._compute_relevance(query, retrieved_documents)
+        adherence = self._compute_adherence(response, retrieved_documents)
+        completeness = self._compute_completeness(query, response, ground_truth)
+        return TRACEScores(
+            utilization=utilization,
+            relevance=relevance,
+            adherence=adherence,
+            completeness=completeness
+        )
+    def _compute_utilization(
+        self,
+        response: str,
+        retrieved_documents: List[str]
+    ) -> float:
+        """Compute utilization score.
+        Measures how well the system uses retrieved documents.
+        Score based on:
+        - Number of documents that contributed to the response
+        - Proportion of retrieved documents used
+        Args:
+            response: Generated response
+            retrieved_documents: List of retrieved documents
+        Returns:
+            Utilization score (0-1)
+        """
+        if not retrieved_documents or not response:
+            return 0.0
+        response_lower = response.lower()
+        response_words = set(self._tokenize(response_lower))
+        # Count how many documents contributed
+        docs_used = 0
+        total_overlap = 0
+        for doc in retrieved_documents:
+            doc_lower = doc.lower()
+            doc_words = set(self._tokenize(doc_lower))
+            # Check for significant overlap
+            overlap = len(response_words & doc_words)
+            if overlap > 5:  # Threshold for significant contribution
+                docs_used += 1
+                total_overlap += overlap
+        # Score based on proportion of documents used
+        proportion_used = docs_used / len(retrieved_documents)
+        # Also consider depth of utilization
+        avg_overlap = total_overlap / len(retrieved_documents) if retrieved_documents else 0
+        depth_score = min(avg_overlap / 20, 1.0)  # Normalize
+        # Combined score
+        utilization_score = 0.6 * proportion_used + 0.4 * depth_score
+        return min(utilization_score, 1.0)
+    def _compute_relevance(
+        self,
+        query: str,
+        retrieved_documents: List[str]
+    ) -> float:
+        """Compute relevance score.
+        Measures relevance of retrieved documents to the query.
+        Uses lexical overlap and keyword matching.
+        Args:
+            query: User query
+            retrieved_documents: List of retrieved documents
+        Returns:
+            Relevance score (0-1)
+        """
+        if not retrieved_documents or not query:
+            return 0.0
+        query_lower = query.lower()
+        query_words = set(self._tokenize(query_lower))
+        query_keywords = self._extract_keywords(query_lower)
+        relevance_scores = []
+        for doc in retrieved_documents:
+            doc_lower = doc.lower()
+            doc_words = set(self._tokenize(doc_lower))
+            # Lexical overlap
+            overlap = len(query_words & doc_words)
+            overlap_score = overlap / len(query_words) if query_words else 0
+            # Keyword matching
+            keyword_matches = sum(1 for kw in query_keywords if kw in doc_lower)
+            keyword_score = keyword_matches / len(query_keywords) if query_keywords else 0
+            # Combined relevance for this document
+            doc_relevance = 0.5 * overlap_score + 0.5 * keyword_score
+            relevance_scores.append(doc_relevance)
+        # Average relevance across documents
+        return np.mean(relevance_scores)
+    def _compute_adherence(
+        self,
+        response: str,
+        retrieved_documents: List[str]
+    ) -> float:
+        """Compute adherence score.
+        Measures how well the response adheres to the retrieved context.
+        Higher score means response is grounded in the documents.
+        Args:
+            response: Generated response
+            retrieved_documents: List of retrieved documents
+        Returns:
+            Adherence score (0-1)
+        """
+        if not retrieved_documents or not response:
+            return 0.0
+        # Combine all documents
+        combined_docs = " ".join(retrieved_documents).lower()
+        doc_words = set(self._tokenize(combined_docs))
+        # Analyze response
+        response_lower = response.lower()
+        response_sentences = self._split_sentences(response_lower)
+        adherence_scores = []
+        for sentence in response_sentences:
+            sentence_words = set(self._tokenize(sentence))
+            # Check what proportion of sentence words appear in documents
+            if sentence_words:
+                grounded_words = len(sentence_words & doc_words)
+                sentence_adherence = grounded_words / len(sentence_words)
+                adherence_scores.append(sentence_adherence)
+        # Average adherence across sentences
+        return np.mean(adherence_scores) if adherence_scores else 0.0
+    def _compute_completeness(
+        self,
+        query: str,
+        response: str,
+        ground_truth: Optional[str] = None
+    ) -> float:
+        """Compute completeness score.
+        Measures how complete the response is in answering the query.
+        Args:
+            query: User query
+            response: Generated response
+            ground_truth: Optional ground truth answer
+        Returns:
+            Completeness score (0-1)
+        """
+        if not response or not query:
+            return 0.0
+        # Query analysis
+        query_lower = query.lower()
+        # Check for question types and expected components
+        is_what = any(w in query_lower for w in ["what", "which"])
+        is_when = "when" in query_lower
+        is_where = "where" in query_lower
+        is_who = "who" in query_lower
+        is_why = "why" in query_lower
+        is_how = "how" in query_lower
+        response_lower = response.lower()
+        # Basic completeness checks
+        completeness_factors = []
+        # Length check (not too short)
+        min_length = 50
+        length_score = min(len(response) / min_length, 1.0)
+        completeness_factors.append(length_score)
+        # Check for appropriate response type
+        if is_when and any(w in response_lower for w in ["year", "date", "time", "century"]):
+            completeness_factors.append(1.0)
+        elif is_where and any(w in response_lower for w in ["location", "place", "country", "city"]):
+            completeness_factors.append(1.0)
+        elif is_who and any(w in response_lower for w in ["person", "people", "name"]):
+            completeness_factors.append(1.0)
+        # If ground truth available, compare
+        if ground_truth:
+            gt_lower = ground_truth.lower()
+            gt_words = set(self._tokenize(gt_lower))
+            response_words = set(self._tokenize(response_lower))
+            # Check overlap with ground truth
+            overlap = len(gt_words & response_words)
+            gt_score = overlap / len(gt_words) if gt_words else 0
+            completeness_factors.append(gt_score)
+        # Average all factors
+        return np.mean(completeness_factors) if completeness_factors else 0.5
+    def _tokenize(self, text: str) -> List[str]:
+        """Tokenize text into words."""
+        # Remove punctuation and split
+        text = re.sub(r'[^\w\s]', ' ', text)
+        words = text.split()
+        # Filter out very short words and common stop words
+        stop_words = {"a", "an", "the", "is", "are", "was", "were", "in", "on", "at", "to", "for"}
+        return [w for w in words if len(w) > 2 and w not in stop_words]
+    def _extract_keywords(self, text: str) -> List[str]:
+        """Extract keywords from text."""
+        words = self._tokenize(text)
+        # Simple keyword extraction - words that appear in query
+        # In production, use TF-IDF or similar
+        word_freq = Counter(words)
+        # Return words that appear at least once
+        return list(word_freq.keys())
+    def _split_sentences(self, text: str) -> List[str]:
+        """Split text into sentences."""
+        # Simple sentence splitting
+        sentences = re.split(r'[.!?]+', text)
+        return [s.strip() for s in sentences if s.strip()]
+    def evaluate_batch(
+        self,
+        test_data: List[Dict]
+    ) -> Dict:
+        """Evaluate multiple test cases.
+        Args:
+            test_data: List of test cases, each containing:
+                - query: User query
+                - response: Generated response
+                - retrieved_documents: Retrieved documents
+                - ground_truth: Ground truth answer (optional)
+        Returns:
+            Dictionary with aggregated scores
+        """
+        all_scores = []
+        for i, test_case in enumerate(test_data):
+            print(f"Evaluating test case {i+1}/{len(test_data)}")
+            scores = self.evaluate(
+                query=test_case.get("query", ""),
+                response=test_case.get("response", ""),
+                retrieved_documents=test_case.get("retrieved_documents", []),
+                ground_truth=test_case.get("ground_truth")
+            )
+            all_scores.append(scores)
+        # Aggregate scores
+        avg_utilization = np.mean([s.utilization for s in all_scores])
+        avg_relevance = np.mean([s.relevance for s in all_scores])
+        avg_adherence = np.mean([s.adherence for s in all_scores])
+        avg_completeness = np.mean([s.completeness for s in all_scores])
+        return {
+            "utilization": float(avg_utilization),
+            "relevance": float(avg_relevance),
+            "adherence": float(avg_adherence),
+            "completeness": float(avg_completeness),
+            "average": float((avg_utilization + avg_relevance +
+                            avg_adherence + avg_completeness) / 4),
+            "num_samples": len(test_data),
+            "individual_scores": [s.to_dict() for s in all_scores]
+        }

vector_store.py ADDED Viewed

	@@ -0,0 +1,412 @@

+"""ChromaDB integration for vector storage and retrieval."""
+from typing import List, Dict, Optional, Tuple
+import chromadb
+from chromadb.config import Settings
+import uuid
+import os
+from embedding_models import EmbeddingFactory, EmbeddingModel
+from chunking_strategies import ChunkingFactory
+import json
+class ChromaDBManager:
+    """Manager for ChromaDB operations."""
+    def __init__(self, persist_directory: str = "./chroma_db"):
+        """Initialize ChromaDB manager.
+        Args:
+            persist_directory: Directory to persist ChromaDB data
+        """
+        self.persist_directory = persist_directory
+        os.makedirs(persist_directory, exist_ok=True)
+        # Initialize ChromaDB client with is_persistent=True to use persistent storage
+        try:
+            self.client = chromadb.PersistentClient(
+                path=persist_directory,
+                settings=Settings(
+                    anonymized_telemetry=False,
+                    allow_reset=True  # Allow reset if needed
+                )
+            )
+        except Exception as e:
+            print(f"Warning: Could not create persistent client: {e}")
+            print("Falling back to regular client...")
+            self.client = chromadb.Client(Settings(
+                persist_directory=persist_directory,
+                anonymized_telemetry=False,
+                allow_reset=True
+            ))
+        self.embedding_model = None
+        self.current_collection = None
+    def reconnect(self):
+        """Reconnect to ChromaDB in case of connection loss."""
+        try:
+            self.client = chromadb.PersistentClient(
+                path=self.persist_directory,
+                settings=Settings(
+                    anonymized_telemetry=False,
+                    allow_reset=True
+                )
+            )
+            print("✅ Reconnected to ChromaDB")
+        except Exception as e:
+            print(f"Error reconnecting: {e}")
+    def create_collection(
+        self,
+        collection_name: str,
+        embedding_model_name: str,
+        metadata: Optional[Dict] = None
+    ) -> chromadb.Collection:
+        """Create a new collection.
+        Args:
+            collection_name: Name of the collection
+            embedding_model_name: Name of the embedding model
+            metadata: Additional metadata for the collection
+        Returns:
+            ChromaDB collection
+        """
+        # Delete if exists
+        try:
+            self.client.delete_collection(collection_name)
+        except:
+            pass
+        # Create embedding model
+        self.embedding_model = EmbeddingFactory.create_embedding_model(embedding_model_name)
+        self.embedding_model.load_model()
+        # Create collection with metadata
+        collection_metadata = {
+            "embedding_model": embedding_model_name,
+            "hnsw:space": "cosine"
+        }
+        if metadata:
+            collection_metadata.update(metadata)
+        self.current_collection = self.client.create_collection(
+            name=collection_name,
+            metadata=collection_metadata
+        )
+        print(f"Created collection: {collection_name}")
+        return self.current_collection
+    def get_collection(self, collection_name: str) -> chromadb.Collection:
+        """Get an existing collection.
+        Args:
+            collection_name: Name of the collection
+        Returns:
+            ChromaDB collection
+        """
+        self.current_collection = self.client.get_collection(collection_name)
+        # Load embedding model from metadata
+        metadata = self.current_collection.metadata
+        if "embedding_model" in metadata:
+            self.embedding_model = EmbeddingFactory.create_embedding_model(
+                metadata["embedding_model"]
+            )
+            self.embedding_model.load_model()
+        return self.current_collection
+    def list_collections(self) -> List[str]:
+        """List all collections.
+        Returns:
+            List of collection names
+        """
+        collections = self.client.list_collections()
+        return [col.name for col in collections]
+    def clear_all_collections(self) -> int:
+        """Delete all collections from the database.
+        Returns:
+            Number of collections deleted
+        """
+        collections = self.list_collections()
+        count = 0
+        for collection_name in collections:
+            try:
+                self.client.delete_collection(collection_name)
+                print(f"Deleted collection: {collection_name}")
+                count += 1
+            except Exception as e:
+                print(f"Error deleting collection {collection_name}: {e}")
+        self.current_collection = None
+        self.embedding_model = None
+        print(f"✅ Cleared {count} collections")
+        return count
+    def delete_collection(self, collection_name: str) -> bool:
+        """Delete a specific collection.
+        Args:
+            collection_name: Name of the collection to delete
+        Returns:
+            True if deleted successfully, False otherwise
+        """
+        try:
+            self.client.delete_collection(collection_name)
+            if self.current_collection and self.current_collection.name == collection_name:
+                self.current_collection = None
+                self.embedding_model = None
+            print(f"✅ Deleted collection: {collection_name}")
+            return True
+        except Exception as e:
+            print(f"❌ Error deleting collection: {e}")
+            return False
+    def add_documents(
+        self,
+        documents: List[str],
+        metadatas: Optional[List[Dict]] = None,
+        ids: Optional[List[str]] = None,
+        batch_size: int = 100
+    ):
+        """Add documents to the current collection.
+        Args:
+            documents: List of document texts
+            metadatas: List of metadata dictionaries
+            ids: List of document IDs
+            batch_size: Batch size for processing
+        """
+        if not self.current_collection:
+            raise ValueError("No collection selected. Create or get a collection first.")
+        if not self.embedding_model:
+            raise ValueError("No embedding model loaded.")
+        # Generate IDs if not provided
+        if ids is None:
+            ids = [str(uuid.uuid4()) for _ in documents]
+        # Generate default metadata if not provided
+        if metadatas is None:
+            metadatas = [{"index": i} for i in range(len(documents))]
+        # Process in batches
+        total_docs = len(documents)
+        print(f"Adding {total_docs} documents to collection...")
+        for i in range(0, total_docs, batch_size):
+            batch_docs = documents[i:i + batch_size]
+            batch_ids = ids[i:i + batch_size]
+            batch_metadatas = metadatas[i:i + batch_size]
+            # Generate embeddings
+            embeddings = self.embedding_model.embed_documents(batch_docs)
+            # Add to collection
+            self.current_collection.add(
+                documents=batch_docs,
+                embeddings=embeddings.tolist(),
+                metadatas=batch_metadatas,
+                ids=batch_ids
+            )
+            print(f"Added batch {i//batch_size + 1}/{(total_docs-1)//batch_size + 1}")
+        print(f"Successfully added {total_docs} documents")
+    def load_dataset_into_collection(
+        self,
+        collection_name: str,
+        embedding_model_name: str,
+        chunking_strategy: str,
+        dataset_data: List[Dict],
+        chunk_size: int = 512,
+        overlap: int = 50
+    ):
+        """Load a dataset into a new collection with chunking.
+        Args:
+            collection_name: Name for the new collection
+            embedding_model_name: Embedding model to use
+            chunking_strategy: Chunking strategy to use
+            dataset_data: List of dataset samples
+            chunk_size: Size of chunks
+            overlap: Overlap between chunks
+        """
+        # Create collection
+        self.create_collection(
+            collection_name,
+            embedding_model_name,
+            metadata={
+                "chunking_strategy": chunking_strategy,
+                "chunk_size": chunk_size,
+                "overlap": overlap
+            }
+        )
+        # Get chunker
+        chunker = ChunkingFactory.create_chunker(chunking_strategy)
+        # Process documents
+        all_chunks = []
+        all_metadatas = []
+        print(f"Processing {len(dataset_data)} documents with {chunking_strategy} chunking...")
+        for idx, sample in enumerate(dataset_data):
+            # Use 'documents' list if available, otherwise fall back to 'context'
+            documents = sample.get("documents", [])
+            # If documents is empty, use context as fallback
+            if not documents:
+                context = sample.get("context", "")
+                if context:
+                    documents = [context]
+            if not documents:
+                continue
+            # Process each document separately for better granularity
+            for doc_idx, document in enumerate(documents):
+                if not document or not str(document).strip():
+                    continue
+                # Chunk each document
+                chunks = chunker.chunk_text(str(document), chunk_size, overlap)
+                # Create metadata for each chunk
+                for chunk_idx, chunk in enumerate(chunks):
+                    all_chunks.append(chunk)
+                    all_metadatas.append({
+                        "doc_id": idx,
+                        "doc_idx": doc_idx,  # Track which document within the sample
+                        "chunk_id": chunk_idx,
+                        "question": sample.get("question", ""),
+                        "answer": sample.get("answer", ""),
+                        "dataset": sample.get("dataset", ""),
+                        "total_docs": len(documents)
+                    })
+        # Add all chunks to collection
+        self.add_documents(all_chunks, all_metadatas)
+        print(f"Loaded {len(all_chunks)} chunks from {len(dataset_data)} samples")
+    def query(
+        self,
+        query_text: str,
+        n_results: int = 5,
+        filter_metadata: Optional[Dict] = None
+    ) -> Dict:
+        """Query the collection.
+        Args:
+            query_text: Query text
+            n_results: Number of results to return
+            filter_metadata: Metadata filter
+        Returns:
+            Query results
+        """
+        if not self.current_collection:
+            raise ValueError("No collection selected.")
+        if not self.embedding_model:
+            raise ValueError("No embedding model loaded.")
+        # Generate query embedding
+        query_embedding = self.embedding_model.embed_query(query_text)
+        # Query collection with retry logic
+        try:
+            results = self.current_collection.query(
+                query_embeddings=[query_embedding.tolist()],
+                n_results=n_results,
+                where=filter_metadata
+            )
+        except Exception as e:
+            if "default_tenant" in str(e):
+                print("Warning: Lost connection to ChromaDB, reconnecting...")
+                self.reconnect()
+                # Try again after reconnecting
+                results = self.current_collection.query(
+                    query_embeddings=[query_embedding.tolist()],
+                    n_results=n_results,
+                    where=filter_metadata
+                )
+            else:
+                raise
+        return results
+    def get_retrieved_documents(
+        self,
+        query_text: str,
+        n_results: int = 5
+    ) -> List[Dict]:
+        """Get retrieved documents with metadata.
+        Args:
+            query_text: Query text
+            n_results: Number of results
+        Returns:
+            List of retrieved documents with metadata
+        """
+        results = self.query(query_text, n_results)
+        retrieved_docs = []
+        for i in range(len(results['documents'][0])):
+            retrieved_docs.append({
+                "document": results['documents'][0][i],
+                "metadata": results['metadatas'][0][i],
+                "distance": results['distances'][0][i] if 'distances' in results else None
+            })
+        return retrieved_docs
+    def delete_collection(self, collection_name: str):
+        """Delete a collection.
+        Args:
+            collection_name: Name of collection to delete
+        """
+        try:
+            self.client.delete_collection(collection_name)
+            print(f"Deleted collection: {collection_name}")
+        except Exception as e:
+            print(f"Error deleting collection: {str(e)}")
+    def get_collection_stats(self, collection_name: Optional[str] = None) -> Dict:
+        """Get statistics for a collection.
+        Args:
+            collection_name: Name of collection (uses current if None)
+        Returns:
+            Dictionary with collection statistics
+        """
+        if collection_name:
+            collection = self.client.get_collection(collection_name)
+        elif self.current_collection:
+            collection = self.current_collection
+        else:
+            raise ValueError("No collection specified or selected")
+        count = collection.count()
+        metadata = collection.metadata
+        return {
+            "name": collection.name,
+            "count": count,
+            "metadata": metadata
+        }