---
title: AB Testing RAG Agent
emoji: 📊
colorFrom: blue
colorTo: indigo
sdk: streamlit
sdk_version: 1.32.0
app_file: streamlit_app.py
pinned: false
---

# AB Testing RAG Agent

This repository contains a Retrieval Augmented Generation (RAG) agent specialized in A/B Testing that:

1. Answers questions about A/B Testing using a collection of Ron Kohavi's work
2. Automatically searches ArXiv for academic papers when needed for better responses
3. Preserves privacy by pre-processing PDFs locally and only deploying processed data

## Features

- Interactive chat interface built with Streamlit
- Vector search using Qdrant with OpenAI embeddings
- Two-tier approach:
  - Initial RAG search for efficiency
  - Advanced agent with tools for complex questions
- Smart source handling and deduplication
- ArXiv integration 

## Quick Start

### Local Development

1. Clone this repository:
```bash
git clone https://github.com/yourusername/AB_Testing_RAG_Agent.git
cd AB_Testing_RAG_Agent
```

2. Install dependencies:
```bash
pip install -r requirements.txt
```

3. Create a `.env` file with your OpenAI API key:
```
OPENAI_API_KEY=your_openai_api_key_here
```

4. Process your PDF files (only needed once):
```bash
python scripts/preprocess_data.py
```

5. Run the Streamlit app:
```bash
streamlit run streamlit_app.py
```

### Docker Deployment

1. Build the Docker image:
```bash
docker build -t ab-testing-rag-agent .
```

2. Run the container:
```bash
docker run -p 8000:8000 -e OPENAI_API_KEY=your_openai_api_key_here ab-testing-rag-agent
```

## Deployment to Hugging Face

1. Prepare for deployment (check if all required files are ready):
```bash
python scripts/prepare_for_deployment.py
```

2. Push to your Hugging Face Space:
```bash
# Initialize git repository if not already done
git init
git add .
git commit -m "Initial commit"

# Add Hugging Face Space remote
git remote add hf https://huggingface.co/spaces/yourusername/ab-testing-rag

# Push to Hugging Face
git push hf main
```

3. Set both required environment variables in the Hugging Face Space settings:
   - `OPENAI_API_KEY`: Your OpenAI API key
   - `HF_TOKEN`: Your Hugging Face token with access to the dataset

### Setting Up The PDF Dataset on Hugging Face

The deployment uses PDFs stored in a separate Hugging Face dataset repo. To set up your own:

1. Create a dataset repository on Hugging Face called `yourusername/ab_testing_pdfs`

2. Upload all your PDF files to this repository via the Hugging Face UI or git:
   ```bash
   git clone https://huggingface.co/datasets/yourusername/ab_testing_pdfs
   cd ab_testing_pdfs
   cp /path/to/your/pdfs/*.pdf .
   git add .
   git commit -m "Add AB Testing PDFs"
   git push
   ```

3. Update the dataset name in `download_pdfs.py` if you used a different repository name

4. Make sure your `HF_TOKEN` has read access to this dataset repository

## Architecture

- **Pre-processing Pipeline**: PDF files are processed locally, converted to embeddings, and stored in a vector database
- **Retrieval System**: Uses OpenAI's text-embedding-3-small model and Qdrant for vector search
- **Response Generation**:
  - Initial attempt with gpt-4.1-mini for efficiency
  - Falls back to gpt-4.1 with tools for complex queries
- **ArXiv Integration**: Searches academic papers when necessary

## Adding Your Own PDFs

1. Add PDF files to the `data/` directory
2. Run the preprocessing script:
```bash
python scripts/preprocess_data.py
```

## Implementation Notes

- Uses the text-embedding-3-small model for embeddings
- Uses gpt-4.1-mini for initial responses
- Uses gpt-4.1 for agent tools and quality evaluation
- Stores preprocessed data in `processed_data/` directory