kamkol's picture
Add Hugging Face Space configuration
4264e91

A newer version of the Streamlit SDK is available: 1.55.0

Upgrade
metadata
title: AB Testing RAG Agent
emoji: 📊
colorFrom: blue
colorTo: indigo
sdk: streamlit
sdk_version: 1.32.0
app_file: streamlit_app.py
pinned: false

AB Testing RAG Agent

This repository contains a Retrieval Augmented Generation (RAG) agent specialized in A/B Testing that:

  1. Answers questions about A/B Testing using a collection of Ron Kohavi's work
  2. Automatically searches ArXiv for academic papers when needed for better responses
  3. Preserves privacy by pre-processing PDFs locally and only deploying processed data

Features

  • Interactive chat interface built with Streamlit
  • Vector search using Qdrant with OpenAI embeddings
  • Two-tier approach:
    • Initial RAG search for efficiency
    • Advanced agent with tools for complex questions
  • Smart source handling and deduplication
  • ArXiv integration

Quick Start

Local Development

  1. Clone this repository:
git clone https://github.com/yourusername/AB_Testing_RAG_Agent.git
cd AB_Testing_RAG_Agent
  1. Install dependencies:
pip install -r requirements.txt
  1. Create a .env file with your OpenAI API key:
OPENAI_API_KEY=your_openai_api_key_here
  1. Process your PDF files (only needed once):
python scripts/preprocess_data.py
  1. Run the Streamlit app:
streamlit run streamlit_app.py

Docker Deployment

  1. Build the Docker image:
docker build -t ab-testing-rag-agent .
  1. Run the container:
docker run -p 8000:8000 -e OPENAI_API_KEY=your_openai_api_key_here ab-testing-rag-agent

Deployment to Hugging Face

  1. Prepare for deployment (check if all required files are ready):
python scripts/prepare_for_deployment.py
  1. Push to your Hugging Face Space:
# Initialize git repository if not already done
git init
git add .
git commit -m "Initial commit"

# Add Hugging Face Space remote
git remote add hf https://huggingface.co/spaces/yourusername/ab-testing-rag

# Push to Hugging Face
git push hf main
  1. Set both required environment variables in the Hugging Face Space settings:
    • OPENAI_API_KEY: Your OpenAI API key
    • HF_TOKEN: Your Hugging Face token with access to the dataset

Setting Up The PDF Dataset on Hugging Face

The deployment uses PDFs stored in a separate Hugging Face dataset repo. To set up your own:

  1. Create a dataset repository on Hugging Face called yourusername/ab_testing_pdfs

  2. Upload all your PDF files to this repository via the Hugging Face UI or git:

    git clone https://huggingface.co/datasets/yourusername/ab_testing_pdfs
    cd ab_testing_pdfs
    cp /path/to/your/pdfs/*.pdf .
    git add .
    git commit -m "Add AB Testing PDFs"
    git push
    
  3. Update the dataset name in download_pdfs.py if you used a different repository name

  4. Make sure your HF_TOKEN has read access to this dataset repository

Architecture

  • Pre-processing Pipeline: PDF files are processed locally, converted to embeddings, and stored in a vector database
  • Retrieval System: Uses OpenAI's text-embedding-3-small model and Qdrant for vector search
  • Response Generation:
    • Initial attempt with gpt-4.1-mini for efficiency
    • Falls back to gpt-4.1 with tools for complex queries
  • ArXiv Integration: Searches academic papers when necessary

Adding Your Own PDFs

  1. Add PDF files to the data/ directory
  2. Run the preprocessing script:
python scripts/preprocess_data.py

Implementation Notes

  • Uses the text-embedding-3-small model for embeddings
  • Uses gpt-4.1-mini for initial responses
  • Uses gpt-4.1 for agent tools and quality evaluation
  • Stores preprocessed data in processed_data/ directory