File size: 3,693 Bytes
4264e91
 
 
 
 
 
 
 
 
 
 
2eb4533
 
3863dba
2eb4533
3863dba
 
 
2eb4533
3863dba
2eb4533
3863dba
 
 
 
 
 
 
2eb4533
3863dba
2eb4533
 
 
3863dba
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2585f8a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3863dba
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
---
title: AB Testing RAG Agent
emoji: 📊
colorFrom: blue
colorTo: indigo
sdk: streamlit
sdk_version: 1.32.0
app_file: streamlit_app.py
pinned: false
---

# AB Testing RAG Agent

This repository contains a Retrieval Augmented Generation (RAG) agent specialized in A/B Testing that:

1. Answers questions about A/B Testing using a collection of Ron Kohavi's work
2. Automatically searches ArXiv for academic papers when needed for better responses
3. Preserves privacy by pre-processing PDFs locally and only deploying processed data

## Features

- Interactive chat interface built with Streamlit
- Vector search using Qdrant with OpenAI embeddings
- Two-tier approach:
  - Initial RAG search for efficiency
  - Advanced agent with tools for complex questions
- Smart source handling and deduplication
- ArXiv integration 

## Quick Start

### Local Development

1. Clone this repository:
```bash
git clone https://github.com/yourusername/AB_Testing_RAG_Agent.git
cd AB_Testing_RAG_Agent
```

2. Install dependencies:
```bash
pip install -r requirements.txt
```

3. Create a `.env` file with your OpenAI API key:
```
OPENAI_API_KEY=your_openai_api_key_here
```

4. Process your PDF files (only needed once):
```bash
python scripts/preprocess_data.py
```

5. Run the Streamlit app:
```bash
streamlit run streamlit_app.py
```

### Docker Deployment

1. Build the Docker image:
```bash
docker build -t ab-testing-rag-agent .
```

2. Run the container:
```bash
docker run -p 8000:8000 -e OPENAI_API_KEY=your_openai_api_key_here ab-testing-rag-agent
```

## Deployment to Hugging Face

1. Prepare for deployment (check if all required files are ready):
```bash
python scripts/prepare_for_deployment.py
```

2. Push to your Hugging Face Space:
```bash
# Initialize git repository if not already done
git init
git add .
git commit -m "Initial commit"

# Add Hugging Face Space remote
git remote add hf https://huggingface.co/spaces/yourusername/ab-testing-rag

# Push to Hugging Face
git push hf main
```

3. Set both required environment variables in the Hugging Face Space settings:
   - `OPENAI_API_KEY`: Your OpenAI API key
   - `HF_TOKEN`: Your Hugging Face token with access to the dataset

### Setting Up The PDF Dataset on Hugging Face

The deployment uses PDFs stored in a separate Hugging Face dataset repo. To set up your own:

1. Create a dataset repository on Hugging Face called `yourusername/ab_testing_pdfs`

2. Upload all your PDF files to this repository via the Hugging Face UI or git:
   ```bash
   git clone https://huggingface.co/datasets/yourusername/ab_testing_pdfs
   cd ab_testing_pdfs
   cp /path/to/your/pdfs/*.pdf .
   git add .
   git commit -m "Add AB Testing PDFs"
   git push
   ```

3. Update the dataset name in `download_pdfs.py` if you used a different repository name

4. Make sure your `HF_TOKEN` has read access to this dataset repository

## Architecture

- **Pre-processing Pipeline**: PDF files are processed locally, converted to embeddings, and stored in a vector database
- **Retrieval System**: Uses OpenAI's text-embedding-3-small model and Qdrant for vector search
- **Response Generation**:
  - Initial attempt with gpt-4.1-mini for efficiency
  - Falls back to gpt-4.1 with tools for complex queries
- **ArXiv Integration**: Searches academic papers when necessary

## Adding Your Own PDFs

1. Add PDF files to the `data/` directory
2. Run the preprocessing script:
```bash
python scripts/preprocess_data.py
```

## Implementation Notes

- Uses the text-embedding-3-small model for embeddings
- Uses gpt-4.1-mini for initial responses
- Uses gpt-4.1 for agent tools and quality evaluation
- Stores preprocessed data in `processed_data/` directory