---
title: AB Testing RAG
emoji: 🦀
colorFrom: indigo
colorTo: indigo
sdk: docker
pinned: false
license: mit
short_description: AB Testing RAG app using Ron Kohavi's work
---

Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

#### Note:

I asked Dr. Greg in his office hours if it was ok to change this Pythonic RAG app such
that instead of allowing the user to upload a file and ask questions on it, the app
already has PDFs (chunked and embedded in its vector database) of Ronny Kohavi's work
on A/B Testing and the user can ask questions about A/B Testing. Dr. Greg approved this,
please do not mark off points because of this! 

#### Question 1:

Why do we want to support streaming? What about streaming is important, or useful?

#### Answer:

This is all about perception of the user. Sometimes it might take the chat model a while
(e.g. several minutes) to complete its entire response. If instead of streaming we
don't show anything to the user, the user might think the app is not working and get
frustrated and/or exit the app. Instead, by streaming, the user has the perception 
that something is happening fairly immediately (vs waiting until the entire message
is generated). This leads to higher satisfaction in the user. It's important to note 
that streaming has nothing to do making the tokens get generated faster, it's really
just about giving the user the perception that something is happening

#### Question 2:

Why are we using User Session here? What about Python makes us need to use this? 
Why not just store everything in a global variable?

#### Answer:

We are using User Session here because if we instead used a global variable in Python,
we would have different users' data mixed up and that would be bad. For example, if
we wanted to keep track of each session's message count and we actually used a global 
variable, then if two users are chatting with the app at the same time, both users would 
increment our global variable counter, so we wouldn't actually keep track of each 
session's true message count. On the other hand, since each user session is unique 
to a user and a given chat session, by using user session, we would be able to actually
keep track of each session's true message count.

Along the same lines, we wouldn't want two (or more) users who are chatting with the app
at the same time to interact with each other's files. That would be really bad. By 
using user sessions (instead of a global variable), their files would be completely 
separate and we wouldn't have any of these kinds of issues

#### Discussion Question 1:

Upload a PDF file of the recent DeepSeek-R1 paper and ask the following questions:

1. What is RL and how does it help reasoning?
2. What is the difference between DeepSeek-R1 and DeepSeek-R1-Zero?
3. What is this paper about?

Does this application pass your vibe check? Are there any immediate pitfalls you're 
noticing?

#### Note:

As I described above, I got permission from Dr. Greg to change the app to instead of 
letting the user upload a file, instead we already have A/B Testing files and we let
the user ask A/B Testing questions (described more above). Thus, I will adjust
Discussion Question 1 appropriately for the app that I did. Please do not mark off 
points for this as this app adjustment was approved by Dr. Greg

#### Updated Questions:

1. What is False Positive Risk and why is it important?
2. What is the difference between sequential A/B testing and standard A/B testing?
3. What is Ronny Kohavi's paper 'AB Testing Intuition Busters' about?

Does this application pass your vibe check? Are there any immediate pitfalls you're 
noticing?

#### Answer:

While the app's response to the first question above did pass my vibe check, the app's
response to the next two questions did not pass my vibe check. Thus, overall, the app
did not pass my vibe check. 

Specifically, for the 2nd question above, the app responded
with 'I don't know the answer.' This shows a big pitfall in the retrieval part of the
RAG because it should be able to answer this from the sources. One major pitfall is
that we are using such a primitive splitting strategy (just based off of number
of characters). Another pitfall is that we don't have any additional tools in place
such that if the chat model originally responds with 'I don't know the answer', then the
tool can call another chat model and provide a prompt that includes the user's originally
query and tell the chat model to generate 5 different rewrites of the original query to
improve the prompt, then it can retrieve chunks based off of these 5 new queries

Specifically, for the 3rd question above, of the app's provided sources, none were the
actual paper that the question references, even though that paper (or rather its chunks) 
is in the database! This further shows the pitfalls in the app. We should have some 
filtering such that if the user specifically asks about a specific paper in our 
database, then the chunks returned are from that paper

While this wasn't currently a major issue given that we have a small database, our app 
compares the query's embedding vector with each chunk in the database's embedding
vector. This will not scale at all if we wanted to have a much bigger database