--- title: AB Testing RAG emoji: 🦀 colorFrom: indigo colorTo: indigo sdk: docker pinned: false license: mit short_description: AB Testing RAG app using Ron Kohavi's work --- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference #### Note: I asked Dr. Greg in his office hours if it was ok to change this Pythonic RAG app such that instead of allowing the user to upload a file and ask questions on it, the app already has PDFs (chunked and embedded in its vector database) of Ronny Kohavi's work on A/B Testing and the user can ask questions about A/B Testing. Dr. Greg approved this, please do not mark off points because of this! #### Question 1: Why do we want to support streaming? What about streaming is important, or useful? #### Answer: This is all about perception of the user. Sometimes it might take the chat model a while (e.g. several minutes) to complete its entire response. If instead of streaming we don't show anything to the user, the user might think the app is not working and get frustrated and/or exit the app. Instead, by streaming, the user has the perception that something is happening fairly immediately (vs waiting until the entire message is generated). This leads to higher satisfaction in the user. It's important to note that streaming has nothing to do making the tokens get generated faster, it's really just about giving the user the perception that something is happening #### Question 2: Why are we using User Session here? What about Python makes us need to use this? Why not just store everything in a global variable? #### Answer: We are using User Session here because if we instead used a global variable in Python, we would have different users' data mixed up and that would be bad. For example, if we wanted to keep track of each session's message count and we actually used a global variable, then if two users are chatting with the app at the same time, both users would increment our global variable counter, so we wouldn't actually keep track of each session's true message count. On the other hand, since each user session is unique to a user and a given chat session, by using user session, we would be able to actually keep track of each session's true message count. Along the same lines, we wouldn't want two (or more) users who are chatting with the app at the same time to interact with each other's files. That would be really bad. By using user sessions (instead of a global variable), their files would be completely separate and we wouldn't have any of these kinds of issues #### Discussion Question 1: Upload a PDF file of the recent DeepSeek-R1 paper and ask the following questions: 1. What is RL and how does it help reasoning? 2. What is the difference between DeepSeek-R1 and DeepSeek-R1-Zero? 3. What is this paper about? Does this application pass your vibe check? Are there any immediate pitfalls you're noticing? #### Note: As I described above, I got permission from Dr. Greg to change the app to instead of letting the user upload a file, instead we already have A/B Testing files and we let the user ask A/B Testing questions (described more above). Thus, I will adjust Discussion Question 1 appropriately for the app that I did. Please do not mark off points for this as this app adjustment was approved by Dr. Greg #### Updated Questions: 1. What is False Positive Risk and why is it important? 2. What is the difference between sequential A/B testing and standard A/B testing? 3. What is Ronny Kohavi's paper 'AB Testing Intuition Busters' about? Does this application pass your vibe check? Are there any immediate pitfalls you're noticing? #### Answer: While the app's response to the first question above did pass my vibe check, the app's response to the next two questions did not pass my vibe check. Thus, overall, the app did not pass my vibe check. Specifically, for the 2nd question above, the app responded with 'I don't know the answer.' This shows a big pitfall in the retrieval part of the RAG because it should be able to answer this from the sources. One major pitfall is that we are using such a primitive splitting strategy (just based off of number of characters). Another pitfall is that we don't have any additional tools in place such that if the chat model originally responds with 'I don't know the answer', then the tool can call another chat model and provide a prompt that includes the user's originally query and tell the chat model to generate 5 different rewrites of the original query to improve the prompt, then it can retrieve chunks based off of these 5 new queries Specifically, for the 3rd question above, of the app's provided sources, none were the actual paper that the question references, even though that paper (or rather its chunks) is in the database! This further shows the pitfalls in the app. We should have some filtering such that if the user specifically asks about a specific paper in our database, then the chunks returned are from that paper While this wasn't currently a major issue given that we have a small database, our app compares the query's embedding vector with each chunk in the database's embedding vector. This will not scale at all if we wanted to have a much bigger database