Spaces:
Sleeping
Sleeping
| title: AB Testing RAG | |
| emoji: 🦀 | |
| colorFrom: indigo | |
| colorTo: indigo | |
| sdk: docker | |
| pinned: false | |
| license: mit | |
| short_description: AB Testing RAG app using Ron Kohavi's work | |
| Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference | |
| #### Note: | |
| I asked Dr. Greg in his office hours if it was ok to change this Pythonic RAG app such | |
| that instead of allowing the user to upload a file and ask questions on it, the app | |
| already has PDFs (chunked and embedded in its vector database) of Ronny Kohavi's work | |
| on A/B Testing and the user can ask questions about A/B Testing. Dr. Greg approved this, | |
| please do not mark off points because of this! | |
| #### Question 1: | |
| Why do we want to support streaming? What about streaming is important, or useful? | |
| #### Answer: | |
| This is all about perception of the user. Sometimes it might take the chat model a while | |
| (e.g. several minutes) to complete its entire response. If instead of streaming we | |
| don't show anything to the user, the user might think the app is not working and get | |
| frustrated and/or exit the app. Instead, by streaming, the user has the perception | |
| that something is happening fairly immediately (vs waiting until the entire message | |
| is generated). This leads to higher satisfaction in the user. It's important to note | |
| that streaming has nothing to do making the tokens get generated faster, it's really | |
| just about giving the user the perception that something is happening | |
| #### Question 2: | |
| Why are we using User Session here? What about Python makes us need to use this? | |
| Why not just store everything in a global variable? | |
| #### Answer: | |
| We are using User Session here because if we instead used a global variable in Python, | |
| we would have different users' data mixed up and that would be bad. For example, if | |
| we wanted to keep track of each session's message count and we actually used a global | |
| variable, then if two users are chatting with the app at the same time, both users would | |
| increment our global variable counter, so we wouldn't actually keep track of each | |
| session's true message count. On the other hand, since each user session is unique | |
| to a user and a given chat session, by using user session, we would be able to actually | |
| keep track of each session's true message count. | |
| Along the same lines, we wouldn't want two (or more) users who are chatting with the app | |
| at the same time to interact with each other's files. That would be really bad. By | |
| using user sessions (instead of a global variable), their files would be completely | |
| separate and we wouldn't have any of these kinds of issues | |
| #### Discussion Question 1: | |
| Upload a PDF file of the recent DeepSeek-R1 paper and ask the following questions: | |
| 1. What is RL and how does it help reasoning? | |
| 2. What is the difference between DeepSeek-R1 and DeepSeek-R1-Zero? | |
| 3. What is this paper about? | |
| Does this application pass your vibe check? Are there any immediate pitfalls you're | |
| noticing? | |
| #### Note: | |
| As I described above, I got permission from Dr. Greg to change the app to instead of | |
| letting the user upload a file, instead we already have A/B Testing files and we let | |
| the user ask A/B Testing questions (described more above). Thus, I will adjust | |
| Discussion Question 1 appropriately for the app that I did. Please do not mark off | |
| points for this as this app adjustment was approved by Dr. Greg | |
| #### Updated Questions: | |
| 1. What is False Positive Risk and why is it important? | |
| 2. What is the difference between sequential A/B testing and standard A/B testing? | |
| 3. What is Ronny Kohavi's paper 'AB Testing Intuition Busters' about? | |
| Does this application pass your vibe check? Are there any immediate pitfalls you're | |
| noticing? | |
| #### Answer: | |
| While the app's response to the first question above did pass my vibe check, the app's | |
| response to the next two questions did not pass my vibe check. Thus, overall, the app | |
| did not pass my vibe check. | |
| Specifically, for the 2nd question above, the app responded | |
| with 'I don't know the answer.' This shows a big pitfall in the retrieval part of the | |
| RAG because it should be able to answer this from the sources. One major pitfall is | |
| that we are using such a primitive splitting strategy (just based off of number | |
| of characters). Another pitfall is that we don't have any additional tools in place | |
| such that if the chat model originally responds with 'I don't know the answer', then the | |
| tool can call another chat model and provide a prompt that includes the user's originally | |
| query and tell the chat model to generate 5 different rewrites of the original query to | |
| improve the prompt, then it can retrieve chunks based off of these 5 new queries | |
| Specifically, for the 3rd question above, of the app's provided sources, none were the | |
| actual paper that the question references, even though that paper (or rather its chunks) | |
| is in the database! This further shows the pitfalls in the app. We should have some | |
| filtering such that if the user specifically asks about a specific paper in our | |
| database, then the chunks returned are from that paper | |
| While this wasn't currently a major issue given that we have a small database, our app | |
| compares the query's embedding vector with each chunk in the database's embedding | |
| vector. This will not scale at all if we wanted to have a much bigger database | |