Spaces:

kamkol
/

AB_Testing_RAG

Sleeping

App Files Files Community

AB_Testing_RAG / README.md

kamkol

Answer all the questions in the README

b049522 11 months ago

preview code

raw

history blame contribute delete

5.31 kB

	---
	title: AB Testing RAG
	emoji: 🦀
	colorFrom: indigo
	colorTo: indigo
	sdk: docker
	pinned: false
	license: mit
	short_description: AB Testing RAG app using Ron Kohavi's work
	---

	Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

	#### Note:

	I asked Dr. Greg in his office hours if it was ok to change this Pythonic RAG app such
	that instead of allowing the user to upload a file and ask questions on it, the app
	already has PDFs (chunked and embedded in its vector database) of Ronny Kohavi's work
	on A/B Testing and the user can ask questions about A/B Testing. Dr. Greg approved this,
	please do not mark off points because of this!

	#### Question 1:

	Why do we want to support streaming? What about streaming is important, or useful?

	#### Answer:

	This is all about perception of the user. Sometimes it might take the chat model a while
	(e.g. several minutes) to complete its entire response. If instead of streaming we
	don't show anything to the user, the user might think the app is not working and get
	frustrated and/or exit the app. Instead, by streaming, the user has the perception
	that something is happening fairly immediately (vs waiting until the entire message
	is generated). This leads to higher satisfaction in the user. It's important to note
	that streaming has nothing to do making the tokens get generated faster, it's really
	just about giving the user the perception that something is happening

	#### Question 2:

	Why are we using User Session here? What about Python makes us need to use this?
	Why not just store everything in a global variable?

	#### Answer:

	We are using User Session here because if we instead used a global variable in Python,
	we would have different users' data mixed up and that would be bad. For example, if
	we wanted to keep track of each session's message count and we actually used a global
	variable, then if two users are chatting with the app at the same time, both users would
	increment our global variable counter, so we wouldn't actually keep track of each
	session's true message count. On the other hand, since each user session is unique
	to a user and a given chat session, by using user session, we would be able to actually
	keep track of each session's true message count.

	Along the same lines, we wouldn't want two (or more) users who are chatting with the app
	at the same time to interact with each other's files. That would be really bad. By
	using user sessions (instead of a global variable), their files would be completely
	separate and we wouldn't have any of these kinds of issues

	#### Discussion Question 1:

	Upload a PDF file of the recent DeepSeek-R1 paper and ask the following questions:

	1. What is RL and how does it help reasoning?
	2. What is the difference between DeepSeek-R1 and DeepSeek-R1-Zero?
	3. What is this paper about?

	Does this application pass your vibe check? Are there any immediate pitfalls you're
	noticing?

	#### Note:

	As I described above, I got permission from Dr. Greg to change the app to instead of
	letting the user upload a file, instead we already have A/B Testing files and we let
	the user ask A/B Testing questions (described more above). Thus, I will adjust
	Discussion Question 1 appropriately for the app that I did. Please do not mark off
	points for this as this app adjustment was approved by Dr. Greg

	#### Updated Questions:

	1. What is False Positive Risk and why is it important?
	2. What is the difference between sequential A/B testing and standard A/B testing?
	3. What is Ronny Kohavi's paper 'AB Testing Intuition Busters' about?

	Does this application pass your vibe check? Are there any immediate pitfalls you're
	noticing?

	#### Answer:

	While the app's response to the first question above did pass my vibe check, the app's
	response to the next two questions did not pass my vibe check. Thus, overall, the app
	did not pass my vibe check.

	Specifically, for the 2nd question above, the app responded
	with 'I don't know the answer.' This shows a big pitfall in the retrieval part of the
	RAG because it should be able to answer this from the sources. One major pitfall is
	that we are using such a primitive splitting strategy (just based off of number
	of characters). Another pitfall is that we don't have any additional tools in place
	such that if the chat model originally responds with 'I don't know the answer', then the
	tool can call another chat model and provide a prompt that includes the user's originally
	query and tell the chat model to generate 5 different rewrites of the original query to
	improve the prompt, then it can retrieve chunks based off of these 5 new queries

	Specifically, for the 3rd question above, of the app's provided sources, none were the
	actual paper that the question references, even though that paper (or rather its chunks)
	is in the database! This further shows the pitfalls in the app. We should have some
	filtering such that if the user specifically asks about a specific paper in our
	database, then the chunks returned are from that paper

	While this wasn't currently a major issue given that we have a small database, our app
	compares the query's embedding vector with each chunk in the database's embedding
	vector. This will not scale at all if we wanted to have a much bigger database