Spaces:
Sleeping
Sleeping
Evaluation: LLM Calling & Query Processing Flow
High-Level Overview
EVALUATION PROCESS:
β
ββ Load Test Data from Dataset
β ββ Questions + Ground Truth Answers
β
ββ FOR EACH TEST QUESTION:
β β
β ββ 1. RETRIEVE DOCUMENTS (Vector Search)
β β β
β β ββ query_text β embed_query() β semantic search β get_retrieved_documents()
β β
β ββ 2. GENERATE RESPONSE (LLM Call)
β β β
β β ββ query + documents β LLM β response
β β
β ββ 3. STORE TEST CASE (For Evaluation)
β ββ {query, response, documents, ground_truth}
β
ββ COMPUTE TRACe METRICS
β ββ utilization, relevance, adherence, completeness
β
ββ DISPLAY RESULTS
Detailed Flow: Query Processing in Evaluation
Step 1: Test Sample Loop (streamlit_app.py, Line 723)
for i, sample in enumerate(test_data):
# sample = {"question": "...", "answer": "...", ...}
# Step 2: Call RAG pipeline with the question
result = st.session_state.rag_pipeline.query(
sample["question"], # β Query string
n_results=5 # β How many docs to retrieve
)
Input:
sample["question"]= User question from RAGBench dataset- Example: "What is machine learning?"
n_results=5= Retrieve top 5 most similar documents
Step 2: RAG Pipeline Query (llm_client.py, Line 295)
class RAGPipeline:
def query(self, query: str, n_results: int = 5) -> Dict:
# βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
# β PHASE 1: RETRIEVAL (Vector Search) β
# βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
# STEP 1: Call vector store to retrieve documents
retrieved_docs = self.vector_store.get_retrieved_documents(
query, # "What is machine learning?"
n_results=5 # Top 5 documents
)
# Result: [
# {"document": "ML is...", "metadata": {...}, "distance": 0.12},
# {"document": "Machine learning uses...", "metadata": {...}, "distance": 0.15},
# ...
# ]
# Extract document texts
doc_texts = [doc["document"] for doc in retrieved_docs]
# doc_texts = ["ML is...", "Machine learning uses...", ...]
# βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
# β PHASE 2: GENERATION (LLM Call) β
# βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
# STEP 2: Call LLM with query + retrieved documents
response = self.llm.generate_with_context(
query, # "What is machine learning?"
doc_texts, # ["ML is...", "Machine learning uses...", ...]
max_tokens=1024,
temperature=0.7
)
# response = "Machine learning is a subset of artificial intelligence..."
# STEP 3: Package results
return {
"query": query,
"response": response,
"retrieved_documents": retrieved_docs
}
Step 3A: Document Retrieval (Vector Store) (vector_store.py, Line 321)
Query Processing:
USER QUESTION:
"What is machine learning?"
β
βΌ
βββββββββββββββββββββββββββββββββββββββ
β 1. Embed the Query β
β ββββββββββββββββββββββββββββββββββ β
β embedding_model.embed_query(query) β
β β
β Model: sentence-transformers/ β
β all-mpnet-base-v2 β
β β
β Query String (tokens): β
β "What" β [0.1, 0.2, ...] β
β "is" β [0.3, 0.4, ...] β
β "machine" β [0.5, 0.6, ...] β
β "learning" β [0.7, 0.8, ...] β
β β
β Output: Query Vector [768-dim] β
β β β
β [0.15, 0.32, 0.51, ..., 0.89] β
βββββββββββββββββββββββββββββββββββββββ
β
βΌ
ββββββββββββββββββββββββββββββββββββββββββββ
β 2. Semantic Search in ChromaDB β
β ββββββββββββββββββββββββββββββββββββββββ β
β β
β collection.query( β
β query_embeddings=[query_vector], β
β n_results=5, β
β where=None β
β ) β
β β
β Compare query_vector against all doc β
β vectors in the collection using β
β cosine similarity β
β β
β Scoring: similarity = dot_product/ β
β (norm_a * norm_b) β
β β
β Top 5 Results (sorted by similarity): β
β β’ Doc 1: "ML is a field..." (sim: 0.92) β
β β’ Doc 2: "Deep learning..." (sim: 0.89) β
β β’ Doc 3: "Neural networks..." (sim: 0.87)
β β’ Doc 4: "AI overview..." (sim: 0.81) β
β β’ Doc 5: "Training data..." (sim: 0.78) β
ββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββ
β 3. Format Retrieved Documents β
β ββββββββββββββββββββββββββββββββββ β
β retrieved_docs = [ β
β { β
β "document": "ML is a field...",β
β "metadata": {...}, β
β "distance": 0.08 β
β }, β
β {...}, β
β ... β
β ] β
βββββββββββββββββββββββββββββββββββββββ
β
βΌ
RETURNED TO RAGPipeline
Step 3B: LLM Response Generation (llm_client.py, Line 215)
Retrieved Documents:
β
ββ Doc1: "ML is a field of AI that..."
ββ Doc2: "Machine learning uses algorithms..."
ββ Doc3: "Neural networks process data..."
ββ Doc4: "Training data is essential..."
ββ Doc5: "Deep learning is a subset..."
β
βΌ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β 1. BUILD PROMPT β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β context = """ β
β Document 1: ML is a field of AI that... β
β Document 2: Machine learning uses algorithms... β
β Document 3: Neural networks process data... β
β Document 4: Training data is essential... β
β Document 5: Deep learning is a subset... β
β """ β
β β
β prompt = """ β
β Answer the following question based on the provided β
β context. β
β β
β Context: β
β {context} β
β β
β Question: What is machine learning? β
β β
β Answer: β
β """ β
β β
β system_prompt = "You are a helpful AI assistant..." β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β 2. LLM API CALL (Groq) β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β Client: Groq (groq.com) β
β Model: llama-3.1-8b-instant (or selected model) β
β API Endpoint: https://api.groq.com/v1/chat/ β
β completions β
β β
β Request: β
β { β
β "model": "llama-3.1-8b-instant", β
β "messages": [ β
β { β
β "role": "system", β
β "content": "You are a helpful..." β
β }, β
β { β
β "role": "user", β
β "content": "[full prompt above]" β
β } β
β ], β
β "max_tokens": 1024, β
β "temperature": 0.7 β
β } β
β β
β Where is the LLM processing happening: β
β β Groq's GPU servers (not local) β
β β Model processes entire prompt β
β β Generates response token-by-token β
β β Returns complete response β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β 3. PARSE LLM RESPONSE β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β Response Text: β
β "Machine learning is a field of artificial β
β intelligence that enables computers to learn from β
β data without being explicitly programmed..." β
β β
β Extract: response.choices[0].message.content β
β Return: Final Answer String β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
RETURNED TO RAGPipeline
Complete Code Flow for One Evaluation Query
File: streamlit_app.py (Line 723-730)
# FOR EACH TEST QUESTION IN THE DATASET:
for i, sample in enumerate(test_data):
# sample["question"] = "What is machine learning?"
# sample["answer"] = "ML is a subset of AI..."
# β
STEP 1: CALL RAG PIPELINE β
result = st.session_state.rag_pipeline.query(
sample["question"], # Pass question
n_results=5 # Get top 5 docs
)
# Returns:
# {
# "query": "What is machine learning?",
# "response": "Machine learning is...",
# "retrieved_documents": [
# {"document": "...", "metadata": {...}, ...},
# ...
# ]
# }
# β
STEP 2: EXTRACT RESULTS β
test_cases.append({
"query": sample["question"],
"response": result["response"],
"retrieved_documents": [
doc["document"] for doc in result["retrieved_documents"]
],
"ground_truth": sample.get("answer", "")
})
File: llm_client.py (RAGPipeline class, Line 295-340)
class RAGPipeline:
def query(self, query: str, n_results: int = 5) -> Dict:
# β
STEP 2A: RETRIEVE DOCUMENTS β
# Where: vector_store.py β get_retrieved_documents()
retrieved_docs = self.vector_store.get_retrieved_documents(
query, # "What is machine learning?"
n_results=5
)
# β
STEP 2B: EXTRACT DOCUMENT TEXTS β
doc_texts = [doc["document"] for doc in retrieved_docs]
# doc_texts = [
# "Machine learning is a subset of AI...",
# "Deep learning uses neural networks...",
# ...
# ]
# β
STEP 2C: CALL LLM β
# Where: llm_client.py β generate_with_context()
response = self.llm.generate_with_context(
query, # "What is machine learning?"
doc_texts, # [retrieved document texts]
max_tokens=1024,
temperature=0.7
)
# response = "Machine learning is a field of AI..."
# β
STEP 2D: RETURN RESULTS β
return {
"query": query,
"response": response,
"retrieved_documents": retrieved_docs
}
File: vector_store.py (ChromaDBManager class, Line 370-400)
def get_retrieved_documents(self, query_text: str, n_results: int = 5):
# β
STEP 3A-1: QUERY THE COLLECTION β
# Where: vector_store.py β query()
results = self.query(query_text, n_results)
# results = {
# 'documents': [[doc1, doc2, doc3, doc4, doc5]],
# 'metadatas': [[meta1, meta2, ...]],
# 'distances': [[dist1, dist2, ...]],
# 'ids': [[id1, id2, ...]]
# }
# β
STEP 3A-2: FORMAT RESULTS β
retrieved_docs = []
for i in range(len(results['documents'][0])):
retrieved_docs.append({
"document": results['documents'][0][i],
"metadata": results['metadatas'][0][i],
"distance": results['distances'][0][i]
})
# retrieved_docs = [
# {"document": "ML is...", "metadata": {...}, "distance": 0.08},
# {"document": "Deep...", "metadata": {...}, "distance": 0.11},
# ...
# ]
return retrieved_docs
File: llm_client.py (GroqLLMClient class, Line 215-250)
def generate_with_context(self, query: str, context_documents: List[str]):
# β
STEP 3B-1: BUILD CONTEXT STRING β
context = "\n\n".join([
f"Document {i+1}: {doc}"
for i, doc in enumerate(context_documents)
])
# context = """
# Document 1: ML is a field of AI that...
# Document 2: Machine learning uses algorithms...
# ...
# """
# β
STEP 3B-2: BUILD PROMPT β
prompt = f"""Answer the following question based on the provided context.
Context:
{context}
Question: {query}
Answer:"""
system_prompt = "You are a helpful AI assistant..."
# β
STEP 3B-3: CALL LLM (GROQ API) β
# Where: llm_client.py β generate()
return self.generate(prompt, max_tokens=1024, temperature=0.7, system_prompt)
File: llm_client.py (GroqLLMClient.generate(), Line 110-155)
def generate(self, prompt: str, max_tokens: int, temperature: float, system_prompt: str):
# β
STEP 3B-4: PREPARE GROQ API CALL β
# Apply rate limiting (max 30 requests per minute)
self.rate_limiter.acquire_sync()
# Build messages for Groq API
messages = []
if system_prompt:
messages.append({
"role": "system",
"content": system_prompt
})
messages.append({
"role": "user",
"content": prompt
})
# β
STEP 3B-5: MAKE GROQ API REQUEST β
try:
response = self.client.chat.completions.create(
model=self.model_name, # e.g., "llama-3.1-8b-instant"
messages=messages,
max_tokens=max_tokens, # 1024
temperature=temperature # 0.7
)
# β
STEP 3B-6: EXTRACT RESPONSE β
return response.choices[0].message.content
# Returns: "Machine learning is a field of artificial intelligence..."
except Exception as e:
return f"Error: {str(e)}"
Summary of Query Processing in Evaluation
| Step | Component | Input | Process | Output |
|---|---|---|---|---|
| 1 | Streamlit UI | Test sample | Load from dataset | Question |
| 2 | RAGPipeline | Question | Orchestrate RAG | Response |
| 2A | ChromaDB | Question | Embed & search | 5 documents |
| 2B | Embedding Model | Question text | Convert to vector | 768-dim vector |
| 2C | Groq LLM | Q + 5 docs | API call | Generated answer |
| 3 | TRACEEvaluator | Q, response, docs | Compute metrics | TRACe scores |
Where LLM Gets Called
PRIMARY LLM CALL LOCATION: llm_client.py, function GroqLLMClient.generate() (Line 110)
TRIGGERED BY:
- Chat interface:
Chat tab β query β generate() - Evaluation:
run_evaluation() β rag_pipeline.query() β generate_with_context() β generate()
DURING EVALUATION SPECIFICALLY:
- Called once per test question (e.g., 10 times for 10 test samples)
- Each call:
- Gets a unique question
- Retrieves 5 relevant documents
- Asks Groq LLM to answer using those documents
- Stores result for TRACe metric computation
LLM MODEL USED:
- Default:
llama-3.1-8b-instant(can be switched in UI) - Also available:
meta-llama/llama-4-maverick-17b-128e-instruct,openai/gpt-oss-120b - Provider: Groq (cloud-based GPU inference)