CapStoneRAG10 / docs /EVALUATION_LLM_FLOW.md
Developer
Initial commit for HuggingFace Spaces - RAG Capstone Project with Qdrant Cloud
1d10b0a
# Evaluation: LLM Calling & Query Processing Flow
## High-Level Overview
```
EVALUATION PROCESS:
β”‚
β”œβ”€ Load Test Data from Dataset
β”‚ └─ Questions + Ground Truth Answers
β”‚
β”œβ”€ FOR EACH TEST QUESTION:
β”‚ β”‚
β”‚ β”œβ”€ 1. RETRIEVE DOCUMENTS (Vector Search)
β”‚ β”‚ β”‚
β”‚ β”‚ └─ query_text β†’ embed_query() β†’ semantic search β†’ get_retrieved_documents()
β”‚ β”‚
β”‚ β”œβ”€ 2. GENERATE RESPONSE (LLM Call)
β”‚ β”‚ β”‚
β”‚ β”‚ └─ query + documents β†’ LLM β†’ response
β”‚ β”‚
β”‚ └─ 3. STORE TEST CASE (For Evaluation)
β”‚ └─ {query, response, documents, ground_truth}
β”‚
β”œβ”€ COMPUTE TRACe METRICS
β”‚ └─ utilization, relevance, adherence, completeness
β”‚
└─ DISPLAY RESULTS
```
---
## Detailed Flow: Query Processing in Evaluation
### **Step 1: Test Sample Loop** (streamlit_app.py, Line 723)
```python
for i, sample in enumerate(test_data):
# sample = {"question": "...", "answer": "...", ...}
# Step 2: Call RAG pipeline with the question
result = st.session_state.rag_pipeline.query(
sample["question"], # ← Query string
n_results=5 # ← How many docs to retrieve
)
```
**Input**:
- `sample["question"]` = User question from RAGBench dataset
- Example: "What is machine learning?"
- `n_results=5` = Retrieve top 5 most similar documents
---
### **Step 2: RAG Pipeline Query** (llm_client.py, Line 295)
```python
class RAGPipeline:
def query(self, query: str, n_results: int = 5) -> Dict:
# β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
# β”‚ PHASE 1: RETRIEVAL (Vector Search) β”‚
# β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
# STEP 1: Call vector store to retrieve documents
retrieved_docs = self.vector_store.get_retrieved_documents(
query, # "What is machine learning?"
n_results=5 # Top 5 documents
)
# Result: [
# {"document": "ML is...", "metadata": {...}, "distance": 0.12},
# {"document": "Machine learning uses...", "metadata": {...}, "distance": 0.15},
# ...
# ]
# Extract document texts
doc_texts = [doc["document"] for doc in retrieved_docs]
# doc_texts = ["ML is...", "Machine learning uses...", ...]
# β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
# β”‚ PHASE 2: GENERATION (LLM Call) β”‚
# β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
# STEP 2: Call LLM with query + retrieved documents
response = self.llm.generate_with_context(
query, # "What is machine learning?"
doc_texts, # ["ML is...", "Machine learning uses...", ...]
max_tokens=1024,
temperature=0.7
)
# response = "Machine learning is a subset of artificial intelligence..."
# STEP 3: Package results
return {
"query": query,
"response": response,
"retrieved_documents": retrieved_docs
}
```
---
### **Step 3A: Document Retrieval (Vector Store)** (vector_store.py, Line 321)
```
Query Processing:
USER QUESTION:
"What is machine learning?"
β”‚
β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ 1. Embed the Query β”‚
β”‚ ────────────────────────────────── β”‚
β”‚ embedding_model.embed_query(query) β”‚
β”‚ β”‚
β”‚ Model: sentence-transformers/ β”‚
β”‚ all-mpnet-base-v2 β”‚
β”‚ β”‚
β”‚ Query String (tokens): β”‚
β”‚ "What" β†’ [0.1, 0.2, ...] β”‚
β”‚ "is" β†’ [0.3, 0.4, ...] β”‚
β”‚ "machine" β†’ [0.5, 0.6, ...] β”‚
β”‚ "learning" β†’ [0.7, 0.8, ...] β”‚
β”‚ β”‚
β”‚ Output: Query Vector [768-dim] β”‚
β”‚ ↓ β”‚
β”‚ [0.15, 0.32, 0.51, ..., 0.89] β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚
β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ 2. Semantic Search in ChromaDB β”‚
β”‚ ──────────────────────────────────────── β”‚
β”‚ β”‚
β”‚ collection.query( β”‚
β”‚ query_embeddings=[query_vector], β”‚
β”‚ n_results=5, β”‚
β”‚ where=None β”‚
β”‚ ) β”‚
β”‚ β”‚
β”‚ Compare query_vector against all doc β”‚
β”‚ vectors in the collection using β”‚
β”‚ cosine similarity β”‚
β”‚ β”‚
β”‚ Scoring: similarity = dot_product/ β”‚
β”‚ (norm_a * norm_b) β”‚
β”‚ β”‚
β”‚ Top 5 Results (sorted by similarity): β”‚
β”‚ β€’ Doc 1: "ML is a field..." (sim: 0.92) β”‚
β”‚ β€’ Doc 2: "Deep learning..." (sim: 0.89) β”‚
β”‚ β€’ Doc 3: "Neural networks..." (sim: 0.87)
β”‚ β€’ Doc 4: "AI overview..." (sim: 0.81) β”‚
β”‚ β€’ Doc 5: "Training data..." (sim: 0.78) β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚
β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ 3. Format Retrieved Documents β”‚
β”‚ ────────────────────────────────── β”‚
β”‚ retrieved_docs = [ β”‚
β”‚ { β”‚
β”‚ "document": "ML is a field...",β”‚
β”‚ "metadata": {...}, β”‚
β”‚ "distance": 0.08 β”‚
β”‚ }, β”‚
β”‚ {...}, β”‚
β”‚ ... β”‚
β”‚ ] β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚
β–Ό
RETURNED TO RAGPipeline
```
---
### **Step 3B: LLM Response Generation** (llm_client.py, Line 215)
```
Retrieved Documents:
β”‚
β”œβ”€ Doc1: "ML is a field of AI that..."
β”œβ”€ Doc2: "Machine learning uses algorithms..."
β”œβ”€ Doc3: "Neural networks process data..."
β”œβ”€ Doc4: "Training data is essential..."
└─ Doc5: "Deep learning is a subset..."
β”‚
β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ 1. BUILD PROMPT β”‚
β”‚ ────────────────────────────────────────────────────── β”‚
β”‚ β”‚
β”‚ context = """ β”‚
β”‚ Document 1: ML is a field of AI that... β”‚
β”‚ Document 2: Machine learning uses algorithms... β”‚
β”‚ Document 3: Neural networks process data... β”‚
β”‚ Document 4: Training data is essential... β”‚
β”‚ Document 5: Deep learning is a subset... β”‚
β”‚ """ β”‚
β”‚ β”‚
β”‚ prompt = """ β”‚
β”‚ Answer the following question based on the provided β”‚
β”‚ context. β”‚
β”‚ β”‚
β”‚ Context: β”‚
β”‚ {context} β”‚
β”‚ β”‚
β”‚ Question: What is machine learning? β”‚
β”‚ β”‚
β”‚ Answer: β”‚
β”‚ """ β”‚
β”‚ β”‚
β”‚ system_prompt = "You are a helpful AI assistant..." β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚
β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ 2. LLM API CALL (Groq) β”‚
β”‚ ────────────────────────────────────────────────────── β”‚
β”‚ β”‚
β”‚ Client: Groq (groq.com) β”‚
β”‚ Model: llama-3.1-8b-instant (or selected model) β”‚
β”‚ API Endpoint: https://api.groq.com/v1/chat/ β”‚
β”‚ completions β”‚
β”‚ β”‚
β”‚ Request: β”‚
β”‚ { β”‚
β”‚ "model": "llama-3.1-8b-instant", β”‚
β”‚ "messages": [ β”‚
β”‚ { β”‚
β”‚ "role": "system", β”‚
β”‚ "content": "You are a helpful..." β”‚
β”‚ }, β”‚
β”‚ { β”‚
β”‚ "role": "user", β”‚
β”‚ "content": "[full prompt above]" β”‚
β”‚ } β”‚
β”‚ ], β”‚
β”‚ "max_tokens": 1024, β”‚
β”‚ "temperature": 0.7 β”‚
β”‚ } β”‚
β”‚ β”‚
β”‚ Where is the LLM processing happening: β”‚
β”‚ β†’ Groq's GPU servers (not local) β”‚
β”‚ β†’ Model processes entire prompt β”‚
β”‚ β†’ Generates response token-by-token β”‚
β”‚ β†’ Returns complete response β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚
β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ 3. PARSE LLM RESPONSE β”‚
β”‚ ────────────────────────────────────────────────────── β”‚
β”‚ β”‚
β”‚ Response Text: β”‚
β”‚ "Machine learning is a field of artificial β”‚
β”‚ intelligence that enables computers to learn from β”‚
β”‚ data without being explicitly programmed..." β”‚
β”‚ β”‚
β”‚ Extract: response.choices[0].message.content β”‚
β”‚ Return: Final Answer String β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚
β–Ό
RETURNED TO RAGPipeline
```
---
## Complete Code Flow for One Evaluation Query
### **File: streamlit_app.py** (Line 723-730)
```python
# FOR EACH TEST QUESTION IN THE DATASET:
for i, sample in enumerate(test_data):
# sample["question"] = "What is machine learning?"
# sample["answer"] = "ML is a subset of AI..."
# β˜… STEP 1: CALL RAG PIPELINE β˜…
result = st.session_state.rag_pipeline.query(
sample["question"], # Pass question
n_results=5 # Get top 5 docs
)
# Returns:
# {
# "query": "What is machine learning?",
# "response": "Machine learning is...",
# "retrieved_documents": [
# {"document": "...", "metadata": {...}, ...},
# ...
# ]
# }
# β˜… STEP 2: EXTRACT RESULTS β˜…
test_cases.append({
"query": sample["question"],
"response": result["response"],
"retrieved_documents": [
doc["document"] for doc in result["retrieved_documents"]
],
"ground_truth": sample.get("answer", "")
})
```
### **File: llm_client.py** (RAGPipeline class, Line 295-340)
```python
class RAGPipeline:
def query(self, query: str, n_results: int = 5) -> Dict:
# β˜… STEP 2A: RETRIEVE DOCUMENTS β˜…
# Where: vector_store.py β†’ get_retrieved_documents()
retrieved_docs = self.vector_store.get_retrieved_documents(
query, # "What is machine learning?"
n_results=5
)
# β˜… STEP 2B: EXTRACT DOCUMENT TEXTS β˜…
doc_texts = [doc["document"] for doc in retrieved_docs]
# doc_texts = [
# "Machine learning is a subset of AI...",
# "Deep learning uses neural networks...",
# ...
# ]
# β˜… STEP 2C: CALL LLM β˜…
# Where: llm_client.py β†’ generate_with_context()
response = self.llm.generate_with_context(
query, # "What is machine learning?"
doc_texts, # [retrieved document texts]
max_tokens=1024,
temperature=0.7
)
# response = "Machine learning is a field of AI..."
# β˜… STEP 2D: RETURN RESULTS β˜…
return {
"query": query,
"response": response,
"retrieved_documents": retrieved_docs
}
```
### **File: vector_store.py** (ChromaDBManager class, Line 370-400)
```python
def get_retrieved_documents(self, query_text: str, n_results: int = 5):
# β˜… STEP 3A-1: QUERY THE COLLECTION β˜…
# Where: vector_store.py β†’ query()
results = self.query(query_text, n_results)
# results = {
# 'documents': [[doc1, doc2, doc3, doc4, doc5]],
# 'metadatas': [[meta1, meta2, ...]],
# 'distances': [[dist1, dist2, ...]],
# 'ids': [[id1, id2, ...]]
# }
# β˜… STEP 3A-2: FORMAT RESULTS β˜…
retrieved_docs = []
for i in range(len(results['documents'][0])):
retrieved_docs.append({
"document": results['documents'][0][i],
"metadata": results['metadatas'][0][i],
"distance": results['distances'][0][i]
})
# retrieved_docs = [
# {"document": "ML is...", "metadata": {...}, "distance": 0.08},
# {"document": "Deep...", "metadata": {...}, "distance": 0.11},
# ...
# ]
return retrieved_docs
```
### **File: llm_client.py** (GroqLLMClient class, Line 215-250)
```python
def generate_with_context(self, query: str, context_documents: List[str]):
# β˜… STEP 3B-1: BUILD CONTEXT STRING β˜…
context = "\n\n".join([
f"Document {i+1}: {doc}"
for i, doc in enumerate(context_documents)
])
# context = """
# Document 1: ML is a field of AI that...
# Document 2: Machine learning uses algorithms...
# ...
# """
# β˜… STEP 3B-2: BUILD PROMPT β˜…
prompt = f"""Answer the following question based on the provided context.
Context:
{context}
Question: {query}
Answer:"""
system_prompt = "You are a helpful AI assistant..."
# β˜… STEP 3B-3: CALL LLM (GROQ API) β˜…
# Where: llm_client.py β†’ generate()
return self.generate(prompt, max_tokens=1024, temperature=0.7, system_prompt)
```
### **File: llm_client.py** (GroqLLMClient.generate(), Line 110-155)
```python
def generate(self, prompt: str, max_tokens: int, temperature: float, system_prompt: str):
# β˜… STEP 3B-4: PREPARE GROQ API CALL β˜…
# Apply rate limiting (max 30 requests per minute)
self.rate_limiter.acquire_sync()
# Build messages for Groq API
messages = []
if system_prompt:
messages.append({
"role": "system",
"content": system_prompt
})
messages.append({
"role": "user",
"content": prompt
})
# β˜… STEP 3B-5: MAKE GROQ API REQUEST β˜…
try:
response = self.client.chat.completions.create(
model=self.model_name, # e.g., "llama-3.1-8b-instant"
messages=messages,
max_tokens=max_tokens, # 1024
temperature=temperature # 0.7
)
# β˜… STEP 3B-6: EXTRACT RESPONSE β˜…
return response.choices[0].message.content
# Returns: "Machine learning is a field of artificial intelligence..."
except Exception as e:
return f"Error: {str(e)}"
```
---
## Summary of Query Processing in Evaluation
| Step | Component | Input | Process | Output |
|------|-----------|-------|---------|--------|
| 1 | Streamlit UI | Test sample | Load from dataset | Question |
| 2 | RAGPipeline | Question | Orchestrate RAG | Response |
| 2A | ChromaDB | Question | Embed & search | 5 documents |
| 2B | Embedding Model | Question text | Convert to vector | 768-dim vector |
| 2C | Groq LLM | Q + 5 docs | API call | Generated answer |
| 3 | TRACEEvaluator | Q, response, docs | Compute metrics | TRACe scores |
---
## Where LLM Gets Called
**PRIMARY LLM CALL LOCATION**: `llm_client.py`, function `GroqLLMClient.generate()` (Line 110)
**TRIGGERED BY**:
1. Chat interface: `Chat tab β†’ query β†’ generate()`
2. Evaluation: `run_evaluation() β†’ rag_pipeline.query() β†’ generate_with_context() β†’ generate()`
**DURING EVALUATION SPECIFICALLY**:
- Called **once per test question** (e.g., 10 times for 10 test samples)
- Each call:
- Gets a unique question
- Retrieves 5 relevant documents
- Asks Groq LLM to answer using those documents
- Stores result for TRACe metric computation
**LLM MODEL USED**:
- Default: `llama-3.1-8b-instant` (can be switched in UI)
- Also available: `meta-llama/llama-4-maverick-17b-128e-instruct`, `openai/gpt-oss-120b`
- Provider: **Groq** (cloud-based GPU inference)