CapStoneRAG10 / docs /EVALUATION_LLM_FLOW.md
Developer
Initial commit for HuggingFace Spaces - RAG Capstone Project with Qdrant Cloud
1d10b0a

Evaluation: LLM Calling & Query Processing Flow

High-Level Overview

EVALUATION PROCESS:
β”‚
β”œβ”€ Load Test Data from Dataset
β”‚  └─ Questions + Ground Truth Answers
β”‚
β”œβ”€ FOR EACH TEST QUESTION:
β”‚  β”‚
β”‚  β”œβ”€ 1. RETRIEVE DOCUMENTS (Vector Search)
β”‚  β”‚   β”‚
β”‚  β”‚   └─ query_text β†’ embed_query() β†’ semantic search β†’ get_retrieved_documents()
β”‚  β”‚
β”‚  β”œβ”€ 2. GENERATE RESPONSE (LLM Call)
β”‚  β”‚   β”‚
β”‚  β”‚   └─ query + documents β†’ LLM β†’ response
β”‚  β”‚
β”‚  └─ 3. STORE TEST CASE (For Evaluation)
β”‚      └─ {query, response, documents, ground_truth}
β”‚
β”œβ”€ COMPUTE TRACe METRICS
β”‚  └─ utilization, relevance, adherence, completeness
β”‚
└─ DISPLAY RESULTS

Detailed Flow: Query Processing in Evaluation

Step 1: Test Sample Loop (streamlit_app.py, Line 723)

for i, sample in enumerate(test_data):
    # sample = {"question": "...", "answer": "...", ...}
    
    # Step 2: Call RAG pipeline with the question
    result = st.session_state.rag_pipeline.query(
        sample["question"],     # ← Query string
        n_results=5             # ← How many docs to retrieve
    )

Input:

  • sample["question"] = User question from RAGBench dataset
    • Example: "What is machine learning?"
  • n_results=5 = Retrieve top 5 most similar documents

Step 2: RAG Pipeline Query (llm_client.py, Line 295)

class RAGPipeline:
    def query(self, query: str, n_results: int = 5) -> Dict:
        
        # β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
        # β”‚ PHASE 1: RETRIEVAL (Vector Search)                  β”‚
        # β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
        
        # STEP 1: Call vector store to retrieve documents
        retrieved_docs = self.vector_store.get_retrieved_documents(
            query,        # "What is machine learning?"
            n_results=5   # Top 5 documents
        )
        # Result: [
        #   {"document": "ML is...", "metadata": {...}, "distance": 0.12},
        #   {"document": "Machine learning uses...", "metadata": {...}, "distance": 0.15},
        #   ...
        # ]
        
        # Extract document texts
        doc_texts = [doc["document"] for doc in retrieved_docs]
        # doc_texts = ["ML is...", "Machine learning uses...", ...]
        
        # β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
        # β”‚ PHASE 2: GENERATION (LLM Call)                      β”‚
        # β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
        
        # STEP 2: Call LLM with query + retrieved documents
        response = self.llm.generate_with_context(
            query,          # "What is machine learning?"
            doc_texts,      # ["ML is...", "Machine learning uses...", ...]
            max_tokens=1024,
            temperature=0.7
        )
        # response = "Machine learning is a subset of artificial intelligence..."
        
        # STEP 3: Package results
        return {
            "query": query,
            "response": response,
            "retrieved_documents": retrieved_docs
        }

Step 3A: Document Retrieval (Vector Store) (vector_store.py, Line 321)

Query Processing:

USER QUESTION:
"What is machine learning?"
        β”‚
        β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ 1. Embed the Query                  β”‚
β”‚ ──────────────────────────────────  β”‚
β”‚ embedding_model.embed_query(query)  β”‚
β”‚                                     β”‚
β”‚ Model: sentence-transformers/       β”‚
β”‚        all-mpnet-base-v2            β”‚
β”‚                                     β”‚
β”‚ Query String (tokens):              β”‚
β”‚   "What" β†’ [0.1, 0.2, ...]        β”‚
β”‚   "is" β†’ [0.3, 0.4, ...]          β”‚
β”‚   "machine" β†’ [0.5, 0.6, ...]     β”‚
β”‚   "learning" β†’ [0.7, 0.8, ...]    β”‚
β”‚                                     β”‚
β”‚ Output: Query Vector [768-dim]      β”‚
β”‚         ↓                           β”‚
β”‚   [0.15, 0.32, 0.51, ..., 0.89]   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
        β”‚
        β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ 2. Semantic Search in ChromaDB           β”‚
β”‚ ──────────────────────────────────────── β”‚
β”‚                                          β”‚
β”‚ collection.query(                        β”‚
β”‚     query_embeddings=[query_vector],     β”‚
β”‚     n_results=5,                         β”‚
β”‚     where=None                           β”‚
β”‚ )                                        β”‚
β”‚                                          β”‚
β”‚ Compare query_vector against all doc     β”‚
β”‚ vectors in the collection using          β”‚
β”‚ cosine similarity                        β”‚
β”‚                                          β”‚
β”‚ Scoring: similarity = dot_product/       β”‚
β”‚          (norm_a * norm_b)               β”‚
β”‚                                          β”‚
β”‚ Top 5 Results (sorted by similarity):    β”‚
β”‚  β€’ Doc 1: "ML is a field..." (sim: 0.92) β”‚
β”‚  β€’ Doc 2: "Deep learning..." (sim: 0.89) β”‚
β”‚  β€’ Doc 3: "Neural networks..." (sim: 0.87)
β”‚  β€’ Doc 4: "AI overview..." (sim: 0.81)   β”‚
β”‚  β€’ Doc 5: "Training data..." (sim: 0.78) β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
        β”‚
        β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ 3. Format Retrieved Documents       β”‚
β”‚ ──────────────────────────────────  β”‚
β”‚ retrieved_docs = [                  β”‚
β”‚   {                                 β”‚
β”‚     "document": "ML is a field...",β”‚
β”‚     "metadata": {...},              β”‚
β”‚     "distance": 0.08                β”‚
β”‚   },                                β”‚
β”‚   {...},                            β”‚
β”‚   ...                               β”‚
β”‚ ]                                   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
        β”‚
        β–Ό
   RETURNED TO RAGPipeline

Step 3B: LLM Response Generation (llm_client.py, Line 215)

Retrieved Documents:
β”‚
β”œβ”€ Doc1: "ML is a field of AI that..."
β”œβ”€ Doc2: "Machine learning uses algorithms..."
β”œβ”€ Doc3: "Neural networks process data..."
β”œβ”€ Doc4: "Training data is essential..."
└─ Doc5: "Deep learning is a subset..."
        β”‚
        β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ 1. BUILD PROMPT                                        β”‚
β”‚ ────────────────────────────────────────────────────── β”‚
β”‚                                                        β”‚
β”‚ context = """                                          β”‚
β”‚   Document 1: ML is a field of AI that...             β”‚
β”‚   Document 2: Machine learning uses algorithms...     β”‚
β”‚   Document 3: Neural networks process data...         β”‚
β”‚   Document 4: Training data is essential...           β”‚
β”‚   Document 5: Deep learning is a subset...            β”‚
β”‚ """                                                    β”‚
β”‚                                                        β”‚
β”‚ prompt = """                                           β”‚
β”‚ Answer the following question based on the provided   β”‚
β”‚ context.                                              β”‚
β”‚                                                        β”‚
β”‚ Context:                                              β”‚
β”‚ {context}                                             β”‚
β”‚                                                        β”‚
β”‚ Question: What is machine learning?                   β”‚
β”‚                                                        β”‚
β”‚ Answer:                                               β”‚
β”‚ """                                                    β”‚
β”‚                                                        β”‚
β”‚ system_prompt = "You are a helpful AI assistant..."  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
        β”‚
        β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ 2. LLM API CALL (Groq)                                 β”‚
β”‚ ────────────────────────────────────────────────────── β”‚
β”‚                                                        β”‚
β”‚ Client: Groq (groq.com)                               β”‚
β”‚ Model: llama-3.1-8b-instant (or selected model)      β”‚
β”‚ API Endpoint: https://api.groq.com/v1/chat/           β”‚
β”‚             completions                               β”‚
β”‚                                                        β”‚
β”‚ Request:                                              β”‚
β”‚ {                                                      β”‚
β”‚   "model": "llama-3.1-8b-instant",                    β”‚
β”‚   "messages": [                                        β”‚
β”‚     {                                                  β”‚
β”‚       "role": "system",                               β”‚
β”‚       "content": "You are a helpful..."               β”‚
β”‚     },                                                β”‚
β”‚     {                                                  β”‚
β”‚       "role": "user",                                 β”‚
β”‚       "content": "[full prompt above]"                β”‚
β”‚     }                                                  β”‚
β”‚   ],                                                   β”‚
β”‚   "max_tokens": 1024,                                 β”‚
β”‚   "temperature": 0.7                                  β”‚
β”‚ }                                                      β”‚
β”‚                                                        β”‚
β”‚ Where is the LLM processing happening:                β”‚
β”‚  β†’ Groq's GPU servers (not local)                     β”‚
β”‚  β†’ Model processes entire prompt                      β”‚
β”‚  β†’ Generates response token-by-token                  β”‚
β”‚  β†’ Returns complete response                          β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
        β”‚
        β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ 3. PARSE LLM RESPONSE                                  β”‚
β”‚ ────────────────────────────────────────────────────── β”‚
β”‚                                                        β”‚
β”‚ Response Text:                                        β”‚
β”‚ "Machine learning is a field of artificial            β”‚
β”‚  intelligence that enables computers to learn from    β”‚
β”‚  data without being explicitly programmed..."         β”‚
β”‚                                                        β”‚
β”‚ Extract: response.choices[0].message.content          β”‚
β”‚ Return: Final Answer String                           β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
        β”‚
        β–Ό
   RETURNED TO RAGPipeline

Complete Code Flow for One Evaluation Query

File: streamlit_app.py (Line 723-730)

# FOR EACH TEST QUESTION IN THE DATASET:
for i, sample in enumerate(test_data):
    # sample["question"] = "What is machine learning?"
    # sample["answer"] = "ML is a subset of AI..."
    
    # β˜… STEP 1: CALL RAG PIPELINE β˜…
    result = st.session_state.rag_pipeline.query(
        sample["question"],     # Pass question
        n_results=5             # Get top 5 docs
    )
    # Returns:
    # {
    #   "query": "What is machine learning?",
    #   "response": "Machine learning is...",
    #   "retrieved_documents": [
    #     {"document": "...", "metadata": {...}, ...},
    #     ...
    #   ]
    # }
    
    # β˜… STEP 2: EXTRACT RESULTS β˜…
    test_cases.append({
        "query": sample["question"],
        "response": result["response"],
        "retrieved_documents": [
            doc["document"] for doc in result["retrieved_documents"]
        ],
        "ground_truth": sample.get("answer", "")
    })

File: llm_client.py (RAGPipeline class, Line 295-340)

class RAGPipeline:
    def query(self, query: str, n_results: int = 5) -> Dict:
        
        # β˜… STEP 2A: RETRIEVE DOCUMENTS β˜…
        # Where: vector_store.py β†’ get_retrieved_documents()
        retrieved_docs = self.vector_store.get_retrieved_documents(
            query,          # "What is machine learning?"
            n_results=5
        )
        
        # β˜… STEP 2B: EXTRACT DOCUMENT TEXTS β˜…
        doc_texts = [doc["document"] for doc in retrieved_docs]
        # doc_texts = [
        #   "Machine learning is a subset of AI...",
        #   "Deep learning uses neural networks...",
        #   ...
        # ]
        
        # β˜… STEP 2C: CALL LLM β˜…
        # Where: llm_client.py β†’ generate_with_context()
        response = self.llm.generate_with_context(
            query,          # "What is machine learning?"
            doc_texts,      # [retrieved document texts]
            max_tokens=1024,
            temperature=0.7
        )
        # response = "Machine learning is a field of AI..."
        
        # β˜… STEP 2D: RETURN RESULTS β˜…
        return {
            "query": query,
            "response": response,
            "retrieved_documents": retrieved_docs
        }

File: vector_store.py (ChromaDBManager class, Line 370-400)

def get_retrieved_documents(self, query_text: str, n_results: int = 5):
    # β˜… STEP 3A-1: QUERY THE COLLECTION β˜…
    # Where: vector_store.py β†’ query()
    results = self.query(query_text, n_results)
    # results = {
    #   'documents': [[doc1, doc2, doc3, doc4, doc5]],
    #   'metadatas': [[meta1, meta2, ...]],
    #   'distances': [[dist1, dist2, ...]],
    #   'ids': [[id1, id2, ...]]
    # }
    
    # β˜… STEP 3A-2: FORMAT RESULTS β˜…
    retrieved_docs = []
    for i in range(len(results['documents'][0])):
        retrieved_docs.append({
            "document": results['documents'][0][i],
            "metadata": results['metadatas'][0][i],
            "distance": results['distances'][0][i]
        })
    # retrieved_docs = [
    #   {"document": "ML is...", "metadata": {...}, "distance": 0.08},
    #   {"document": "Deep...", "metadata": {...}, "distance": 0.11},
    #   ...
    # ]
    
    return retrieved_docs

File: llm_client.py (GroqLLMClient class, Line 215-250)

def generate_with_context(self, query: str, context_documents: List[str]):
    # β˜… STEP 3B-1: BUILD CONTEXT STRING β˜…
    context = "\n\n".join([
        f"Document {i+1}: {doc}"
        for i, doc in enumerate(context_documents)
    ])
    # context = """
    # Document 1: ML is a field of AI that...
    # Document 2: Machine learning uses algorithms...
    # ...
    # """
    
    # β˜… STEP 3B-2: BUILD PROMPT β˜…
    prompt = f"""Answer the following question based on the provided context.

Context:
{context}

Question: {query}

Answer:"""
    
    system_prompt = "You are a helpful AI assistant..."
    
    # β˜… STEP 3B-3: CALL LLM (GROQ API) β˜…
    # Where: llm_client.py β†’ generate()
    return self.generate(prompt, max_tokens=1024, temperature=0.7, system_prompt)

File: llm_client.py (GroqLLMClient.generate(), Line 110-155)

def generate(self, prompt: str, max_tokens: int, temperature: float, system_prompt: str):
    # β˜… STEP 3B-4: PREPARE GROQ API CALL β˜…
    
    # Apply rate limiting (max 30 requests per minute)
    self.rate_limiter.acquire_sync()
    
    # Build messages for Groq API
    messages = []
    if system_prompt:
        messages.append({
            "role": "system",
            "content": system_prompt
        })
    messages.append({
        "role": "user",
        "content": prompt
    })
    
    # β˜… STEP 3B-5: MAKE GROQ API REQUEST β˜…
    try:
        response = self.client.chat.completions.create(
            model=self.model_name,          # e.g., "llama-3.1-8b-instant"
            messages=messages,
            max_tokens=max_tokens,          # 1024
            temperature=temperature         # 0.7
        )
        
        # β˜… STEP 3B-6: EXTRACT RESPONSE β˜…
        return response.choices[0].message.content
        # Returns: "Machine learning is a field of artificial intelligence..."
        
    except Exception as e:
        return f"Error: {str(e)}"

Summary of Query Processing in Evaluation

Step Component Input Process Output
1 Streamlit UI Test sample Load from dataset Question
2 RAGPipeline Question Orchestrate RAG Response
2A ChromaDB Question Embed & search 5 documents
2B Embedding Model Question text Convert to vector 768-dim vector
2C Groq LLM Q + 5 docs API call Generated answer
3 TRACEEvaluator Q, response, docs Compute metrics TRACe scores

Where LLM Gets Called

PRIMARY LLM CALL LOCATION: llm_client.py, function GroqLLMClient.generate() (Line 110)

TRIGGERED BY:

  1. Chat interface: Chat tab β†’ query β†’ generate()
  2. Evaluation: run_evaluation() β†’ rag_pipeline.query() β†’ generate_with_context() β†’ generate()

DURING EVALUATION SPECIFICALLY:

  • Called once per test question (e.g., 10 times for 10 test samples)
  • Each call:
    • Gets a unique question
    • Retrieves 5 relevant documents
    • Asks Groq LLM to answer using those documents
    • Stores result for TRACe metric computation

LLM MODEL USED:

  • Default: llama-3.1-8b-instant (can be switched in UI)
  • Also available: meta-llama/llama-4-maverick-17b-128e-instruct, openai/gpt-oss-120b
  • Provider: Groq (cloud-based GPU inference)