# Evaluation: LLM Calling & Query Processing Flow ## High-Level Overview ``` EVALUATION PROCESS: │ ├─ Load Test Data from Dataset │ └─ Questions + Ground Truth Answers │ ├─ FOR EACH TEST QUESTION: │ │ │ ├─ 1. RETRIEVE DOCUMENTS (Vector Search) │ │ │ │ │ └─ query_text → embed_query() → semantic search → get_retrieved_documents() │ │ │ ├─ 2. GENERATE RESPONSE (LLM Call) │ │ │ │ │ └─ query + documents → LLM → response │ │ │ └─ 3. STORE TEST CASE (For Evaluation) │ └─ {query, response, documents, ground_truth} │ ├─ COMPUTE TRACe METRICS │ └─ utilization, relevance, adherence, completeness │ └─ DISPLAY RESULTS ``` --- ## Detailed Flow: Query Processing in Evaluation ### **Step 1: Test Sample Loop** (streamlit_app.py, Line 723) ```python for i, sample in enumerate(test_data): # sample = {"question": "...", "answer": "...", ...} # Step 2: Call RAG pipeline with the question result = st.session_state.rag_pipeline.query( sample["question"], # ← Query string n_results=5 # ← How many docs to retrieve ) ``` **Input**: - `sample["question"]` = User question from RAGBench dataset - Example: "What is machine learning?" - `n_results=5` = Retrieve top 5 most similar documents --- ### **Step 2: RAG Pipeline Query** (llm_client.py, Line 295) ```python class RAGPipeline: def query(self, query: str, n_results: int = 5) -> Dict: # ┌─────────────────────────────────────────────────────┐ # │ PHASE 1: RETRIEVAL (Vector Search) │ # └─────────────────────────────────────────────────────┘ # STEP 1: Call vector store to retrieve documents retrieved_docs = self.vector_store.get_retrieved_documents( query, # "What is machine learning?" n_results=5 # Top 5 documents ) # Result: [ # {"document": "ML is...", "metadata": {...}, "distance": 0.12}, # {"document": "Machine learning uses...", "metadata": {...}, "distance": 0.15}, # ... # ] # Extract document texts doc_texts = [doc["document"] for doc in retrieved_docs] # doc_texts = ["ML is...", "Machine learning uses...", ...] # ┌─────────────────────────────────────────────────────┐ # │ PHASE 2: GENERATION (LLM Call) │ # └─────────────────────────────────────────────────────┘ # STEP 2: Call LLM with query + retrieved documents response = self.llm.generate_with_context( query, # "What is machine learning?" doc_texts, # ["ML is...", "Machine learning uses...", ...] max_tokens=1024, temperature=0.7 ) # response = "Machine learning is a subset of artificial intelligence..." # STEP 3: Package results return { "query": query, "response": response, "retrieved_documents": retrieved_docs } ``` --- ### **Step 3A: Document Retrieval (Vector Store)** (vector_store.py, Line 321) ``` Query Processing: USER QUESTION: "What is machine learning?" │ ▼ ┌─────────────────────────────────────┐ │ 1. Embed the Query │ │ ────────────────────────────────── │ │ embedding_model.embed_query(query) │ │ │ │ Model: sentence-transformers/ │ │ all-mpnet-base-v2 │ │ │ │ Query String (tokens): │ │ "What" → [0.1, 0.2, ...] │ │ "is" → [0.3, 0.4, ...] │ │ "machine" → [0.5, 0.6, ...] │ │ "learning" → [0.7, 0.8, ...] │ │ │ │ Output: Query Vector [768-dim] │ │ ↓ │ │ [0.15, 0.32, 0.51, ..., 0.89] │ └─────────────────────────────────────┘ │ ▼ ┌──────────────────────────────────────────┐ │ 2. Semantic Search in ChromaDB │ │ ──────────────────────────────────────── │ │ │ │ collection.query( │ │ query_embeddings=[query_vector], │ │ n_results=5, │ │ where=None │ │ ) │ │ │ │ Compare query_vector against all doc │ │ vectors in the collection using │ │ cosine similarity │ │ │ │ Scoring: similarity = dot_product/ │ │ (norm_a * norm_b) │ │ │ │ Top 5 Results (sorted by similarity): │ │ • Doc 1: "ML is a field..." (sim: 0.92) │ │ • Doc 2: "Deep learning..." (sim: 0.89) │ │ • Doc 3: "Neural networks..." (sim: 0.87) │ • Doc 4: "AI overview..." (sim: 0.81) │ │ • Doc 5: "Training data..." (sim: 0.78) │ └──────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────┐ │ 3. Format Retrieved Documents │ │ ────────────────────────────────── │ │ retrieved_docs = [ │ │ { │ │ "document": "ML is a field...",│ │ "metadata": {...}, │ │ "distance": 0.08 │ │ }, │ │ {...}, │ │ ... │ │ ] │ └─────────────────────────────────────┘ │ ▼ RETURNED TO RAGPipeline ``` --- ### **Step 3B: LLM Response Generation** (llm_client.py, Line 215) ``` Retrieved Documents: │ ├─ Doc1: "ML is a field of AI that..." ├─ Doc2: "Machine learning uses algorithms..." ├─ Doc3: "Neural networks process data..." ├─ Doc4: "Training data is essential..." └─ Doc5: "Deep learning is a subset..." │ ▼ ┌────────────────────────────────────────────────────────┐ │ 1. BUILD PROMPT │ │ ────────────────────────────────────────────────────── │ │ │ │ context = """ │ │ Document 1: ML is a field of AI that... │ │ Document 2: Machine learning uses algorithms... │ │ Document 3: Neural networks process data... │ │ Document 4: Training data is essential... │ │ Document 5: Deep learning is a subset... │ │ """ │ │ │ │ prompt = """ │ │ Answer the following question based on the provided │ │ context. │ │ │ │ Context: │ │ {context} │ │ │ │ Question: What is machine learning? │ │ │ │ Answer: │ │ """ │ │ │ │ system_prompt = "You are a helpful AI assistant..." │ └────────────────────────────────────────────────────────┘ │ ▼ ┌────────────────────────────────────────────────────────┐ │ 2. LLM API CALL (Groq) │ │ ────────────────────────────────────────────────────── │ │ │ │ Client: Groq (groq.com) │ │ Model: llama-3.1-8b-instant (or selected model) │ │ API Endpoint: https://api.groq.com/v1/chat/ │ │ completions │ │ │ │ Request: │ │ { │ │ "model": "llama-3.1-8b-instant", │ │ "messages": [ │ │ { │ │ "role": "system", │ │ "content": "You are a helpful..." │ │ }, │ │ { │ │ "role": "user", │ │ "content": "[full prompt above]" │ │ } │ │ ], │ │ "max_tokens": 1024, │ │ "temperature": 0.7 │ │ } │ │ │ │ Where is the LLM processing happening: │ │ → Groq's GPU servers (not local) │ │ → Model processes entire prompt │ │ → Generates response token-by-token │ │ → Returns complete response │ └────────────────────────────────────────────────────────┘ │ ▼ ┌────────────────────────────────────────────────────────┐ │ 3. PARSE LLM RESPONSE │ │ ────────────────────────────────────────────────────── │ │ │ │ Response Text: │ │ "Machine learning is a field of artificial │ │ intelligence that enables computers to learn from │ │ data without being explicitly programmed..." │ │ │ │ Extract: response.choices[0].message.content │ │ Return: Final Answer String │ └────────────────────────────────────────────────────────┘ │ ▼ RETURNED TO RAGPipeline ``` --- ## Complete Code Flow for One Evaluation Query ### **File: streamlit_app.py** (Line 723-730) ```python # FOR EACH TEST QUESTION IN THE DATASET: for i, sample in enumerate(test_data): # sample["question"] = "What is machine learning?" # sample["answer"] = "ML is a subset of AI..." # ★ STEP 1: CALL RAG PIPELINE ★ result = st.session_state.rag_pipeline.query( sample["question"], # Pass question n_results=5 # Get top 5 docs ) # Returns: # { # "query": "What is machine learning?", # "response": "Machine learning is...", # "retrieved_documents": [ # {"document": "...", "metadata": {...}, ...}, # ... # ] # } # ★ STEP 2: EXTRACT RESULTS ★ test_cases.append({ "query": sample["question"], "response": result["response"], "retrieved_documents": [ doc["document"] for doc in result["retrieved_documents"] ], "ground_truth": sample.get("answer", "") }) ``` ### **File: llm_client.py** (RAGPipeline class, Line 295-340) ```python class RAGPipeline: def query(self, query: str, n_results: int = 5) -> Dict: # ★ STEP 2A: RETRIEVE DOCUMENTS ★ # Where: vector_store.py → get_retrieved_documents() retrieved_docs = self.vector_store.get_retrieved_documents( query, # "What is machine learning?" n_results=5 ) # ★ STEP 2B: EXTRACT DOCUMENT TEXTS ★ doc_texts = [doc["document"] for doc in retrieved_docs] # doc_texts = [ # "Machine learning is a subset of AI...", # "Deep learning uses neural networks...", # ... # ] # ★ STEP 2C: CALL LLM ★ # Where: llm_client.py → generate_with_context() response = self.llm.generate_with_context( query, # "What is machine learning?" doc_texts, # [retrieved document texts] max_tokens=1024, temperature=0.7 ) # response = "Machine learning is a field of AI..." # ★ STEP 2D: RETURN RESULTS ★ return { "query": query, "response": response, "retrieved_documents": retrieved_docs } ``` ### **File: vector_store.py** (ChromaDBManager class, Line 370-400) ```python def get_retrieved_documents(self, query_text: str, n_results: int = 5): # ★ STEP 3A-1: QUERY THE COLLECTION ★ # Where: vector_store.py → query() results = self.query(query_text, n_results) # results = { # 'documents': [[doc1, doc2, doc3, doc4, doc5]], # 'metadatas': [[meta1, meta2, ...]], # 'distances': [[dist1, dist2, ...]], # 'ids': [[id1, id2, ...]] # } # ★ STEP 3A-2: FORMAT RESULTS ★ retrieved_docs = [] for i in range(len(results['documents'][0])): retrieved_docs.append({ "document": results['documents'][0][i], "metadata": results['metadatas'][0][i], "distance": results['distances'][0][i] }) # retrieved_docs = [ # {"document": "ML is...", "metadata": {...}, "distance": 0.08}, # {"document": "Deep...", "metadata": {...}, "distance": 0.11}, # ... # ] return retrieved_docs ``` ### **File: llm_client.py** (GroqLLMClient class, Line 215-250) ```python def generate_with_context(self, query: str, context_documents: List[str]): # ★ STEP 3B-1: BUILD CONTEXT STRING ★ context = "\n\n".join([ f"Document {i+1}: {doc}" for i, doc in enumerate(context_documents) ]) # context = """ # Document 1: ML is a field of AI that... # Document 2: Machine learning uses algorithms... # ... # """ # ★ STEP 3B-2: BUILD PROMPT ★ prompt = f"""Answer the following question based on the provided context. Context: {context} Question: {query} Answer:""" system_prompt = "You are a helpful AI assistant..." # ★ STEP 3B-3: CALL LLM (GROQ API) ★ # Where: llm_client.py → generate() return self.generate(prompt, max_tokens=1024, temperature=0.7, system_prompt) ``` ### **File: llm_client.py** (GroqLLMClient.generate(), Line 110-155) ```python def generate(self, prompt: str, max_tokens: int, temperature: float, system_prompt: str): # ★ STEP 3B-4: PREPARE GROQ API CALL ★ # Apply rate limiting (max 30 requests per minute) self.rate_limiter.acquire_sync() # Build messages for Groq API messages = [] if system_prompt: messages.append({ "role": "system", "content": system_prompt }) messages.append({ "role": "user", "content": prompt }) # ★ STEP 3B-5: MAKE GROQ API REQUEST ★ try: response = self.client.chat.completions.create( model=self.model_name, # e.g., "llama-3.1-8b-instant" messages=messages, max_tokens=max_tokens, # 1024 temperature=temperature # 0.7 ) # ★ STEP 3B-6: EXTRACT RESPONSE ★ return response.choices[0].message.content # Returns: "Machine learning is a field of artificial intelligence..." except Exception as e: return f"Error: {str(e)}" ``` --- ## Summary of Query Processing in Evaluation | Step | Component | Input | Process | Output | |------|-----------|-------|---------|--------| | 1 | Streamlit UI | Test sample | Load from dataset | Question | | 2 | RAGPipeline | Question | Orchestrate RAG | Response | | 2A | ChromaDB | Question | Embed & search | 5 documents | | 2B | Embedding Model | Question text | Convert to vector | 768-dim vector | | 2C | Groq LLM | Q + 5 docs | API call | Generated answer | | 3 | TRACEEvaluator | Q, response, docs | Compute metrics | TRACe scores | --- ## Where LLM Gets Called **PRIMARY LLM CALL LOCATION**: `llm_client.py`, function `GroqLLMClient.generate()` (Line 110) **TRIGGERED BY**: 1. Chat interface: `Chat tab → query → generate()` 2. Evaluation: `run_evaluation() → rag_pipeline.query() → generate_with_context() → generate()` **DURING EVALUATION SPECIFICALLY**: - Called **once per test question** (e.g., 10 times for 10 test samples) - Each call: - Gets a unique question - Retrieves 5 relevant documents - Asks Groq LLM to answer using those documents - Stores result for TRACe metric computation **LLM MODEL USED**: - Default: `llama-3.1-8b-instant` (can be switched in UI) - Also available: `meta-llama/llama-4-maverick-17b-128e-instruct`, `openai/gpt-oss-120b` - Provider: **Groq** (cloud-based GPU inference)