# Evaluation: LLM Calling & Query Processing Flow

## High-Level Overview

```
EVALUATION PROCESS:
│
├─ Load Test Data from Dataset
│  └─ Questions + Ground Truth Answers
│
├─ FOR EACH TEST QUESTION:
│  │
│  ├─ 1. RETRIEVE DOCUMENTS (Vector Search)
│  │   │
│  │   └─ query_text → embed_query() → semantic search → get_retrieved_documents()
│  │
│  ├─ 2. GENERATE RESPONSE (LLM Call)
│  │   │
│  │   └─ query + documents → LLM → response
│  │
│  └─ 3. STORE TEST CASE (For Evaluation)
│      └─ {query, response, documents, ground_truth}
│
├─ COMPUTE TRACe METRICS
│  └─ utilization, relevance, adherence, completeness
│
└─ DISPLAY RESULTS
```

---

## Detailed Flow: Query Processing in Evaluation

### **Step 1: Test Sample Loop** (streamlit_app.py, Line 723)

```python
for i, sample in enumerate(test_data):
    # sample = {"question": "...", "answer": "...", ...}
    
    # Step 2: Call RAG pipeline with the question
    result = st.session_state.rag_pipeline.query(
        sample["question"],     # ← Query string
        n_results=5             # ← How many docs to retrieve
    )
```

**Input**: 
- `sample["question"]` = User question from RAGBench dataset
  - Example: "What is machine learning?"
- `n_results=5` = Retrieve top 5 most similar documents

---

### **Step 2: RAG Pipeline Query** (llm_client.py, Line 295)

```python
class RAGPipeline:
    def query(self, query: str, n_results: int = 5) -> Dict:
        
        # ┌─────────────────────────────────────────────────────┐
        # │ PHASE 1: RETRIEVAL (Vector Search)                  │
        # └─────────────────────────────────────────────────────┘
        
        # STEP 1: Call vector store to retrieve documents
        retrieved_docs = self.vector_store.get_retrieved_documents(
            query,        # "What is machine learning?"
            n_results=5   # Top 5 documents
        )
        # Result: [
        #   {"document": "ML is...", "metadata": {...}, "distance": 0.12},
        #   {"document": "Machine learning uses...", "metadata": {...}, "distance": 0.15},
        #   ...
        # ]
        
        # Extract document texts
        doc_texts = [doc["document"] for doc in retrieved_docs]
        # doc_texts = ["ML is...", "Machine learning uses...", ...]
        
        # ┌─────────────────────────────────────────────────────┐
        # │ PHASE 2: GENERATION (LLM Call)                      │
        # └─────────────────────────────────────────────────────┘
        
        # STEP 2: Call LLM with query + retrieved documents
        response = self.llm.generate_with_context(
            query,          # "What is machine learning?"
            doc_texts,      # ["ML is...", "Machine learning uses...", ...]
            max_tokens=1024,
            temperature=0.7
        )
        # response = "Machine learning is a subset of artificial intelligence..."
        
        # STEP 3: Package results
        return {
            "query": query,
            "response": response,
            "retrieved_documents": retrieved_docs
        }
```

---

### **Step 3A: Document Retrieval (Vector Store)** (vector_store.py, Line 321)

```
Query Processing:

USER QUESTION:
"What is machine learning?"
        │
        ▼
┌─────────────────────────────────────┐
│ 1. Embed the Query                  │
│ ──────────────────────────────────  │
│ embedding_model.embed_query(query)  │
│                                     │
│ Model: sentence-transformers/       │
│        all-mpnet-base-v2            │
│                                     │
│ Query String (tokens):              │
│   "What" → [0.1, 0.2, ...]        │
│   "is" → [0.3, 0.4, ...]          │
│   "machine" → [0.5, 0.6, ...]     │
│   "learning" → [0.7, 0.8, ...]    │
│                                     │
│ Output: Query Vector [768-dim]      │
│         ↓                           │
│   [0.15, 0.32, 0.51, ..., 0.89]   │
└─────────────────────────────────────┘
        │
        ▼
┌──────────────────────────────────────────┐
│ 2. Semantic Search in ChromaDB           │
│ ──────────────────────────────────────── │
│                                          │
│ collection.query(                        │
│     query_embeddings=[query_vector],     │
│     n_results=5,                         │
│     where=None                           │
│ )                                        │
│                                          │
│ Compare query_vector against all doc     │
│ vectors in the collection using          │
│ cosine similarity                        │
│                                          │
│ Scoring: similarity = dot_product/       │
│          (norm_a * norm_b)               │
│                                          │
│ Top 5 Results (sorted by similarity):    │
│  • Doc 1: "ML is a field..." (sim: 0.92) │
│  • Doc 2: "Deep learning..." (sim: 0.89) │
│  • Doc 3: "Neural networks..." (sim: 0.87)
│  • Doc 4: "AI overview..." (sim: 0.81)   │
│  • Doc 5: "Training data..." (sim: 0.78) │
└──────────────────────────────────────────┘
        │
        ▼
┌─────────────────────────────────────┐
│ 3. Format Retrieved Documents       │
│ ──────────────────────────────────  │
│ retrieved_docs = [                  │
│   {                                 │
│     "document": "ML is a field...",│
│     "metadata": {...},              │
│     "distance": 0.08                │
│   },                                │
│   {...},                            │
│   ...                               │
│ ]                                   │
└─────────────────────────────────────┘
        │
        ▼
   RETURNED TO RAGPipeline
```

---

### **Step 3B: LLM Response Generation** (llm_client.py, Line 215)

```
Retrieved Documents:
│
├─ Doc1: "ML is a field of AI that..."
├─ Doc2: "Machine learning uses algorithms..."
├─ Doc3: "Neural networks process data..."
├─ Doc4: "Training data is essential..."
└─ Doc5: "Deep learning is a subset..."
        │
        ▼
┌────────────────────────────────────────────────────────┐
│ 1. BUILD PROMPT                                        │
│ ────────────────────────────────────────────────────── │
│                                                        │
│ context = """                                          │
│   Document 1: ML is a field of AI that...             │
│   Document 2: Machine learning uses algorithms...     │
│   Document 3: Neural networks process data...         │
│   Document 4: Training data is essential...           │
│   Document 5: Deep learning is a subset...            │
│ """                                                    │
│                                                        │
│ prompt = """                                           │
│ Answer the following question based on the provided   │
│ context.                                              │
│                                                        │
│ Context:                                              │
│ {context}                                             │
│                                                        │
│ Question: What is machine learning?                   │
│                                                        │
│ Answer:                                               │
│ """                                                    │
│                                                        │
│ system_prompt = "You are a helpful AI assistant..."  │
└────────────────────────────────────────────────────────┘
        │
        ▼
┌────────────────────────────────────────────────────────┐
│ 2. LLM API CALL (Groq)                                 │
│ ────────────────────────────────────────────────────── │
│                                                        │
│ Client: Groq (groq.com)                               │
│ Model: llama-3.1-8b-instant (or selected model)      │
│ API Endpoint: https://api.groq.com/v1/chat/           │
│             completions                               │
│                                                        │
│ Request:                                              │
│ {                                                      │
│   "model": "llama-3.1-8b-instant",                    │
│   "messages": [                                        │
│     {                                                  │
│       "role": "system",                               │
│       "content": "You are a helpful..."               │
│     },                                                │
│     {                                                  │
│       "role": "user",                                 │
│       "content": "[full prompt above]"                │
│     }                                                  │
│   ],                                                   │
│   "max_tokens": 1024,                                 │
│   "temperature": 0.7                                  │
│ }                                                      │
│                                                        │
│ Where is the LLM processing happening:                │
│  → Groq's GPU servers (not local)                     │
│  → Model processes entire prompt                      │
│  → Generates response token-by-token                  │
│  → Returns complete response                          │
└────────────────────────────────────────────────────────┘
        │
        ▼
┌────────────────────────────────────────────────────────┐
│ 3. PARSE LLM RESPONSE                                  │
│ ────────────────────────────────────────────────────── │
│                                                        │
│ Response Text:                                        │
│ "Machine learning is a field of artificial            │
│  intelligence that enables computers to learn from    │
│  data without being explicitly programmed..."         │
│                                                        │
│ Extract: response.choices[0].message.content          │
│ Return: Final Answer String                           │
└────────────────────────────────────────────────────────┘
        │
        ▼
   RETURNED TO RAGPipeline
```

---

## Complete Code Flow for One Evaluation Query

### **File: streamlit_app.py** (Line 723-730)

```python
# FOR EACH TEST QUESTION IN THE DATASET:
for i, sample in enumerate(test_data):
    # sample["question"] = "What is machine learning?"
    # sample["answer"] = "ML is a subset of AI..."
    
    # ★ STEP 1: CALL RAG PIPELINE ★
    result = st.session_state.rag_pipeline.query(
        sample["question"],     # Pass question
        n_results=5             # Get top 5 docs
    )
    # Returns:
    # {
    #   "query": "What is machine learning?",
    #   "response": "Machine learning is...",
    #   "retrieved_documents": [
    #     {"document": "...", "metadata": {...}, ...},
    #     ...
    #   ]
    # }
    
    # ★ STEP 2: EXTRACT RESULTS ★
    test_cases.append({
        "query": sample["question"],
        "response": result["response"],
        "retrieved_documents": [
            doc["document"] for doc in result["retrieved_documents"]
        ],
        "ground_truth": sample.get("answer", "")
    })
```

### **File: llm_client.py** (RAGPipeline class, Line 295-340)

```python
class RAGPipeline:
    def query(self, query: str, n_results: int = 5) -> Dict:
        
        # ★ STEP 2A: RETRIEVE DOCUMENTS ★
        # Where: vector_store.py → get_retrieved_documents()
        retrieved_docs = self.vector_store.get_retrieved_documents(
            query,          # "What is machine learning?"
            n_results=5
        )
        
        # ★ STEP 2B: EXTRACT DOCUMENT TEXTS ★
        doc_texts = [doc["document"] for doc in retrieved_docs]
        # doc_texts = [
        #   "Machine learning is a subset of AI...",
        #   "Deep learning uses neural networks...",
        #   ...
        # ]
        
        # ★ STEP 2C: CALL LLM ★
        # Where: llm_client.py → generate_with_context()
        response = self.llm.generate_with_context(
            query,          # "What is machine learning?"
            doc_texts,      # [retrieved document texts]
            max_tokens=1024,
            temperature=0.7
        )
        # response = "Machine learning is a field of AI..."
        
        # ★ STEP 2D: RETURN RESULTS ★
        return {
            "query": query,
            "response": response,
            "retrieved_documents": retrieved_docs
        }
```

### **File: vector_store.py** (ChromaDBManager class, Line 370-400)

```python
def get_retrieved_documents(self, query_text: str, n_results: int = 5):
    # ★ STEP 3A-1: QUERY THE COLLECTION ★
    # Where: vector_store.py → query()
    results = self.query(query_text, n_results)
    # results = {
    #   'documents': [[doc1, doc2, doc3, doc4, doc5]],
    #   'metadatas': [[meta1, meta2, ...]],
    #   'distances': [[dist1, dist2, ...]],
    #   'ids': [[id1, id2, ...]]
    # }
    
    # ★ STEP 3A-2: FORMAT RESULTS ★
    retrieved_docs = []
    for i in range(len(results['documents'][0])):
        retrieved_docs.append({
            "document": results['documents'][0][i],
            "metadata": results['metadatas'][0][i],
            "distance": results['distances'][0][i]
        })
    # retrieved_docs = [
    #   {"document": "ML is...", "metadata": {...}, "distance": 0.08},
    #   {"document": "Deep...", "metadata": {...}, "distance": 0.11},
    #   ...
    # ]
    
    return retrieved_docs
```

### **File: llm_client.py** (GroqLLMClient class, Line 215-250)

```python
def generate_with_context(self, query: str, context_documents: List[str]):
    # ★ STEP 3B-1: BUILD CONTEXT STRING ★
    context = "\n\n".join([
        f"Document {i+1}: {doc}"
        for i, doc in enumerate(context_documents)
    ])
    # context = """
    # Document 1: ML is a field of AI that...
    # Document 2: Machine learning uses algorithms...
    # ...
    # """
    
    # ★ STEP 3B-2: BUILD PROMPT ★
    prompt = f"""Answer the following question based on the provided context.

Context:
{context}

Question: {query}

Answer:"""
    
    system_prompt = "You are a helpful AI assistant..."
    
    # ★ STEP 3B-3: CALL LLM (GROQ API) ★
    # Where: llm_client.py → generate()
    return self.generate(prompt, max_tokens=1024, temperature=0.7, system_prompt)
```

### **File: llm_client.py** (GroqLLMClient.generate(), Line 110-155)

```python
def generate(self, prompt: str, max_tokens: int, temperature: float, system_prompt: str):
    # ★ STEP 3B-4: PREPARE GROQ API CALL ★
    
    # Apply rate limiting (max 30 requests per minute)
    self.rate_limiter.acquire_sync()
    
    # Build messages for Groq API
    messages = []
    if system_prompt:
        messages.append({
            "role": "system",
            "content": system_prompt
        })
    messages.append({
        "role": "user",
        "content": prompt
    })
    
    # ★ STEP 3B-5: MAKE GROQ API REQUEST ★
    try:
        response = self.client.chat.completions.create(
            model=self.model_name,          # e.g., "llama-3.1-8b-instant"
            messages=messages,
            max_tokens=max_tokens,          # 1024
            temperature=temperature         # 0.7
        )
        
        # ★ STEP 3B-6: EXTRACT RESPONSE ★
        return response.choices[0].message.content
        # Returns: "Machine learning is a field of artificial intelligence..."
        
    except Exception as e:
        return f"Error: {str(e)}"
```

---

## Summary of Query Processing in Evaluation

| Step | Component | Input | Process | Output |
|------|-----------|-------|---------|--------|
| 1 | Streamlit UI | Test sample | Load from dataset | Question |
| 2 | RAGPipeline | Question | Orchestrate RAG | Response |
| 2A | ChromaDB | Question | Embed & search | 5 documents |
| 2B | Embedding Model | Question text | Convert to vector | 768-dim vector |
| 2C | Groq LLM | Q + 5 docs | API call | Generated answer |
| 3 | TRACEEvaluator | Q, response, docs | Compute metrics | TRACe scores |

---

## Where LLM Gets Called

**PRIMARY LLM CALL LOCATION**: `llm_client.py`, function `GroqLLMClient.generate()` (Line 110)

**TRIGGERED BY**:
1. Chat interface: `Chat tab → query → generate()`
2. Evaluation: `run_evaluation() → rag_pipeline.query() → generate_with_context() → generate()`

**DURING EVALUATION SPECIFICALLY**:
- Called **once per test question** (e.g., 10 times for 10 test samples)
- Each call:
  - Gets a unique question
  - Retrieves 5 relevant documents
  - Asks Groq LLM to answer using those documents
  - Stores result for TRACe metric computation

**LLM MODEL USED**:
- Default: `llama-3.1-8b-instant` (can be switched in UI)
- Also available: `meta-llama/llama-4-maverick-17b-128e-instruct`, `openai/gpt-oss-120b`
- Provider: **Groq** (cloud-based GPU inference)