CapStoneRAG10 / docs /LLMAUDITTRAIL_CHANGES.md
Developer
Initial commit for HuggingFace Spaces - RAG Capstone Project with Qdrant Cloud
1d10b0a

LLM Audit Trail Enhancement - Changes Summary

Overview

Enhanced JSON download functionality to include complete LLM request and response information for debugging and auditing purposes.

Files Modified

1. advanced_rag_evaluator.py

Change 1: _get_gpt_labels() method (Lines 483-549)

What Changed:

  • Modified to capture and return complete LLM request/response information
  • Changed return type from Optional[GPTLabelingOutput] to Optional[Dict]
  • Dictionary structure contains:
    • "labels": The GPTLabelingOutput object (original return)
    • "llm_request_info": Complete audit trail with:
      • system_prompt: The system instruction used
      • query: The user question
      • context_documents: List of retrieved documents
      • llm_response: The original LLM response
      • labeling_prompt: The generated labeling prompt
      • model: Model name (e.g., "groq-default")
      • temperature: Temperature parameter (0.0)
      • max_tokens: Max tokens setting (2048)
      • full_llm_response: Complete raw response from LLM

Why: Enables complete auditability of LLM interactions

Change 2: evaluate() method (Lines 414-473)

What Changed:

  • Changed return type from AdvancedTRACEScores to Tuple[AdvancedTRACEScores, Optional[Dict]]
  • Now returns tuple: (scores, llm_request_info)
  • Handles both old return style and new tuple return from _get_gpt_labels()
  • Passes llm_request_info through to caller

Why: Allows scores and LLM audit info to flow through evaluation pipeline

Change 3: evaluate_batch() method (Lines 627-701)

What Changed:

  • Updated to unpack tuple from evaluate(): scores, llm_request_info = self.evaluate(...)
  • Added "llm_request" field to each result_dict in detailed_results
  • Structure: Each detailed_result now includes:
    {
      "query_id": 1,
      "question": "...",
      "llm_response": "...",
      "retrieved_documents": [...],
      "metrics": {...},
      "ground_truth_scores": {...},
      "llm_request": {
        "system_prompt": "...",
        "query": "...",
        "context_documents": [...],
        "llm_response": "...",
        "full_llm_response": "...",
        "model": "...",
        "temperature": 0.0,
        "max_tokens": 2048
      }
    }
    

Why: Stores complete LLM audit trail for each query in batch evaluation

2. evaluation_pipeline.py

Change: evaluate() method (Lines 80-127)

What Changed:

  • Updated to handle new tuple return from advanced_evaluator.evaluate()
  • Added type checking: if isinstance(result, tuple)
  • Backwards compatible with both old and new return types
  • Extracts scores and llm_info separately
  • Adds "llm_request_info" to returned evaluation result dictionary

Why: Maintains compatibility while propagating LLM audit info through pipeline

3. streamlit_app.py

No changes needed

  • Already uses detailed_results in JSON download (line 818)
  • New llm_request field automatically included in JSON export
  • JSON download structure already supports arbitrary nested fields

Data Flow

_get_gpt_labels()
  β”œβ”€ Captures complete LLM interaction
  β”œβ”€ Returns: {"labels": GPTLabelingOutput, "llm_request_info": {...}}
  β”‚
evaluate()
  β”œβ”€ Unpacks tuple from _get_gpt_labels()
  β”œβ”€ Returns: (AdvancedTRACEScores, llm_request_info)
  β”‚
evaluate_batch()
  β”œβ”€ Unpack score and llm_request_info
  β”œβ”€ Store llm_request in detailed_results[i]["llm_request"]
  β”œβ”€ Returns: Dict with detailed_results containing llm_request
  β”‚
JSON Download (streamlit_app.py)
  β”œβ”€ detailed_results automatically includes llm_request
  β”œβ”€ Downloads complete audit trail for each query

JSON Download Structure

After downloading, the JSON file contains:

{
  "evaluation_metadata": {...},
  "aggregate_metrics": {...},
  "rmse_metrics": {...},
  "auc_metrics": {...},
  "detailed_results": [
    {
      "query_id": 1,
      "question": "User's question",
      "llm_response": "LLM's answer",
      "retrieved_documents": ["doc1", "doc2", ...],
      "metrics": {...},
      "ground_truth_scores": {...},
      "llm_request": {
        "system_prompt": "System instruction for labeling",
        "query": "User question",
        "context_documents": [...],
        "llm_response": "Raw LLM response before JSON parsing",
        "labeling_prompt": "Generated prompt sent to LLM",
        "model": "groq-default",
        "temperature": 0.0,
        "max_tokens": 2048,
        "full_llm_response": "Complete raw response from LLM"
      }
    },
    ...
  ]
}

Benefits

  1. Complete Audit Trail: Full visibility into what was sent to LLM and what was returned
  2. Debugging: Can reproduce evaluation if needed
  3. Transparency: Complete record of LLM interactions
  4. Analysis: Can correlate LLM inputs with output quality
  5. Compliance: Meets auditability requirements

Testing Recommendations

  1. Run evaluation with GPT Labeling method
  2. Download complete results JSON
  3. Verify each detailed_result contains llm_request field
  4. Verify llm_request contains:
    • system_prompt
    • query (user question)
    • context_documents list
    • labeling_prompt
    • full_llm_response

Backwards Compatibility

  • Changes are backwards compatible
  • evaluation_pipeline.py checks isinstance(result, tuple) to handle both old and new return types
  • If LLM client unavailable, gracefully returns empty llm_request_info dict
  • JSON download structure remains valid even if llm_request is empty

Implementation Status

βœ… _get_gpt_labels() captures complete LLM interaction βœ… evaluate() returns scores + LLM info
βœ… evaluate_batch() stores LLM info in detailed_results βœ… JSON download automatically includes llm_request βœ… All files compile without errors βœ… Backwards compatible