Spaces:

gopikrishnait
/

CapStoneRAG10

Sleeping

App Files Files Community

CapStoneRAG10 / docs /LLMAUDITTRAIL_CHANGES.md

Developer

Initial commit for HuggingFace Spaces - RAG Capstone Project with Qdrant Cloud

1d10b0a 4 months ago

preview code

raw

history blame contribute delete

5.82 kB

LLM Audit Trail Enhancement - Changes Summary

Overview

Enhanced JSON download functionality to include complete LLM request and response information for debugging and auditing purposes.

Files Modified

1. advanced_rag_evaluator.py

Change 1: _get_gpt_labels() method (Lines 483-549)

What Changed:

Modified to capture and return complete LLM request/response information
Changed return type from Optional[GPTLabelingOutput] to Optional[Dict]
Dictionary structure contains:
- "labels": The GPTLabelingOutput object (original return)
- "llm_request_info": Complete audit trail with:
  - system_prompt: The system instruction used
  - query: The user question
  - context_documents: List of retrieved documents
  - llm_response: The original LLM response
  - labeling_prompt: The generated labeling prompt
  - model: Model name (e.g., "groq-default")
  - temperature: Temperature parameter (0.0)
  - max_tokens: Max tokens setting (2048)
  - full_llm_response: Complete raw response from LLM

Why: Enables complete auditability of LLM interactions

Change 2: evaluate() method (Lines 414-473)

What Changed:

Changed return type from AdvancedTRACEScores to Tuple[AdvancedTRACEScores, Optional[Dict]]
Now returns tuple: (scores, llm_request_info)
Handles both old return style and new tuple return from _get_gpt_labels()
Passes llm_request_info through to caller

Why: Allows scores and LLM audit info to flow through evaluation pipeline

Change 3: evaluate_batch() method (Lines 627-701)

What Changed:

Updated to unpack tuple from evaluate(): scores, llm_request_info = self.evaluate(...)
Added "llm_request" field to each result_dict in detailed_results

Structure: Each detailed_result now includes:

{
  "query_id": 1,
  "question": "...",
  "llm_response": "...",
  "retrieved_documents": [...],
  "metrics": {...},
  "ground_truth_scores": {...},
  "llm_request": {
    "system_prompt": "...",
    "query": "...",
    "context_documents": [...],
    "llm_response": "...",
    "full_llm_response": "...",
    "model": "...",
    "temperature": 0.0,
    "max_tokens": 2048
  }
}

Why: Stores complete LLM audit trail for each query in batch evaluation

2. evaluation_pipeline.py

Change: evaluate() method (Lines 80-127)

What Changed:

Updated to handle new tuple return from advanced_evaluator.evaluate()
Added type checking: if isinstance(result, tuple)
Backwards compatible with both old and new return types
Extracts scores and llm_info separately
Adds "llm_request_info" to returned evaluation result dictionary

Why: Maintains compatibility while propagating LLM audit info through pipeline

3. streamlit_app.py

No changes needed

Already uses detailed_results in JSON download (line 818)
New llm_request field automatically included in JSON export
JSON download structure already supports arbitrary nested fields

Data Flow

_get_gpt_labels()
  ├─ Captures complete LLM interaction
  ├─ Returns: {"labels": GPTLabelingOutput, "llm_request_info": {...}}
  │
evaluate()
  ├─ Unpacks tuple from _get_gpt_labels()
  ├─ Returns: (AdvancedTRACEScores, llm_request_info)
  │
evaluate_batch()
  ├─ Unpack score and llm_request_info
  ├─ Store llm_request in detailed_results[i]["llm_request"]
  ├─ Returns: Dict with detailed_results containing llm_request
  │
JSON Download (streamlit_app.py)
  ├─ detailed_results automatically includes llm_request
  ├─ Downloads complete audit trail for each query

JSON Download Structure

After downloading, the JSON file contains:

{
  "evaluation_metadata": {...},
  "aggregate_metrics": {...},
  "rmse_metrics": {...},
  "auc_metrics": {...},
  "detailed_results": [
    {
      "query_id": 1,
      "question": "User's question",
      "llm_response": "LLM's answer",
      "retrieved_documents": ["doc1", "doc2", ...],
      "metrics": {...},
      "ground_truth_scores": {...},
      "llm_request": {
        "system_prompt": "System instruction for labeling",
        "query": "User question",
        "context_documents": [...],
        "llm_response": "Raw LLM response before JSON parsing",
        "labeling_prompt": "Generated prompt sent to LLM",
        "model": "groq-default",
        "temperature": 0.0,
        "max_tokens": 2048,
        "full_llm_response": "Complete raw response from LLM"
      }
    },
    ...
  ]
}

Benefits

Complete Audit Trail: Full visibility into what was sent to LLM and what was returned
Debugging: Can reproduce evaluation if needed
Transparency: Complete record of LLM interactions
Analysis: Can correlate LLM inputs with output quality
Compliance: Meets auditability requirements

Testing Recommendations

Run evaluation with GPT Labeling method
Download complete results JSON
Verify each detailed_result contains llm_request field
Verify llm_request contains:
- system_prompt
- query (user question)
- context_documents list
- labeling_prompt
- full_llm_response

Backwards Compatibility

Changes are backwards compatible
evaluation_pipeline.py checks isinstance(result, tuple) to handle both old and new return types
If LLM client unavailable, gracefully returns empty llm_request_info dict
JSON download structure remains valid even if llm_request is empty

Implementation Status

✅ _get_gpt_labels() captures complete LLM interaction ✅ evaluate() returns scores + LLM info
✅ evaluate_batch() stores LLM info in detailed_results ✅ JSON download automatically includes llm_request ✅ All files compile without errors ✅ Backwards compatible