Spaces:
Running
Running
| # TRACE Evaluation - Detailed Results Export | |
| ## Overview | |
| Enhanced TRACE evaluation to save comprehensive per-query data including questions, retrieved documents, LLM responses, and metrics for each test case. | |
| ## Changes Made | |
| ### 1. **trace_evaluator.py** - Updated `evaluate_batch()` method | |
| Added `detailed_results` array to the evaluation output that captures: | |
| ```python | |
| { | |
| "query_id": 1, | |
| "question": "What is machine learning?", | |
| "llm_response": "Machine learning is a subset of artificial intelligence...", | |
| "retrieved_documents": [ | |
| "Document 1 text...", | |
| "Document 2 text...", | |
| ... | |
| ], | |
| "ground_truth": "Ground truth answer...", | |
| "metrics": { | |
| "utilization": 0.85, | |
| "relevance": 0.92, | |
| "adherence": 0.88, | |
| "completeness": 0.79, | |
| "average": 0.86 | |
| } | |
| } | |
| ``` | |
| ### 2. **streamlit_app.py** - Enhanced Evaluation Display | |
| #### A. Summary Metrics View | |
| - Displays aggregate TRACE scores across all test cases | |
| - Shows metrics in table format | |
| #### B. Detailed Per-Query Analysis | |
| - Expandable section for each query showing: | |
| - **Question**: Original user query | |
| - **LLM Response**: Generated answer | |
| - **Retrieved Documents**: Each document in its own expandable section | |
| - **Ground Truth**: Expected answer (if available) | |
| - **TRACE Metrics**: Utilization, Relevance, Adherence, Completeness | |
| #### C. Enhanced JSON Download | |
| - Single button downloads complete evaluation with: | |
| - Aggregate scores | |
| - Individual query metrics | |
| - Full per-query details (Q, response, docs, ground truth) | |
| - Evaluation configuration (chunking, embedding model, chunk size, etc.) | |
| ## JSON Output Structure | |
| ```json | |
| { | |
| "utilization": 0.85, | |
| "relevance": 0.92, | |
| "adherence": 0.88, | |
| "completeness": 0.79, | |
| "average": 0.86, | |
| "num_samples": 10, | |
| "individual_scores": [ | |
| { | |
| "utilization": 0.85, | |
| "relevance": 0.92, | |
| "adherence": 0.88, | |
| "completeness": 0.79, | |
| "average": 0.86 | |
| }, | |
| ... | |
| ], | |
| "detailed_results": [ | |
| { | |
| "query_id": 1, | |
| "question": "What is machine learning?", | |
| "llm_response": "Machine learning is...", | |
| "retrieved_documents": ["Doc 1", "Doc 2", ...], | |
| "ground_truth": "ML is a field of AI...", | |
| "metrics": { | |
| "utilization": 0.85, | |
| "relevance": 0.92, | |
| "adherence": 0.88, | |
| "completeness": 0.79, | |
| "average": 0.86 | |
| } | |
| }, | |
| ... | |
| ], | |
| "evaluation_config": { | |
| "chunking_strategy": "dense", | |
| "embedding_model": "sentence-transformers/all-mpnet-base-v2", | |
| "chunk_size": 512, | |
| "chunk_overlap": 50 | |
| } | |
| } | |
| ``` | |
| ## How to Use | |
| ### In Streamlit UI: | |
| 1. **Run Evaluation** | |
| - Go to "π Evaluation" tab | |
| - Select LLM model | |
| - Set number of test samples | |
| - Click "π¬ Run Evaluation" | |
| 2. **View Results** | |
| - Aggregate metrics displayed at top | |
| - "π Summary Metrics by Query" - Table view of all scores | |
| - "π Detailed Per-Query Analysis" - Expandable details for each query | |
| 3. **Download Results** | |
| - Click "πΎ Download Complete Results (JSON)" | |
| - Saves file: `trace_evaluation_YYYYMMDD_HHMMSS.json` | |
| - Contains all information for analysis and reproducibility | |
| ### Analyzing Downloaded JSON: | |
| ```python | |
| import json | |
| with open("trace_evaluation_20250119_153022.json") as f: | |
| results = json.load(f) | |
| # Access aggregate scores | |
| print(f"Average Score: {results['average']:.2%}") | |
| # Access per-query details | |
| for query in results['detailed_results']: | |
| print(f"\nQuery {query['query_id']}: {query['question']}") | |
| print(f" Metrics: {query['metrics']}") | |
| print(f" Response: {query['llm_response'][:100]}...") | |
| print(f" Docs Retrieved: {len(query['retrieved_documents'])}") | |
| ``` | |
| ## Benefits | |
| β **Reproducibility** - Full evaluation context saved with results | |
| β **Transparency** - See exactly what questions were asked and what documents retrieved | |
| β **Analysis** - Easy to analyze correlations between queries, responses, and metrics | |
| β **Audit Trail** - Complete record for reporting and review | |
| β **Debugging** - Identify problematic cases and understand why metrics were low | |
| β **Comparison** - Compare results across different configurations | |
| ## Files Modified | |
| 1. **trace_evaluator.py** (Line 359-436) | |
| - Enhanced `evaluate_batch()` method | |
| - Collects per-query details | |
| - Adds `detailed_results` field to output | |
| 2. **streamlit_app.py** (Line 640-682) | |
| - Added detailed per-query analysis section | |
| - Enhanced download with full context | |
| - Improved UI with expandable query details | |
| ## Example Workflow | |
| ``` | |
| 1. Create collection with specific embedding model and chunking strategy | |
| 2. Run TRACE evaluation on test samples | |
| 3. Review metrics in Streamlit UI | |
| 4. Click details to inspect specific queries | |
| 5. Download JSON for: | |
| - Archival and reproducibility | |
| - Further analysis in notebooks | |
| - Sharing with stakeholders | |
| - Comparing different configurations | |
| ``` | |
| ## Integration with RAG Pipeline | |
| The evaluation captures: | |
| - **Queries**: From RAGBench test split | |
| - **Retrieved Documents**: From vector store retrieval | |
| - **LLM Response**: From Groq API | |
| - **Ground Truth**: From RAGBench labels | |
| - **Metrics**: Computed by TRACEEvaluator | |
| - **Configuration**: Embedding model, chunking strategy, parameters | |
| All information is preserved in the JSON export for complete traceability. | |