Spaces:
Sleeping
Sleeping
| # Fix: Citation Validation Issues - Context Manager Metadata Key Mismatch | |
| ## π― Problem Summary | |
| HuggingFace deployment was showing persistent invalid citation warnings: | |
| ``` | |
| WARNING:src.rag.rag_pipeline:Invalid citations detected: ['document_1.md', 'document_2.md', 'document_3.md'] | |
| WARNING:src.rag.rag_pipeline:Available sources were: ['pto_policy.md', 'pto_policy.md', 'pto_policy.md'] | |
| ``` | |
| ## π Root Cause Analysis | |
| The issue was a **metadata key mismatch** between document processing and context formatting: | |
| 1. **HF Document Processing** (`scripts/hf_process_documents.py`): | |
| - Stores filenames in `metadata.source_file` | |
| - Example: `{"source_file": "pto_policy.md"}` | |
| 2. **Context Manager** (`src/llm/context_manager.py`): | |
| - Was only checking `metadata.filename` | |
| - Defaulted to `f"document_{i}"` when not found | |
| - Result: LLM saw "Document: document_1.md" instead of real filenames | |
| 3. **LLM Behavior**: | |
| - Generated citations based on context: `[Source: document_1.md]` | |
| - Citation validation correctly flagged these as invalid | |
| ## π οΈ Solution Implemented | |
| ### 1. **Fixed Context Manager** (`src/llm/context_manager.py`) | |
| ```python | |
| # OLD CODE (causing the issue): | |
| filename = metadata.get("filename", f"document_{i}") | |
| # NEW CODE (fixed): | |
| filename = metadata.get("source_file") or metadata.get("filename", f"document_{i}") | |
| ``` | |
| - Now checks both `source_file` (HF) and `filename` (legacy) keys | |
| - Changed format from "Document:" to "SOURCE FILE:" for consistency | |
| ### 2. **Enhanced System Prompt** (`src/llm/prompt_templates.py`) | |
| - Added explicit warnings against generic document names | |
| - Provided clear examples of correct vs incorrect citations | |
| - Emphasized using filenames after "SOURCE FILE:" labels | |
| ### 3. **Improved Fallback Citations** (`src/llm/prompt_templates.py`) | |
| - Updated `add_fallback_citations()` to check both metadata keys | |
| - Ensures backup citations use real filenames | |
| ### 4. **Enhanced Debugging** (`src/rag/rag_pipeline.py`) | |
| - Added detailed logging for citation validation | |
| - Shows available sources vs detected citations for troubleshooting | |
| ## π§ͺ Testing | |
| Created comprehensive test (`test_citation_fix.py`) that validates: | |
| - β Correct HF citations with real filenames | |
| - β Detection of invalid generic citations | |
| - β Fallback citations using real filenames | |
| - β Backward compatibility with legacy metadata | |
| **Test Results:** All validation tests passing β | |
| ## π Expected Impact | |
| **Before Fix:** | |
| ``` | |
| Available sources: ['pto_policy.md', 'pto_policy.md', 'pto_policy.md'] | |
| LLM sees context: "Document: document_1.md" | |
| Generated citation: [Source: document_1.md] β | |
| ``` | |
| **After Fix:** | |
| ``` | |
| Available sources: ['pto_policy.md', 'pto_policy.md', 'pto_policy.md'] | |
| LLM sees context: "SOURCE FILE: pto_policy.md" | |
| Generated citation: [Source: pto_policy.md] β | |
| ``` | |
| ## π Benefits | |
| 1. **Eliminates Invalid Citation Warnings** - Complete resolution of the core issue | |
| 2. **Improves User Experience** - Proper source attribution in responses | |
| 3. **Maintains Backward Compatibility** - Still works with legacy `filename` metadata | |
| 4. **Better Debugging** - Enhanced logging for future troubleshooting | |
| 5. **Consistent Context Format** - Unified "SOURCE FILE:" format across the pipeline | |
| ## π Deployment | |
| - [x] Tested locally with comprehensive validation | |
| - [x] Pre-commit hooks passing | |
| - [x] Ready for HuggingFace Spaces deployment | |
| - [x] CI/CD pipeline configured for automatic deployment | |
| ## π·οΈ Files Changed | |
| - `src/llm/context_manager.py` - Core fix for metadata key handling | |
| - `src/llm/prompt_templates.py` - Enhanced prompts and fallback citations | |
| - `src/rag/rag_pipeline.py` - Improved debugging and validation | |
| - `test_citation_fix.py` - Comprehensive validation tests | |
| This fix addresses the fundamental issue causing invalid citations in the HuggingFace deployment and ensures reliable source attribution going forward. | |