Spaces:
Sleeping
Sleeping
Fix: Citation Validation Issues - Context Manager Metadata Key Mismatch
π― Problem Summary
HuggingFace deployment was showing persistent invalid citation warnings:
WARNING:src.rag.rag_pipeline:Invalid citations detected: ['document_1.md', 'document_2.md', 'document_3.md']
WARNING:src.rag.rag_pipeline:Available sources were: ['pto_policy.md', 'pto_policy.md', 'pto_policy.md']
π Root Cause Analysis
The issue was a metadata key mismatch between document processing and context formatting:
HF Document Processing (
scripts/hf_process_documents.py):- Stores filenames in
metadata.source_file - Example:
{"source_file": "pto_policy.md"}
- Stores filenames in
Context Manager (
src/llm/context_manager.py):- Was only checking
metadata.filename - Defaulted to
f"document_{i}"when not found - Result: LLM saw "Document: document_1.md" instead of real filenames
- Was only checking
LLM Behavior:
- Generated citations based on context:
[Source: document_1.md] - Citation validation correctly flagged these as invalid
- Generated citations based on context:
π οΈ Solution Implemented
1. Fixed Context Manager (src/llm/context_manager.py)
# OLD CODE (causing the issue):
filename = metadata.get("filename", f"document_{i}")
# NEW CODE (fixed):
filename = metadata.get("source_file") or metadata.get("filename", f"document_{i}")
- Now checks both
source_file(HF) andfilename(legacy) keys - Changed format from "Document:" to "SOURCE FILE:" for consistency
2. Enhanced System Prompt (src/llm/prompt_templates.py)
- Added explicit warnings against generic document names
- Provided clear examples of correct vs incorrect citations
- Emphasized using filenames after "SOURCE FILE:" labels
3. Improved Fallback Citations (src/llm/prompt_templates.py)
- Updated
add_fallback_citations()to check both metadata keys - Ensures backup citations use real filenames
4. Enhanced Debugging (src/rag/rag_pipeline.py)
- Added detailed logging for citation validation
- Shows available sources vs detected citations for troubleshooting
π§ͺ Testing
Created comprehensive test (test_citation_fix.py) that validates:
- β Correct HF citations with real filenames
- β Detection of invalid generic citations
- β Fallback citations using real filenames
- β Backward compatibility with legacy metadata
Test Results: All validation tests passing β
π Expected Impact
Before Fix:
Available sources: ['pto_policy.md', 'pto_policy.md', 'pto_policy.md']
LLM sees context: "Document: document_1.md"
Generated citation: [Source: document_1.md] β
After Fix:
Available sources: ['pto_policy.md', 'pto_policy.md', 'pto_policy.md']
LLM sees context: "SOURCE FILE: pto_policy.md"
Generated citation: [Source: pto_policy.md] β
π Benefits
- Eliminates Invalid Citation Warnings - Complete resolution of the core issue
- Improves User Experience - Proper source attribution in responses
- Maintains Backward Compatibility - Still works with legacy
filenamemetadata - Better Debugging - Enhanced logging for future troubleshooting
- Consistent Context Format - Unified "SOURCE FILE:" format across the pipeline
π Deployment
- Tested locally with comprehensive validation
- Pre-commit hooks passing
- Ready for HuggingFace Spaces deployment
- CI/CD pipeline configured for automatic deployment
π·οΈ Files Changed
src/llm/context_manager.py- Core fix for metadata key handlingsrc/llm/prompt_templates.py- Enhanced prompts and fallback citationssrc/rag/rag_pipeline.py- Improved debugging and validationtest_citation_fix.py- Comprehensive validation tests
This fix addresses the fundamental issue causing invalid citations in the HuggingFace deployment and ensures reliable source attribution going forward.