Spaces:
Sleeping
Sleeping
File size: 3,891 Bytes
f884e6e |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 |
# Fix: Citation Validation Issues - Context Manager Metadata Key Mismatch
## π― Problem Summary
HuggingFace deployment was showing persistent invalid citation warnings:
```
WARNING:src.rag.rag_pipeline:Invalid citations detected: ['document_1.md', 'document_2.md', 'document_3.md']
WARNING:src.rag.rag_pipeline:Available sources were: ['pto_policy.md', 'pto_policy.md', 'pto_policy.md']
```
## π Root Cause Analysis
The issue was a **metadata key mismatch** between document processing and context formatting:
1. **HF Document Processing** (`scripts/hf_process_documents.py`):
- Stores filenames in `metadata.source_file`
- Example: `{"source_file": "pto_policy.md"}`
2. **Context Manager** (`src/llm/context_manager.py`):
- Was only checking `metadata.filename`
- Defaulted to `f"document_{i}"` when not found
- Result: LLM saw "Document: document_1.md" instead of real filenames
3. **LLM Behavior**:
- Generated citations based on context: `[Source: document_1.md]`
- Citation validation correctly flagged these as invalid
## π οΈ Solution Implemented
### 1. **Fixed Context Manager** (`src/llm/context_manager.py`)
```python
# OLD CODE (causing the issue):
filename = metadata.get("filename", f"document_{i}")
# NEW CODE (fixed):
filename = metadata.get("source_file") or metadata.get("filename", f"document_{i}")
```
- Now checks both `source_file` (HF) and `filename` (legacy) keys
- Changed format from "Document:" to "SOURCE FILE:" for consistency
### 2. **Enhanced System Prompt** (`src/llm/prompt_templates.py`)
- Added explicit warnings against generic document names
- Provided clear examples of correct vs incorrect citations
- Emphasized using filenames after "SOURCE FILE:" labels
### 3. **Improved Fallback Citations** (`src/llm/prompt_templates.py`)
- Updated `add_fallback_citations()` to check both metadata keys
- Ensures backup citations use real filenames
### 4. **Enhanced Debugging** (`src/rag/rag_pipeline.py`)
- Added detailed logging for citation validation
- Shows available sources vs detected citations for troubleshooting
## π§ͺ Testing
Created comprehensive test (`test_citation_fix.py`) that validates:
- β
Correct HF citations with real filenames
- β
Detection of invalid generic citations
- β
Fallback citations using real filenames
- β
Backward compatibility with legacy metadata
**Test Results:** All validation tests passing β
## π Expected Impact
**Before Fix:**
```
Available sources: ['pto_policy.md', 'pto_policy.md', 'pto_policy.md']
LLM sees context: "Document: document_1.md"
Generated citation: [Source: document_1.md] β
```
**After Fix:**
```
Available sources: ['pto_policy.md', 'pto_policy.md', 'pto_policy.md']
LLM sees context: "SOURCE FILE: pto_policy.md"
Generated citation: [Source: pto_policy.md] β
```
## π Benefits
1. **Eliminates Invalid Citation Warnings** - Complete resolution of the core issue
2. **Improves User Experience** - Proper source attribution in responses
3. **Maintains Backward Compatibility** - Still works with legacy `filename` metadata
4. **Better Debugging** - Enhanced logging for future troubleshooting
5. **Consistent Context Format** - Unified "SOURCE FILE:" format across the pipeline
## π Deployment
- [x] Tested locally with comprehensive validation
- [x] Pre-commit hooks passing
- [x] Ready for HuggingFace Spaces deployment
- [x] CI/CD pipeline configured for automatic deployment
## π·οΈ Files Changed
- `src/llm/context_manager.py` - Core fix for metadata key handling
- `src/llm/prompt_templates.py` - Enhanced prompts and fallback citations
- `src/rag/rag_pipeline.py` - Improved debugging and validation
- `test_citation_fix.py` - Comprehensive validation tests
This fix addresses the fundamental issue causing invalid citations in the HuggingFace deployment and ensures reliable source attribution going forward.
|