Spaces:

msse-team-3
/

ai-engineering-project

Sleeping

App Files Files Community

ai-engineering-project / docs /PR_CITATION_FIX.md

GitHub Action

Clean deployment without binary files

f884e6e 2 months ago

preview code

raw

history blame contribute delete

3.89 kB

Fix: Citation Validation Issues - Context Manager Metadata Key Mismatch

🎯 Problem Summary

HuggingFace deployment was showing persistent invalid citation warnings:

WARNING:src.rag.rag_pipeline:Invalid citations detected: ['document_1.md', 'document_2.md', 'document_3.md']
WARNING:src.rag.rag_pipeline:Available sources were: ['pto_policy.md', 'pto_policy.md', 'pto_policy.md']

🔍 Root Cause Analysis

The issue was a metadata key mismatch between document processing and context formatting:

HF Document Processing (scripts/hf_process_documents.py):
- Stores filenames in metadata.source_file
- Example: {"source_file": "pto_policy.md"}
Context Manager (src/llm/context_manager.py):
- Was only checking metadata.filename
- Defaulted to f"document_{i}" when not found
- Result: LLM saw "Document: document_1.md" instead of real filenames
LLM Behavior:
- Generated citations based on context: [Source: document_1.md]
- Citation validation correctly flagged these as invalid

🛠️ Solution Implemented

1. Fixed Context Manager (`src/llm/context_manager.py`)

# OLD CODE (causing the issue):
filename = metadata.get("filename", f"document_{i}")

# NEW CODE (fixed):
filename = metadata.get("source_file") or metadata.get("filename", f"document_{i}")

Now checks both source_file (HF) and filename (legacy) keys
Changed format from "Document:" to "SOURCE FILE:" for consistency

2. Enhanced System Prompt (`src/llm/prompt_templates.py`)

Added explicit warnings against generic document names
Provided clear examples of correct vs incorrect citations
Emphasized using filenames after "SOURCE FILE:" labels

3. Improved Fallback Citations (`src/llm/prompt_templates.py`)

Updated add_fallback_citations() to check both metadata keys
Ensures backup citations use real filenames

4. Enhanced Debugging (`src/rag/rag_pipeline.py`)

Added detailed logging for citation validation
Shows available sources vs detected citations for troubleshooting

🧪 Testing

Created comprehensive test (test_citation_fix.py) that validates:

✅ Correct HF citations with real filenames
✅ Detection of invalid generic citations
✅ Fallback citations using real filenames
✅ Backward compatibility with legacy metadata

Test Results: All validation tests passing ✅

📈 Expected Impact

Before Fix:

Available sources: ['pto_policy.md', 'pto_policy.md', 'pto_policy.md']
LLM sees context: "Document: document_1.md"
Generated citation: [Source: document_1.md] ❌

After Fix:

Available sources: ['pto_policy.md', 'pto_policy.md', 'pto_policy.md']
LLM sees context: "SOURCE FILE: pto_policy.md"
Generated citation: [Source: pto_policy.md] ✅

🎉 Benefits

Eliminates Invalid Citation Warnings - Complete resolution of the core issue
Improves User Experience - Proper source attribution in responses
Maintains Backward Compatibility - Still works with legacy filename metadata
Better Debugging - Enhanced logging for future troubleshooting
Consistent Context Format - Unified "SOURCE FILE:" format across the pipeline

🔄 Deployment

Tested locally with comprehensive validation
Pre-commit hooks passing
Ready for HuggingFace Spaces deployment
CI/CD pipeline configured for automatic deployment

🏷️ Files Changed

src/llm/context_manager.py - Core fix for metadata key handling
src/llm/prompt_templates.py - Enhanced prompts and fallback citations
src/rag/rag_pipeline.py - Improved debugging and validation
test_citation_fix.py - Comprehensive validation tests

This fix addresses the fundamental issue causing invalid citations in the HuggingFace deployment and ensures reliable source attribution going forward.