ai-engineering-project / docs /PR_CITATION_FIX.md
GitHub Action
Clean deployment without binary files
f884e6e

Fix: Citation Validation Issues - Context Manager Metadata Key Mismatch

🎯 Problem Summary

HuggingFace deployment was showing persistent invalid citation warnings:

WARNING:src.rag.rag_pipeline:Invalid citations detected: ['document_1.md', 'document_2.md', 'document_3.md']
WARNING:src.rag.rag_pipeline:Available sources were: ['pto_policy.md', 'pto_policy.md', 'pto_policy.md']

πŸ” Root Cause Analysis

The issue was a metadata key mismatch between document processing and context formatting:

  1. HF Document Processing (scripts/hf_process_documents.py):

    • Stores filenames in metadata.source_file
    • Example: {"source_file": "pto_policy.md"}
  2. Context Manager (src/llm/context_manager.py):

    • Was only checking metadata.filename
    • Defaulted to f"document_{i}" when not found
    • Result: LLM saw "Document: document_1.md" instead of real filenames
  3. LLM Behavior:

    • Generated citations based on context: [Source: document_1.md]
    • Citation validation correctly flagged these as invalid

πŸ› οΈ Solution Implemented

1. Fixed Context Manager (src/llm/context_manager.py)

# OLD CODE (causing the issue):
filename = metadata.get("filename", f"document_{i}")

# NEW CODE (fixed):
filename = metadata.get("source_file") or metadata.get("filename", f"document_{i}")
  • Now checks both source_file (HF) and filename (legacy) keys
  • Changed format from "Document:" to "SOURCE FILE:" for consistency

2. Enhanced System Prompt (src/llm/prompt_templates.py)

  • Added explicit warnings against generic document names
  • Provided clear examples of correct vs incorrect citations
  • Emphasized using filenames after "SOURCE FILE:" labels

3. Improved Fallback Citations (src/llm/prompt_templates.py)

  • Updated add_fallback_citations() to check both metadata keys
  • Ensures backup citations use real filenames

4. Enhanced Debugging (src/rag/rag_pipeline.py)

  • Added detailed logging for citation validation
  • Shows available sources vs detected citations for troubleshooting

πŸ§ͺ Testing

Created comprehensive test (test_citation_fix.py) that validates:

  • βœ… Correct HF citations with real filenames
  • βœ… Detection of invalid generic citations
  • βœ… Fallback citations using real filenames
  • βœ… Backward compatibility with legacy metadata

Test Results: All validation tests passing βœ…

πŸ“ˆ Expected Impact

Before Fix:

Available sources: ['pto_policy.md', 'pto_policy.md', 'pto_policy.md']
LLM sees context: "Document: document_1.md"
Generated citation: [Source: document_1.md] ❌

After Fix:

Available sources: ['pto_policy.md', 'pto_policy.md', 'pto_policy.md']
LLM sees context: "SOURCE FILE: pto_policy.md"
Generated citation: [Source: pto_policy.md] βœ…

πŸŽ‰ Benefits

  1. Eliminates Invalid Citation Warnings - Complete resolution of the core issue
  2. Improves User Experience - Proper source attribution in responses
  3. Maintains Backward Compatibility - Still works with legacy filename metadata
  4. Better Debugging - Enhanced logging for future troubleshooting
  5. Consistent Context Format - Unified "SOURCE FILE:" format across the pipeline

πŸ”„ Deployment

  • Tested locally with comprehensive validation
  • Pre-commit hooks passing
  • Ready for HuggingFace Spaces deployment
  • CI/CD pipeline configured for automatic deployment

🏷️ Files Changed

  • src/llm/context_manager.py - Core fix for metadata key handling
  • src/llm/prompt_templates.py - Enhanced prompts and fallback citations
  • src/rag/rag_pipeline.py - Improved debugging and validation
  • test_citation_fix.py - Comprehensive validation tests

This fix addresses the fundamental issue causing invalid citations in the HuggingFace deployment and ensures reliable source attribution going forward.