# Fix: Citation Validation Issues - Context Manager Metadata Key Mismatch

## 🎯 Problem Summary

HuggingFace deployment was showing persistent invalid citation warnings:
```
WARNING:src.rag.rag_pipeline:Invalid citations detected: ['document_1.md', 'document_2.md', 'document_3.md']
WARNING:src.rag.rag_pipeline:Available sources were: ['pto_policy.md', 'pto_policy.md', 'pto_policy.md']
```

## 🔍 Root Cause Analysis

The issue was a **metadata key mismatch** between document processing and context formatting:

1. **HF Document Processing** (`scripts/hf_process_documents.py`):
   - Stores filenames in `metadata.source_file`
   - Example: `{"source_file": "pto_policy.md"}`

2. **Context Manager** (`src/llm/context_manager.py`):
   - Was only checking `metadata.filename`
   - Defaulted to `f"document_{i}"` when not found
   - Result: LLM saw "Document: document_1.md" instead of real filenames

3. **LLM Behavior**:
   - Generated citations based on context: `[Source: document_1.md]`
   - Citation validation correctly flagged these as invalid

## 🛠️ Solution Implemented

### 1. **Fixed Context Manager** (`src/llm/context_manager.py`)
```python
# OLD CODE (causing the issue):
filename = metadata.get("filename", f"document_{i}")

# NEW CODE (fixed):
filename = metadata.get("source_file") or metadata.get("filename", f"document_{i}")
```

- Now checks both `source_file` (HF) and `filename` (legacy) keys
- Changed format from "Document:" to "SOURCE FILE:" for consistency

### 2. **Enhanced System Prompt** (`src/llm/prompt_templates.py`)
- Added explicit warnings against generic document names
- Provided clear examples of correct vs incorrect citations
- Emphasized using filenames after "SOURCE FILE:" labels

### 3. **Improved Fallback Citations** (`src/llm/prompt_templates.py`)
- Updated `add_fallback_citations()` to check both metadata keys
- Ensures backup citations use real filenames

### 4. **Enhanced Debugging** (`src/rag/rag_pipeline.py`)
- Added detailed logging for citation validation
- Shows available sources vs detected citations for troubleshooting

## 🧪 Testing

Created comprehensive test (`test_citation_fix.py`) that validates:
- ✅ Correct HF citations with real filenames
- ✅ Detection of invalid generic citations
- ✅ Fallback citations using real filenames
- ✅ Backward compatibility with legacy metadata

**Test Results:** All validation tests passing ✅

## 📈 Expected Impact

**Before Fix:**
```
Available sources: ['pto_policy.md', 'pto_policy.md', 'pto_policy.md']
LLM sees context: "Document: document_1.md"
Generated citation: [Source: document_1.md] ❌
```

**After Fix:**
```
Available sources: ['pto_policy.md', 'pto_policy.md', 'pto_policy.md']
LLM sees context: "SOURCE FILE: pto_policy.md"
Generated citation: [Source: pto_policy.md] ✅
```

## 🎉 Benefits

1. **Eliminates Invalid Citation Warnings** - Complete resolution of the core issue
2. **Improves User Experience** - Proper source attribution in responses
3. **Maintains Backward Compatibility** - Still works with legacy `filename` metadata
4. **Better Debugging** - Enhanced logging for future troubleshooting
5. **Consistent Context Format** - Unified "SOURCE FILE:" format across the pipeline

## 🔄 Deployment

- [x] Tested locally with comprehensive validation
- [x] Pre-commit hooks passing
- [x] Ready for HuggingFace Spaces deployment
- [x] CI/CD pipeline configured for automatic deployment

## 🏷️ Files Changed

- `src/llm/context_manager.py` - Core fix for metadata key handling
- `src/llm/prompt_templates.py` - Enhanced prompts and fallback citations
- `src/rag/rag_pipeline.py` - Improved debugging and validation
- `test_citation_fix.py` - Comprehensive validation tests

This fix addresses the fundamental issue causing invalid citations in the HuggingFace deployment and ensures reliable source attribution going forward.