Spaces:
Sleeping
Sleeping
π§ Source Citation Fix - DEPLOYED β
π Issue Identified and Fixed
Problem: UNKNOWN Source Files in UI
When users asked questions and the model provided responses, the source citations showed "UNKNOWN" instead of the actual policy filename (e.g., remote_work_policy.md).
Root Cause: Metadata Key Mismatch
- HF Document Processing: Stored filename as
'source_file'key in metadata - RAG Pipeline: Was looking for
'filename'key in metadata - Result:
metadata.get("filename", "unknown")always returned "unknown"
β Fix Applied
1. Updated RAG Pipeline Source Formatting
# OLD (broken):
"document": metadata.get("filename", "unknown")
# NEW (fixed):
source_filename = metadata.get("source_file") or metadata.get("filename", "unknown")
"document": source_filename
2. Updated Citation Validation Logic
# OLD (broken):
available_sources = [result.get("metadata", {}).get("filename", "") for result in search_results]
# NEW (fixed):
available_sources = [
result.get("metadata", {}).get("source_file") or result.get("metadata", {}).get("filename", "")
for result in search_results
]
3. Backwards Compatibility
- Checks
'source_file'first (HF processing format) - Falls back to
'filename'(legacy format) - Finally defaults to "unknown" if neither exists
π Expected Results After Rebuild (2-3 minutes)
β Before (BROKEN):
{
"sources": [
{
"document": "UNKNOWN",
"relevance_score": 0.85,
"excerpt": "Employees may work remotely up to 3 days..."
}
]
}
β After (FIXED):
{
"sources": [
{
"document": "remote_work_policy.md",
"relevance_score": 0.85,
"excerpt": "Employees may work remotely up to 3 days..."
}
]
}
π― Example User Experience
User Question: "What is our remote work policy?"
Model Response:
"Based on our remote work policy, employees may work remotely up to 3 days per week with manager approval..."
Sources (NOW SHOWING CORRECTLY):
- π remote_work_policy.md (Relevance: 95%)
- π employee_handbook.md (Relevance: 78%)
- π workplace_safety_guidelines.md (Relevance: 65%)
π Metadata Flow Confirmed
1. Document Processing:
metadata = {
'source_file': policy_file.name, # e.g., "remote_work_policy.md"
'chunk_id': chunk['metadata'].get('chunk_id', ''),
'chunk_index': chunk['metadata'].get('chunk_index', 0),
'content_hash': hashlib.md5(chunk['content'].encode()).hexdigest()
}
2. Vector Storage: HF Dataset stores metadata with each embedding
3. Search Results: Vector search returns metadata with each result
4. RAG Response: Now correctly extracts 'source_file' from metadata
5. UI Display: Shows actual policy filenames instead of "UNKNOWN"
π STATUS: DEPLOYED AND FIXED
Commit: facda33 - "fix: Correct source file metadata lookup in RAG pipeline"
Expected: Proper source file names in UI citations
Result: Users will see actual policy filenames in source citations
π Your UI will now properly show which policy documents are being referenced!