DeepBoner / docs /bugs /archive /P0_ORCHESTRATOR_DEDUP_AND_JUDGE_BUGS.md
VibecoderMcSwaggins's picture
feat(search): SPEC_13 Evidence Deduplication (#98)
2c5db87 unverified
|
raw
history blame
4.21 kB

P0 Bug Report: Orchestrator Dedup + Judge Failures

Status

  • Date: 2025-11-29
  • Priority: P0 (Blocker - Simple mode broken on HF Spaces)
  • Component: src/orchestrator.py, src/agent_factory/judges.py
  • Resolution: FIXED in commits 5e761eb, 2588375

Symptoms

When running Simple mode (free tier) on HuggingFace Spaces:

  1. Judge always returns 0% confidence β†’ loops forever with "continue"
  2. Deduplication removes ALL evidence after iteration 1
  3. Never synthesizes β†’ user sees infinite loop

Example Output

πŸ“š SEARCH_COMPLETE: Found 20 new sources (19 total)   ← Iteration 1 OK
βœ… JUDGE_COMPLETE: Assessment: continue (confidence: 0%)  ← FAIL: 0% = fallback

πŸ“š SEARCH_COMPLETE: Found 12 new sources (11 total)   ← Iteration 2 BROKEN
...
πŸ“š SEARCH_COMPLETE: Found 31 new sources (0 total)    ← 0 TOTAL = all removed!
βœ… JUDGE_COMPLETE: Assessment: continue (confidence: 0%)  ← Still failing

Root Cause Analysis

Bug 1: Semantic Deduplication Removes Old Evidence

File: src/orchestrator.py:213-219

# URL dedup (correct)
seen_urls = {e.citation.url for e in all_evidence}
unique_new = [e for e in new_evidence if e.citation.url not in seen_urls]
all_evidence.extend(unique_new)

# BUG: Passes ALL evidence (including old) to semantic dedup
all_evidence = await self._deduplicate_and_rank(all_evidence, query)

Problem: The deduplicate() function checks each item against the vector store. Items from iteration 1 are ALREADY in the store. When re-checked in iteration 2+, they find THEMSELVES (distance β‰ˆ 0) and are removed as "duplicates".

Result: After iteration 1, evidence count drops to 0.

Bug 2: HF Inference Judge Always Failing

File: src/agent_factory/judges.py:186-254

Evidence: Judge returns this every time:

  • confidence: 0.0
  • recommendation: "continue"
  • Next queries are just the original query with suffixes

This is the _create_fallback_assessment() response, meaning:

  • The HF Inference API calls are failing
  • All 3 fallback models (Llama, Mistral, Zephyr) are failing
  • Likely due to rate limits, quota, or model availability

The Fix

Fix 1: Only Dedup NEW Evidence (not all_evidence)

# Before (broken)
all_evidence.extend(unique_new)
all_evidence = await self._deduplicate_and_rank(all_evidence, query)

# After (fixed)
# Only dedup the NEW evidence against the store
if unique_new:
    unique_new = await self._deduplicate_new_evidence(unique_new, query)
all_evidence.extend(unique_new)

Or simpler - disable semantic dedup until we fix it properly:

# Disable broken semantic dedup
# all_evidence = await self._deduplicate_and_rank(all_evidence, query)

Fix 2: Handle HF Inference Failures Gracefully

Option A: After N failed judge calls, force synthesize with available evidence Option B: Increase retry count or add longer backoff Option C: Fall back to MockJudgeHandler (which DOES work) after failures

# In _create_fallback_assessment, track failures
if self._consecutive_failures >= 3:
    # Force synthesis instead of infinite loop
    return JudgeAssessment(
        sufficient=True,  # STOP
        confidence=0.1,
        recommendation="synthesize",
        ...
    )

Test Plan

  • Disable semantic dedup OR fix to only process new items
  • Verify evidence accumulates across iterations (not drops to 0)
  • Test HF Inference with fresh HF_TOKEN
  • If HF keeps failing, fall back to MockJudgeHandler
  • Verify "synthesize" is eventually reached
  • Deploy and test on HF Space

Priority Justification

P0 because:

  • Simple mode (free tier) is the DEFAULT experience
  • Currently produces infinite loop with no output
  • Users see "confidence: 0%" and think tool is broken
  • Blocks hackathon demo for users without API keys

Quick Workaround

Disable semantic dedup by setting enable_embeddings=False in orchestrator creation:

orchestrator = create_orchestrator(
    ...
    enable_embeddings=False,  # Disable broken dedup
)

Or users can enter an OpenAI/Anthropic API key to bypass HF Inference issues.