DeepCritical / docs /bugs /P0_CRITICAL_BUGS.md
VibecoderMcSwaggins's picture
refactor(tools): replace BioRxiv with Europe PMC (Phase 01)
2f8ae1f
|
raw
history blame
11.3 kB

P0 CRITICAL BUGS - Why DeepCritical Produces Garbage Results

Date: November 27, 2025 Status: CRITICAL - App is functionally useless Severity: P0 (Blocker)

TL;DR

The app produces garbage because:

  1. BioRxiv search doesn't work - returns random papers
  2. Free tier LLM is too dumb - can't identify drugs
  3. Query construction is naive - no optimization for PubMed/CT.gov syntax
  4. Loop terminates too early - 5 iterations isn't enough

P0-001: BioRxiv Search is Fundamentally Broken

File: src/tools/biorxiv.py:248-286

The Problem: The bioRxiv API DOES NOT SUPPORT KEYWORD SEARCH.

The code does this:

# Fetch recent papers (last 90 days, first 100 papers)
url = f"{self.BASE_URL}/{self.server}/{interval}/0/json"
# Then filter client-side for keywords

What Actually Happens:

  1. Fetches the first 100 papers from medRxiv in the last 90 days (chronological order)
  2. Filters those 100 random papers for query keywords
  3. Returns whatever garbage matches

Result: For "Long COVID medications", you get random papers like:

  • "Calf muscle structure-function adaptations"
  • "Work-Life Balance of Ophthalmologists During COVID"

These papers contain "COVID" somewhere but have NOTHING to do with Long COVID treatments.

Root Cause: The /0/json pagination only returns 100 papers. You'd need to paginate through ALL papers (thousands) to do proper keyword filtering.

Fix Options:

  1. Remove BioRxiv entirely - It's unusable without proper search API
  2. Use a different preprint aggregator - Europe PMC has preprints WITH search
  3. Add pagination - Fetch all papers (slow, expensive)
  4. Use Semantic Scholar API - Has preprints and proper search

P0-002: Free Tier LLM Cannot Perform Drug Identification

File: src/agent_factory/judges.py:153-211

The Problem: Without an API key, the app uses HFInferenceJudgeHandler with:

  • Llama 3.1 8B Instruct
  • Mistral 7B Instruct

These are 7-8 billion parameter models. They cannot:

  • Reliably parse complex biomedical abstracts
  • Identify drug candidates from scientific text
  • Generate structured JSON output consistently
  • Reason about mechanism of action

Evidence of Failure:

# From MockJudgeHandler - the honest fallback when LLM fails
drug_candidates=[
    "Drug identification requires AI analysis",
    "Enter API key above for full results",
]

The team KNEW the free tier can't identify drugs and added this message.

Root Cause: Drug repurposing requires understanding:

  • Drug mechanisms
  • Disease pathophysiology
  • Clinical trial phases
  • Statistical significance

This requires GPT-4 / Claude Sonnet class models (100B+ parameters).

Fix Options:

  1. Require API key - No free tier, be honest
  2. Use larger HF models - Llama 70B or Mixtral 8x7B (expensive on free tier)
  3. Hybrid approach - Use free tier for search, require paid for synthesis

P0-003: PubMed Query Not Optimized

File: src/tools/pubmed.py:54-71

The Problem: The query is passed directly to PubMed without optimization:

search_params = self._build_params(
    db="pubmed",
    term=query,  # Raw user query!
    retmax=max_results,
    sort="relevance",
)

What User Enters: "What medications show promise for Long COVID?"

What PubMed Receives: What medications show promise for Long COVID?

What PubMed Should Receive:

("long covid"[Title/Abstract] OR "post-COVID"[Title/Abstract] OR "PASC"[Title/Abstract])
AND (drug[Title/Abstract] OR treatment[Title/Abstract] OR medication[Title/Abstract] OR therapy[Title/Abstract])
AND (clinical trial[Publication Type] OR randomized[Title/Abstract])

Root Cause: No query preprocessing or medical term expansion.

Fix Options:

  1. Add query preprocessor - Extract medical entities, expand synonyms
  2. Use MeSH terms - PubMed's controlled vocabulary for better recall
  3. LLM query generation - Use LLM to generate optimized PubMed query

P0-004: Loop Terminates Too Early

File: src/app.py:42-45 and src/utils/models.py

The Problem:

config = OrchestratorConfig(
    max_iterations=5,
    max_results_per_tool=10,
)

5 iterations is not enough to:

  1. Search multiple variations of the query
  2. Gather enough evidence for the Judge to synthesize
  3. Refine queries based on initial results

Evidence: The user's output shows "Max Iterations Reached" with only 6 sources.

Root Cause: Conservative defaults to avoid API costs, but makes app useless.

Fix Options:

  1. Increase default to 10-15 - More iterations = better results
  2. Dynamic termination - Stop when confidence > threshold, not iteration count
  3. Parallel query expansion - Run more queries per iteration

P0-005: No Query Understanding Layer

Files: src/orchestrator.py, src/tools/search_handler.py

The Problem: There's no NLU (Natural Language Understanding) layer. The system:

  1. Takes raw user query
  2. Passes directly to search tools
  3. No entity extraction
  4. No intent classification
  5. No query expansion

For drug repurposing, you need to extract:

  • Disease: "Long COVID" β†’ [Long COVID, PASC, Post-COVID syndrome, chronic COVID]
  • Drug intent: "medications" β†’ [drugs, treatments, therapeutics, interventions]
  • Evidence type: "show promise" β†’ [clinical trials, efficacy, RCT]

Root Cause: No preprocessing pipeline between user input and search execution.

Fix Options:

  1. Add entity extraction - Use BioBERT or PubMedBERT for medical NER
  2. Add query expansion - Use medical ontologies (UMLS, MeSH)
  3. LLM preprocessing - Use LLM to generate search strategy before searching

P0-006: ClinicalTrials.gov Results Not Filtered

File: src/tools/clinicaltrials.py

The Problem: ClinicalTrials.gov returns ALL matching trials including:

  • Withdrawn trials
  • Terminated trials
  • Not yet recruiting
  • Observational studies (not interventional)

For drug repurposing, you want:

  • Interventional studies
  • Phase 2+ (has safety/efficacy data)
  • Completed or with results

Root Cause: No filtering of trial metadata.


Summary: Why This App Produces Garbage

User Query: "What medications show promise for Long COVID?"
    β”‚
    β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ NO QUERY PREPROCESSING                                       β”‚
β”‚ - No entity extraction                                       β”‚
β”‚ - No synonym expansion                                       β”‚
β”‚ - No medical term normalization                              β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
    β”‚
    β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ BROKEN SEARCH LAYER                                          β”‚
β”‚ - PubMed: Raw query, no MeSH, gets 1 result                 β”‚
β”‚ - BioRxiv: Returns random papers (API doesn't support search)β”‚
β”‚ - ClinicalTrials: Returns all trials, no filtering          β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
    β”‚
    β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ GARBAGE EVIDENCE                                             β”‚
β”‚ - 6 papers, most irrelevant                                  β”‚
β”‚ - "Calf muscle adaptations" (mentions COVID once)            β”‚
β”‚ - "Ophthalmologist work-life balance"                        β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
    β”‚
    β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ DUMB JUDGE (Free Tier)                                       β”‚
β”‚ - Llama 8B can't identify drugs from garbage                 β”‚
β”‚ - JSON parsing fails                                         β”‚
β”‚ - Falls back to "Drug identification requires AI analysis"   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
    β”‚
    β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ LOOP HITS MAX (5 iterations)                                 β”‚
β”‚ - Never finds enough good evidence                           β”‚
β”‚ - Never synthesizes anything useful                          β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
    β”‚
    β–Ό
    GARBAGE OUTPUT

What Would Make This Actually Work

Minimum Viable Fix (1-2 days)

  1. Remove BioRxiv - It doesn't work
  2. Require API key - Be honest that free tier is useless
  3. Add basic query preprocessing - Strip question words, expand COVID synonyms
  4. Increase iterations to 10

Proper Fix (1-2 weeks)

  1. Query Understanding Layer

    • Medical NER (BioBERT/SciBERT)
    • Query expansion with MeSH/UMLS
    • Intent classification (drug discovery vs mechanism vs safety)
  2. Optimized Search

    • PubMed: Proper query syntax with MeSH terms
    • ClinicalTrials: Filter by phase, status, intervention type
    • Replace BioRxiv with Europe PMC (has preprints + search)
  3. Evidence Ranking

    • Score by publication type (RCT > cohort > case report)
    • Score by journal impact factor
    • Score by recency
    • Score by citation count
  4. Proper LLM Pipeline

    • Use GPT-4 / Claude for synthesis
    • Structured extraction of: drug, mechanism, evidence level, effect size
    • Multi-step reasoning: identify β†’ validate β†’ rank β†’ synthesize

The Hard Truth

Building a drug repurposing agent that works is HARD. The state of the art is:

  • Drug2Disease (IBM) - Uses knowledge graphs + ML
  • COVID-KG (Stanford) - Dedicated COVID knowledge graph
  • Literature Mining at scale (PubMed) - Millions of papers, not 10

This hackathon project is fundamentally a search wrapper with an LLM prompt. That's not enough.

To make it useful:

  1. Either scope it down (e.g., "find clinical trials for X disease")
  2. Or invest serious engineering in the NLU + search + ranking pipeline