Spaces:
Sleeping
P0 CRITICAL BUGS - Why DeepCritical Produces Garbage Results
Date: November 27, 2025 Status: CRITICAL - App is functionally useless Severity: P0 (Blocker)
TL;DR
The app produces garbage because:
- BioRxiv search doesn't work - returns random papers
- Free tier LLM is too dumb - can't identify drugs
- Query construction is naive - no optimization for PubMed/CT.gov syntax
- Loop terminates too early - 5 iterations isn't enough
P0-001: BioRxiv Search is Fundamentally Broken
File: src/tools/biorxiv.py:248-286
The Problem: The bioRxiv API DOES NOT SUPPORT KEYWORD SEARCH.
The code does this:
# Fetch recent papers (last 90 days, first 100 papers)
url = f"{self.BASE_URL}/{self.server}/{interval}/0/json"
# Then filter client-side for keywords
What Actually Happens:
- Fetches the first 100 papers from medRxiv in the last 90 days (chronological order)
- Filters those 100 random papers for query keywords
- Returns whatever garbage matches
Result: For "Long COVID medications", you get random papers like:
- "Calf muscle structure-function adaptations"
- "Work-Life Balance of Ophthalmologists During COVID"
These papers contain "COVID" somewhere but have NOTHING to do with Long COVID treatments.
Root Cause: The /0/json pagination only returns 100 papers. You'd need to paginate through ALL papers (thousands) to do proper keyword filtering.
Fix Options:
- Remove BioRxiv entirely - It's unusable without proper search API
- Use a different preprint aggregator - Europe PMC has preprints WITH search
- Add pagination - Fetch all papers (slow, expensive)
- Use Semantic Scholar API - Has preprints and proper search
P0-002: Free Tier LLM Cannot Perform Drug Identification
File: src/agent_factory/judges.py:153-211
The Problem:
Without an API key, the app uses HFInferenceJudgeHandler with:
- Llama 3.1 8B Instruct
- Mistral 7B Instruct
These are 7-8 billion parameter models. They cannot:
- Reliably parse complex biomedical abstracts
- Identify drug candidates from scientific text
- Generate structured JSON output consistently
- Reason about mechanism of action
Evidence of Failure:
# From MockJudgeHandler - the honest fallback when LLM fails
drug_candidates=[
"Drug identification requires AI analysis",
"Enter API key above for full results",
]
The team KNEW the free tier can't identify drugs and added this message.
Root Cause: Drug repurposing requires understanding:
- Drug mechanisms
- Disease pathophysiology
- Clinical trial phases
- Statistical significance
This requires GPT-4 / Claude Sonnet class models (100B+ parameters).
Fix Options:
- Require API key - No free tier, be honest
- Use larger HF models - Llama 70B or Mixtral 8x7B (expensive on free tier)
- Hybrid approach - Use free tier for search, require paid for synthesis
P0-003: PubMed Query Not Optimized
File: src/tools/pubmed.py:54-71
The Problem: The query is passed directly to PubMed without optimization:
search_params = self._build_params(
db="pubmed",
term=query, # Raw user query!
retmax=max_results,
sort="relevance",
)
What User Enters: "What medications show promise for Long COVID?"
What PubMed Receives: What medications show promise for Long COVID?
What PubMed Should Receive:
("long covid"[Title/Abstract] OR "post-COVID"[Title/Abstract] OR "PASC"[Title/Abstract])
AND (drug[Title/Abstract] OR treatment[Title/Abstract] OR medication[Title/Abstract] OR therapy[Title/Abstract])
AND (clinical trial[Publication Type] OR randomized[Title/Abstract])
Root Cause: No query preprocessing or medical term expansion.
Fix Options:
- Add query preprocessor - Extract medical entities, expand synonyms
- Use MeSH terms - PubMed's controlled vocabulary for better recall
- LLM query generation - Use LLM to generate optimized PubMed query
P0-004: Loop Terminates Too Early
File: src/app.py:42-45 and src/utils/models.py
The Problem:
config = OrchestratorConfig(
max_iterations=5,
max_results_per_tool=10,
)
5 iterations is not enough to:
- Search multiple variations of the query
- Gather enough evidence for the Judge to synthesize
- Refine queries based on initial results
Evidence: The user's output shows "Max Iterations Reached" with only 6 sources.
Root Cause: Conservative defaults to avoid API costs, but makes app useless.
Fix Options:
- Increase default to 10-15 - More iterations = better results
- Dynamic termination - Stop when confidence > threshold, not iteration count
- Parallel query expansion - Run more queries per iteration
P0-005: No Query Understanding Layer
Files: src/orchestrator.py, src/tools/search_handler.py
The Problem: There's no NLU (Natural Language Understanding) layer. The system:
- Takes raw user query
- Passes directly to search tools
- No entity extraction
- No intent classification
- No query expansion
For drug repurposing, you need to extract:
- Disease: "Long COVID" β [Long COVID, PASC, Post-COVID syndrome, chronic COVID]
- Drug intent: "medications" β [drugs, treatments, therapeutics, interventions]
- Evidence type: "show promise" β [clinical trials, efficacy, RCT]
Root Cause: No preprocessing pipeline between user input and search execution.
Fix Options:
- Add entity extraction - Use BioBERT or PubMedBERT for medical NER
- Add query expansion - Use medical ontologies (UMLS, MeSH)
- LLM preprocessing - Use LLM to generate search strategy before searching
P0-006: ClinicalTrials.gov Results Not Filtered
File: src/tools/clinicaltrials.py
The Problem: ClinicalTrials.gov returns ALL matching trials including:
- Withdrawn trials
- Terminated trials
- Not yet recruiting
- Observational studies (not interventional)
For drug repurposing, you want:
- Interventional studies
- Phase 2+ (has safety/efficacy data)
- Completed or with results
Root Cause: No filtering of trial metadata.
Summary: Why This App Produces Garbage
User Query: "What medications show promise for Long COVID?"
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β NO QUERY PREPROCESSING β
β - No entity extraction β
β - No synonym expansion β
β - No medical term normalization β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β BROKEN SEARCH LAYER β
β - PubMed: Raw query, no MeSH, gets 1 result β
β - BioRxiv: Returns random papers (API doesn't support search)β
β - ClinicalTrials: Returns all trials, no filtering β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β GARBAGE EVIDENCE β
β - 6 papers, most irrelevant β
β - "Calf muscle adaptations" (mentions COVID once) β
β - "Ophthalmologist work-life balance" β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β DUMB JUDGE (Free Tier) β
β - Llama 8B can't identify drugs from garbage β
β - JSON parsing fails β
β - Falls back to "Drug identification requires AI analysis" β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β LOOP HITS MAX (5 iterations) β
β - Never finds enough good evidence β
β - Never synthesizes anything useful β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
GARBAGE OUTPUT
What Would Make This Actually Work
Minimum Viable Fix (1-2 days)
- Remove BioRxiv - It doesn't work
- Require API key - Be honest that free tier is useless
- Add basic query preprocessing - Strip question words, expand COVID synonyms
- Increase iterations to 10
Proper Fix (1-2 weeks)
Query Understanding Layer
- Medical NER (BioBERT/SciBERT)
- Query expansion with MeSH/UMLS
- Intent classification (drug discovery vs mechanism vs safety)
Optimized Search
- PubMed: Proper query syntax with MeSH terms
- ClinicalTrials: Filter by phase, status, intervention type
- Replace BioRxiv with Europe PMC (has preprints + search)
Evidence Ranking
- Score by publication type (RCT > cohort > case report)
- Score by journal impact factor
- Score by recency
- Score by citation count
Proper LLM Pipeline
- Use GPT-4 / Claude for synthesis
- Structured extraction of: drug, mechanism, evidence level, effect size
- Multi-step reasoning: identify β validate β rank β synthesize
The Hard Truth
Building a drug repurposing agent that works is HARD. The state of the art is:
- Drug2Disease (IBM) - Uses knowledge graphs + ML
- COVID-KG (Stanford) - Dedicated COVID knowledge graph
- Literature Mining at scale (PubMed) - Millions of papers, not 10
This hackathon project is fundamentally a search wrapper with an LLM prompt. That's not enough.
To make it useful:
- Either scope it down (e.g., "find clinical trials for X disease")
- Or invest serious engineering in the NLU + search + ranking pipeline