VibecoderMcSwaggins commited on
Commit
622c8ba
·
unverified ·
1 Parent(s): d1f91a4

fix: P0 bugs + model config stabilization (November 2025) (#59)

Browse files

* fix(judges): fallback to heuristic extraction when HF quota exhausted

* fix(config): update default LLM models to stable Nov 2025 versions

* docs: add LLM model defaults (Nov 2025) to agent guidance files

* refactor(judges): extract shared title-extraction helper

Address CodeRabbit nitpicks:
- Add language tag to fenced code block in docs
- Extract _extract_titles_from_evidence() to reduce duplication
between HFInferenceJudgeHandler and MockJudgeHandler

All 136 tests pass.

AGENTS.md CHANGED
@@ -89,6 +89,21 @@ DeepBonerError (base)
89
  └── ConfigurationError
90
  ```
91
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
92
  ## Testing
93
 
94
  - **TDD**: Write tests first in `tests/unit/`, implement in `src/`
 
89
  └── ConfigurationError
90
  ```
91
 
92
+ ## LLM Model Defaults (November 2025)
93
+
94
+ Given the rapid advancements, as of November 29, 2025, the DeepBoner project uses the following default LLM models in its configuration (`src/utils/config.py`):
95
+
96
+ - **OpenAI:** `gpt-5`
97
+ - This is the stable flagship model released in August 2025.
98
+ - While `gpt-5.1` (released November 2025) exists, it is currently gated, and attempts to use it resulted in a `403 model_not_found` error for typical API keys. Advanced users with access to `gpt-5.1-instant`, `gpt-5.1-thinking`, or `gpt-5.1-codex-max` may configure their `.env` accordingly.
99
+ - **Anthropic:** `claude-sonnet-4-5-20250929`
100
+ - This is the mid-range Claude 4.5 model, released on September 29, 2025.
101
+ - The flagship `Claude Opus 4.5` (released November 24, 2025) is also available and can be configured by advanced users for enhanced capabilities.
102
+ - **HuggingFace (Free Tier):** `meta-llama/Llama-3.1-70B-Instruct`
103
+ - This remains the default for the free tier, subject to quota limits.
104
+
105
+ It is crucial to keep these defaults updated as the LLM landscape evolves.
106
+
107
  ## Testing
108
 
109
  - **TDD**: Write tests first in `tests/unit/`, implement in `src/`
CLAUDE.md CHANGED
@@ -96,6 +96,21 @@ DeepBonerError (base)
96
  - **Mocking**: `respx` for httpx, `pytest-mock` for general mocking
97
  - **Fixtures**: `tests/conftest.py` has `mock_httpx_client`, `mock_llm_response`
98
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
99
  ## Git Workflow
100
 
101
  - `main`: Production-ready (GitHub)
 
96
  - **Mocking**: `respx` for httpx, `pytest-mock` for general mocking
97
  - **Fixtures**: `tests/conftest.py` has `mock_httpx_client`, `mock_llm_response`
98
 
99
+ ## LLM Model Defaults (November 2025)
100
+
101
+ Given the rapid advancements, as of November 29, 2025, the DeepBoner project uses the following default LLM models in its configuration (`src/utils/config.py`):
102
+
103
+ - **OpenAI:** `gpt-5`
104
+ - This is the stable flagship model released in August 2025.
105
+ - While `gpt-5.1` (released November 2025) exists, it is currently gated, and attempts to use it resulted in a `403 model_not_found` error for typical API keys. Advanced users with access to `gpt-5.1-instant`, `gpt-5.1-thinking`, or `gpt-5.1-codex-max` may configure their `.env` accordingly.
106
+ - **Anthropic:** `claude-sonnet-4-5-20250929`
107
+ - This is the mid-range Claude 4.5 model, released on September 29, 2025.
108
+ - The flagship `Claude Opus 4.5` (released November 24, 2025) is also available and can be configured by advanced users for enhanced capabilities.
109
+ - **HuggingFace (Free Tier):** `meta-llama/Llama-3.1-70B-Instruct`
110
+ - This remains the default for the free tier, subject to quota limits.
111
+
112
+ It is crucial to keep these defaults updated as the LLM landscape evolves.
113
+
114
  ## Git Workflow
115
 
116
  - `main`: Production-ready (GitHub)
GEMINI.md CHANGED
@@ -70,6 +70,21 @@ Settings via pydantic-settings from `.env`:
70
  - `MAX_ITERATIONS`: 1-50, default 10
71
  - `LOG_LEVEL`: DEBUG, INFO, WARNING, ERROR
72
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
73
  ## Development Conventions
74
 
75
  1. **Strict TDD:** Write failing tests in `tests/unit/` *before* implementing logic in `src/`.
 
70
  - `MAX_ITERATIONS`: 1-50, default 10
71
  - `LOG_LEVEL`: DEBUG, INFO, WARNING, ERROR
72
 
73
+ ## LLM Model Defaults (November 2025)
74
+
75
+ Given the rapid advancements, as of November 29, 2025, the DeepBoner project uses the following default LLM models in its configuration (`src/utils/config.py`):
76
+
77
+ - **OpenAI:** `gpt-5`
78
+ - This is the stable flagship model released in August 2025.
79
+ - While `gpt-5.1` (released November 2025) exists, it is currently gated, and attempts to use it resulted in a `403 model_not_found` error for typical API keys. Advanced users with access to `gpt-5.1-instant`, `gpt-5.1-thinking`, or `gpt-5.1-codex-max` may configure their `.env` accordingly.
80
+ - **Anthropic:** `claude-sonnet-4-5-20250929`
81
+ - This is the mid-range Claude 4.5 model, released on September 29, 2025.
82
+ - The flagship `Claude Opus 4.5` (released November 24, 2025) is also available and can be configured by advanced users for enhanced capabilities.
83
+ - **HuggingFace (Free Tier):** `meta-llama/Llama-3.1-70B-Instruct`
84
+ - This remains the default for the free tier, subject to quota limits.
85
+
86
+ It is crucial to keep these defaults updated as the LLM landscape evolves.
87
+
88
  ## Development Conventions
89
 
90
  1. **Strict TDD:** Write failing tests in `tests/unit/` *before* implementing logic in `src/`.
docs/bugs/INVESTIGATION_INVALID_MODELS.md ADDED
@@ -0,0 +1,30 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Bug Investigation: Invalid Default LLM Models
2
+
3
+ ## Status
4
+ - **Date:** 2025-11-29
5
+ - **Reporter:** CLI User
6
+ - **Component:** `src/utils/config.py`
7
+ - **Priority:** High (Magentic Mode Blocker)
8
+ - **Resolution:** FIXED
9
+
10
+ ## Issue Description
11
+ The user encountered a 403 error when running in Magentic mode:
12
+ `Error code: 403 - {'error': {'message': 'Project ... does not have access to model gpt-5.1', ... 'code': 'model_not_found'}}`
13
+
14
+ This indicates the application is trying to use `gpt-5.1`, which the user's API key did not have access to (likely a beta/gated model).
15
+
16
+ ## Root Cause Analysis
17
+ The default config used `gpt-5.1` (beta/preview) and `claude-sonnet-4-5-20250929`.
18
+ Initial remediation mistakenly downgraded these to 2024 models (`gpt-4o`).
19
+ Web search confirmed that in November 2025:
20
+ - `claude-sonnet-4-5-20250929` IS valid.
21
+ - `gpt-5.1` exists but access is restricted (leading to 403).
22
+ - `gpt-5` (August 2025) is the stable flagship.
23
+
24
+ ## Solution Implemented
25
+ Updated `src/utils/config.py` to use:
26
+ - `anthropic_model`: `claude-sonnet-4-5-20250929` (Restored correct Nov 2025 model)
27
+ - `openai_model`: `gpt-5` (Changed from 5.1 to 5 to ensure stability/access).
28
+
29
+ ## Verification
30
+ - `tests/unit/agent_factory/test_judges_factory.py` updated and passed.
docs/bugs/INVESTIGATION_QUOTA_BLOCKER.md ADDED
@@ -0,0 +1,49 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Bug Investigation: HF Free Tier Quota Exhaustion
2
+
3
+ ## Status
4
+ - **Date:** 2025-11-29
5
+ - **Reporter:** CLI User
6
+ - **Component:** `HFInferenceJudgeHandler`
7
+ - **Priority:** High (UX Blocker for Free Tier)
8
+ - **Resolution:** FIXED
9
+
10
+ ## Issue Description
11
+ On a fresh run with a simple query ("What drugs improve female libido post-menopause?"), the system retrieved 20 valid sources but failed during the Judge/Analysis phase with:
12
+ `⚠️ Free Tier Quota Exceeded ⚠️`
13
+
14
+ This results in a "Synthesis" step that has 0 candidates and 0 findings, rendering the application useless for free users once the (very low) limit is hit, despite having valid search results.
15
+
16
+ ## Evidence
17
+ Output provided:
18
+ ```text
19
+ ### Citations (20 sources)
20
+ ...
21
+ ### Reasoning
22
+ ⚠️ **Free Tier Quota Exceeded** ⚠️
23
+ ```
24
+
25
+ ## Root Cause Analysis
26
+ 1. **Search Success:** `SearchAgent` correctly found 20 documents (PubMed/EuropePMC).
27
+ 2. **Judge Failure:** `HFInferenceJudgeHandler` called the HF Inference API.
28
+ 3. **Quota Trap:** The API returned a 402 (Payment Required) or Quota error.
29
+ 4. **Previous Handling:** The handler caught this error and returned a `JudgeAssessment` with `sufficient=True` (to stop the loop) and *empty* fields.
30
+ 5. **Data Loss:** The 20 valid search results were effectively discarded from the "Analysis" perspective.
31
+
32
+ ## The "Deep Blocker"
33
+ The system had a "hard failure" mode for quota exhaustion, assuming that if the LLM can't judge, we have *no* useful information. This "bricked" the UX for free users immediately upon hitting the limit.
34
+
35
+ ## Solution Implemented
36
+ Modified `HFInferenceJudgeHandler._create_quota_exhausted_assessment` to:
37
+ 1. Accept the `evidence` list as an argument.
38
+ 2. Perform basic heuristic extraction (borrowed from `MockJudgeHandler` logic):
39
+ - Use titles as "Key Findings" (first 5 sources).
40
+ - Add a clear message in "Drug Candidates" telling the user to upgrade.
41
+ 3. Return this "Partial" assessment instead of an empty one.
42
+
43
+ ## Verification
44
+ - Created `tests/unit/agent_factory/test_judges_hf_quota.py` to verify that:
45
+ - 402 errors are caught.
46
+ - `sufficient` is set to `True` (stops loop).
47
+ - `key_findings` are populated from search result titles.
48
+ - `reasoning` contains the warning message.
49
+ - Ran existing tests `tests/unit/agent_factory/test_judges_hf.py` - All passed.
src/agent_factory/judges.py CHANGED
@@ -26,6 +26,31 @@ from src.utils.models import AssessmentDetails, Evidence, JudgeAssessment
26
  logger = structlog.get_logger()
27
 
28
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
29
  def get_model() -> Any:
30
  """Get the LLM model based on configuration.
31
 
@@ -218,7 +243,7 @@ class HFInferenceJudgeHandler:
218
  or "payment required" in error_str.lower()
219
  ):
220
  logger.error("HF Quota Exhausted", error=error_str)
221
- return self._create_quota_exhausted_assessment(question)
222
 
223
  logger.warning("Model failed", model=model, error=str(e))
224
  last_error = e
@@ -342,16 +367,26 @@ IMPORTANT: Respond with ONLY valid JSON matching this schema:
342
 
343
  return None
344
 
345
- def _create_quota_exhausted_assessment(self, question: str) -> JudgeAssessment:
 
 
346
  """Create an assessment that stops the loop when quota is exhausted."""
 
 
 
 
 
 
347
  return JudgeAssessment(
348
  details=AssessmentDetails(
349
  mechanism_score=0,
350
- mechanism_reasoning="Free tier quota exhausted.",
351
  clinical_evidence_score=0,
352
- clinical_reasoning="Free tier quota exhausted.",
353
- drug_candidates=[],
354
- key_findings=[],
 
 
355
  ),
356
  sufficient=True, # STOP THE LOOP
357
  confidence=0.0,
@@ -360,6 +395,8 @@ IMPORTANT: Respond with ONLY valid JSON matching this schema:
360
  reasoning=(
361
  "⚠️ **Free Tier Quota Exceeded** ⚠️\n\n"
362
  "The HuggingFace Inference API free tier limit has been reached. "
 
 
363
  "Please try again later, or add an OpenAI/Anthropic API key above "
364
  "for unlimited access."
365
  ),
@@ -414,13 +451,11 @@ class MockJudgeHandler:
414
 
415
  def _extract_key_findings(self, evidence: list[Evidence], max_findings: int = 5) -> list[str]:
416
  """Extract key findings from evidence titles."""
417
- findings = []
418
- for e in evidence[:max_findings]:
419
- # Use first 150 chars of title as a finding
420
- title = e.citation.title
421
- if len(title) > 150:
422
- title = title[:147] + "..."
423
- findings.append(title)
424
  return findings if findings else ["No specific findings extracted (demo mode)"]
425
 
426
  def _extract_drug_candidates(self, question: str, evidence: list[Evidence]) -> list[str]:
 
26
  logger = structlog.get_logger()
27
 
28
 
29
+ def _extract_titles_from_evidence(
30
+ evidence: list[Evidence], max_items: int = 5, fallback_message: str | None = None
31
+ ) -> list[str]:
32
+ """Extract truncated titles from evidence for fallback display.
33
+
34
+ Args:
35
+ evidence: List of evidence items
36
+ max_items: Maximum number of titles to extract
37
+ fallback_message: Message to return if no evidence provided
38
+
39
+ Returns:
40
+ List of truncated titles (max 150 chars each)
41
+ """
42
+ findings = []
43
+ for e in evidence[:max_items]:
44
+ title = e.citation.title
45
+ if len(title) > 150:
46
+ title = title[:147] + "..."
47
+ findings.append(title)
48
+
49
+ if not findings and fallback_message:
50
+ return [fallback_message]
51
+ return findings
52
+
53
+
54
  def get_model() -> Any:
55
  """Get the LLM model based on configuration.
56
 
 
243
  or "payment required" in error_str.lower()
244
  ):
245
  logger.error("HF Quota Exhausted", error=error_str)
246
+ return self._create_quota_exhausted_assessment(question, evidence)
247
 
248
  logger.warning("Model failed", model=model, error=str(e))
249
  last_error = e
 
367
 
368
  return None
369
 
370
+ def _create_quota_exhausted_assessment(
371
+ self, question: str, evidence: list[Evidence]
372
+ ) -> JudgeAssessment:
373
  """Create an assessment that stops the loop when quota is exhausted."""
374
+ findings = _extract_titles_from_evidence(
375
+ evidence,
376
+ max_items=5,
377
+ fallback_message="No findings available (Quota exceeded and no search results).",
378
+ )
379
+
380
  return JudgeAssessment(
381
  details=AssessmentDetails(
382
  mechanism_score=0,
383
+ mechanism_reasoning="Free tier quota exhausted. Unable to analyze mechanism.",
384
  clinical_evidence_score=0,
385
+ clinical_reasoning=(
386
+ "Free tier quota exhausted. Unable to analyze clinical evidence."
387
+ ),
388
+ drug_candidates=["Upgrade to paid API for drug extraction."],
389
+ key_findings=findings,
390
  ),
391
  sufficient=True, # STOP THE LOOP
392
  confidence=0.0,
 
395
  reasoning=(
396
  "⚠️ **Free Tier Quota Exceeded** ⚠️\n\n"
397
  "The HuggingFace Inference API free tier limit has been reached. "
398
+ "The search results listed below were retrieved but could not be "
399
+ "analyzed by the AI. "
400
  "Please try again later, or add an OpenAI/Anthropic API key above "
401
  "for unlimited access."
402
  ),
 
451
 
452
  def _extract_key_findings(self, evidence: list[Evidence], max_findings: int = 5) -> list[str]:
453
  """Extract key findings from evidence titles."""
454
+ findings = _extract_titles_from_evidence(
455
+ evidence,
456
+ max_items=max_findings,
457
+ fallback_message="No specific findings extracted (demo mode)",
458
+ )
 
 
459
  return findings if findings else ["No specific findings extracted (demo mode)"]
460
 
461
  def _extract_drug_candidates(self, question: str, evidence: list[Evidence]) -> list[str]:
src/utils/config.py CHANGED
@@ -26,7 +26,7 @@ class Settings(BaseSettings):
26
  llm_provider: Literal["openai", "anthropic", "huggingface"] = Field(
27
  default="openai", description="Which LLM provider to use"
28
  )
29
- openai_model: str = Field(default="gpt-5.1", description="OpenAI model name")
30
  anthropic_model: str = Field(
31
  default="claude-sonnet-4-5-20250929", description="Anthropic model"
32
  )
 
26
  llm_provider: Literal["openai", "anthropic", "huggingface"] = Field(
27
  default="openai", description="Which LLM provider to use"
28
  )
29
+ openai_model: str = Field(default="gpt-5", description="OpenAI model name")
30
  anthropic_model: str = Field(
31
  default="claude-sonnet-4-5-20250929", description="Anthropic model"
32
  )
tests/unit/agent_factory/test_judges_factory.py CHANGED
@@ -25,11 +25,11 @@ def test_get_model_openai(mock_settings):
25
  """Test that OpenAI model is returned when provider is openai."""
26
  mock_settings.llm_provider = "openai"
27
  mock_settings.openai_api_key = "sk-test"
28
- mock_settings.openai_model = "gpt-5.1"
29
 
30
  model = get_model()
31
  assert isinstance(model, OpenAIChatModel)
32
- assert model.model_name == "gpt-5.1"
33
 
34
 
35
  def test_get_model_anthropic(mock_settings):
@@ -58,7 +58,7 @@ def test_get_model_default_fallback(mock_settings):
58
  """Test fallback to OpenAI if provider is unknown."""
59
  mock_settings.llm_provider = "unknown_provider"
60
  mock_settings.openai_api_key = "sk-test"
61
- mock_settings.openai_model = "gpt-5.1"
62
 
63
  model = get_model()
64
  assert isinstance(model, OpenAIChatModel)
 
25
  """Test that OpenAI model is returned when provider is openai."""
26
  mock_settings.llm_provider = "openai"
27
  mock_settings.openai_api_key = "sk-test"
28
+ mock_settings.openai_model = "gpt-5"
29
 
30
  model = get_model()
31
  assert isinstance(model, OpenAIChatModel)
32
+ assert model.model_name == "gpt-5"
33
 
34
 
35
  def test_get_model_anthropic(mock_settings):
 
58
  """Test fallback to OpenAI if provider is unknown."""
59
  mock_settings.llm_provider = "unknown_provider"
60
  mock_settings.openai_api_key = "sk-test"
61
+ mock_settings.openai_model = "gpt-5"
62
 
63
  model = get_model()
64
  assert isinstance(model, OpenAIChatModel)
tests/unit/agent_factory/test_judges_hf_quota.py ADDED
@@ -0,0 +1,49 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Unit tests for HFInferenceJudgeHandler Quota Logic."""
2
+
3
+ from unittest.mock import patch
4
+
5
+ import pytest
6
+
7
+ from src.agent_factory.judges import HFInferenceJudgeHandler
8
+ from src.utils.models import Citation, Evidence
9
+
10
+
11
+ @pytest.mark.unit
12
+ class TestHFInferenceJudgeHandlerQuota:
13
+ """Tests for HFInferenceJudgeHandler Quota handling."""
14
+
15
+ @pytest.mark.asyncio
16
+ async def test_assess_quota_exhausted(self):
17
+ """Test that quota exhaustion triggers fallback extraction."""
18
+ handler = HFInferenceJudgeHandler()
19
+
20
+ # Create some dummy evidence
21
+ evidence = [
22
+ Evidence(
23
+ content="Content 1",
24
+ citation=Citation(
25
+ source="pubmed", title="Important Drug A Findings", url="u1", date="d1"
26
+ ),
27
+ ),
28
+ Evidence(
29
+ content="Content 2",
30
+ citation=Citation(
31
+ source="pubmed", title="Clinical Trial of Drug B", url="u2", date="d2"
32
+ ),
33
+ ),
34
+ ]
35
+
36
+ # Mock _call_with_retry to raise a Quota error
37
+ with patch.object(
38
+ handler, "_call_with_retry", side_effect=Exception("402 Payment Required")
39
+ ):
40
+ result = await handler.assess("question", evidence)
41
+
42
+ # Check that it caught the error and stopped
43
+ assert result.sufficient is True
44
+ assert "Free Tier Quota Exceeded" in result.reasoning
45
+
46
+ # CRITICAL: Check that it extracted findings from titles
47
+ assert "Important Drug A Findings" in result.details.key_findings
48
+ assert "Clinical Trial of Drug B" in result.details.key_findings
49
+ assert result.details.drug_candidates == ["Upgrade to paid API for drug extraction."]