Spaces:

VibecoderMcSwaggins
/

DeepBoner

Paused

App Files Files Community

VibecoderMcSwaggins commited on Nov 29, 2025

Commit

622c8ba

unverified ·

1 Parent(s): d1f91a4

fix: P0 bugs + model config stabilization (November 2025) (#59)

Browse files

* fix(judges): fallback to heuristic extraction when HF quota exhausted

* fix(config): update default LLM models to stable Nov 2025 versions

* docs: add LLM model defaults (Nov 2025) to agent guidance files

* refactor(judges): extract shared title-extraction helper

Address CodeRabbit nitpicks:
- Add language tag to fenced code block in docs
- Extract _extract_titles_from_evidence() to reduce duplication
between HFInferenceJudgeHandler and MockJudgeHandler

All 136 tests pass.

Files changed (9) hide show

AGENTS.md +15 -0
CLAUDE.md +15 -0
GEMINI.md +15 -0
docs/bugs/INVESTIGATION_INVALID_MODELS.md +30 -0
docs/bugs/INVESTIGATION_QUOTA_BLOCKER.md +49 -0
src/agent_factory/judges.py +48 -13
src/utils/config.py +1 -1
tests/unit/agent_factory/test_judges_factory.py +3 -3
tests/unit/agent_factory/test_judges_hf_quota.py +49 -0

AGENTS.md CHANGED Viewed

@@ -89,6 +89,21 @@ DeepBonerError (base)
 └── ConfigurationError
 ```
 ## Testing
 - **TDD**: Write tests first in `tests/unit/`, implement in `src/`

 └── ConfigurationError
 ```
+## LLM Model Defaults (November 2025)
+Given the rapid advancements, as of November 29, 2025, the DeepBoner project uses the following default LLM models in its configuration (`src/utils/config.py`):
+- **OpenAI:** `gpt-5`
+  - This is the stable flagship model released in August 2025.
+  - While `gpt-5.1` (released November 2025) exists, it is currently gated, and attempts to use it resulted in a `403 model_not_found` error for typical API keys. Advanced users with access to `gpt-5.1-instant`, `gpt-5.1-thinking`, or `gpt-5.1-codex-max` may configure their `.env` accordingly.
+- **Anthropic:** `claude-sonnet-4-5-20250929`
+  - This is the mid-range Claude 4.5 model, released on September 29, 2025.
+  - The flagship `Claude Opus 4.5` (released November 24, 2025) is also available and can be configured by advanced users for enhanced capabilities.
+- **HuggingFace (Free Tier):** `meta-llama/Llama-3.1-70B-Instruct`
+  - This remains the default for the free tier, subject to quota limits.
+It is crucial to keep these defaults updated as the LLM landscape evolves.
 ## Testing
 - **TDD**: Write tests first in `tests/unit/`, implement in `src/`

CLAUDE.md CHANGED Viewed

@@ -96,6 +96,21 @@ DeepBonerError (base)
 - **Mocking**: `respx` for httpx, `pytest-mock` for general mocking
 - **Fixtures**: `tests/conftest.py` has `mock_httpx_client`, `mock_llm_response`
 ## Git Workflow
 - `main`: Production-ready (GitHub)

 - **Mocking**: `respx` for httpx, `pytest-mock` for general mocking
 - **Fixtures**: `tests/conftest.py` has `mock_httpx_client`, `mock_llm_response`
+## LLM Model Defaults (November 2025)
+Given the rapid advancements, as of November 29, 2025, the DeepBoner project uses the following default LLM models in its configuration (`src/utils/config.py`):
+- **OpenAI:** `gpt-5`
+  - This is the stable flagship model released in August 2025.
+  - While `gpt-5.1` (released November 2025) exists, it is currently gated, and attempts to use it resulted in a `403 model_not_found` error for typical API keys. Advanced users with access to `gpt-5.1-instant`, `gpt-5.1-thinking`, or `gpt-5.1-codex-max` may configure their `.env` accordingly.
+- **Anthropic:** `claude-sonnet-4-5-20250929`
+  - This is the mid-range Claude 4.5 model, released on September 29, 2025.
+  - The flagship `Claude Opus 4.5` (released November 24, 2025) is also available and can be configured by advanced users for enhanced capabilities.
+- **HuggingFace (Free Tier):** `meta-llama/Llama-3.1-70B-Instruct`
+  - This remains the default for the free tier, subject to quota limits.
+It is crucial to keep these defaults updated as the LLM landscape evolves.
 ## Git Workflow
 - `main`: Production-ready (GitHub)

GEMINI.md CHANGED Viewed

@@ -70,6 +70,21 @@ Settings via pydantic-settings from `.env`:
 - `MAX_ITERATIONS`: 1-50, default 10
 - `LOG_LEVEL`: DEBUG, INFO, WARNING, ERROR
 ## Development Conventions
 1. **Strict TDD:** Write failing tests in `tests/unit/` *before* implementing logic in `src/`.

 - `MAX_ITERATIONS`: 1-50, default 10
 - `LOG_LEVEL`: DEBUG, INFO, WARNING, ERROR
+## LLM Model Defaults (November 2025)
+Given the rapid advancements, as of November 29, 2025, the DeepBoner project uses the following default LLM models in its configuration (`src/utils/config.py`):
+- **OpenAI:** `gpt-5`
+  - This is the stable flagship model released in August 2025.
+  - While `gpt-5.1` (released November 2025) exists, it is currently gated, and attempts to use it resulted in a `403 model_not_found` error for typical API keys. Advanced users with access to `gpt-5.1-instant`, `gpt-5.1-thinking`, or `gpt-5.1-codex-max` may configure their `.env` accordingly.
+- **Anthropic:** `claude-sonnet-4-5-20250929`
+  - This is the mid-range Claude 4.5 model, released on September 29, 2025.
+  - The flagship `Claude Opus 4.5` (released November 24, 2025) is also available and can be configured by advanced users for enhanced capabilities.
+- **HuggingFace (Free Tier):** `meta-llama/Llama-3.1-70B-Instruct`
+  - This remains the default for the free tier, subject to quota limits.
+It is crucial to keep these defaults updated as the LLM landscape evolves.
 ## Development Conventions
 1. **Strict TDD:** Write failing tests in `tests/unit/` *before* implementing logic in `src/`.

docs/bugs/INVESTIGATION_INVALID_MODELS.md ADDED Viewed

	@@ -0,0 +1,30 @@

+# Bug Investigation: Invalid Default LLM Models
+## Status
+- **Date:** 2025-11-29
+- **Reporter:** CLI User
+- **Component:** `src/utils/config.py`
+- **Priority:** High (Magentic Mode Blocker)
+- **Resolution:** FIXED
+## Issue Description
+The user encountered a 403 error when running in Magentic mode:
+`Error code: 403 - {'error': {'message': 'Project ... does not have access to model gpt-5.1', ... 'code': 'model_not_found'}}`
+This indicates the application is trying to use `gpt-5.1`, which the user's API key did not have access to (likely a beta/gated model).
+## Root Cause Analysis
+The default config used `gpt-5.1` (beta/preview) and `claude-sonnet-4-5-20250929`.
+Initial remediation mistakenly downgraded these to 2024 models (`gpt-4o`).
+Web search confirmed that in November 2025:
+- `claude-sonnet-4-5-20250929` IS valid.
+- `gpt-5.1` exists but access is restricted (leading to 403).
+- `gpt-5` (August 2025) is the stable flagship.
+## Solution Implemented
+Updated `src/utils/config.py` to use:
+- `anthropic_model`: `claude-sonnet-4-5-20250929` (Restored correct Nov 2025 model)
+- `openai_model`: `gpt-5` (Changed from 5.1 to 5 to ensure stability/access).
+## Verification
+- `tests/unit/agent_factory/test_judges_factory.py` updated and passed.

docs/bugs/INVESTIGATION_QUOTA_BLOCKER.md ADDED Viewed

	@@ -0,0 +1,49 @@

+# Bug Investigation: HF Free Tier Quota Exhaustion
+## Status
+- **Date:** 2025-11-29
+- **Reporter:** CLI User
+- **Component:** `HFInferenceJudgeHandler`
+- **Priority:** High (UX Blocker for Free Tier)
+- **Resolution:** FIXED
+## Issue Description
+On a fresh run with a simple query ("What drugs improve female libido post-menopause?"), the system retrieved 20 valid sources but failed during the Judge/Analysis phase with:
+`⚠️ Free Tier Quota Exceeded ⚠️`
+This results in a "Synthesis" step that has 0 candidates and 0 findings, rendering the application useless for free users once the (very low) limit is hit, despite having valid search results.
+## Evidence
+Output provided:
+```text
+### Citations (20 sources)
+...
+### Reasoning
+⚠️ **Free Tier Quota Exceeded** ⚠️
+```
+## Root Cause Analysis
+1. **Search Success:** `SearchAgent` correctly found 20 documents (PubMed/EuropePMC).
+2. **Judge Failure:** `HFInferenceJudgeHandler` called the HF Inference API.
+3. **Quota Trap:** The API returned a 402 (Payment Required) or Quota error.
+4. **Previous Handling:** The handler caught this error and returned a `JudgeAssessment` with `sufficient=True` (to stop the loop) and *empty* fields.
+5. **Data Loss:** The 20 valid search results were effectively discarded from the "Analysis" perspective.
+## The "Deep Blocker"
+The system had a "hard failure" mode for quota exhaustion, assuming that if the LLM can't judge, we have *no* useful information. This "bricked" the UX for free users immediately upon hitting the limit.
+## Solution Implemented
+Modified `HFInferenceJudgeHandler._create_quota_exhausted_assessment` to:
+1. Accept the `evidence` list as an argument.
+2. Perform basic heuristic extraction (borrowed from `MockJudgeHandler` logic):
+   - Use titles as "Key Findings" (first 5 sources).
+   - Add a clear message in "Drug Candidates" telling the user to upgrade.
+3. Return this "Partial" assessment instead of an empty one.
+## Verification
+- Created `tests/unit/agent_factory/test_judges_hf_quota.py` to verify that:
+  - 402 errors are caught.
+  - `sufficient` is set to `True` (stops loop).
+  - `key_findings` are populated from search result titles.
+  - `reasoning` contains the warning message.
+- Ran existing tests `tests/unit/agent_factory/test_judges_hf.py` - All passed.

src/agent_factory/judges.py CHANGED Viewed

@@ -26,6 +26,31 @@ from src.utils.models import AssessmentDetails, Evidence, JudgeAssessment
 logger = structlog.get_logger()
 def get_model() -> Any:
     """Get the LLM model based on configuration.
@@ -218,7 +243,7 @@ class HFInferenceJudgeHandler:
                     or "payment required" in error_str.lower()
                 ):
                     logger.error("HF Quota Exhausted", error=error_str)
-                    return self._create_quota_exhausted_assessment(question)
                 logger.warning("Model failed", model=model, error=str(e))
                 last_error = e
@@ -342,16 +367,26 @@ IMPORTANT: Respond with ONLY valid JSON matching this schema:
         return None
-    def _create_quota_exhausted_assessment(self, question: str) -> JudgeAssessment:
         """Create an assessment that stops the loop when quota is exhausted."""
         return JudgeAssessment(
             details=AssessmentDetails(
                 mechanism_score=0,
-                mechanism_reasoning="Free tier quota exhausted.",
                 clinical_evidence_score=0,
-                clinical_reasoning="Free tier quota exhausted.",
-                drug_candidates=[],
-                key_findings=[],
             ),
             sufficient=True,  # STOP THE LOOP
             confidence=0.0,
@@ -360,6 +395,8 @@ IMPORTANT: Respond with ONLY valid JSON matching this schema:
             reasoning=(
                 "⚠️ **Free Tier Quota Exceeded** ⚠️\n\n"
                 "The HuggingFace Inference API free tier limit has been reached. "
                 "Please try again later, or add an OpenAI/Anthropic API key above "
                 "for unlimited access."
             ),
@@ -414,13 +451,11 @@ class MockJudgeHandler:
     def _extract_key_findings(self, evidence: list[Evidence], max_findings: int = 5) -> list[str]:
         """Extract key findings from evidence titles."""
-        findings = []
-        for e in evidence[:max_findings]:
-            # Use first 150 chars of title as a finding
-            title = e.citation.title
-            if len(title) > 150:
-                title = title[:147] + "..."
-            findings.append(title)
         return findings if findings else ["No specific findings extracted (demo mode)"]
     def _extract_drug_candidates(self, question: str, evidence: list[Evidence]) -> list[str]:

 logger = structlog.get_logger()
+def _extract_titles_from_evidence(
+    evidence: list[Evidence], max_items: int = 5, fallback_message: str | None = None
+) -> list[str]:
+    """Extract truncated titles from evidence for fallback display.
+    Args:
+        evidence: List of evidence items
+        max_items: Maximum number of titles to extract
+        fallback_message: Message to return if no evidence provided
+    Returns:
+        List of truncated titles (max 150 chars each)
+    """
+    findings = []
+    for e in evidence[:max_items]:
+        title = e.citation.title
+        if len(title) > 150:
+            title = title[:147] + "..."
+        findings.append(title)
+    if not findings and fallback_message:
+        return [fallback_message]
+    return findings
 def get_model() -> Any:
     """Get the LLM model based on configuration.
                     or "payment required" in error_str.lower()
                 ):
                     logger.error("HF Quota Exhausted", error=error_str)
+                    return self._create_quota_exhausted_assessment(question, evidence)
                 logger.warning("Model failed", model=model, error=str(e))
                 last_error = e
         return None
+    def _create_quota_exhausted_assessment(
+        self, question: str, evidence: list[Evidence]
+    ) -> JudgeAssessment:
         """Create an assessment that stops the loop when quota is exhausted."""
+        findings = _extract_titles_from_evidence(
+            evidence,
+            max_items=5,
+            fallback_message="No findings available (Quota exceeded and no search results).",
+        )
         return JudgeAssessment(
             details=AssessmentDetails(
                 mechanism_score=0,
+                mechanism_reasoning="Free tier quota exhausted. Unable to analyze mechanism.",
                 clinical_evidence_score=0,
+                clinical_reasoning=(
+                    "Free tier quota exhausted. Unable to analyze clinical evidence."
+                ),
+                drug_candidates=["Upgrade to paid API for drug extraction."],
+                key_findings=findings,
             ),
             sufficient=True,  # STOP THE LOOP
             confidence=0.0,
             reasoning=(
                 "⚠️ **Free Tier Quota Exceeded** ⚠️\n\n"
                 "The HuggingFace Inference API free tier limit has been reached. "
+                "The search results listed below were retrieved but could not be "
+                "analyzed by the AI. "
                 "Please try again later, or add an OpenAI/Anthropic API key above "
                 "for unlimited access."
             ),
     def _extract_key_findings(self, evidence: list[Evidence], max_findings: int = 5) -> list[str]:
         """Extract key findings from evidence titles."""
+        findings = _extract_titles_from_evidence(
+            evidence,
+            max_items=max_findings,
+            fallback_message="No specific findings extracted (demo mode)",
+        )
         return findings if findings else ["No specific findings extracted (demo mode)"]
     def _extract_drug_candidates(self, question: str, evidence: list[Evidence]) -> list[str]:

src/utils/config.py CHANGED Viewed

@@ -26,7 +26,7 @@ class Settings(BaseSettings):
     llm_provider: Literal["openai", "anthropic", "huggingface"] = Field(
         default="openai", description="Which LLM provider to use"
     )
-    openai_model: str = Field(default="gpt-5.1", description="OpenAI model name")
     anthropic_model: str = Field(
         default="claude-sonnet-4-5-20250929", description="Anthropic model"
     )

     llm_provider: Literal["openai", "anthropic", "huggingface"] = Field(
         default="openai", description="Which LLM provider to use"
     )
+    openai_model: str = Field(default="gpt-5", description="OpenAI model name")
     anthropic_model: str = Field(
         default="claude-sonnet-4-5-20250929", description="Anthropic model"
     )

tests/unit/agent_factory/test_judges_factory.py CHANGED Viewed

@@ -25,11 +25,11 @@ def test_get_model_openai(mock_settings):
     """Test that OpenAI model is returned when provider is openai."""
     mock_settings.llm_provider = "openai"
     mock_settings.openai_api_key = "sk-test"
-    mock_settings.openai_model = "gpt-5.1"
     model = get_model()
     assert isinstance(model, OpenAIChatModel)
-    assert model.model_name == "gpt-5.1"
 def test_get_model_anthropic(mock_settings):
@@ -58,7 +58,7 @@ def test_get_model_default_fallback(mock_settings):
     """Test fallback to OpenAI if provider is unknown."""
     mock_settings.llm_provider = "unknown_provider"
     mock_settings.openai_api_key = "sk-test"
-    mock_settings.openai_model = "gpt-5.1"
     model = get_model()
     assert isinstance(model, OpenAIChatModel)

     """Test that OpenAI model is returned when provider is openai."""
     mock_settings.llm_provider = "openai"
     mock_settings.openai_api_key = "sk-test"
+    mock_settings.openai_model = "gpt-5"
     model = get_model()
     assert isinstance(model, OpenAIChatModel)
+    assert model.model_name == "gpt-5"
 def test_get_model_anthropic(mock_settings):
     """Test fallback to OpenAI if provider is unknown."""
     mock_settings.llm_provider = "unknown_provider"
     mock_settings.openai_api_key = "sk-test"
+    mock_settings.openai_model = "gpt-5"
     model = get_model()
     assert isinstance(model, OpenAIChatModel)

tests/unit/agent_factory/test_judges_hf_quota.py ADDED Viewed

	@@ -0,0 +1,49 @@

+"""Unit tests for HFInferenceJudgeHandler Quota Logic."""
+from unittest.mock import patch
+import pytest
+from src.agent_factory.judges import HFInferenceJudgeHandler
+from src.utils.models import Citation, Evidence
+@pytest.mark.unit
+class TestHFInferenceJudgeHandlerQuota:
+    """Tests for HFInferenceJudgeHandler Quota handling."""
+    @pytest.mark.asyncio
+    async def test_assess_quota_exhausted(self):
+        """Test that quota exhaustion triggers fallback extraction."""
+        handler = HFInferenceJudgeHandler()
+        # Create some dummy evidence
+        evidence = [
+            Evidence(
+                content="Content 1",
+                citation=Citation(
+                    source="pubmed", title="Important Drug A Findings", url="u1", date="d1"
+                ),
+            ),
+            Evidence(
+                content="Content 2",
+                citation=Citation(
+                    source="pubmed", title="Clinical Trial of Drug B", url="u2", date="d2"
+                ),
+            ),
+        ]
+        # Mock _call_with_retry to raise a Quota error
+        with patch.object(
+            handler, "_call_with_retry", side_effect=Exception("402 Payment Required")
+        ):
+            result = await handler.assess("question", evidence)
+            # Check that it caught the error and stopped
+            assert result.sufficient is True
+            assert "Free Tier Quota Exceeded" in result.reasoning
+            # CRITICAL: Check that it extracted findings from titles
+            assert "Important Drug A Findings" in result.details.key_findings
+            assert "Clinical Trial of Drug B" in result.details.key_findings
+            assert result.details.drug_candidates == ["Upgrade to paid API for drug extraction."]