agentbee

Sleeping

mangubee Claude Sonnet 4.5 commited on Jan 4

Commit

ac73681

1 Parent(s): 91db628

Add UI-based LLM provider selection for cloud testing

- Config-based LLM selection: Added LLM_PROVIDER and ENABLE_LLM_FALLBACK environment variables
- Smart routing: _call_with_fallback() function routes to selected provider with optional fallback
- UI dropdowns: Added LLM provider selection to both Test & Debug and Full Evaluation tabs
- Cloud-friendly: UI selection overrides env vars, enabling instant provider switching without rebuilds
- Unified behavior: Same UI selection works identically in local and cloud environments

Modified files:
- src/agent/llm_client.py: Config-based provider routing (~150 lines)
- app.py: UI dropdowns and function parameter updates (~30 lines)
- CHANGELOG.md: Documented two problems solved

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Files changed (3) hide show

CHANGELOG.md +68 -0
app.py +49 -6
src/agent/llm_client.py +119 -109

CHANGELOG.md CHANGED Viewed

@@ -99,6 +99,74 @@
 - ✅ No regressions introduced by Stage 5 changes
 - ✅ Test suite run time: ~2min 40sec
 ### Created Files
 ### Deleted Files

 - ✅ No regressions introduced by Stage 5 changes
 - ✅ Test suite run time: ~2min 40sec
+### [PROBLEM: LLM Provider Debugging - Config-Based Selection]
+**Problem:** Hard to debug which LLM provider is handling each step with 4-tier fallback chain. Cannot isolate provider performance for improvement.
+**Modified Files:**
+- **.env** (~5 lines added)
+  - Added `LLM_PROVIDER=gemini` - Select single provider: "gemini", "huggingface", "groq", or "claude"
+  - Added `ENABLE_LLM_FALLBACK=false` - Toggle fallback behavior (true/false)
+  - Removed deprecated `DEFAULT_LLM_MODEL` config
+- **src/agent/llm_client.py** (~150 lines added/modified)
+  - Added `LLM_PROVIDER` config variable (line 49) - Reads from environment
+  - Added `ENABLE_LLM_FALLBACK` config variable (line 50) - Reads from environment
+  - Added `_get_provider_function()` helper (lines 114-158) - Maps function names to provider implementations
+  - Added `_call_with_fallback()` routing function (lines 161-212)
+    - Primary provider: Uses LLM_PROVIDER config
+    - Fallback behavior: Controlled by ENABLE_LLM_FALLBACK
+    - Logging: Clear info logs showing which provider is used
+    - Error handling: Specific error messages when fallback disabled
+  - Updated `plan_question()` - Now uses `_call_with_fallback()` (simplified from 40 lines to 1 line)
+  - Updated `select_tools_with_function_calling()` - Now uses `_call_with_fallback()` (simplified from 40 lines to 1 line)
+  - Updated `synthesize_answer()` - Now uses `_call_with_fallback()` (simplified from 40 lines to 1 line)
+**Benefits:**
+- ✅ Easy debugging: Change `LLM_PROVIDER=groq` in .env to test specific provider
+- ✅ Clear logs: Know exactly which LLM handled each step
+- ✅ Isolated testing: Disable fallback to test single provider performance
+- ✅ Production safety: Enable fallback=true for deployment reliability
+**Verification:**
+- ✅ Config-based selection tested with Groq provider
+- ✅ Logs show "Using primary provider: groq"
+- ✅ Fallback disabled error handling works correctly
+### [PROBLEM: Cloud Testing UX - UI-Based LLM Selection]
+**Problem:** Testing different LLM providers in HF Spaces cloud requires manually changing environment variables in Space settings, then waiting for rebuild. Slow iteration, poor UX.
+**Modified Files:**
+- **app.py** (~30 lines added/modified)
+  - Updated `test_single_question()` function signature - Added `llm_provider` and `enable_fallback` parameters
+    - Sets `os.environ["LLM_PROVIDER"]` from UI selection (overrides .env and HF Space env vars)
+    - Sets `os.environ["ENABLE_LLM_FALLBACK"]` from UI checkbox
+    - Adds provider info to diagnostics output
+  - Updated `run_and_submit_all()` function signature - Added `llm_provider` and `enable_fallback` parameters
+    - Reordered params: UI inputs first, profile last (optional)
+    - Sets environment variables before agent initialization
+  - Added UI components in "Test & Debug" tab:
+    - `llm_provider_dropdown` - Select from: Gemini, HuggingFace, Groq, Claude (default: Groq)
+    - `enable_fallback_checkbox` - Toggle fallback behavior (default: false for testing)
+  - Added UI components in "Full Evaluation" tab:
+    - `eval_llm_provider_dropdown` - Select LLM for all questions (default: Groq)
+    - `eval_enable_fallback_checkbox` - Toggle fallback (default: true for production)
+  - Updated button click handlers to pass new UI inputs to functions
+**Benefits:**
+- ✅ **Cloud testing:** Test all 4 providers directly from HF Space UI
+- ✅ **Instant switching:** No environment variable changes, no rebuild wait
+- ✅ **Clear visibility:** UI shows which provider is selected
+- ✅ **A/B testing:** Easy comparison between providers on same questions
+- ✅ **Production safety:** Fallback enabled by default for full evaluation
+**Verification:**
+- ✅ No syntax errors in app.py
+- ✅ UI components properly connected to function parameters
 ### Created Files
 ### Deleted Files

app.py CHANGED Viewed

@@ -148,12 +148,18 @@ def format_diagnostics(final_state: dict) -> str:
     return "\n".join(diagnostics)
-def test_single_question(question: str):
     """Test agent with a single question and return diagnostics."""
     if not question or not question.strip():
         return "Please enter a question.", "", check_api_keys()
     try:
         # Initialize agent
         agent = GAIAAgent()
@@ -163,8 +169,9 @@ def test_single_question(question: str):
         # Get final state from agent
         final_state = agent.last_state or {}
-        # Format diagnostics
-        diagnostics = format_diagnostics(final_state)
         api_status = check_api_keys()
         return answer, diagnostics, api_status
@@ -183,7 +190,7 @@ def test_single_question(question: str):
 # Stage 5: Performance optimization
-def run_and_submit_all(profile: gr.OAuthProfile | None):
     """
     Fetches all questions, runs the BasicAgent on them, submits all answers,
     and displays the results.
@@ -202,6 +209,11 @@ def run_and_submit_all(profile: gr.OAuthProfile | None):
     questions_url = f"{api_url}/questions"
     submit_url = f"{api_url}/submit"
     # 1. Instantiate Agent (Stage 1: GAIAAgent with LangGraph)
     try:
         logger.info("Initializing GAIAAgent...")
@@ -363,6 +375,20 @@ with gr.Blocks() as demo:
                 placeholder="e.g., What is the capital of France?",
                 lines=3
             )
             test_button = gr.Button("Run Test", variant="primary")
             with gr.Row():
@@ -386,7 +412,7 @@ with gr.Blocks() as demo:
             test_button.click(
                 fn=test_single_question,
-                inputs=[test_question_input],
                 outputs=[test_answer_output, test_diagnostics_output, test_api_status]
             )
@@ -409,6 +435,19 @@ with gr.Blocks() as demo:
             gr.LoginButton()
             run_button = gr.Button("Run Evaluation & Submit All Answers")
             status_output = gr.Textbox(
@@ -422,7 +461,11 @@ with gr.Blocks() as demo:
                 type="filepath"
             )
-            run_button.click(fn=run_and_submit_all, outputs=[status_output, results_table, export_output])
 if __name__ == "__main__":
     print("\n" + "-" * 30 + " App Starting " + "-" * 30)

     return "\n".join(diagnostics)
+def test_single_question(question: str, llm_provider: str, enable_fallback: bool):
     """Test agent with a single question and return diagnostics."""
     if not question or not question.strip():
         return "Please enter a question.", "", check_api_keys()
     try:
+        # Set LLM provider from UI selection (overrides .env)
+        os.environ["LLM_PROVIDER"] = llm_provider.lower()
+        os.environ["ENABLE_LLM_FALLBACK"] = "true" if enable_fallback else "false"
+        logger.info(f"UI Config: LLM_PROVIDER={llm_provider}, ENABLE_LLM_FALLBACK={enable_fallback}")
         # Initialize agent
         agent = GAIAAgent()
         # Get final state from agent
         final_state = agent.last_state or {}
+        # Format diagnostics with LLM provider info
+        provider_info = f"**LLM Provider:** {llm_provider} (Fallback: {'Enabled' if enable_fallback else 'Disabled'})\n\n"
+        diagnostics = provider_info + format_diagnostics(final_state)
         api_status = check_api_keys()
         return answer, diagnostics, api_status
 # Stage 5: Performance optimization
+def run_and_submit_all(llm_provider: str, enable_fallback: bool, profile: gr.OAuthProfile | None = None):
     """
     Fetches all questions, runs the BasicAgent on them, submits all answers,
     and displays the results.
     questions_url = f"{api_url}/questions"
     submit_url = f"{api_url}/submit"
+    # Set LLM provider from UI selection (overrides .env)
+    os.environ["LLM_PROVIDER"] = llm_provider.lower()
+    os.environ["ENABLE_LLM_FALLBACK"] = "true" if enable_fallback else "false"
+    logger.info(f"UI Config for Full Evaluation: LLM_PROVIDER={llm_provider}, ENABLE_LLM_FALLBACK={enable_fallback}")
     # 1. Instantiate Agent (Stage 1: GAIAAgent with LangGraph)
     try:
         logger.info("Initializing GAIAAgent...")
                 placeholder="e.g., What is the capital of France?",
                 lines=3
             )
+            with gr.Row():
+                llm_provider_dropdown = gr.Dropdown(
+                    label="LLM Provider",
+                    choices=["Gemini", "HuggingFace", "Groq", "Claude"],
+                    value="Groq",
+                    info="Select which LLM to use for this test"
+                )
+                enable_fallback_checkbox = gr.Checkbox(
+                    label="Enable Fallback",
+                    value=False,
+                    info="If enabled, falls back to other providers on failure"
+                )
             test_button = gr.Button("Run Test", variant="primary")
             with gr.Row():
             test_button.click(
                 fn=test_single_question,
+                inputs=[test_question_input, llm_provider_dropdown, enable_fallback_checkbox],
                 outputs=[test_answer_output, test_diagnostics_output, test_api_status]
             )
             gr.LoginButton()
+            with gr.Row():
+                eval_llm_provider_dropdown = gr.Dropdown(
+                    label="LLM Provider for Evaluation",
+                    choices=["Gemini", "HuggingFace", "Groq", "Claude"],
+                    value="Groq",
+                    info="Select which LLM to use for all questions"
+                )
+                eval_enable_fallback_checkbox = gr.Checkbox(
+                    label="Enable Fallback",
+                    value=True,
+                    info="Recommended: Enable fallback for production evaluation"
+                )
             run_button = gr.Button("Run Evaluation & Submit All Answers")
             status_output = gr.Textbox(
                 type="filepath"
             )
+            run_button.click(
+                fn=run_and_submit_all,
+                inputs=[eval_llm_provider_dropdown, eval_enable_fallback_checkbox],
+                outputs=[status_output, results_table, export_output]
+            )
 if __name__ == "__main__":
     print("\n" + "-" * 30 + " App Starting " + "-" * 30)

src/agent/llm_client.py CHANGED Viewed

@@ -45,6 +45,10 @@ GROQ_MODEL = "qwen/qwen3-32b"  # Free tier: 60 req/min, fast inference
 TEMPERATURE = 0  # Deterministic for factoid answers
 MAX_TOKENS = 4096
 # ============================================================================
 # Logging Setup
 # ============================================================================
@@ -102,6 +106,112 @@ def retry_with_backoff(func: Callable, max_retries: int = 3) -> Any:
             raise
 # ============================================================================
 # Client Initialization
 # ============================================================================
@@ -423,8 +533,8 @@ def plan_question(
     """
     Analyze question and generate execution plan using LLM.
-    Pattern: Try Gemini first (free tier), HuggingFace (free tier), Groq (free tier), then Claude (paid) if all fail.
-    4-tier fallback ensures availability even with quota limits.
     Each provider call wrapped with retry logic (3 attempts with exponential backoff).
     Args:
@@ -435,43 +545,7 @@ def plan_question(
     Returns:
         Execution plan as structured text
     """
-    try:
-        return retry_with_backoff(
-            lambda: plan_question_gemini(question, available_tools, file_paths)
-        )
-    except Exception as gemini_error:
-        logger.warning(
-            f"[plan_question] Gemini failed: {gemini_error}, trying HuggingFace fallback"
-        )
-        try:
-            return retry_with_backoff(
-                lambda: plan_question_hf(question, available_tools, file_paths)
-            )
-        except Exception as hf_error:
-            logger.warning(
-                f"[plan_question] HuggingFace failed: {hf_error}, trying Groq fallback"
-            )
-            try:
-                return retry_with_backoff(
-                    lambda: plan_question_groq(question, available_tools, file_paths)
-                )
-            except Exception as groq_error:
-                logger.warning(
-                    f"[plan_question] Groq failed: {groq_error}, trying Claude fallback"
-                )
-                try:
-                    return retry_with_backoff(
-                        lambda: plan_question_claude(
-                            question, available_tools, file_paths
-                        )
-                    )
-                except Exception as claude_error:
-                    logger.error(
-                        f"[plan_question] All LLMs failed. Gemini: {gemini_error}, HF: {hf_error}, Groq: {groq_error}, Claude: {claude_error}"
-                    )
-                    raise Exception(
-                        f"Planning failed with all LLMs. Gemini: {gemini_error}, HF: {hf_error}, Groq: {groq_error}, Claude: {claude_error}"
-                    )
 # ============================================================================
@@ -825,8 +899,8 @@ def select_tools_with_function_calling(
     """
     Use LLM function calling to dynamically select tools and extract parameters.
-    Pattern: Try Gemini first (free tier), HuggingFace (free tier), Groq (free tier), then Claude (paid) if all fail.
-    4-tier fallback ensures availability even with quota limits.
     Each provider call wrapped with retry logic (3 attempts with exponential backoff).
     Args:
@@ -837,41 +911,7 @@ def select_tools_with_function_calling(
     Returns:
         List of tool calls with extracted parameters
     """
-    try:
-        return retry_with_backoff(
-            lambda: select_tools_gemini(question, plan, available_tools)
-        )
-    except Exception as gemini_error:
-        logger.warning(
-            f"[select_tools] Gemini failed: {gemini_error}, trying HuggingFace fallback"
-        )
-        try:
-            return retry_with_backoff(
-                lambda: select_tools_hf(question, plan, available_tools)
-            )
-        except Exception as hf_error:
-            logger.warning(
-                f"[select_tools] HuggingFace failed: {hf_error}, trying Groq fallback"
-            )
-            try:
-                return retry_with_backoff(
-                    lambda: select_tools_groq(question, plan, available_tools)
-                )
-            except Exception as groq_error:
-                logger.warning(
-                    f"[select_tools] Groq failed: {groq_error}, trying Claude fallback"
-                )
-                try:
-                    return retry_with_backoff(
-                        lambda: select_tools_claude(question, plan, available_tools)
-                    )
-                except Exception as claude_error:
-                    logger.error(
-                        f"[select_tools] All LLMs failed. Gemini: {gemini_error}, HF: {hf_error}, Groq: {groq_error}, Claude: {claude_error}"
-                    )
-                    raise Exception(
-                        f"Tool selection failed with all LLMs. Gemini: {gemini_error}, HF: {hf_error}, Groq: {groq_error}, Claude: {claude_error}"
-                    )
 # ============================================================================
@@ -1121,8 +1161,8 @@ def synthesize_answer(question: str, evidence: List[str]) -> str:
     """
     Synthesize factoid answer from collected evidence using LLM.
-    Pattern: Try Gemini first (free tier), HuggingFace (free tier), Groq (free tier), then Claude (paid) if all fail.
-    4-tier fallback ensures availability even with quota limits.
     Each provider call wrapped with retry logic (3 attempts with exponential backoff).
     Args:
@@ -1132,37 +1172,7 @@ def synthesize_answer(question: str, evidence: List[str]) -> str:
     Returns:
         Factoid answer string
     """
-    try:
-        return retry_with_backoff(lambda: synthesize_answer_gemini(question, evidence))
-    except Exception as gemini_error:
-        logger.warning(
-            f"[synthesize_answer] Gemini failed: {gemini_error}, trying HuggingFace fallback"
-        )
-        try:
-            return retry_with_backoff(lambda: synthesize_answer_hf(question, evidence))
-        except Exception as hf_error:
-            logger.warning(
-                f"[synthesize_answer] HuggingFace failed: {hf_error}, trying Groq fallback"
-            )
-            try:
-                return retry_with_backoff(
-                    lambda: synthesize_answer_groq(question, evidence)
-                )
-            except Exception as groq_error:
-                logger.warning(
-                    f"[synthesize_answer] Groq failed: {groq_error}, trying Claude fallback"
-                )
-                try:
-                    return retry_with_backoff(
-                        lambda: synthesize_answer_claude(question, evidence)
-                    )
-                except Exception as claude_error:
-                    logger.error(
-                        f"[synthesize_answer] All LLMs failed. Gemini: {gemini_error}, HF: {hf_error}, Groq: {groq_error}, Claude: {claude_error}"
-                    )
-                    raise Exception(
-                        f"Answer synthesis failed with all LLMs. Gemini: {gemini_error}, HF: {hf_error}, Groq: {groq_error}, Claude: {claude_error}"
-                    )
 # ============================================================================

 TEMPERATURE = 0  # Deterministic for factoid answers
 MAX_TOKENS = 4096
+# LLM Provider Selection
+LLM_PROVIDER = os.getenv("LLM_PROVIDER", "gemini").lower()  # "gemini", "huggingface", "groq", or "claude"
+ENABLE_LLM_FALLBACK = os.getenv("ENABLE_LLM_FALLBACK", "false").lower() == "true"
 # ============================================================================
 # Logging Setup
 # ============================================================================
             raise
+# ============================================================================
+# LLM Provider Routing
+# ============================================================================
+def _get_provider_function(function_name: str, provider: str) -> Callable:
+    """
+    Get the provider-specific function for a given operation.
+    Args:
+        function_name: Base function name ("plan_question", "select_tools", "synthesize_answer")
+        provider: Provider name ("gemini", "huggingface", "groq", "claude")
+    Returns:
+        Callable: Provider-specific function
+    Raises:
+        ValueError: If provider is invalid
+    """
+    # Map function names to provider-specific implementations
+    function_map = {
+        "plan_question": {
+            "gemini": plan_question_gemini,
+            "huggingface": plan_question_hf,
+            "groq": plan_question_groq,
+            "claude": plan_question_claude,
+        },
+        "select_tools": {
+            "gemini": select_tools_gemini,
+            "huggingface": select_tools_hf,
+            "groq": select_tools_groq,
+            "claude": select_tools_claude,
+        },
+        "synthesize_answer": {
+            "gemini": synthesize_answer_gemini,
+            "huggingface": synthesize_answer_hf,
+            "groq": synthesize_answer_groq,
+            "claude": synthesize_answer_claude,
+        },
+    }
+    if function_name not in function_map:
+        raise ValueError(f"Unknown function name: {function_name}")
+    if provider not in function_map[function_name]:
+        raise ValueError(
+            f"Unknown provider: {provider}. Valid options: gemini, huggingface, groq, claude"
+        )
+    return function_map[function_name][provider]
+def _call_with_fallback(function_name: str, *args, **kwargs) -> Any:
+    """
+    Call LLM function with configured provider and optional fallback.
+    Args:
+        function_name: Base function name ("plan_question", "select_tools", "synthesize_answer")
+        *args, **kwargs: Arguments to pass to the provider-specific function
+    Returns:
+        Result from LLM call
+    Raises:
+        Exception: If selected provider fails and fallback disabled, or all providers fail
+    """
+    primary_provider = LLM_PROVIDER
+    # Define fallback order (excluding primary provider)
+    all_providers = ["gemini", "huggingface", "groq", "claude"]
+    fallback_providers = [p for p in all_providers if p != primary_provider]
+    # Try primary provider first
+    try:
+        primary_func = _get_provider_function(function_name, primary_provider)
+        logger.info(f"[{function_name}] Using primary provider: {primary_provider}")
+        return retry_with_backoff(lambda: primary_func(*args, **kwargs))
+    except Exception as primary_error:
+        logger.warning(f"[{function_name}] Primary provider {primary_provider} failed: {primary_error}")
+        # If fallback disabled, raise immediately
+        if not ENABLE_LLM_FALLBACK:
+            logger.error(f"[{function_name}] Fallback disabled. Failing fast.")
+            raise Exception(
+                f"{function_name} failed with {primary_provider}: {primary_error}. "
+                f"Fallback disabled (ENABLE_LLM_FALLBACK=false)"
+            )
+        # Try fallback providers in order
+        errors = {primary_provider: primary_error}
+        for fallback_provider in fallback_providers:
+            try:
+                fallback_func = _get_provider_function(function_name, fallback_provider)
+                logger.info(f"[{function_name}] Trying fallback provider: {fallback_provider}")
+                return retry_with_backoff(lambda: fallback_func(*args, **kwargs))
+            except Exception as fallback_error:
+                errors[fallback_provider] = fallback_error
+                logger.warning(f"[{function_name}] Fallback provider {fallback_provider} failed: {fallback_error}")
+                continue
+        # All providers failed
+        error_summary = ", ".join([f"{k}: {v}" for k, v in errors.items()])
+        logger.error(f"[{function_name}] All providers failed. {error_summary}")
+        raise Exception(f"{function_name} failed with all providers. {error_summary}")
 # ============================================================================
 # Client Initialization
 # ============================================================================
     """
     Analyze question and generate execution plan using LLM.
+    Uses LLM_PROVIDER config to select which provider to use.
+    If ENABLE_LLM_FALLBACK=true, falls back to other providers on failure.
     Each provider call wrapped with retry logic (3 attempts with exponential backoff).
     Args:
     Returns:
         Execution plan as structured text
     """
+    return _call_with_fallback("plan_question", question, available_tools, file_paths)
 # ============================================================================
     """
     Use LLM function calling to dynamically select tools and extract parameters.
+    Uses LLM_PROVIDER config to select which provider to use.
+    If ENABLE_LLM_FALLBACK=true, falls back to other providers on failure.
     Each provider call wrapped with retry logic (3 attempts with exponential backoff).
     Args:
     Returns:
         List of tool calls with extracted parameters
     """
+    return _call_with_fallback("select_tools", question, plan, available_tools)
 # ============================================================================
     """
     Synthesize factoid answer from collected evidence using LLM.
+    Uses LLM_PROVIDER config to select which provider to use.
+    If ENABLE_LLM_FALLBACK=true, falls back to other providers on failure.
     Each provider call wrapped with retry logic (3 attempts with exponential backoff).
     Args:
     Returns:
         Factoid answer string
     """
+    return _call_with_fallback("synthesize_answer", question, evidence)
 # ============================================================================