agentbee / dev /dev_260103_16_huggingface_llm_integration.md
mangubee's picture
Docs: Recover Stage 4 complete documentation structure
06fc271
|
raw
history blame
13.2 kB

[dev_260103_16] HuggingFace LLM API Integration

Date: 2026-01-03 Type: Development Status: Resolved Related Dev: dev_260102_15_stage4_mvp_real_integration.md

Problem Description

Context: Stage 4 implementation was 7/10 complete with comprehensive diagnostics and error handling. However, testing revealed critical LLM availability issues:

  1. Gemini 2.0 Flash - Quota exceeded (1,500 requests/day free tier limit exhausted from testing)
  2. Claude Sonnet 4.5 - Credit balance too low (paid tier, user's balance depleted)

Root Cause: Agent relied on only 2 LLM tiers (free Gemini β†’ paid Claude), with no middle fallback when free tier exhausted. This caused complete LLM failure, falling back to keyword-based tool selection (Stage 4 fallback mechanism).

User Request: Add completely free LLM alternative that works in HuggingFace Spaces environment without requiring local GPU resources.

Requirements:

  • Must be completely free (no credits, reasonable rate limits)
  • Must support function calling (critical for tool selection)
  • Must work in HuggingFace Spaces (cloud-based, no local GPU)
  • Must integrate into existing 3-tier fallback architecture

Key Decisions

Decision 1: HuggingFace LLM API over Ollama (local LLMs)

Why chosen:

  • βœ… Works in HuggingFace Spaces (cloud-based API)
  • βœ… Free tier with rate limits (~60 req/min vs Gemini's 1,500 req/day)
  • βœ… Function calling support via OpenAI-compatible API
  • βœ… No GPU requirements (serverless inference)
  • βœ… Already deployed to HF Spaces - logical integration

Rejected alternative: Ollama + Llama 3.1 70B (local)

  • ❌ Requires local GPU or high-end CPU
  • ❌ Won't work in HuggingFace Free Spaces (CPU-only, 16GB RAM limit)
  • ❌ Would need GPU Spaces upgrade (not free)
  • ❌ Complex setup for user's deployment environment

Decision 2: Qwen 2.5 72B Instruct as HuggingFace Model

Why chosen:

  • βœ… Excellent function calling capabilities (OpenAI-compatible tools format)
  • βœ… Strong reasoning performance (competitive with GPT-4 on benchmarks)
  • βœ… Free on HuggingFace LLM API
  • βœ… 72B parameters - sufficient intelligence for GAIA tasks

Considered alternatives:

  • meta-llama/Llama-3.1-70B-Instruct - Good but slightly worse function calling
  • NousResearch/Hermes-3-Llama-3.1-70B - Excellent but less tested for tool use

Decision 3: 3-Tier Fallback Architecture

Final chain:

  1. Gemini 2.0 Flash (free, 1,500 req/day) - Primary
  2. HuggingFace Qwen 2.5 72B (free, rate limited) - NEW Middle Tier
  3. Claude Sonnet 4.5 (paid) - Expensive fallback
  4. Keyword matching (deterministic) - Last resort

Trade-offs:

  • Pro: 4 layers of resilience ensure agent always produces output
  • Pro: Maximizes free tier usage before burning paid credits
  • Con: Slightly higher latency on fallback chain traversal
  • Con: More API keys to manage (but HF_TOKEN already required for Space)

Decision 4: TOOLS Schema Bug Fix (Critical)

Problem discovered: src/tools/__init__.py had parameters as list ["query"] but LLM client expected dict {"query": {...}} with type/description.

Impact: Gemini function calling was completely broken - caused 'list' object has no attribute 'items' error.

Fix: Updated all tool definitions to proper schema:

"parameters": {
    "query": {
        "description": "Search query string",
        "type": "string"
    },
    "max_results": {
        "description": "Maximum number of search results to return",
        "type": "integer"
    }
},
"required_params": ["query"]

Result: Gemini function calling now working correctly (verified in tests).


Outcome

Successfully integrated HuggingFace LLM API as free LLM fallback tier, completing Stage 4 MVP with robust multi-tier resilience.

Deliverables:

  1. src/agent/llm_client.py - Added ~150 lines of HuggingFace integration

    • create_hf_client() - Initialize InferenceClient with HF_TOKEN
    • plan_question_hf() - Planning using Qwen 2.5 72B
    • select_tools_hf() - Function calling with OpenAI-compatible tools format
    • synthesize_answer_hf() - Answer synthesis from evidence
    • Updated unified functions: plan_question(), select_tools_with_function_calling(), synthesize_answer() to use 3-tier fallback
  2. src/agent/graph.py - Added HF_TOKEN validation

    • Updated validate_environment() to check HF_TOKEN at agent startup
    • Shows ⚠️ WARNING if HF_TOKEN missing
  3. app.py - Updated UI and added JSON export functionality

    • Added HF_TOKEN to check_api_keys() display in Test & Debug tab
    • Added export_results_to_json() - Exports evaluation results as clean JSON
      • Local: ~/Downloads/gaia_results_TIMESTAMP.json
      • HF Spaces: ./exports/gaia_results_TIMESTAMP.json (environment-aware)
      • Full error messages preserved (no truncation), easy code processing
    • Updated run_and_submit_all() - ALL return paths now export results
    • Added gr.File download button - Direct download instead of text display
  4. src/tools/init.py - Fixed TOOLS schema bug (earlier in session)

    • Changed parameters from list to dict format
    • Added type/description for each parameter
    • Fixed Gemini function calling compatibility

Test Results:

uv run pytest test/ -q
99 passed, 11 warnings in 51.99s  βœ…

All tests passing with new 3-tier fallback architecture.

Stage 4 Progress: 10/10 tasks completed βœ…

  • βœ… Comprehensive debug logging
  • βœ… Improved error messages
  • βœ… API key validation (including HF_TOKEN)
  • βœ… Tool execution error handling
  • βœ… Fallback tool execution (keyword matching)
  • βœ… LLM exception handling (3-tier fallback)
  • βœ… Diagnostics display in Gradio UI
  • βœ… Documentation in dev log (this file)
  • βœ… Tool name consistency fix (web_search, calculator, vision)
  • βœ… Deploy to HF Space and run GAIA validation

GAIA Validation Results (Real Test):

  • Score: 10.0% (2/20 correct)
  • Improvement: 0/20 β†’ 2/20 (MVP validated!)
  • Status: Agent is functional and operational

What worked:

  • βœ… Question 1: "How many studio albums were published by Mercedes Sosa between 2000 and 2009?" β†’ Answer: "3" (CORRECT)
  • βœ… HuggingFace LLM (Qwen 2.5 72B) successfully used for planning and tool selection
  • βœ… Web search tool executed successfully
  • βœ… Evidence collection and answer synthesis working

What failed:

  • ❌ Question 2: YouTube video analysis (vision tool) - "Tool vision failed: Exception: Vision analysis failed - Gemini and Claude both failed"
  • Issue: Vision tool requires multimodal LLM access (quota-limited or needs configuration)

Next Stage: Stage 5 - Performance Optimization (target: 5/20 questions)


Learnings and Insights

Pattern: Free-First Fallback Architecture

What worked well:

  • Prioritizing free tiers (Gemini β†’ HuggingFace) before paid tier (Claude) maximizes cost efficiency
  • Multiple free alternatives with different quota models (daily vs rate-limited) provide better resilience than single free tier
  • Keyword fallback ensures agent never completely fails even when all LLMs unavailable

Reusable pattern:

def unified_llm_function(...):
    """3-tier fallback with comprehensive error capture"""
    errors = []

    try:
        return free_tier_1(...)  # Gemini - daily quota
    except Exception as e1:
        errors.append(f"Tier 1: {e1}")
        try:
            return free_tier_2(...)  # HuggingFace - rate limited
        except Exception as e2:
            errors.append(f"Tier 2: {e2}")
            try:
                return paid_tier(...)  # Claude - credits
            except Exception as e3:
                errors.append(f"Tier 3: {e3}")
                # Deterministic fallback as last resort
                return keyword_fallback(...)

Pattern: Function Calling Schema Compatibility

Critical insight: Different LLM providers require different function calling schemas:

  1. Gemini - genai.protos.Tool with function_declarations:

    Tool(function_declarations=[
        FunctionDeclaration(
            name="search_web",
            description="...",
            parameters={
                "type": "object",
                "properties": {"query": {"type": "string", "description": "..."}},
                "required": ["query"]
            }
        )
    ])
    
  2. HuggingFace - OpenAI-compatible tools array:

    tools = [{
        "type": "function",
        "function": {
            "name": "search_web",
            "description": "...",
            "parameters": {
                "type": "object",
                "properties": {"query": {"type": "string", "description": "..."}},
                "required": ["query"]
            }
        }
    }]
    
  3. Claude - Anthropic native format (simplified):

    tools = [{
        "name": "search_web",
        "description": "...",
        "input_schema": {
            "type": "object",
            "properties": {"query": {"type": "string", "description": "..."}},
            "required": ["query"]
        }
    }]
    

Best practice: Maintain single source of truth in src/tools/__init__.py with rich schema (dict format with type/description), then transform to provider-specific format in LLM client functions.

Pattern: Environment Validation at Startup

What worked well:

  • Validating all API keys at agent initialization (not at first use) provides immediate feedback
  • Clear warnings listing missing keys help users diagnose setup issues
  • Non-blocking warnings (continue anyway) allow testing with partial configuration

Implementation:

def validate_environment() -> List[str]:
    """Check API keys at startup, return list of missing keys"""
    missing = []
    for key_name in ["GOOGLE_API_KEY", "HF_TOKEN", "ANTHROPIC_API_KEY", "TAVILY_API_KEY"]:
        if not os.getenv(key_name):
            missing.append(key_name)

    if missing:
        logger.warning(f"⚠️  Missing API keys: {', '.join(missing)}")
    else:
        logger.info("βœ“ All API keys configured")

    return missing

What to avoid:

Anti-pattern: List-based parameter schemas

# WRONG - breaks LLM function calling
"parameters": ["query", "max_results"]

# CORRECT - works with all providers
"parameters": {
    "query": {"type": "string", "description": "..."},
    "max_results": {"type": "integer", "description": "..."}
}

Why it breaks: LLM clients iterate over parameters.items() to extract type/description metadata. List has no .items() method.


Changelog

Session Date: 2026-01-03

Modified Files

  1. src/agent/llm_client.py (~150 lines added)

    • Added create_hf_client() - Initialize HuggingFace InferenceClient with HF_TOKEN
    • Added plan_question_hf(question, available_tools, file_paths) - Planning with Qwen 2.5 72B
    • Added select_tools_hf(question, plan, available_tools) - Function calling with OpenAI-compatible tools format
    • Added synthesize_answer_hf(question, evidence) - Answer synthesis from evidence
    • Updated plan_question() - Added HuggingFace as middle fallback tier (Gemini β†’ HF β†’ Claude)
    • Updated select_tools_with_function_calling() - Added HuggingFace as middle fallback tier
    • Updated synthesize_answer() - Added HuggingFace as middle fallback tier
    • Added CONFIG constant: HF_MODEL = "Qwen/Qwen2.5-72B-Instruct"
    • Added import: from huggingface_hub import InferenceClient
  2. src/agent/graph.py

    • Updated validate_environment() - Added HF_TOKEN to API key validation check
    • Updated startup logging - Shows ⚠️ WARNING if HF_TOKEN missing
  3. app.py

    • Updated check_api_keys() - Added HF_TOKEN status display in Test & Debug tab
    • UI now shows: "HF_TOKEN (HuggingFace): βœ“ SET" or "βœ— MISSING"
    • Added export_results_to_json(results_log, submission_status) - Export evaluation results as JSON
      • Local: ~/Downloads/gaia_results_TIMESTAMP.json
      • HF Spaces: ./exports/gaia_results_TIMESTAMP.json
      • Pretty formatted (indent=2), full error messages, easy code processing
    • Updated run_and_submit_all() - ALL return paths now export results
    • Added gr.File download button - Direct download of JSON file
    • Updated run_button click handler - Outputs 3 values (status, table, export_path)
  4. src/tools/init.py (Fixed earlier in session)

    • Fixed TOOLS schema bug - Changed parameters from list to dict format
    • Updated all tool definitions to include type/description for each parameter
    • Added "required_params" field to specify required parameters
    • Fixed Gemini function calling compatibility

Dependencies

No changes to requirements.txt - huggingface-hub>=0.26.0 already present from initial setup.

Test Results

All tests passing with new 3-tier fallback architecture:

uv run pytest test/ -q
======================== 99 passed, 11 warnings in 51.99s ========================