agentbee

Sleeping

App Files Files Community

agentbee / dev /dev_260103_16_huggingface_llm_integration.md

mangubee

Docs: Recover Stage 4 complete documentation structure

06fc271 4 months ago

preview code

raw

history blame

13.2 kB

[dev_260103_16] HuggingFace LLM API Integration

Date: 2026-01-03 Type: Development Status: Resolved Related Dev: dev_260102_15_stage4_mvp_real_integration.md

Problem Description

Context: Stage 4 implementation was 7/10 complete with comprehensive diagnostics and error handling. However, testing revealed critical LLM availability issues:

Gemini 2.0 Flash - Quota exceeded (1,500 requests/day free tier limit exhausted from testing)
Claude Sonnet 4.5 - Credit balance too low (paid tier, user's balance depleted)

Root Cause: Agent relied on only 2 LLM tiers (free Gemini → paid Claude), with no middle fallback when free tier exhausted. This caused complete LLM failure, falling back to keyword-based tool selection (Stage 4 fallback mechanism).

User Request: Add completely free LLM alternative that works in HuggingFace Spaces environment without requiring local GPU resources.

Requirements:

Must be completely free (no credits, reasonable rate limits)
Must support function calling (critical for tool selection)
Must work in HuggingFace Spaces (cloud-based, no local GPU)
Must integrate into existing 3-tier fallback architecture

Key Decisions

Decision 1: HuggingFace LLM API over Ollama (local LLMs)

Why chosen:

✅ Works in HuggingFace Spaces (cloud-based API)
✅ Free tier with rate limits (~60 req/min vs Gemini's 1,500 req/day)
✅ Function calling support via OpenAI-compatible API
✅ No GPU requirements (serverless inference)
✅ Already deployed to HF Spaces - logical integration

Rejected alternative: Ollama + Llama 3.1 70B (local)

❌ Requires local GPU or high-end CPU
❌ Won't work in HuggingFace Free Spaces (CPU-only, 16GB RAM limit)
❌ Would need GPU Spaces upgrade (not free)
❌ Complex setup for user's deployment environment

Decision 2: Qwen 2.5 72B Instruct as HuggingFace Model

Why chosen:

✅ Excellent function calling capabilities (OpenAI-compatible tools format)
✅ Strong reasoning performance (competitive with GPT-4 on benchmarks)
✅ Free on HuggingFace LLM API
✅ 72B parameters - sufficient intelligence for GAIA tasks

Considered alternatives:

meta-llama/Llama-3.1-70B-Instruct - Good but slightly worse function calling
NousResearch/Hermes-3-Llama-3.1-70B - Excellent but less tested for tool use

Decision 3: 3-Tier Fallback Architecture

Final chain:

Gemini 2.0 Flash (free, 1,500 req/day) - Primary
HuggingFace Qwen 2.5 72B (free, rate limited) - NEW Middle Tier
Claude Sonnet 4.5 (paid) - Expensive fallback
Keyword matching (deterministic) - Last resort

Trade-offs:

Pro: 4 layers of resilience ensure agent always produces output
Pro: Maximizes free tier usage before burning paid credits
Con: Slightly higher latency on fallback chain traversal
Con: More API keys to manage (but HF_TOKEN already required for Space)

Decision 4: TOOLS Schema Bug Fix (Critical)

Problem discovered: src/tools/__init__.py had parameters as list ["query"] but LLM client expected dict {"query": {...}} with type/description.

Impact: Gemini function calling was completely broken - caused 'list' object has no attribute 'items' error.

Fix: Updated all tool definitions to proper schema:

"parameters": {
    "query": {
        "description": "Search query string",
        "type": "string"
    },
    "max_results": {
        "description": "Maximum number of search results to return",
        "type": "integer"
    }
},
"required_params": ["query"]

Result: Gemini function calling now working correctly (verified in tests).

Outcome

Successfully integrated HuggingFace LLM API as free LLM fallback tier, completing Stage 4 MVP with robust multi-tier resilience.

Deliverables:

src/agent/llm_client.py - Added ~150 lines of HuggingFace integration
- create_hf_client() - Initialize InferenceClient with HF_TOKEN
- plan_question_hf() - Planning using Qwen 2.5 72B
- select_tools_hf() - Function calling with OpenAI-compatible tools format
- synthesize_answer_hf() - Answer synthesis from evidence
- Updated unified functions: plan_question(), select_tools_with_function_calling(), synthesize_answer() to use 3-tier fallback
src/agent/graph.py - Added HF_TOKEN validation
- Updated validate_environment() to check HF_TOKEN at agent startup
- Shows ⚠️ WARNING if HF_TOKEN missing
app.py - Updated UI and added JSON export functionality
- Added HF_TOKEN to check_api_keys() display in Test & Debug tab
- Added export_results_to_json() - Exports evaluation results as clean JSON
  - Local: ~/Downloads/gaia_results_TIMESTAMP.json
  - HF Spaces: ./exports/gaia_results_TIMESTAMP.json (environment-aware)
  - Full error messages preserved (no truncation), easy code processing
- Updated run_and_submit_all() - ALL return paths now export results
- Added gr.File download button - Direct download instead of text display
src/tools/init.py - Fixed TOOLS schema bug (earlier in session)
- Changed parameters from list to dict format
- Added type/description for each parameter
- Fixed Gemini function calling compatibility

Test Results:

uv run pytest test/ -q
99 passed, 11 warnings in 51.99s  ✅

All tests passing with new 3-tier fallback architecture.

Stage 4 Progress: 10/10 tasks completed ✅

✅ Comprehensive debug logging
✅ Improved error messages
✅ API key validation (including HF_TOKEN)
✅ Tool execution error handling
✅ Fallback tool execution (keyword matching)
✅ LLM exception handling (3-tier fallback)
✅ Diagnostics display in Gradio UI
✅ Documentation in dev log (this file)
✅ Tool name consistency fix (web_search, calculator, vision)
✅ Deploy to HF Space and run GAIA validation

GAIA Validation Results (Real Test):

Score: 10.0% (2/20 correct)
Improvement: 0/20 → 2/20 (MVP validated!)
Status: Agent is functional and operational

What worked:

✅ Question 1: "How many studio albums were published by Mercedes Sosa between 2000 and 2009?" → Answer: "3" (CORRECT)
✅ HuggingFace LLM (Qwen 2.5 72B) successfully used for planning and tool selection
✅ Web search tool executed successfully
✅ Evidence collection and answer synthesis working

What failed:

❌ Question 2: YouTube video analysis (vision tool) - "Tool vision failed: Exception: Vision analysis failed - Gemini and Claude both failed"
Issue: Vision tool requires multimodal LLM access (quota-limited or needs configuration)

Next Stage: Stage 5 - Performance Optimization (target: 5/20 questions)

Learnings and Insights

Pattern: Free-First Fallback Architecture

What worked well:

Prioritizing free tiers (Gemini → HuggingFace) before paid tier (Claude) maximizes cost efficiency
Multiple free alternatives with different quota models (daily vs rate-limited) provide better resilience than single free tier
Keyword fallback ensures agent never completely fails even when all LLMs unavailable

Reusable pattern:

def unified_llm_function(...):
    """3-tier fallback with comprehensive error capture"""
    errors = []

    try:
        return free_tier_1(...)  # Gemini - daily quota
    except Exception as e1:
        errors.append(f"Tier 1: {e1}")
        try:
            return free_tier_2(...)  # HuggingFace - rate limited
        except Exception as e2:
            errors.append(f"Tier 2: {e2}")
            try:
                return paid_tier(...)  # Claude - credits
            except Exception as e3:
                errors.append(f"Tier 3: {e3}")
                # Deterministic fallback as last resort
                return keyword_fallback(...)

Pattern: Function Calling Schema Compatibility

Critical insight: Different LLM providers require different function calling schemas:

Gemini - genai.protos.Tool with function_declarations:

Tool(function_declarations=[
    FunctionDeclaration(
        name="search_web",
        description="...",
        parameters={
            "type": "object",
            "properties": {"query": {"type": "string", "description": "..."}},
            "required": ["query"]
        }
    )
])

HuggingFace - OpenAI-compatible tools array:

tools = [{
    "type": "function",
    "function": {
        "name": "search_web",
        "description": "...",
        "parameters": {
            "type": "object",
            "properties": {"query": {"type": "string", "description": "..."}},
            "required": ["query"]
        }
    }
}]

Claude - Anthropic native format (simplified):

tools = [{
    "name": "search_web",
    "description": "...",
    "input_schema": {
        "type": "object",
        "properties": {"query": {"type": "string", "description": "..."}},
        "required": ["query"]
    }
}]

Best practice: Maintain single source of truth in src/tools/__init__.py with rich schema (dict format with type/description), then transform to provider-specific format in LLM client functions.

Pattern: Environment Validation at Startup

What worked well:

Validating all API keys at agent initialization (not at first use) provides immediate feedback
Clear warnings listing missing keys help users diagnose setup issues
Non-blocking warnings (continue anyway) allow testing with partial configuration

Implementation:

def validate_environment() -> List[str]:
    """Check API keys at startup, return list of missing keys"""
    missing = []
    for key_name in ["GOOGLE_API_KEY", "HF_TOKEN", "ANTHROPIC_API_KEY", "TAVILY_API_KEY"]:
        if not os.getenv(key_name):
            missing.append(key_name)

    if missing:
        logger.warning(f"⚠️  Missing API keys: {', '.join(missing)}")
    else:
        logger.info("✓ All API keys configured")

    return missing

What to avoid:

Anti-pattern: List-based parameter schemas

# WRONG - breaks LLM function calling
"parameters": ["query", "max_results"]

# CORRECT - works with all providers
"parameters": {
    "query": {"type": "string", "description": "..."},
    "max_results": {"type": "integer", "description": "..."}
}

Why it breaks: LLM clients iterate over parameters.items() to extract type/description metadata. List has no .items() method.

Changelog

Session Date: 2026-01-03

Modified Files

src/agent/llm_client.py (~150 lines added)
- Added create_hf_client() - Initialize HuggingFace InferenceClient with HF_TOKEN
- Added plan_question_hf(question, available_tools, file_paths) - Planning with Qwen 2.5 72B
- Added select_tools_hf(question, plan, available_tools) - Function calling with OpenAI-compatible tools format
- Added synthesize_answer_hf(question, evidence) - Answer synthesis from evidence
- Updated plan_question() - Added HuggingFace as middle fallback tier (Gemini → HF → Claude)
- Updated select_tools_with_function_calling() - Added HuggingFace as middle fallback tier
- Updated synthesize_answer() - Added HuggingFace as middle fallback tier
- Added CONFIG constant: HF_MODEL = "Qwen/Qwen2.5-72B-Instruct"
- Added import: from huggingface_hub import InferenceClient
src/agent/graph.py
- Updated validate_environment() - Added HF_TOKEN to API key validation check
- Updated startup logging - Shows ⚠️ WARNING if HF_TOKEN missing
app.py
- Updated check_api_keys() - Added HF_TOKEN status display in Test & Debug tab
- UI now shows: "HF_TOKEN (HuggingFace): ✓ SET" or "✗ MISSING"
- Added export_results_to_json(results_log, submission_status) - Export evaluation results as JSON
  - Local: ~/Downloads/gaia_results_TIMESTAMP.json
  - HF Spaces: ./exports/gaia_results_TIMESTAMP.json
  - Pretty formatted (indent=2), full error messages, easy code processing
- Updated run_and_submit_all() - ALL return paths now export results
- Added gr.File download button - Direct download of JSON file
- Updated run_button click handler - Outputs 3 values (status, table, export_path)
src/tools/init.py (Fixed earlier in session)
- Fixed TOOLS schema bug - Changed parameters from list to dict format
- Updated all tool definitions to include type/description for each parameter
- Added "required_params" field to specify required parameters
- Fixed Gemini function calling compatibility

Dependencies

No changes to requirements.txt - huggingface-hub>=0.26.0 already present from initial setup.

Test Results

All tests passing with new 3-tier fallback architecture:

uv run pytest test/ -q
======================== 99 passed, 11 warnings in 51.99s ========================