[dev_260103_16] HuggingFace LLM API Integration
Date: 2026-01-03 Type: Development Status: Resolved Related Dev: dev_260102_15_stage4_mvp_real_integration.md
Problem Description
Context: Stage 4 implementation was 7/10 complete with comprehensive diagnostics and error handling. However, testing revealed critical LLM availability issues:
- Gemini 2.0 Flash - Quota exceeded (1,500 requests/day free tier limit exhausted from testing)
- Claude Sonnet 4.5 - Credit balance too low (paid tier, user's balance depleted)
Root Cause: Agent relied on only 2 LLM tiers (free Gemini β paid Claude), with no middle fallback when free tier exhausted. This caused complete LLM failure, falling back to keyword-based tool selection (Stage 4 fallback mechanism).
User Request: Add completely free LLM alternative that works in HuggingFace Spaces environment without requiring local GPU resources.
Requirements:
- Must be completely free (no credits, reasonable rate limits)
- Must support function calling (critical for tool selection)
- Must work in HuggingFace Spaces (cloud-based, no local GPU)
- Must integrate into existing 3-tier fallback architecture
Key Decisions
Decision 1: HuggingFace LLM API over Ollama (local LLMs)
Why chosen:
- β Works in HuggingFace Spaces (cloud-based API)
- β Free tier with rate limits (~60 req/min vs Gemini's 1,500 req/day)
- β Function calling support via OpenAI-compatible API
- β No GPU requirements (serverless inference)
- β Already deployed to HF Spaces - logical integration
Rejected alternative: Ollama + Llama 3.1 70B (local)
- β Requires local GPU or high-end CPU
- β Won't work in HuggingFace Free Spaces (CPU-only, 16GB RAM limit)
- β Would need GPU Spaces upgrade (not free)
- β Complex setup for user's deployment environment
Decision 2: Qwen 2.5 72B Instruct as HuggingFace Model
Why chosen:
- β Excellent function calling capabilities (OpenAI-compatible tools format)
- β Strong reasoning performance (competitive with GPT-4 on benchmarks)
- β Free on HuggingFace LLM API
- β 72B parameters - sufficient intelligence for GAIA tasks
Considered alternatives:
meta-llama/Llama-3.1-70B-Instruct- Good but slightly worse function callingNousResearch/Hermes-3-Llama-3.1-70B- Excellent but less tested for tool use
Decision 3: 3-Tier Fallback Architecture
Final chain:
- Gemini 2.0 Flash (free, 1,500 req/day) - Primary
- HuggingFace Qwen 2.5 72B (free, rate limited) - NEW Middle Tier
- Claude Sonnet 4.5 (paid) - Expensive fallback
- Keyword matching (deterministic) - Last resort
Trade-offs:
- Pro: 4 layers of resilience ensure agent always produces output
- Pro: Maximizes free tier usage before burning paid credits
- Con: Slightly higher latency on fallback chain traversal
- Con: More API keys to manage (but HF_TOKEN already required for Space)
Decision 4: TOOLS Schema Bug Fix (Critical)
Problem discovered: src/tools/__init__.py had parameters as list ["query"] but LLM client expected dict {"query": {...}} with type/description.
Impact: Gemini function calling was completely broken - caused 'list' object has no attribute 'items' error.
Fix: Updated all tool definitions to proper schema:
"parameters": {
"query": {
"description": "Search query string",
"type": "string"
},
"max_results": {
"description": "Maximum number of search results to return",
"type": "integer"
}
},
"required_params": ["query"]
Result: Gemini function calling now working correctly (verified in tests).
Outcome
Successfully integrated HuggingFace LLM API as free LLM fallback tier, completing Stage 4 MVP with robust multi-tier resilience.
Deliverables:
src/agent/llm_client.py - Added ~150 lines of HuggingFace integration
create_hf_client()- Initialize InferenceClient with HF_TOKENplan_question_hf()- Planning using Qwen 2.5 72Bselect_tools_hf()- Function calling with OpenAI-compatible tools formatsynthesize_answer_hf()- Answer synthesis from evidence- Updated unified functions:
plan_question(),select_tools_with_function_calling(),synthesize_answer()to use 3-tier fallback
src/agent/graph.py - Added HF_TOKEN validation
- Updated
validate_environment()to check HF_TOKEN at agent startup - Shows β οΈ WARNING if HF_TOKEN missing
- Updated
app.py - Updated UI and added JSON export functionality
- Added HF_TOKEN to
check_api_keys()display in Test & Debug tab - Added
export_results_to_json()- Exports evaluation results as clean JSON- Local: ~/Downloads/gaia_results_TIMESTAMP.json
- HF Spaces: ./exports/gaia_results_TIMESTAMP.json (environment-aware)
- Full error messages preserved (no truncation), easy code processing
- Updated
run_and_submit_all()- ALL return paths now export results - Added gr.File download button - Direct download instead of text display
- Added HF_TOKEN to
src/tools/init.py - Fixed TOOLS schema bug (earlier in session)
- Changed parameters from list to dict format
- Added type/description for each parameter
- Fixed Gemini function calling compatibility
Test Results:
uv run pytest test/ -q
99 passed, 11 warnings in 51.99s β
All tests passing with new 3-tier fallback architecture.
Stage 4 Progress: 10/10 tasks completed β
- β Comprehensive debug logging
- β Improved error messages
- β API key validation (including HF_TOKEN)
- β Tool execution error handling
- β Fallback tool execution (keyword matching)
- β LLM exception handling (3-tier fallback)
- β Diagnostics display in Gradio UI
- β Documentation in dev log (this file)
- β Tool name consistency fix (web_search, calculator, vision)
- β Deploy to HF Space and run GAIA validation
GAIA Validation Results (Real Test):
- Score: 10.0% (2/20 correct)
- Improvement: 0/20 β 2/20 (MVP validated!)
- Status: Agent is functional and operational
What worked:
- β Question 1: "How many studio albums were published by Mercedes Sosa between 2000 and 2009?" β Answer: "3" (CORRECT)
- β HuggingFace LLM (Qwen 2.5 72B) successfully used for planning and tool selection
- β Web search tool executed successfully
- β Evidence collection and answer synthesis working
What failed:
- β Question 2: YouTube video analysis (vision tool) - "Tool vision failed: Exception: Vision analysis failed - Gemini and Claude both failed"
- Issue: Vision tool requires multimodal LLM access (quota-limited or needs configuration)
Next Stage: Stage 5 - Performance Optimization (target: 5/20 questions)
Learnings and Insights
Pattern: Free-First Fallback Architecture
What worked well:
- Prioritizing free tiers (Gemini β HuggingFace) before paid tier (Claude) maximizes cost efficiency
- Multiple free alternatives with different quota models (daily vs rate-limited) provide better resilience than single free tier
- Keyword fallback ensures agent never completely fails even when all LLMs unavailable
Reusable pattern:
def unified_llm_function(...):
"""3-tier fallback with comprehensive error capture"""
errors = []
try:
return free_tier_1(...) # Gemini - daily quota
except Exception as e1:
errors.append(f"Tier 1: {e1}")
try:
return free_tier_2(...) # HuggingFace - rate limited
except Exception as e2:
errors.append(f"Tier 2: {e2}")
try:
return paid_tier(...) # Claude - credits
except Exception as e3:
errors.append(f"Tier 3: {e3}")
# Deterministic fallback as last resort
return keyword_fallback(...)
Pattern: Function Calling Schema Compatibility
Critical insight: Different LLM providers require different function calling schemas:
Gemini -
genai.protos.Toolwithfunction_declarations:Tool(function_declarations=[ FunctionDeclaration( name="search_web", description="...", parameters={ "type": "object", "properties": {"query": {"type": "string", "description": "..."}}, "required": ["query"] } ) ])HuggingFace - OpenAI-compatible tools array:
tools = [{ "type": "function", "function": { "name": "search_web", "description": "...", "parameters": { "type": "object", "properties": {"query": {"type": "string", "description": "..."}}, "required": ["query"] } } }]Claude - Anthropic native format (simplified):
tools = [{ "name": "search_web", "description": "...", "input_schema": { "type": "object", "properties": {"query": {"type": "string", "description": "..."}}, "required": ["query"] } }]
Best practice: Maintain single source of truth in src/tools/__init__.py with rich schema (dict format with type/description), then transform to provider-specific format in LLM client functions.
Pattern: Environment Validation at Startup
What worked well:
- Validating all API keys at agent initialization (not at first use) provides immediate feedback
- Clear warnings listing missing keys help users diagnose setup issues
- Non-blocking warnings (continue anyway) allow testing with partial configuration
Implementation:
def validate_environment() -> List[str]:
"""Check API keys at startup, return list of missing keys"""
missing = []
for key_name in ["GOOGLE_API_KEY", "HF_TOKEN", "ANTHROPIC_API_KEY", "TAVILY_API_KEY"]:
if not os.getenv(key_name):
missing.append(key_name)
if missing:
logger.warning(f"β οΈ Missing API keys: {', '.join(missing)}")
else:
logger.info("β All API keys configured")
return missing
What to avoid:
Anti-pattern: List-based parameter schemas
# WRONG - breaks LLM function calling
"parameters": ["query", "max_results"]
# CORRECT - works with all providers
"parameters": {
"query": {"type": "string", "description": "..."},
"max_results": {"type": "integer", "description": "..."}
}
Why it breaks: LLM clients iterate over parameters.items() to extract type/description metadata. List has no .items() method.
Changelog
Session Date: 2026-01-03
Modified Files
src/agent/llm_client.py (~150 lines added)
- Added
create_hf_client()- Initialize HuggingFace InferenceClient with HF_TOKEN - Added
plan_question_hf(question, available_tools, file_paths)- Planning with Qwen 2.5 72B - Added
select_tools_hf(question, plan, available_tools)- Function calling with OpenAI-compatible tools format - Added
synthesize_answer_hf(question, evidence)- Answer synthesis from evidence - Updated
plan_question()- Added HuggingFace as middle fallback tier (Gemini β HF β Claude) - Updated
select_tools_with_function_calling()- Added HuggingFace as middle fallback tier - Updated
synthesize_answer()- Added HuggingFace as middle fallback tier - Added CONFIG constant:
HF_MODEL = "Qwen/Qwen2.5-72B-Instruct" - Added import:
from huggingface_hub import InferenceClient
- Added
src/agent/graph.py
- Updated
validate_environment()- Added HF_TOKEN to API key validation check - Updated startup logging - Shows β οΈ WARNING if HF_TOKEN missing
- Updated
app.py
- Updated
check_api_keys()- Added HF_TOKEN status display in Test & Debug tab - UI now shows: "HF_TOKEN (HuggingFace): β SET" or "β MISSING"
- Added
export_results_to_json(results_log, submission_status)- Export evaluation results as JSON- Local: ~/Downloads/gaia_results_TIMESTAMP.json
- HF Spaces: ./exports/gaia_results_TIMESTAMP.json
- Pretty formatted (indent=2), full error messages, easy code processing
- Updated
run_and_submit_all()- ALL return paths now export results - Added gr.File download button - Direct download of JSON file
- Updated run_button click handler - Outputs 3 values (status, table, export_path)
- Updated
src/tools/init.py (Fixed earlier in session)
- Fixed TOOLS schema bug - Changed parameters from list to dict format
- Updated all tool definitions to include type/description for each parameter
- Added
"required_params"field to specify required parameters - Fixed Gemini function calling compatibility
Dependencies
No changes to requirements.txt - huggingface-hub>=0.26.0 already present from initial setup.
Test Results
All tests passing with new 3-tier fallback architecture:
uv run pytest test/ -q
======================== 99 passed, 11 warnings in 51.99s ========================