| # [dev_260101_06] Level 5 Component Selection Decisions | |
| **Date:** 2026-01-01 | |
| **Type:** Development | |
| **Status:** Resolved | |
| **Related Dev:** dev_260101_05 | |
| ## Problem Description | |
| Applied Level 5 Component Selection parameters from AI Agent System Design Framework to select LLM model, tool suite, memory architecture, and guardrails for GAIA benchmark agent MVP. | |
| --- | |
| ## Key Decisions | |
| **Parameter 1: LLM Model → Claude Sonnet 4.5 (primary) + Free API baseline options** | |
| - **Primary choice:** Claude Sonnet 4.5 | |
| - **Reasoning:** Framework best practice - "Start with most capable model to baseline performance, then optimize downward for cost" | |
| - **Capability match:** Sonnet 4.5 provides strong reasoning + tool use capabilities required for GAIA | |
| - **Budget alignment:** Learning project allows premium model for baseline measurement | |
| - **Free API baseline alternatives:** | |
| - **Google Gemini 2.0 Flash** (via AI Studio free tier) | |
| - Function calling support, multi-modal, good reasoning | |
| - Free quota: 1500 requests/day, suitable for GAIA evaluation | |
| - **Qwen 2.5 72B** (via HuggingFace Inference API) | |
| - Open source, function calling via HF API | |
| - Free tier available, strong reasoning performance | |
| - **Meta Llama 3.3 70B** (via HuggingFace Inference API) | |
| - Open source, good tool use capability | |
| - Free tier for experimentation | |
| - **Optimization path:** Start with free baseline (Gemini Flash), compare with Claude if budget allows | |
| - **Implication:** Dual-track approach - free API for experimentation, premium model for performance ceiling | |
| **Parameter 2: Tool Suite → Web browser / Python interpreter / File reader / Multi-modal processor** | |
| - **Evidence-based selection:** GAIA requirements breakdown: | |
| - Web browsing: 76% of questions | |
| - Code execution: 33% of questions | |
| - File reading: 28% of questions (diverse formats) | |
| - Multi-modal (vision): 30% of questions | |
| - **Specific tools:** | |
| - Web search: Exa or Tavily API | |
| - Code execution: Python interpreter (sandboxed) | |
| - File reader: Multi-format parser (PDF, CSV, Excel, images) | |
| - Vision: Multi-modal LLM capability for image analysis | |
| - **Coverage:** 4 core tools address primary GAIA capability requirements for MVP | |
| **Parameter 3: Memory Architecture → Short-term context only** | |
| - **Reasoning:** GAIA questions are independent and stateless (Level 1 decision) | |
| - **Evidence:** Zero-shot evaluation requires each question answered in isolation | |
| - **Implication:** No vector stores/RAG/semantic memory/episodic memory needed | |
| - **Memory scope:** Only maintain context within single question execution | |
| - **Alignment:** Matches Level 1 stateless design, prevents cross-question contamination | |
| **Parameter 4: Guardrails → Output validation + Tool restrictions** | |
| - **Output validation:** Enforce factoid answer format (numbers/few words/comma-separated lists) | |
| - **Tool restrictions:** Execution timeouts (prevent infinite loops), resource limits | |
| - **Minimal constraints:** No heavy content filtering for MVP (learning context) | |
| - **Safety focus:** Format compliance and execution safety, not content policy enforcement | |
| **Parameter 5: Answer Synthesis → LLM-generated (Stage 3 implementation)** | |
| - **Reasoning:** GAIA requires extracting factoid answers from multi-source evidence | |
| - **Evidence:** Answers must synthesize information from web searches, code outputs, file contents | |
| - **Implication:** LLM must reason about evidence and generate final answer (not template-based) | |
| - **Stage alignment:** Core logic implementation in Stage 3 (beyond MVP tool integration) | |
| - **Capability requirement:** LLM must distill complex evidence into concise factoid format | |
| **Parameter 6: Conflict Resolution → LLM-based reasoning (Stage 3 implementation)** | |
| - **Reasoning:** Multi-source evidence may contain conflicting information requiring judgment | |
| - **Example scenarios:** Conflicting search results, outdated vs current information, contradictory sources | |
| - **Implication:** LLM must evaluate source credibility and recency to resolve conflicts | |
| - **Stage alignment:** Decision logic in Stage 3 (not needed for Stage 2 tool integration) | |
| - **Alternative rejected:** Latest wins / Source priority too simplistic for GAIA evidence evaluation | |
| **Rejected alternatives:** | |
| - Vector stores/RAG: Unnecessary for stateless question-answering | |
| - Semantic/episodic memory: Violates GAIA zero-shot evaluation requirements | |
| - Heavy prompt constraints: Over-engineering for learning/benchmark context | |
| - Procedural caches: No repeated procedures to cache in stateless design | |
| **Future optimization:** | |
| - Model selection: A/B test free baselines (Gemini Flash, Qwen, Llama) vs premium (Claude, GPT-4) | |
| - Tool expansion: Add specialized tools based on failure analysis | |
| - Memory: Consider episodic memory for self-improvement experiments (non-benchmark mode) | |
| ## Outcome | |
| Selected component stack optimized for GAIA MVP: Claude Sonnet 4.5 for reasoning, 4 core tools (web/code/file/vision) for capability coverage, short-term context for stateless execution, minimal guardrails for format validation and safety. | |
| **Deliverables:** | |
| - `dev/dev_260101_06_level5_component_selection.md` - Level 5 component selection decisions | |
| **Component Specifications:** | |
| - **LLM:** Claude Sonnet 4.5 (primary) with free baseline alternatives (Gemini 2.0 Flash, Qwen 2.5 72B, Llama 3.3 70B) | |
| - **Tools:** Web (Exa/Tavily) + Python interpreter + File reader + Vision | |
| - **Memory:** Short-term context only (stateless) | |
| - **Guardrails:** Output format validation + execution timeouts | |
| ## Learnings and Insights | |
| **Pattern discovered:** Component selection driven by evidence-based requirements (GAIA capability analysis: 76% web, 33% code, 28% file, 30% multi-modal) rather than speculative "might need this" additions. | |
| **Best practice application:** "Start with most capable model to baseline performance" prevents premature optimization. Measure first, optimize second. | |
| **Memory architecture principle:** Stateless design enforced by benchmark requirements creates clean separation - no cross-question context leakage. | |
| **Critical connection:** Tool suite selection directly impacts Level 6 framework choice (framework must support function calling for tool integration). | |
| ## Changelog | |
| **What was changed:** | |
| - Created `dev/dev_260101_06_level5_component_selection.md` - Level 5 component selection decisions | |
| - Referenced AI Agent System Design Framework (2026-01-01).pdf Level 5 parameters | |
| - Referenced GAIA_TuyenPham_Analysis.pdf capability requirements (76% web, 33% code, 28% file, 30% multi-modal) | |
| - Established Claude Sonnet 4.5 as primary LLM with free baseline alternatives (Gemini 2.0 Flash, Qwen 2.5 72B, Llama 3.3 70B) | |
| - Added dual-track optimization path: free API for experimentation, premium model for performance ceiling | |