agentbee / CHANGELOG.md
mangubee's picture
[2026-01-21] [Documentation] [COMPLETED] ACHIEVEMENT.md - Project Success Report
3b2e582

A newer version of the Gradio SDK is available: 6.4.0

Upgrade

Session Changelog

[2026-01-22] [Enhancement] [COMPLETED] UI Instructions - User-Focused Quick Start Guide

Problem: Default template instructions were developer-focused ("clone this space, modify code") and not helpful for end users.

Solution: Rewrote instructions to be concise and user-oriented:

Before:

  • Generic numbered steps
  • Talked about cloning/modifying code (irrelevant for end users)
  • Long rambling disclaimer about sub-optimal setup

After:

  • Quick Start section with bolded key actions
  • What happens section explaining the workflow
  • Expectations section managing user expectations about time and downloads
  • Explicitly mentions JSON + HTML export formats

Modified Files:

  • app.py (lines 910-927)

[2026-01-22] [Refactor] [COMPLETED] Export Architecture - Canonical Data Model

Problem: HTML export called JSON export internally, wrote JSON to disk, read it back, then wrote HTML. This was:

  • Inefficient (redundant disk I/O)
  • Tightly coupled (HTML depended on JSON format)
  • Error-prone (data structure mismatch)

Solution: Refactored to use canonical data model:

  1. _build_export_data() - Single source of truth, builds canonical data structure
  2. export_results_to_json() - Calls canonical builder, writes JSON
  3. export_results_to_html() - Calls canonical builder, writes HTML

Benefits:

  • No redundant processing (no disk I/O between exports)
  • Loose coupling (exports are independent)
  • Consistent data (both use identical source)
  • Easier to extend (add CSV, PDF exports easily)

Modified Files:

  • app.py (~200 lines refactored)

[2026-01-21] [Bugfix] [COMPLETED] DataFrame Scroll Bug - Replaced with HTML Export

Problem: Gradio 6.2.0 DataFrame has critical scrolling bugs (virtualized scrolling from Gradio 3.43+):

  • Spring-back to top when scrolling
  • Random scroll positions
  • Locked scrolling after window resize

Attempted Solutions (all failed):

  • max_height parameter
  • row_count parameter
  • interactive=False
  • Custom CSS overrides
  • Downgrade to Gradio 3.x (numpy conflict)

Solution: Removed DataFrame entirely, replaced with:

  1. JSON Export - Full data download
  2. HTML Export - Interactive table with scrollable cells

UI Changes:

  • Removed: gr.DataFrame component
  • Added: gr.File components for JSON and HTML downloads
  • Updated: All return statements in run_and_submit_all()

Modified Files:

  • app.py (~50 lines modified)

[2026-01-21] [Debug] [FAILED] Gradio DataFrame Scroll Bug - Multiple Attempted Fixes

Problem: Gradio 6.2.0 DataFrame has critical scrolling bugs due to virtualized scrolling introduced in Gradio 3.43+:

  • Spring-back to top when scrolling
  • Random scroll positions on click
  • Locked scrolling after window resize

Attempted Solutions (all failed):

  1. max_height parameter - No effect, virtualized scrolling still active
  2. row_count parameter - No effect, display issues persisted
  3. interactive=False - No effect, scrolling still broken
  4. Custom CSS overrides - Attempted to override virtualized styles, no effect
  5. Downgrade to Gradio 3.x - Failed due to numpy 1.x vs 2.x dependency conflict

Root Cause Identified:

  • Virtualized scrolling in Gradio 3.43+ fundamentally breaks DataFrame display
  • No workarounds available in Gradio 6.2.0
  • Downgrade blocked by dependency constraints

Resolution: Abandoned DataFrame UI, replaced with export buttons (see next entry)

Status: FAILED - UI bug unfixable, switched to alternative solution

Modified Files:

  • app.py (multiple attempted fixes, all reverted)

[2026-01-21] [Documentation] [COMPLETED] ACHIEVEMENT.md - Project Success Report

Problem: Need professional marketing/stakeholder report showcasing GAIA agent engineering journey and achievements.

Solution: Created comprehensive achievement report focusing on strategic engineering decisions and architectural choices.

Report Structure:

  1. Executive Summary - Design-first approach (10 days planning + 4 days implementation), key achievements
  2. Strategic Engineering Decisions - 7 major decisions documented:
    • Decision 1: Design-First Approach (8-Level Framework)
    • Decision 2: Tech Stack Selection (LangGraph, Gradio, model selection criteria)
    • Decision 3: Free-Tier-First Cost Architecture (4-tier LLM fallback)
    • Decision 4: UI-Driven Runtime Configuration
    • Decision 5: Unified Fallback Pattern Architecture
    • Decision 6: Evidence-Based State Design
    • Decision 7: Dynamic Planning via LLM
  3. Implementation Journey - 6 stages with architectural decisions per stage
  4. Performance Progression Timeline - 10% β†’ 25% β†’ 30% accuracy progression
  5. Production Readiness Highlights - Deployment, cost optimization, resilience engineering
  6. Quantifiable Impact Summary - Metrics table with 10 key achievements
  7. Key Learnings & Takeaways - 6 strategic insights
  8. Conclusion - Final stats and repository link

Tech Stack Details Added:

  • LLM Chain: Gemini 2.0 Flash Exp β†’ GPT-OSS 120B (HF) β†’ GPT-OSS 120B (Groq) β†’ Claude Sonnet 4.5
  • Vision: Gemma-3-27B (HF) β†’ Gemini 2.0 Flash β†’ Claude Sonnet 4.5
  • Search: Tavily β†’ Exa
  • Audio: Whisper Small with ZeroGPU
  • Frameworks: LangGraph (not LangChain), Gradio (not Streamlit), uv (not pip/poetry)

Focus: Strategic WHY (engineering decisions) over technical WHAT (bug fixes), emphasizing architectural thinking and product design.

Modified Files:

  • ACHIEVEMENT.md (401 lines created) - Complete marketing report with executive summary, strategic decisions, implementation journey, metrics

Result: Professional achievement report ready for employers, recruiters, investors, and blog/social media sharing.


[2026-01-14] [Enhancement] [COMPLETED] Unified Log Format - Markdown Standard

Problem: Inconsistent log formats across different components, wasteful ==== separators.

Solution: Standardize all logs to Markdown format with clean structure.

Unified Log Standard:

# Title

**Key:** value
**Key:** value

## Section

Content

Files Updated:

  1. LLM Session Logs (llm_session_*.md):

    • Header: # LLM Synthesis Session Log
    • Questions: ## Question [timestamp]
    • Sections: ### Evidence & Prompt, ### LLM Response
    • Code blocks: triple backticks
  2. YouTube Transcript Logs ({video_id}_transcript.md):

    • Header: # YouTube Transcript
    • Metadata: **Video ID:**, **Source:**, **Length:**
    • Content: ## Transcript

Note: No horizontal rules (---) - already banned in global CLAUDE.md, breaks collapsible sections

Token Savings:

Style Tokens per separator 20 questions
==== x 80 chars ~40 tokens ~800 tokens
## heading ~2 tokens ~40 tokens

Savings: ~760 tokens per session (95% reduction)

Benefits:

  • βœ… Collapsible headings in all Markdown editors
  • βœ… Consistent structure across all log files
  • βœ… Token-efficient for LLM processing
  • βœ… Readable in both rendered and plain text
  • βœ… .md extension for proper syntax highlighting

Modified Files:

  • src/agent/llm_client.py (LLM session logs)
  • src/tools/youtube.py (transcript logs)
  • CLAUDE.md (added unified log format standard)

[2026-01-14] [Cleanup] [COMPLETED] Session Log Optimization - Reduce Static Content Redundancy

Problem: System prompt (~30 lines) was written for every question (20x = 600 lines of redundant text).

Solution: Write system prompt once on first question, skip for subsequent questions.

Implementation:

  • Added _SYSTEM_PROMPT_WRITTEN flag to track if system prompt was logged
  • First question includes full SYSTEM PROMPT section
  • Subsequent questions only show dynamic content (question, evidence, response)

Log format comparison:

Before (every question):

QUESTION START
SYSTEM PROMPT: [30 lines repeated]
USER PROMPT: [dynamic]
LLM RESPONSE: [dynamic]

After (first question):

SYSTEM PROMPT (static - used for all questions): [30 lines]
QUESTION [...]
EVIDENCE & PROMPT: [dynamic]
LLM RESPONSE: [dynamic]

After (subsequent questions):

QUESTION [...]
EVIDENCE & PROMPT: [dynamic]
LLM RESPONSE: [dynamic]

Result: ~570 lines less redundancy per 20-question evaluation.

Modified Files:

  • src/agent/llm_client.py (~30 lines modified - added flag, conditional logging)

[2026-01-14] [Bugfix] [COMPLETED] Session Log Synchronization - Atomic Per-Question Logging

Problem: When processing multiple questions, LLM responses were written out of order relative to their questions, causing mismatched prompts/responses in session logs.

Root Cause: synthesize_answer_hf() wrote QUESTION START immediately, but appended LLM RESPONSE later after API call completed. With concurrent processing, responses finished in different order.

Solution: Buffer complete question block in memory, write atomically when response arrives:

# Before (broken):
write_question_start()  # immediate
api_response = call_llm()
write_llm_response()     # later, out of order

# After (fixed):
question_header = buffer_question_start()
api_response = call_llm()
complete_block = question_header + response + end
write_atomic(complete_block)  # all at once

Result: Each question block is self-contained, no mismatched prompts/responses.

Modified Files:

  • src/agent/llm_client.py (~40 lines modified - synthesize_answer_hf function)

[2026-01-13] [Cleanup] [COMPLETED] LLM Session Log Format - Removed Duplicate Evidence

Problem: Evidence appeared twice in session log - once in USER PROMPT section, again in EVIDENCE ITEMS section.

Solution: Removed standalone EVIDENCE ITEMS section, kept evidence in USER PROMPT only.

Rationale: USER PROMPT shows what's actually sent to the LLM (system + user messages together).

Modified Files:

  • src/agent/llm_client.py - Removed duplicate logging section (lines 1189-1194 deleted)

Result: Cleaner logs, no duplication

[2026-01-13] [Feature] [COMPLETED] YouTube Frame Processing Mode - Visual Video Analysis

Problem: Transcript mode captures audio but misses visual information (objects, scenes, actions).

Solution: Implemented frame extraction and vision-based video analysis mode.

Implementation:

1. Frame Extraction (src/tools/youtube.py):

  • download_video() - Downloads video using yt-dlp
  • extract_frames() - Extracts N frames at regular intervals using OpenCV
  • analyze_frames() - Analyzes frames with vision models
  • process_video_frames() - Complete frame processing pipeline
  • youtube_analyze() - Unified API with mode parameter

2. CONFIG Settings:

  • FRAME_COUNT = 6 - Number of frames to extract
  • FRAME_QUALITY = "worst" - Download quality (faster)

3. UI Integration (app.py):

  • Added radio button: "YouTube Processing Mode"
  • Choices: "Transcript" (default) or "Frames"
  • Sets YOUTUBE_MODE environment variable

4. Updated Dependencies:

  • requirements.txt - Added opencv-python>=4.8.0
  • pyproject.toml - Added via uv add opencv-python

5. Tool Description Update (src/tools/__init__.py):

  • Updated youtube_transcript description to mention both modes

Architecture:

youtube_transcript() β†’ reads YOUTUBE_MODE env
                        β”œβ”€ "transcript" β†’ audio/subtitle extraction
                        └─ "frames" β†’ video download β†’ extract 6 frames β†’ vision analysis

Test Result:

  • Successfully processed video with 6 frames analyzed
  • Each frame analyzed with vision model, combined output returned
  • Frame timestamps: 0s, 20s, 40s, 60s, 80s, 100s (spread evenly)

Known Limitation:

  • Frame sampling is random (regular intervals)
  • Low probability of capturing transient events (~5.5% for 108s video)
  • Future: Hybrid mode using timestamps to guide frame extraction (documented in user_io/knowledge/hybrid_video_audio_analysis.md)

Status: Implemented and tested, ready for use

Modified Files:

  • src/tools/youtube.py (~200 lines added - frame extraction + analysis)
  • app.py (~5 lines modified - UI toggle)
  • requirements.txt (1 line added - opencv-python)
  • src/tools/__init__.py (1 line modified - tool description)

[2026-01-13] [Investigation] [OPEN] HF Spaces vs Local Performance Discrepancy

Problem: HF Space deployment shows significantly lower scores (5%) than local execution (20-30%).

Investigation:

Environment Score System Errors NoneType Errors
Local 20-30% 3 (15%) 1
HF ZeroGPU 5% 5 (25%) 3
HF CPU Basic 5% 5 (25%) 3

Verified: Code is 100% identical (cloned HF Space repo, git history matches at commit 3dcf523).

Issue: HF Spaces infrastructure causes LLM to return empty/None responses during synthesis.

Known Limitations (Local 30% Run):

  • 3 system errors: reverse text (calculator), chess vision (NoneType), Python .py execution
  • 10 "Unable to answer": search evidence extraction issues
  • 1 wrong answer: Wikipedia dinosaur (Jimfbleak vs FunkMonk)

Resolution: Competition accepts local results. HF Spaces deployment not required.

Status: OPEN - Infrastructure Issue, Won't Fix (use local execution)

[2026-01-13] [Infrastructure] [COMPLETED] 3-Tier Folder Naming Convention

Problem: Previous rename used _ prefix for both runtime folders AND user-only folders, creating ambiguity.

Solution: Implemented 3-tier naming convention to clearly distinguish folder purposes.

3-Tier Convention:

  1. User-only (user_* prefix) - Manual use, not app runtime:

    • user_input/ - User testing files, not app input
    • user_output/ - User downloads, not app output
    • user_dev/ - Dev records (manual documentation)
    • user_archive/ - Archived code/reference materials
  2. Runtime/Internal (_ prefix) - App creates, temporary:

    • _cache/ - Runtime cache, served via app download
    • _log/ - Runtime logs, debugging
  3. Application (no prefix) - Permanent code:

    • src/, test/, docs/, ref/ - Application folders

Folders Renamed:

  • _input/ β†’ user_input/ (user testing files)
  • _output/ β†’ user_output/ (user downloads)
  • dev/ β†’ user_dev/ (dev records)
  • archive/ β†’ user_archive/ (archived materials)

Folders Unchanged (correct tier):

  • _cache/, _log/ - Runtime βœ“
  • src/, test/, docs/, ref/ - Application βœ“

Updated Files:

  • test/test_phase0_hf_vision_api.py - Path("_output") β†’ Path("user_output")
  • .gitignore - Updated folder references and comments

Git Status:

  • Old folders removed from git tracking
  • New folders excluded by .gitignore
  • Existing files become untracked

Result: Clear 3-tier structure: user**, **, and no prefix

[2026-01-13] [Infrastructure] [COMPLETED] Runtime Folder Naming Convention - Underscore Prefix

Problem: Folders log/, output/, and input/ didn't clearly indicate they were runtime-only storage, making it unclear which folders are internal vs permanent.

Solution: Renamed all runtime-only folders to use _ prefix, following Python convention for internal/private.

Folders Renamed:

  • log/ β†’ _log/ (runtime logs, debugging)
  • output/ β†’ _output/ (runtime results, user downloads)
  • input/ β†’ _input/ (user testing files, not app input)

Rationale:

  • _ prefix signals "internal, temporary, not part of public API"
  • Consistent with Python convention (_private, __dunder__)
  • Distinguishes runtime storage from permanent project folders

Updated Files:

  • src/agent/llm_client.py - Path("log") β†’ Path("_log")
  • src/tools/youtube.py - Path("log") β†’ Path("_log")
  • test/test_phase0_hf_vision_api.py - Path("output") β†’ Path("_output")
  • .gitignore - Updated folder references

Result: Runtime folders now clearly marked with _ prefix

[2026-01-13] [Documentation] [COMPLETED] Log Consolidation - Session-Level Logging

Problem: Each question created separate log file (llm_context_TIMESTAMP.txt), polluting the log/ folder with 20+ files per evaluation.

Solution: Implemented session-level log file where all questions append to single file.

Implementation:

  • Added get_session_log_file() function in src/agent/llm_client.py
  • Creates log/llm_session_YYYYMMDD_HHMMSS.txt on first use
  • All questions append to same file with question delimiters
  • Added reset_session_log() for testing/new runs

Updated File:

  • src/agent/llm_client.py (~40 lines added)
    • Session log management (lines 62-99)
    • Updated synthesize_answer_hf to append to session log

Result: One log file per evaluation instead of 20+

[2026-01-13] [Infrastructure] [COMPLETED] Project Template Reference Move

Problem: Project template moved to new location, documentation references outdated.

Solution: Updated CHANGELOG.md references to new template location.

Changes:

  • Moved: project_template_original/ β†’ ref/project_template_original/
  • Updated CHANGELOG.md (7 occurrences)
  • Added ref/ to .gitignore (static copies, not in git)

Result: Documentation reflects new template location

[2026-01-12] [Infrastructure] [COMPLETED] Git Ignore Fixes - PDF Commit Block

Problem: Git push rejected due to binary files in docs/ folder.

Solution:

  1. Reset commit: git reset --soft HEAD~1
  2. Added docs/*.pdf to .gitignore
  3. Removed PDF files from git: git rm --cached "docs/*.pdf"
  4. Recommitted without PDFs
  5. Push successful

User feedback: "can just gitignore all the docs also"

Final Fix: Changed docs/*.pdf to docs/ to ignore entire docs folder

Updated Files:

  • .gitignore - Added docs/ folder ignore

Result: Clean git history, no binary files committed

[2026-01-13] [Documentation] [COMPLETED] 30% Results Analysis - Phase 1 Success

Problem: Need to analyze results to understand what's working and what needs improvement.

Analysis of gaia_results_20260113_174815.json (30% score):

Results Breakdown:

  • 6 Correct (30%):

    • a1e91b78 (YouTube bird count) - Phase 1 fix working βœ“
    • 9d191bce (YouTube Teal'c) - Phase 1 fix working βœ“
    • 6f37996b (CSV table) - Calculator working βœ“
    • 1f975693 (Calculus MP3) - Audio transcription working βœ“
    • 99c9cc74 (Strawberry pie MP3) - Audio transcription working βœ“
    • 7bd855d8 (Excel food sales) - File parsing working βœ“
  • 3 System Errors (15%):

    • 2d83110e (Reverse text) - Calculator: SyntaxError
    • cca530fc (Chess position) - NoneType error (vision)
    • f918266a (Python code) - parse_file: ValueError
  • 10 "Unable to answer" (50%):

    • Search evidence extraction insufficient
    • Need better LLM prompts or search processing
  • 1 Wrong Answer (5%):

    • 4fc2f1ae (Wikipedia dinosaur) - Found "Jimfbleak" instead of "FunkMonk"

Phase 1 Impact (YouTube + Audio):

  • Fixed 4 questions that would have failed before
  • YouTube transcription with Whisper fallback working
  • Audio transcription working well

Next Steps:

  1. Fix 3 system errors (text manipulation, vision NoneType, Python execution)
  2. Improve search evidence extraction (10 questions)
  3. Investigate wrong answer (Wikipedia search precision)

[2026-01-13] [Feature] [COMPLETED] Phase 1: YouTube + Audio Transcription Support

Problem: Questions with YouTube videos and audio files couldn't be answered.

Solution: Implemented two-phase transcription system.

YouTube Transcription (src/tools/youtube.py):

  • Extracts transcript using youtube_transcript_api
  • Falls back to Whisper audio transcription if captions unavailable
  • Saves transcript to _log/{video_id}_transcript.txt

Audio Transcription (src/tools/audio.py):

  • Uses Groq's Whisper-large-v3 model (ZeroGPU compatible)
  • Supports MP3, WAV, M4A, OGG, FLAC, AAC formats
  • Saves transcript to _log/ for debugging

Impact:

  • 4 additional questions answered correctly (30% vs ~10% before)
  • 9d191bce (YouTube Teal'c) - "Extremely" βœ“
  • a1e91b78 (YouTube birds) - "3" βœ“
  • 1f975693 (Calculus MP3) - "132, 133, 134, 197, 245" βœ“
  • 99c9cc74 (Strawberry pie MP3) - Full ingredient list βœ“

Status: Phase 1 complete, hit 30% target score

[2026-01-12] [Infrastructure] [COMPLETED] Session Log Implementation

Problem: Need to track LLM synthesis context for debugging and analysis.

Solution: Created session-level logging system in src/agent/llm_client.py.

Implementation:

  • Session log: _log/llm_session_YYYYMMDD_HHMMSS.txt
  • Per-question log: _log/{video_id}_transcript.txt (YouTube only)
  • Captures: questions, evidence items, LLM prompts, answers
  • Structured format with timestamps and delimiters

Result: Full audit trail for debugging failed questions

[2026-01-13] [Infrastructure] [COMPLETED] Git Commit & HF Push

Problem: Need to deploy changes to HuggingFace Spaces.

Solution: Committed and pushed latest changes.

Commit: 3dcf523 - "refactor: update folder structure and adjust output paths"

Changes Deployed:

  • 3-tier folder naming convention
  • Session-level logging
  • Project template reference move
  • Git ignore fixes

Result: HF Space updated with latest code

[2026-01-13] [Testing] [COMPLETED] Phase 0 Vision API Validation

Problem: Need to validate vision API works before integrating into agent.

Solution: Created test suite test/test_phase0_hf_vision_api.py.

Test Results:

  • Tested 4 image sources
  • Validated multimodal LLM responses
  • Confirmed HF Inference API compatibility
  • Identified NoneType edge case (empty responses)

File: user_io/result_ServerApp/phase0_vision_validation_*.json

Result: Vision API validated, ready for integration

[2026-01-11] [Feature] [COMPLETED] Multi-Modal Vision Support

Problem: Agent couldn't process image-based questions (chess positions, charts, etc.).

Solution: Implemented vision tool using HuggingFace Inference API.

Implementation (src/tools/vision.py):

  • analyze_image() - Main vision analysis function
  • Supports JPEG, PNG, GIF, BMP, WebP formats
  • Returns detailed descriptions of visual content
  • Fallback to Gemini/Claude if HF fails

Status: Implemented, some NoneType errors remain

[2026-01-10] [Feature] [COMPLETED] File Parser Tool

Problem: Agent couldn't read uploaded files (PDF, Excel, Word, CSV, etc.).

Solution: Implemented unified file parser (src/tools/file_parser.py).

Supported Formats:

  • PDF (parse_pdf) - PyPDF2 extraction
  • Excel (parse_excel) - Calamine-based parsing
  • Word (parse_word) - python-docx extraction
  • Text/CSV (parse_text) - UTF-8 text reading
  • Unified parse_file() - Auto-detects format

Result: Agent can now read file attachments

[2026-01-09] [Feature] [COMPLETED] Calculator Tool

Problem: Agent couldn't perform mathematical calculations.

Solution: Implemented safe expression evaluator (src/tools/calculator.py).

Features:

  • safe_eval() - Safe math expression evaluation
  • Supports: arithmetic, algebra, trigonometry, logarithms
  • Constants: pi, e
  • Functions: sqrt, sin, cos, log, abs, etc.
  • Error handling for invalid expressions

Result: CSV table question answered correctly (6f37996b)

[2026-01-08] [Feature] [COMPLETED] Web Search Tool

Problem: Agent couldn't access current information beyond training data.

Solution: Implemented web search using Tavily API (src/tools/web_search.py).

Features:

  • tavily_search() - Primary search via Tavily
  • exa_search() - Fallback via Exa (if available)
  • Unified search() - Auto-fallback chain
  • Returns structured results with titles, snippets, URLs

Configuration:

  • TAVILY_API_KEY required
  • EXA_API_KEY optional (fallback)

Result: Agent can now search web for current information

[2026-01-07] [Infrastructure] [COMPLETED] Project Initialization

Problem: New project setup required.

Solution: Initialized project structure with standard files.

Created:

  • README.md - Project documentation
  • CLAUDE.md - Project-specific AI instructions
  • CHANGELOG.md - Session tracking
  • .gitignore - Git exclusions
  • requirements.txt - Dependencies
  • pyproject.toml - UV package config

Result: Project scaffold ready for development

Date: YYYY-MM-DD Dev Record: [link to dev/dev_YYMMDD_##_concise_title.md]

What Was Changed

  • Change 1
  • Change 2