GAIA-Langgraph / ARCHITECTURE.md
jash0803's picture
docs: update architecture
862a3c5

Architecture

System Overview

graph LR
    User[User via Gradio] --> App[app.py]
    App -->|fetch questions| API[GAIA Scoring API]
    App -->|"run per question (+ file_name)"| Agent[GAIAAgent]
    Agent --> Supervisor["Supervisor (GPT-5-mini)"]
    Supervisor -->|delegate| WebAgent[Web Research Agent]
    Supervisor -->|delegate| CodeAgent[Code Execution Agent]
    Supervisor -->|delegate| FileAgent[File Processing Agent]
    Supervisor -->|delegate| MathAgent[Math Agent]
    Agent -->|"extract FINAL ANSWER"| App
    App -->|submit answers| API
    App -->|save| Log[submission_payload.log]

Supervisor Routing

The supervisor receives each question and routes based on strict rules. The [IMPORTANT CONTEXT: ...] marker in the prompt is the only signal for file-based routing.

flowchart TD
    Q[Incoming Question] --> Analyze[Supervisor Analyzes Question]
    Analyze --> HasMarker{"Has [IMPORTANT CONTEXT: file...] marker?"}
    HasMarker -->|Yes| FileAgent[File Processing Agent]
    HasMarker -->|No| HasYT{Contains YouTube URL?}
    HasYT -->|Yes| WebAgent[Web Research Agent]
    HasYT -->|No| Classify{Question Type?}
    Classify -->|Facts / Search| WebAgent
    Classify -->|Code / Algorithm| CodeAgent[Code Execution Agent]
    Classify -->|Math / Calculation| MathAgent[Math Agent]
    FileAgent --> NeedMore{Need further processing?}
    NeedMore -->|Yes| Classify
    NeedMore -->|No| Extract["Extract FINAL ANSWER"]
    WebAgent --> Extract
    CodeAgent --> Extract
    MathAgent --> Extract
    Extract --> Return[Return to App]

Agent-Tool Mapping

Each sub-agent is built with create_agent and has access to specific tools.

graph TD
    subgraph web ["Web Research Agent (GPT-5-mini)"]
        Tavily[Tavily Search]
        Wiki[Wikipedia]
        Gemini["Gemini 2.5 Pro Video"]
    end

    subgraph code ["Code Execution Agent (GPT-5-mini)"]
        PythonREPL1[Python REPL]
    end

    subgraph file ["File Processing Agent (GPT-5-mini)"]
        Download["GAIA File Downloader (HF Dataset)"]
        Excel[Excel/CSV Reader]
        Audio[Whisper Transcription]
        Vision["GPT-5-mini Vision (Responses API)"]
        TextFile[Text File Reader]
        PDF[PDF Reader]
        PythonREPL2[Python REPL]
    end

    subgraph math ["Math Agent (GPT-5-mini)"]
        Calc[Calculator]
        PythonREPL3[Python REPL]
    end

Answer Flow

All agents use the GAIA answer format prompt: reason through the problem, then output FINAL ANSWER: [answer]. The extraction layer strips the prefix before submission.

flowchart LR
    Prompt["GAIA Answer Format Prompt"] --> Agent["Sub-Agent Reasons"]
    Agent --> FA["FINAL ANSWER: 42"]
    FA --> Extract["_extract_answer()"]
    Extract --> Clean["42"]
    Clean --> Submit["POST /submit"]

Data Flow — Single Question

sequenceDiagram
    participant App as app.py
    participant GA as GAIAAgent
    participant SV as Supervisor
    participant SA as Sub-Agent
    participant Tool as Tool

    App->>GA: question + task_id + file_name
    GA->>GA: Build prompt with file context if file_name present
    GA->>SV: Invoke graph with messages
    SV->>SV: Analyze question, pick agent
    SV->>SA: Delegate with full context
    SA->>Tool: Call tool (search, video, code, file, etc.)
    Tool-->>SA: Tool result
    SA->>SA: Reason and produce FINAL ANSWER
    SA-->>SV: Response with FINAL ANSWER
    SV-->>GA: Relay FINAL ANSWER
    GA->>GA: Extract answer via regex
    GA-->>App: Clean answer string

Submission Flow — Full Evaluation

sequenceDiagram
    participant User
    participant Gradio as Gradio UI
    participant App as app.py
    participant API as GAIA Scoring API
    participant Agent as GAIAAgent

    User->>Gradio: Click "Run Evaluation"
    Gradio->>App: run_and_submit_all(profile)
    App->>API: GET /questions
    API-->>App: 20 questions with file_name metadata

    loop For each question
        App->>Agent: agent(question, task_id, file_name)
        Agent-->>App: concise answer
    end

    App->>App: Save submission_payload.log
    App->>API: POST /submit (username, agent_code, answers)
    API-->>App: score, correct_count, total_attempted
    App-->>Gradio: Display results table + score
    Gradio-->>User: Show results

File Processing Pipeline

The file agent downloads from the HuggingFace GAIA dataset (with API fallback) and handles multiple modalities by extension:

flowchart LR
    Download["Download from HF Dataset"] --> Detect{File Extension?}
    Detect -->|.xlsx .xls .csv| Pandas[Pandas Reader]
    Detect -->|.mp3 .wav .m4a| Whisper[Whisper API]
    Detect -->|.png .jpg .gif .webp| GPT5V["GPT-5-mini Vision"]
    Detect -->|.pdf| PyPDF2[PyPDF2 Reader]
    Detect -->|.txt .py .json .md| TextReader[Text Reader]
    Pandas --> Analyze["Reason + FINAL ANSWER"]
    Whisper --> Analyze
    GPT5V --> Analyze
    PyPDF2 --> Analyze
    TextReader --> Analyze

Video Analysis Pipeline

YouTube video questions are handled by the web research agent using Gemini's native video understanding -- no download required:

flowchart LR
    Question["Question with YouTube URL"] --> WebAgent[Web Research Agent]
    WebAgent --> GeminiTool["analyze_youtube_video tool"]
    GeminiTool -->|"pass URL directly"| Gemini["Gemini 2.5 Pro"]
    Gemini -->|watches video| Response[Video Analysis Result]
    Response --> Answer["Reason + FINAL ANSWER"]