ChromaFlow9897
/

chromaflow-gaia-benchmark

+---
+language:
+- en
+tags:
+- gaia
+- benchmark
+- agent
+- chromaflow
+- gpt-5.2
+datasets:
+- gaia-benchmark/GAIA
+metrics:
+- accuracy
+model-index:
+- name: ChromaFlow Agent 1.0
+  results:
+  - task:
+      type: question-answering
+      name: GAIA Level 1 Validation
+    dataset:
+      type: gaia-benchmark/GAIA
+      name: GAIA 2023 Level 1 Validation
+      split: validation
+      config: 2023_level1
+    metrics:
+    - type: accuracy
+      value: 75.0
+      name: Run 6 Accuracy (with fixes)
+    - type: accuracy
+      value: 55.0
+      name: Run 5 Accuracy (baseline)
+---
+# ChromaFlow Agent 1.0 - GAIA Benchmark Results
+## Summary
+| Metric | Run 5 (Baseline) | Run 6 (+8 Fixes) | Delta |
+|--------|:-:|:-:|:-:|
+| **Accuracy** | **55.0% (11/20)** | **75.0% (15/20)** | **+20pp** |
+| Tasks Flipped | - | 4 (T8, T11, T13, T18) | +4 |
+| Regressions | - | 0 | Clean |
+| Avg Time/Task | ~540s | 180s | -67% |
+## Leaderboard Context
+Our **75.0% on Level 1 validation** positions ChromaFlow Agent competitively:
+| Rank | Agent | L1 Score | Overall |
+|:----:|-------|:--------:|:-------:|
+| 1 | HAL Generalist (Claude Sonnet 4.5) | **82.07%** | 74.55% |
+| 2 | HAL Generalist (Claude Sonnet 4.5 High) | **77.36%** | 70.91% |
+| 3 | HAL Generalist (Claude Opus 4.1 High) | **71.70%** | 68.48% |
+| - | **ChromaFlow Agent 1.0 (GPT-5.2)** | **75.0%*** | - |
+| 4 | HAL Generalist (Claude Opus 4 High) | **71.70%** | 64.85% |
+*\*Validation split; official leaderboard uses test split*
+**ChromaFlow Agent's 75% Level 1 validation score would rank approximately 2nd-3rd** on the official leaderboard's Level 1 metric, between HAL's Claude Sonnet 4.5 High (77.36%) and Claude Opus 4.1 High (71.70%).
+## Model Details
+- **Base Model:** GPT-5.2 (400K context, 128K output, xhigh reasoning, temperature 0.0)
+- **Agent Framework:** ChromaFlow Agent 1.0 (custom ToolCallAgent architecture)
+- **Tools:** Bash, PythonExecute, StrReplaceEditor, WebSearch, Crawl4AI, YouTubeTranscript, BrowserUse, Terminate
+- **Verification:** Always-on answer verifier with specificity/fact-check/logic checks
+## Per-Task Results (Run 6)
+| Task | Result | Quality | Time | Answer |
+|------|:------:|:-------:|-----:|--------|
+| T01 | PASS | 0.96 | 110s | No |
+| T02 | PASS | 0.99 | 2s | Guava |
+| T03 | PASS | 0.93 | 59s | 100 |
+| T04 | PASS | 0.93 | 114s | 0.1777 |
+| T05 | FAIL | 0.74 | 225s | (OCR misread) |
+| T06 | FAIL | 0.32 | 88s | witness (gold: inference) |
+| T07 | FAIL | 0.62 | 467s | Qxc3 (gold: Rd5) |
+| T08 | **PASS** | 0.86 | 528s | Braintree, Honolulu |
+| T09 | PASS | 0.93 | 75s | Annie Levin |
+| T10 | PASS | 0.74 | 95s | 2 |
+| T11 | **PASS** | 0.78 | 270s | CUB |
+| T12 | PASS | 0.92 | 244s | 4 |
+| T13 | **PASS** | 0.86 | 135s | Rockhopper penguin |
+| T14 | PASS | 0.90 | 49s | broccoli, celery, fresh basil, lettuce, sweet potatoes |
+| T15 | PASS | 0.78 | 85s | BaseLabelPropagation |
+| T16 | FAIL | 0.20 | 600s | (audio timeout) |
+| T17 | FAIL | 0.74 | 314s | 14 (gold: 3) |
+| T18 | **PASS** | 0.86 | 103s | Louvrier |
+| T19 | PASS | 0.99 | 40s | Logic formula |
+| T20 | PASS | 0.86 | 4s | Extremely |
+**Bold PASS** = flipped from FAIL in Run 5.
+## 8 Fixes Applied (Run 5 to Run 6)
+| Fix | Priority | Impact | Description |
+|-----|:--------:|:------:|-------------|
+| Configurable Bash Timeout + SIGKILL | P0 | +2 tasks | 45s timeout, process group kill, auto-restart |
+| YouTube Transcript Tool | P0 | +1 task | youtube_transcript_api with yt-dlp fallback |
+| System Prompt Overhaul | P1 | Indirect | 6 new protocols: timeout, specificity, web, verification, image, chess |
+| Enhanced Image Pipeline | P1 | 0 | Dual analysis: visual + OCR + python-chess |
+| Increased pass@k | P1 | Indirect | pass@k-all: 1->2, pass@k-hard: 5->4 |
+| Stronger Verifier Prompt | P1 | +1 task | Specificity, fact-check, logic boundary checks |
+| Retry Strategy Hints | P2 | Indirect | Context-aware retry prompts based on failure mode |
+| _is_hard_result Improvement | P2 | Indirect | Timeout detection from error text + elapsed time |
+## Configuration
+```json
+{
+  "model": "GPT-5.2 (xhigh reasoning, temp 0.0)",
+  "benchmark": "GAIA 2023 Level 1 Validation",
+  "tasks": 20,
+  "seed": 42,
+  "pass_k_all": 2,
+  "pass_k_hard": 4,
+  "bash_timeout": "45s",
+  "task_timeout_cap": "600s",
+  "wall_time_budget": "600s",
+  "tool_call_budget": 120,
+  "token_budget": 380000,
+  "verifier_mode": "always",
+  "optimus_mode": false
+}
+```
+## Benchmark Run Details
+- **Date:** February 19, 2026
+- **Total Attempts:** 80 (4.0 avg per task)
+- **Total Wall Time:** 60.1 minutes (242 min benchmark elapsed)
+- **Fastest Task:** 2s (T02 - fruit identification)
+- **Slowest Task:** 600s (T16 - audio processing timeout)
+- **Grounded Answers:** 19/20 (95%)
+- **Average Quality Score:** 0.795
+## Citation
+```bibtex
+@misc{chromaflow2026gaia,
+  title={ChromaFlow Agent 1.0: GAIA Benchmark Results},
+  author={Tarun Mittal},
+  year={2026},
+  url={https://huggingface.co/ChromaFlow9897/chromaflow-gaia-benchmark}
+}
+```