ChromaFlow Agent 1.0 - GAIA Benchmark Results
Summary
| Metric | Run 5 (Baseline) | Run 6 (+8 Fixes) | Delta |
|---|---|---|---|
| Accuracy | 55.0% (11/20) | 75.0% (15/20) | +20pp |
| Tasks Flipped | - | 4 (T8, T11, T13, T18) | +4 |
| Regressions | - | 0 | Clean |
| Avg Time/Task | ~540s | 180s | -67% |
Leaderboard Context
Our 75.0% on Level 1 validation positions ChromaFlow Agent competitively:
| Rank | Agent | L1 Score | Overall |
|---|---|---|---|
| 1 | HAL Generalist (Claude Sonnet 4.5) | 82.07% | 74.55% |
| 2 | HAL Generalist (Claude Sonnet 4.5 High) | 77.36% | 70.91% |
| 3 | HAL Generalist (Claude Opus 4.1 High) | 71.70% | 68.48% |
| - | ChromaFlow Agent 1.0 (GPT-5.2) | 75.0%* | - |
| 4 | HAL Generalist (Claude Opus 4 High) | 71.70% | 64.85% |
*Validation split; official leaderboard uses test split
ChromaFlow Agent's 75% Level 1 validation score would rank approximately 2nd-3rd on the official leaderboard's Level 1 metric, between HAL's Claude Sonnet 4.5 High (77.36%) and Claude Opus 4.1 High (71.70%).
Model Details
- Base Model: GPT-5.2 (400K context, 128K output, xhigh reasoning, temperature 0.0)
- Agent Framework: ChromaFlow Agent 1.0 (custom ToolCallAgent architecture)
- Tools: Bash, PythonExecute, StrReplaceEditor, WebSearch, Crawl4AI, YouTubeTranscript, BrowserUse, Terminate
- Verification: Always-on answer verifier with specificity/fact-check/logic checks
Per-Task Results (Run 6)
| Task | Result | Quality | Time | Answer |
|---|---|---|---|---|
| T01 | PASS | 0.96 | 110s | No |
| T02 | PASS | 0.99 | 2s | Guava |
| T03 | PASS | 0.93 | 59s | 100 |
| T04 | PASS | 0.93 | 114s | 0.1777 |
| T05 | FAIL | 0.74 | 225s | (OCR misread) |
| T06 | FAIL | 0.32 | 88s | witness (gold: inference) |
| T07 | FAIL | 0.62 | 467s | Qxc3 (gold: Rd5) |
| T08 | PASS | 0.86 | 528s | Braintree, Honolulu |
| T09 | PASS | 0.93 | 75s | Annie Levin |
| T10 | PASS | 0.74 | 95s | 2 |
| T11 | PASS | 0.78 | 270s | CUB |
| T12 | PASS | 0.92 | 244s | 4 |
| T13 | PASS | 0.86 | 135s | Rockhopper penguin |
| T14 | PASS | 0.90 | 49s | broccoli, celery, fresh basil, lettuce, sweet potatoes |
| T15 | PASS | 0.78 | 85s | BaseLabelPropagation |
| T16 | FAIL | 0.20 | 600s | (audio timeout) |
| T17 | FAIL | 0.74 | 314s | 14 (gold: 3) |
| T18 | PASS | 0.86 | 103s | Louvrier |
| T19 | PASS | 0.99 | 40s | Logic formula |
| T20 | PASS | 0.86 | 4s | Extremely |
Bold PASS = flipped from FAIL in Run 5.
8 Fixes Applied (Run 5 to Run 6)
| Fix | Priority | Impact | Description |
|---|---|---|---|
| Configurable Bash Timeout + SIGKILL | P0 | +2 tasks | 45s timeout, process group kill, auto-restart |
| YouTube Transcript Tool | P0 | +1 task | youtube_transcript_api with yt-dlp fallback |
| System Prompt Overhaul | P1 | Indirect | 6 new protocols: timeout, specificity, web, verification, image, chess |
| Enhanced Image Pipeline | P1 | 0 | Dual analysis: visual + OCR + python-chess |
| Increased pass@k | P1 | Indirect | pass@k-all: 1->2, pass@k-hard: 5->4 |
| Stronger Verifier Prompt | P1 | +1 task | Specificity, fact-check, logic boundary checks |
| Retry Strategy Hints | P2 | Indirect | Context-aware retry prompts based on failure mode |
| _is_hard_result Improvement | P2 | Indirect | Timeout detection from error text + elapsed time |
Configuration
{
"model": "GPT-5.2 (xhigh reasoning, temp 0.0)",
"benchmark": "GAIA 2023 Level 1 Validation",
"tasks": 20,
"seed": 42,
"pass_k_all": 2,
"pass_k_hard": 4,
"bash_timeout": "45s",
"task_timeout_cap": "600s",
"wall_time_budget": "600s",
"tool_call_budget": 120,
"token_budget": 380000,
"verifier_mode": "always",
"optimus_mode": false
}
Benchmark Run Details
- Date: February 19, 2026
- Total Attempts: 80 (4.0 avg per task)
- Total Wall Time: 60.1 minutes (242 min benchmark elapsed)
- Fastest Task: 2s (T02 - fruit identification)
- Slowest Task: 600s (T16 - audio processing timeout)
- Grounded Answers: 19/20 (95%)
- Average Quality Score: 0.795
Citation
@misc{chromaflow2026gaia,
title={ChromaFlow Agent 1.0: GAIA Benchmark Results},
author={Tarun Mittal},
year={2026},
url={https://huggingface.co/ChromaFlow9897/chromaflow-gaia-benchmark}
}
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support
Dataset used to train ChromaFlow9897/chromaflow-gaia-benchmark
Evaluation results
- Run 6 Accuracy (with fixes) on GAIA 2023 Level 1 Validationvalidation set self-reported75.000
- Run 5 Accuracy (baseline) on GAIA 2023 Level 1 Validationvalidation set self-reported55.000