ChromaFlow Agent 1.0 - GAIA Benchmark Results

Summary

Metric Run 5 (Baseline) Run 6 (+8 Fixes) Delta
Accuracy 55.0% (11/20) 75.0% (15/20) +20pp
Tasks Flipped - 4 (T8, T11, T13, T18) +4
Regressions - 0 Clean
Avg Time/Task ~540s 180s -67%

Leaderboard Context

Our 75.0% on Level 1 validation positions ChromaFlow Agent competitively:

Rank Agent L1 Score Overall
1 HAL Generalist (Claude Sonnet 4.5) 82.07% 74.55%
2 HAL Generalist (Claude Sonnet 4.5 High) 77.36% 70.91%
3 HAL Generalist (Claude Opus 4.1 High) 71.70% 68.48%
- ChromaFlow Agent 1.0 (GPT-5.2) 75.0%* -
4 HAL Generalist (Claude Opus 4 High) 71.70% 64.85%

*Validation split; official leaderboard uses test split

ChromaFlow Agent's 75% Level 1 validation score would rank approximately 2nd-3rd on the official leaderboard's Level 1 metric, between HAL's Claude Sonnet 4.5 High (77.36%) and Claude Opus 4.1 High (71.70%).

Model Details

  • Base Model: GPT-5.2 (400K context, 128K output, xhigh reasoning, temperature 0.0)
  • Agent Framework: ChromaFlow Agent 1.0 (custom ToolCallAgent architecture)
  • Tools: Bash, PythonExecute, StrReplaceEditor, WebSearch, Crawl4AI, YouTubeTranscript, BrowserUse, Terminate
  • Verification: Always-on answer verifier with specificity/fact-check/logic checks

Per-Task Results (Run 6)

Task Result Quality Time Answer
T01 PASS 0.96 110s No
T02 PASS 0.99 2s Guava
T03 PASS 0.93 59s 100
T04 PASS 0.93 114s 0.1777
T05 FAIL 0.74 225s (OCR misread)
T06 FAIL 0.32 88s witness (gold: inference)
T07 FAIL 0.62 467s Qxc3 (gold: Rd5)
T08 PASS 0.86 528s Braintree, Honolulu
T09 PASS 0.93 75s Annie Levin
T10 PASS 0.74 95s 2
T11 PASS 0.78 270s CUB
T12 PASS 0.92 244s 4
T13 PASS 0.86 135s Rockhopper penguin
T14 PASS 0.90 49s broccoli, celery, fresh basil, lettuce, sweet potatoes
T15 PASS 0.78 85s BaseLabelPropagation
T16 FAIL 0.20 600s (audio timeout)
T17 FAIL 0.74 314s 14 (gold: 3)
T18 PASS 0.86 103s Louvrier
T19 PASS 0.99 40s Logic formula
T20 PASS 0.86 4s Extremely

Bold PASS = flipped from FAIL in Run 5.

8 Fixes Applied (Run 5 to Run 6)

Fix Priority Impact Description
Configurable Bash Timeout + SIGKILL P0 +2 tasks 45s timeout, process group kill, auto-restart
YouTube Transcript Tool P0 +1 task youtube_transcript_api with yt-dlp fallback
System Prompt Overhaul P1 Indirect 6 new protocols: timeout, specificity, web, verification, image, chess
Enhanced Image Pipeline P1 0 Dual analysis: visual + OCR + python-chess
Increased pass@k P1 Indirect pass@k-all: 1->2, pass@k-hard: 5->4
Stronger Verifier Prompt P1 +1 task Specificity, fact-check, logic boundary checks
Retry Strategy Hints P2 Indirect Context-aware retry prompts based on failure mode
_is_hard_result Improvement P2 Indirect Timeout detection from error text + elapsed time

Configuration

{
  "model": "GPT-5.2 (xhigh reasoning, temp 0.0)",
  "benchmark": "GAIA 2023 Level 1 Validation",
  "tasks": 20,
  "seed": 42,
  "pass_k_all": 2,
  "pass_k_hard": 4,
  "bash_timeout": "45s",
  "task_timeout_cap": "600s",
  "wall_time_budget": "600s",
  "tool_call_budget": 120,
  "token_budget": 380000,
  "verifier_mode": "always",
  "optimus_mode": false
}

Benchmark Run Details

  • Date: February 19, 2026
  • Total Attempts: 80 (4.0 avg per task)
  • Total Wall Time: 60.1 minutes (242 min benchmark elapsed)
  • Fastest Task: 2s (T02 - fruit identification)
  • Slowest Task: 600s (T16 - audio processing timeout)
  • Grounded Answers: 19/20 (95%)
  • Average Quality Score: 0.795

Citation

@misc{chromaflow2026gaia,
  title={ChromaFlow Agent 1.0: GAIA Benchmark Results},
  author={Tarun Mittal},
  year={2026},
  url={https://huggingface.co/ChromaFlow9897/chromaflow-gaia-benchmark}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train ChromaFlow9897/chromaflow-gaia-benchmark

Evaluation results