ChromaFlow Agent 1.0 - GAIA Benchmark Results

Summary

Metric	Run 5 (Baseline)	Run 6 (+8 Fixes)	Delta
Accuracy	55.0% (11/20)	75.0% (15/20)	+20pp
Tasks Flipped	-	4 (T8, T11, T13, T18)	+4
Regressions	-	0	Clean
Avg Time/Task	~540s	180s	-67%

Leaderboard Context

Our 75.0% on Level 1 validation positions ChromaFlow Agent competitively:

Rank	Agent	L1 Score	Overall
1	HAL Generalist (Claude Sonnet 4.5)	82.07%	74.55%
2	HAL Generalist (Claude Sonnet 4.5 High)	77.36%	70.91%
3	HAL Generalist (Claude Opus 4.1 High)	71.70%	68.48%
-	ChromaFlow Agent 1.0 (GPT-5.2)	75.0%*	-
4	HAL Generalist (Claude Opus 4 High)	71.70%	64.85%

*Validation split; official leaderboard uses test split

ChromaFlow Agent's 75% Level 1 validation score would rank approximately 2nd-3rd on the official leaderboard's Level 1 metric, between HAL's Claude Sonnet 4.5 High (77.36%) and Claude Opus 4.1 High (71.70%).

Model Details

Base Model: GPT-5.2 (400K context, 128K output, xhigh reasoning, temperature 0.0)
Agent Framework: ChromaFlow Agent 1.0 (custom ToolCallAgent architecture)
Tools: Bash, PythonExecute, StrReplaceEditor, WebSearch, Crawl4AI, YouTubeTranscript, BrowserUse, Terminate
Verification: Always-on answer verifier with specificity/fact-check/logic checks

Per-Task Results (Run 6)

Task	Result	Quality	Time	Answer
T01	PASS	0.96	110s	No
T02	PASS	0.99	2s	Guava
T03	PASS	0.93	59s	100
T04	PASS	0.93	114s	0.1777
T05	FAIL	0.74	225s	(OCR misread)
T06	FAIL	0.32	88s	witness (gold: inference)
T07	FAIL	0.62	467s	Qxc3 (gold: Rd5)
T08	PASS	0.86	528s	Braintree, Honolulu
T09	PASS	0.93	75s	Annie Levin
T10	PASS	0.74	95s	2
T11	PASS	0.78	270s	CUB
T12	PASS	0.92	244s	4
T13	PASS	0.86	135s	Rockhopper penguin
T14	PASS	0.90	49s	broccoli, celery, fresh basil, lettuce, sweet potatoes
T15	PASS	0.78	85s	BaseLabelPropagation
T16	FAIL	0.20	600s	(audio timeout)
T17	FAIL	0.74	314s	14 (gold: 3)
T18	PASS	0.86	103s	Louvrier
T19	PASS	0.99	40s	Logic formula
T20	PASS	0.86	4s	Extremely

Bold PASS = flipped from FAIL in Run 5.

8 Fixes Applied (Run 5 to Run 6)

Fix	Priority	Impact	Description
Configurable Bash Timeout + SIGKILL	P0	+2 tasks	45s timeout, process group kill, auto-restart
YouTube Transcript Tool	P0	+1 task	youtube_transcript_api with yt-dlp fallback
System Prompt Overhaul	P1	Indirect	6 new protocols: timeout, specificity, web, verification, image, chess
Enhanced Image Pipeline	P1	0	Dual analysis: visual + OCR + python-chess
Increased pass@k	P1	Indirect	pass@k-all: 1->2, pass@k-hard: 5->4
Stronger Verifier Prompt	P1	+1 task	Specificity, fact-check, logic boundary checks
Retry Strategy Hints	P2	Indirect	Context-aware retry prompts based on failure mode
_is_hard_result Improvement	P2	Indirect	Timeout detection from error text + elapsed time

Configuration

{
  "model": "GPT-5.2 (xhigh reasoning, temp 0.0)",
  "benchmark": "GAIA 2023 Level 1 Validation",
  "tasks": 20,
  "seed": 42,
  "pass_k_all": 2,
  "pass_k_hard": 4,
  "bash_timeout": "45s",
  "task_timeout_cap": "600s",
  "wall_time_budget": "600s",
  "tool_call_budget": 120,
  "token_budget": 380000,
  "verifier_mode": "always",
  "optimus_mode": false
}

Benchmark Run Details

Date: February 19, 2026
Total Attempts: 80 (4.0 avg per task)
Total Wall Time: 60.1 minutes (242 min benchmark elapsed)
Fastest Task: 2s (T02 - fruit identification)
Slowest Task: 600s (T16 - audio processing timeout)
Grounded Answers: 19/20 (95%)
Average Quality Score: 0.795

Citation

@misc{chromaflow2026gaia,
  title={ChromaFlow Agent 1.0: GAIA Benchmark Results},
  author={Tarun Mittal},
  year={2026},
  url={https://huggingface.co/ChromaFlow9897/chromaflow-gaia-benchmark}
}

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train ChromaFlow9897/chromaflow-gaia-benchmark

Evaluation results

Run 6 Accuracy (with fixes) on GAIA 2023 Level 1 Validation
validation set self-reported

75.000
Run 5 Accuracy (baseline) on GAIA 2023 Level 1 Validation
validation set self-reported

55.000