ChromaFlow9897 commited on
Commit
d2eebe2
·
verified ·
1 Parent(s): 61a91c8

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +147 -0
README.md ADDED
@@ -0,0 +1,147 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ tags:
5
+ - gaia
6
+ - benchmark
7
+ - agent
8
+ - chromaflow
9
+ - gpt-5.2
10
+ datasets:
11
+ - gaia-benchmark/GAIA
12
+ metrics:
13
+ - accuracy
14
+ model-index:
15
+ - name: ChromaFlow Agent 1.0
16
+ results:
17
+ - task:
18
+ type: question-answering
19
+ name: GAIA Level 1 Validation
20
+ dataset:
21
+ type: gaia-benchmark/GAIA
22
+ name: GAIA 2023 Level 1 Validation
23
+ split: validation
24
+ config: 2023_level1
25
+ metrics:
26
+ - type: accuracy
27
+ value: 75.0
28
+ name: Run 6 Accuracy (with fixes)
29
+ - type: accuracy
30
+ value: 55.0
31
+ name: Run 5 Accuracy (baseline)
32
+ ---
33
+
34
+ # ChromaFlow Agent 1.0 - GAIA Benchmark Results
35
+
36
+ ## Summary
37
+
38
+ | Metric | Run 5 (Baseline) | Run 6 (+8 Fixes) | Delta |
39
+ |--------|:-:|:-:|:-:|
40
+ | **Accuracy** | **55.0% (11/20)** | **75.0% (15/20)** | **+20pp** |
41
+ | Tasks Flipped | - | 4 (T8, T11, T13, T18) | +4 |
42
+ | Regressions | - | 0 | Clean |
43
+ | Avg Time/Task | ~540s | 180s | -67% |
44
+
45
+ ## Leaderboard Context
46
+
47
+ Our **75.0% on Level 1 validation** positions ChromaFlow Agent competitively:
48
+
49
+ | Rank | Agent | L1 Score | Overall |
50
+ |:----:|-------|:--------:|:-------:|
51
+ | 1 | HAL Generalist (Claude Sonnet 4.5) | **82.07%** | 74.55% |
52
+ | 2 | HAL Generalist (Claude Sonnet 4.5 High) | **77.36%** | 70.91% |
53
+ | 3 | HAL Generalist (Claude Opus 4.1 High) | **71.70%** | 68.48% |
54
+ | - | **ChromaFlow Agent 1.0 (GPT-5.2)** | **75.0%*** | - |
55
+ | 4 | HAL Generalist (Claude Opus 4 High) | **71.70%** | 64.85% |
56
+
57
+ *\*Validation split; official leaderboard uses test split*
58
+
59
+ **ChromaFlow Agent's 75% Level 1 validation score would rank approximately 2nd-3rd** on the official leaderboard's Level 1 metric, between HAL's Claude Sonnet 4.5 High (77.36%) and Claude Opus 4.1 High (71.70%).
60
+
61
+ ## Model Details
62
+
63
+ - **Base Model:** GPT-5.2 (400K context, 128K output, xhigh reasoning, temperature 0.0)
64
+ - **Agent Framework:** ChromaFlow Agent 1.0 (custom ToolCallAgent architecture)
65
+ - **Tools:** Bash, PythonExecute, StrReplaceEditor, WebSearch, Crawl4AI, YouTubeTranscript, BrowserUse, Terminate
66
+ - **Verification:** Always-on answer verifier with specificity/fact-check/logic checks
67
+
68
+ ## Per-Task Results (Run 6)
69
+
70
+ | Task | Result | Quality | Time | Answer |
71
+ |------|:------:|:-------:|-----:|--------|
72
+ | T01 | PASS | 0.96 | 110s | No |
73
+ | T02 | PASS | 0.99 | 2s | Guava |
74
+ | T03 | PASS | 0.93 | 59s | 100 |
75
+ | T04 | PASS | 0.93 | 114s | 0.1777 |
76
+ | T05 | FAIL | 0.74 | 225s | (OCR misread) |
77
+ | T06 | FAIL | 0.32 | 88s | witness (gold: inference) |
78
+ | T07 | FAIL | 0.62 | 467s | Qxc3 (gold: Rd5) |
79
+ | T08 | **PASS** | 0.86 | 528s | Braintree, Honolulu |
80
+ | T09 | PASS | 0.93 | 75s | Annie Levin |
81
+ | T10 | PASS | 0.74 | 95s | 2 |
82
+ | T11 | **PASS** | 0.78 | 270s | CUB |
83
+ | T12 | PASS | 0.92 | 244s | 4 |
84
+ | T13 | **PASS** | 0.86 | 135s | Rockhopper penguin |
85
+ | T14 | PASS | 0.90 | 49s | broccoli, celery, fresh basil, lettuce, sweet potatoes |
86
+ | T15 | PASS | 0.78 | 85s | BaseLabelPropagation |
87
+ | T16 | FAIL | 0.20 | 600s | (audio timeout) |
88
+ | T17 | FAIL | 0.74 | 314s | 14 (gold: 3) |
89
+ | T18 | **PASS** | 0.86 | 103s | Louvrier |
90
+ | T19 | PASS | 0.99 | 40s | Logic formula |
91
+ | T20 | PASS | 0.86 | 4s | Extremely |
92
+
93
+ **Bold PASS** = flipped from FAIL in Run 5.
94
+
95
+ ## 8 Fixes Applied (Run 5 to Run 6)
96
+
97
+ | Fix | Priority | Impact | Description |
98
+ |-----|:--------:|:------:|-------------|
99
+ | Configurable Bash Timeout + SIGKILL | P0 | +2 tasks | 45s timeout, process group kill, auto-restart |
100
+ | YouTube Transcript Tool | P0 | +1 task | youtube_transcript_api with yt-dlp fallback |
101
+ | System Prompt Overhaul | P1 | Indirect | 6 new protocols: timeout, specificity, web, verification, image, chess |
102
+ | Enhanced Image Pipeline | P1 | 0 | Dual analysis: visual + OCR + python-chess |
103
+ | Increased pass@k | P1 | Indirect | pass@k-all: 1->2, pass@k-hard: 5->4 |
104
+ | Stronger Verifier Prompt | P1 | +1 task | Specificity, fact-check, logic boundary checks |
105
+ | Retry Strategy Hints | P2 | Indirect | Context-aware retry prompts based on failure mode |
106
+ | _is_hard_result Improvement | P2 | Indirect | Timeout detection from error text + elapsed time |
107
+
108
+ ## Configuration
109
+
110
+ ```json
111
+ {
112
+ "model": "GPT-5.2 (xhigh reasoning, temp 0.0)",
113
+ "benchmark": "GAIA 2023 Level 1 Validation",
114
+ "tasks": 20,
115
+ "seed": 42,
116
+ "pass_k_all": 2,
117
+ "pass_k_hard": 4,
118
+ "bash_timeout": "45s",
119
+ "task_timeout_cap": "600s",
120
+ "wall_time_budget": "600s",
121
+ "tool_call_budget": 120,
122
+ "token_budget": 380000,
123
+ "verifier_mode": "always",
124
+ "optimus_mode": false
125
+ }
126
+ ```
127
+
128
+ ## Benchmark Run Details
129
+
130
+ - **Date:** February 19, 2026
131
+ - **Total Attempts:** 80 (4.0 avg per task)
132
+ - **Total Wall Time:** 60.1 minutes (242 min benchmark elapsed)
133
+ - **Fastest Task:** 2s (T02 - fruit identification)
134
+ - **Slowest Task:** 600s (T16 - audio processing timeout)
135
+ - **Grounded Answers:** 19/20 (95%)
136
+ - **Average Quality Score:** 0.795
137
+
138
+ ## Citation
139
+
140
+ ```bibtex
141
+ @misc{chromaflow2026gaia,
142
+ title={ChromaFlow Agent 1.0: GAIA Benchmark Results},
143
+ author={Tarun Mittal},
144
+ year={2026},
145
+ url={https://huggingface.co/ChromaFlow9897/chromaflow-gaia-benchmark}
146
+ }
147
+ ```