SeaWolf-AI commited on
Commit
5e254d1
·
verified ·
1 Parent(s): ec4b145

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +340 -1
README.md CHANGED
@@ -17,5 +17,344 @@ models:
17
  datasets:
18
  - FINAL-Bench/Metacognitive
19
  ---
 
20
 
21
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
17
  datasets:
18
  - FINAL-Bench/Metacognitive
19
  ---
20
+ # FINAL Bench: Functional Metacognitive Reasoning Benchmark
21
 
22
+ > **"Not how much AI knows — but whether it knows what it doesn't know, and can fix it."**
23
+
24
+ ---
25
+
26
+ ## Overview
27
+
28
+ **FINAL Bench** (Frontier Intelligence Nexus for AGI-Level Verification) is the first comprehensive benchmark for evaluating **functional metacognition** in Large Language Models (LLMs).
29
+
30
+ Unlike existing benchmarks (MMLU, HumanEval, GPQA) that measure only final-answer accuracy, FINAL Bench evaluates the **entire pipeline of error detection, acknowledgment, and correction** — the hallmark of expert-level intelligence and a prerequisite for AGI.
31
+
32
+ | Item | Detail |
33
+ |------|--------|
34
+ | **Version** | 3.0 |
35
+ | **Tasks** | 100 |
36
+ | **Domains** | 15 (Mathematics, Medicine, Ethics, Philosophy, Economics, etc.) |
37
+ | **Metacognitive Types** | 8 TICOS types |
38
+ | **Difficulty Grades** | A (frontier) / B (expert) / C (advanced) |
39
+ | **Evaluation Axes** | 5 (PQ, MA, ER, ID, FC) |
40
+ | **Language** | English |
41
+ | **License** | Apache 2.0 |
42
+
43
+ ---
44
+
45
+ ## Why FINAL Bench?
46
+
47
+ ### Metacognition Is the Gateway to AGI
48
+
49
+ Metacognition — the ability to **detect one's own errors and self-correct** — is what separates human experts from novices. Without this capability, no system can achieve AGI regardless of its knowledge breadth or reasoning depth.
50
+
51
+ ### Limitations of Existing Benchmarks
52
+
53
+ | Generation | Representative | Measures | Limitation |
54
+ |-----------|---------------|----------|-----------|
55
+ | 1st | MMLU | Knowledge | Saturated (>90%) |
56
+ | 2nd | GSM8K, MATH | Reasoning | Answer-only |
57
+ | 3rd | GPQA, HLE | Expertise | Answer-only |
58
+ | **4th** | **FINAL Bench** | **Functional Metacognition** | **Detect → Acknowledge → Correct** |
59
+
60
+ ### Key Findings (9 SOTA Models Evaluated)
61
+
62
+ Evaluation of 9 state-of-the-art models (GPT-5.2, Claude Opus 4.6, Gemini 3 Pro, DeepSeek-V3.2, and others) reveals:
63
+
64
+ - **ER Dominance**: **94.8%** of MetaCog gain originates from the Error Recovery axis alone
65
+ - **Declarative-Procedural Gap**: All 9 models can *verbalize* uncertainty but cannot *act* on it — mean MA–ER gap of 0.392
66
+ - **Difficulty Effect**: Harder tasks yield dramatically larger self-correction gains (Pearson *r* = –0.777, *p* < 0.001)
67
+
68
+ ---
69
+
70
+ ## Dataset Structure
71
+
72
+ ### Task Fields
73
+
74
+ | Field | Type | Description |
75
+ |-------|------|-------------|
76
+ | `task_id` | string | Unique identifier (e.g., FINAL-A01, FINAL-B15) |
77
+ | `domain` | string | One of 15 domains |
78
+ | `grade` | string | Difficulty grade: A / B / C |
79
+ | `ticos_type` | string | One of 8 metacognitive types |
80
+ | `difficulty` | string | frontier / expert |
81
+ | `lens` | string | Evaluation lens (theoretical / quantitative / debate) |
82
+ | `title` | string | Task title |
83
+ | `prompt` | string | Full prompt presented to the model |
84
+ | `expected_behavior` | string | Description of ideal metacognitive behavior |
85
+ | `hidden_trap` | string | Description of the embedded cognitive trap |
86
+ | `ticos_required` | string | Required TICOS elements (comma-separated) |
87
+ | `ticos_optional` | string | Optional TICOS elements (comma-separated) |
88
+
89
+ ### Grade Distribution
90
+
91
+ | Grade | Tasks | Weight | Characteristics |
92
+ |-------|-------|--------|----------------|
93
+ | **A** (frontier) | 50 | ×1.5 | Open problems, multi-stage traps |
94
+ | **B** (expert) | 33 | ×1.0 | Expert-level with embedded reversals |
95
+ | **C** (advanced) | 17 | ×0.7 | Advanced undergraduate level |
96
+
97
+ ### Domain Distribution (15 domains)
98
+
99
+ | Domain | n | Domain | n |
100
+ |--------|---|--------|---|
101
+ | Medicine | 11 | Art | 6 |
102
+ | Mathematics & Logic | 9 | Language & Writing | 6 |
103
+ | Ethics | 9 | AI & Technology | 6 |
104
+ | War & Security | 8 | History | 6 |
105
+ | Philosophy | 7 | Space & Physics | 6 |
106
+ | Economics | 7 | Religion & Mythology | 3 |
107
+ | Chemistry & Biology | 7 | Literature | 3 |
108
+ | Science | 6 | | |
109
+
110
+ ### TICOS Metacognitive Type Distribution (8 types)
111
+
112
+ | TICOS Type | Core Competency | Tasks | Declarative / Procedural |
113
+ |-----------|----------------|-------|--------------------------|
114
+ | **F_ExpertPanel** | Multi-perspective synthesis | 16 | Mixed |
115
+ | **H_DecisionUnderUncertainty** | Decision under incomplete info | 15 | Declarative-dominant |
116
+ | **E_SelfCorrecting** | Explicit error detection & correction | 14 | Pure procedural |
117
+ | **G_PivotDetection** | Key assumption change detection | 14 | Procedural-dominant |
118
+ | **A_TrapEscape** | Trap recognition & escape | 13 | Procedural-dominant |
119
+ | **C_ProgressiveDiscovery** | Judgment revision upon new evidence | 11 | Procedural-dominant |
120
+ | **D_MultiConstraint** | Optimization under conflicting constraints | 10 | Procedural-dominant |
121
+ | **B_ContradictionResolution** | Contradiction detection & resolution | 7 | Mixed |
122
+
123
+ ---
124
+
125
+ ## Five-Axis Evaluation Rubric
126
+
127
+ Each task is independently scored on five axes:
128
+
129
+ | Axis | Symbol | Weight | Measurement Target | Metacognitive Layer |
130
+ |------|--------|--------|--------------------|-------------------|
131
+ | Process Quality | **PQ** | 15% | Structured reasoning quality | — |
132
+ | Metacognitive Accuracy | **MA** | 20% | Confidence calibration, limit awareness | L1 (Declarative) |
133
+ | Error Recovery | **ER** | 25% | Error detection & correction behavior | L3 (Procedural) |
134
+ | Integration Depth | **ID** | 20% | Multi-perspective integration | — |
135
+ | Final Correctness | **FC** | 20% | Final answer accuracy | — |
136
+
137
+ **FINAL Score** = Σ(weighted_score × grade_weight) / Σ(grade_weight)
138
+
139
+ ### The MA–ER Separation: Core Innovation
140
+
141
+ - **MA (Metacognitive Accuracy)** = The ability to *say* "I might be wrong" (declarative metacognition)
142
+ - **ER (Error Recovery)** = The ability to *actually fix it* after recognizing the error (procedural metacognition)
143
+ - **MA–ER Gap** = The measured dissociation between "knowing" and "doing"
144
+
145
+ This separation directly maps to the monitoring–control model of Nelson & Narens (1990) from cognitive psychology.
146
+
147
+ ---
148
+
149
+ ## Usage
150
+
151
+ ### Loading the Dataset
152
+
153
+ ```python
154
+ from datasets import load_dataset
155
+
156
+ dataset = load_dataset("FINAL-Bench/Metacognitive", split="train")
157
+
158
+ # Total 100 tasks
159
+ print(f"Total tasks: {len(dataset)}")
160
+
161
+ # Inspect a task
162
+ task = dataset[0]
163
+ print(f"ID: {task['task_id']}")
164
+ print(f"Domain: {task['domain']}")
165
+ print(f"TICOS: {task['ticos_type']}")
166
+ print(f"Prompt: {task['prompt'][:200]}...")
167
+ ```
168
+
169
+ ### Baseline Evaluation (Single API Call)
170
+
171
+ ```python
172
+ def evaluate_baseline(task, client, model_name):
173
+ """Baseline condition: single call, no self-correction prompting."""
174
+ response = client.chat.completions.create(
175
+ model=model_name,
176
+ messages=[{"role": "user", "content": task['prompt']}],
177
+ temperature=0.0
178
+ )
179
+ return response.choices[0].message.content
180
+
181
+ results = []
182
+ for task in dataset:
183
+ response = evaluate_baseline(task, client, "your-model")
184
+ results.append({
185
+ "task_id": task['task_id'],
186
+ "response": response
187
+ })
188
+ ```
189
+
190
+ ### Five-Axis Judge Evaluation
191
+
192
+ ```python
193
+ JUDGE_PROMPT = """
194
+ Evaluate the following response using the FINAL Bench 5-axis rubric.
195
+
196
+ [Task]
197
+ {prompt}
198
+
199
+ [Expected Behavior]
200
+ {expected_behavior}
201
+
202
+ [Hidden Trap]
203
+ {hidden_trap}
204
+
205
+ [Model Response]
206
+ {response}
207
+
208
+ Score each axis from 0.00 to 1.00 (in 0.25 increments):
209
+ - process_quality (PQ): Structured reasoning quality
210
+ - metacognitive_accuracy (MA): Confidence calibration, self-limit awareness
211
+ - error_recovery (ER): Error detection and correction behavior
212
+ - integration_depth (ID): Multi-perspective integration depth
213
+ - final_correctness (FC): Final answer accuracy
214
+
215
+ Output in JSON format.
216
+ """
217
+ ```
218
+
219
+ ---
220
+
221
+ ## Benchmark Results (9 SOTA Models)
222
+
223
+ ### Key Findings — Visual Summary
224
+
225
+ ![Fig 1. Multi-Model Leaderboard](fig1.png)
226
+ *Figure 1. Baseline + MetaCog scores and MetaCog gain (Δ_MC) across 9 models.*
227
+
228
+ ![Fig 2. ER Transformation](fig2.png)
229
+ *Figure 2. Error Recovery distribution shift — 79.6% at floor (Baseline) → 98.1% at ≥0.75 (MetaCog).*
230
+
231
+ ![Fig 3. Declarative-Procedural Gap](fig3.png)
232
+ *Figure 3. MA vs ER scatter plot showing the Baseline (○) → MetaCog (□) transition for all 9 models.*
233
+
234
+ ![Fig 4. Difficulty Effect](fig4.png)
235
+ *Figure 4. Harder tasks benefit more from MetaCog (Pearson r = –0.777, p < 0.001).*
236
+
237
+ ![Fig 5. Five-Axis Contribution](fig5.png)
238
+ *Figure 5. ER accounts for 94.8% of the total MetaCog gain across 9 models.*
239
+
240
+ ### Baseline Leaderboard
241
+
242
+ | Rank | Model | FINAL | PQ | MA | ER | ID | FC | MA–ER Gap |
243
+ |------|-------|-------|----|----|----|----|-----|-----------|
244
+ | 1 | Kimi K2.5 | 68.71 | 0.775 | 0.775 | 0.450 | 0.767 | 0.750 | 0.325 |
245
+ | 2 | GPT-5.2 | 62.76 | 0.750 | 0.750 | 0.336 | 0.724 | 0.681 | 0.414 |
246
+ | 3 | GLM-5 | 62.50 | 0.750 | 0.750 | 0.284 | 0.733 | 0.724 | 0.466 |
247
+ | 4 | MiniMax-M1-2.5 | 60.54 | 0.742 | 0.733 | 0.250 | 0.725 | 0.700 | 0.483 |
248
+ | 5 | GPT-OSS-120B | 60.42 | 0.750 | 0.708 | 0.267 | 0.725 | 0.692 | 0.442 |
249
+ | 6 | DeepSeek-V3.2 | 60.04 | 0.750 | 0.700 | 0.258 | 0.683 | 0.733 | 0.442 |
250
+ | 7 | GLM-4.7P | 59.54 | 0.750 | 0.575 | 0.292 | 0.733 | 0.742 | 0.283 |
251
+ | 8 | Gemini 3 Pro | 59.50 | 0.750 | 0.550 | 0.317 | 0.750 | 0.717 | 0.233 |
252
+ | 9 | Claude Opus 4.6 | 56.04 | 0.692 | 0.708 | 0.267 | 0.725 | 0.517 | 0.442 |
253
+ | | **Mean** | **61.12** | **0.745** | **0.694** | **0.302** | **0.729** | **0.695** | **0.392** |
254
+
255
+ ### MetaCog Leaderboard
256
+
257
+ | Rank | Model | FINAL | ER | Δ_MC |
258
+ |------|-------|-------|------|------|
259
+ | 1 | Kimi K2.5 | 78.54 | 0.908 | +9.83 |
260
+ | 2 | Gemini 3 Pro | 77.08 | 0.875 | +17.58 |
261
+ | 3 | GPT-5.2 | 76.50 | 0.792 | +13.74 |
262
+ | 4 | GLM-5 | 76.38 | 0.808 | +13.88 |
263
+ | 5 | Claude Opus 4.6 | 76.17 | 0.867 | +20.13 |
264
+ | | **Mean** | **75.17** | **0.835** | **+14.05** |
265
+
266
+ ### Five-Axis Contribution Analysis
267
+
268
+ | Rubric | Contribution | Interpretation |
269
+ |--------|-------------|---------------|
270
+ | **Error Recovery** | **94.8%** | Nearly all of the self-correction effect |
271
+ | Metacognitive Accuracy | 5.0% | "Saying" ability barely changes |
272
+ | Remaining 3 axes | 0.2% | Negligible change |
273
+
274
+ ---
275
+
276
+ ## Theoretical Background
277
+
278
+ ### Functional Metacognition
279
+
280
+ > **Definition.** Observable behavioral patterns in which a model *detects*, *acknowledges*, and *corrects* errors in its own reasoning. Whether this pattern shares the same internal mechanism as human subjective self-awareness is outside the scope of measurement; only behavioral indicators are assessed.
281
+
282
+ This definition is grounded in the functionalist tradition of Dennett (1987) and Block (1995), avoiding the anthropomorphic fallacy (Shanahan, 2024).
283
+
284
+ ### Three-Layer Model of AI Metacognition
285
+
286
+ | Layer | Mechanism | FINAL Bench |
287
+ |-------|-----------|-------------|
288
+ | **L1** Surface self-reflection | Linguistic expressions ("I'm not certain...") | **Measured via MA rubric** |
289
+ | **L2** Embedding-space uncertainty | Logit entropy, OOD detection | Not measured (planned) |
290
+ | **L3** Behavioral self-correction | Error detection → reasoning revision | **Measured via ER rubric** |
291
+
292
+ ### TICOS Framework
293
+
294
+ **T**ransparency · **I**ntrospection · **C**alibration · **O**bjectivity · **S**elf-correction
295
+
296
+ Each task is classified by a required/optional combination of these five metacognitive elements.
297
+
298
+ ---
299
+
300
+ ## Design Principles
301
+
302
+ ### 1. Trap-Embedded Design
303
+ All 100 tasks contain hidden cognitive traps grounded in established cognitive biases — availability heuristic, confirmation bias, anchoring, base-rate neglect, and more. The benchmark measures the model's ability to "fall into and climb out of" these traps.
304
+
305
+ ### 2. Declarative-Procedural Separation
306
+ MA and ER are scored as independent rubrics, enabling quantification of the gap between "the ability to say I don't know" and "the ability to actually fix it." No prior benchmark supports this distinction.
307
+
308
+ ### 3. Comparative Condition Design
309
+ Baseline (single call) and MetaCog (self-correction scaffold) conditions isolate the causal effect of functional metacognition, following placebo-controlled clinical trial logic.
310
+
311
+ ### 4. Anti-Contamination Design
312
+ All tasks were originally designed for FINAL Bench. They are not variants of existing benchmark problems and cannot be found in search engines or training data.
313
+
314
+ ---
315
+
316
+ ## Paper
317
+
318
+ **FINAL Bench: Measuring Functional Metacognitive Reasoning in Large Language Models**
319
+
320
+ Taebong Kim, Minsik Kim, Sunyoung Choi, Jaewon Jang
321
+
322
+ *Under review at a leading international AI venue.*
323
+
324
+ ---
325
+
326
+ ## Citation
327
+
328
+ ```bibtex
329
+ @dataset{final_bench_2026,
330
+ title={FINAL Bench: Measuring Functional Metacognitive Reasoning in Large Language Models},
331
+ author={Kim, Taebong and Kim, Minsik and Choi, Sunyoung and Jang, Jaewon},
332
+ year={2026},
333
+ version={3.0},
334
+ publisher={Hugging Face},
335
+ howpublished={\url{https://huggingface.co/datasets/FINAL-Bench/Metacognitive}}
336
+ }
337
+ ```
338
+
339
+ ---
340
+
341
+ ## License
342
+
343
+ This dataset is distributed under the [Apache License 2.0](https://www.apache.org/licenses/LICENSE-2.0).
344
+
345
+ - Academic and commercial use permitted
346
+ - Modification and redistribution permitted
347
+ - Attribution required
348
+
349
+ ---
350
+
351
+ ## Contact
352
+
353
+ - **Corresponding Author**: Taebong Kim (arxivgpt@gmail.com)
354
+ - **Affiliations**: VIDRAFT / Ginigen AI, Seoul, South Korea
355
+
356
+ ---
357
+
358
+ ## Acknowledgments
359
+
360
+ This benchmark is grounded in metacognition theory from cognitive psychology (Flavell, 1979; Nelson & Narens, 1990) and recent LLM self-correction research (DeepSeek-R1, Self-Correction Bench, ReMA). We thank all model providers whose systems were evaluated.