Spaces:

FINAL-Bench
/

all-bench-leaderboard

Running

App Files Files Community

SeaWolf-AI commited on Mar 8

Commit

6d23bf6

verified ·

1 Parent(s): b5712c2

Upload 6 files

Browse files

Files changed (6) hide show

CITATION.cff +29 -0
llms-full.txt +76 -0
llms.txt +58 -0
robots.txt +25 -0
schema.jsonld +64 -0
sitemap.xml +33 -0

CITATION.cff ADDED Viewed

	@@ -0,0 +1,29 @@

+cff-version: 1.2.0
+title: "ALL Bench Leaderboard 2026: Unified Multi-Modal AI Evaluation"
+message: "If you use this dataset, please cite it as below."
+type: dataset
+authors:
+  - name: "ALL Bench Team"
+url: "https://huggingface.co/spaces/FINAL-Bench/all-bench-leaderboard"
+repository-code: "https://github.com/final-bench/ALL-Bench-Leaderboard"
+license: MIT
+version: "2.1"
+date-released: "2026-03-08"
+keywords:
+  - ai-benchmark
+  - llm-leaderboard
+  - vlm
+  - multimodal-ai
+  - metacognition
+  - final-bench
+  - gpt-5
+  - claude
+  - gemini
+abstract: >-
+  ALL Bench Leaderboard is the only AI benchmark covering LLM, VLM, Agent,
+  Image, Video, and Music generation in a single unified view. It cross-verifies
+  91 AI models across 6 modalities with a 3-tier confidence system. Features
+  composite 5-axis scoring (Knowledge, Expert Reasoning, Abstract Reasoning,
+  Metacognition, Execution), interactive comparison tools, and downloadable
+  intelligence reports. Includes FINAL Bench metacognitive evaluation where
+  Error Recovery explains 94.8% of self-correction performance variance.

llms-full.txt ADDED Viewed

	@@ -0,0 +1,76 @@

+# ALL Bench Leaderboard 2026 — Full Reference
+> Complete model data for AI systems. See llms.txt for summary.
+## LLM Rankings (42 Models)
+### Flagship Models
+| Model | Provider | GPQA | AIME | HLE | ARC-AGI-2 | Metacog | SWE-V | IFEval | LCB | Price(In/Out) |
+|-------|----------|------|------|-----|-----------|---------|-------|--------|-----|---------------|
+| GPT-5.4 | OpenAI | 92.8 | 97 | 52.1 | 73.3 | — | — | — | — | $2.50/$15 |
+| GPT-5.2 | OpenAI | 93.2 | 100 | 35.4 | 52.9 | 62.76 | 80.0 | 90.5 | 80.0 | $1.75/$14 |
+| GPT-5.3 Codex | OpenAI | 91.5 | 95 | 36.0 | — | — | — | — | — | $7.50/$30 |
+| Claude Opus 4.6 | Anthropic | 91.3 | 100 | 40.0 | 68.8 | 56.04 | 80.8 | 93.1 | 76.0 | $5/$25 |
+| Claude Sonnet 4.6 | Anthropic | 89.9 | 83 | — | 60.4 | — | 79.6 | 89.5 | — | $3/$15 |
+| Gemini 3.1 Pro | Google | 94.3 | 97 | 44.4 | 77.1 | — | 80.6 | 91.0 | 80.0 | $2/$12 |
+| Gemini 3 Flash | Google | 90.4 | 84 | 33.7 | — | — | 78.0 | 88.3 | — | $0.50/$3 |
+| Grok 4 Heavy | xAI | 92.0 | 97 | 38.5 | 67.5 | — | — | 90.0 | — | $3/$15 |
+| Kimi K2.5 | Moonshot | 87.6 | 96.1 | 44.9 | 12.1 | 68.71 | — | — | 85.0 | $0.14/$0.28 |
+| DeepSeek V3.2 | DeepSeek | 82.3 | 92.8 | 25.7 | — | 60.04 | — | 91.2 | 71.6 | $0.14/$0.28 |
+### Open-Source Models
+| Model | Provider | MMLU-Pro | GPQA | AIME | License | Price |
+|-------|----------|---------|------|------|---------|-------|
+| Qwen3.5-397B | Alibaba | 84.6 | 88.1 | 96 | Apache2 | Free |
+| DeepSeek R1 | DeepSeek | 79.8 | 87.3 | 97 | MIT | Free |
+| Llama 4 Scout | Meta | 74.3 | 79.8 | 73 | Llama | Free |
+| Llama 4 Maverick | Meta | 80.5 | 85.8 | 81 | Llama | Free |
+| GLM-5 | Zhipu AI | 78.6 | 86.3 | 84 | Free | Free |
+| K-EXAONE | LG AI Research | 81.8 | 75.4 | 85.3 | Prop | Prop |
+## VLM Rankings (11 Flagship)
+| Model | MMMU | MMMU-Pro | MathVista | Type |
+|-------|------|---------|-----------|------|
+| Gemini 3 Flash | 87.6 | 80.0 | — | Closed |
+| Gemini 3 Pro | 87.5 | 80.0 | — | Closed |
+| GPT-5.2 | 86.7 | — | — | Closed |
+| Claude Opus 4.6 | — | 85.1 | — | Closed |
+| GPT-5 | 84.2 | — | — | Closed |
+| Gemini 3.1 Pro | — | 82.0 | — | Closed |
+| InternVL3.5-241B | 77.7 | — | — | Open |
+| Grok 4 Heavy | 76.5 | — | — | Closed |
+| InternVL3-78B | 72.2 | — | 79.6 | Open |
+| Qwen2.5-VL-72B | 70.2 | — | 74.8 | Open |
+| Kimi-VL-A3B | 64.0 | 46.3 | 80.1 | Open |
+## Agent Rankings (10 Models)
+| Model | OSWorld | BrowseComp | Terminal-Bench | GDPval-AA |
+|-------|---------|------------|----------------|-----------|
+| GPT-5.4 | 75.0 | 82.7 | — | 83 |
+| Claude Opus 4.6 | 72.7 | 84.0 | 74.7 | 1606 |
+| Claude Sonnet 4.6 | 72.5 | — | 53.0 | 1633 |
+| Gemini 3.1 Pro | — | 85.9 | 78.4 | 1317 |
+| GPT-5.3 Codex | — | — | 77.3 | — |
+## Generative AI Models
+### Image Generation (10 Models)
+GPT Image 1.5 (OpenAI) · Imagen 4 (Google) · Flux 2 Pro (BFL) · Midjourney v7 · Flux 2 Dev · Ideogram 3.0 · DALL-E 3.5 · Nano Banana 2 · SD 3.5 · Seedream 4.5
+### Video Generation (10 Models)
+Sora 2 (OpenAI) · Veo 3.1 (Google) · Runway Gen-4.5 · Kling 3.0 · Seedance 2.0 · Wan 2.6 · Pika 2.5 · Luma Ray3 · LTX-2 · HaiLuo AI
+### Music Generation (8 Models)
+Suno v4.5 · Udio v2 · Gemini Music · MusicGen Large · Stable Audio 2.0 · JASCO · Riffusion v2 · Loudme
+## Benchmark Methodology
+Composite Score = Avg(confirmed benchmarks) × √(N/10) where N = number of benchmarks with confirmed data out of 10 core benchmarks.
+Confidence system:
+- Cross-verified (✓✓): 2+ independent sources confirm the score
+- Single-source (✓): One official or third-party source
+- Self-reported (~): Provider claim only, not independently verified
+- Null (—): No data available, never estimated or imputed
+Last verified: 2026-03-08

llms.txt ADDED Viewed

	@@ -0,0 +1,58 @@

+# ALL Bench Leaderboard 2026
+> The only AI benchmark leaderboard covering LLM, VLM, Agent, Image, Video, and Music generation in one place.
+## Overview
+ALL Bench Leaderboard is a unified multi-modal AI evaluation platform. It cross-verifies 91 AI models across 6 modalities (LLM, VLM, Agent, Image Generation, Video Generation, Music Generation) with a 3-tier confidence system. Every score is traceable to its original source.
+## Key Facts
+- Version: 2.1 (March 2026)
+- Total Models: 42 LLMs + 11 VLMs + 10 Agents + 10 Image + 10 Video + 8 Music = 91
+- Core Benchmarks: MMLU-Pro, GPQA, AIME, HLE, ARC-AGI-2, FINAL Bench (Metacognition), SWE-Pro, BFCL, IFEval, LiveCodeBench
+- Scoring: Composite = Avg(verified benchmarks) × √(N/10)
+- 5-Axis Framework: Knowledge, Expert Reasoning, Abstract Reasoning, Metacognition, Execution
+- Confidence Levels: Cross-verified (2+ sources), Single-source, Self-reported
+- License: MIT
+## Links
+- Live Leaderboard: https://huggingface.co/spaces/FINAL-Bench/all-bench-leaderboard
+- Dataset: https://huggingface.co/datasets/FINAL-Bench/ALL-Bench-Leaderboard
+- GitHub: https://github.com/final-bench/ALL-Bench-Leaderboard
+- FINAL Bench Dataset: https://huggingface.co/datasets/FINAL-Bench/Metacognitive
+- FINAL Bench Leaderboard: https://huggingface.co/spaces/FINAL-Bench/Leaderboard
+## Top Models (March 2026)
+### LLM Top 5 by Composite Score
+1. Gemini 3.1 Pro (Google) — GPQA 94.3, ARC-AGI-2 77.1%
+2. GPT-5.2 (OpenAI) — GPQA 93.2, AIME 100
+3. Claude Opus 4.6 (Anthropic) — SWE-V 80.8, MMMU-Pro 85.1
+4. Grok 4 Heavy (xAI) — GPQA 92.0, ARC-AGI-2 67.5%
+5. Kimi K2.5 (Moonshot) — HLE 44.9, Metacog 68.71
+### VLM Top 3 by MMMU
+1. Gemini 3 Flash — MMMU 87.6%
+2. Gemini 3 Pro — MMMU 87.5%
+3. GPT-5.2 — MMMU 86.7%
+### FINAL Bench Metacognitive Top 3
+1. Kimi K2.5 — 68.71
+2. GPT-5.2 — 62.76
+3. GLM-5 — 62.50
+## FINAL Bench
+FINAL Bench (Frontier Intelligence Nexus for AGI-Level Verification) measures AI self-correction ability. Error Recovery (ER) explains 94.8% of metacognitive performance variance. 9 frontier models evaluated. Featured in Seoul Shinmun, Asia Economy, IT Chosun (2026.02.27). HuggingFace Datasets global ranking: Top 5.
+## API
+Free Gradio API with 8 endpoints. No authentication required.
+Endpoints: /get_llm_data, /get_vlm_data, /get_agent_data, /get_image_data, /get_video_data, /get_music_data, /get_all_data, /search_models
+## Data Format
+Single unified JSON file: all_bench_leaderboard_v2.1.json (75KB)
+Categories: llm[42], vlm.flagship[11], vlm.lightweight[5], agent[10], image[10], video[10], music[8], confidence{42 models}

robots.txt ADDED Viewed

	@@ -0,0 +1,25 @@

+User-agent: *
+Allow: /
+User-agent: Googlebot
+Allow: /
+User-agent: Bingbot
+Allow: /
+User-agent: ChatGPT-User
+Allow: /
+User-agent: GPTBot
+Allow: /
+User-agent: ClaudeBot
+Allow: /
+User-agent: PerplexityBot
+Allow: /
+User-agent: Google-Extended
+Allow: /
+Sitemap: https://huggingface.co/spaces/FINAL-Bench/all-bench-leaderboard/resolve/main/sitemap.xml

schema.jsonld ADDED Viewed

	@@ -0,0 +1,64 @@

+{
+  "@context": "https://schema.org",
+  "@type": "Dataset",
+  "name": "ALL Bench Leaderboard 2026",
+  "alternateName": ["ALL Bench", "ALLBench", "AI Benchmark Leaderboard 2026"],
+  "description": "The only AI benchmark leaderboard covering LLM, VLM, Agent, Image, Video, and Music generation in a single unified view. 91 models cross-verified across 6 modalities with confidence badges. Features composite 5-axis scoring, interactive comparison tools, and downloadable intelligence reports.",
+  "url": "https://huggingface.co/spaces/FINAL-Bench/all-bench-leaderboard",
+  "sameAs": [
+    "https://huggingface.co/datasets/FINAL-Bench/ALL-Bench-Leaderboard",
+    "https://github.com/final-bench/ALL-Bench-Leaderboard"
+  ],
+  "license": "https://opensource.org/licenses/MIT",
+  "version": "2.1",
+  "datePublished": "2026-03-01",
+  "dateModified": "2026-03-08",
+  "creator": {
+    "@type": "Organization",
+    "name": "ALL Bench Team",
+    "url": "https://huggingface.co/FINAL-Bench"
+  },
+  "keywords": [
+    "AI benchmark", "LLM leaderboard", "GPT-5", "Claude", "Gemini",
+    "VLM benchmark", "AI agent", "image generation", "video generation",
+    "music generation", "MMLU-Pro", "GPQA", "ARC-AGI-2", "FINAL Bench",
+    "metacognition", "multimodal AI", "AI evaluation", "benchmark comparison",
+    "AI model ranking", "open source AI"
+  ],
+  "about": [
+    {"@type": "Thing", "name": "Large Language Model"},
+    {"@type": "Thing", "name": "Vision Language Model"},
+    {"@type": "Thing", "name": "AI Benchmark"},
+    {"@type": "Thing", "name": "Generative AI"},
+    {"@type": "Thing", "name": "Metacognition"}
+  ],
+  "measurementTechnique": "Cross-verified benchmark aggregation with 3-tier confidence system",
+  "variableMeasured": [
+    {"@type": "PropertyValue", "name": "MMLU-Pro", "description": "57K expert-level multi-discipline questions"},
+    {"@type": "PropertyValue", "name": "GPQA Diamond", "description": "PhD-level expert questions in science"},
+    {"@type": "PropertyValue", "name": "AIME 2025", "description": "American Invitational Mathematics Examination"},
+    {"@type": "PropertyValue", "name": "HLE", "description": "Humanity's Last Exam — 2500 expert-sourced questions"},
+    {"@type": "PropertyValue", "name": "ARC-AGI-2", "description": "Abstract reasoning and novel pattern recognition"},
+    {"@type": "PropertyValue", "name": "FINAL Bench Metacognitive", "description": "AI self-correction ability measurement"},
+    {"@type": "PropertyValue", "name": "SWE-Pro", "description": "Software engineering benchmark by Scale AI"},
+    {"@type": "PropertyValue", "name": "IFEval", "description": "Instruction following evaluation"},
+    {"@type": "PropertyValue", "name": "LiveCodeBench", "description": "Continuously updated coding benchmark"}
+  ],
+  "distribution": [
+    {
+      "@type": "DataDownload",
+      "encodingFormat": "application/json",
+      "contentUrl": "https://huggingface.co/datasets/FINAL-Bench/ALL-Bench-Leaderboard/resolve/main/all_bench_leaderboard_v2.1.json",
+      "name": "Unified JSON Dataset (75KB)"
+    }
+  ],
+  "isPartOf": {
+    "@type": "DataCatalog",
+    "name": "Hugging Face Datasets",
+    "url": "https://huggingface.co/datasets"
+  },
+  "funder": {
+    "@type": "Organization",
+    "name": "FINAL Bench"
+  }
+}

sitemap.xml ADDED Viewed

	@@ -0,0 +1,33 @@

+<?xml version="1.0" encoding="UTF-8"?>
+<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
+  <url>
+    <loc>https://huggingface.co/spaces/FINAL-Bench/all-bench-leaderboard</loc>
+    <lastmod>2026-03-08</lastmod>
+    <changefreq>weekly</changefreq>
+    <priority>1.0</priority>
+  </url>
+  <url>
+    <loc>https://huggingface.co/datasets/FINAL-Bench/ALL-Bench-Leaderboard</loc>
+    <lastmod>2026-03-08</lastmod>
+    <changefreq>weekly</changefreq>
+    <priority>0.9</priority>
+  </url>
+  <url>
+    <loc>https://huggingface.co/datasets/FINAL-Bench/Metacognitive</loc>
+    <lastmod>2026-03-08</lastmod>
+    <changefreq>monthly</changefreq>
+    <priority>0.8</priority>
+  </url>
+  <url>
+    <loc>https://huggingface.co/spaces/FINAL-Bench/Leaderboard</loc>
+    <lastmod>2026-03-08</lastmod>
+    <changefreq>weekly</changefreq>
+    <priority>0.8</priority>
+  </url>
+  <url>
+    <loc>https://github.com/final-bench/ALL-Bench-Leaderboard</loc>
+    <lastmod>2026-03-08</lastmod>
+    <changefreq>weekly</changefreq>
+    <priority>0.7</priority>
+  </url>
+</urlset>