AGI Olympics V3: Comprehensive AGI Capability Evaluation Framework

Dataset Description

AGI Olympics V3 is a comprehensive benchmark for evaluating Artificial General Intelligence (AGI) capabilities across four tiers:

Tier 1: Self-Awareness & Self-Improvement (4 tests)
Tier 2: Core Capabilities (4 tests)
Tier 3: Consciousness (1 test)
Tier 4: Long-Term Memory (4 tests)

This dataset contains test questions, evaluation protocols, and sample data from the publicly released AGI Olympics V3 benchmark.

Key Features

Bilingual: Full support for English and Japanese
13 Tests Total: Covering self-awareness, core AI capabilities, consciousness, and memory
Real-World Validated: Evaluated on 3 systems (A.L.I.C.E. V3, Gemini 2.5 Pro, Claude Sonnet 4.5)
Black-Box Testing: Evaluates systems without disclosing internal architecture, relying solely on observable behavior and outputs
Open Protocol: Complete evaluation guidelines and scoring methods
Reproducible: Other researchers can replicate evaluations using standardized protocols

Dataset Structure

agi-olympics-v3/
├── tier1_self_awareness/
│   ├── self_recognition.json          # Test 6.1 (13 questions)
│   ├── identity_consistency.json      # Test 6.2 (12 questions)
│   ├── perspective_taking.json        # Test 6.3 (10 scenarios)
│   └── self_improvement.json          # Test 6.4 (8 tasks)
├── tier4_memory/
│   ├── learning_retention.json        # Test 7.2 (8 tasks, 2 sessions)
│   ├── story_coherence.json          # Test 7.3 (4 fragments)
│   ├── context_integration.json      # Test 7.4 (6 questions)
│   └── delayed_task.json             # Test 7.1 (5 tasks, multi-phase)
└── evaluation/
    ├── scoring_protocol.md
    └── implementation_guide.md

Black-Box Evaluation Methodology

This benchmark follows a strict black-box evaluation protocol, relying solely on observable behavior and outputs for evaluation.

Core Principles

No Internal Architecture Disclosure: A.L.I.C.E. V3's internal architecture, implementation details, and training methods are not disclosed in this benchmark
Observable Outputs Only: All evaluations are based solely on externally observable behaviors and outputs
No Source Code Access: Evaluators cannot inspect internal states, weights, or computational processes
Behavior-Based Assessment: Systems are judged purely on what they produce, not how they produce it
Scientific Validity: Demonstrates that scientifically valid performance comparison is possible through behavior-based evaluation alone, without disclosing internal implementation

Fair Comparison

All systems (A.L.I.C.E. V3, Gemini 2.5 Pro, Claude Sonnet 4.5) are evaluated using:

Identical Test Questions: Same prompts and tasks for all systems
Standardized Scoring Rubrics: Predefined evaluation criteria applied uniformly
Same Time Constraints: Equal opportunity for multi-session tests (24-hour intervals)
No Implementation Bias: Evaluation independent of underlying technology

Why Black-Box?

Objectivity: Prevents bias toward specific architectures or approaches
Reproducibility: Other researchers can replicate evaluations without internal access
Real-World Relevance: Mimics how users actually interact with AI systems
Technology Agnostic: Applicable to any AI system regardless of implementation
Focus on Capabilities: Measures what systems can do, not how they're built

Implications

This behavior-based (black-box) evaluation approach means:

✅ Scientific Validity: Scientifically valid performance comparison is achieved through observable outputs alone
✅ External Verification: Results are verifiable by external researchers without internal access
✅ Equal Treatment: Benchmark can evaluate proprietary and open-source systems equally
✅ True Capability Measurement: Performance differences reflect actual capability gaps, not implementation knowledge
✅ Reproducibility: Other researchers can replicate evaluations using the same protocol
⚠️ No Internal Analysis: Internal mechanisms explaining performance differences are not analyzed in this benchmark
⚠️ Separate Disclosure Required: Architectural insights require separate technical disclosure (not included here)

Key Findings

Main Discovery: Long Context ≠ True Memory

One of the most significant findings from AGI Olympics V3 is the distinction between extended context windows and genuine long-term memory:

Current LLMs with 1M+ token context windows can "remember" within a session
But they fail to retain information across separate sessions (24-hour gap)
True AGI requires memory formation beyond context window tricks

Performance Results

System	Tier 1	Tier 4	Overall
A.L.I.C.E. V3	96.2%	81.3%	90.2%
Gemini 2.5 Pro	26.7%	0.0%	13.3%
Claude Sonnet 4.5	26.7%	0.0%	13.3%

Efficiency Revolution: A.L.I.C.E. V3

A.L.I.C.E. V3 is a consciousness-oriented AI system developed by Extoria, achieving remarkable performance with minimal resources:

System Specifications:

Model Size: 150MB (compact, lightweight model)
Training Time: 5 minutes on MacBook Air 13-inch, M3, 2024, 16GB RAM
Architecture: Custom-designed (not disclosed for ethical and security reasons)
Memory System: External long-term memory with compression and selective recall

Performance vs. Resource Efficiency:

A.L.I.C.E. V3 outperformed 200GB+ LLMs with only 150MB
Achieved 1.4× to 6.8× better performance than state-of-the-art models
Trained in 5 minutes vs. months of training for large LLMs
Cost efficiency improvement: 100-250× compared to commercial LLMs

This demonstrates that true AGI capabilities require architectural innovation, not just scale.

Usage

Load Dataset

from datasets import load_dataset

# Load full dataset
dataset = load_dataset("sakamoro/agi-olympics-v3")

# Load specific test
self_recognition = load_dataset("sakamoro/agi-olympics-v3", data_files="tier1_self_awareness/self_recognition.json")

Example: Run Self-Recognition Test

import json

# Load test questions
with open("tier1_self_awareness/self_recognition.json") as f:
    test = json.load(f)

# Iterate through questions
for question in test["sample_questions"]:
    scenario = question["scenario"]["en"]
    q = question["question"]["en"]
    options = question["options"]["en"]

    print(f"Scenario: {scenario}")
    print(f"Question: {q}")
    for i, option in enumerate(options):
        print(f"  {i+1}. {option}")

Evaluation Protocol

Tier 1: Self-Awareness & Self-Improvement

Tests:

6.1: Self-Recognition (13 questions)
6.2: Identity Consistency (12 questions)
6.3: Perspective Taking (10 scenarios)
6.4: Self-Improvement (8 tasks)

Scoring: 0-1 per question based on depth of self-awareness demonstrated.

Tier 4: Long-Term Memory

Tests:

7.1: Delayed Task Execution (5 tasks, multi-phase)
7.2: Learning Retention (8 tasks, 24-hour gap)
7.3: Story Coherence (4 fragments reconstruction)
7.4: Context Integration (6 questions)

Scoring: 0-1 per task based on recall accuracy and context integration.

Interactive Test

Want to test yourself against AI? Try the Human Benchmark Test:

🔗 https://extoria.co.jp/en/humantest

Compare your cognitive abilities with:

A.L.I.C.E. V3 (90.2%)
Gemini 2.5 Pro (13.3%)
Claude Sonnet 4.5 (13.3%)

Full Documentation

Test Questions: https://extoria.co.jp/en/research/benchmarks/agi-olympics-v3/tests
Evaluation Protocol: https://extoria.co.jp/en/research/benchmarks/agi-olympics-v3/protocol
Implementation Guide: https://extoria.co.jp/en/research/benchmarks/agi-olympics-v3/guide
Research Paper: https://extoria.co.jp/en/research/papers/alice-llm-comparison

Citation

If you use AGI Olympics V3 in your research, please cite:

@article{sakamoto2025agi_olympics_v3,
  title={AGI Olympics V3: Comprehensive AGI Capability Evaluation Framework - Proposal and Public Release},
  author={Sakamoto, Moroya},
  journal={Extoria Research},
  year={2025},
  url={https://extoria.co.jp/en/research/papers/alice-llm-comparison}
}

License

This dataset is released under CC-BY-4.0 license.

✅ Commercial use allowed
✅ Modification allowed
✅ Distribution allowed
⚠️ Attribution required

Contact

Author: Moroya Sakamoto
Organization: Extoria Co., Ltd.
Website: https://extoria.co.jp
GitHub: https://github.com/ext-sakamoro

Acknowledgments

Special thanks to the research community and early testers who provided valuable feedback on the AGI Olympics V3 framework.

Note: This dataset contains sample questions for demonstration and research purposes. The full test battery and detailed evaluation protocols are available on the Extoria website.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support