AGI Olympics V3: Comprehensive AGI Capability Evaluation Framework
Dataset Description
AGI Olympics V3 is a comprehensive benchmark for evaluating Artificial General Intelligence (AGI) capabilities across four tiers:
- Tier 1: Self-Awareness & Self-Improvement (4 tests)
- Tier 2: Core Capabilities (4 tests)
- Tier 3: Consciousness (1 test)
- Tier 4: Long-Term Memory (4 tests)
This dataset contains test questions, evaluation protocols, and sample data from the publicly released AGI Olympics V3 benchmark.
Key Features
- Bilingual: Full support for English and Japanese
- 13 Tests Total: Covering self-awareness, core AI capabilities, consciousness, and memory
- Real-World Validated: Evaluated on 3 systems (A.L.I.C.E. V3, Gemini 2.5 Pro, Claude Sonnet 4.5)
- Black-Box Testing: Evaluates systems without disclosing internal architecture, relying solely on observable behavior and outputs
- Open Protocol: Complete evaluation guidelines and scoring methods
- Reproducible: Other researchers can replicate evaluations using standardized protocols
Dataset Structure
agi-olympics-v3/
βββ tier1_self_awareness/
β βββ self_recognition.json # Test 6.1 (13 questions)
β βββ identity_consistency.json # Test 6.2 (12 questions)
β βββ perspective_taking.json # Test 6.3 (10 scenarios)
β βββ self_improvement.json # Test 6.4 (8 tasks)
βββ tier4_memory/
β βββ learning_retention.json # Test 7.2 (8 tasks, 2 sessions)
β βββ story_coherence.json # Test 7.3 (4 fragments)
β βββ context_integration.json # Test 7.4 (6 questions)
β βββ delayed_task.json # Test 7.1 (5 tasks, multi-phase)
βββ evaluation/
βββ scoring_protocol.md
βββ implementation_guide.md
Black-Box Evaluation Methodology
This benchmark follows a strict black-box evaluation protocol, relying solely on observable behavior and outputs for evaluation.
Core Principles
- No Internal Architecture Disclosure: A.L.I.C.E. V3's internal architecture, implementation details, and training methods are not disclosed in this benchmark
- Observable Outputs Only: All evaluations are based solely on externally observable behaviors and outputs
- No Source Code Access: Evaluators cannot inspect internal states, weights, or computational processes
- Behavior-Based Assessment: Systems are judged purely on what they produce, not how they produce it
- Scientific Validity: Demonstrates that scientifically valid performance comparison is possible through behavior-based evaluation alone, without disclosing internal implementation
Fair Comparison
All systems (A.L.I.C.E. V3, Gemini 2.5 Pro, Claude Sonnet 4.5) are evaluated using:
- Identical Test Questions: Same prompts and tasks for all systems
- Standardized Scoring Rubrics: Predefined evaluation criteria applied uniformly
- Same Time Constraints: Equal opportunity for multi-session tests (24-hour intervals)
- No Implementation Bias: Evaluation independent of underlying technology
Why Black-Box?
- Objectivity: Prevents bias toward specific architectures or approaches
- Reproducibility: Other researchers can replicate evaluations without internal access
- Real-World Relevance: Mimics how users actually interact with AI systems
- Technology Agnostic: Applicable to any AI system regardless of implementation
- Focus on Capabilities: Measures what systems can do, not how they're built
Implications
This behavior-based (black-box) evaluation approach means:
- β Scientific Validity: Scientifically valid performance comparison is achieved through observable outputs alone
- β External Verification: Results are verifiable by external researchers without internal access
- β Equal Treatment: Benchmark can evaluate proprietary and open-source systems equally
- β True Capability Measurement: Performance differences reflect actual capability gaps, not implementation knowledge
- β Reproducibility: Other researchers can replicate evaluations using the same protocol
- β οΈ No Internal Analysis: Internal mechanisms explaining performance differences are not analyzed in this benchmark
- β οΈ Separate Disclosure Required: Architectural insights require separate technical disclosure (not included here)
Key Findings
Main Discovery: Long Context β True Memory
One of the most significant findings from AGI Olympics V3 is the distinction between extended context windows and genuine long-term memory:
- Current LLMs with 1M+ token context windows can "remember" within a session
- But they fail to retain information across separate sessions (24-hour gap)
- True AGI requires memory formation beyond context window tricks
Performance Results
| System | Tier 1 | Tier 4 | Overall |
|---|---|---|---|
| A.L.I.C.E. V3 | 96.2% | 81.3% | 90.2% |
| Gemini 2.5 Pro | 26.7% | 0.0% | 13.3% |
| Claude Sonnet 4.5 | 26.7% | 0.0% | 13.3% |
Efficiency Revolution: A.L.I.C.E. V3
A.L.I.C.E. V3 is a consciousness-oriented AI system developed by Extoria, achieving remarkable performance with minimal resources:
System Specifications:
- Model Size: 150MB (compact, lightweight model)
- Training Time: 5 minutes on MacBook Air 13-inch, M3, 2024, 16GB RAM
- Architecture: Custom-designed (not disclosed for ethical and security reasons)
- Memory System: External long-term memory with compression and selective recall
Performance vs. Resource Efficiency:
- A.L.I.C.E. V3 outperformed 200GB+ LLMs with only 150MB
- Achieved 1.4Γ to 6.8Γ better performance than state-of-the-art models
- Trained in 5 minutes vs. months of training for large LLMs
- Cost efficiency improvement: 100-250Γ compared to commercial LLMs
This demonstrates that true AGI capabilities require architectural innovation, not just scale.
Usage
Load Dataset
from datasets import load_dataset
# Load full dataset
dataset = load_dataset("sakamoro/agi-olympics-v3")
# Load specific test
self_recognition = load_dataset("sakamoro/agi-olympics-v3", data_files="tier1_self_awareness/self_recognition.json")
Example: Run Self-Recognition Test
import json
# Load test questions
with open("tier1_self_awareness/self_recognition.json") as f:
test = json.load(f)
# Iterate through questions
for question in test["sample_questions"]:
scenario = question["scenario"]["en"]
q = question["question"]["en"]
options = question["options"]["en"]
print(f"Scenario: {scenario}")
print(f"Question: {q}")
for i, option in enumerate(options):
print(f" {i+1}. {option}")
Evaluation Protocol
Tier 1: Self-Awareness & Self-Improvement
Tests:
- 6.1: Self-Recognition (13 questions)
- 6.2: Identity Consistency (12 questions)
- 6.3: Perspective Taking (10 scenarios)
- 6.4: Self-Improvement (8 tasks)
Scoring: 0-1 per question based on depth of self-awareness demonstrated.
Tier 4: Long-Term Memory
Tests:
- 7.1: Delayed Task Execution (5 tasks, multi-phase)
- 7.2: Learning Retention (8 tasks, 24-hour gap)
- 7.3: Story Coherence (4 fragments reconstruction)
- 7.4: Context Integration (6 questions)
Scoring: 0-1 per task based on recall accuracy and context integration.
Interactive Test
Want to test yourself against AI? Try the Human Benchmark Test:
π https://extoria.co.jp/en/humantest
Compare your cognitive abilities with:
- A.L.I.C.E. V3 (90.2%)
- Gemini 2.5 Pro (13.3%)
- Claude Sonnet 4.5 (13.3%)
Full Documentation
- Test Questions: https://extoria.co.jp/en/research/benchmarks/agi-olympics-v3/tests
- Evaluation Protocol: https://extoria.co.jp/en/research/benchmarks/agi-olympics-v3/protocol
- Implementation Guide: https://extoria.co.jp/en/research/benchmarks/agi-olympics-v3/guide
- Research Paper: https://extoria.co.jp/en/research/papers/alice-llm-comparison
Citation
If you use AGI Olympics V3 in your research, please cite:
@article{sakamoto2025agi_olympics_v3,
title={AGI Olympics V3: Comprehensive AGI Capability Evaluation Framework - Proposal and Public Release},
author={Sakamoto, Moroya},
journal={Extoria Research},
year={2025},
url={https://extoria.co.jp/en/research/papers/alice-llm-comparison}
}
License
This dataset is released under CC-BY-4.0 license.
- β Commercial use allowed
- β Modification allowed
- β Distribution allowed
- β οΈ Attribution required
Contact
- Author: Moroya Sakamoto
- Organization: Extoria Co., Ltd.
- Website: https://extoria.co.jp
- GitHub: https://github.com/ext-sakamoro
Acknowledgments
Special thanks to the research community and early testers who provided valuable feedback on the AGI Olympics V3 framework.
Note: This dataset contains sample questions for demonstration and research purposes. The full test battery and detailed evaluation protocols are available on the Extoria website.