--- license: mit task_categories: - question-answering - text-generation language: - en tags: - benchmark - reasoning - multi-step - evaluation - llm-evaluation - goodhart - execution-vs-understanding - consensus - multi-model size_categories: - 1K