Spaces:

opencompass
/

ATLAS

Sleeping

File size: 7,539 Bytes

from dataclasses import dataclass
from enum import Enum

@dataclass
class Task:
    benchmark: str
    metric: str
    col_name: str


# Select your tasks here
# ---------------------------------------------------
class Tasks(Enum):
    # task_key in the json file, metric_key in the json file, name to display in the leaderboard 
    sage_overall = Task("sage_overall", "accuracy", "ATLAS Overall")
    sage_math = Task("sage_math", "accuracy", "Mathematics")
    sage_physics = Task("sage_physics", "accuracy", "Physics")
    sage_chemistry = Task("sage_chemistry", "accuracy", "Chemistry")
    sage_biology = Task("sage_biology", "accuracy", "Biology")
    sage_earth_science = Task("sage_earth_science", "accuracy", "Earth Science")
    sage_astronomy = Task("sage_astronomy", "accuracy", "Astronomy")

NUM_FEWSHOT = 0 # Change with your few shot
# ---------------------------------------------------



# Your leaderboard name
# TITLE = """<h1 align="center" id="space-title">ATLAS: A High-Difficulty, Multidisciplinary Benchmark for Frontier Scientific Reasoning</h1>"""
TITLE = """<h1 align="center" id="space-title">ATLAS: A High-Difficulty, Multidisciplinary Benchmark for Frontier Scientific Reasoning</h1>
<div align="center" style="margin-top: 10px; white-space: nowrap;"><a href="https://creativecommons.org/licenses/by-nc-sa/4.0/" style="display: inline-block; margin: 0 5px;"><img src="https://img.shields.io/badge/Dataset%20License-CC%20BY--NC--SA%204.0-blue.svg" alt="Dataset License: CC BY-NC-SA 4.0"></a> <a href="https://arxiv.org/abs/2511.14366" style="display: inline-block; margin: 0 5px;"><img src="https://img.shields.io/badge/Paper-arXiv-red.svg" alt="Paper"></a> <a href="https://huggingface.co/datasets/opencompass/ATLAS" style="display: inline-block; margin: 0 5px;"><img src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Dataset-orange" alt="Hugging Face Dataset"></a></div>"""

# What does your leaderboard evaluate?
INTRODUCTION_TEXT = """
**ATLAS (AGI-Oriented Testbed for Logical Application in Science)** is a large-scale, high-difficulty, cross-disciplinary evaluation suite for assessing the frontier scientific reasoning capabilities of LLMs. Designed to address the challenges of benchmark saturation, narrow disciplinary focus, oversimplified answer formats, and data contamination in existing evaluations, ATLAS serves as a reliable **ruler** for measuring progress toward AGI in the **AI for Science** domain.

## 🚀 Benchmark Overview
**ATLAS** evaluates models across seven core scientific fields that are central to AI for Science, encompassing 57 corresponding sub-fields to ensure comprehensive coverage of scientific reasoning requirements:
- **Mathematics** - Abstract algebra, analysis, differential equations, and computational mathematics
- **Physics** - Classical mechanics, electrodynamics, quantum mechanics, thermodynamics, and astrophysics
- **Chemistry** - Physical chemistry, inorganic chemistry, organic chemistry, and analytical chemistry
- **Biology** - Genetics, immunology, molecular biology, biophysics, and ecology
- **Computer Science** - Computer architecture, artificial intelligence, and software fundamentals
- **Earth Science** - Geography, geodesy, atmospheric chemistry, marine science, and geology
- **Materials Science** - Composite materials, metal materials, organic polymer materials, and material synthesis

## 📊 Evaluation Metrics
- **Accuracy (%)**: Overall correctness of predictions across all domains, judged by LLM-as-Judge (OpenAI o4-mini / GPT-OSS-120B)
- **mG-Pass@2**: Multi-generation Pass rate for 2 predictions (measures consistency of model outputs)
- **mG-Pass@4**: Multi-generation Pass rate for 4 predictions (measures stability of reasoning capabilities)
The leaderboard displays model performance sorted by average accuracy, with domain-specific scores reflecting strengths in different scientific fields. All metrics are derived from the ATLAS validation/test set (≈800 expert-created original problems).

#### 📧 If you have any questions about submissions or leaderboards, please contact: opencompass@pjlab.org.cn
"""

# Which evaluations are you running? how can people reproduce what you have?
LLM_BENCHMARKS_TEXT = f"""
## How ATLAS Works

ATLAS evaluates language models across seven scientific domains through a comprehensive assessment of both content generation and reasoning capabilities.

### Evaluation Process:
1. **Multi-domain Assessment**: Models are tested on questions spanning Mathematics, Physics, Chemistry, Biology, Computer Science, Earth Science, and Materials Science
2. **Content + Reasoning**: Each submission requires both predicted answers and reasoning explanations  
3. **Accuracy Scoring**: Performance is measured using accuracy metrics across all domains
4. **Comprehensive Reporting**: Results are aggregated to provide both overall and domain-specific scores

### Submission Format:
Submissions should follow this JSON structure:
```json
{{
    "submission_org": "Your Organization",
    "submission_email": "contact@example.com",
    "predictions": [
        {{
            "original_question_id": 0,
            "content": ["answer1", "answer2", "answer3", "answer4"],
            "reasoning_content": ["reasoning1", "reasoning2", "reasoning3", "reasoning4"]
        }}
    ]
}}
```

## Reproducibility
To reproduce our evaluation results:
1. Download the ATLAS dataset from our repository
2. Use the evaluation scripts provided in the benchmark toolkit
3. Follow the submission format specifications exactly
4. Submit your results through this leaderboard interface
"""

EVALUATION_QUEUE_TEXT = """
## Submit Your ATLAS Test Set Results

Results can be submitted as evaluation outputs in JSON format. Each submission should include predictions and reasoning content for all test questions.

### Required JSON Format:
```json
{   
    "model_name": "Your Model Name",
    "submission_org": "Your Organization",
    "submission_email": "contact@example.com",
    "predictions": [
        {
            "original_question_id": 0,
            "content": ["answer1", "answer2", "answer3", "answer4"],
            "reasoning_content": ["reasoning1", "reasoning2", "reasoning3", "reasoning4"]
        }
    ]
}
```

### Submission Guidelines:
- Each prediction must include exactly 4 content items and 4 reasoning items
- Question IDs should match the official ATLAS test set
- Provide clear scientific reasoning for each prediction
- Ensure JSON format is valid and complete

Your submission will be automatically evaluated across all scientific domains and added to the leaderboard.
"""

CITATION_BUTTON_LABEL = "If you find our work useful or used our benchmark, please cite us!"
CITATION_BUTTON_TEXT = r"""@article{liu2025atlas,
  title={ATLAS: A High-Difficulty, Multidisciplinary Benchmark for Frontier Scientific Reasoning},
  author={Liu, Hongwei and Liu, Junnan and Liu, Shudong and Duan, Haodong and Li, Yuqiang and Su, Mao and Liu, Xiaohong and Zhai, Guangtao and Fang, Xinyu and Ma, Qianhong and Zhang, Taolin and Ma, Zihan and Zhao, Yufeng and Zhou, Peiheng and Xiao, Linchen and Zhang, Wenlong and Zhou, Shijie and Ma, Xingjian and Sun, Siqi and Ge, Jiaye and Li, Meng and Liu, Yuhong and Dong, Jianxin and Li, Jiaying and Wu, Hui and Liang, Hanwen and Lin, Jintai and Wang, Yanting and Dong, Jie and Zhu, Tong and Fu, Tianfan and He, Conghui and Zhang, Qi and Zhang, Songyang and Bai, Lei and Chen, Kai},
  journal={arXiv preprint arXiv:2511.14366},
  year={2025}
}"""