A newer version of the Gradio SDK is available: 6.14.0
title: FormationEval Leaderboard
emoji: 🪨
colorFrom: blue
colorTo: green
sdk: gradio
app_file: app.py
pinned: false
license: cc-by-4.0
short_description: Petroleum geoscience LLM benchmark
tags:
- leaderboard
- benchmark
- geoscience
- petroleum
- mcq
- evaluation
FormationEval Leaderboard
Interactive leaderboard for the FormationEval benchmark — 72 language models evaluated on 505 petroleum geoscience multiple-choice questions (Christmas 2025).
FormationEval v0.1 was the first 505-question benchmark version and formed a small part of the work presented at EAGE Digital 2026 in Stavanger in the session New Frontiers In Geomodelling: Recent Digital Advances under the title Multi-Agent Framework for Subsurface Workflows: Petrophysicist, Geologist and Reservoir Engineer GenAI Agents. It was built to compare models for oil and gas geoscience and subsurface tasks and to provide a public leaderboard that was useful in practice. At that point I did not see public benchmarks or leaderboards in that area that matched that need. DISKOS-QA and the SPE MCQ Dataset were added later as separate imported tracks in the same suite.
March 2026 update: FormationEval now also includes the imported DISKOS-QA and SPE MCQ tracks. This Space still displays results for the evaluated MCQ
v0.1track only. A full rerun on the expanded suite is pending because this is a self funded one person project and expanded suite evaluation requires materially more token spend.
Features
- Overall rankings with accuracy, pricing, and open-weight status
- Difficulty breakdown (Easy/Medium/Hard accuracy)
- Domain breakdown (7 geoscience domains)
- Bias analysis (position and length bias per model)
- Interactive charts (accuracy vs price, top 30, open-weight models, domain heatmap)
- Filters (search, company, open-weight toggle)
Links
- Paper: arXiv:2601.02158
- Dataset: AlmazErmilov/FormationEval
- GitHub: FormationEval Repository
- Website: formationeval.no — take a quiz, compare with 72 models, browse questions, send feedback
- DISKOS-QA track: Browse on formationeval.no
- Unified question browser: formationeval.no/questions
Current public note
The current leaderboard already answered the original model comparison goal of the project on the 505 question MCQ track. The expanded suite rerun is pending because this is a self funded one person project and expanded suite evaluation requires materially more token spend. If you want to collaborate, support reruns or discuss related research and engineering work, contact almaz.ermilov@gmail.com.
Top performers
| Rank | Model | Open | Accuracy |
|---|---|---|---|
| 1 | gemini-3-pro-preview | No | 99.8% |
| 2 | glm-4.7 | Yes | 98.6% |
| 3 | gemini-3-flash-preview | No | 98.2% |
| 4 | gemini-2.5-pro | No | 97.8% |
| 5 | grok-4.1-fast | No | 97.6% |
Citation
@misc{ermilov2026formationeval,
title={FormationEval, an open multiple-choice benchmark for petroleum geoscience},
author={Almaz Ermilov},
year={2026},
eprint={2601.02158},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2601.02158},
doi={10.48550/arXiv.2601.02158}
}
License
This Space repository is licensed under CC BY 4.0. The Space displays results for the MCQ v0.1 track only and does not redefine the licensing of the imported DISKOS-QA or SPE MCQ tracks.
For imported track provenance and licensing notes, see the main repository notices, the dataset repository notices, the upstream DISKOS-QA benchmark, and the upstream SPE MCQ dataset.