AlmazErmilov's picture
docs: add OG benchmark creation date
caafdd0

A newer version of the Gradio SDK is available: 6.14.0

Upgrade
metadata
title: FormationEval Leaderboard
emoji: 🪨
colorFrom: blue
colorTo: green
sdk: gradio
app_file: app.py
pinned: false
license: cc-by-4.0
short_description: Petroleum geoscience LLM benchmark
tags:
  - leaderboard
  - benchmark
  - geoscience
  - petroleum
  - mcq
  - evaluation

FormationEval Leaderboard

Interactive leaderboard for the FormationEval benchmark — 72 language models evaluated on 505 petroleum geoscience multiple-choice questions (Christmas 2025).

FormationEval v0.1 was the first 505-question benchmark version and formed a small part of the work presented at EAGE Digital 2026 in Stavanger in the session New Frontiers In Geomodelling: Recent Digital Advances under the title Multi-Agent Framework for Subsurface Workflows: Petrophysicist, Geologist and Reservoir Engineer GenAI Agents. It was built to compare models for oil and gas geoscience and subsurface tasks and to provide a public leaderboard that was useful in practice. At that point I did not see public benchmarks or leaderboards in that area that matched that need. DISKOS-QA and the SPE MCQ Dataset were added later as separate imported tracks in the same suite.

March 2026 update: FormationEval now also includes the imported DISKOS-QA and SPE MCQ tracks. This Space still displays results for the evaluated MCQ v0.1 track only. A full rerun on the expanded suite is pending because this is a self funded one person project and expanded suite evaluation requires materially more token spend.

Features

  • Overall rankings with accuracy, pricing, and open-weight status
  • Difficulty breakdown (Easy/Medium/Hard accuracy)
  • Domain breakdown (7 geoscience domains)
  • Bias analysis (position and length bias per model)
  • Interactive charts (accuracy vs price, top 30, open-weight models, domain heatmap)
  • Filters (search, company, open-weight toggle)

Links

Current public note

The current leaderboard already answered the original model comparison goal of the project on the 505 question MCQ track. The expanded suite rerun is pending because this is a self funded one person project and expanded suite evaluation requires materially more token spend. If you want to collaborate, support reruns or discuss related research and engineering work, contact almaz.ermilov@gmail.com.

Top performers

Rank Model Open Accuracy
1 gemini-3-pro-preview No 99.8%
2 glm-4.7 Yes 98.6%
3 gemini-3-flash-preview No 98.2%
4 gemini-2.5-pro No 97.8%
5 grok-4.1-fast No 97.6%

Citation

@misc{ermilov2026formationeval,
      title={FormationEval, an open multiple-choice benchmark for petroleum geoscience},
      author={Almaz Ermilov},
      year={2026},
      eprint={2601.02158},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2601.02158},
      doi={10.48550/arXiv.2601.02158}
}

License

This Space repository is licensed under CC BY 4.0. The Space displays results for the MCQ v0.1 track only and does not redefine the licensing of the imported DISKOS-QA or SPE MCQ tracks.

For imported track provenance and licensing notes, see the main repository notices, the dataset repository notices, the upstream DISKOS-QA benchmark, and the upstream SPE MCQ dataset.