Spaces:

AlmazErmilov
/

FormationEval-Leaderboard

Running

App Files Files Community

FormationEval-Leaderboard / README.md

AlmazErmilov

docs: add OG benchmark creation date

caafdd0 about 1 month ago

preview code

raw

history blame contribute delete

4.39 kB

A newer version of the Gradio SDK is available: 6.14.0

Upgrade

metadata

title: FormationEval Leaderboard
emoji: 🪨
colorFrom: blue
colorTo: green
sdk: gradio
app_file: app.py
pinned: false
license: cc-by-4.0
short_description: Petroleum geoscience LLM benchmark
tags:
  - leaderboard
  - benchmark
  - geoscience
  - petroleum
  - mcq
  - evaluation

FormationEval Leaderboard

Interactive leaderboard for the FormationEval benchmark — 72 language models evaluated on 505 petroleum geoscience multiple-choice questions (Christmas 2025).

FormationEval v0.1 was the first 505-question benchmark version and formed a small part of the work presented at EAGE Digital 2026 in Stavanger in the session New Frontiers In Geomodelling: Recent Digital Advances under the title Multi-Agent Framework for Subsurface Workflows: Petrophysicist, Geologist and Reservoir Engineer GenAI Agents. It was built to compare models for oil and gas geoscience and subsurface tasks and to provide a public leaderboard that was useful in practice. At that point I did not see public benchmarks or leaderboards in that area that matched that need. DISKOS-QA and the SPE MCQ Dataset were added later as separate imported tracks in the same suite.

March 2026 update: FormationEval now also includes the imported DISKOS-QA and SPE MCQ tracks. This Space still displays results for the evaluated MCQ v0.1 track only. A full rerun on the expanded suite is pending because this is a self funded one person project and expanded suite evaluation requires materially more token spend.

Features

Overall rankings with accuracy, pricing, and open-weight status
Difficulty breakdown (Easy/Medium/Hard accuracy)
Domain breakdown (7 geoscience domains)
Bias analysis (position and length bias per model)
Interactive charts (accuracy vs price, top 30, open-weight models, domain heatmap)
Filters (search, company, open-weight toggle)

Current public note

The current leaderboard already answered the original model comparison goal of the project on the 505 question MCQ track. The expanded suite rerun is pending because this is a self funded one person project and expanded suite evaluation requires materially more token spend. If you want to collaborate, support reruns or discuss related research and engineering work, contact almaz.ermilov@gmail.com.

Top performers

Rank	Model	Open	Accuracy
1	gemini-3-pro-preview	No	99.8%
2	glm-4.7	Yes	98.6%
3	gemini-3-flash-preview	No	98.2%
4	gemini-2.5-pro	No	97.8%
5	grok-4.1-fast	No	97.6%

Citation

@misc{ermilov2026formationeval,
      title={FormationEval, an open multiple-choice benchmark for petroleum geoscience},
      author={Almaz Ermilov},
      year={2026},
      eprint={2601.02158},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2601.02158},
      doi={10.48550/arXiv.2601.02158}
}

License

This Space repository is licensed under CC BY 4.0. The Space displays results for the MCQ v0.1 track only and does not redefine the licensing of the imported DISKOS-QA or SPE MCQ tracks.

For imported track provenance and licensing notes, see the main repository notices, the dataset repository notices, the upstream DISKOS-QA benchmark, and the upstream SPE MCQ dataset.