Evaluating Mathematical Reasoning Beyond Accuracy
Paper โข 2404.05692 โข Published โข 2
# Load model directly
from transformers import AutoTokenizer, ReasonEval_7B
tokenizer = AutoTokenizer.from_pretrained("GAIR/ReasonEval-7B")
model = ReasonEval_7B.from_pretrained("GAIR/ReasonEval-7B")ReasonEval-7B is a 7B parameter decoder-only language model fine-tuned from WizardMath-7B-V1.1. Given a mathematical problem and the solution, ReasonEval-7B assesses the problem-solving process in a step-by-step format from the following perspectives:
With ReasonEval, you can
๐ quantify the quality of reasoning steps free of human or close-source models.
๐ค find the potential invalid or redundant steps in the solutions even with the correct results.
๐ ๏ธ select high-quality training data for downstream tasks (e.g., fine-tuning).
ReasonEval-7B's architecture is identical to WizardMath-7B-V1.1, except that the
classification head for next-token prediction is replaced with a classification head for outputting the
possibilities of each class of reasong steps.For detailed instructions on how to use the ReasonEval-7B model, visit our GitHub repository at https://github.com/GAIR-NLP/ReasonEval.
@article{xia2024evaluating,
title={Evaluating Mathematical Reasoning Beyond Accuracy},
author={Xia, Shijie and Li, Xuefeng and Liu, Yixin and Wu, Tongshuang and Liu, Pengfei},
journal={arXiv preprint arXiv:2404.05692},
year={2024},
}
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-classification", model="GAIR/ReasonEval-7B")