Spaces:
Running
Running
Update README.md
Browse files
README.md
CHANGED
|
@@ -7,16 +7,34 @@ sdk: static
|
|
| 7 |
pinned: false
|
| 8 |
---
|
| 9 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 10 |
# MATH Dataset
|
| 11 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 12 |
## Training data for GenRM-FT
|
| 13 |
- Llama-3.1-8B-Instruct: https://huggingface.co/datasets/sc-genrm-scaling/genrm_gpt4o_verifs_llama_3p1_8b_solns_math_train
|
| 14 |
- Qwen-2.5.-7B-Instruct: https://huggingface.co/datasets/sc-genrm-scaling/genrm_gpt4o_verifs_qwen_2p5_7b_solns_math_train
|
| 15 |
|
|
|
|
|
|
|
| 16 |
## Finetuned Verifiers:
|
| 17 |
- Llama-3.1-8B-Instruct: https://huggingface.co/sc-genrm-scaling/llama_3.1_8b_genrm_ft
|
| 18 |
- Qwen-2.5.-7B-Instruct: https://huggingface.co/sc-genrm-scaling/qwen_2.5_7b_genrm_ft
|
| 19 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 20 |
## Solutions and Verifications for Test-set
|
| 21 |
- Llama-3.1-8B-Instruct:
|
| 22 |
- Solutions: https://huggingface.co/datasets/sc-genrm-scaling/MATH128_Solutions_Llama-3.1-8B-Instruct
|
|
|
|
| 7 |
pinned: false
|
| 8 |
---
|
| 9 |
|
| 10 |
+
Data and models accompanying the paper [When To Solve, When To Verify: Compute-Optimal Problem Solving and Generative Verification for LLM Reasoning](https://arxiv.org/abs/2504.01005), containing:
|
| 11 |
+
|
| 12 |
+
- Finetuned generative verifiers (i.e., GenRM-FT) for math reasoning.
|
| 13 |
+
- Synthetic verification data generated by GPT-4o for math reasoning to train your own generative verifiers.
|
| 14 |
+
- Solutions and verifications generated by various models for math and science reasoning.
|
| 15 |
+
|
| 16 |
+
|
| 17 |
# MATH Dataset
|
| 18 |
|
| 19 |
+
We use Llama-3.1-8B-Instruct and Qwen-2.5-7B-Instruct to generate solutions for problems in the training split of the [MATH dataset](https://huggingface.co/datasets/hendrycks/competition_math).
|
| 20 |
+
Then, we use GPT-4o to verify these solutions. We filter out the verifications whose verdict doesn't match the ground-truth correctness of the solution, and balance the dataset to have equal 'yes' and 'no' verifications in the dataset.
|
| 21 |
+
This results in these datasets:
|
| 22 |
+
|
| 23 |
## Training data for GenRM-FT
|
| 24 |
- Llama-3.1-8B-Instruct: https://huggingface.co/datasets/sc-genrm-scaling/genrm_gpt4o_verifs_llama_3p1_8b_solns_math_train
|
| 25 |
- Qwen-2.5.-7B-Instruct: https://huggingface.co/datasets/sc-genrm-scaling/genrm_gpt4o_verifs_qwen_2p5_7b_solns_math_train
|
| 26 |
|
| 27 |
+
We fine-tune the two models on their respective datasets using LoRA, resulting in these fine-tuned GenRMs:
|
| 28 |
+
|
| 29 |
## Finetuned Verifiers:
|
| 30 |
- Llama-3.1-8B-Instruct: https://huggingface.co/sc-genrm-scaling/llama_3.1_8b_genrm_ft
|
| 31 |
- Qwen-2.5.-7B-Instruct: https://huggingface.co/sc-genrm-scaling/qwen_2.5_7b_genrm_ft
|
| 32 |
|
| 33 |
+
You can follow [this example](https://github.com/nishadsinghi/sc-genrm-scaling/blob/master/llmonk/verify/demo.ipynb) of how to do inference with these models.
|
| 34 |
+
|
| 35 |
+
|
| 36 |
+
We use these generative verifiers (without fine-tuning in the case of Llama-3.3-70B-Instruct) on solutions from the MATH test set to obtain this data, which we analyse in the paper:
|
| 37 |
+
|
| 38 |
## Solutions and Verifications for Test-set
|
| 39 |
- Llama-3.1-8B-Instruct:
|
| 40 |
- Solutions: https://huggingface.co/datasets/sc-genrm-scaling/MATH128_Solutions_Llama-3.1-8B-Instruct
|