Spaces:

sc-genrm-scaling
/

README

Running

App Files Files Community

nishadsinghi commited on Apr 7

Commit

c305e53

verified ·

1 Parent(s): a26881b

Update README.md

Browse files

Files changed (1) hide show

README.md +18 -0

README.md CHANGED Viewed

@@ -7,16 +7,34 @@ sdk: static
 pinned: false
 ---
 # MATH Dataset
 ## Training data for GenRM-FT
 - Llama-3.1-8B-Instruct: https://huggingface.co/datasets/sc-genrm-scaling/genrm_gpt4o_verifs_llama_3p1_8b_solns_math_train
 - Qwen-2.5.-7B-Instruct: https://huggingface.co/datasets/sc-genrm-scaling/genrm_gpt4o_verifs_qwen_2p5_7b_solns_math_train
 ## Finetuned Verifiers:
 - Llama-3.1-8B-Instruct: https://huggingface.co/sc-genrm-scaling/llama_3.1_8b_genrm_ft
 - Qwen-2.5.-7B-Instruct: https://huggingface.co/sc-genrm-scaling/qwen_2.5_7b_genrm_ft
 ## Solutions and Verifications for Test-set
 - Llama-3.1-8B-Instruct:
   - Solutions: https://huggingface.co/datasets/sc-genrm-scaling/MATH128_Solutions_Llama-3.1-8B-Instruct

 pinned: false
 ---
+Data and models accompanying the paper [When To Solve, When To Verify: Compute-Optimal Problem Solving and Generative Verification for LLM Reasoning](https://arxiv.org/abs/2504.01005), containing:
+- Finetuned generative verifiers (i.e., GenRM-FT) for math reasoning.
+- Synthetic verification data generated by GPT-4o for math reasoning to train your own generative verifiers.
+- Solutions and verifications generated by various models for math and science reasoning.
 # MATH Dataset
+We use Llama-3.1-8B-Instruct and Qwen-2.5-7B-Instruct to generate solutions for problems in the training split of the [MATH dataset](https://huggingface.co/datasets/hendrycks/competition_math).
+Then, we use GPT-4o to verify these solutions. We filter out the verifications whose verdict doesn't match the ground-truth correctness of the solution, and balance the dataset to have equal 'yes' and 'no' verifications in the dataset.
+This results in these datasets:
 ## Training data for GenRM-FT
 - Llama-3.1-8B-Instruct: https://huggingface.co/datasets/sc-genrm-scaling/genrm_gpt4o_verifs_llama_3p1_8b_solns_math_train
 - Qwen-2.5.-7B-Instruct: https://huggingface.co/datasets/sc-genrm-scaling/genrm_gpt4o_verifs_qwen_2p5_7b_solns_math_train
+We fine-tune the two models on their respective datasets using LoRA, resulting in these fine-tuned GenRMs:
 ## Finetuned Verifiers:
 - Llama-3.1-8B-Instruct: https://huggingface.co/sc-genrm-scaling/llama_3.1_8b_genrm_ft
 - Qwen-2.5.-7B-Instruct: https://huggingface.co/sc-genrm-scaling/qwen_2.5_7b_genrm_ft
+You can follow [this example](https://github.com/nishadsinghi/sc-genrm-scaling/blob/master/llmonk/verify/demo.ipynb) of how to do inference with these models.
+We use these generative verifiers (without fine-tuning in the case of Llama-3.3-70B-Instruct) on solutions from the MATH test set to obtain this data, which we analyse in the paper:
 ## Solutions and Verifications for Test-set
 - Llama-3.1-8B-Instruct:
   - Solutions: https://huggingface.co/datasets/sc-genrm-scaling/MATH128_Solutions_Llama-3.1-8B-Instruct