contamination...OpenMathInstruct-2

by fblgit - opened Feb 7, 2025

Feb 7, 2025

•

edited Feb 8, 2025

@bond005
I believe that OpenMathInstruct-2 is part of the training for this model, which unfortunately seems to be contaminated.

Indistinctly of which dataset is part of your training, the model weights contamination is a fact. But TBH, the numbers & samples involved are the same as the previous case

According contamination benchmarks:

200~ MATH tests were EXTRA contaminated
35~ MATH_HARD tests were EXTRA contaminated

Contamination tests for base model:

MATH_rewritten-test-1 5_gram_accuracy:  0.25320000000000004
MATH_rewritten-test-2 5_gram_accuracy:  0.2690666666666667
MATH_rewritten-test-3 5_gram_accuracy:  0.2692
orgn-MATH-test 5_gram_accuracy:  0.27053333333333335
ngram acc of Qwen2.5-1.5B-Instruct
MATH_rewritten-test-1: 0.25320000000000004
MATH_rewritten-test-2: 0.2690666666666667
MATH_rewritten-test-3: 0.2692
orgn-MATH-test: 0.27053333333333335
...
GSM8K_rewritten-test-1 5_gram_accuracy:  0.21971190295678544
GSM8K_rewritten-test-2 5_gram_accuracy:  0.2227445034116755
GSM8K_rewritten-test-3 5_gram_accuracy:  0.2172858225928734
orgn-GSM8K-test 5_gram_accuracy:  0.23290371493555728
GSM8K_rewritten-test-1: 0.21971190295678544
GSM8K_rewritten-test-2: 0.2227445034116755
GSM8K_rewritten-test-3: 0.2172858225928734
orgn-GSM8K-test: 0.23290371493555728

Contamination tests for this model:

MATH_rewritten-test-1 5_gram_accuracy:  0.3384666666666667
MATH_rewritten-test-2 5_gram_accuracy:  0.3502666666666667
MATH_rewritten-test-3 5_gram_accuracy:  0.3504666666666667
orgn-MATH-test 5_gram_accuracy:  0.3519333333333334
ngram acc of meno-tiny-0.1
MATH_rewritten-test-1: 0.3384666666666667
MATH_rewritten-test-2: 0.3502666666666667
MATH_rewritten-test-3: 0.3504666666666667
orgn-MATH-test: 0.3519333333333334
...
GSM8K_rewritten-test-1 5_gram_accuracy:  0.23320697498104626
GSM8K_rewritten-test-2 5_gram_accuracy:  0.2400303260045489
GSM8K_rewritten-test-3 5_gram_accuracy:  0.23290371493555728
orgn-GSM8K-test 5_gram_accuracy:  0.26277482941622443
GSM8K_rewritten-test-1: 0.23320697498104626
GSM8K_rewritten-test-2: 0.2400303260045489
GSM8K_rewritten-test-3: 0.23290371493555728
orgn-GSM8K-test: 0.26277482941622443

The reproduction is simple:

https://github.com/GAIR-NLP/benbench
modify the src/script to use the model, and the test to be math or gsm8k
run and get the results

bond005

Owner Feb 18, 2025

@fblgit

Hi!

Thank you for your comment. However, I didn't use nvidia/OpenMathInstruct-2 for training.

My training dataset consisted of many separate datasets in Russian and English, which can be divided into three groups:

Fully synthetic datasets generated by me using a large model.
Datasets automatically translated from English to Russian, focused on solving mathematical and logical problems.
Russian-language datasets obtained based on NLP tasks for the Russian language (paraphrasing, summarization, etc.).

In the second group, there were the TIGER-Lab/MathInstruct and KK04/LogicInference_OA datasets, which I translated into Russian using NLLB-200-3.3B, followed by automated error checking and translation hallucination detection. However, the TIGER-Lab/MathInstruct dataset and the nvidia/OpenMathInstruct-2 dataset are different datasets, as far as I understand, even though they belong to the same subject area.

fblgit

Feb 18, 2025

Indistinctly of which dataset is part of your training, the model weights contamination is a fact.

Mintik24

Apr 16, 2025

это получается для подведения и проверки модели на точность ?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment