GPT2 Reproduce results with lm-evaluation-harness

#90

by david5819 - opened May 7, 2024

May 7, 2024

I'm trying to reproduce the Score Card results using the lm-evaluation-harness. Based on this comment https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard/discussions/60#648c567bb010e9fed5f92328 I ran this command on commit 441e6ac1 of the lm-evaluation harness repository. The evaluation results I get do not match the score card results for LAMBADA.

python main.py \
    --model=hf-causal-experimental \
    --model_args="pretrained=gpt2,use_accelerate=True" \
    --tasks=lambada_openai \
    --num_fewshot=0 \
    --batch_size=2 \
    --output_path=output

My results on commit 4416ac1

{
  "results": {
    "lambada_openai": {
      "ppl": 40.05542021199565,
      "ppl_stderr": 1.4880684857031479,
      "acc": 0.32563555210556955, (should be 45.99%)
      "acc_stderr": 0.00652867895783546
    }
  },
  "versions": {
    "lambada_openai": 0
  },
  "config": {
    "model": "hf-causal-experimental",
    "model_args": "pretrained=gpt2,use_accelerate=True",
    "num_fewshot": 0,
    "batch_size": "2",
    "device": null,
    "no_cache": false,
    "limit": null,
    "bootstrap_iters": 100000,
    "description_dict": {}
  }
}

My results on commit b281b09 (according to the About tab on https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)
I also delete lm_cache in between to make sure these results were not cached from previous runs.

{
  "results": {
    "lambada_openai": {
      "ppl": 40.05542021199565,
      "ppl_stderr": 1.4880684857031479,
      "acc": 0.32563555210556955, (should be 45.99%)
      "acc_stderr": 0.00652867895783546
    }
  },
  "versions": {
    "lambada_openai": 0
  },
  "config": {
    "model": "hf-causal-experimental",
    "model_args": "pretrained=gpt2,use_accelerate=True",
    "num_fewshot": 0,
    "batch_size": "2",
    "batch_sizes": [],
    "device": null,
    "no_cache": false,
    "limit": null,
    "bootstrap_iters": 100000,
    "description_dict": {}
  }
}

But the Score Card (https://huggingface.co/openai-community/gpt2) achieves 45.99% acc for LAMBADA.

I did manage to reproduce the ARC Challenge results for GPT2 (using same commit as above):

Command:

python main.py \
     --model=hf-causal-experimental \
     --model_args="pretrained=gpt2,use_accelerate=True" \
     --tasks=arc_challenge \
     --num_fewshot=25 \
     --batch_size=2 \
     --output_path=output

My results

 "results": {
    "arc_challenge": {
      "acc": 0.20051194539249148,
      "acc_stderr": 0.011700318050499373,
      "acc_norm": 0.21928327645051193,
      "acc_norm_stderr": 0.012091245787615723
    }

Results from https://huggingface.co/datasets/open-llm-leaderboard/details_gpt2

"harness|arc:challenge|25": {
        "acc": 0.197098976109215,
        "acc_stderr": 0.011625047669880633,
        "acc_norm": 0.22013651877133106,
        "acc_norm_stderr": 0.01210812488346097
    },

Can anyone share the commands for reproducing the GPT2 score card results?

bedio

Oct 17, 2024

have you figured out how to reproduce the results?
i want to reproduced the results too inclduing wickitext results

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment