Adding Evaluation Results

8fd9528 verified over 1 year ago

4.54 kB

license: apache-2.0
library_name: peft
tags:
  - nlp
  - code
  - instruct
  - llama
base_model: meta-llama/Meta-Llama-3.1-8B-Instruct
datasets:
  - Intel/orca_dpo_pairs
model-index:
  - name: Llama-3_1-8B-Instruct-orca-ORPO
    results:
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: IFEval (0-Shot)
          type: HuggingFaceH4/ifeval
          args:
            num_few_shot: 0
        metrics:
          - type: inst_level_strict_acc and prompt_level_strict_acc
            value: 22.73
            name: strict accuracy
        source:
          url: >-
            https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=monsterapi/Llama-3_1-8B-Instruct-orca-ORPO
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: BBH (3-Shot)
          type: BBH
          args:
            num_few_shot: 3
        metrics:
          - type: acc_norm
            value: 1.34
            name: normalized accuracy
        source:
          url: >-
            https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=monsterapi/Llama-3_1-8B-Instruct-orca-ORPO
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: MATH Lvl 5 (4-Shot)
          type: hendrycks/competition_math
          args:
            num_few_shot: 4
        metrics:
          - type: exact_match
            value: 0
            name: exact match
        source:
          url: >-
            https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=monsterapi/Llama-3_1-8B-Instruct-orca-ORPO
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: GPQA (0-shot)
          type: Idavidrein/gpqa
          args:
            num_few_shot: 0
        metrics:
          - type: acc_norm
            value: 0
            name: acc_norm
        source:
          url: >-
            https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=monsterapi/Llama-3_1-8B-Instruct-orca-ORPO
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: MuSR (0-shot)
          type: TAUR-Lab/MuSR
          args:
            num_few_shot: 0
        metrics:
          - type: acc_norm
            value: 3.06
            name: acc_norm
        source:
          url: >-
            https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=monsterapi/Llama-3_1-8B-Instruct-orca-ORPO
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: MMLU-PRO (5-shot)
          type: TIGER-Lab/MMLU-Pro
          config: main
          split: test
          args:
            num_few_shot: 5
        metrics:
          - type: acc
            value: 1.86
            name: accuracy
        source:
          url: >-
            https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=monsterapi/Llama-3_1-8B-Instruct-orca-ORPO
          name: Open LLM Leaderboard

Finetuning Overview:

Model Used: meta-llama/Meta-Llama-3.1-8B-Instruct
Dataset: Intel/orca_dpo_pairs

Dataset Insights:

The Intel Orca dataset is a specialized version of the OpenOrca dataset, which includes ~1M GPT-4 completions and ~3.2M GPT-3.5 completions. This dataset is tabularized to align with the distributions in the ORCA paper and focuses on preference optimization by clearly indicating which responses are good and which are bad. It is primarily used in natural language processing for training and evaluation.

Finetuning Details:

This finetuning run was performed using MonsterAPI's LLM finetuner with ORPO (Optimized Response Preference Optimization) for enhancing preference optimization.

Completed in a total duration of 1 hour and 39 minutes for 1 epoch.
Costed $2.69 for the entire process.

Hyperparameters & Additional Details:

Epochs: 1
Cost Per Epoch: $2.69
Total Finetuning Cost: $2.69
Model Path: meta-llama/Meta-Llama-3.1-8B-Instruct
Learning Rate: 0.001
Data Split: 90% train 10% validation
Gradient Accumulation Steps: 16

Open LLM Leaderboard Evaluation Results

Detailed results can be found here

Metric	Value
Avg.	4.83
IFEval (0-Shot)	22.73
BBH (3-Shot)	1.34
MATH Lvl 5 (4-Shot)	0.00
GPQA (0-shot)	0.00
MuSR (0-shot)	3.06
MMLU-PRO (5-shot)	1.86