Adding Evaluation Results

27173eb verified about 2 years ago

5.72 kB

language:
  - en
license: mit
library_name: adapter-transformers
datasets:
  - argilla/distilabel-intel-orca-dpo-pairs
  - jondurbin/truthy-dpo-v0.1
  - argilla/distilabel-math-preference-dpo
  - argilla/distilabel-capybara-dpo-7k-binarized
base_model: Technoculture/MT7Bi-sft
model-index:
  - name: MT7Bi-alpha-dpo-v0.2
    results:
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: AI2 Reasoning Challenge (25-Shot)
          type: ai2_arc
          config: ARC-Challenge
          split: test
          args:
            num_few_shot: 25
        metrics:
          - type: acc_norm
            value: 54.69
            name: normalized accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=Technoculture/MT7Bi-alpha-dpo-v0.2
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: HellaSwag (10-Shot)
          type: hellaswag
          split: validation
          args:
            num_few_shot: 10
        metrics:
          - type: acc_norm
            value: 75.89
            name: normalized accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=Technoculture/MT7Bi-alpha-dpo-v0.2
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: MMLU (5-Shot)
          type: cais/mmlu
          config: all
          split: test
          args:
            num_few_shot: 5
        metrics:
          - type: acc
            value: 52.82
            name: accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=Technoculture/MT7Bi-alpha-dpo-v0.2
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: TruthfulQA (0-shot)
          type: truthful_qa
          config: multiple_choice
          split: validation
          args:
            num_few_shot: 0
        metrics:
          - type: mc2
            value: 45.48
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=Technoculture/MT7Bi-alpha-dpo-v0.2
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: Winogrande (5-shot)
          type: winogrande
          config: winogrande_xl
          split: validation
          args:
            num_few_shot: 5
        metrics:
          - type: acc
            value: 71.59
            name: accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=Technoculture/MT7Bi-alpha-dpo-v0.2
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: GSM8k (5-shot)
          type: gsm8k
          config: main
          split: test
          args:
            num_few_shot: 5
        metrics:
          - type: acc
            value: 25.93
            name: accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=Technoculture/MT7Bi-alpha-dpo-v0.2
          name: Open LLM Leaderboard

Technoculture/MT7Bi-alpha-dpo-v-0.2

Open LLM Leaderboard

Model Name	ARC	HellaSwag	MMLU	TruthfulQA	Winogrande	GSM8K
Orca-2-7b	78.4	76.1	53.7	52.4	74.2	47.2
LLAMA-2-7b	43.2	77.1	44.4	38.7	69.5	16
MT7Bi-sft	54.1	75.11	-	43.08	72.14	15.54
MT7Bi-alpha-dpo-v0.2	54.69	75.89	52.82	45.48	71.58	25.93

Training Details

GPU: Nvidia A100 Tensor Core GPU
Total Batches: 4266
Epochs: 3
Duration: 3 hours, 59 minutes, and 55 seconds

DPO Training Dataset Mixture

Dataset Name	Original Size(Rows)	Ratio	Size After Ratio(Rows)
argilla/distilabel-math-preference-dpo	2.4k	1.0	2.4k
argilla/distilabel-intel-orca-dpo-pairs	12.9k	0.5	6.45k
jondurbin/truthy-dpo-v0.1	1.04k	1.0	1.04k
argilla/distilabel-capybara-dpo-7k-binarized	7.5k	0.2	1.5k
Total Size: 11.38k

Training Loss Plot

Training Loss Smoothed Plot

For full details of this dpo-training please go through our notebook.

Open LLM Leaderboard Evaluation Results

Detailed results can be found here

Metric	Value
Avg.	54.40
AI2 Reasoning Challenge (25-Shot)	54.69
HellaSwag (10-Shot)	75.89
MMLU (5-Shot)	52.82
TruthfulQA (0-shot)	45.48
Winogrande (5-shot)	71.59
GSM8k (5-shot)	25.93