Adding Evaluation Results

55970a4 verified about 2 years ago

5.27 kB

language:
  - en
license: apache-2.0
library_name: transformers
tags:
  - code
datasets:
  - teknium/OpenHermes-2.5
  - TokenBender/python_eval_instruct_51k
  - codefuse-ai/Evol-instruction-66k
pipeline_tag: text-generation
model-index:
  - name: SpeechlessCoder
    results:
      - task:
          type: text-generation
        dataset:
          name: HumanEval
          type: openai_humaneval
        metrics:
          - type: pass@1
            value: 0
            name: pass@1
            verified: false
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: AI2 Reasoning Challenge (25-Shot)
          type: ai2_arc
          config: ARC-Challenge
          split: test
          args:
            num_few_shot: 25
        metrics:
          - type: acc_norm
            value: 59.39
            name: normalized accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=uukuguy/speechless-mistral-hermes-code-7b
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: HellaSwag (10-Shot)
          type: hellaswag
          split: validation
          args:
            num_few_shot: 10
        metrics:
          - type: acc_norm
            value: 78.55
            name: normalized accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=uukuguy/speechless-mistral-hermes-code-7b
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: MMLU (5-Shot)
          type: cais/mmlu
          config: all
          split: test
          args:
            num_few_shot: 5
        metrics:
          - type: acc
            value: 59.88
            name: accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=uukuguy/speechless-mistral-hermes-code-7b
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: TruthfulQA (0-shot)
          type: truthful_qa
          config: multiple_choice
          split: validation
          args:
            num_few_shot: 0
        metrics:
          - type: mc2
            value: 51.26
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=uukuguy/speechless-mistral-hermes-code-7b
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: Winogrande (5-shot)
          type: winogrande
          config: winogrande_xl
          split: validation
          args:
            num_few_shot: 5
        metrics:
          - type: acc
            value: 77.27
            name: accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=uukuguy/speechless-mistral-hermes-code-7b
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: GSM8k (5-shot)
          type: gsm8k
          config: main
          split: test
          args:
            num_few_shot: 5
        metrics:
          - type: acc
            value: 40.64
            name: accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=uukuguy/speechless-mistral-hermes-code-7b
          name: Open LLM Leaderboard

speechless-mistral-hermes-code-7b

Code: https://github.com/uukuguy/speechless

Use the following dataset to fine-tune mistralai/Mistral-7B-v0.1 in order to improve the model's reasoning and planning abilities.

Total 986k samples.

teknium/OpenHermes-2.5
TokenBender/python_eval_instruct_51k
Spider
codefuse-ai/Evol-instruction-66k

How to Prompt the Model

This model accepts the Alpaca instruction format.

For example:

You are an intelligent programming assistant.

### Instruction:
Implement a linked list in C++

### Response:

HumanEval

Metric	Value
humaneval-python

lm-evaluation-harness

{'ARC (acc_norm)': ,
'HellaSwag (acc_norm)': ,
'MMLU (acc)': ,
'TruthfulQA (mc2)': ,
'Winoground (acc)': ,
'GSM8K (acc)': ,
'DROP (f1)': ,
'Open LLM Score': }

Open LLM Leaderboard Evaluation Results

Detailed results can be found here

Metric	Value
Avg.
ARC (25-shot)
HellaSwag (10-shot)
MMLU (5-shot)
TruthfulQA (0-shot)
Winogrande (5-shot)
GSM8K (5-shot)
DROP (3-shot)

Open LLM Leaderboard Evaluation Results

Detailed results can be found here

Metric	Value
Avg.	61.16
AI2 Reasoning Challenge (25-Shot)	59.39
HellaSwag (10-Shot)	78.55
MMLU (5-Shot)	59.88
TruthfulQA (0-shot)	51.26
Winogrande (5-shot)	77.27
GSM8k (5-shot)	40.64