Update metrics table for new qiskit mistral
#8
by tidealwari - opened
README.md
CHANGED
|
@@ -116,10 +116,10 @@ We advise adding the `rope_scaling` configuration only when processing long cont
|
|
| 116 |
|
| 117 |
| **Model** | **QiskitHumanEval-Hard** | **QiskitHumanEval** | **HumanEval** | **ASDiv** | **MathQA** | **SciQ** | **MBPP** | **IFEval** | **CrowsPairs (English)** | **TruthfulQA (MC1 acc)** |
|
| 118 |
|-----------|---------------------------|----------------------|---------------|-----------|------------|----------|----------|------------|---------------------------|---------------------------|
|
| 119 |
-
| **qwen2.5-coder-14b-qiskit** |
|
| 120 |
-
| mistral-small-3.2-24b-qiskit |
|
| 121 |
| granite-3.3-8b-qiskit | 14.57 | 27.15 | 62.80 | 0.48 | 38.66 | 93.30 | 52.40 | 59.71 | **59.75** | 39.05 |
|
| 122 |
-
| granite-3.2-8b-qiskit | 9.93 | 24.50 | 57.32 | 0.09 | 41.41 | 96.30 | 51.80 | **60.79** | 66.79 | 40.51 |
|
| 123 |
|
| 124 |
*Note: All models listed in the benchmark table were evaluated using their respective system prompt, defined in their Hugging Face model.*
|
| 125 |
|
|
|
|
| 116 |
|
| 117 |
| **Model** | **QiskitHumanEval-Hard** | **QiskitHumanEval** | **HumanEval** | **ASDiv** | **MathQA** | **SciQ** | **MBPP** | **IFEval** | **CrowsPairs (English)** | **TruthfulQA (MC1 acc)** |
|
| 118 |
|-----------|---------------------------|----------------------|---------------|-----------|------------|----------|----------|------------|---------------------------|---------------------------|
|
| 119 |
+
| **qwen2.5-coder-14b-qiskit** | 25.17 | **49.01** | **91.46** | **4.21** | **53.90** | 97.00 | **77.60** | 49.64 | 65.18 | 37.82 |
|
| 120 |
+
| mistral-small-3.2-24b-qiskit | **32.45** | 47.02 | 77.49 | 3.77 | 49.68 | **97.50** | 64.00 | 48.44 | 67.08 | 39.41 |
|
| 121 |
| granite-3.3-8b-qiskit | 14.57 | 27.15 | 62.80 | 0.48 | 38.66 | 93.30 | 52.40 | 59.71 | **59.75** | 39.05 |
|
| 122 |
+
| granite-3.2-8b-qiskit | 9.93 | 24.50 | 57.32 | 0.09 | 41.41 | 96.30 | 51.80 | **60.79** | 66.79 | **40.51** |
|
| 123 |
|
| 124 |
*Note: All models listed in the benchmark table were evaluated using their respective system prompt, defined in their Hugging Face model.*
|
| 125 |
|