Qiskit
/

Qwen2.5-Coder-14B-Qiskit

@@ -116,10 +116,10 @@ We advise adding the `rope_scaling` configuration only when processing long cont
 | **Model** | **QiskitHumanEval-Hard** | **QiskitHumanEval** | **HumanEval** | **ASDiv** | **MathQA** | **SciQ** | **MBPP** | **IFEval** | **CrowsPairs (English)** | **TruthfulQA (MC1 acc)** |
 |-----------|---------------------------|----------------------|---------------|-----------|------------|----------|----------|------------|---------------------------|---------------------------|
-| **qwen2.5-coder-14b-qiskit** | **25.17** | **49.01** | **91.46** | 4.21 | **53.90** | **97.00** | **77.60** | 49.64 | 65.18 | 37.82 |
-| mistral-small-3.2-24b-qiskit | 20.53 | 40.39 | 77.49 | **20.69** | 53.40 | 96.40 | 63.40 | 31.66 | 67.56 | **42.84** |
 | granite-3.3-8b-qiskit | 14.57 | 27.15 | 62.80 | 0.48 | 38.66 | 93.30 | 52.40 | 59.71 | **59.75** | 39.05 |
-| granite-3.2-8b-qiskit | 9.93 | 24.50 | 57.32 | 0.09 | 41.41 | 96.30 | 51.80 | **60.79** | 66.79 | 40.51 |
 *Note: All models listed in the benchmark table were evaluated using their respective system prompt, defined in their Hugging Face model.*

 | **Model** | **QiskitHumanEval-Hard** | **QiskitHumanEval** | **HumanEval** | **ASDiv** | **MathQA** | **SciQ** | **MBPP** | **IFEval** | **CrowsPairs (English)** | **TruthfulQA (MC1 acc)** |
 |-----------|---------------------------|----------------------|---------------|-----------|------------|----------|----------|------------|---------------------------|---------------------------|
+| **qwen2.5-coder-14b-qiskit** | 25.17 | **49.01** | **91.46** | **4.21** | **53.90** | 97.00 | **77.60** | 49.64 | 65.18 | 37.82 |
+| mistral-small-3.2-24b-qiskit | **32.45** | 47.02 | 77.49 | 3.77 | 49.68 | **97.50** | 64.00 | 48.44 | 67.08 | 39.41 |
 | granite-3.3-8b-qiskit | 14.57 | 27.15 | 62.80 | 0.48 | 38.66 | 93.30 | 52.40 | 59.71 | **59.75** | 39.05 |
+| granite-3.2-8b-qiskit | 9.93 | 24.50 | 57.32 | 0.09 | 41.41 | 96.30 | 51.80 | **60.79** | 66.79 | **40.51** |
 *Note: All models listed in the benchmark table were evaluated using their respective system prompt, defined in their Hugging Face model.*