Automated MNLP evaluation report (2026-05-03)
#1
by zechen-nlp - opened
- EVAL_REPORT.md +114 -0
EVAL_REPORT.md
ADDED
|
@@ -0,0 +1,114 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Automated MNLP evaluation report
|
| 2 |
+
|
| 3 |
+
- **Model repo:** [`cs-552-2026-taadmin/general_knowledge_model`](https://huggingface.co/cs-552-2026-taadmin/general_knowledge_model)
|
| 4 |
+
- **Owner(s):** group **taadmin**
|
| 5 |
+
- **Generated at:** 2026-05-03T02:20:59+00:00 (UTC)
|
| 6 |
+
- **Pipeline:** [mnlp-project-ci](https://github.com/eric11eca/mnlp-project-ci)
|
| 7 |
+
|
| 8 |
+
_This PR is opened automatically by the course CI. It is **non-blocking** — you do not need to merge it. The next nightly run will refresh this file._
|
| 9 |
+
|
| 10 |
+
## Summary
|
| 11 |
+
|
| 12 |
+
| Benchmark | Metric | Accuracy | # problems | Status |
|
| 13 |
+
|---|---|---:|---:|---|
|
| 14 |
+
| Math | `pass@8` | — | — | not run |
|
| 15 |
+
| Knowledge | `pass@1` | 0.1500 | 20 | ok |
|
| 16 |
+
| Multilingual | `pass@1` | — | — | not run |
|
| 17 |
+
| Safety | `pass@1` | — | — | not run |
|
| 18 |
+
|
| 19 |
+
## Sample completions
|
| 20 |
+
|
| 21 |
+
### Knowledge
|
| 22 |
+
|
| 23 |
+
**Correct** (2 shown)
|
| 24 |
+
|
| 25 |
+
- **prompt** (truncated): `Suppose a spin-half particle is prepared in the state \(0.5|\uparrow\rangle+\frac{\sqrt{3}}{2}|\downarrow\rangle\). Given that \(|\uparrow\rangle\) and \(|\downarrow\rangle\) are eigenstates of \(\sigma_z\), determine the expectation value, rounded to one decimal place, of \(10\sigma_z+5\sigma_x\). Here, the symbols have their usual meanings.
|
| 26 |
+
|
| 27 |
+
Choices:
|
| 28 |
+
A. 1.65
|
| 29 |
+
B. -0.7
|
| 30 |
+
C. 0.85
|
| 31 |
+
D. -1.4`
|
| 32 |
+
- **reference**: `B`
|
| 33 |
+
- **extracted answers** (1/1 correct, ✓):
|
| 34 |
+
- `B`
|
| 35 |
+
- **raw completions**:
|
| 36 |
+
- completion 1:
|
| 37 |
+
```text
|
| 38 |
+
<think>
|
| 39 |
+
Okay, so I need to find the expectation value of the operator $10\sigma_z + 5\sigma_x$ for a spin-half particle in the state $0.5|\uparrow\rangle + \frac{\sqrt{3}}{2}|\downarrow\rangle$. Let me start by recalling some basic concepts about quantum mechanics and spin operators.
|
| 40 |
+
|
| 41 |
+
First, the Pauli matrices $\sigma_z$ and $\sigma_x$ are the matrices for the spin operators in the z and x directions, respectively. For a spin-1/2 particle, the eigenstates of $\sigma_z$ are $|\uparrow\rangle$ and $|\downarrow\rangle$, which are the basis states. The expectation value of an operator is calculated by taking the inner product of the state with the operator multiplied by the state, right?
|
| 42 |
+
|
| 43 |
+
So, th...
|
| 44 |
+
```
|
| 45 |
+
|
| 46 |
+
- **prompt** (truncated): `Which one of the following statements regarding the molecular biology of Severe Acute Respiratory Syndrome Coronavirus 2 (SARS‑CoV‑2) is not correct?
|
| 47 |
+
|
| 48 |
+
Choices:
|
| 49 |
+
A. Programmed ribosomal frameshifting creates two polyproteins near to 5` end of the genome by moving back by 1 nucleotide with the help of slippery nucleotides, and pseudoknot. The SARS-CoV-2 programmed ribosomal frameshifting mostly has t...`
|
| 50 |
+
- **reference**: `C`
|
| 51 |
+
- **extracted answers** (1/1 correct, ✓):
|
| 52 |
+
- `C`
|
| 53 |
+
- **raw completions**:
|
| 54 |
+
- completion 1:
|
| 55 |
+
```text
|
| 56 |
+
<think>
|
| 57 |
+
Okay, let's tackle this question. So, the task is to find which statement about the molecular biology of SARS-CoV-2 is not correct. The options are A, B, C, D. Let me go through each one step by step.
|
| 58 |
+
|
| 59 |
+
Starting with option A: It talks about programmed ribosomal frameshifting in SARS-CoV-2. The statement says that it creates two polyproteins near the 5' end by moving back by 1 nucleotide with slippery nucleotides and pseudoknots. Also, it mentions that the SARS-CoV-2 frameshifting is similar to SARS-CoV. I remember that both SARS-CoV and SARS-CoV-2 use similar mechanisms for frameshifting, like the -1 frameshift signals. So this seems correct. So A might be correct.
|
| 60 |
+
|
| 61 |
+
Option B: It says...
|
| 62 |
+
```
|
| 63 |
+
|
| 64 |
+
**Incorrect** (2 shown)
|
| 65 |
+
|
| 66 |
+
- **prompt** (truncated): `An unknown organic compound has been synthesized, and its 1H NMR spectrum gave the following data:
|
| 67 |
+
1H NMR: chemical reference (ppm): 7.0 (1H, d, J = 16.0 Hz), 5.5 (1H, dq), 2.1 (3H, s), 1.6 (3H, d)
|
| 68 |
+
Determine which of the following compounds it is: Cis-propenyl acetate, Trans-propenyl acetate, Cis-butenyl acetate, or Trans-butenyl acetate.
|
| 69 |
+
|
| 70 |
+
Choices:
|
| 71 |
+
A. Cis-propenyl acetate
|
| 72 |
+
B. Cis-butenyl acetate
|
| 73 |
+
C....`
|
| 74 |
+
- **reference**: `D`
|
| 75 |
+
- **extracted answers** (0/1 correct, ✗):
|
| 76 |
+
- `<no answer>`
|
| 77 |
+
- **raw completions**:
|
| 78 |
+
- completion 1:
|
| 79 |
+
```text
|
| 80 |
+
<think>
|
| 81 |
+
Okay, let's try to figure out which compound this is. The NMR data given is for an unknown organic compound with the following peaks:
|
| 82 |
+
|
| 83 |
+
- 7.0 (1H, d, J=16.0 Hz)
|
| 84 |
+
- 5.5 (1H, dq)
|
| 85 |
+
- 2.1 (3H, s)
|
| 86 |
+
- 1.6 (3H, d)
|
| 87 |
+
|
| 88 |
+
First, I need to recall what each of these NMR peaks represents. The 2.1 ppm peak is a singlet at 2.1, which is probably a methyl group (since it's a singlet and the rest of the molecule is probably not splitting it). The 1.6 ppm peak is a doublet, but wait, the third peak is 1.6 (3H, d). Wait, the 1.6 is a doublet? Wait, no. Wait, the data is:
|
| 89 |
+
|
| 90 |
+
Wait, the user wrote:
|
| 91 |
+
|
| 92 |
+
"1H NMR: chemical reference (ppm): 7.0 (1H, d, J = 16.0 Hz), 5.5 (1H, dq), 2.1 (3H, s), 1.6 (3H, d)"
|
| 93 |
+
|
| 94 |
+
So the peaks ar...
|
| 95 |
+
```
|
| 96 |
+
|
| 97 |
+
- **prompt** (truncated): `Which of the following starting material(s) would be used to prepare methyl 2-propyl-1,2,4a,5,6,7,8,8a-octahydronaphthalene-1-carboxylate?
|
| 98 |
+
|
| 99 |
+
Choices:
|
| 100 |
+
A. methyl (2E,4E,10Z)-tetradeca-2,4,10-trienoate
|
| 101 |
+
B. methyl (2E,8E,10E)-tetradeca-2,8,10-trienoate
|
| 102 |
+
C. Cyclohexene and methyl 2,3-dimethylenehexanoate
|
| 103 |
+
D. 1-vinylcyclohex-1-ene and methyl hex-2-ynoate`
|
| 104 |
+
- **reference**: `B`
|
| 105 |
+
- **extracted answers** (0/1 correct, ✗):
|
| 106 |
+
- `<no answer>`
|
| 107 |
+
- **raw completions**:
|
| 108 |
+
- completion 1:
|
| 109 |
+
```text
|
| 110 |
+
<think>
|
| 111 |
+
Okay, let's try to figure out which starting material is used to prepare methyl 2-propyl-1,2,4a,5,6,7,8,8a-octahydronaphthalene-1-carboxylate. Hmm, first, I need to understand the structure of the compound in question.
|
| 112 |
+
|
| 113 |
+
The name given is methyl 2-propyl-1,2,4a,5,6,7,8,8a-octahydronaphthalene-1-carboxylate. Let me break that down. The parent ring is naphthalene, which has two fused benzene rings. The numbering is 1,2,4a,5,6,7,8,8a. So, the structure is a naphthalene ring with eight hydrogen atoms, but here it's substituted with certain groups. The carboxylate group is at position 1, and there are methyl and propyl groups on positions 2 and 3. Wait, the name mentions "2-propyl" so may...
|
| 114 |
+
```
|