zechen-nlp commited on
Commit
60416ec
Β·
verified Β·
1 Parent(s): 598f987

Update Automated MNLP evaluation report (2026-05-22)

Browse files
Files changed (1) hide show
  1. EVAL_REPORT.md +12 -87
EVAL_REPORT.md CHANGED
@@ -2,7 +2,7 @@
2
 
3
  - **Model repo:** [`cs-552-2026-ChatMODS/general_knowledge_model`](https://huggingface.co/cs-552-2026-ChatMODS/general_knowledge_model)
4
  - **Owner(s):** group **ChatMODS**
5
- - **Generated at:** 2026-05-21T03:28:21+00:00 (UTC)
6
  - **Pipeline:** [mnlp-project-ci](https://github.com/eric11eca/mnlp-project-ci)
7
 
8
  _This PR is opened automatically by the course CI. It is **non-blocking** β€” you do not need to merge it. The next nightly run will refresh this file._
@@ -12,7 +12,7 @@ _This PR is opened automatically by the course CI. It is **non-blocking** β€” yo
12
  | Benchmark | Accuracy | Status |
13
  |---|---:|---|
14
  | Math | β€” | not run |
15
- | Knowledge | 0.1900 | ok |
16
  | Multilingual | β€” | not run |
17
  | Safety | β€” | not run |
18
 
@@ -24,101 +24,26 @@ _Prompts are intentionally omitted to avoid revealing benchmark contents. For mu
24
 
25
  **Correct** (1 shown)
26
 
27
- - **reference**: `C`
28
  - **overall** (1/1 completions correct)
29
- - **extracted** (βœ“): `C`
30
  - **completion**:
31
 
32
  ```text
33
- The patient presents with **intermittent incontinence** that occurs **a few days before flights** and **resolves the day after**. This pattern suggests a **stress-related** or **anxiety-induced** condition. The symptoms are **not related to physical causes**, and the **urinalysis and urine culture are normal**.
34
-
35
- ### Key Points:
36
- - **Intermittent incontinence** that is **stress-induced** or **anxiet
37
-
38
- ... [truncated 52 chars] ...
39
-
40
- triggered by travel** and **resolve after the flight**, indicating a **stressor** (flight) is the trigger.
41
- - She has **no other significant medical history**, and **oral contraceptive** is the only medication.
42
- - The **symptoms are not related to physical causes** (no urinary tract infection, structural abnormalities, etc.).
43
-
44
- ### Most Appropriate Next Step:
45
- The most appropriate next step is to **evaluate the role of stress and anxiety** in her symptoms. Since the symptoms are **stress-related**, the best approach is to **recommend behavioral therapy** or **stress management techniques**.
46
-
47
- ### Correct Answer:
48
- **C. Recommend behavioral therapy**
49
-
50
- ### Explanation:
51
- - **Behavioral therapy** (e.g., cognitive-behavioral therapy) is effective for stress-related incontinence.
52
- - **Avoiding stressful activities** (E) is a **management strategy**, but **not the most appropriate next step** in a clinical setting.
53
- - **Stopping oral contraceptives** (H) may help, but it is **not the first-line intervention** unless the incontinence is confirmed to be due to hormonal causes.
54
- - **Psychoanalytic therapy** (I) is **not evidence-based** for this type of incontinence.
55
-
56
- ### Final Answer:
57
- $$
58
- \boxed{C}
59
- $$
60
  ```
61
 
62
  **Incorrect** (1 shown)
63
 
64
- - **reference**: `B`
65
  - **overall** (0/1 completions correct)
66
- - **extracted** (βœ—): `\text{A}`
67
  - **completion**:
68
 
69
  ```text
70
- We are given a Lagrangian for the coupling of a **bosonic field $ X $** to **fermions**:
71
-
72
- $$
73
- \mathcal{L}(y) = -\lambda f \bar{\psi}_f(y) f(y) (\upsilon + X(y)) \psi_f(y)
74
- $$
75
-
76
- This is a **vector-like** coupling, and the **boson $ X $** is coupled to **fermions** in a way that depends on the **mass** of the boson, $ m_X = 6 \, \text{GeV} $.
77
-
78
- ---
79
-
80
- ### Step 1: Understand the decay modes
81
-
82
- The **kinemati
83
-
84
- ... [truncated 823 chars] ...
85
-
86
- 0–5000 MeV
87
- - **Leptons**: $ e, \mu, \tau $ β€” masses are in the range of 0.5–100 MeV
88
- - **Baryons**: $ b, s $ β€” masses are in the range of 4–100 GeV
89
-
90
- So, the **boson $ X $** with mass 6 GeV can decay into:
91
-
92
- - **Fermion pairs** with **total mass less than or equal to 6 GeV**
93
-
94
- So, the **allowed decay modes** are:
95
-
96
- - $ X \rightarrow b\bar{b} $ (mass ~4 GeV)
97
- - $ X \rightarrow s\bar{s} $ (mass ~4 GeV)
98
- - $ X \rightarrow u\bar{u} $ (mass ~5 GeV)
99
- - $ X \rightarrow d\bar{d} $ (mass ~5 GeV)
100
- - $ X \rightarrow \tau^+ \tau^- $ (mass ~2000 MeV)
101
- - $ X \rightarrow \mu^+ \mu^- $ (mass ~200 MeV)
102
- - $ X \rightarrow e^+ e^- $ (mass ~100 MeV)
103
-
104
- ---
105
-
106
- ### Step 4: Choose the correct option
107
-
108
- From the choices:
109
-
110
- - **A** includes $ b\bar{b}, s\bar{s}, u\bar{u}, d\bar{d}, \tau^+ \tau^-, \mu^+ \mu^-, e^+ e^- $
111
- - **B** includes $ c\bar{c}, s\bar{s}, u\bar{u}, d\bar{d}, \tau^+ \tau^-, \mu^+ \mu^-, e^+ e^- $
112
- - **C** includes $ c\bar{c}, s\bar{s}, u\bar{u}, d\bar{d}, t\bar{t}, \tau^+ \tau^-, \mu^+ \mu^-, e^+ e^- $
113
- - **D** includes $ b\bar{b}, s\bar{s}, u\bar{u}, d\bar{d}, \tau^+ \tau^-, e^+ e^- $
114
-
115
- Only **A** includes all **allowed decay modes** of $ X $ with mass 6 GeV.
116
-
117
- ---
118
-
119
- ### βœ… Final Answer:
120
-
121
- $$
122
- \boxed{\text{A}}
123
- $$
124
  ```
 
2
 
3
  - **Model repo:** [`cs-552-2026-ChatMODS/general_knowledge_model`](https://huggingface.co/cs-552-2026-ChatMODS/general_knowledge_model)
4
  - **Owner(s):** group **ChatMODS**
5
+ - **Generated at:** 2026-05-22T03:04:12+00:00 (UTC)
6
  - **Pipeline:** [mnlp-project-ci](https://github.com/eric11eca/mnlp-project-ci)
7
 
8
  _This PR is opened automatically by the course CI. It is **non-blocking** β€” you do not need to merge it. The next nightly run will refresh this file._
 
12
  | Benchmark | Accuracy | Status |
13
  |---|---:|---|
14
  | Math | β€” | not run |
15
+ | Knowledge | 0.2800 | ok |
16
  | Multilingual | β€” | not run |
17
  | Safety | β€” | not run |
18
 
 
24
 
25
  **Correct** (1 shown)
26
 
27
+ - **reference**: `A`
28
  - **overall** (1/1 completions correct)
29
+ - **extracted** (βœ“): `A`
30
  - **completion**:
31
 
32
  ```text
33
+ Let me analyze each option:
34
+ The correct answer is A. No, the school exercised its appropriate function of trying to restrict the use of vulgarity and offensive discourse in a high school setting where free speech may be limited..
35
+ \boxed{A}
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
36
  ```
37
 
38
  **Incorrect** (1 shown)
39
 
40
+ - **reference**: `A`
41
  - **overall** (0/1 completions correct)
42
+ - **extracted** (βœ—): `B`
43
  - **completion**:
44
 
45
  ```text
46
+ Let me analyze each option:
47
+ The correct answer is B.
48
+ \boxed{B}
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
49
  ```