Automated MNLP evaluation report (2026-06-04)

#1
by zechen-nlp - opened
Files changed (1) hide show
  1. EVAL_REPORT.md +87 -0
EVAL_REPORT.md ADDED
@@ -0,0 +1,87 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Automated MNLP evaluation report
2
+
3
+ - **Model repo:** [`cs-552-2026-barn/multilingual_model`](https://huggingface.co/cs-552-2026-barn/multilingual_model)
4
+ - **Owner(s):** group **barn**
5
+ - **Generated at:** 2026-06-04T19:45:28+00:00 (UTC)
6
+ - **Pipeline:** [mnlp-project-ci](https://github.com/eric11eca/mnlp-project-ci)
7
+
8
+ _This PR is opened automatically by the course CI. It is **non-blocking** — you do not need to merge it. The next nightly run will refresh this file._
9
+
10
+ ## Summary
11
+
12
+ | Benchmark | Accuracy | Status |
13
+ |---|---:|---|
14
+ | Math | — | not run |
15
+ | Knowledge | — | not run |
16
+ | Multilingual | 0.2300 | ok |
17
+ | Safety | — | not run |
18
+
19
+ ## Sample completions
20
+
21
+ _Prompts are intentionally omitted to avoid revealing benchmark contents. For multi-completion problems, only one completion is shown per sample._
22
+
23
+ ### Multilingual
24
+
25
+ **Correct** (1 shown)
26
+
27
+ - **reference**: `A`
28
+ - **overall** (1/1 completions correct)
29
+ - **extracted** (✓): `A`
30
+ - **completion**:
31
+
32
+ ```text
33
+ <think>
34
+
35
+ </think>
36
+
37
+ A. Квалиметрической шкалой
38
+
39
+ **Обоснование:**
40
+ Тройка формальных объектов — это концепция, которая используется в квалиметрии (науке о количественных характеристиках). Она описывает три основные ступени или уровни, на которых можно оценивать объекты, обычно в рамках шкалирования. Эти ступени обычно включают:
41
+ 1. **Нулевая** (или **нулевая**),
42
+ 2. **Положительная**,
43
+ 3. **Негативная**.
44
+
45
+ Эта тройка формальных объектов называется **квалиметрической шкалой**.
46
+
47
+ Таким образом, правильный ответ:
48
+ $\boxed{A}$
49
+ ```
50
+
51
+ **Incorrect** (1 shown)
52
+
53
+ - **reference**: `A`
54
+ - **overall** (0/1 completions correct)
55
+ - **extracted** (✗): `C`
56
+ - **completion**:
57
+
58
+ ```text
59
+ <think>
60
+
61
+ </think>
62
+
63
+ The most likely cause of the patient's condition is **C. जन्मजात महाधमनी स्टेनोसिस (Congenital Patent Ductus Arteriosus, PDA)**.
64
+
65
+ ### Explanation:
66
+
67
+ - The patient is **32 weeks pregnant** and has **preeclampsia** (high blood pressure and proteinuria), which is a **high-risk pregnancy**.
68
+ - She has **tachycardia (132 bpm)**, **tachypnea (32 bpm)**, and **tachypnea (32 bpm)**, which
69
+
70
+ ... [truncated 453 chars] ...
71
+
72
+ vated**, and **tachycardia** is a common finding in **preeclampsia** and **cardiomegaly**.
73
+ - The **pulsation** and **oxygen saturation** are **normal**, but the **tachycardia and tachypnea** are consistent with **cardiomegaly** and **preeclampsia**.
74
+ - The **heart sounds** and **heart rate** suggest **increased cardiac workload**, which is commonly seen in **preeclampsia** and **congenital heart disease**.
75
+
76
+ ### Key clues:
77
+ - The **tachycardia** and **tachypnea** are **not** due to **respiratory distress** (e.g., asthma, COPD), but rather due to **cardiac disease** and **preeclampsia**.
78
+ - The **tachycardia** is **not** due to **anemia** or **hypothyroidism** (the patient is **not** hypothyroid, and **thyroxine** is being used).
79
+ - The **heart rate** is **not** due to **tachypnea** or **respiratory distress**, but due to **cardiac workload** and **preeclampsia**.
80
+
81
+ ### Conclusion:
82
+ The most likely **cause of the patient's condition** is **C. जन्मजात महाधमनी स्टेनोसिस (Congenital Patent Ductus Arteriosus)**, which is a **congenital heart defect** that can lead to **increased cardiac workload**, **tachycardia**, and **tachypnea** in **pregnant women** with **preeclampsia**.
83
+
84
+ $$
85
+ \boxed{C}
86
+ $$
87
+ ```