zechen-nlp commited on
Commit
0d764d2
·
verified ·
1 Parent(s): 22faba1

Update Automated MNLP evaluation report (2026-05-03)

Browse files
Files changed (1) hide show
  1. EVAL_REPORT.md +1 -24
EVAL_REPORT.md CHANGED
@@ -2,7 +2,7 @@
2
 
3
  - **Model repo:** [`cs-552-2026-taadmin/group_model`](https://huggingface.co/cs-552-2026-taadmin/group_model)
4
  - **Owner(s):** group **taadmin**
5
- - **Generated at:** 2026-05-03T02:10:11+00:00 (UTC)
6
  - **Pipeline:** [mnlp-project-ci](https://github.com/eric11eca/mnlp-project-ci)
7
 
8
  _This PR is opened automatically by the course CI. It is **non-blocking** — you do not need to merge it. The next nightly run will refresh this file._
@@ -17,29 +17,6 @@ _This PR is opened automatically by the course CI. It is **non-blocking** — yo
17
  | Safety | `pass@1` | 0.7000 | 20 | ok |
18
  | **Average** | — | **0.3125** | — | — |
19
 
20
- ## Per-source breakdown
21
-
22
- ### Math
23
-
24
- | Source | Metric | Accuracy | # problems |
25
- |---|---|---:|---:|
26
- | `HuggingFaceH4/MATH-500` | `pass@8` | 0.6000 | 5 |
27
- | `MathArena/aime_2025` | `pass@8` | 0.5000 | 2 |
28
- | `MathArena/aime_2026` | `pass@8` | 0.0000 | 1 |
29
- | `MathArena/hmmt_feb_2026` | `pass@8` | 0.0000 | 2 |
30
-
31
- ### Knowledge
32
-
33
- | Source | Metric | Accuracy | # problems |
34
- |---|---|---:|---:|
35
- | `Idavidrein/gpqa/gpqa_diamond` | `pass@1` | 0.1500 | 20 |
36
-
37
- ### Safety
38
-
39
- | Source | Metric | Accuracy | # problems |
40
- |---|---|---:|---:|
41
- | `thu-coai/SafetyBench` | `pass@1` | 0.7000 | 20 |
42
-
43
  ## Sample completions
44
 
45
  ### Math
 
2
 
3
  - **Model repo:** [`cs-552-2026-taadmin/group_model`](https://huggingface.co/cs-552-2026-taadmin/group_model)
4
  - **Owner(s):** group **taadmin**
5
+ - **Generated at:** 2026-05-03T02:20:59+00:00 (UTC)
6
  - **Pipeline:** [mnlp-project-ci](https://github.com/eric11eca/mnlp-project-ci)
7
 
8
  _This PR is opened automatically by the course CI. It is **non-blocking** — you do not need to merge it. The next nightly run will refresh this file._
 
17
  | Safety | `pass@1` | 0.7000 | 20 | ok |
18
  | **Average** | — | **0.3125** | — | — |
19
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
20
  ## Sample completions
21
 
22
  ### Math