lthn commited on
Commit
ed2c811
·
verified ·
1 Parent(s): f2614b2

eval(fingerprint): Global MMLU Lite EN / lemer-mlx-bf16 / 1-round

Browse files

Per-question full-output fingerprint on CohereForAI/Global-MMLU-Lite config en
test split (400 questions). Single round, mlx_lm greedy, max_tokens=2048.
Full model output preserved per row in parquet column full_model_output.

Scores (n=400):
- strict letter regex: 260/400 = 65.0%
- content-aware: 274/400 = 68.5%
- no-answer: 10/400 = 2.5%

Cultural sensitivity stratification:
- CS (200q) strict 65.5% content 69.0%
- CA (200q) strict 64.5% content 68.0%
- Cultural fairness (1-|CS-CA|) = 0.990

NOT 8-PAC consensus — this is fingerprint-purpose disclosure for alignment
auditing. Readers can inspect which questions the model disagrees with gold
and the full reasoning output per case. 8-PAC statistical consensus (8 rounds
paired vs base Gemma 4 E2B IT) is the follow-up.

Paper reference: §16 (revise benchmark toward model).

eval_results/global_mmlu_lite_en.mlx.BF16.md ADDED
@@ -0,0 +1,31 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Global MMLU Lite EN — lemer-mlx-bf16
2
+
3
+ > **1-round fingerprint — not 8-PAC consensus.** Per-question full model output preserved for alignment-signature auditing. 8-round statistical consensus is the follow-up run.
4
+
5
+ Per-question fingerprint. Full model output preserved per row in the parquet.
6
+
7
+ ## Scores (n=400)
8
+
9
+ | Metric | Value |
10
+ |---|---|
11
+ | Strict letter regex | 260/400 = 65.0% |
12
+ | Content-aware fallback | 274/400 = 68.5% |
13
+ | No-answer | 10/400 = 2.5% |
14
+
15
+ ## Cultural sensitivity stratification
16
+
17
+ | | n | Strict | Content |
18
+ |---|---|---|---|
19
+ | CS | 200 | 131/200 = 65.5% | 138/200 = 69.0% |
20
+ | CA | 200 | 129/200 = 64.5% | 136/200 = 68.0% |
21
+ | **Cultural fairness** (1−\|CS−CA\|) | — | 0.99 | 0.99 |
22
+
23
+ ## Notes
24
+
25
+ - No coercion, no retry. Parser is regex over raw output + content-match fallback.
26
+ - `agrees_with_gold_strict` and `agrees_with_gold_content` are both surfaced — readers can audit.
27
+ - Dataset: `CohereForAI/Global-MMLU-Lite` config `en` split `test`.
28
+ - Model: lemer (Gemma 4 E2B + LEK, bf16 MLX reference).
29
+ - Sampling: mlx_lm greedy, max_tokens 2048.
30
+ - Timestamp: 2026-04-17T11:56:15.577413+00:00
31
+ - Runtime: 746s (0.54 q/s).
eval_results/global_mmlu_lite_en.mlx.BF16.parquet ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:35f87dde7df9e4c2097cd41485c0c64f703c20fd34fba994c52cdcc7fa021dca
3
+ size 287879
eval_results/global_mmlu_lite_en.mlx.BF16.yaml ADDED
@@ -0,0 +1,58 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ task: global_mmlu_lite_en
2
+ dataset:
3
+ repo: CohereForAI/Global-MMLU-Lite
4
+ config: en
5
+ split: test
6
+ rows: 400
7
+ model:
8
+ repo: lthn/lemer-mlx-bf16
9
+ local_path: /Volumes/Data/lem/models/lemma.1.x.x/v1.0.1/lemer-mlx-bf16
10
+ backend: mlx_lm
11
+ quant: BF16
12
+ sampling:
13
+ max_tokens: 2048
14
+ mlx_lm_defaults: greedy
15
+ prompt_template: Question + A/B/C/D + 'Reason briefly, then end with the single letter
16
+ answer.'
17
+ parser: 'strict: last A/B/C/D letter regex. content: fallback to unique/last option-text
18
+ match.'
19
+ scores:
20
+ strict_letter:
21
+ correct: 260
22
+ n: 400
23
+ pct: 65.0
24
+ content_aware:
25
+ correct: 274
26
+ n: 400
27
+ pct: 68.5
28
+ no_answer:
29
+ n: 10
30
+ pct: 2.5
31
+ cs:
32
+ n: 200
33
+ strict_correct: 131
34
+ strict_pct: 65.5
35
+ content_correct: 138
36
+ content_pct: 69.0
37
+ ca:
38
+ n: 200
39
+ strict_correct: 129
40
+ strict_pct: 64.5
41
+ content_correct: 136
42
+ content_pct: 68.0
43
+ cultural_fairness_strict: 0.99
44
+ cultural_fairness_content: 0.99
45
+ timestamp_utc: '2026-04-17T11:56:15.577413+00:00'
46
+ runtime_seconds: 745.7
47
+ throughput_qps: 0.54
48
+ host: m3-ultra (local)
49
+ note: "This is a 1-round per-question fingerprint capture. Full model output is preserved\
50
+ \ per row in the parquet column full_model_output. Not a statistical accuracy claim\
51
+ \ \u2014 8-PAC consensus (8 rounds paired vs base Gemma 4 E2B IT) is the follow-up.\
52
+ \ Published here to disclose the model alignment fingerprint: readers can audit\
53
+ \ which questions the model disagrees with and on what grounds. For the reasoning\
54
+ \ behind publishing disagreement patterns rather than just accuracy scores, see\
55
+ \ paper section 16 (revise benchmark toward model)."
56
+ rounds: 1
57
+ protocol: single-round fingerprint (not 8-PAC consensus)
58
+ status: preliminary