hevok commited on
Commit
005dbfc
·
verified ·
1 Parent(s): 8a932f0

Upload folder using huggingface_hub

Browse files
lm_eval/meta-llama__Llama-3.2-3B/humaneval_0.4.8_results_2025-03-09T10-55-43.014396.json ADDED
@@ -0,0 +1,131 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "results": {
3
+ "humaneval": {
4
+ "alias": "humaneval",
5
+ "pass@1,create_test": 0.2621951219512195,
6
+ "pass@1_stderr,create_test": 0.034450002891734645
7
+ }
8
+ },
9
+ "group_subtasks": {
10
+ "humaneval": []
11
+ },
12
+ "configs": {
13
+ "humaneval": {
14
+ "task": "humaneval",
15
+ "dataset_path": "openai/openai_humaneval",
16
+ "test_split": "test",
17
+ "doc_to_text": "{{prompt}}",
18
+ "doc_to_target": "{{test}}\ncheck({{entry_point}})",
19
+ "unsafe_code": true,
20
+ "description": "",
21
+ "target_delimiter": " ",
22
+ "fewshot_delimiter": "\n\n",
23
+ "num_fewshot": 0,
24
+ "metric_list": [
25
+ {
26
+ "metric": "def pass_at_k(references: list[str], predictions: list[list[str]], k: list[int] = None):\n global compute_\n assert k is not None\n if isinstance(k, int):\n k = [k]\n res = compute_.compute(\n references=references,\n predictions=predictions,\n k=k,\n )\n return res[0]\n",
27
+ "aggregation": "mean",
28
+ "higher_is_better": true,
29
+ "k": [
30
+ 1
31
+ ]
32
+ }
33
+ ],
34
+ "output_type": "generate_until",
35
+ "generation_kwargs": {
36
+ "until": [
37
+ "\nclass",
38
+ "\ndef",
39
+ "\n#",
40
+ "\nif",
41
+ "\nprint"
42
+ ],
43
+ "max_gen_toks": 1024,
44
+ "do_sample": false
45
+ },
46
+ "repeats": 1,
47
+ "filter_list": [
48
+ {
49
+ "name": "create_test",
50
+ "filter": [
51
+ {
52
+ "function": "custom",
53
+ "filter_fn": "<function build_predictions at 0x7cae74eed000>"
54
+ }
55
+ ]
56
+ }
57
+ ],
58
+ "should_decontaminate": false,
59
+ "metadata": {
60
+ "version": 1.0
61
+ }
62
+ }
63
+ },
64
+ "versions": {
65
+ "humaneval": 1.0
66
+ },
67
+ "n-shot": {
68
+ "humaneval": 0
69
+ },
70
+ "higher_is_better": {
71
+ "humaneval": {
72
+ "pass_at_k": true
73
+ }
74
+ },
75
+ "n-samples": {
76
+ "humaneval": {
77
+ "original": 164,
78
+ "effective": 164
79
+ }
80
+ },
81
+ "config": {
82
+ "model": "hf",
83
+ "model_args": "pretrained=meta-llama/Llama-3.2-3B,dtype=float32,trust_remote_code=True",
84
+ "model_num_parameters": 3212749824,
85
+ "model_dtype": "torch.float32",
86
+ "model_revision": "main",
87
+ "model_sha": "13afe5124825b4f3751f836b40dafda64c1ed062",
88
+ "batch_size": "auto",
89
+ "batch_sizes": [],
90
+ "device": "cuda",
91
+ "use_cache": null,
92
+ "limit": null,
93
+ "bootstrap_iters": 100000,
94
+ "gen_kwargs": null,
95
+ "random_seed": 0,
96
+ "numpy_seed": 1234,
97
+ "torch_seed": 1234,
98
+ "fewshot_seed": 1234
99
+ },
100
+ "git_hash": null,
101
+ "date": 1741516707.445278,
102
+ "pretty_env_info": "PyTorch version: 2.5.1+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: 14.0.0-1ubuntu1.1\nCMake version: version 3.31.2\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Nov 6 2024, 20:22:13) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-6.6.56+-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.140\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: Tesla T4\nGPU 1: Tesla T4\n\nNvidia driver version: 560.35.03\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.6\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.6\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.6\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.6\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.6\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.6\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.6\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 46 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 4\nOn-line CPU(s) list: 0-3\nVendor ID: GenuineIntel\nModel name: Intel(R) Xeon(R) CPU @ 2.00GHz\nCPU family: 6\nModel: 85\nThread(s) per core: 2\nCore(s) per socket: 2\nSocket(s): 1\nStepping: 3\nBogoMIPS: 4000.28\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch pti ssbd ibrs ibpb stibp fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves arat md_clear arch_capabilities\nHypervisor vendor: KVM\nVirtualization type: full\nL1d cache: 64 KiB (2 instances)\nL1i cache: 64 KiB (2 instances)\nL2 cache: 2 MiB (2 instances)\nL3 cache: 38.5 MiB (1 instance)\nNUMA node(s): 1\nNUMA node0 CPU(s): 0-3\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Mitigation; PTE Inversion\nVulnerability Mds: Mitigation; Clear CPU buffers; SMT Host state unknown\nVulnerability Meltdown: Mitigation; PTI\nVulnerability Mmio stale data: Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown\nVulnerability Reg file data sampling: Not affected\nVulnerability Retbleed: Mitigation; IBRS\nVulnerability Spec rstack overflow: Not affected\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; IBRS; IBPB conditional; STIBP conditional; RSB filling; PBRSB-eIBRS Not affected; BHI SW loop, KVM SW loop\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Mitigation; Clear CPU buffers; SMT Host state unknown\n\nVersions of relevant libraries:\n[pip3] mypy-extensions==1.0.0\n[pip3] numpy==1.26.4\n[pip3] onnx==1.17.0\n[pip3] optree==0.13.1\n[pip3] pytorch-ignite==0.5.1\n[pip3] pytorch-lightning==2.5.0.post0\n[pip3] torch==2.5.1+cu121\n[pip3] torchaudio==2.5.1+cu121\n[pip3] torchinfo==1.8.0\n[pip3] torchmetrics==1.6.1\n[pip3] torchsummary==1.5.1\n[pip3] torchtune==0.5.0\n[pip3] torchvision==0.20.1+cu121\n[conda] Could not collect",
103
+ "transformers_version": "4.47.0",
104
+ "upper_git_hash": null,
105
+ "tokenizer_pad_token": [
106
+ "<|end_of_text|>",
107
+ "128001"
108
+ ],
109
+ "tokenizer_eos_token": [
110
+ "<|end_of_text|>",
111
+ "128001"
112
+ ],
113
+ "tokenizer_bos_token": [
114
+ "<|begin_of_text|>",
115
+ "128000"
116
+ ],
117
+ "eot_token_id": 128001,
118
+ "max_length": 131072,
119
+ "task_hashes": {},
120
+ "model_source": "hf",
121
+ "model_name": "meta-llama/Llama-3.2-3B",
122
+ "model_name_sanitized": "meta-llama__Llama-3.2-3B",
123
+ "system_instruction": null,
124
+ "system_instruction_sha": null,
125
+ "fewshot_as_multiturn": false,
126
+ "chat_template": null,
127
+ "chat_template_sha": null,
128
+ "start_time": 7383.679866534,
129
+ "end_time": 8435.206068245,
130
+ "total_evaluation_time_seconds": "1051.526201710999"
131
+ }
lm_eval/meta-llama__Llama-3.2-3B/results_2025-03-05T14-23-45.341028.json ADDED
@@ -0,0 +1,123 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "results": {
3
+ "lambada_openai": {
4
+ "perplexity,none": 3.9425084889549575,
5
+ "perplexity_stderr,none": 0.08280551645968916,
6
+ "acc,none": 0.7054143217543178,
7
+ "acc_stderr,none": 0.006350969451144858,
8
+ "alias": "lambada_openai"
9
+ }
10
+ },
11
+ "group_subtasks": {
12
+ "lambada_openai": []
13
+ },
14
+ "configs": {
15
+ "lambada_openai": {
16
+ "task": "lambada_openai",
17
+ "group": [
18
+ "lambada"
19
+ ],
20
+ "dataset_path": "EleutherAI/lambada_openai",
21
+ "dataset_name": "default",
22
+ "dataset_kwargs": {
23
+ "trust_remote_code": true
24
+ },
25
+ "test_split": "test",
26
+ "doc_to_text": "{{text.split(' ')[:-1]|join(' ')}}",
27
+ "doc_to_target": "{{' '+text.split(' ')[-1]}}",
28
+ "description": "",
29
+ "target_delimiter": " ",
30
+ "fewshot_delimiter": "\n\n",
31
+ "num_fewshot": 0,
32
+ "metric_list": [
33
+ {
34
+ "metric": "perplexity",
35
+ "aggregation": "perplexity",
36
+ "higher_is_better": false
37
+ },
38
+ {
39
+ "metric": "acc",
40
+ "aggregation": "mean",
41
+ "higher_is_better": true
42
+ }
43
+ ],
44
+ "output_type": "loglikelihood",
45
+ "repeats": 1,
46
+ "should_decontaminate": true,
47
+ "doc_to_decontamination_query": "{{text}}",
48
+ "metadata": {
49
+ "version": 1.0
50
+ }
51
+ }
52
+ },
53
+ "versions": {
54
+ "lambada_openai": 1.0
55
+ },
56
+ "n-shot": {
57
+ "lambada_openai": 0
58
+ },
59
+ "higher_is_better": {
60
+ "lambada_openai": {
61
+ "perplexity": false,
62
+ "acc": true
63
+ }
64
+ },
65
+ "n-samples": {
66
+ "lambada_openai": {
67
+ "original": 5153,
68
+ "effective": 5153
69
+ }
70
+ },
71
+ "config": {
72
+ "model": "hf",
73
+ "model_args": "pretrained=meta-llama/Llama-3.2-3B,dtype=float32,trust_remote_code=True",
74
+ "model_num_parameters": 3212749824,
75
+ "model_dtype": "torch.float32",
76
+ "model_revision": "main",
77
+ "model_sha": "13afe5124825b4f3751f836b40dafda64c1ed062",
78
+ "batch_size": "auto",
79
+ "batch_sizes": [
80
+ 8
81
+ ],
82
+ "device": "cuda",
83
+ "use_cache": null,
84
+ "limit": null,
85
+ "bootstrap_iters": 100000,
86
+ "gen_kwargs": null,
87
+ "random_seed": 0,
88
+ "numpy_seed": 1234,
89
+ "torch_seed": 1234,
90
+ "fewshot_seed": 1234
91
+ },
92
+ "git_hash": null,
93
+ "date": 1741183767.1865582,
94
+ "pretty_env_info": "PyTorch version: 2.5.1+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: 14.0.0-1ubuntu1.1\nCMake version: version 3.31.2\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Nov 6 2024, 20:22:13) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-6.6.56+-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.140\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: Tesla T4\nGPU 1: Tesla T4\n\nNvidia driver version: 560.35.03\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.6\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.6\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.6\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.6\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.6\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.6\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.6\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 46 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 4\nOn-line CPU(s) list: 0-3\nVendor ID: GenuineIntel\nModel name: Intel(R) Xeon(R) CPU @ 2.00GHz\nCPU family: 6\nModel: 85\nThread(s) per core: 2\nCore(s) per socket: 2\nSocket(s): 1\nStepping: 3\nBogoMIPS: 4000.38\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch pti ssbd ibrs ibpb stibp fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves arat md_clear arch_capabilities\nHypervisor vendor: KVM\nVirtualization type: full\nL1d cache: 64 KiB (2 instances)\nL1i cache: 64 KiB (2 instances)\nL2 cache: 2 MiB (2 instances)\nL3 cache: 38.5 MiB (1 instance)\nNUMA node(s): 1\nNUMA node0 CPU(s): 0-3\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Mitigation; PTE Inversion\nVulnerability Mds: Mitigation; Clear CPU buffers; SMT Host state unknown\nVulnerability Meltdown: Mitigation; PTI\nVulnerability Mmio stale data: Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown\nVulnerability Reg file data sampling: Not affected\nVulnerability Retbleed: Mitigation; IBRS\nVulnerability Spec rstack overflow: Not affected\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; IBRS; IBPB conditional; STIBP conditional; RSB filling; PBRSB-eIBRS Not affected; BHI SW loop, KVM SW loop\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Mitigation; Clear CPU buffers; SMT Host state unknown\n\nVersions of relevant libraries:\n[pip3] mypy-extensions==1.0.0\n[pip3] numpy==1.26.4\n[pip3] onnx==1.17.0\n[pip3] optree==0.13.1\n[pip3] pytorch-ignite==0.5.1\n[pip3] pytorch-lightning==2.5.0.post0\n[pip3] torch==2.5.1+cu121\n[pip3] torchaudio==2.5.1+cu121\n[pip3] torchinfo==1.8.0\n[pip3] torchmetrics==1.6.1\n[pip3] torchsummary==1.5.1\n[pip3] torchtune==0.5.0\n[pip3] torchvision==0.20.1+cu121\n[conda] Could not collect",
95
+ "transformers_version": "4.47.0",
96
+ "upper_git_hash": null,
97
+ "tokenizer_pad_token": [
98
+ "<|end_of_text|>",
99
+ 128001
100
+ ],
101
+ "tokenizer_eos_token": [
102
+ "<|end_of_text|>",
103
+ 128001
104
+ ],
105
+ "tokenizer_bos_token": [
106
+ "<|begin_of_text|>",
107
+ 128000
108
+ ],
109
+ "eot_token_id": 128001,
110
+ "max_length": 131072,
111
+ "task_hashes": {},
112
+ "model_source": "hf",
113
+ "model_name": "meta-llama/Llama-3.2-3B",
114
+ "model_name_sanitized": "meta-llama__Llama-3.2-3B",
115
+ "system_instruction": null,
116
+ "system_instruction_sha": null,
117
+ "fewshot_as_multiturn": false,
118
+ "chat_template": null,
119
+ "chat_template_sha": null,
120
+ "start_time": 15853.628074066,
121
+ "end_time": 16718.171508502,
122
+ "total_evaluation_time_seconds": "864.5434344360001"
123
+ }
lm_eval/meta-llama__Llama-3.2-3B/results_2025-03-05T15-08-14.503599.json ADDED
@@ -0,0 +1,120 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "results": {
3
+ "openbookqa": {
4
+ "acc,none": 0.312,
5
+ "acc_stderr,none": 0.020740596536488073,
6
+ "acc_norm,none": 0.43,
7
+ "acc_norm_stderr,none": 0.02216263442665284,
8
+ "alias": "openbookqa"
9
+ }
10
+ },
11
+ "group_subtasks": {
12
+ "openbookqa": []
13
+ },
14
+ "configs": {
15
+ "openbookqa": {
16
+ "task": "openbookqa",
17
+ "dataset_path": "openbookqa",
18
+ "dataset_name": "main",
19
+ "training_split": "train",
20
+ "validation_split": "validation",
21
+ "test_split": "test",
22
+ "doc_to_text": "question_stem",
23
+ "doc_to_target": "{{choices.label.index(answerKey.lstrip())}}",
24
+ "doc_to_choice": "{{choices.text}}",
25
+ "description": "",
26
+ "target_delimiter": " ",
27
+ "fewshot_delimiter": "\n\n",
28
+ "num_fewshot": 0,
29
+ "metric_list": [
30
+ {
31
+ "metric": "acc",
32
+ "aggregation": "mean",
33
+ "higher_is_better": true
34
+ },
35
+ {
36
+ "metric": "acc_norm",
37
+ "aggregation": "mean",
38
+ "higher_is_better": true
39
+ }
40
+ ],
41
+ "output_type": "multiple_choice",
42
+ "repeats": 1,
43
+ "should_decontaminate": true,
44
+ "doc_to_decontamination_query": "question_stem",
45
+ "metadata": {
46
+ "version": 1.0
47
+ }
48
+ }
49
+ },
50
+ "versions": {
51
+ "openbookqa": 1.0
52
+ },
53
+ "n-shot": {
54
+ "openbookqa": 0
55
+ },
56
+ "higher_is_better": {
57
+ "openbookqa": {
58
+ "acc": true,
59
+ "acc_norm": true
60
+ }
61
+ },
62
+ "n-samples": {
63
+ "openbookqa": {
64
+ "original": 500,
65
+ "effective": 500
66
+ }
67
+ },
68
+ "config": {
69
+ "model": "hf",
70
+ "model_args": "pretrained=meta-llama/Llama-3.2-3B,dtype=float32,trust_remote_code=True",
71
+ "model_num_parameters": 3212749824,
72
+ "model_dtype": "torch.float32",
73
+ "model_revision": "main",
74
+ "model_sha": "13afe5124825b4f3751f836b40dafda64c1ed062",
75
+ "batch_size": "auto",
76
+ "batch_sizes": [
77
+ 16
78
+ ],
79
+ "device": "cuda",
80
+ "use_cache": null,
81
+ "limit": null,
82
+ "bootstrap_iters": 100000,
83
+ "gen_kwargs": null,
84
+ "random_seed": 0,
85
+ "numpy_seed": 1234,
86
+ "torch_seed": 1234,
87
+ "fewshot_seed": 1234
88
+ },
89
+ "git_hash": null,
90
+ "date": 1741187187.2261014,
91
+ "pretty_env_info": "PyTorch version: 2.5.1+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: 14.0.0-1ubuntu1.1\nCMake version: version 3.31.2\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Nov 6 2024, 20:22:13) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-6.6.56+-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.140\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: Tesla T4\nGPU 1: Tesla T4\n\nNvidia driver version: 560.35.03\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.6\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.6\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.6\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.6\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.6\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.6\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.6\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 46 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 4\nOn-line CPU(s) list: 0-3\nVendor ID: GenuineIntel\nModel name: Intel(R) Xeon(R) CPU @ 2.00GHz\nCPU family: 6\nModel: 85\nThread(s) per core: 2\nCore(s) per socket: 2\nSocket(s): 1\nStepping: 3\nBogoMIPS: 4000.38\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch pti ssbd ibrs ibpb stibp fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves arat md_clear arch_capabilities\nHypervisor vendor: KVM\nVirtualization type: full\nL1d cache: 64 KiB (2 instances)\nL1i cache: 64 KiB (2 instances)\nL2 cache: 2 MiB (2 instances)\nL3 cache: 38.5 MiB (1 instance)\nNUMA node(s): 1\nNUMA node0 CPU(s): 0-3\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Mitigation; PTE Inversion\nVulnerability Mds: Mitigation; Clear CPU buffers; SMT Host state unknown\nVulnerability Meltdown: Mitigation; PTI\nVulnerability Mmio stale data: Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown\nVulnerability Reg file data sampling: Not affected\nVulnerability Retbleed: Mitigation; IBRS\nVulnerability Spec rstack overflow: Not affected\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; IBRS; IBPB conditional; STIBP conditional; RSB filling; PBRSB-eIBRS Not affected; BHI SW loop, KVM SW loop\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Mitigation; Clear CPU buffers; SMT Host state unknown\n\nVersions of relevant libraries:\n[pip3] mypy-extensions==1.0.0\n[pip3] numpy==1.26.4\n[pip3] onnx==1.17.0\n[pip3] optree==0.13.1\n[pip3] pytorch-ignite==0.5.1\n[pip3] pytorch-lightning==2.5.0.post0\n[pip3] torch==2.5.1+cu121\n[pip3] torchaudio==2.5.1+cu121\n[pip3] torchinfo==1.8.0\n[pip3] torchmetrics==1.6.1\n[pip3] torchsummary==1.5.1\n[pip3] torchtune==0.5.0\n[pip3] torchvision==0.20.1+cu121\n[conda] Could not collect",
92
+ "transformers_version": "4.47.0",
93
+ "upper_git_hash": null,
94
+ "tokenizer_pad_token": [
95
+ "<|end_of_text|>",
96
+ 128001
97
+ ],
98
+ "tokenizer_eos_token": [
99
+ "<|end_of_text|>",
100
+ 128001
101
+ ],
102
+ "tokenizer_bos_token": [
103
+ "<|begin_of_text|>",
104
+ 128000
105
+ ],
106
+ "eot_token_id": 128001,
107
+ "max_length": 131072,
108
+ "task_hashes": {},
109
+ "model_source": "hf",
110
+ "model_name": "meta-llama/Llama-3.2-3B",
111
+ "model_name_sanitized": "meta-llama__Llama-3.2-3B",
112
+ "system_instruction": null,
113
+ "system_instruction_sha": null,
114
+ "fewshot_as_multiturn": false,
115
+ "chat_template": null,
116
+ "chat_template_sha": null,
117
+ "start_time": 19273.551134288,
118
+ "end_time": 19387.33433296,
119
+ "total_evaluation_time_seconds": "113.78319867199752"
120
+ }
lm_eval/meta-llama__Llama-3.2-3B/results_2025-03-06T14-21-08.986098.json ADDED
@@ -0,0 +1,2967 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "results": {
3
+ "blimp": {
4
+ "acc,none": 0.8199552238805967,
5
+ "acc_stderr,none": 0.0013392933229133912,
6
+ "alias": "blimp"
7
+ },
8
+ "blimp_adjunct_island": {
9
+ "alias": " - blimp_adjunct_island",
10
+ "acc,none": 0.872,
11
+ "acc_stderr,none": 0.010570133761108663
12
+ },
13
+ "blimp_anaphor_gender_agreement": {
14
+ "alias": " - blimp_anaphor_gender_agreement",
15
+ "acc,none": 0.991,
16
+ "acc_stderr,none": 0.002987963843142665
17
+ },
18
+ "blimp_anaphor_number_agreement": {
19
+ "alias": " - blimp_anaphor_number_agreement",
20
+ "acc,none": 0.997,
21
+ "acc_stderr,none": 0.001730316154346927
22
+ },
23
+ "blimp_animate_subject_passive": {
24
+ "alias": " - blimp_animate_subject_passive",
25
+ "acc,none": 0.793,
26
+ "acc_stderr,none": 0.012818553557843984
27
+ },
28
+ "blimp_animate_subject_trans": {
29
+ "alias": " - blimp_animate_subject_trans",
30
+ "acc,none": 0.925,
31
+ "acc_stderr,none": 0.008333333333333342
32
+ },
33
+ "blimp_causative": {
34
+ "alias": " - blimp_causative",
35
+ "acc,none": 0.763,
36
+ "acc_stderr,none": 0.013454070462577941
37
+ },
38
+ "blimp_complex_NP_island": {
39
+ "alias": " - blimp_complex_NP_island",
40
+ "acc,none": 0.496,
41
+ "acc_stderr,none": 0.01581879370351089
42
+ },
43
+ "blimp_coordinate_structure_constraint_complex_left_branch": {
44
+ "alias": " - blimp_coordinate_structure_constraint_complex_left_branch",
45
+ "acc,none": 0.725,
46
+ "acc_stderr,none": 0.014127086556490531
47
+ },
48
+ "blimp_coordinate_structure_constraint_object_extraction": {
49
+ "alias": " - blimp_coordinate_structure_constraint_object_extraction",
50
+ "acc,none": 0.857,
51
+ "acc_stderr,none": 0.011075814808567038
52
+ },
53
+ "blimp_determiner_noun_agreement_1": {
54
+ "alias": " - blimp_determiner_noun_agreement_1",
55
+ "acc,none": 0.993,
56
+ "acc_stderr,none": 0.0026377941462437634
57
+ },
58
+ "blimp_determiner_noun_agreement_2": {
59
+ "alias": " - blimp_determiner_noun_agreement_2",
60
+ "acc,none": 0.967,
61
+ "acc_stderr,none": 0.00565180882045237
62
+ },
63
+ "blimp_determiner_noun_agreement_irregular_1": {
64
+ "alias": " - blimp_determiner_noun_agreement_irregular_1",
65
+ "acc,none": 0.939,
66
+ "acc_stderr,none": 0.007572076091557426
67
+ },
68
+ "blimp_determiner_noun_agreement_irregular_2": {
69
+ "alias": " - blimp_determiner_noun_agreement_irregular_2",
70
+ "acc,none": 0.939,
71
+ "acc_stderr,none": 0.007572076091557422
72
+ },
73
+ "blimp_determiner_noun_agreement_with_adj_2": {
74
+ "alias": " - blimp_determiner_noun_agreement_with_adj_2",
75
+ "acc,none": 0.926,
76
+ "acc_stderr,none": 0.008282064512704154
77
+ },
78
+ "blimp_determiner_noun_agreement_with_adj_irregular_1": {
79
+ "alias": " - blimp_determiner_noun_agreement_with_adj_irregular_1",
80
+ "acc,none": 0.909,
81
+ "acc_stderr,none": 0.009099549538400243
82
+ },
83
+ "blimp_determiner_noun_agreement_with_adj_irregular_2": {
84
+ "alias": " - blimp_determiner_noun_agreement_with_adj_irregular_2",
85
+ "acc,none": 0.903,
86
+ "acc_stderr,none": 0.00936368937324812
87
+ },
88
+ "blimp_determiner_noun_agreement_with_adjective_1": {
89
+ "alias": " - blimp_determiner_noun_agreement_with_adjective_1",
90
+ "acc,none": 0.966,
91
+ "acc_stderr,none": 0.005733836139695435
92
+ },
93
+ "blimp_distractor_agreement_relational_noun": {
94
+ "alias": " - blimp_distractor_agreement_relational_noun",
95
+ "acc,none": 0.869,
96
+ "acc_stderr,none": 0.010674874844837956
97
+ },
98
+ "blimp_distractor_agreement_relative_clause": {
99
+ "alias": " - blimp_distractor_agreement_relative_clause",
100
+ "acc,none": 0.706,
101
+ "acc_stderr,none": 0.014414290540008217
102
+ },
103
+ "blimp_drop_argument": {
104
+ "alias": " - blimp_drop_argument",
105
+ "acc,none": 0.783,
106
+ "acc_stderr,none": 0.01304151375727071
107
+ },
108
+ "blimp_ellipsis_n_bar_1": {
109
+ "alias": " - blimp_ellipsis_n_bar_1",
110
+ "acc,none": 0.786,
111
+ "acc_stderr,none": 0.01297583802196877
112
+ },
113
+ "blimp_ellipsis_n_bar_2": {
114
+ "alias": " - blimp_ellipsis_n_bar_2",
115
+ "acc,none": 0.941,
116
+ "acc_stderr,none": 0.007454835650406728
117
+ },
118
+ "blimp_existential_there_object_raising": {
119
+ "alias": " - blimp_existential_there_object_raising",
120
+ "acc,none": 0.867,
121
+ "acc_stderr,none": 0.01074366913239735
122
+ },
123
+ "blimp_existential_there_quantifiers_1": {
124
+ "alias": " - blimp_existential_there_quantifiers_1",
125
+ "acc,none": 0.97,
126
+ "acc_stderr,none": 0.005397140829099199
127
+ },
128
+ "blimp_existential_there_quantifiers_2": {
129
+ "alias": " - blimp_existential_there_quantifiers_2",
130
+ "acc,none": 0.365,
131
+ "acc_stderr,none": 0.015231776226264898
132
+ },
133
+ "blimp_existential_there_subject_raising": {
134
+ "alias": " - blimp_existential_there_subject_raising",
135
+ "acc,none": 0.899,
136
+ "acc_stderr,none": 0.009533618929341
137
+ },
138
+ "blimp_expletive_it_object_raising": {
139
+ "alias": " - blimp_expletive_it_object_raising",
140
+ "acc,none": 0.8,
141
+ "acc_stderr,none": 0.012655439943366651
142
+ },
143
+ "blimp_inchoative": {
144
+ "alias": " - blimp_inchoative",
145
+ "acc,none": 0.67,
146
+ "acc_stderr,none": 0.01487687202745673
147
+ },
148
+ "blimp_intransitive": {
149
+ "alias": " - blimp_intransitive",
150
+ "acc,none": 0.831,
151
+ "acc_stderr,none": 0.011856625977890115
152
+ },
153
+ "blimp_irregular_past_participle_adjectives": {
154
+ "alias": " - blimp_irregular_past_participle_adjectives",
155
+ "acc,none": 0.986,
156
+ "acc_stderr,none": 0.0037172325482565595
157
+ },
158
+ "blimp_irregular_past_participle_verbs": {
159
+ "alias": " - blimp_irregular_past_participle_verbs",
160
+ "acc,none": 0.876,
161
+ "acc_stderr,none": 0.010427498872343956
162
+ },
163
+ "blimp_irregular_plural_subject_verb_agreement_1": {
164
+ "alias": " - blimp_irregular_plural_subject_verb_agreement_1",
165
+ "acc,none": 0.917,
166
+ "acc_stderr,none": 0.00872852720607479
167
+ },
168
+ "blimp_irregular_plural_subject_verb_agreement_2": {
169
+ "alias": " - blimp_irregular_plural_subject_verb_agreement_2",
170
+ "acc,none": 0.93,
171
+ "acc_stderr,none": 0.008072494358323494
172
+ },
173
+ "blimp_left_branch_island_echo_question": {
174
+ "alias": " - blimp_left_branch_island_echo_question",
175
+ "acc,none": 0.673,
176
+ "acc_stderr,none": 0.014842213153411247
177
+ },
178
+ "blimp_left_branch_island_simple_question": {
179
+ "alias": " - blimp_left_branch_island_simple_question",
180
+ "acc,none": 0.862,
181
+ "acc_stderr,none": 0.010912152632504401
182
+ },
183
+ "blimp_matrix_question_npi_licensor_present": {
184
+ "alias": " - blimp_matrix_question_npi_licensor_present",
185
+ "acc,none": 0.636,
186
+ "acc_stderr,none": 0.015222868840522022
187
+ },
188
+ "blimp_npi_present_1": {
189
+ "alias": " - blimp_npi_present_1",
190
+ "acc,none": 0.617,
191
+ "acc_stderr,none": 0.015380102325652716
192
+ },
193
+ "blimp_npi_present_2": {
194
+ "alias": " - blimp_npi_present_2",
195
+ "acc,none": 0.697,
196
+ "acc_stderr,none": 0.01453968371053525
197
+ },
198
+ "blimp_only_npi_licensor_present": {
199
+ "alias": " - blimp_only_npi_licensor_present",
200
+ "acc,none": 0.922,
201
+ "acc_stderr,none": 0.008484573530118581
202
+ },
203
+ "blimp_only_npi_scope": {
204
+ "alias": " - blimp_only_npi_scope",
205
+ "acc,none": 0.786,
206
+ "acc_stderr,none": 0.01297583802196876
207
+ },
208
+ "blimp_passive_1": {
209
+ "alias": " - blimp_passive_1",
210
+ "acc,none": 0.891,
211
+ "acc_stderr,none": 0.00985982840703719
212
+ },
213
+ "blimp_passive_2": {
214
+ "alias": " - blimp_passive_2",
215
+ "acc,none": 0.916,
216
+ "acc_stderr,none": 0.008776162089491118
217
+ },
218
+ "blimp_principle_A_c_command": {
219
+ "alias": " - blimp_principle_A_c_command",
220
+ "acc,none": 0.741,
221
+ "acc_stderr,none": 0.01386041525752791
222
+ },
223
+ "blimp_principle_A_case_1": {
224
+ "alias": " - blimp_principle_A_case_1",
225
+ "acc,none": 1.0,
226
+ "acc_stderr,none": 0.0
227
+ },
228
+ "blimp_principle_A_case_2": {
229
+ "alias": " - blimp_principle_A_case_2",
230
+ "acc,none": 0.947,
231
+ "acc_stderr,none": 0.0070881056172464405
232
+ },
233
+ "blimp_principle_A_domain_1": {
234
+ "alias": " - blimp_principle_A_domain_1",
235
+ "acc,none": 0.995,
236
+ "acc_stderr,none": 0.0022315868748448804
237
+ },
238
+ "blimp_principle_A_domain_2": {
239
+ "alias": " - blimp_principle_A_domain_2",
240
+ "acc,none": 0.816,
241
+ "acc_stderr,none": 0.012259457340938577
242
+ },
243
+ "blimp_principle_A_domain_3": {
244
+ "alias": " - blimp_principle_A_domain_3",
245
+ "acc,none": 0.686,
246
+ "acc_stderr,none": 0.014683991951087966
247
+ },
248
+ "blimp_principle_A_reconstruction": {
249
+ "alias": " - blimp_principle_A_reconstruction",
250
+ "acc,none": 0.471,
251
+ "acc_stderr,none": 0.0157926694516289
252
+ },
253
+ "blimp_regular_plural_subject_verb_agreement_1": {
254
+ "alias": " - blimp_regular_plural_subject_verb_agreement_1",
255
+ "acc,none": 0.95,
256
+ "acc_stderr,none": 0.006895472974897919
257
+ },
258
+ "blimp_regular_plural_subject_verb_agreement_2": {
259
+ "alias": " - blimp_regular_plural_subject_verb_agreement_2",
260
+ "acc,none": 0.887,
261
+ "acc_stderr,none": 0.010016552866696874
262
+ },
263
+ "blimp_sentential_negation_npi_licensor_present": {
264
+ "alias": " - blimp_sentential_negation_npi_licensor_present",
265
+ "acc,none": 0.997,
266
+ "acc_stderr,none": 0.0017303161543469382
267
+ },
268
+ "blimp_sentential_negation_npi_scope": {
269
+ "alias": " - blimp_sentential_negation_npi_scope",
270
+ "acc,none": 0.735,
271
+ "acc_stderr,none": 0.013963164754809949
272
+ },
273
+ "blimp_sentential_subject_island": {
274
+ "alias": " - blimp_sentential_subject_island",
275
+ "acc,none": 0.417,
276
+ "acc_stderr,none": 0.015599819048769614
277
+ },
278
+ "blimp_superlative_quantifiers_1": {
279
+ "alias": " - blimp_superlative_quantifiers_1",
280
+ "acc,none": 0.93,
281
+ "acc_stderr,none": 0.008072494358323506
282
+ },
283
+ "blimp_superlative_quantifiers_2": {
284
+ "alias": " - blimp_superlative_quantifiers_2",
285
+ "acc,none": 0.977,
286
+ "acc_stderr,none": 0.004742730594656799
287
+ },
288
+ "blimp_tough_vs_raising_1": {
289
+ "alias": " - blimp_tough_vs_raising_1",
290
+ "acc,none": 0.673,
291
+ "acc_stderr,none": 0.014842213153411245
292
+ },
293
+ "blimp_tough_vs_raising_2": {
294
+ "alias": " - blimp_tough_vs_raising_2",
295
+ "acc,none": 0.867,
296
+ "acc_stderr,none": 0.010743669132397328
297
+ },
298
+ "blimp_transitive": {
299
+ "alias": " - blimp_transitive",
300
+ "acc,none": 0.86,
301
+ "acc_stderr,none": 0.010978183844357786
302
+ },
303
+ "blimp_wh_island": {
304
+ "alias": " - blimp_wh_island",
305
+ "acc,none": 0.732,
306
+ "acc_stderr,none": 0.014013292702729486
307
+ },
308
+ "blimp_wh_questions_object_gap": {
309
+ "alias": " - blimp_wh_questions_object_gap",
310
+ "acc,none": 0.782,
311
+ "acc_stderr,none": 0.013063179040595299
312
+ },
313
+ "blimp_wh_questions_subject_gap": {
314
+ "alias": " - blimp_wh_questions_subject_gap",
315
+ "acc,none": 0.894,
316
+ "acc_stderr,none": 0.009739551265785126
317
+ },
318
+ "blimp_wh_questions_subject_gap_long_distance": {
319
+ "alias": " - blimp_wh_questions_subject_gap_long_distance",
320
+ "acc,none": 0.869,
321
+ "acc_stderr,none": 0.010674874844837957
322
+ },
323
+ "blimp_wh_vs_that_no_gap": {
324
+ "alias": " - blimp_wh_vs_that_no_gap",
325
+ "acc,none": 0.949,
326
+ "acc_stderr,none": 0.0069604200625714135
327
+ },
328
+ "blimp_wh_vs_that_no_gap_long_distance": {
329
+ "alias": " - blimp_wh_vs_that_no_gap_long_distance",
330
+ "acc,none": 0.946,
331
+ "acc_stderr,none": 0.007150883521295441
332
+ },
333
+ "blimp_wh_vs_that_with_gap": {
334
+ "alias": " - blimp_wh_vs_that_with_gap",
335
+ "acc,none": 0.349,
336
+ "acc_stderr,none": 0.0150806639915631
337
+ },
338
+ "blimp_wh_vs_that_with_gap_long_distance": {
339
+ "alias": " - blimp_wh_vs_that_with_gap_long_distance",
340
+ "acc,none": 0.31,
341
+ "acc_stderr,none": 0.014632638658632902
342
+ }
343
+ },
344
+ "groups": {
345
+ "blimp": {
346
+ "acc,none": 0.8199552238805967,
347
+ "acc_stderr,none": 0.0013392933229133912,
348
+ "alias": "blimp"
349
+ }
350
+ },
351
+ "group_subtasks": {
352
+ "blimp": [
353
+ "blimp_adjunct_island",
354
+ "blimp_anaphor_gender_agreement",
355
+ "blimp_anaphor_number_agreement",
356
+ "blimp_animate_subject_passive",
357
+ "blimp_animate_subject_trans",
358
+ "blimp_causative",
359
+ "blimp_complex_NP_island",
360
+ "blimp_coordinate_structure_constraint_complex_left_branch",
361
+ "blimp_coordinate_structure_constraint_object_extraction",
362
+ "blimp_determiner_noun_agreement_1",
363
+ "blimp_determiner_noun_agreement_2",
364
+ "blimp_determiner_noun_agreement_irregular_1",
365
+ "blimp_determiner_noun_agreement_irregular_2",
366
+ "blimp_determiner_noun_agreement_with_adj_2",
367
+ "blimp_determiner_noun_agreement_with_adj_irregular_1",
368
+ "blimp_determiner_noun_agreement_with_adj_irregular_2",
369
+ "blimp_determiner_noun_agreement_with_adjective_1",
370
+ "blimp_distractor_agreement_relational_noun",
371
+ "blimp_distractor_agreement_relative_clause",
372
+ "blimp_drop_argument",
373
+ "blimp_ellipsis_n_bar_1",
374
+ "blimp_ellipsis_n_bar_2",
375
+ "blimp_existential_there_object_raising",
376
+ "blimp_existential_there_quantifiers_1",
377
+ "blimp_existential_there_quantifiers_2",
378
+ "blimp_existential_there_subject_raising",
379
+ "blimp_expletive_it_object_raising",
380
+ "blimp_inchoative",
381
+ "blimp_intransitive",
382
+ "blimp_irregular_past_participle_adjectives",
383
+ "blimp_irregular_past_participle_verbs",
384
+ "blimp_irregular_plural_subject_verb_agreement_1",
385
+ "blimp_irregular_plural_subject_verb_agreement_2",
386
+ "blimp_left_branch_island_echo_question",
387
+ "blimp_left_branch_island_simple_question",
388
+ "blimp_matrix_question_npi_licensor_present",
389
+ "blimp_npi_present_1",
390
+ "blimp_npi_present_2",
391
+ "blimp_only_npi_licensor_present",
392
+ "blimp_only_npi_scope",
393
+ "blimp_passive_1",
394
+ "blimp_passive_2",
395
+ "blimp_principle_A_c_command",
396
+ "blimp_principle_A_case_1",
397
+ "blimp_principle_A_case_2",
398
+ "blimp_principle_A_domain_1",
399
+ "blimp_principle_A_domain_2",
400
+ "blimp_principle_A_domain_3",
401
+ "blimp_principle_A_reconstruction",
402
+ "blimp_regular_plural_subject_verb_agreement_1",
403
+ "blimp_regular_plural_subject_verb_agreement_2",
404
+ "blimp_sentential_negation_npi_licensor_present",
405
+ "blimp_sentential_negation_npi_scope",
406
+ "blimp_sentential_subject_island",
407
+ "blimp_superlative_quantifiers_1",
408
+ "blimp_superlative_quantifiers_2",
409
+ "blimp_tough_vs_raising_1",
410
+ "blimp_tough_vs_raising_2",
411
+ "blimp_transitive",
412
+ "blimp_wh_island",
413
+ "blimp_wh_questions_object_gap",
414
+ "blimp_wh_questions_subject_gap",
415
+ "blimp_wh_questions_subject_gap_long_distance",
416
+ "blimp_wh_vs_that_no_gap",
417
+ "blimp_wh_vs_that_no_gap_long_distance",
418
+ "blimp_wh_vs_that_with_gap",
419
+ "blimp_wh_vs_that_with_gap_long_distance"
420
+ ]
421
+ },
422
+ "configs": {
423
+ "blimp_adjunct_island": {
424
+ "task": "blimp_adjunct_island",
425
+ "dataset_path": "blimp",
426
+ "dataset_name": "adjunct_island",
427
+ "validation_split": "train",
428
+ "doc_to_text": "",
429
+ "doc_to_target": 0,
430
+ "unsafe_code": false,
431
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
432
+ "description": "",
433
+ "target_delimiter": " ",
434
+ "fewshot_delimiter": "\n\n",
435
+ "num_fewshot": 0,
436
+ "metric_list": [
437
+ {
438
+ "metric": "acc",
439
+ "aggregation": "mean",
440
+ "higher_is_better": true
441
+ }
442
+ ],
443
+ "output_type": "multiple_choice",
444
+ "repeats": 1,
445
+ "should_decontaminate": true,
446
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
447
+ "metadata": {
448
+ "version": 1.0
449
+ }
450
+ },
451
+ "blimp_anaphor_gender_agreement": {
452
+ "task": "blimp_anaphor_gender_agreement",
453
+ "dataset_path": "blimp",
454
+ "dataset_name": "anaphor_gender_agreement",
455
+ "validation_split": "train",
456
+ "doc_to_text": "",
457
+ "doc_to_target": 0,
458
+ "unsafe_code": false,
459
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
460
+ "description": "",
461
+ "target_delimiter": " ",
462
+ "fewshot_delimiter": "\n\n",
463
+ "num_fewshot": 0,
464
+ "metric_list": [
465
+ {
466
+ "metric": "acc",
467
+ "aggregation": "mean",
468
+ "higher_is_better": true
469
+ }
470
+ ],
471
+ "output_type": "multiple_choice",
472
+ "repeats": 1,
473
+ "should_decontaminate": true,
474
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
475
+ "metadata": {
476
+ "version": 1.0
477
+ }
478
+ },
479
+ "blimp_anaphor_number_agreement": {
480
+ "task": "blimp_anaphor_number_agreement",
481
+ "dataset_path": "blimp",
482
+ "dataset_name": "anaphor_number_agreement",
483
+ "validation_split": "train",
484
+ "doc_to_text": "",
485
+ "doc_to_target": 0,
486
+ "unsafe_code": false,
487
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
488
+ "description": "",
489
+ "target_delimiter": " ",
490
+ "fewshot_delimiter": "\n\n",
491
+ "num_fewshot": 0,
492
+ "metric_list": [
493
+ {
494
+ "metric": "acc",
495
+ "aggregation": "mean",
496
+ "higher_is_better": true
497
+ }
498
+ ],
499
+ "output_type": "multiple_choice",
500
+ "repeats": 1,
501
+ "should_decontaminate": true,
502
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
503
+ "metadata": {
504
+ "version": 1.0
505
+ }
506
+ },
507
+ "blimp_animate_subject_passive": {
508
+ "task": "blimp_animate_subject_passive",
509
+ "dataset_path": "blimp",
510
+ "dataset_name": "animate_subject_passive",
511
+ "validation_split": "train",
512
+ "doc_to_text": "",
513
+ "doc_to_target": 0,
514
+ "unsafe_code": false,
515
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
516
+ "description": "",
517
+ "target_delimiter": " ",
518
+ "fewshot_delimiter": "\n\n",
519
+ "num_fewshot": 0,
520
+ "metric_list": [
521
+ {
522
+ "metric": "acc",
523
+ "aggregation": "mean",
524
+ "higher_is_better": true
525
+ }
526
+ ],
527
+ "output_type": "multiple_choice",
528
+ "repeats": 1,
529
+ "should_decontaminate": true,
530
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
531
+ "metadata": {
532
+ "version": 1.0
533
+ }
534
+ },
535
+ "blimp_animate_subject_trans": {
536
+ "task": "blimp_animate_subject_trans",
537
+ "dataset_path": "blimp",
538
+ "dataset_name": "animate_subject_trans",
539
+ "validation_split": "train",
540
+ "doc_to_text": "",
541
+ "doc_to_target": 0,
542
+ "unsafe_code": false,
543
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
544
+ "description": "",
545
+ "target_delimiter": " ",
546
+ "fewshot_delimiter": "\n\n",
547
+ "num_fewshot": 0,
548
+ "metric_list": [
549
+ {
550
+ "metric": "acc",
551
+ "aggregation": "mean",
552
+ "higher_is_better": true
553
+ }
554
+ ],
555
+ "output_type": "multiple_choice",
556
+ "repeats": 1,
557
+ "should_decontaminate": true,
558
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
559
+ "metadata": {
560
+ "version": 1.0
561
+ }
562
+ },
563
+ "blimp_causative": {
564
+ "task": "blimp_causative",
565
+ "dataset_path": "blimp",
566
+ "dataset_name": "causative",
567
+ "validation_split": "train",
568
+ "doc_to_text": "",
569
+ "doc_to_target": 0,
570
+ "unsafe_code": false,
571
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
572
+ "description": "",
573
+ "target_delimiter": " ",
574
+ "fewshot_delimiter": "\n\n",
575
+ "num_fewshot": 0,
576
+ "metric_list": [
577
+ {
578
+ "metric": "acc",
579
+ "aggregation": "mean",
580
+ "higher_is_better": true
581
+ }
582
+ ],
583
+ "output_type": "multiple_choice",
584
+ "repeats": 1,
585
+ "should_decontaminate": true,
586
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
587
+ "metadata": {
588
+ "version": 1.0
589
+ }
590
+ },
591
+ "blimp_complex_NP_island": {
592
+ "task": "blimp_complex_NP_island",
593
+ "dataset_path": "blimp",
594
+ "dataset_name": "complex_NP_island",
595
+ "validation_split": "train",
596
+ "doc_to_text": "",
597
+ "doc_to_target": 0,
598
+ "unsafe_code": false,
599
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
600
+ "description": "",
601
+ "target_delimiter": " ",
602
+ "fewshot_delimiter": "\n\n",
603
+ "num_fewshot": 0,
604
+ "metric_list": [
605
+ {
606
+ "metric": "acc",
607
+ "aggregation": "mean",
608
+ "higher_is_better": true
609
+ }
610
+ ],
611
+ "output_type": "multiple_choice",
612
+ "repeats": 1,
613
+ "should_decontaminate": true,
614
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
615
+ "metadata": {
616
+ "version": 1.0
617
+ }
618
+ },
619
+ "blimp_coordinate_structure_constraint_complex_left_branch": {
620
+ "task": "blimp_coordinate_structure_constraint_complex_left_branch",
621
+ "dataset_path": "blimp",
622
+ "dataset_name": "coordinate_structure_constraint_complex_left_branch",
623
+ "validation_split": "train",
624
+ "doc_to_text": "",
625
+ "doc_to_target": 0,
626
+ "unsafe_code": false,
627
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
628
+ "description": "",
629
+ "target_delimiter": " ",
630
+ "fewshot_delimiter": "\n\n",
631
+ "num_fewshot": 0,
632
+ "metric_list": [
633
+ {
634
+ "metric": "acc",
635
+ "aggregation": "mean",
636
+ "higher_is_better": true
637
+ }
638
+ ],
639
+ "output_type": "multiple_choice",
640
+ "repeats": 1,
641
+ "should_decontaminate": true,
642
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
643
+ "metadata": {
644
+ "version": 1.0
645
+ }
646
+ },
647
+ "blimp_coordinate_structure_constraint_object_extraction": {
648
+ "task": "blimp_coordinate_structure_constraint_object_extraction",
649
+ "dataset_path": "blimp",
650
+ "dataset_name": "coordinate_structure_constraint_object_extraction",
651
+ "validation_split": "train",
652
+ "doc_to_text": "",
653
+ "doc_to_target": 0,
654
+ "unsafe_code": false,
655
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
656
+ "description": "",
657
+ "target_delimiter": " ",
658
+ "fewshot_delimiter": "\n\n",
659
+ "num_fewshot": 0,
660
+ "metric_list": [
661
+ {
662
+ "metric": "acc",
663
+ "aggregation": "mean",
664
+ "higher_is_better": true
665
+ }
666
+ ],
667
+ "output_type": "multiple_choice",
668
+ "repeats": 1,
669
+ "should_decontaminate": true,
670
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
671
+ "metadata": {
672
+ "version": 1.0
673
+ }
674
+ },
675
+ "blimp_determiner_noun_agreement_1": {
676
+ "task": "blimp_determiner_noun_agreement_1",
677
+ "dataset_path": "blimp",
678
+ "dataset_name": "determiner_noun_agreement_1",
679
+ "validation_split": "train",
680
+ "doc_to_text": "",
681
+ "doc_to_target": 0,
682
+ "unsafe_code": false,
683
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
684
+ "description": "",
685
+ "target_delimiter": " ",
686
+ "fewshot_delimiter": "\n\n",
687
+ "num_fewshot": 0,
688
+ "metric_list": [
689
+ {
690
+ "metric": "acc",
691
+ "aggregation": "mean",
692
+ "higher_is_better": true
693
+ }
694
+ ],
695
+ "output_type": "multiple_choice",
696
+ "repeats": 1,
697
+ "should_decontaminate": true,
698
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
699
+ "metadata": {
700
+ "version": 1.0
701
+ }
702
+ },
703
+ "blimp_determiner_noun_agreement_2": {
704
+ "task": "blimp_determiner_noun_agreement_2",
705
+ "dataset_path": "blimp",
706
+ "dataset_name": "determiner_noun_agreement_2",
707
+ "validation_split": "train",
708
+ "doc_to_text": "",
709
+ "doc_to_target": 0,
710
+ "unsafe_code": false,
711
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
712
+ "description": "",
713
+ "target_delimiter": " ",
714
+ "fewshot_delimiter": "\n\n",
715
+ "num_fewshot": 0,
716
+ "metric_list": [
717
+ {
718
+ "metric": "acc",
719
+ "aggregation": "mean",
720
+ "higher_is_better": true
721
+ }
722
+ ],
723
+ "output_type": "multiple_choice",
724
+ "repeats": 1,
725
+ "should_decontaminate": true,
726
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
727
+ "metadata": {
728
+ "version": 1.0
729
+ }
730
+ },
731
+ "blimp_determiner_noun_agreement_irregular_1": {
732
+ "task": "blimp_determiner_noun_agreement_irregular_1",
733
+ "dataset_path": "blimp",
734
+ "dataset_name": "determiner_noun_agreement_irregular_1",
735
+ "validation_split": "train",
736
+ "doc_to_text": "",
737
+ "doc_to_target": 0,
738
+ "unsafe_code": false,
739
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
740
+ "description": "",
741
+ "target_delimiter": " ",
742
+ "fewshot_delimiter": "\n\n",
743
+ "num_fewshot": 0,
744
+ "metric_list": [
745
+ {
746
+ "metric": "acc",
747
+ "aggregation": "mean",
748
+ "higher_is_better": true
749
+ }
750
+ ],
751
+ "output_type": "multiple_choice",
752
+ "repeats": 1,
753
+ "should_decontaminate": true,
754
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
755
+ "metadata": {
756
+ "version": 1.0
757
+ }
758
+ },
759
+ "blimp_determiner_noun_agreement_irregular_2": {
760
+ "task": "blimp_determiner_noun_agreement_irregular_2",
761
+ "dataset_path": "blimp",
762
+ "dataset_name": "determiner_noun_agreement_irregular_2",
763
+ "validation_split": "train",
764
+ "doc_to_text": "",
765
+ "doc_to_target": 0,
766
+ "unsafe_code": false,
767
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
768
+ "description": "",
769
+ "target_delimiter": " ",
770
+ "fewshot_delimiter": "\n\n",
771
+ "num_fewshot": 0,
772
+ "metric_list": [
773
+ {
774
+ "metric": "acc",
775
+ "aggregation": "mean",
776
+ "higher_is_better": true
777
+ }
778
+ ],
779
+ "output_type": "multiple_choice",
780
+ "repeats": 1,
781
+ "should_decontaminate": true,
782
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
783
+ "metadata": {
784
+ "version": 1.0
785
+ }
786
+ },
787
+ "blimp_determiner_noun_agreement_with_adj_2": {
788
+ "task": "blimp_determiner_noun_agreement_with_adj_2",
789
+ "dataset_path": "blimp",
790
+ "dataset_name": "determiner_noun_agreement_with_adj_2",
791
+ "validation_split": "train",
792
+ "doc_to_text": "",
793
+ "doc_to_target": 0,
794
+ "unsafe_code": false,
795
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
796
+ "description": "",
797
+ "target_delimiter": " ",
798
+ "fewshot_delimiter": "\n\n",
799
+ "num_fewshot": 0,
800
+ "metric_list": [
801
+ {
802
+ "metric": "acc",
803
+ "aggregation": "mean",
804
+ "higher_is_better": true
805
+ }
806
+ ],
807
+ "output_type": "multiple_choice",
808
+ "repeats": 1,
809
+ "should_decontaminate": true,
810
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
811
+ "metadata": {
812
+ "version": 1.0
813
+ }
814
+ },
815
+ "blimp_determiner_noun_agreement_with_adj_irregular_1": {
816
+ "task": "blimp_determiner_noun_agreement_with_adj_irregular_1",
817
+ "dataset_path": "blimp",
818
+ "dataset_name": "determiner_noun_agreement_with_adj_irregular_1",
819
+ "validation_split": "train",
820
+ "doc_to_text": "",
821
+ "doc_to_target": 0,
822
+ "unsafe_code": false,
823
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
824
+ "description": "",
825
+ "target_delimiter": " ",
826
+ "fewshot_delimiter": "\n\n",
827
+ "num_fewshot": 0,
828
+ "metric_list": [
829
+ {
830
+ "metric": "acc",
831
+ "aggregation": "mean",
832
+ "higher_is_better": true
833
+ }
834
+ ],
835
+ "output_type": "multiple_choice",
836
+ "repeats": 1,
837
+ "should_decontaminate": true,
838
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
839
+ "metadata": {
840
+ "version": 1.0
841
+ }
842
+ },
843
+ "blimp_determiner_noun_agreement_with_adj_irregular_2": {
844
+ "task": "blimp_determiner_noun_agreement_with_adj_irregular_2",
845
+ "dataset_path": "blimp",
846
+ "dataset_name": "determiner_noun_agreement_with_adj_irregular_2",
847
+ "validation_split": "train",
848
+ "doc_to_text": "",
849
+ "doc_to_target": 0,
850
+ "unsafe_code": false,
851
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
852
+ "description": "",
853
+ "target_delimiter": " ",
854
+ "fewshot_delimiter": "\n\n",
855
+ "num_fewshot": 0,
856
+ "metric_list": [
857
+ {
858
+ "metric": "acc",
859
+ "aggregation": "mean",
860
+ "higher_is_better": true
861
+ }
862
+ ],
863
+ "output_type": "multiple_choice",
864
+ "repeats": 1,
865
+ "should_decontaminate": true,
866
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
867
+ "metadata": {
868
+ "version": 1.0
869
+ }
870
+ },
871
+ "blimp_determiner_noun_agreement_with_adjective_1": {
872
+ "task": "blimp_determiner_noun_agreement_with_adjective_1",
873
+ "dataset_path": "blimp",
874
+ "dataset_name": "determiner_noun_agreement_with_adjective_1",
875
+ "validation_split": "train",
876
+ "doc_to_text": "",
877
+ "doc_to_target": 0,
878
+ "unsafe_code": false,
879
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
880
+ "description": "",
881
+ "target_delimiter": " ",
882
+ "fewshot_delimiter": "\n\n",
883
+ "num_fewshot": 0,
884
+ "metric_list": [
885
+ {
886
+ "metric": "acc",
887
+ "aggregation": "mean",
888
+ "higher_is_better": true
889
+ }
890
+ ],
891
+ "output_type": "multiple_choice",
892
+ "repeats": 1,
893
+ "should_decontaminate": true,
894
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
895
+ "metadata": {
896
+ "version": 1.0
897
+ }
898
+ },
899
+ "blimp_distractor_agreement_relational_noun": {
900
+ "task": "blimp_distractor_agreement_relational_noun",
901
+ "dataset_path": "blimp",
902
+ "dataset_name": "distractor_agreement_relational_noun",
903
+ "validation_split": "train",
904
+ "doc_to_text": "",
905
+ "doc_to_target": 0,
906
+ "unsafe_code": false,
907
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
908
+ "description": "",
909
+ "target_delimiter": " ",
910
+ "fewshot_delimiter": "\n\n",
911
+ "num_fewshot": 0,
912
+ "metric_list": [
913
+ {
914
+ "metric": "acc",
915
+ "aggregation": "mean",
916
+ "higher_is_better": true
917
+ }
918
+ ],
919
+ "output_type": "multiple_choice",
920
+ "repeats": 1,
921
+ "should_decontaminate": true,
922
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
923
+ "metadata": {
924
+ "version": 1.0
925
+ }
926
+ },
927
+ "blimp_distractor_agreement_relative_clause": {
928
+ "task": "blimp_distractor_agreement_relative_clause",
929
+ "dataset_path": "blimp",
930
+ "dataset_name": "distractor_agreement_relative_clause",
931
+ "validation_split": "train",
932
+ "doc_to_text": "",
933
+ "doc_to_target": 0,
934
+ "unsafe_code": false,
935
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
936
+ "description": "",
937
+ "target_delimiter": " ",
938
+ "fewshot_delimiter": "\n\n",
939
+ "num_fewshot": 0,
940
+ "metric_list": [
941
+ {
942
+ "metric": "acc",
943
+ "aggregation": "mean",
944
+ "higher_is_better": true
945
+ }
946
+ ],
947
+ "output_type": "multiple_choice",
948
+ "repeats": 1,
949
+ "should_decontaminate": true,
950
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
951
+ "metadata": {
952
+ "version": 1.0
953
+ }
954
+ },
955
+ "blimp_drop_argument": {
956
+ "task": "blimp_drop_argument",
957
+ "dataset_path": "blimp",
958
+ "dataset_name": "drop_argument",
959
+ "validation_split": "train",
960
+ "doc_to_text": "",
961
+ "doc_to_target": 0,
962
+ "unsafe_code": false,
963
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
964
+ "description": "",
965
+ "target_delimiter": " ",
966
+ "fewshot_delimiter": "\n\n",
967
+ "num_fewshot": 0,
968
+ "metric_list": [
969
+ {
970
+ "metric": "acc",
971
+ "aggregation": "mean",
972
+ "higher_is_better": true
973
+ }
974
+ ],
975
+ "output_type": "multiple_choice",
976
+ "repeats": 1,
977
+ "should_decontaminate": true,
978
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
979
+ "metadata": {
980
+ "version": 1.0
981
+ }
982
+ },
983
+ "blimp_ellipsis_n_bar_1": {
984
+ "task": "blimp_ellipsis_n_bar_1",
985
+ "dataset_path": "blimp",
986
+ "dataset_name": "ellipsis_n_bar_1",
987
+ "validation_split": "train",
988
+ "doc_to_text": "",
989
+ "doc_to_target": 0,
990
+ "unsafe_code": false,
991
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
992
+ "description": "",
993
+ "target_delimiter": " ",
994
+ "fewshot_delimiter": "\n\n",
995
+ "num_fewshot": 0,
996
+ "metric_list": [
997
+ {
998
+ "metric": "acc",
999
+ "aggregation": "mean",
1000
+ "higher_is_better": true
1001
+ }
1002
+ ],
1003
+ "output_type": "multiple_choice",
1004
+ "repeats": 1,
1005
+ "should_decontaminate": true,
1006
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
1007
+ "metadata": {
1008
+ "version": 1.0
1009
+ }
1010
+ },
1011
+ "blimp_ellipsis_n_bar_2": {
1012
+ "task": "blimp_ellipsis_n_bar_2",
1013
+ "dataset_path": "blimp",
1014
+ "dataset_name": "ellipsis_n_bar_2",
1015
+ "validation_split": "train",
1016
+ "doc_to_text": "",
1017
+ "doc_to_target": 0,
1018
+ "unsafe_code": false,
1019
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
1020
+ "description": "",
1021
+ "target_delimiter": " ",
1022
+ "fewshot_delimiter": "\n\n",
1023
+ "num_fewshot": 0,
1024
+ "metric_list": [
1025
+ {
1026
+ "metric": "acc",
1027
+ "aggregation": "mean",
1028
+ "higher_is_better": true
1029
+ }
1030
+ ],
1031
+ "output_type": "multiple_choice",
1032
+ "repeats": 1,
1033
+ "should_decontaminate": true,
1034
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
1035
+ "metadata": {
1036
+ "version": 1.0
1037
+ }
1038
+ },
1039
+ "blimp_existential_there_object_raising": {
1040
+ "task": "blimp_existential_there_object_raising",
1041
+ "dataset_path": "blimp",
1042
+ "dataset_name": "existential_there_object_raising",
1043
+ "validation_split": "train",
1044
+ "doc_to_text": "",
1045
+ "doc_to_target": 0,
1046
+ "unsafe_code": false,
1047
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
1048
+ "description": "",
1049
+ "target_delimiter": " ",
1050
+ "fewshot_delimiter": "\n\n",
1051
+ "num_fewshot": 0,
1052
+ "metric_list": [
1053
+ {
1054
+ "metric": "acc",
1055
+ "aggregation": "mean",
1056
+ "higher_is_better": true
1057
+ }
1058
+ ],
1059
+ "output_type": "multiple_choice",
1060
+ "repeats": 1,
1061
+ "should_decontaminate": true,
1062
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
1063
+ "metadata": {
1064
+ "version": 1.0
1065
+ }
1066
+ },
1067
+ "blimp_existential_there_quantifiers_1": {
1068
+ "task": "blimp_existential_there_quantifiers_1",
1069
+ "dataset_path": "blimp",
1070
+ "dataset_name": "existential_there_quantifiers_1",
1071
+ "validation_split": "train",
1072
+ "doc_to_text": "",
1073
+ "doc_to_target": 0,
1074
+ "unsafe_code": false,
1075
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
1076
+ "description": "",
1077
+ "target_delimiter": " ",
1078
+ "fewshot_delimiter": "\n\n",
1079
+ "num_fewshot": 0,
1080
+ "metric_list": [
1081
+ {
1082
+ "metric": "acc",
1083
+ "aggregation": "mean",
1084
+ "higher_is_better": true
1085
+ }
1086
+ ],
1087
+ "output_type": "multiple_choice",
1088
+ "repeats": 1,
1089
+ "should_decontaminate": true,
1090
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
1091
+ "metadata": {
1092
+ "version": 1.0
1093
+ }
1094
+ },
1095
+ "blimp_existential_there_quantifiers_2": {
1096
+ "task": "blimp_existential_there_quantifiers_2",
1097
+ "dataset_path": "blimp",
1098
+ "dataset_name": "existential_there_quantifiers_2",
1099
+ "validation_split": "train",
1100
+ "doc_to_text": "",
1101
+ "doc_to_target": 0,
1102
+ "unsafe_code": false,
1103
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
1104
+ "description": "",
1105
+ "target_delimiter": " ",
1106
+ "fewshot_delimiter": "\n\n",
1107
+ "num_fewshot": 0,
1108
+ "metric_list": [
1109
+ {
1110
+ "metric": "acc",
1111
+ "aggregation": "mean",
1112
+ "higher_is_better": true
1113
+ }
1114
+ ],
1115
+ "output_type": "multiple_choice",
1116
+ "repeats": 1,
1117
+ "should_decontaminate": true,
1118
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
1119
+ "metadata": {
1120
+ "version": 1.0
1121
+ }
1122
+ },
1123
+ "blimp_existential_there_subject_raising": {
1124
+ "task": "blimp_existential_there_subject_raising",
1125
+ "dataset_path": "blimp",
1126
+ "dataset_name": "existential_there_subject_raising",
1127
+ "validation_split": "train",
1128
+ "doc_to_text": "",
1129
+ "doc_to_target": 0,
1130
+ "unsafe_code": false,
1131
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
1132
+ "description": "",
1133
+ "target_delimiter": " ",
1134
+ "fewshot_delimiter": "\n\n",
1135
+ "num_fewshot": 0,
1136
+ "metric_list": [
1137
+ {
1138
+ "metric": "acc",
1139
+ "aggregation": "mean",
1140
+ "higher_is_better": true
1141
+ }
1142
+ ],
1143
+ "output_type": "multiple_choice",
1144
+ "repeats": 1,
1145
+ "should_decontaminate": true,
1146
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
1147
+ "metadata": {
1148
+ "version": 1.0
1149
+ }
1150
+ },
1151
+ "blimp_expletive_it_object_raising": {
1152
+ "task": "blimp_expletive_it_object_raising",
1153
+ "dataset_path": "blimp",
1154
+ "dataset_name": "expletive_it_object_raising",
1155
+ "validation_split": "train",
1156
+ "doc_to_text": "",
1157
+ "doc_to_target": 0,
1158
+ "unsafe_code": false,
1159
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
1160
+ "description": "",
1161
+ "target_delimiter": " ",
1162
+ "fewshot_delimiter": "\n\n",
1163
+ "num_fewshot": 0,
1164
+ "metric_list": [
1165
+ {
1166
+ "metric": "acc",
1167
+ "aggregation": "mean",
1168
+ "higher_is_better": true
1169
+ }
1170
+ ],
1171
+ "output_type": "multiple_choice",
1172
+ "repeats": 1,
1173
+ "should_decontaminate": true,
1174
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
1175
+ "metadata": {
1176
+ "version": 1.0
1177
+ }
1178
+ },
1179
+ "blimp_inchoative": {
1180
+ "task": "blimp_inchoative",
1181
+ "dataset_path": "blimp",
1182
+ "dataset_name": "inchoative",
1183
+ "validation_split": "train",
1184
+ "doc_to_text": "",
1185
+ "doc_to_target": 0,
1186
+ "unsafe_code": false,
1187
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
1188
+ "description": "",
1189
+ "target_delimiter": " ",
1190
+ "fewshot_delimiter": "\n\n",
1191
+ "num_fewshot": 0,
1192
+ "metric_list": [
1193
+ {
1194
+ "metric": "acc",
1195
+ "aggregation": "mean",
1196
+ "higher_is_better": true
1197
+ }
1198
+ ],
1199
+ "output_type": "multiple_choice",
1200
+ "repeats": 1,
1201
+ "should_decontaminate": true,
1202
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
1203
+ "metadata": {
1204
+ "version": 1.0
1205
+ }
1206
+ },
1207
+ "blimp_intransitive": {
1208
+ "task": "blimp_intransitive",
1209
+ "dataset_path": "blimp",
1210
+ "dataset_name": "intransitive",
1211
+ "validation_split": "train",
1212
+ "doc_to_text": "",
1213
+ "doc_to_target": 0,
1214
+ "unsafe_code": false,
1215
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
1216
+ "description": "",
1217
+ "target_delimiter": " ",
1218
+ "fewshot_delimiter": "\n\n",
1219
+ "num_fewshot": 0,
1220
+ "metric_list": [
1221
+ {
1222
+ "metric": "acc",
1223
+ "aggregation": "mean",
1224
+ "higher_is_better": true
1225
+ }
1226
+ ],
1227
+ "output_type": "multiple_choice",
1228
+ "repeats": 1,
1229
+ "should_decontaminate": true,
1230
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
1231
+ "metadata": {
1232
+ "version": 1.0
1233
+ }
1234
+ },
1235
+ "blimp_irregular_past_participle_adjectives": {
1236
+ "task": "blimp_irregular_past_participle_adjectives",
1237
+ "dataset_path": "blimp",
1238
+ "dataset_name": "irregular_past_participle_adjectives",
1239
+ "validation_split": "train",
1240
+ "doc_to_text": "",
1241
+ "doc_to_target": 0,
1242
+ "unsafe_code": false,
1243
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
1244
+ "description": "",
1245
+ "target_delimiter": " ",
1246
+ "fewshot_delimiter": "\n\n",
1247
+ "num_fewshot": 0,
1248
+ "metric_list": [
1249
+ {
1250
+ "metric": "acc",
1251
+ "aggregation": "mean",
1252
+ "higher_is_better": true
1253
+ }
1254
+ ],
1255
+ "output_type": "multiple_choice",
1256
+ "repeats": 1,
1257
+ "should_decontaminate": true,
1258
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
1259
+ "metadata": {
1260
+ "version": 1.0
1261
+ }
1262
+ },
1263
+ "blimp_irregular_past_participle_verbs": {
1264
+ "task": "blimp_irregular_past_participle_verbs",
1265
+ "dataset_path": "blimp",
1266
+ "dataset_name": "irregular_past_participle_verbs",
1267
+ "validation_split": "train",
1268
+ "doc_to_text": "",
1269
+ "doc_to_target": 0,
1270
+ "unsafe_code": false,
1271
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
1272
+ "description": "",
1273
+ "target_delimiter": " ",
1274
+ "fewshot_delimiter": "\n\n",
1275
+ "num_fewshot": 0,
1276
+ "metric_list": [
1277
+ {
1278
+ "metric": "acc",
1279
+ "aggregation": "mean",
1280
+ "higher_is_better": true
1281
+ }
1282
+ ],
1283
+ "output_type": "multiple_choice",
1284
+ "repeats": 1,
1285
+ "should_decontaminate": true,
1286
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
1287
+ "metadata": {
1288
+ "version": 1.0
1289
+ }
1290
+ },
1291
+ "blimp_irregular_plural_subject_verb_agreement_1": {
1292
+ "task": "blimp_irregular_plural_subject_verb_agreement_1",
1293
+ "dataset_path": "blimp",
1294
+ "dataset_name": "irregular_plural_subject_verb_agreement_1",
1295
+ "validation_split": "train",
1296
+ "doc_to_text": "",
1297
+ "doc_to_target": 0,
1298
+ "unsafe_code": false,
1299
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
1300
+ "description": "",
1301
+ "target_delimiter": " ",
1302
+ "fewshot_delimiter": "\n\n",
1303
+ "num_fewshot": 0,
1304
+ "metric_list": [
1305
+ {
1306
+ "metric": "acc",
1307
+ "aggregation": "mean",
1308
+ "higher_is_better": true
1309
+ }
1310
+ ],
1311
+ "output_type": "multiple_choice",
1312
+ "repeats": 1,
1313
+ "should_decontaminate": true,
1314
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
1315
+ "metadata": {
1316
+ "version": 1.0
1317
+ }
1318
+ },
1319
+ "blimp_irregular_plural_subject_verb_agreement_2": {
1320
+ "task": "blimp_irregular_plural_subject_verb_agreement_2",
1321
+ "dataset_path": "blimp",
1322
+ "dataset_name": "irregular_plural_subject_verb_agreement_2",
1323
+ "validation_split": "train",
1324
+ "doc_to_text": "",
1325
+ "doc_to_target": 0,
1326
+ "unsafe_code": false,
1327
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
1328
+ "description": "",
1329
+ "target_delimiter": " ",
1330
+ "fewshot_delimiter": "\n\n",
1331
+ "num_fewshot": 0,
1332
+ "metric_list": [
1333
+ {
1334
+ "metric": "acc",
1335
+ "aggregation": "mean",
1336
+ "higher_is_better": true
1337
+ }
1338
+ ],
1339
+ "output_type": "multiple_choice",
1340
+ "repeats": 1,
1341
+ "should_decontaminate": true,
1342
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
1343
+ "metadata": {
1344
+ "version": 1.0
1345
+ }
1346
+ },
1347
+ "blimp_left_branch_island_echo_question": {
1348
+ "task": "blimp_left_branch_island_echo_question",
1349
+ "dataset_path": "blimp",
1350
+ "dataset_name": "left_branch_island_echo_question",
1351
+ "validation_split": "train",
1352
+ "doc_to_text": "",
1353
+ "doc_to_target": 0,
1354
+ "unsafe_code": false,
1355
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
1356
+ "description": "",
1357
+ "target_delimiter": " ",
1358
+ "fewshot_delimiter": "\n\n",
1359
+ "num_fewshot": 0,
1360
+ "metric_list": [
1361
+ {
1362
+ "metric": "acc",
1363
+ "aggregation": "mean",
1364
+ "higher_is_better": true
1365
+ }
1366
+ ],
1367
+ "output_type": "multiple_choice",
1368
+ "repeats": 1,
1369
+ "should_decontaminate": true,
1370
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
1371
+ "metadata": {
1372
+ "version": 1.0
1373
+ }
1374
+ },
1375
+ "blimp_left_branch_island_simple_question": {
1376
+ "task": "blimp_left_branch_island_simple_question",
1377
+ "dataset_path": "blimp",
1378
+ "dataset_name": "left_branch_island_simple_question",
1379
+ "validation_split": "train",
1380
+ "doc_to_text": "",
1381
+ "doc_to_target": 0,
1382
+ "unsafe_code": false,
1383
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
1384
+ "description": "",
1385
+ "target_delimiter": " ",
1386
+ "fewshot_delimiter": "\n\n",
1387
+ "num_fewshot": 0,
1388
+ "metric_list": [
1389
+ {
1390
+ "metric": "acc",
1391
+ "aggregation": "mean",
1392
+ "higher_is_better": true
1393
+ }
1394
+ ],
1395
+ "output_type": "multiple_choice",
1396
+ "repeats": 1,
1397
+ "should_decontaminate": true,
1398
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
1399
+ "metadata": {
1400
+ "version": 1.0
1401
+ }
1402
+ },
1403
+ "blimp_matrix_question_npi_licensor_present": {
1404
+ "task": "blimp_matrix_question_npi_licensor_present",
1405
+ "dataset_path": "blimp",
1406
+ "dataset_name": "matrix_question_npi_licensor_present",
1407
+ "validation_split": "train",
1408
+ "doc_to_text": "",
1409
+ "doc_to_target": 0,
1410
+ "unsafe_code": false,
1411
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
1412
+ "description": "",
1413
+ "target_delimiter": " ",
1414
+ "fewshot_delimiter": "\n\n",
1415
+ "num_fewshot": 0,
1416
+ "metric_list": [
1417
+ {
1418
+ "metric": "acc",
1419
+ "aggregation": "mean",
1420
+ "higher_is_better": true
1421
+ }
1422
+ ],
1423
+ "output_type": "multiple_choice",
1424
+ "repeats": 1,
1425
+ "should_decontaminate": true,
1426
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
1427
+ "metadata": {
1428
+ "version": 1.0
1429
+ }
1430
+ },
1431
+ "blimp_npi_present_1": {
1432
+ "task": "blimp_npi_present_1",
1433
+ "dataset_path": "blimp",
1434
+ "dataset_name": "npi_present_1",
1435
+ "validation_split": "train",
1436
+ "doc_to_text": "",
1437
+ "doc_to_target": 0,
1438
+ "unsafe_code": false,
1439
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
1440
+ "description": "",
1441
+ "target_delimiter": " ",
1442
+ "fewshot_delimiter": "\n\n",
1443
+ "num_fewshot": 0,
1444
+ "metric_list": [
1445
+ {
1446
+ "metric": "acc",
1447
+ "aggregation": "mean",
1448
+ "higher_is_better": true
1449
+ }
1450
+ ],
1451
+ "output_type": "multiple_choice",
1452
+ "repeats": 1,
1453
+ "should_decontaminate": true,
1454
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
1455
+ "metadata": {
1456
+ "version": 1.0
1457
+ }
1458
+ },
1459
+ "blimp_npi_present_2": {
1460
+ "task": "blimp_npi_present_2",
1461
+ "dataset_path": "blimp",
1462
+ "dataset_name": "npi_present_2",
1463
+ "validation_split": "train",
1464
+ "doc_to_text": "",
1465
+ "doc_to_target": 0,
1466
+ "unsafe_code": false,
1467
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
1468
+ "description": "",
1469
+ "target_delimiter": " ",
1470
+ "fewshot_delimiter": "\n\n",
1471
+ "num_fewshot": 0,
1472
+ "metric_list": [
1473
+ {
1474
+ "metric": "acc",
1475
+ "aggregation": "mean",
1476
+ "higher_is_better": true
1477
+ }
1478
+ ],
1479
+ "output_type": "multiple_choice",
1480
+ "repeats": 1,
1481
+ "should_decontaminate": true,
1482
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
1483
+ "metadata": {
1484
+ "version": 1.0
1485
+ }
1486
+ },
1487
+ "blimp_only_npi_licensor_present": {
1488
+ "task": "blimp_only_npi_licensor_present",
1489
+ "dataset_path": "blimp",
1490
+ "dataset_name": "only_npi_licensor_present",
1491
+ "validation_split": "train",
1492
+ "doc_to_text": "",
1493
+ "doc_to_target": 0,
1494
+ "unsafe_code": false,
1495
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
1496
+ "description": "",
1497
+ "target_delimiter": " ",
1498
+ "fewshot_delimiter": "\n\n",
1499
+ "num_fewshot": 0,
1500
+ "metric_list": [
1501
+ {
1502
+ "metric": "acc",
1503
+ "aggregation": "mean",
1504
+ "higher_is_better": true
1505
+ }
1506
+ ],
1507
+ "output_type": "multiple_choice",
1508
+ "repeats": 1,
1509
+ "should_decontaminate": true,
1510
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
1511
+ "metadata": {
1512
+ "version": 1.0
1513
+ }
1514
+ },
1515
+ "blimp_only_npi_scope": {
1516
+ "task": "blimp_only_npi_scope",
1517
+ "dataset_path": "blimp",
1518
+ "dataset_name": "only_npi_scope",
1519
+ "validation_split": "train",
1520
+ "doc_to_text": "",
1521
+ "doc_to_target": 0,
1522
+ "unsafe_code": false,
1523
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
1524
+ "description": "",
1525
+ "target_delimiter": " ",
1526
+ "fewshot_delimiter": "\n\n",
1527
+ "num_fewshot": 0,
1528
+ "metric_list": [
1529
+ {
1530
+ "metric": "acc",
1531
+ "aggregation": "mean",
1532
+ "higher_is_better": true
1533
+ }
1534
+ ],
1535
+ "output_type": "multiple_choice",
1536
+ "repeats": 1,
1537
+ "should_decontaminate": true,
1538
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
1539
+ "metadata": {
1540
+ "version": 1.0
1541
+ }
1542
+ },
1543
+ "blimp_passive_1": {
1544
+ "task": "blimp_passive_1",
1545
+ "dataset_path": "blimp",
1546
+ "dataset_name": "passive_1",
1547
+ "validation_split": "train",
1548
+ "doc_to_text": "",
1549
+ "doc_to_target": 0,
1550
+ "unsafe_code": false,
1551
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
1552
+ "description": "",
1553
+ "target_delimiter": " ",
1554
+ "fewshot_delimiter": "\n\n",
1555
+ "num_fewshot": 0,
1556
+ "metric_list": [
1557
+ {
1558
+ "metric": "acc",
1559
+ "aggregation": "mean",
1560
+ "higher_is_better": true
1561
+ }
1562
+ ],
1563
+ "output_type": "multiple_choice",
1564
+ "repeats": 1,
1565
+ "should_decontaminate": true,
1566
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
1567
+ "metadata": {
1568
+ "version": 1.0
1569
+ }
1570
+ },
1571
+ "blimp_passive_2": {
1572
+ "task": "blimp_passive_2",
1573
+ "dataset_path": "blimp",
1574
+ "dataset_name": "passive_2",
1575
+ "validation_split": "train",
1576
+ "doc_to_text": "",
1577
+ "doc_to_target": 0,
1578
+ "unsafe_code": false,
1579
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
1580
+ "description": "",
1581
+ "target_delimiter": " ",
1582
+ "fewshot_delimiter": "\n\n",
1583
+ "num_fewshot": 0,
1584
+ "metric_list": [
1585
+ {
1586
+ "metric": "acc",
1587
+ "aggregation": "mean",
1588
+ "higher_is_better": true
1589
+ }
1590
+ ],
1591
+ "output_type": "multiple_choice",
1592
+ "repeats": 1,
1593
+ "should_decontaminate": true,
1594
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
1595
+ "metadata": {
1596
+ "version": 1.0
1597
+ }
1598
+ },
1599
+ "blimp_principle_A_c_command": {
1600
+ "task": "blimp_principle_A_c_command",
1601
+ "dataset_path": "blimp",
1602
+ "dataset_name": "principle_A_c_command",
1603
+ "validation_split": "train",
1604
+ "doc_to_text": "",
1605
+ "doc_to_target": 0,
1606
+ "unsafe_code": false,
1607
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
1608
+ "description": "",
1609
+ "target_delimiter": " ",
1610
+ "fewshot_delimiter": "\n\n",
1611
+ "num_fewshot": 0,
1612
+ "metric_list": [
1613
+ {
1614
+ "metric": "acc",
1615
+ "aggregation": "mean",
1616
+ "higher_is_better": true
1617
+ }
1618
+ ],
1619
+ "output_type": "multiple_choice",
1620
+ "repeats": 1,
1621
+ "should_decontaminate": true,
1622
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
1623
+ "metadata": {
1624
+ "version": 1.0
1625
+ }
1626
+ },
1627
+ "blimp_principle_A_case_1": {
1628
+ "task": "blimp_principle_A_case_1",
1629
+ "dataset_path": "blimp",
1630
+ "dataset_name": "principle_A_case_1",
1631
+ "validation_split": "train",
1632
+ "doc_to_text": "",
1633
+ "doc_to_target": 0,
1634
+ "unsafe_code": false,
1635
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
1636
+ "description": "",
1637
+ "target_delimiter": " ",
1638
+ "fewshot_delimiter": "\n\n",
1639
+ "num_fewshot": 0,
1640
+ "metric_list": [
1641
+ {
1642
+ "metric": "acc",
1643
+ "aggregation": "mean",
1644
+ "higher_is_better": true
1645
+ }
1646
+ ],
1647
+ "output_type": "multiple_choice",
1648
+ "repeats": 1,
1649
+ "should_decontaminate": true,
1650
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
1651
+ "metadata": {
1652
+ "version": 1.0
1653
+ }
1654
+ },
1655
+ "blimp_principle_A_case_2": {
1656
+ "task": "blimp_principle_A_case_2",
1657
+ "dataset_path": "blimp",
1658
+ "dataset_name": "principle_A_case_2",
1659
+ "validation_split": "train",
1660
+ "doc_to_text": "",
1661
+ "doc_to_target": 0,
1662
+ "unsafe_code": false,
1663
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
1664
+ "description": "",
1665
+ "target_delimiter": " ",
1666
+ "fewshot_delimiter": "\n\n",
1667
+ "num_fewshot": 0,
1668
+ "metric_list": [
1669
+ {
1670
+ "metric": "acc",
1671
+ "aggregation": "mean",
1672
+ "higher_is_better": true
1673
+ }
1674
+ ],
1675
+ "output_type": "multiple_choice",
1676
+ "repeats": 1,
1677
+ "should_decontaminate": true,
1678
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
1679
+ "metadata": {
1680
+ "version": 1.0
1681
+ }
1682
+ },
1683
+ "blimp_principle_A_domain_1": {
1684
+ "task": "blimp_principle_A_domain_1",
1685
+ "dataset_path": "blimp",
1686
+ "dataset_name": "principle_A_domain_1",
1687
+ "validation_split": "train",
1688
+ "doc_to_text": "",
1689
+ "doc_to_target": 0,
1690
+ "unsafe_code": false,
1691
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
1692
+ "description": "",
1693
+ "target_delimiter": " ",
1694
+ "fewshot_delimiter": "\n\n",
1695
+ "num_fewshot": 0,
1696
+ "metric_list": [
1697
+ {
1698
+ "metric": "acc",
1699
+ "aggregation": "mean",
1700
+ "higher_is_better": true
1701
+ }
1702
+ ],
1703
+ "output_type": "multiple_choice",
1704
+ "repeats": 1,
1705
+ "should_decontaminate": true,
1706
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
1707
+ "metadata": {
1708
+ "version": 1.0
1709
+ }
1710
+ },
1711
+ "blimp_principle_A_domain_2": {
1712
+ "task": "blimp_principle_A_domain_2",
1713
+ "dataset_path": "blimp",
1714
+ "dataset_name": "principle_A_domain_2",
1715
+ "validation_split": "train",
1716
+ "doc_to_text": "",
1717
+ "doc_to_target": 0,
1718
+ "unsafe_code": false,
1719
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
1720
+ "description": "",
1721
+ "target_delimiter": " ",
1722
+ "fewshot_delimiter": "\n\n",
1723
+ "num_fewshot": 0,
1724
+ "metric_list": [
1725
+ {
1726
+ "metric": "acc",
1727
+ "aggregation": "mean",
1728
+ "higher_is_better": true
1729
+ }
1730
+ ],
1731
+ "output_type": "multiple_choice",
1732
+ "repeats": 1,
1733
+ "should_decontaminate": true,
1734
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
1735
+ "metadata": {
1736
+ "version": 1.0
1737
+ }
1738
+ },
1739
+ "blimp_principle_A_domain_3": {
1740
+ "task": "blimp_principle_A_domain_3",
1741
+ "dataset_path": "blimp",
1742
+ "dataset_name": "principle_A_domain_3",
1743
+ "validation_split": "train",
1744
+ "doc_to_text": "",
1745
+ "doc_to_target": 0,
1746
+ "unsafe_code": false,
1747
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
1748
+ "description": "",
1749
+ "target_delimiter": " ",
1750
+ "fewshot_delimiter": "\n\n",
1751
+ "num_fewshot": 0,
1752
+ "metric_list": [
1753
+ {
1754
+ "metric": "acc",
1755
+ "aggregation": "mean",
1756
+ "higher_is_better": true
1757
+ }
1758
+ ],
1759
+ "output_type": "multiple_choice",
1760
+ "repeats": 1,
1761
+ "should_decontaminate": true,
1762
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
1763
+ "metadata": {
1764
+ "version": 1.0
1765
+ }
1766
+ },
1767
+ "blimp_principle_A_reconstruction": {
1768
+ "task": "blimp_principle_A_reconstruction",
1769
+ "dataset_path": "blimp",
1770
+ "dataset_name": "principle_A_reconstruction",
1771
+ "validation_split": "train",
1772
+ "doc_to_text": "",
1773
+ "doc_to_target": 0,
1774
+ "unsafe_code": false,
1775
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
1776
+ "description": "",
1777
+ "target_delimiter": " ",
1778
+ "fewshot_delimiter": "\n\n",
1779
+ "num_fewshot": 0,
1780
+ "metric_list": [
1781
+ {
1782
+ "metric": "acc",
1783
+ "aggregation": "mean",
1784
+ "higher_is_better": true
1785
+ }
1786
+ ],
1787
+ "output_type": "multiple_choice",
1788
+ "repeats": 1,
1789
+ "should_decontaminate": true,
1790
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
1791
+ "metadata": {
1792
+ "version": 1.0
1793
+ }
1794
+ },
1795
+ "blimp_regular_plural_subject_verb_agreement_1": {
1796
+ "task": "blimp_regular_plural_subject_verb_agreement_1",
1797
+ "dataset_path": "blimp",
1798
+ "dataset_name": "regular_plural_subject_verb_agreement_1",
1799
+ "validation_split": "train",
1800
+ "doc_to_text": "",
1801
+ "doc_to_target": 0,
1802
+ "unsafe_code": false,
1803
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
1804
+ "description": "",
1805
+ "target_delimiter": " ",
1806
+ "fewshot_delimiter": "\n\n",
1807
+ "num_fewshot": 0,
1808
+ "metric_list": [
1809
+ {
1810
+ "metric": "acc",
1811
+ "aggregation": "mean",
1812
+ "higher_is_better": true
1813
+ }
1814
+ ],
1815
+ "output_type": "multiple_choice",
1816
+ "repeats": 1,
1817
+ "should_decontaminate": true,
1818
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
1819
+ "metadata": {
1820
+ "version": 1.0
1821
+ }
1822
+ },
1823
+ "blimp_regular_plural_subject_verb_agreement_2": {
1824
+ "task": "blimp_regular_plural_subject_verb_agreement_2",
1825
+ "dataset_path": "blimp",
1826
+ "dataset_name": "regular_plural_subject_verb_agreement_2",
1827
+ "validation_split": "train",
1828
+ "doc_to_text": "",
1829
+ "doc_to_target": 0,
1830
+ "unsafe_code": false,
1831
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
1832
+ "description": "",
1833
+ "target_delimiter": " ",
1834
+ "fewshot_delimiter": "\n\n",
1835
+ "num_fewshot": 0,
1836
+ "metric_list": [
1837
+ {
1838
+ "metric": "acc",
1839
+ "aggregation": "mean",
1840
+ "higher_is_better": true
1841
+ }
1842
+ ],
1843
+ "output_type": "multiple_choice",
1844
+ "repeats": 1,
1845
+ "should_decontaminate": true,
1846
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
1847
+ "metadata": {
1848
+ "version": 1.0
1849
+ }
1850
+ },
1851
+ "blimp_sentential_negation_npi_licensor_present": {
1852
+ "task": "blimp_sentential_negation_npi_licensor_present",
1853
+ "dataset_path": "blimp",
1854
+ "dataset_name": "sentential_negation_npi_licensor_present",
1855
+ "validation_split": "train",
1856
+ "doc_to_text": "",
1857
+ "doc_to_target": 0,
1858
+ "unsafe_code": false,
1859
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
1860
+ "description": "",
1861
+ "target_delimiter": " ",
1862
+ "fewshot_delimiter": "\n\n",
1863
+ "num_fewshot": 0,
1864
+ "metric_list": [
1865
+ {
1866
+ "metric": "acc",
1867
+ "aggregation": "mean",
1868
+ "higher_is_better": true
1869
+ }
1870
+ ],
1871
+ "output_type": "multiple_choice",
1872
+ "repeats": 1,
1873
+ "should_decontaminate": true,
1874
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
1875
+ "metadata": {
1876
+ "version": 1.0
1877
+ }
1878
+ },
1879
+ "blimp_sentential_negation_npi_scope": {
1880
+ "task": "blimp_sentential_negation_npi_scope",
1881
+ "dataset_path": "blimp",
1882
+ "dataset_name": "sentential_negation_npi_scope",
1883
+ "validation_split": "train",
1884
+ "doc_to_text": "",
1885
+ "doc_to_target": 0,
1886
+ "unsafe_code": false,
1887
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
1888
+ "description": "",
1889
+ "target_delimiter": " ",
1890
+ "fewshot_delimiter": "\n\n",
1891
+ "num_fewshot": 0,
1892
+ "metric_list": [
1893
+ {
1894
+ "metric": "acc",
1895
+ "aggregation": "mean",
1896
+ "higher_is_better": true
1897
+ }
1898
+ ],
1899
+ "output_type": "multiple_choice",
1900
+ "repeats": 1,
1901
+ "should_decontaminate": true,
1902
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
1903
+ "metadata": {
1904
+ "version": 1.0
1905
+ }
1906
+ },
1907
+ "blimp_sentential_subject_island": {
1908
+ "task": "blimp_sentential_subject_island",
1909
+ "dataset_path": "blimp",
1910
+ "dataset_name": "sentential_subject_island",
1911
+ "validation_split": "train",
1912
+ "doc_to_text": "",
1913
+ "doc_to_target": 0,
1914
+ "unsafe_code": false,
1915
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
1916
+ "description": "",
1917
+ "target_delimiter": " ",
1918
+ "fewshot_delimiter": "\n\n",
1919
+ "num_fewshot": 0,
1920
+ "metric_list": [
1921
+ {
1922
+ "metric": "acc",
1923
+ "aggregation": "mean",
1924
+ "higher_is_better": true
1925
+ }
1926
+ ],
1927
+ "output_type": "multiple_choice",
1928
+ "repeats": 1,
1929
+ "should_decontaminate": true,
1930
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
1931
+ "metadata": {
1932
+ "version": 1.0
1933
+ }
1934
+ },
1935
+ "blimp_superlative_quantifiers_1": {
1936
+ "task": "blimp_superlative_quantifiers_1",
1937
+ "dataset_path": "blimp",
1938
+ "dataset_name": "superlative_quantifiers_1",
1939
+ "validation_split": "train",
1940
+ "doc_to_text": "",
1941
+ "doc_to_target": 0,
1942
+ "unsafe_code": false,
1943
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
1944
+ "description": "",
1945
+ "target_delimiter": " ",
1946
+ "fewshot_delimiter": "\n\n",
1947
+ "num_fewshot": 0,
1948
+ "metric_list": [
1949
+ {
1950
+ "metric": "acc",
1951
+ "aggregation": "mean",
1952
+ "higher_is_better": true
1953
+ }
1954
+ ],
1955
+ "output_type": "multiple_choice",
1956
+ "repeats": 1,
1957
+ "should_decontaminate": true,
1958
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
1959
+ "metadata": {
1960
+ "version": 1.0
1961
+ }
1962
+ },
1963
+ "blimp_superlative_quantifiers_2": {
1964
+ "task": "blimp_superlative_quantifiers_2",
1965
+ "dataset_path": "blimp",
1966
+ "dataset_name": "superlative_quantifiers_2",
1967
+ "validation_split": "train",
1968
+ "doc_to_text": "",
1969
+ "doc_to_target": 0,
1970
+ "unsafe_code": false,
1971
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
1972
+ "description": "",
1973
+ "target_delimiter": " ",
1974
+ "fewshot_delimiter": "\n\n",
1975
+ "num_fewshot": 0,
1976
+ "metric_list": [
1977
+ {
1978
+ "metric": "acc",
1979
+ "aggregation": "mean",
1980
+ "higher_is_better": true
1981
+ }
1982
+ ],
1983
+ "output_type": "multiple_choice",
1984
+ "repeats": 1,
1985
+ "should_decontaminate": true,
1986
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
1987
+ "metadata": {
1988
+ "version": 1.0
1989
+ }
1990
+ },
1991
+ "blimp_tough_vs_raising_1": {
1992
+ "task": "blimp_tough_vs_raising_1",
1993
+ "dataset_path": "blimp",
1994
+ "dataset_name": "tough_vs_raising_1",
1995
+ "validation_split": "train",
1996
+ "doc_to_text": "",
1997
+ "doc_to_target": 0,
1998
+ "unsafe_code": false,
1999
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
2000
+ "description": "",
2001
+ "target_delimiter": " ",
2002
+ "fewshot_delimiter": "\n\n",
2003
+ "num_fewshot": 0,
2004
+ "metric_list": [
2005
+ {
2006
+ "metric": "acc",
2007
+ "aggregation": "mean",
2008
+ "higher_is_better": true
2009
+ }
2010
+ ],
2011
+ "output_type": "multiple_choice",
2012
+ "repeats": 1,
2013
+ "should_decontaminate": true,
2014
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
2015
+ "metadata": {
2016
+ "version": 1.0
2017
+ }
2018
+ },
2019
+ "blimp_tough_vs_raising_2": {
2020
+ "task": "blimp_tough_vs_raising_2",
2021
+ "dataset_path": "blimp",
2022
+ "dataset_name": "tough_vs_raising_2",
2023
+ "validation_split": "train",
2024
+ "doc_to_text": "",
2025
+ "doc_to_target": 0,
2026
+ "unsafe_code": false,
2027
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
2028
+ "description": "",
2029
+ "target_delimiter": " ",
2030
+ "fewshot_delimiter": "\n\n",
2031
+ "num_fewshot": 0,
2032
+ "metric_list": [
2033
+ {
2034
+ "metric": "acc",
2035
+ "aggregation": "mean",
2036
+ "higher_is_better": true
2037
+ }
2038
+ ],
2039
+ "output_type": "multiple_choice",
2040
+ "repeats": 1,
2041
+ "should_decontaminate": true,
2042
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
2043
+ "metadata": {
2044
+ "version": 1.0
2045
+ }
2046
+ },
2047
+ "blimp_transitive": {
2048
+ "task": "blimp_transitive",
2049
+ "dataset_path": "blimp",
2050
+ "dataset_name": "transitive",
2051
+ "validation_split": "train",
2052
+ "doc_to_text": "",
2053
+ "doc_to_target": 0,
2054
+ "unsafe_code": false,
2055
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
2056
+ "description": "",
2057
+ "target_delimiter": " ",
2058
+ "fewshot_delimiter": "\n\n",
2059
+ "num_fewshot": 0,
2060
+ "metric_list": [
2061
+ {
2062
+ "metric": "acc",
2063
+ "aggregation": "mean",
2064
+ "higher_is_better": true
2065
+ }
2066
+ ],
2067
+ "output_type": "multiple_choice",
2068
+ "repeats": 1,
2069
+ "should_decontaminate": true,
2070
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
2071
+ "metadata": {
2072
+ "version": 1.0
2073
+ }
2074
+ },
2075
+ "blimp_wh_island": {
2076
+ "task": "blimp_wh_island",
2077
+ "dataset_path": "blimp",
2078
+ "dataset_name": "wh_island",
2079
+ "validation_split": "train",
2080
+ "doc_to_text": "",
2081
+ "doc_to_target": 0,
2082
+ "unsafe_code": false,
2083
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
2084
+ "description": "",
2085
+ "target_delimiter": " ",
2086
+ "fewshot_delimiter": "\n\n",
2087
+ "num_fewshot": 0,
2088
+ "metric_list": [
2089
+ {
2090
+ "metric": "acc",
2091
+ "aggregation": "mean",
2092
+ "higher_is_better": true
2093
+ }
2094
+ ],
2095
+ "output_type": "multiple_choice",
2096
+ "repeats": 1,
2097
+ "should_decontaminate": true,
2098
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
2099
+ "metadata": {
2100
+ "version": 1.0
2101
+ }
2102
+ },
2103
+ "blimp_wh_questions_object_gap": {
2104
+ "task": "blimp_wh_questions_object_gap",
2105
+ "dataset_path": "blimp",
2106
+ "dataset_name": "wh_questions_object_gap",
2107
+ "validation_split": "train",
2108
+ "doc_to_text": "",
2109
+ "doc_to_target": 0,
2110
+ "unsafe_code": false,
2111
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
2112
+ "description": "",
2113
+ "target_delimiter": " ",
2114
+ "fewshot_delimiter": "\n\n",
2115
+ "num_fewshot": 0,
2116
+ "metric_list": [
2117
+ {
2118
+ "metric": "acc",
2119
+ "aggregation": "mean",
2120
+ "higher_is_better": true
2121
+ }
2122
+ ],
2123
+ "output_type": "multiple_choice",
2124
+ "repeats": 1,
2125
+ "should_decontaminate": true,
2126
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
2127
+ "metadata": {
2128
+ "version": 1.0
2129
+ }
2130
+ },
2131
+ "blimp_wh_questions_subject_gap": {
2132
+ "task": "blimp_wh_questions_subject_gap",
2133
+ "dataset_path": "blimp",
2134
+ "dataset_name": "wh_questions_subject_gap",
2135
+ "validation_split": "train",
2136
+ "doc_to_text": "",
2137
+ "doc_to_target": 0,
2138
+ "unsafe_code": false,
2139
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
2140
+ "description": "",
2141
+ "target_delimiter": " ",
2142
+ "fewshot_delimiter": "\n\n",
2143
+ "num_fewshot": 0,
2144
+ "metric_list": [
2145
+ {
2146
+ "metric": "acc",
2147
+ "aggregation": "mean",
2148
+ "higher_is_better": true
2149
+ }
2150
+ ],
2151
+ "output_type": "multiple_choice",
2152
+ "repeats": 1,
2153
+ "should_decontaminate": true,
2154
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
2155
+ "metadata": {
2156
+ "version": 1.0
2157
+ }
2158
+ },
2159
+ "blimp_wh_questions_subject_gap_long_distance": {
2160
+ "task": "blimp_wh_questions_subject_gap_long_distance",
2161
+ "dataset_path": "blimp",
2162
+ "dataset_name": "wh_questions_subject_gap_long_distance",
2163
+ "validation_split": "train",
2164
+ "doc_to_text": "",
2165
+ "doc_to_target": 0,
2166
+ "unsafe_code": false,
2167
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
2168
+ "description": "",
2169
+ "target_delimiter": " ",
2170
+ "fewshot_delimiter": "\n\n",
2171
+ "num_fewshot": 0,
2172
+ "metric_list": [
2173
+ {
2174
+ "metric": "acc",
2175
+ "aggregation": "mean",
2176
+ "higher_is_better": true
2177
+ }
2178
+ ],
2179
+ "output_type": "multiple_choice",
2180
+ "repeats": 1,
2181
+ "should_decontaminate": true,
2182
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
2183
+ "metadata": {
2184
+ "version": 1.0
2185
+ }
2186
+ },
2187
+ "blimp_wh_vs_that_no_gap": {
2188
+ "task": "blimp_wh_vs_that_no_gap",
2189
+ "dataset_path": "blimp",
2190
+ "dataset_name": "wh_vs_that_no_gap",
2191
+ "validation_split": "train",
2192
+ "doc_to_text": "",
2193
+ "doc_to_target": 0,
2194
+ "unsafe_code": false,
2195
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
2196
+ "description": "",
2197
+ "target_delimiter": " ",
2198
+ "fewshot_delimiter": "\n\n",
2199
+ "num_fewshot": 0,
2200
+ "metric_list": [
2201
+ {
2202
+ "metric": "acc",
2203
+ "aggregation": "mean",
2204
+ "higher_is_better": true
2205
+ }
2206
+ ],
2207
+ "output_type": "multiple_choice",
2208
+ "repeats": 1,
2209
+ "should_decontaminate": true,
2210
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
2211
+ "metadata": {
2212
+ "version": 1.0
2213
+ }
2214
+ },
2215
+ "blimp_wh_vs_that_no_gap_long_distance": {
2216
+ "task": "blimp_wh_vs_that_no_gap_long_distance",
2217
+ "dataset_path": "blimp",
2218
+ "dataset_name": "wh_vs_that_no_gap_long_distance",
2219
+ "validation_split": "train",
2220
+ "doc_to_text": "",
2221
+ "doc_to_target": 0,
2222
+ "unsafe_code": false,
2223
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
2224
+ "description": "",
2225
+ "target_delimiter": " ",
2226
+ "fewshot_delimiter": "\n\n",
2227
+ "num_fewshot": 0,
2228
+ "metric_list": [
2229
+ {
2230
+ "metric": "acc",
2231
+ "aggregation": "mean",
2232
+ "higher_is_better": true
2233
+ }
2234
+ ],
2235
+ "output_type": "multiple_choice",
2236
+ "repeats": 1,
2237
+ "should_decontaminate": true,
2238
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
2239
+ "metadata": {
2240
+ "version": 1.0
2241
+ }
2242
+ },
2243
+ "blimp_wh_vs_that_with_gap": {
2244
+ "task": "blimp_wh_vs_that_with_gap",
2245
+ "dataset_path": "blimp",
2246
+ "dataset_name": "wh_vs_that_with_gap",
2247
+ "validation_split": "train",
2248
+ "doc_to_text": "",
2249
+ "doc_to_target": 0,
2250
+ "unsafe_code": false,
2251
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
2252
+ "description": "",
2253
+ "target_delimiter": " ",
2254
+ "fewshot_delimiter": "\n\n",
2255
+ "num_fewshot": 0,
2256
+ "metric_list": [
2257
+ {
2258
+ "metric": "acc",
2259
+ "aggregation": "mean",
2260
+ "higher_is_better": true
2261
+ }
2262
+ ],
2263
+ "output_type": "multiple_choice",
2264
+ "repeats": 1,
2265
+ "should_decontaminate": true,
2266
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
2267
+ "metadata": {
2268
+ "version": 1.0
2269
+ }
2270
+ },
2271
+ "blimp_wh_vs_that_with_gap_long_distance": {
2272
+ "task": "blimp_wh_vs_that_with_gap_long_distance",
2273
+ "dataset_path": "blimp",
2274
+ "dataset_name": "wh_vs_that_with_gap_long_distance",
2275
+ "validation_split": "train",
2276
+ "doc_to_text": "",
2277
+ "doc_to_target": 0,
2278
+ "unsafe_code": false,
2279
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
2280
+ "description": "",
2281
+ "target_delimiter": " ",
2282
+ "fewshot_delimiter": "\n\n",
2283
+ "num_fewshot": 0,
2284
+ "metric_list": [
2285
+ {
2286
+ "metric": "acc",
2287
+ "aggregation": "mean",
2288
+ "higher_is_better": true
2289
+ }
2290
+ ],
2291
+ "output_type": "multiple_choice",
2292
+ "repeats": 1,
2293
+ "should_decontaminate": true,
2294
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
2295
+ "metadata": {
2296
+ "version": 1.0
2297
+ }
2298
+ }
2299
+ },
2300
+ "versions": {
2301
+ "blimp": 2.0,
2302
+ "blimp_adjunct_island": 1.0,
2303
+ "blimp_anaphor_gender_agreement": 1.0,
2304
+ "blimp_anaphor_number_agreement": 1.0,
2305
+ "blimp_animate_subject_passive": 1.0,
2306
+ "blimp_animate_subject_trans": 1.0,
2307
+ "blimp_causative": 1.0,
2308
+ "blimp_complex_NP_island": 1.0,
2309
+ "blimp_coordinate_structure_constraint_complex_left_branch": 1.0,
2310
+ "blimp_coordinate_structure_constraint_object_extraction": 1.0,
2311
+ "blimp_determiner_noun_agreement_1": 1.0,
2312
+ "blimp_determiner_noun_agreement_2": 1.0,
2313
+ "blimp_determiner_noun_agreement_irregular_1": 1.0,
2314
+ "blimp_determiner_noun_agreement_irregular_2": 1.0,
2315
+ "blimp_determiner_noun_agreement_with_adj_2": 1.0,
2316
+ "blimp_determiner_noun_agreement_with_adj_irregular_1": 1.0,
2317
+ "blimp_determiner_noun_agreement_with_adj_irregular_2": 1.0,
2318
+ "blimp_determiner_noun_agreement_with_adjective_1": 1.0,
2319
+ "blimp_distractor_agreement_relational_noun": 1.0,
2320
+ "blimp_distractor_agreement_relative_clause": 1.0,
2321
+ "blimp_drop_argument": 1.0,
2322
+ "blimp_ellipsis_n_bar_1": 1.0,
2323
+ "blimp_ellipsis_n_bar_2": 1.0,
2324
+ "blimp_existential_there_object_raising": 1.0,
2325
+ "blimp_existential_there_quantifiers_1": 1.0,
2326
+ "blimp_existential_there_quantifiers_2": 1.0,
2327
+ "blimp_existential_there_subject_raising": 1.0,
2328
+ "blimp_expletive_it_object_raising": 1.0,
2329
+ "blimp_inchoative": 1.0,
2330
+ "blimp_intransitive": 1.0,
2331
+ "blimp_irregular_past_participle_adjectives": 1.0,
2332
+ "blimp_irregular_past_participle_verbs": 1.0,
2333
+ "blimp_irregular_plural_subject_verb_agreement_1": 1.0,
2334
+ "blimp_irregular_plural_subject_verb_agreement_2": 1.0,
2335
+ "blimp_left_branch_island_echo_question": 1.0,
2336
+ "blimp_left_branch_island_simple_question": 1.0,
2337
+ "blimp_matrix_question_npi_licensor_present": 1.0,
2338
+ "blimp_npi_present_1": 1.0,
2339
+ "blimp_npi_present_2": 1.0,
2340
+ "blimp_only_npi_licensor_present": 1.0,
2341
+ "blimp_only_npi_scope": 1.0,
2342
+ "blimp_passive_1": 1.0,
2343
+ "blimp_passive_2": 1.0,
2344
+ "blimp_principle_A_c_command": 1.0,
2345
+ "blimp_principle_A_case_1": 1.0,
2346
+ "blimp_principle_A_case_2": 1.0,
2347
+ "blimp_principle_A_domain_1": 1.0,
2348
+ "blimp_principle_A_domain_2": 1.0,
2349
+ "blimp_principle_A_domain_3": 1.0,
2350
+ "blimp_principle_A_reconstruction": 1.0,
2351
+ "blimp_regular_plural_subject_verb_agreement_1": 1.0,
2352
+ "blimp_regular_plural_subject_verb_agreement_2": 1.0,
2353
+ "blimp_sentential_negation_npi_licensor_present": 1.0,
2354
+ "blimp_sentential_negation_npi_scope": 1.0,
2355
+ "blimp_sentential_subject_island": 1.0,
2356
+ "blimp_superlative_quantifiers_1": 1.0,
2357
+ "blimp_superlative_quantifiers_2": 1.0,
2358
+ "blimp_tough_vs_raising_1": 1.0,
2359
+ "blimp_tough_vs_raising_2": 1.0,
2360
+ "blimp_transitive": 1.0,
2361
+ "blimp_wh_island": 1.0,
2362
+ "blimp_wh_questions_object_gap": 1.0,
2363
+ "blimp_wh_questions_subject_gap": 1.0,
2364
+ "blimp_wh_questions_subject_gap_long_distance": 1.0,
2365
+ "blimp_wh_vs_that_no_gap": 1.0,
2366
+ "blimp_wh_vs_that_no_gap_long_distance": 1.0,
2367
+ "blimp_wh_vs_that_with_gap": 1.0,
2368
+ "blimp_wh_vs_that_with_gap_long_distance": 1.0
2369
+ },
2370
+ "n-shot": {
2371
+ "blimp_adjunct_island": 0,
2372
+ "blimp_anaphor_gender_agreement": 0,
2373
+ "blimp_anaphor_number_agreement": 0,
2374
+ "blimp_animate_subject_passive": 0,
2375
+ "blimp_animate_subject_trans": 0,
2376
+ "blimp_causative": 0,
2377
+ "blimp_complex_NP_island": 0,
2378
+ "blimp_coordinate_structure_constraint_complex_left_branch": 0,
2379
+ "blimp_coordinate_structure_constraint_object_extraction": 0,
2380
+ "blimp_determiner_noun_agreement_1": 0,
2381
+ "blimp_determiner_noun_agreement_2": 0,
2382
+ "blimp_determiner_noun_agreement_irregular_1": 0,
2383
+ "blimp_determiner_noun_agreement_irregular_2": 0,
2384
+ "blimp_determiner_noun_agreement_with_adj_2": 0,
2385
+ "blimp_determiner_noun_agreement_with_adj_irregular_1": 0,
2386
+ "blimp_determiner_noun_agreement_with_adj_irregular_2": 0,
2387
+ "blimp_determiner_noun_agreement_with_adjective_1": 0,
2388
+ "blimp_distractor_agreement_relational_noun": 0,
2389
+ "blimp_distractor_agreement_relative_clause": 0,
2390
+ "blimp_drop_argument": 0,
2391
+ "blimp_ellipsis_n_bar_1": 0,
2392
+ "blimp_ellipsis_n_bar_2": 0,
2393
+ "blimp_existential_there_object_raising": 0,
2394
+ "blimp_existential_there_quantifiers_1": 0,
2395
+ "blimp_existential_there_quantifiers_2": 0,
2396
+ "blimp_existential_there_subject_raising": 0,
2397
+ "blimp_expletive_it_object_raising": 0,
2398
+ "blimp_inchoative": 0,
2399
+ "blimp_intransitive": 0,
2400
+ "blimp_irregular_past_participle_adjectives": 0,
2401
+ "blimp_irregular_past_participle_verbs": 0,
2402
+ "blimp_irregular_plural_subject_verb_agreement_1": 0,
2403
+ "blimp_irregular_plural_subject_verb_agreement_2": 0,
2404
+ "blimp_left_branch_island_echo_question": 0,
2405
+ "blimp_left_branch_island_simple_question": 0,
2406
+ "blimp_matrix_question_npi_licensor_present": 0,
2407
+ "blimp_npi_present_1": 0,
2408
+ "blimp_npi_present_2": 0,
2409
+ "blimp_only_npi_licensor_present": 0,
2410
+ "blimp_only_npi_scope": 0,
2411
+ "blimp_passive_1": 0,
2412
+ "blimp_passive_2": 0,
2413
+ "blimp_principle_A_c_command": 0,
2414
+ "blimp_principle_A_case_1": 0,
2415
+ "blimp_principle_A_case_2": 0,
2416
+ "blimp_principle_A_domain_1": 0,
2417
+ "blimp_principle_A_domain_2": 0,
2418
+ "blimp_principle_A_domain_3": 0,
2419
+ "blimp_principle_A_reconstruction": 0,
2420
+ "blimp_regular_plural_subject_verb_agreement_1": 0,
2421
+ "blimp_regular_plural_subject_verb_agreement_2": 0,
2422
+ "blimp_sentential_negation_npi_licensor_present": 0,
2423
+ "blimp_sentential_negation_npi_scope": 0,
2424
+ "blimp_sentential_subject_island": 0,
2425
+ "blimp_superlative_quantifiers_1": 0,
2426
+ "blimp_superlative_quantifiers_2": 0,
2427
+ "blimp_tough_vs_raising_1": 0,
2428
+ "blimp_tough_vs_raising_2": 0,
2429
+ "blimp_transitive": 0,
2430
+ "blimp_wh_island": 0,
2431
+ "blimp_wh_questions_object_gap": 0,
2432
+ "blimp_wh_questions_subject_gap": 0,
2433
+ "blimp_wh_questions_subject_gap_long_distance": 0,
2434
+ "blimp_wh_vs_that_no_gap": 0,
2435
+ "blimp_wh_vs_that_no_gap_long_distance": 0,
2436
+ "blimp_wh_vs_that_with_gap": 0,
2437
+ "blimp_wh_vs_that_with_gap_long_distance": 0
2438
+ },
2439
+ "higher_is_better": {
2440
+ "blimp": {
2441
+ "acc": true
2442
+ },
2443
+ "blimp_adjunct_island": {
2444
+ "acc": true
2445
+ },
2446
+ "blimp_anaphor_gender_agreement": {
2447
+ "acc": true
2448
+ },
2449
+ "blimp_anaphor_number_agreement": {
2450
+ "acc": true
2451
+ },
2452
+ "blimp_animate_subject_passive": {
2453
+ "acc": true
2454
+ },
2455
+ "blimp_animate_subject_trans": {
2456
+ "acc": true
2457
+ },
2458
+ "blimp_causative": {
2459
+ "acc": true
2460
+ },
2461
+ "blimp_complex_NP_island": {
2462
+ "acc": true
2463
+ },
2464
+ "blimp_coordinate_structure_constraint_complex_left_branch": {
2465
+ "acc": true
2466
+ },
2467
+ "blimp_coordinate_structure_constraint_object_extraction": {
2468
+ "acc": true
2469
+ },
2470
+ "blimp_determiner_noun_agreement_1": {
2471
+ "acc": true
2472
+ },
2473
+ "blimp_determiner_noun_agreement_2": {
2474
+ "acc": true
2475
+ },
2476
+ "blimp_determiner_noun_agreement_irregular_1": {
2477
+ "acc": true
2478
+ },
2479
+ "blimp_determiner_noun_agreement_irregular_2": {
2480
+ "acc": true
2481
+ },
2482
+ "blimp_determiner_noun_agreement_with_adj_2": {
2483
+ "acc": true
2484
+ },
2485
+ "blimp_determiner_noun_agreement_with_adj_irregular_1": {
2486
+ "acc": true
2487
+ },
2488
+ "blimp_determiner_noun_agreement_with_adj_irregular_2": {
2489
+ "acc": true
2490
+ },
2491
+ "blimp_determiner_noun_agreement_with_adjective_1": {
2492
+ "acc": true
2493
+ },
2494
+ "blimp_distractor_agreement_relational_noun": {
2495
+ "acc": true
2496
+ },
2497
+ "blimp_distractor_agreement_relative_clause": {
2498
+ "acc": true
2499
+ },
2500
+ "blimp_drop_argument": {
2501
+ "acc": true
2502
+ },
2503
+ "blimp_ellipsis_n_bar_1": {
2504
+ "acc": true
2505
+ },
2506
+ "blimp_ellipsis_n_bar_2": {
2507
+ "acc": true
2508
+ },
2509
+ "blimp_existential_there_object_raising": {
2510
+ "acc": true
2511
+ },
2512
+ "blimp_existential_there_quantifiers_1": {
2513
+ "acc": true
2514
+ },
2515
+ "blimp_existential_there_quantifiers_2": {
2516
+ "acc": true
2517
+ },
2518
+ "blimp_existential_there_subject_raising": {
2519
+ "acc": true
2520
+ },
2521
+ "blimp_expletive_it_object_raising": {
2522
+ "acc": true
2523
+ },
2524
+ "blimp_inchoative": {
2525
+ "acc": true
2526
+ },
2527
+ "blimp_intransitive": {
2528
+ "acc": true
2529
+ },
2530
+ "blimp_irregular_past_participle_adjectives": {
2531
+ "acc": true
2532
+ },
2533
+ "blimp_irregular_past_participle_verbs": {
2534
+ "acc": true
2535
+ },
2536
+ "blimp_irregular_plural_subject_verb_agreement_1": {
2537
+ "acc": true
2538
+ },
2539
+ "blimp_irregular_plural_subject_verb_agreement_2": {
2540
+ "acc": true
2541
+ },
2542
+ "blimp_left_branch_island_echo_question": {
2543
+ "acc": true
2544
+ },
2545
+ "blimp_left_branch_island_simple_question": {
2546
+ "acc": true
2547
+ },
2548
+ "blimp_matrix_question_npi_licensor_present": {
2549
+ "acc": true
2550
+ },
2551
+ "blimp_npi_present_1": {
2552
+ "acc": true
2553
+ },
2554
+ "blimp_npi_present_2": {
2555
+ "acc": true
2556
+ },
2557
+ "blimp_only_npi_licensor_present": {
2558
+ "acc": true
2559
+ },
2560
+ "blimp_only_npi_scope": {
2561
+ "acc": true
2562
+ },
2563
+ "blimp_passive_1": {
2564
+ "acc": true
2565
+ },
2566
+ "blimp_passive_2": {
2567
+ "acc": true
2568
+ },
2569
+ "blimp_principle_A_c_command": {
2570
+ "acc": true
2571
+ },
2572
+ "blimp_principle_A_case_1": {
2573
+ "acc": true
2574
+ },
2575
+ "blimp_principle_A_case_2": {
2576
+ "acc": true
2577
+ },
2578
+ "blimp_principle_A_domain_1": {
2579
+ "acc": true
2580
+ },
2581
+ "blimp_principle_A_domain_2": {
2582
+ "acc": true
2583
+ },
2584
+ "blimp_principle_A_domain_3": {
2585
+ "acc": true
2586
+ },
2587
+ "blimp_principle_A_reconstruction": {
2588
+ "acc": true
2589
+ },
2590
+ "blimp_regular_plural_subject_verb_agreement_1": {
2591
+ "acc": true
2592
+ },
2593
+ "blimp_regular_plural_subject_verb_agreement_2": {
2594
+ "acc": true
2595
+ },
2596
+ "blimp_sentential_negation_npi_licensor_present": {
2597
+ "acc": true
2598
+ },
2599
+ "blimp_sentential_negation_npi_scope": {
2600
+ "acc": true
2601
+ },
2602
+ "blimp_sentential_subject_island": {
2603
+ "acc": true
2604
+ },
2605
+ "blimp_superlative_quantifiers_1": {
2606
+ "acc": true
2607
+ },
2608
+ "blimp_superlative_quantifiers_2": {
2609
+ "acc": true
2610
+ },
2611
+ "blimp_tough_vs_raising_1": {
2612
+ "acc": true
2613
+ },
2614
+ "blimp_tough_vs_raising_2": {
2615
+ "acc": true
2616
+ },
2617
+ "blimp_transitive": {
2618
+ "acc": true
2619
+ },
2620
+ "blimp_wh_island": {
2621
+ "acc": true
2622
+ },
2623
+ "blimp_wh_questions_object_gap": {
2624
+ "acc": true
2625
+ },
2626
+ "blimp_wh_questions_subject_gap": {
2627
+ "acc": true
2628
+ },
2629
+ "blimp_wh_questions_subject_gap_long_distance": {
2630
+ "acc": true
2631
+ },
2632
+ "blimp_wh_vs_that_no_gap": {
2633
+ "acc": true
2634
+ },
2635
+ "blimp_wh_vs_that_no_gap_long_distance": {
2636
+ "acc": true
2637
+ },
2638
+ "blimp_wh_vs_that_with_gap": {
2639
+ "acc": true
2640
+ },
2641
+ "blimp_wh_vs_that_with_gap_long_distance": {
2642
+ "acc": true
2643
+ }
2644
+ },
2645
+ "n-samples": {
2646
+ "blimp_adjunct_island": {
2647
+ "original": 1000,
2648
+ "effective": 1000
2649
+ },
2650
+ "blimp_anaphor_gender_agreement": {
2651
+ "original": 1000,
2652
+ "effective": 1000
2653
+ },
2654
+ "blimp_anaphor_number_agreement": {
2655
+ "original": 1000,
2656
+ "effective": 1000
2657
+ },
2658
+ "blimp_animate_subject_passive": {
2659
+ "original": 1000,
2660
+ "effective": 1000
2661
+ },
2662
+ "blimp_animate_subject_trans": {
2663
+ "original": 1000,
2664
+ "effective": 1000
2665
+ },
2666
+ "blimp_causative": {
2667
+ "original": 1000,
2668
+ "effective": 1000
2669
+ },
2670
+ "blimp_complex_NP_island": {
2671
+ "original": 1000,
2672
+ "effective": 1000
2673
+ },
2674
+ "blimp_coordinate_structure_constraint_complex_left_branch": {
2675
+ "original": 1000,
2676
+ "effective": 1000
2677
+ },
2678
+ "blimp_coordinate_structure_constraint_object_extraction": {
2679
+ "original": 1000,
2680
+ "effective": 1000
2681
+ },
2682
+ "blimp_determiner_noun_agreement_1": {
2683
+ "original": 1000,
2684
+ "effective": 1000
2685
+ },
2686
+ "blimp_determiner_noun_agreement_2": {
2687
+ "original": 1000,
2688
+ "effective": 1000
2689
+ },
2690
+ "blimp_determiner_noun_agreement_irregular_1": {
2691
+ "original": 1000,
2692
+ "effective": 1000
2693
+ },
2694
+ "blimp_determiner_noun_agreement_irregular_2": {
2695
+ "original": 1000,
2696
+ "effective": 1000
2697
+ },
2698
+ "blimp_determiner_noun_agreement_with_adj_2": {
2699
+ "original": 1000,
2700
+ "effective": 1000
2701
+ },
2702
+ "blimp_determiner_noun_agreement_with_adj_irregular_1": {
2703
+ "original": 1000,
2704
+ "effective": 1000
2705
+ },
2706
+ "blimp_determiner_noun_agreement_with_adj_irregular_2": {
2707
+ "original": 1000,
2708
+ "effective": 1000
2709
+ },
2710
+ "blimp_determiner_noun_agreement_with_adjective_1": {
2711
+ "original": 1000,
2712
+ "effective": 1000
2713
+ },
2714
+ "blimp_distractor_agreement_relational_noun": {
2715
+ "original": 1000,
2716
+ "effective": 1000
2717
+ },
2718
+ "blimp_distractor_agreement_relative_clause": {
2719
+ "original": 1000,
2720
+ "effective": 1000
2721
+ },
2722
+ "blimp_drop_argument": {
2723
+ "original": 1000,
2724
+ "effective": 1000
2725
+ },
2726
+ "blimp_ellipsis_n_bar_1": {
2727
+ "original": 1000,
2728
+ "effective": 1000
2729
+ },
2730
+ "blimp_ellipsis_n_bar_2": {
2731
+ "original": 1000,
2732
+ "effective": 1000
2733
+ },
2734
+ "blimp_existential_there_object_raising": {
2735
+ "original": 1000,
2736
+ "effective": 1000
2737
+ },
2738
+ "blimp_existential_there_quantifiers_1": {
2739
+ "original": 1000,
2740
+ "effective": 1000
2741
+ },
2742
+ "blimp_existential_there_quantifiers_2": {
2743
+ "original": 1000,
2744
+ "effective": 1000
2745
+ },
2746
+ "blimp_existential_there_subject_raising": {
2747
+ "original": 1000,
2748
+ "effective": 1000
2749
+ },
2750
+ "blimp_expletive_it_object_raising": {
2751
+ "original": 1000,
2752
+ "effective": 1000
2753
+ },
2754
+ "blimp_inchoative": {
2755
+ "original": 1000,
2756
+ "effective": 1000
2757
+ },
2758
+ "blimp_intransitive": {
2759
+ "original": 1000,
2760
+ "effective": 1000
2761
+ },
2762
+ "blimp_irregular_past_participle_adjectives": {
2763
+ "original": 1000,
2764
+ "effective": 1000
2765
+ },
2766
+ "blimp_irregular_past_participle_verbs": {
2767
+ "original": 1000,
2768
+ "effective": 1000
2769
+ },
2770
+ "blimp_irregular_plural_subject_verb_agreement_1": {
2771
+ "original": 1000,
2772
+ "effective": 1000
2773
+ },
2774
+ "blimp_irregular_plural_subject_verb_agreement_2": {
2775
+ "original": 1000,
2776
+ "effective": 1000
2777
+ },
2778
+ "blimp_left_branch_island_echo_question": {
2779
+ "original": 1000,
2780
+ "effective": 1000
2781
+ },
2782
+ "blimp_left_branch_island_simple_question": {
2783
+ "original": 1000,
2784
+ "effective": 1000
2785
+ },
2786
+ "blimp_matrix_question_npi_licensor_present": {
2787
+ "original": 1000,
2788
+ "effective": 1000
2789
+ },
2790
+ "blimp_npi_present_1": {
2791
+ "original": 1000,
2792
+ "effective": 1000
2793
+ },
2794
+ "blimp_npi_present_2": {
2795
+ "original": 1000,
2796
+ "effective": 1000
2797
+ },
2798
+ "blimp_only_npi_licensor_present": {
2799
+ "original": 1000,
2800
+ "effective": 1000
2801
+ },
2802
+ "blimp_only_npi_scope": {
2803
+ "original": 1000,
2804
+ "effective": 1000
2805
+ },
2806
+ "blimp_passive_1": {
2807
+ "original": 1000,
2808
+ "effective": 1000
2809
+ },
2810
+ "blimp_passive_2": {
2811
+ "original": 1000,
2812
+ "effective": 1000
2813
+ },
2814
+ "blimp_principle_A_c_command": {
2815
+ "original": 1000,
2816
+ "effective": 1000
2817
+ },
2818
+ "blimp_principle_A_case_1": {
2819
+ "original": 1000,
2820
+ "effective": 1000
2821
+ },
2822
+ "blimp_principle_A_case_2": {
2823
+ "original": 1000,
2824
+ "effective": 1000
2825
+ },
2826
+ "blimp_principle_A_domain_1": {
2827
+ "original": 1000,
2828
+ "effective": 1000
2829
+ },
2830
+ "blimp_principle_A_domain_2": {
2831
+ "original": 1000,
2832
+ "effective": 1000
2833
+ },
2834
+ "blimp_principle_A_domain_3": {
2835
+ "original": 1000,
2836
+ "effective": 1000
2837
+ },
2838
+ "blimp_principle_A_reconstruction": {
2839
+ "original": 1000,
2840
+ "effective": 1000
2841
+ },
2842
+ "blimp_regular_plural_subject_verb_agreement_1": {
2843
+ "original": 1000,
2844
+ "effective": 1000
2845
+ },
2846
+ "blimp_regular_plural_subject_verb_agreement_2": {
2847
+ "original": 1000,
2848
+ "effective": 1000
2849
+ },
2850
+ "blimp_sentential_negation_npi_licensor_present": {
2851
+ "original": 1000,
2852
+ "effective": 1000
2853
+ },
2854
+ "blimp_sentential_negation_npi_scope": {
2855
+ "original": 1000,
2856
+ "effective": 1000
2857
+ },
2858
+ "blimp_sentential_subject_island": {
2859
+ "original": 1000,
2860
+ "effective": 1000
2861
+ },
2862
+ "blimp_superlative_quantifiers_1": {
2863
+ "original": 1000,
2864
+ "effective": 1000
2865
+ },
2866
+ "blimp_superlative_quantifiers_2": {
2867
+ "original": 1000,
2868
+ "effective": 1000
2869
+ },
2870
+ "blimp_tough_vs_raising_1": {
2871
+ "original": 1000,
2872
+ "effective": 1000
2873
+ },
2874
+ "blimp_tough_vs_raising_2": {
2875
+ "original": 1000,
2876
+ "effective": 1000
2877
+ },
2878
+ "blimp_transitive": {
2879
+ "original": 1000,
2880
+ "effective": 1000
2881
+ },
2882
+ "blimp_wh_island": {
2883
+ "original": 1000,
2884
+ "effective": 1000
2885
+ },
2886
+ "blimp_wh_questions_object_gap": {
2887
+ "original": 1000,
2888
+ "effective": 1000
2889
+ },
2890
+ "blimp_wh_questions_subject_gap": {
2891
+ "original": 1000,
2892
+ "effective": 1000
2893
+ },
2894
+ "blimp_wh_questions_subject_gap_long_distance": {
2895
+ "original": 1000,
2896
+ "effective": 1000
2897
+ },
2898
+ "blimp_wh_vs_that_no_gap": {
2899
+ "original": 1000,
2900
+ "effective": 1000
2901
+ },
2902
+ "blimp_wh_vs_that_no_gap_long_distance": {
2903
+ "original": 1000,
2904
+ "effective": 1000
2905
+ },
2906
+ "blimp_wh_vs_that_with_gap": {
2907
+ "original": 1000,
2908
+ "effective": 1000
2909
+ },
2910
+ "blimp_wh_vs_that_with_gap_long_distance": {
2911
+ "original": 1000,
2912
+ "effective": 1000
2913
+ }
2914
+ },
2915
+ "config": {
2916
+ "model": "hf",
2917
+ "model_args": "pretrained=meta-llama/Llama-3.2-3B,dtype=float32,trust_remote_code=True",
2918
+ "model_num_parameters": 3212749824,
2919
+ "model_dtype": "torch.float32",
2920
+ "model_revision": "main",
2921
+ "model_sha": "13afe5124825b4f3751f836b40dafda64c1ed062",
2922
+ "batch_size": "auto",
2923
+ "batch_sizes": [
2924
+ 32
2925
+ ],
2926
+ "device": "cuda",
2927
+ "use_cache": null,
2928
+ "limit": null,
2929
+ "bootstrap_iters": 100000,
2930
+ "gen_kwargs": null,
2931
+ "random_seed": 0,
2932
+ "numpy_seed": 1234,
2933
+ "torch_seed": 1234,
2934
+ "fewshot_seed": 1234
2935
+ },
2936
+ "git_hash": null,
2937
+ "date": 1741267634.9579294,
2938
+ "pretty_env_info": "PyTorch version: 2.5.1+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: 14.0.0-1ubuntu1.1\nCMake version: version 3.31.2\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Nov 6 2024, 20:22:13) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-6.6.56+-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.140\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: Tesla T4\nGPU 1: Tesla T4\n\nNvidia driver version: 560.35.03\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.6\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.6\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.6\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.6\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.6\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.6\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.6\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 46 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 4\nOn-line CPU(s) list: 0-3\nVendor ID: GenuineIntel\nModel name: Intel(R) Xeon(R) CPU @ 2.00GHz\nCPU family: 6\nModel: 85\nThread(s) per core: 2\nCore(s) per socket: 2\nSocket(s): 1\nStepping: 3\nBogoMIPS: 4000.41\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch pti ssbd ibrs ibpb stibp fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves arat md_clear arch_capabilities\nHypervisor vendor: KVM\nVirtualization type: full\nL1d cache: 64 KiB (2 instances)\nL1i cache: 64 KiB (2 instances)\nL2 cache: 2 MiB (2 instances)\nL3 cache: 38.5 MiB (1 instance)\nNUMA node(s): 1\nNUMA node0 CPU(s): 0-3\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Mitigation; PTE Inversion\nVulnerability Mds: Mitigation; Clear CPU buffers; SMT Host state unknown\nVulnerability Meltdown: Mitigation; PTI\nVulnerability Mmio stale data: Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown\nVulnerability Reg file data sampling: Not affected\nVulnerability Retbleed: Mitigation; IBRS\nVulnerability Spec rstack overflow: Not affected\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; IBRS; IBPB conditional; STIBP conditional; RSB filling; PBRSB-eIBRS Not affected; BHI SW loop, KVM SW loop\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Mitigation; Clear CPU buffers; SMT Host state unknown\n\nVersions of relevant libraries:\n[pip3] mypy-extensions==1.0.0\n[pip3] numpy==1.26.4\n[pip3] onnx==1.17.0\n[pip3] optree==0.13.1\n[pip3] pytorch-ignite==0.5.1\n[pip3] pytorch-lightning==2.5.0.post0\n[pip3] torch==2.5.1+cu121\n[pip3] torchaudio==2.5.1+cu121\n[pip3] torchinfo==1.8.0\n[pip3] torchmetrics==1.6.1\n[pip3] torchsummary==1.5.1\n[pip3] torchtune==0.5.0\n[pip3] torchvision==0.20.1+cu121\n[conda] Could not collect",
2939
+ "transformers_version": "4.47.0",
2940
+ "upper_git_hash": null,
2941
+ "tokenizer_pad_token": [
2942
+ "<|end_of_text|>",
2943
+ "128001"
2944
+ ],
2945
+ "tokenizer_eos_token": [
2946
+ "<|end_of_text|>",
2947
+ "128001"
2948
+ ],
2949
+ "tokenizer_bos_token": [
2950
+ "<|begin_of_text|>",
2951
+ "128000"
2952
+ ],
2953
+ "eot_token_id": 128001,
2954
+ "max_length": 131072,
2955
+ "task_hashes": {},
2956
+ "model_source": "hf",
2957
+ "model_name": "meta-llama/Llama-3.2-3B",
2958
+ "model_name_sanitized": "meta-llama__Llama-3.2-3B",
2959
+ "system_instruction": null,
2960
+ "system_instruction_sha": null,
2961
+ "fewshot_as_multiturn": false,
2962
+ "chat_template": null,
2963
+ "chat_template_sha": null,
2964
+ "start_time": 5659.99530225,
2965
+ "end_time": 8910.415783908,
2966
+ "total_evaluation_time_seconds": "3250.4204816579995"
2967
+ }