elly99 commited on
Commit
a42b295
·
verified ·
1 Parent(s): 6505da3

Update benchmark/results/benchmark_results_table.txt

Browse files
benchmark/results/benchmark_results_table.txt CHANGED
@@ -1,10 +1,33 @@
1
- Domain Epist. Base Epist. Marc Hall. Base Hall. Marc Evid. Base Evid. Marc Overconf. Base Overconf. Marc Cautious Base Cautious Marc Contrad. Base Contrad. Marc Claim Base Claim Marc
2
- Medicine 71 84 18 9 69 82 36 23 51 68 5 2 77 90
3
- Neuroscience 69 83 17 9 70 82 35 22 52 67 4 2 78 89
4
- Biology 74 82 13 8 74 81 31 22 56 66 3 2 81 89
5
- Statistics 73 82 12 7 73 80 32 21 55 65 3 2 82 90
6
- Linguistics 72 83 15 9 71 82 34 23 53 68 4 2 79 90
7
- Computer Science 74 85 13 7 74 84 30 20 57 69 3 2 82 91
8
- Physics 72 82 14 8 72 80 33 22 54 66 4 2 80 88
9
- Law 71 84 16 10 68 84 36 24 52 69 6 3 78 91
10
-
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Domain Epist. Base Epist. Marc Hall. Base Hall. Marc Evid. Base Evid. Marc Overconf. Base Overconf. Marc Cautious Base Cautious Marc Contrad. Base Contrad. Marc Claim Base Claim Marc
2
+ Medicine 71 84 18 9 69 82 36 23 51 68 5 2 77 90
3
+ Neuroscience 69 83 17 9 70 82 35 22 52 67 4 2 78 89
4
+ Biology 74 82 13 8 74 81 31 22 56 66 3 2 81 89
5
+ Statistics 73 82 12 7 73 80 32 21 55 65 3 2 82 90
6
+ Linguistics 72 83 15 9 71 82 34 23 53 68 4 2 79 90
7
+ Computer Science 74 85 13 7 74 84 30 20 57 69 3 2 82 91
8
+ Physics 72 82 14 8 72 80 33 22 54 66 4 2 80 88
9
+ Law 71 84 16 10 68 84 36 24 52 69 6 3 78 91
10
+
11
+ model results: llama-4-scout-17b-16e-instruct
12
+
13
+ Domain Epist. Base Epist. Marc Hall. Base Hall. Marc Evid. Base Evid. Marc Overconf. Base Overconf. Marc Cautious Base Cautious Marc Contrad. Base Contrad. Marc Claim Base Claim Marc
14
+ Biology 82 87 11 11 88 82 27 17.5 74 76 6 6 84 87
15
+ Law 82 85.5 11 6.5 68 80 74 17 23 73 4 3.5 71 88.5
16
+ Physics 82 83 11 17 76 73 18 20 71 67.5 6 6.5 84 78.5
17
+ Computer Science 82 85.5 11 11.5 76 77 18 20.5 71 73 4 5.5 79 81.5
18
+ Linguistics 82 84.5 11 13.5 76 79 18 18 64 71 7 7.5 79 79
19
+ Statistics 82 86.5 7 5.5 88 87 12 11 76 78.5 4 2 84 86.5
20
+ Medicine 86 92 7 5 82 89 10 13 75 80 0 0 88 91
21
+ Neuroscience 90 90 5 6 85 85 11 12 80 78 0 0 88 88
22
+
23
+ model results:DeepSeek-R1-Distill-Qwen-1.5B
24
+
25
+ Domain Epist. Base Epist. Marc Hall. Base Hall. Marc Evid. Base Evid. Marc Overconf. Base Overconf. Marc Cautious Base Cautious Marc Contrad. Base Contrad. Marc Claim Base Claim Marc
26
+ Biology 72 90.25 28 4.25 65 93.25 60 6.75 40 87.75 22 1.5 68 94.75
27
+ Law 42 83 68 13 25 75 61 21 18 71 54 5.5 33 80
28
+ Physics 82 85.5 11 10.5 76 70.5 18 23.5 71 59.5 7 4.5 79 90
29
+ Computer Science 82 84.5 11 10 76 79 18 19 64 73 7 6.5 79 82.5
30
+ Linguistics 78 87 12 6 84 82 21 11 67 78 5 4 81 84
31
+ Statistics 87 83.5 6 9 82 78 11 15 74 71.5 4 5.5 89 86
32
+ Medicine 82 87 11 11.5 76 78 18 22 71 57.5 4 6 79 82.5
33
+ Neuroscience 72 87 18 6 81 82 26 11.5 64 76 7 4 77 84.5