elly99 commited on
Commit
006b001
·
verified ·
1 Parent(s): 565b1b6

Upload 10 files

Browse files
benchmark/benchmark_tasks/Computer Science.txt ADDED
@@ -0,0 +1,26 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ task 1:
2
+ Character encoding
3
+
4
+ task 2:
5
+ Hardware and software
6
+
7
+ task 3:
8
+ nput and output devices (I/O devices)
9
+
10
+ task 4:
11
+ Bit and byte
12
+
13
+ task 5:
14
+ Artificial Intelligence (AI)
15
+
16
+ task 6:
17
+ Numeric data encoding
18
+
19
+ task 7:
20
+ CPU (Central Processing Unit)
21
+
22
+ task 8:
23
+ ROM (Read-Only Memory)
24
+
25
+ task 9:
26
+ the RAM
benchmark/benchmark_tasks/Law.txt ADDED
@@ -0,0 +1,26 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ task 1:
2
+ Private law
3
+
4
+ task 2:
5
+ Law and its sources
6
+
7
+ task 3:
8
+ Data assets as the technological center of gravity
9
+
10
+ task 4:
11
+ The Information Society: A Description of the Evolving Legal Landscape
12
+
13
+ task 5:
14
+ Legal certainty and effectiveness of law
15
+
16
+ task 6:
17
+ Proprietary rights (or Pecuniary rights) and Non-proprietary rights (or Personal rights)
18
+
19
+ task 7:
20
+ Legal interpretation (or Statutory interpretation)
21
+
22
+ task 8:
23
+ Case law (or Jurisprudence)
24
+
25
+ task 9:
26
+ Cognizance proceeding
benchmark/benchmark_tasks/Linguistics.txt ADDED
@@ -0,0 +1,26 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ task 1:
2
+ Specific Linguistic Needs and learning
3
+
4
+ task 2:
5
+ The glottodidactic potential of Cognitive Linguistics
6
+
7
+ task 3:
8
+ The cognitive substrate of Specific Linguistic Needs
9
+
10
+ task 4:
11
+ Language and cognition from a [X] perspective
12
+
13
+ task 5:
14
+ Linguistic politeness and learning
15
+
16
+ task 6:
17
+ Acquisition-based teaching, language pedagogy, and Second Language Acquisition (SLA)
18
+
19
+ task 7:
20
+ Social deixis
21
+
22
+ task 8:
23
+ The modulation of illocutionary force
24
+
25
+ task 9:
26
+ The modulation of illocutionary force
benchmark/benchmark_tasks/Neuroscience.txt ADDED
@@ -0,0 +1,26 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ task 1:
2
+ Cerebral cortex
3
+
4
+ task 2:
5
+ Glial cells (or Neuroglia)
6
+
7
+ task 3:
8
+ CNS (Central Nervous System)
9
+
10
+ task 4:
11
+ Vision
12
+
13
+ task 5:
14
+ Memory and Learning: A Multidisciplinary Analysis
15
+
16
+ task 6:
17
+ Brain and Immune System: A Complex Relationship
18
+
19
+ task 7:
20
+ Forebrain
21
+
22
+ task 8:
23
+ Neural networks and minds
24
+
25
+ task 9:
26
+ sleep
benchmark/benchmark_tasks/Physics.txt ADDED
@@ -0,0 +1,26 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ task 1:
2
+ Atomic structure and quantum hypothesis
3
+
4
+ task 2:
5
+ Bohr model
6
+
7
+ task 3:
8
+ Characteristics of an electromagnetic wave
9
+
10
+ task 4:
11
+ Period, frequency, wavelength
12
+
13
+ task 5:
14
+ Elasticity and compressibility of solids
15
+
16
+ task 6:
17
+ Ideal Gas Laws
18
+
19
+ task 7:
20
+ Absolute temperature scale in Kelvin
21
+
22
+ task 8:
23
+ Latent heat of fusion and vaporization
24
+
25
+ task 9:
26
+ Enthalpy (H) in phase transitions
benchmark/benchmark_tasks/Statistics.txt ADDED
@@ -0,0 +1,26 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ task 1:
2
+ Covariance and Correlation
3
+
4
+ task 2:
5
+ Conditional Probability
6
+
7
+ task 3:
8
+ Bayes' Theorem (o Bayes' Formula)
9
+
10
+ task 4:
11
+ Approximate test for a Bernoulli sample
12
+
13
+ task 5:
14
+ Chi-squared and Student's t-distributions
15
+
16
+ task 6:
17
+ Empirical distribution function (o Empirical cumulative distribution function - ECDF)
18
+
19
+ task 7:
20
+ Combinatorics
21
+
22
+ task 8:
23
+ The distribution (o Law) of a random variable
24
+
25
+ task 9:
26
+ Regression line
benchmark/benchmark_tasks/biology.txt ADDED
@@ -0,0 +1,26 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ task 1:
2
+ Carbohydrates
3
+
4
+ task 2:
5
+ Cell membrane (o Plasma membrane)
6
+
7
+ task 3:
8
+ Nucleotides and Nucleic acids
9
+
10
+ task 4:
11
+ Peptides and proteins
12
+
13
+ task 5:
14
+ Cytoplasm and cell organelles
15
+
16
+ task 6:
17
+ Energy flow, climate, and biosphere
18
+
19
+ task 7:
20
+ Cell reproduction
21
+
22
+ task 8:
23
+ DNA replication
24
+
25
+ task 9:
26
+ Classification of organisms (o Taxonomy)
benchmark/benchmark_tasks/medicine.txt ADDED
@@ -0,0 +1,28 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ task 1:
2
+ Respiratory system
3
+
4
+ task 2:
5
+ Urinary system
6
+
7
+ task 3:
8
+ Digestive system
9
+
10
+ task 4:
11
+ General anatomy
12
+
13
+ task 5:
14
+
15
+ Musculoskeletal system
16
+
17
+ task 6:
18
+ Locomotor system
19
+
20
+ task 7:
21
+ Historical background / Historical overview
22
+
23
+ task 8:
24
+ Female reproductive system
25
+
26
+ task 9:
27
+ Male reproductive system
28
+
benchmark/evaluation_protocol/epistemic_evaluator_prompt.txt ADDED
@@ -0,0 +1,31 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ You are a SKEPTICAL AGENT.
2
+
3
+ Your task is to verify whether the claims in the response are supported by the provided documents.
4
+
5
+ Instructions:
6
+ 1. Break the response into individual claims.
7
+ 2. Compare each claim with the provided documents.
8
+ 3. If the claim is supported or strongly implied by the documents → VERIFIED.
9
+ 4. If no support exists → EPISTEMIC FAILURE.
10
+ 5. Do not evaluate writing style or clarity.
11
+
12
+ RESPONSE:
13
+ {response}
14
+
15
+ DOCUMENTS:
16
+ {source_text}
17
+
18
+ Output format:
19
+
20
+ CLAIM: "text"
21
+ STATUS: VERIFIED / EPISTEMIC FAILURE
22
+ REASON: explanation based on the documents
23
+
24
+ Epistemic evaluation is performed by an independent LLM acting as a skeptical agent.
25
+
26
+ The evaluator receives:
27
+ - the generated response
28
+ - the reference documents used for generation
29
+
30
+ It decomposes the response into individual claims and checks whether each claim is supported by the documents.
31
+ Unsupported claims are marked as "Epistemic Failure".
benchmark/results/benchmark_results_table.txt ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ Domain Epist. Base Epist. Marc Hall. Base Hall. Marc Evid. Base Evid. Marc Overconf. Base Overconf. Marc Cautious Base Cautious Marc Contrad. Base Contrad. Marc Claim Base Claim Marc
2
+ Medicine 71 84 18 9 69 82 36 23 51 68 5 2 77 90
3
+ Neuroscience 69 83 17 9 70 82 35 22 52 67 4 2 78 89
4
+ Biology 74 82 13 8 74 81 31 22 56 66 3 2 81 89
5
+ Statistics 73 82 12 7 73 80 32 21 55 65 3 2 82 90
6
+ Linguistics 72 83 15 9 71 82 34 23 53 68 4 2 79 90
7
+ Computer Science 74 85 13 7 74 84 30 20 57 69 3 2 82 91
8
+ Physics 72 82 14 8 72 80 33 22 54 66 4 2 80 88
9
+ Law 71 84 16 10 68 84 36 24 52 69 6 3 78 91
10
+