Transformers
Italian
English
semantic-search
explainable-ai
faiss
ai-ethics
responsible-ai
llm
prompt-engineering
multimodal-ai
ai-transparency
ethical-intelligence
explainable-llm
cognitive-ai
ethical-ai
scientific-retrieval
modular-ai
memory-augmented-llm
trustworthy-ai
reasoning-engine
ai-alignment
next-gen-llm
thinking-machines
open-source-ai
explainability
ai-research
semantic audit
cognitive agent
human-centered-ai
Upload 10 files
Browse files- benchmark/benchmark_tasks/Computer Science.txt +26 -0
- benchmark/benchmark_tasks/Law.txt +26 -0
- benchmark/benchmark_tasks/Linguistics.txt +26 -0
- benchmark/benchmark_tasks/Neuroscience.txt +26 -0
- benchmark/benchmark_tasks/Physics.txt +26 -0
- benchmark/benchmark_tasks/Statistics.txt +26 -0
- benchmark/benchmark_tasks/biology.txt +26 -0
- benchmark/benchmark_tasks/medicine.txt +28 -0
- benchmark/evaluation_protocol/epistemic_evaluator_prompt.txt +31 -0
- benchmark/results/benchmark_results_table.txt +10 -0
benchmark/benchmark_tasks/Computer Science.txt
ADDED
|
@@ -0,0 +1,26 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
task 1:
|
| 2 |
+
Character encoding
|
| 3 |
+
|
| 4 |
+
task 2:
|
| 5 |
+
Hardware and software
|
| 6 |
+
|
| 7 |
+
task 3:
|
| 8 |
+
nput and output devices (I/O devices)
|
| 9 |
+
|
| 10 |
+
task 4:
|
| 11 |
+
Bit and byte
|
| 12 |
+
|
| 13 |
+
task 5:
|
| 14 |
+
Artificial Intelligence (AI)
|
| 15 |
+
|
| 16 |
+
task 6:
|
| 17 |
+
Numeric data encoding
|
| 18 |
+
|
| 19 |
+
task 7:
|
| 20 |
+
CPU (Central Processing Unit)
|
| 21 |
+
|
| 22 |
+
task 8:
|
| 23 |
+
ROM (Read-Only Memory)
|
| 24 |
+
|
| 25 |
+
task 9:
|
| 26 |
+
the RAM
|
benchmark/benchmark_tasks/Law.txt
ADDED
|
@@ -0,0 +1,26 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
task 1:
|
| 2 |
+
Private law
|
| 3 |
+
|
| 4 |
+
task 2:
|
| 5 |
+
Law and its sources
|
| 6 |
+
|
| 7 |
+
task 3:
|
| 8 |
+
Data assets as the technological center of gravity
|
| 9 |
+
|
| 10 |
+
task 4:
|
| 11 |
+
The Information Society: A Description of the Evolving Legal Landscape
|
| 12 |
+
|
| 13 |
+
task 5:
|
| 14 |
+
Legal certainty and effectiveness of law
|
| 15 |
+
|
| 16 |
+
task 6:
|
| 17 |
+
Proprietary rights (or Pecuniary rights) and Non-proprietary rights (or Personal rights)
|
| 18 |
+
|
| 19 |
+
task 7:
|
| 20 |
+
Legal interpretation (or Statutory interpretation)
|
| 21 |
+
|
| 22 |
+
task 8:
|
| 23 |
+
Case law (or Jurisprudence)
|
| 24 |
+
|
| 25 |
+
task 9:
|
| 26 |
+
Cognizance proceeding
|
benchmark/benchmark_tasks/Linguistics.txt
ADDED
|
@@ -0,0 +1,26 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
task 1:
|
| 2 |
+
Specific Linguistic Needs and learning
|
| 3 |
+
|
| 4 |
+
task 2:
|
| 5 |
+
The glottodidactic potential of Cognitive Linguistics
|
| 6 |
+
|
| 7 |
+
task 3:
|
| 8 |
+
The cognitive substrate of Specific Linguistic Needs
|
| 9 |
+
|
| 10 |
+
task 4:
|
| 11 |
+
Language and cognition from a [X] perspective
|
| 12 |
+
|
| 13 |
+
task 5:
|
| 14 |
+
Linguistic politeness and learning
|
| 15 |
+
|
| 16 |
+
task 6:
|
| 17 |
+
Acquisition-based teaching, language pedagogy, and Second Language Acquisition (SLA)
|
| 18 |
+
|
| 19 |
+
task 7:
|
| 20 |
+
Social deixis
|
| 21 |
+
|
| 22 |
+
task 8:
|
| 23 |
+
The modulation of illocutionary force
|
| 24 |
+
|
| 25 |
+
task 9:
|
| 26 |
+
The modulation of illocutionary force
|
benchmark/benchmark_tasks/Neuroscience.txt
ADDED
|
@@ -0,0 +1,26 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
task 1:
|
| 2 |
+
Cerebral cortex
|
| 3 |
+
|
| 4 |
+
task 2:
|
| 5 |
+
Glial cells (or Neuroglia)
|
| 6 |
+
|
| 7 |
+
task 3:
|
| 8 |
+
CNS (Central Nervous System)
|
| 9 |
+
|
| 10 |
+
task 4:
|
| 11 |
+
Vision
|
| 12 |
+
|
| 13 |
+
task 5:
|
| 14 |
+
Memory and Learning: A Multidisciplinary Analysis
|
| 15 |
+
|
| 16 |
+
task 6:
|
| 17 |
+
Brain and Immune System: A Complex Relationship
|
| 18 |
+
|
| 19 |
+
task 7:
|
| 20 |
+
Forebrain
|
| 21 |
+
|
| 22 |
+
task 8:
|
| 23 |
+
Neural networks and minds
|
| 24 |
+
|
| 25 |
+
task 9:
|
| 26 |
+
sleep
|
benchmark/benchmark_tasks/Physics.txt
ADDED
|
@@ -0,0 +1,26 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
task 1:
|
| 2 |
+
Atomic structure and quantum hypothesis
|
| 3 |
+
|
| 4 |
+
task 2:
|
| 5 |
+
Bohr model
|
| 6 |
+
|
| 7 |
+
task 3:
|
| 8 |
+
Characteristics of an electromagnetic wave
|
| 9 |
+
|
| 10 |
+
task 4:
|
| 11 |
+
Period, frequency, wavelength
|
| 12 |
+
|
| 13 |
+
task 5:
|
| 14 |
+
Elasticity and compressibility of solids
|
| 15 |
+
|
| 16 |
+
task 6:
|
| 17 |
+
Ideal Gas Laws
|
| 18 |
+
|
| 19 |
+
task 7:
|
| 20 |
+
Absolute temperature scale in Kelvin
|
| 21 |
+
|
| 22 |
+
task 8:
|
| 23 |
+
Latent heat of fusion and vaporization
|
| 24 |
+
|
| 25 |
+
task 9:
|
| 26 |
+
Enthalpy (H) in phase transitions
|
benchmark/benchmark_tasks/Statistics.txt
ADDED
|
@@ -0,0 +1,26 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
task 1:
|
| 2 |
+
Covariance and Correlation
|
| 3 |
+
|
| 4 |
+
task 2:
|
| 5 |
+
Conditional Probability
|
| 6 |
+
|
| 7 |
+
task 3:
|
| 8 |
+
Bayes' Theorem (o Bayes' Formula)
|
| 9 |
+
|
| 10 |
+
task 4:
|
| 11 |
+
Approximate test for a Bernoulli sample
|
| 12 |
+
|
| 13 |
+
task 5:
|
| 14 |
+
Chi-squared and Student's t-distributions
|
| 15 |
+
|
| 16 |
+
task 6:
|
| 17 |
+
Empirical distribution function (o Empirical cumulative distribution function - ECDF)
|
| 18 |
+
|
| 19 |
+
task 7:
|
| 20 |
+
Combinatorics
|
| 21 |
+
|
| 22 |
+
task 8:
|
| 23 |
+
The distribution (o Law) of a random variable
|
| 24 |
+
|
| 25 |
+
task 9:
|
| 26 |
+
Regression line
|
benchmark/benchmark_tasks/biology.txt
ADDED
|
@@ -0,0 +1,26 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
task 1:
|
| 2 |
+
Carbohydrates
|
| 3 |
+
|
| 4 |
+
task 2:
|
| 5 |
+
Cell membrane (o Plasma membrane)
|
| 6 |
+
|
| 7 |
+
task 3:
|
| 8 |
+
Nucleotides and Nucleic acids
|
| 9 |
+
|
| 10 |
+
task 4:
|
| 11 |
+
Peptides and proteins
|
| 12 |
+
|
| 13 |
+
task 5:
|
| 14 |
+
Cytoplasm and cell organelles
|
| 15 |
+
|
| 16 |
+
task 6:
|
| 17 |
+
Energy flow, climate, and biosphere
|
| 18 |
+
|
| 19 |
+
task 7:
|
| 20 |
+
Cell reproduction
|
| 21 |
+
|
| 22 |
+
task 8:
|
| 23 |
+
DNA replication
|
| 24 |
+
|
| 25 |
+
task 9:
|
| 26 |
+
Classification of organisms (o Taxonomy)
|
benchmark/benchmark_tasks/medicine.txt
ADDED
|
@@ -0,0 +1,28 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
task 1:
|
| 2 |
+
Respiratory system
|
| 3 |
+
|
| 4 |
+
task 2:
|
| 5 |
+
Urinary system
|
| 6 |
+
|
| 7 |
+
task 3:
|
| 8 |
+
Digestive system
|
| 9 |
+
|
| 10 |
+
task 4:
|
| 11 |
+
General anatomy
|
| 12 |
+
|
| 13 |
+
task 5:
|
| 14 |
+
|
| 15 |
+
Musculoskeletal system
|
| 16 |
+
|
| 17 |
+
task 6:
|
| 18 |
+
Locomotor system
|
| 19 |
+
|
| 20 |
+
task 7:
|
| 21 |
+
Historical background / Historical overview
|
| 22 |
+
|
| 23 |
+
task 8:
|
| 24 |
+
Female reproductive system
|
| 25 |
+
|
| 26 |
+
task 9:
|
| 27 |
+
Male reproductive system
|
| 28 |
+
|
benchmark/evaluation_protocol/epistemic_evaluator_prompt.txt
ADDED
|
@@ -0,0 +1,31 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
You are a SKEPTICAL AGENT.
|
| 2 |
+
|
| 3 |
+
Your task is to verify whether the claims in the response are supported by the provided documents.
|
| 4 |
+
|
| 5 |
+
Instructions:
|
| 6 |
+
1. Break the response into individual claims.
|
| 7 |
+
2. Compare each claim with the provided documents.
|
| 8 |
+
3. If the claim is supported or strongly implied by the documents → VERIFIED.
|
| 9 |
+
4. If no support exists → EPISTEMIC FAILURE.
|
| 10 |
+
5. Do not evaluate writing style or clarity.
|
| 11 |
+
|
| 12 |
+
RESPONSE:
|
| 13 |
+
{response}
|
| 14 |
+
|
| 15 |
+
DOCUMENTS:
|
| 16 |
+
{source_text}
|
| 17 |
+
|
| 18 |
+
Output format:
|
| 19 |
+
|
| 20 |
+
CLAIM: "text"
|
| 21 |
+
STATUS: VERIFIED / EPISTEMIC FAILURE
|
| 22 |
+
REASON: explanation based on the documents
|
| 23 |
+
|
| 24 |
+
Epistemic evaluation is performed by an independent LLM acting as a skeptical agent.
|
| 25 |
+
|
| 26 |
+
The evaluator receives:
|
| 27 |
+
- the generated response
|
| 28 |
+
- the reference documents used for generation
|
| 29 |
+
|
| 30 |
+
It decomposes the response into individual claims and checks whether each claim is supported by the documents.
|
| 31 |
+
Unsupported claims are marked as "Epistemic Failure".
|
benchmark/results/benchmark_results_table.txt
ADDED
|
@@ -0,0 +1,10 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
Domain Epist. Base Epist. Marc Hall. Base Hall. Marc Evid. Base Evid. Marc Overconf. Base Overconf. Marc Cautious Base Cautious Marc Contrad. Base Contrad. Marc Claim Base Claim Marc
|
| 2 |
+
Medicine 71 84 18 9 69 82 36 23 51 68 5 2 77 90
|
| 3 |
+
Neuroscience 69 83 17 9 70 82 35 22 52 67 4 2 78 89
|
| 4 |
+
Biology 74 82 13 8 74 81 31 22 56 66 3 2 81 89
|
| 5 |
+
Statistics 73 82 12 7 73 80 32 21 55 65 3 2 82 90
|
| 6 |
+
Linguistics 72 83 15 9 71 82 34 23 53 68 4 2 79 90
|
| 7 |
+
Computer Science 74 85 13 7 74 84 30 20 57 69 3 2 82 91
|
| 8 |
+
Physics 72 82 14 8 72 80 33 22 54 66 4 2 80 88
|
| 9 |
+
Law 71 84 16 10 68 84 36 24 52 69 6 3 78 91
|
| 10 |
+
|