Blanca commited on
Commit
177251c
·
verified ·
1 Parent(s): 9f30340

Update content.py

Browse files
Files changed (1) hide show
  1. content.py +21 -13
content.py CHANGED
@@ -1,27 +1,31 @@
1
  TITLE = """
2
- <div style="text-align: center;">
3
- <h1 align="center" id="space-title">Critical Questions Generation Leaderboard</h1>
4
- <img src="logo_st1.svg" alt="Logo" width="20"/>
5
- </div>
6
-
7
  """
8
 
9
  INTRODUCTION_TEXT = """
10
- <div style="display: flex; align-items: center; gap: 10px;">
11
- <span style="font-size:25px;">Critical Questions Generation is the task of automatically generating questions that can unmask the assumptions held by the premises of an argumentative text. \nThis leaderboard, aims at benchmarking the capacity of language technology systems to create Critical Questions (CQs). That is, questions that should be asked in order to judge if an argument is acceptable or fallacious.\nThe task consists on generating 3 Useful Critical Questions per argumentative text. \nAll details on the task, the dataset, and the evaluation can be found in the paper [Benchmarking Critical Questions Generation: A Challenging Reasoning Task for Large Language Models](https://arxiv.org/abs/2505.11341)</span>
12
- <img src="examples.png" alt="Example" width="50"/>
13
- </div>
 
 
 
 
 
 
 
14
 
15
  ## Data
16
 
17
- <p style='font-size:20px;'> The [CQs-Gen dataset](https://huggingface.co/datasets/HiTZ/CQs-Gen) gathers 220 interventions of real debates. And contains:
18
 
19
  - `validation`: which contains 186 interventions and can be used for training or validation, as it contains ~25 reference questions per intervention already evaluated accoding to their usefulness (either Useful, Unhelpful or Invalid).
20
  - `test`: which contains 34 interventions. The reference questions of this set (~70) are kept private to avoid data contamination. The questions generated using the test set is what should be submitted to this leaderboard.
21
- </p>
22
 
23
  ## Evaluation
24
- <p style='font-size:20px;'> Evaluation is done by comparing each of the 3 newly generated question to the reference questions of the test set using Semantic Text Similarity, and inheriting the label of the most similar reference given the threshold of 0.65. Questions where no reference is found are considered Invalid. See the evaluation function [here](https://huggingface.co/spaces/HiTZ/Critical_Questions_Leaderboard/blob/main/app.py#L141), or find more details in the [paper](https://arxiv.org/abs/2505.11341). </p>
 
25
 
26
  ## Leaderboard
27
 
@@ -29,7 +33,11 @@ INTRODUCTION_TEXT = """
29
 
30
  SUBMISSION_TEXT = """
31
  ## Submissions
32
- <p style='font-size:20px;'> Results can be submitted for the test set only. \nWe expect submissions to be json files with the following format: </p>
 
 
 
 
33
 
34
  ```json
35
  {
 
1
  TITLE = """
2
+ <h1 align="center" id="space-title">Critical Questions Generation Leaderboard</h1>
 
 
 
 
3
  """
4
 
5
  INTRODUCTION_TEXT = """
6
+ Critical Questions Generation is the task of automatically generating questions that can unmask the assumptions held by the premises of an argumentative text.
7
+
8
+ This leaderboard, aims at benchmarking the capacity of language technology systems to create Critical Questions (CQs). That is, questions that should be asked in order to judge if an argument is acceptable or fallacious.
9
+
10
+ The task consists on generating 3 Useful Critical Questions per argumentative text.
11
+
12
+ All details on the task, the dataset, and the evaluation can be found in the paper [Benchmarking Critical Questions Generation: A Challenging Reasoning Task for Large Language Models](https://arxiv.org/abs/2505.11341)
13
+
14
+ """
15
+
16
+ DATA_TEXT = """
17
 
18
  ## Data
19
 
20
+ The [CQs-Gen dataset](https://huggingface.co/datasets/HiTZ/CQs-Gen) gathers 220 interventions of real debates. And contains:
21
 
22
  - `validation`: which contains 186 interventions and can be used for training or validation, as it contains ~25 reference questions per intervention already evaluated accoding to their usefulness (either Useful, Unhelpful or Invalid).
23
  - `test`: which contains 34 interventions. The reference questions of this set (~70) are kept private to avoid data contamination. The questions generated using the test set is what should be submitted to this leaderboard.
24
+
25
 
26
  ## Evaluation
27
+
28
+ Evaluation is done by comparing each of the 3 newly generated question to the reference questions of the test set using Semantic Text Similarity, and inheriting the label of the most similar reference given the threshold of 0.65. Questions where no reference is found are considered Invalid. See the evaluation function [here](https://huggingface.co/spaces/HiTZ/Critical_Questions_Leaderboard/blob/main/app.py#L141), or find more details in the [paper](https://arxiv.org/abs/2505.11341).
29
 
30
  ## Leaderboard
31
 
 
33
 
34
  SUBMISSION_TEXT = """
35
  ## Submissions
36
+
37
+
38
+ Results can be submitted for the test set only.
39
+
40
+ We expect submissions to be json files with the following format:
41
 
42
  ```json
43
  {