MCILAB commited on
Commit
ad600ef
·
verified ·
1 Parent(s): 7f9e409

Update src/about.py

Browse files
Files changed (1) hide show
  1. src/about.py +28 -5
src/about.py CHANGED
@@ -23,7 +23,7 @@ NUM_FEWSHOT = 0 # Change with your few shot
23
 
24
 
25
  # Your leaderboard name
26
- TITLE = """<h1 align="center" id="space-title">Demo leaderboard</h1>"""
27
 
28
  # What does your leaderboard evaluate?
29
  INTRODUCTION_TEXT = """
@@ -32,10 +32,33 @@ Intro text
32
 
33
  # Which evaluations are you running? how can people reproduce what you have?
34
  LLM_BENCHMARKS_TEXT = f"""
35
- ## How it works
36
-
37
- ## Reproducibility
38
- To reproduce our results, here is the commands you can run:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
39
 
40
  """
41
 
 
23
 
24
 
25
  # Your leaderboard name
26
+ TITLE = """<h1 align="center" id="space-title">Open Persian LLM Alignment Leaderboard</h1>"""
27
 
28
  # What does your leaderboard evaluate?
29
  INTRODUCTION_TEXT = """
 
32
 
33
  # Which evaluations are you running? how can people reproduce what you have?
34
  LLM_BENCHMARKS_TEXT = f"""
35
+ ## Open Persian LLM Alignment Leaderboard
36
+
37
+ Developed by MCILAB in collaboration with the Machine Learning Laboratory at Sharif University of Technology, this benchmark presents a comprehensive evaluation framework for assessing the alignment of Persian Large Language Models (LLMs) with critical ethical dimensions, including safety, fairness, and social norms.
38
+ Addressing the gaps in existing LLM evaluation frameworks, this benchmark is specifically tailored to Persian linguistic and cultural contexts. It combines three types of Persian-language benchmarks:
39
+ 1. Translated datasets (adapted from established English benchmarks)
40
+ 2. Synthetically generated data (newly created for Persian LLMs)
41
+ 3. Naturally collected data (reflecting indigenous cultural nuances)
42
+ Key Datasets in the Benchmark
43
+ The benchmark integrates the following datasets to ensure a robust evaluation of Persian LLMs:
44
+ Translated Datasets
45
+ • Anthropic-fa
46
+ • AdvBench-fa
47
+ • HarmBench-fa
48
+ • DecodingTrust-fa
49
+ Newly Developed Persian Datasets
50
+ • ProhibiBench-fa: Evaluates harmful and prohibited content in Persian culture.
51
+ • SafeBench-fa: Assesses safety in generated outputs.
52
+ • FairBench-fa: Measures bias mitigation in Persian LLMs.
53
+ • SocialBench-fa: Evaluates adherence to culturally accepted behaviors.
54
+ Naturally Collected Persian Dataset
55
+ • GuardBench-fa: A large-scale dataset designed to align Persian LLMs with local cultural norms.
56
+ A Unified Framework for Persian LLM Evaluation
57
+ By combining these datasets, our work establishes a culturally grounded alignment evaluation framework, enabling systematic assessment across three key aspects:
58
+ • Safety: Avoiding harmful or toxic content.
59
+ • Fairness: Mitigating biases in model outputs.
60
+ • Social Norms: Ensuring culturally appropriate behavior.
61
+ This benchmark not only fills a critical gap in Persian LLM evaluation but also provides a standardized leaderboard to track progress in developing aligned, ethical, and culturally aware Persian language models.
62
 
63
  """
64