Spaces:
Running
Running
Update src/about.py
Browse files- src/about.py +28 -5
src/about.py
CHANGED
|
@@ -23,7 +23,7 @@ NUM_FEWSHOT = 0 # Change with your few shot
|
|
| 23 |
|
| 24 |
|
| 25 |
# Your leaderboard name
|
| 26 |
-
TITLE = """<h1 align="center" id="space-title">
|
| 27 |
|
| 28 |
# What does your leaderboard evaluate?
|
| 29 |
INTRODUCTION_TEXT = """
|
|
@@ -32,10 +32,33 @@ Intro text
|
|
| 32 |
|
| 33 |
# Which evaluations are you running? how can people reproduce what you have?
|
| 34 |
LLM_BENCHMARKS_TEXT = f"""
|
| 35 |
-
##
|
| 36 |
-
|
| 37 |
-
|
| 38 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 39 |
|
| 40 |
"""
|
| 41 |
|
|
|
|
| 23 |
|
| 24 |
|
| 25 |
# Your leaderboard name
|
| 26 |
+
TITLE = """<h1 align="center" id="space-title">Open Persian LLM Alignment Leaderboard</h1>"""
|
| 27 |
|
| 28 |
# What does your leaderboard evaluate?
|
| 29 |
INTRODUCTION_TEXT = """
|
|
|
|
| 32 |
|
| 33 |
# Which evaluations are you running? how can people reproduce what you have?
|
| 34 |
LLM_BENCHMARKS_TEXT = f"""
|
| 35 |
+
## Open Persian LLM Alignment Leaderboard
|
| 36 |
+
|
| 37 |
+
Developed by MCILAB in collaboration with the Machine Learning Laboratory at Sharif University of Technology, this benchmark presents a comprehensive evaluation framework for assessing the alignment of Persian Large Language Models (LLMs) with critical ethical dimensions, including safety, fairness, and social norms.
|
| 38 |
+
Addressing the gaps in existing LLM evaluation frameworks, this benchmark is specifically tailored to Persian linguistic and cultural contexts. It combines three types of Persian-language benchmarks:
|
| 39 |
+
1. Translated datasets (adapted from established English benchmarks)
|
| 40 |
+
2. Synthetically generated data (newly created for Persian LLMs)
|
| 41 |
+
3. Naturally collected data (reflecting indigenous cultural nuances)
|
| 42 |
+
Key Datasets in the Benchmark
|
| 43 |
+
The benchmark integrates the following datasets to ensure a robust evaluation of Persian LLMs:
|
| 44 |
+
Translated Datasets
|
| 45 |
+
• Anthropic-fa
|
| 46 |
+
• AdvBench-fa
|
| 47 |
+
• HarmBench-fa
|
| 48 |
+
• DecodingTrust-fa
|
| 49 |
+
Newly Developed Persian Datasets
|
| 50 |
+
• ProhibiBench-fa: Evaluates harmful and prohibited content in Persian culture.
|
| 51 |
+
• SafeBench-fa: Assesses safety in generated outputs.
|
| 52 |
+
• FairBench-fa: Measures bias mitigation in Persian LLMs.
|
| 53 |
+
• SocialBench-fa: Evaluates adherence to culturally accepted behaviors.
|
| 54 |
+
Naturally Collected Persian Dataset
|
| 55 |
+
• GuardBench-fa: A large-scale dataset designed to align Persian LLMs with local cultural norms.
|
| 56 |
+
A Unified Framework for Persian LLM Evaluation
|
| 57 |
+
By combining these datasets, our work establishes a culturally grounded alignment evaluation framework, enabling systematic assessment across three key aspects:
|
| 58 |
+
• Safety: Avoiding harmful or toxic content.
|
| 59 |
+
• Fairness: Mitigating biases in model outputs.
|
| 60 |
+
• Social Norms: Ensuring culturally appropriate behavior.
|
| 61 |
+
This benchmark not only fills a critical gap in Persian LLM evaluation but also provides a standardized leaderboard to track progress in developing aligned, ethical, and culturally aware Persian language models.
|
| 62 |
|
| 63 |
"""
|
| 64 |
|