Spaces:
Build error
Build error
Update src/about.py
Browse files- src/about.py +9 -6
src/about.py
CHANGED
|
@@ -41,26 +41,29 @@ Addressing the gaps in existing LLM evaluation frameworks, this benchmark is spe
|
|
| 41 |
2. Synthetically generated data (newly created for Persian LLMs)
|
| 42 |
3. Naturally collected data (reflecting indigenous cultural nuances)
|
| 43 |
|
| 44 |
-
##
|
| 45 |
> The benchmark integrates the following datasets to ensure a robust evaluation of Persian LLMs:
|
|
|
|
| 46 |
> **Translated Datasets**
|
| 47 |
> • Anthropic-fa
|
| 48 |
> • AdvBench-fa
|
| 49 |
-
>
|
| 50 |
> • DecodingTrust-fa
|
|
|
|
| 51 |
> **Newly Developed Persian Datasets**
|
| 52 |
> • ProhibiBench-fa: Evaluates harmful and prohibited content in Persian culture.
|
| 53 |
> • SafeBench-fa: Assesses safety in generated outputs.
|
| 54 |
> • FairBench-fa: Measures bias mitigation in Persian LLMs.
|
| 55 |
> • SocialBench-fa: Evaluates adherence to culturally accepted behaviors.
|
|
|
|
| 56 |
> **Naturally Collected Persian Dataset**
|
| 57 |
> • GuardBench-fa: A large-scale dataset designed to align Persian LLMs with local cultural norms.
|
| 58 |
|
| 59 |
### A Unified Framework for Persian LLM Evaluation
|
| 60 |
-
|
| 61 |
-
|
| 62 |
-
|
| 63 |
-
|
| 64 |
|
| 65 |
|
| 66 |
This benchmark not only fills a critical gap in Persian LLM evaluation but also provides a standardized leaderboard to track progress in developing aligned, ethical, and culturally aware Persian language models.
|
|
|
|
| 41 |
2. Synthetically generated data (newly created for Persian LLMs)
|
| 42 |
3. Naturally collected data (reflecting indigenous cultural nuances)
|
| 43 |
|
| 44 |
+
## Key Datasets in the Benchmark
|
| 45 |
> The benchmark integrates the following datasets to ensure a robust evaluation of Persian LLMs:
|
| 46 |
+
>
|
| 47 |
> **Translated Datasets**
|
| 48 |
> • Anthropic-fa
|
| 49 |
> • AdvBench-fa
|
| 50 |
+
> • HarmBench-fa
|
| 51 |
> • DecodingTrust-fa
|
| 52 |
+
>
|
| 53 |
> **Newly Developed Persian Datasets**
|
| 54 |
> • ProhibiBench-fa: Evaluates harmful and prohibited content in Persian culture.
|
| 55 |
> • SafeBench-fa: Assesses safety in generated outputs.
|
| 56 |
> • FairBench-fa: Measures bias mitigation in Persian LLMs.
|
| 57 |
> • SocialBench-fa: Evaluates adherence to culturally accepted behaviors.
|
| 58 |
+
>
|
| 59 |
> **Naturally Collected Persian Dataset**
|
| 60 |
> • GuardBench-fa: A large-scale dataset designed to align Persian LLMs with local cultural norms.
|
| 61 |
|
| 62 |
### A Unified Framework for Persian LLM Evaluation
|
| 63 |
+
By combining these datasets, our work establishes a culturally grounded alignment evaluation framework, enabling systematic assessment across three key aspects:
|
| 64 |
+
• **Safety**: Avoiding harmful or toxic content.
|
| 65 |
+
• **Fairness**: Mitigating biases in model outputs.
|
| 66 |
+
• **Social Norms**: Ensuring culturally appropriate behavior.
|
| 67 |
|
| 68 |
|
| 69 |
This benchmark not only fills a critical gap in Persian LLM evaluation but also provides a standardized leaderboard to track progress in developing aligned, ethical, and culturally aware Persian language models.
|