Update app.py
Browse files
app.py
CHANGED
|
@@ -48,11 +48,11 @@ with gr.Blocks(title="LLM Propensity Evaluation Leaderboard") as demo:
|
|
| 48 |
|
| 49 |
## Evaluation Details:
|
| 50 |
- **Instruction Following Score**: Measures a model's tendency to follow instructions accurately. Measured using the IFEval dataset.
|
| 51 |
-
- **Hallucination Rate**: Evaluates how often a model hallucinates. Measured using a subset of the SimpleQA dataset. We calculated the rate using this formula : (1 - (correct + not_attempted)), where correct = when the model answered a question correctly and not_attempted = when a model admits to not knowing the answer to a question.*
|
| 52 |
|
| 53 |
## How to Interpret the Scores:
|
| 54 |
* Instruction Following Score: Higher scores indicate better adherence to instructions.
|
| 55 |
-
* Hallucination Rate: Lower rates indicate fewer hallucinations.
|
| 56 |
|
| 57 |
*Note*: The evaluation metrics are designed to provide insights into the models' behavior in specific contexts. They may not capture all aspects of model performance or alignment.
|
| 58 |
|
|
@@ -80,8 +80,7 @@ with gr.Blocks(title="LLM Propensity Evaluation Leaderboard") as demo:
|
|
| 80 |
# Add footer information
|
| 81 |
gr.Markdown("""
|
| 82 |
---
|
| 83 |
-
**Last Updated**:
|
| 84 |
-
**Contact**: <TBD>
|
| 85 |
""")
|
| 86 |
|
| 87 |
# Launch the app
|
|
|
|
| 48 |
|
| 49 |
## Evaluation Details:
|
| 50 |
- **Instruction Following Score**: Measures a model's tendency to follow instructions accurately. Measured using the IFEval dataset.
|
| 51 |
+
- **Factual Hallucination Rate**: Evaluates how often a model hallucinates when questioned on facts. Measured using a subset of the SimpleQA dataset, which explicitly asks uncommon facts. We calculated the rate using this formula : (1 - (correct + not_attempted)), where correct = when the model answered a question correctly and not_attempted = when a model admits to not knowing the answer to a question.*
|
| 52 |
|
| 53 |
## How to Interpret the Scores:
|
| 54 |
* Instruction Following Score: Higher scores indicate better adherence to instructions.
|
| 55 |
+
* Hallucination Rate: Lower rates indicate fewer hallucinations.
|
| 56 |
|
| 57 |
*Note*: The evaluation metrics are designed to provide insights into the models' behavior in specific contexts. They may not capture all aspects of model performance or alignment.
|
| 58 |
|
|
|
|
| 80 |
# Add footer information
|
| 81 |
gr.Markdown("""
|
| 82 |
---
|
| 83 |
+
**Last Updated**: November 1, 2025
|
|
|
|
| 84 |
""")
|
| 85 |
|
| 86 |
# Launch the app
|