foo-barrr commited on
Commit
4baa2f8
·
verified ·
1 Parent(s): f400429

Update app.py

Browse files
Files changed (1) hide show
  1. app.py +3 -4
app.py CHANGED
@@ -48,11 +48,11 @@ with gr.Blocks(title="LLM Propensity Evaluation Leaderboard") as demo:
48
 
49
  ## Evaluation Details:
50
  - **Instruction Following Score**: Measures a model's tendency to follow instructions accurately. Measured using the IFEval dataset.
51
- - **Hallucination Rate**: Evaluates how often a model hallucinates. Measured using a subset of the SimpleQA dataset. We calculated the rate using this formula : (1 - (correct + not_attempted)), where correct = when the model answered a question correctly and not_attempted = when a model admits to not knowing the answer to a question.*
52
 
53
  ## How to Interpret the Scores:
54
  * Instruction Following Score: Higher scores indicate better adherence to instructions.
55
- * Hallucination Rate: Lower rates indicate fewer hallucinations.
56
 
57
  *Note*: The evaluation metrics are designed to provide insights into the models' behavior in specific contexts. They may not capture all aspects of model performance or alignment.
58
 
@@ -80,8 +80,7 @@ with gr.Blocks(title="LLM Propensity Evaluation Leaderboard") as demo:
80
  # Add footer information
81
  gr.Markdown("""
82
  ---
83
- **Last Updated**: Sep 11, 2025
84
- **Contact**: <TBD>
85
  """)
86
 
87
  # Launch the app
 
48
 
49
  ## Evaluation Details:
50
  - **Instruction Following Score**: Measures a model's tendency to follow instructions accurately. Measured using the IFEval dataset.
51
+ - **Factual Hallucination Rate**: Evaluates how often a model hallucinates when questioned on facts. Measured using a subset of the SimpleQA dataset, which explicitly asks uncommon facts. We calculated the rate using this formula : (1 - (correct + not_attempted)), where correct = when the model answered a question correctly and not_attempted = when a model admits to not knowing the answer to a question.*
52
 
53
  ## How to Interpret the Scores:
54
  * Instruction Following Score: Higher scores indicate better adherence to instructions.
55
+ * Hallucination Rate: Lower rates indicate fewer hallucinations.
56
 
57
  *Note*: The evaluation metrics are designed to provide insights into the models' behavior in specific contexts. They may not capture all aspects of model performance or alignment.
58
 
 
80
  # Add footer information
81
  gr.Markdown("""
82
  ---
83
+ **Last Updated**: November 1, 2025
 
84
  """)
85
 
86
  # Launch the app