Spaces:

aauss
/

timebench_eval

Sleeping

App Files Files Community

aauss commited on Jan 15

Commit

edd0a90

1 Parent(s): 5470154

Fix None passed to squad metric and update import in tests.

Browse files

Files changed (4) hide show

README.md +43 -5
tests/test_answer_extraction.py +1 -1
tests/test_metrics.py +1 -1
timebench_eval.py +1 -1

README.md CHANGED Viewed

@@ -1,12 +1,50 @@
 ---
 title: Timebench Eval
-emoji: 💻
-colorFrom: yellow
-colorTo: yellow
 sdk: gradio
-sdk_version: 6.3.0
 app_file: app.py
 pinned: false
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

 ---
 title: Timebench Eval
+datasets:
+-  TimeBench
+tags:
+- evaluate
+- metric
+description: "TODO: add a description here"
 sdk: gradio
+sdk_version: 3.19.1
 app_file: app.py
 pinned: false
 ---
+# Metric Card for Timebench Eval
+***Module Card Instructions:*** *Fill out the following subsections. Feel free to take a look at existing metric cards if you'd like examples.*
+## Metric Description
+*Give a brief overview of this metric, including what task(s) it is usually used for, if any.*
+## How to Use
+*Give general statement of how to use the metric*
+*Provide simplest possible example for using the metric*
+### Inputs
+*List all input arguments in the format below*
+- **input_field** *(type): Definition of input, with explanation if necessary. State any default value(s).*
+### Output Values
+*Explain what this metric outputs and provide an example of what the metric output looks like. Modules should return a dictionary with one or multiple key-value pairs, e.g. {"bleu" : 6.02}*
+*State the range of possible values that the metric's output can take, as well as what in that range is considered good. For example: "This metric can take on any value between 0 and 100, inclusive. Higher scores are better."*
+#### Values from Popular Papers
+*Give examples, preferrably with links to leaderboards or publications, to papers that have reported this metric, along with the values they have reported.*
+### Examples
+*Give code examples of the metric being used. Try to include examples that clear up any potential ambiguity left from the metric description above. If possible, provide a range of examples that show both typical and atypical results, as well as examples where a variety of input parameters are passed.*
+## Limitations and Bias
+*Note any known limitations or biases that the metric has, with links and references if possible.*
+## Citation
+*Cite the source where this metric was introduced.*
+## Further References
+*Add any useful further references.*

tests/test_answer_extraction.py CHANGED Viewed

@@ -1,5 +1,5 @@
 import pytest
-from timebench_eval.timebench_eval import TimebenchEval
 from conftest import (
     PREDICTION_1,
     PREDICTION_2,

 import pytest
+from timebench_eval import TimebenchEval
 from conftest import (
     PREDICTION_1,
     PREDICTION_2,

tests/test_metrics.py CHANGED Viewed

@@ -1,4 +1,4 @@
-from timebench_eval.timebench_eval import TimebenchEval
 import pytest
 from conftest import (
     PREDICTION_1,

+from timebench_eval import TimebenchEval
 import pytest
 from conftest import (
     PREDICTION_1,

timebench_eval.py CHANGED Viewed

@@ -176,7 +176,7 @@ class TimebenchEval(evaluate.Metric):
         for i, (pred, ref) in enumerate(zip(predictions, references)):
             formatted_pred = [
-                {"id": "0", "prediction_text": self._extract_answer(pred)}
             ]
             formatted_ref = [
                 {"id": "0", "answers": {"text": [ref], "answer_start": [0]}}

         for i, (pred, ref) in enumerate(zip(predictions, references)):
             formatted_pred = [
+                {"id": "0", "prediction_text": self._extract_answer(pred) or ""}
             ]
             formatted_ref = [
                 {"id": "0", "answers": {"text": [ref], "answer_start": [0]}}