Spaces:

Melady
/

TemporalBench_Leaderboard

Running

App Files Files Community

Ray0202 commited on Feb 17

Commit

949c705

1 Parent(s): 9abc796

update 02.17.2026

Browse files

Files changed (2) hide show

app.py +16 -3
src/about.py +34 -2

app.py CHANGED Viewed

@@ -173,17 +173,30 @@ with demo:
         # Temporarily disabled for performance debugging.
         with gr.TabItem("📤 Submit Results", elem_id="tab-submit", id=2):
             gr.Markdown(
-                "Upload a results file for manual review. Approved results will be merged into the main dataset.",
                 elem_classes="markdown-text",
             )
             gr.Markdown(EXAMPLE_RECORD_MD, elem_classes="markdown-text")
-            submission_file = gr.File(label="Results file (.json or .csv)", file_types=[".json", ".csv"])
             submit_button = gr.Button("Submit for Review")
             submission_status = gr.Markdown()
-            submission_status.value = "Submission is temporarily disabled for performance debugging."
         with gr.TabItem("📝 About", elem_id="tab-about", id=3):
             gr.Markdown(LLM_BENCHMARKS_TEXT, elem_classes="markdown-text")
     # Citation section hidden for now.
     # with gr.Row():

         # Temporarily disabled for performance debugging.
         with gr.TabItem("📤 Submit Results", elem_id="tab-submit", id=2):
             gr.Markdown(
+                (
+                    "Upload submission files for manual review.\n\n"
+                    "Required files:\n"
+                    "1. `results_on_dev_dataset.json`: task-level metrics in leaderboard format.\n"
+                    "2. `results_on_test_dataset.json`: per-example test outputs with at least "
+                    "`id`, `tier`, `source_dataset`, `label`, and `output` "
+                    "(required when the sample contains forecasting).\n\n"
+                    "Please also include model architecture code and LLM/system details for verification."
+                ),
                 elem_classes="markdown-text",
             )
             gr.Markdown(EXAMPLE_RECORD_MD, elem_classes="markdown-text")
+            submission_file = gr.File(
+                label="Submission package (.zip or .rar)",
+                file_types=[".zip", ".rar"],
+            )
             submit_button = gr.Button("Submit for Review")
             submission_status = gr.Markdown()
+            submit_button.click(save_submission, [submission_file], submission_status)
         with gr.TabItem("📝 About", elem_id="tab-about", id=3):
             gr.Markdown(LLM_BENCHMARKS_TEXT, elem_classes="markdown-text")
+            gr.Markdown(f"## Citation\n{CITATION_BUTTON_LABEL}", elem_classes="markdown-text")
+            gr.Markdown(f"```bibtex\n{CITATION_BUTTON_TEXT.strip()}\n```", elem_classes="markdown-text")
     # Citation section hidden for now.
     # with gr.Row():

src/about.py CHANGED Viewed

@@ -7,6 +7,8 @@ executed here, and no LLM APIs are called.
 """
 LLM_BENCHMARKS_TEXT = """
 ## What this leaderboard shows
 - One row per evaluated agent configuration
@@ -32,8 +34,29 @@ from dataset-level results using question/series counts. Missing values are igno
 ## Submission workflow
-Uploads are stored locally for manual review. Approved results should be merged into
-the main results file to appear on the leaderboard.
 ## Data access
@@ -49,4 +72,13 @@ EVALUATION_QUEUE_TEXT = ""
 CITATION_BUTTON_LABEL = "Copy the following snippet to cite these results"
 CITATION_BUTTON_TEXT = r"""
 """

 """
 LLM_BENCHMARKS_TEXT = """
+The paper describing this benchmark is *TemporalBench: A Benchmark for Evaluating LLM-Based Agents on Contextual and Event-Informed Time Series Tasks* (https://arxiv.org/abs/2602.13272). We also maintain a public leaderboard and welcome submissions from state-of-the-art models: https://huggingface.co/spaces/Melady/TemporalBench_Leaderboard
 ## What this leaderboard shows
 - One row per evaluated agent configuration
 ## Submission workflow
+Uploads are stored locally for manual review.
+For a valid submission, please provide **two files**:
+1. `results_on_dev_dataset.json`
+   - This follows the leaderboard metrics format.
+   - It should include task-level metrics only (e.g., T1-T4 and forecasting metrics).
+2. `results_on_test_dataset.json`
+   - This should include per-example outputs on the test split.
+   - For each example, include at least:
+     - `id`
+     - `tier`
+     - `source_dataset`
+     - `label`
+     - `output` (required when the example contains a forecasting task)
+We also strongly encourage including model and system metadata, such as:
+- model architecture code
+- LLM(s) used
+- key implementation details needed for result verification
+Approved submissions should then be merged into the main results file to appear on the leaderboard.
 ## Data access
 CITATION_BUTTON_LABEL = "Copy the following snippet to cite these results"
 CITATION_BUTTON_TEXT = r"""
+@misc{weng2026temporalbenchbenchmarkevaluatingllmbased,
+      title={TemporalBench: A Benchmark for Evaluating LLM-Based Agents on Contextual and Event-Informed Time Series Tasks},
+      author={Muyan Weng and Defu Cao and Wei Yang and Yashaswi Sharma and Yan Liu},
+      year={2026},
+      eprint={2602.13272},
+      archivePrefix={arXiv},
+      primaryClass={cs.AI},
+      url={https://arxiv.org/abs/2602.13272},
+}
 """