Spaces:

DeepSynthesisTeam
/

deepsynth-leaderboard

Running

App Files Files Community

debjitpaul commited on Apr 19

Commit

0466393

1 Parent(s): 001af0b

Updated citation font with Markdown for app.py and example submission

Browse files

Files changed (1) hide show

app.py +44 -6

app.py CHANGED Viewed

@@ -418,18 +418,56 @@ def build_app() -> gr.Blocks:
             # -------------------------------------------------------------
             with gr.Tab("📖 About"):
                 gr.Markdown(ABOUT_BLURB)
                 gr.Markdown(
                     "## Metrics\n"
-                    "- **F1 / Precision / Recall** — token-level overlap between predicted "
-                    "and gold answers, averaged over all tasks.\n"
                     "- **Exact Match (EM)** — fraction of tasks where the predicted answer "
-                    "exactly equals the gold answer (strict).\n"
                     "- **LLM Judge** — semantic-equivalence scoring with small numerical "
-                    "tolerance (1–5.5%), evaluated by a strong frozen judge model.\n\n"
                     "## Dataset\n"
                     f"DeepSynth is hosted on 🤗 [`DeepSynthesisTeam/deepsynth-bench`]({DATASET_URL}). "
-                    "The dev set (40 tasks) ships with gold answers for prototyping; the test "
-                    "set (120 tasks) is released questions-only to prevent contamination."
                 )
             # -------------------------------------------------------------

             # -------------------------------------------------------------
             with gr.Tab("📖 About"):
                 gr.Markdown(ABOUT_BLURB)
+                gr.Markdown(
+                    "## The task\n"
+                    "Each DeepSynth task presents a complex, real-world question that cannot "
+                    "be answered by a single web search or a single document lookup. Producing "
+                    "the correct answer requires an agent to **decompose** the question into "
+                    "sub-problems, **gather** evidence from multiple heterogeneous sources "
+                    "(news articles, government statistics, scientific publications, specialized "
+                    "databases), **synthesize** findings into a coherent intermediate state, and "
+                    "**return a structured answer** (typically a JSON object of key-value pairs, "
+                    "a ranked list, or a numerical aggregate).\n\n"
+                    "Tasks span **7 domains** — science, geography, economics, history, culture, "
+                    "politics, and technology — and reference entities across **67 countries**. "
+                    "Expert curators verified that every question has a well-defined answer "
+                    "recoverable from public sources at the time of release, and that answering "
+                    "it requires combining evidence from at least three distinct sources."
+                )
+                gr.Markdown(
+                    "## Splits\n"
+                    "DeepSynth ships as **120 expert-curated tasks** divided into two splits:\n\n"
+                    "- **Dev set — 40 tasks (public, with gold answers).** Each dev task includes "
+                    "the question, the gold answer, a full **decomposition** into sub-problems, "
+                    "and the **intermediate answers** expected at each step. Use this split for "
+                    "prototyping, debugging, and agent development — you can score yourself "
+                    "locally and inspect where your agent's reasoning diverges from the expected "
+                    "trajectory.\n"
+                    "- **Test set — 80 tasks (questions only).** Gold answers and decompositions "
+                    "are held private to prevent contamination and enable clean evaluation. "
+                    "Submit your predictions via the leaderboard and we score them against the "
+                    "held-out answers."
+                )
                 gr.Markdown(
                     "## Metrics\n"
+                    "- **F1 / Precision / Recall** — token-level overlap between predicted and "
+                    "gold answers, averaged over all tasks.\n"
                     "- **Exact Match (EM)** — fraction of tasks where the predicted answer "
+                    "exactly equals the gold answer (strict structured-equality check).\n"
                     "- **LLM Judge** — semantic-equivalence scoring with small numerical "
+                    "tolerance (1–5.5%), evaluated by a strong frozen judge model. Captures "
+                    "cases where the answer is substantively correct but phrased or formatted "
+                    "differently from the gold."
+                )
+                gr.Markdown(
                     "## Dataset\n"
                     f"DeepSynth is hosted on 🤗 [`DeepSynthesisTeam/deepsynth-bench`]({DATASET_URL}). "
+                    "Dev-set gold answers, decompositions, and intermediate-answer JSON schemas "
+                    "are shipped alongside the questions. Test-set release is gated — downloading "
+                    "requires agreeing to the evaluation protocol."
                 )
             # -------------------------------------------------------------