debjitpaul commited on
Commit Β·
0466393
1
Parent(s): 001af0b
Updated citation font with Markdown for app.py and example submission
Browse files
app.py
CHANGED
|
@@ -418,18 +418,56 @@ def build_app() -> gr.Blocks:
|
|
| 418 |
# -------------------------------------------------------------
|
| 419 |
with gr.Tab("π About"):
|
| 420 |
gr.Markdown(ABOUT_BLURB)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 421 |
gr.Markdown(
|
| 422 |
"## Metrics\n"
|
| 423 |
-
"- **F1 / Precision / Recall** β token-level overlap between predicted "
|
| 424 |
-
"
|
| 425 |
"- **Exact Match (EM)** β fraction of tasks where the predicted answer "
|
| 426 |
-
"exactly equals the gold answer (strict).\n"
|
| 427 |
"- **LLM Judge** β semantic-equivalence scoring with small numerical "
|
| 428 |
-
"tolerance (1β5.5%), evaluated by a strong frozen judge model.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 429 |
"## Dataset\n"
|
| 430 |
f"DeepSynth is hosted on π€ [`DeepSynthesisTeam/deepsynth-bench`]({DATASET_URL}). "
|
| 431 |
-
"
|
| 432 |
-
"
|
|
|
|
| 433 |
)
|
| 434 |
|
| 435 |
# -------------------------------------------------------------
|
|
|
|
| 418 |
# -------------------------------------------------------------
|
| 419 |
with gr.Tab("π About"):
|
| 420 |
gr.Markdown(ABOUT_BLURB)
|
| 421 |
+
gr.Markdown(
|
| 422 |
+
"## The task\n"
|
| 423 |
+
"Each DeepSynth task presents a complex, real-world question that cannot "
|
| 424 |
+
"be answered by a single web search or a single document lookup. Producing "
|
| 425 |
+
"the correct answer requires an agent to **decompose** the question into "
|
| 426 |
+
"sub-problems, **gather** evidence from multiple heterogeneous sources "
|
| 427 |
+
"(news articles, government statistics, scientific publications, specialized "
|
| 428 |
+
"databases), **synthesize** findings into a coherent intermediate state, and "
|
| 429 |
+
"**return a structured answer** (typically a JSON object of key-value pairs, "
|
| 430 |
+
"a ranked list, or a numerical aggregate).\n\n"
|
| 431 |
+
"Tasks span **7 domains** β science, geography, economics, history, culture, "
|
| 432 |
+
"politics, and technology β and reference entities across **67 countries**. "
|
| 433 |
+
"Expert curators verified that every question has a well-defined answer "
|
| 434 |
+
"recoverable from public sources at the time of release, and that answering "
|
| 435 |
+
"it requires combining evidence from at least three distinct sources."
|
| 436 |
+
)
|
| 437 |
+
|
| 438 |
+
gr.Markdown(
|
| 439 |
+
"## Splits\n"
|
| 440 |
+
"DeepSynth ships as **120 expert-curated tasks** divided into two splits:\n\n"
|
| 441 |
+
"- **Dev set β 40 tasks (public, with gold answers).** Each dev task includes "
|
| 442 |
+
"the question, the gold answer, a full **decomposition** into sub-problems, "
|
| 443 |
+
"and the **intermediate answers** expected at each step. Use this split for "
|
| 444 |
+
"prototyping, debugging, and agent development β you can score yourself "
|
| 445 |
+
"locally and inspect where your agent's reasoning diverges from the expected "
|
| 446 |
+
"trajectory.\n"
|
| 447 |
+
"- **Test set β 80 tasks (questions only).** Gold answers and decompositions "
|
| 448 |
+
"are held private to prevent contamination and enable clean evaluation. "
|
| 449 |
+
"Submit your predictions via the leaderboard and we score them against the "
|
| 450 |
+
"held-out answers."
|
| 451 |
+
)
|
| 452 |
+
|
| 453 |
gr.Markdown(
|
| 454 |
"## Metrics\n"
|
| 455 |
+
"- **F1 / Precision / Recall** β token-level overlap between predicted and "
|
| 456 |
+
"gold answers, averaged over all tasks.\n"
|
| 457 |
"- **Exact Match (EM)** β fraction of tasks where the predicted answer "
|
| 458 |
+
"exactly equals the gold answer (strict structured-equality check).\n"
|
| 459 |
"- **LLM Judge** β semantic-equivalence scoring with small numerical "
|
| 460 |
+
"tolerance (1β5.5%), evaluated by a strong frozen judge model. Captures "
|
| 461 |
+
"cases where the answer is substantively correct but phrased or formatted "
|
| 462 |
+
"differently from the gold."
|
| 463 |
+
)
|
| 464 |
+
|
| 465 |
+
gr.Markdown(
|
| 466 |
"## Dataset\n"
|
| 467 |
f"DeepSynth is hosted on π€ [`DeepSynthesisTeam/deepsynth-bench`]({DATASET_URL}). "
|
| 468 |
+
"Dev-set gold answers, decompositions, and intermediate-answer JSON schemas "
|
| 469 |
+
"are shipped alongside the questions. Test-set release is gated β downloading "
|
| 470 |
+
"requires agreeing to the evaluation protocol."
|
| 471 |
)
|
| 472 |
|
| 473 |
# -------------------------------------------------------------
|