debjitpaul commited on
Commit
0466393
Β·
1 Parent(s): 001af0b

Updated citation font with Markdown for app.py and example submission

Browse files
Files changed (1) hide show
  1. app.py +44 -6
app.py CHANGED
@@ -418,18 +418,56 @@ def build_app() -> gr.Blocks:
418
  # -------------------------------------------------------------
419
  with gr.Tab("πŸ“– About"):
420
  gr.Markdown(ABOUT_BLURB)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
421
  gr.Markdown(
422
  "## Metrics\n"
423
- "- **F1 / Precision / Recall** β€” token-level overlap between predicted "
424
- "and gold answers, averaged over all tasks.\n"
425
  "- **Exact Match (EM)** β€” fraction of tasks where the predicted answer "
426
- "exactly equals the gold answer (strict).\n"
427
  "- **LLM Judge** β€” semantic-equivalence scoring with small numerical "
428
- "tolerance (1–5.5%), evaluated by a strong frozen judge model.\n\n"
 
 
 
 
 
429
  "## Dataset\n"
430
  f"DeepSynth is hosted on πŸ€— [`DeepSynthesisTeam/deepsynth-bench`]({DATASET_URL}). "
431
- "The dev set (40 tasks) ships with gold answers for prototyping; the test "
432
- "set (120 tasks) is released questions-only to prevent contamination."
 
433
  )
434
 
435
  # -------------------------------------------------------------
 
418
  # -------------------------------------------------------------
419
  with gr.Tab("πŸ“– About"):
420
  gr.Markdown(ABOUT_BLURB)
421
+ gr.Markdown(
422
+ "## The task\n"
423
+ "Each DeepSynth task presents a complex, real-world question that cannot "
424
+ "be answered by a single web search or a single document lookup. Producing "
425
+ "the correct answer requires an agent to **decompose** the question into "
426
+ "sub-problems, **gather** evidence from multiple heterogeneous sources "
427
+ "(news articles, government statistics, scientific publications, specialized "
428
+ "databases), **synthesize** findings into a coherent intermediate state, and "
429
+ "**return a structured answer** (typically a JSON object of key-value pairs, "
430
+ "a ranked list, or a numerical aggregate).\n\n"
431
+ "Tasks span **7 domains** β€” science, geography, economics, history, culture, "
432
+ "politics, and technology β€” and reference entities across **67 countries**. "
433
+ "Expert curators verified that every question has a well-defined answer "
434
+ "recoverable from public sources at the time of release, and that answering "
435
+ "it requires combining evidence from at least three distinct sources."
436
+ )
437
+
438
+ gr.Markdown(
439
+ "## Splits\n"
440
+ "DeepSynth ships as **120 expert-curated tasks** divided into two splits:\n\n"
441
+ "- **Dev set β€” 40 tasks (public, with gold answers).** Each dev task includes "
442
+ "the question, the gold answer, a full **decomposition** into sub-problems, "
443
+ "and the **intermediate answers** expected at each step. Use this split for "
444
+ "prototyping, debugging, and agent development β€” you can score yourself "
445
+ "locally and inspect where your agent's reasoning diverges from the expected "
446
+ "trajectory.\n"
447
+ "- **Test set β€” 80 tasks (questions only).** Gold answers and decompositions "
448
+ "are held private to prevent contamination and enable clean evaluation. "
449
+ "Submit your predictions via the leaderboard and we score them against the "
450
+ "held-out answers."
451
+ )
452
+
453
  gr.Markdown(
454
  "## Metrics\n"
455
+ "- **F1 / Precision / Recall** β€” token-level overlap between predicted and "
456
+ "gold answers, averaged over all tasks.\n"
457
  "- **Exact Match (EM)** β€” fraction of tasks where the predicted answer "
458
+ "exactly equals the gold answer (strict structured-equality check).\n"
459
  "- **LLM Judge** β€” semantic-equivalence scoring with small numerical "
460
+ "tolerance (1–5.5%), evaluated by a strong frozen judge model. Captures "
461
+ "cases where the answer is substantively correct but phrased or formatted "
462
+ "differently from the gold."
463
+ )
464
+
465
+ gr.Markdown(
466
  "## Dataset\n"
467
  f"DeepSynth is hosted on πŸ€— [`DeepSynthesisTeam/deepsynth-bench`]({DATASET_URL}). "
468
+ "Dev-set gold answers, decompositions, and intermediate-answer JSON schemas "
469
+ "are shipped alongside the questions. Test-set release is gated β€” downloading "
470
+ "requires agreeing to the evaluation protocol."
471
  )
472
 
473
  # -------------------------------------------------------------