Spaces:

hakari-bench
/

leaderboard

Running

App Files Files Community

leaderboard / docs /create_benchmark_tasks_document.md

hotchpotch

Deploy remote main docs sync

1f41326 verified 2 days ago

preview code

raw

history blame contribute delete

32.1 kB

Creating Benchmark Task Documents

This document defines the policy and template for task-level benchmark documentation under docs/benchmark_tasks/.

Purpose

Each task page should let a reader understand what the retrieval task measures before looking at leaderboard scores. A page should explain the source benchmark, the concrete query and document shapes, the domain, the language, the BM25 baseline behavior, representative examples from the actual Nano tables, and the kind of training data likely to improve the task without leaking evaluation answers.

The pages are public GitHub Markdown. Do not include local paper paths, local Obsidian links, local filesystem paths, private notes, or machine-specific URLs.

Output Location

Write task documents under:

docs/benchmark_tasks/{Nano-set name}/{task name}.md

For collection-level samples where the backing Nano dataset is different from the collection name, include the backing dataset in the file name, for example:

docs/benchmark_tasks/MNanoBEIR/NanoBEIR-ja__NanoMSMARCO.md

Source Policy

Prefer task-level source metadata from config/datasets/*.yaml and config/dataset_collections/*.yaml. If task-level metadata is absent, fall back to dataset-level metadata, then to the Hugging Face dataset README, then to upstream benchmark metadata.

For papers, check whether the original paper has an arXiv page. Use the arXiv URL as the first source URL when it exists, even if an ACL Anthology, DOI, publisher, OpenReview, or project page also exists. Still include the DOI or official proceedings URL as secondary metadata when it is useful for citation accuracy. When a paper exists, read the paper PDF or HTML, not only the abstract, and use the paper's dataset construction, related work, limitations, retrieval source, annotation policy, train/dev/test split, and baseline analysis to improve the task explanation.

Source priority:

arXiv page for the original task or benchmark paper.
Official proceedings, DOI, OpenReview, publisher, or ACL Anthology page when no arXiv page exists, or as a secondary URL.
Official dataset card, project page, GitHub repository, or Hugging Face dataset.
Upstream benchmark source metadata.
Blog posts only when they are the canonical source and no stronger source is available.

For benchmark collections such as BEIR, MTEB, MIRACL, BIRCO, CodeRAG-Bench, or domain-specific benchmark suites, distinguish three paper levels:

task_paper: a paper primarily introducing the exact source task or dataset.
benchmark_paper: a paper introducing the benchmark that includes this task and discusses its construction, split policy, table statistics, task category, evaluation setting, or limitations.
related_paper: a paper about the general task family that does not define the evaluated dataset.

If no standalone task paper is found but a benchmark paper includes a section, appendix entry, dataset table, or construction note for the task, treat that benchmark paper as a source that must be read and reflected in the Details section. Do not write "no paper was confirmed" in a way that implies no paper was used; instead say that no standalone task paper was confirmed and cite the benchmark paper for the available construction details. For example, Quora in BEIR should use the BEIR paper's duplicate-question retrieval category, Quora statistics, split construction, and overlap-removal notes, even though the Quora Question Pairs record is the dataset source.

Do not invent a source paper just to make a task look citable. If only a dataset card or project page is known, list it as a source URL and make that limitation visible. In that case, explicitly say that no task or benchmark paper was confirmed and that the interpretation is based on the official dataset card, Hugging Face dataset, project page, technical article, and observed sample data.

When a paper is used in prose, cite it explicitly in the sentence that relies on it, for example: [MIRACL: A Multilingual Retrieval Dataset Covering 18 Diverse Languages](https://arxiv.org/abs/2210.09984) states that .... Do not hide paper usage only in the source list.

When using a benchmark paper, inspect more than the abstract. Look for the task overview figure, dataset statistics table, appendix entry for the task, split construction, overlap or leakage mitigation, evaluation metric discussion, baseline discussion, and limitations. Mention whichever of those items materially changes the interpretation of the Nano task.

OB Wiki Search can be used as background research, but local wiki or paper-note paths must not be written into generated Markdown or into the machine-readable metadata block.

Structured Fields

Place structured reference material after ## Example Data and before ## Machine-Readable Metadata. This keeps the reader-facing flow focused on the task first: overview, interpretation, then examples. The information table should include at least:

Nano set name.
Backing dataset name.
Backing Hugging Face dataset id.
Task or split name.
Language and category.
Query count, document count, and positive qrels row count.
BM25 nDCG@10 computed from the dataset's bm25 candidate column and positive qrels.
BM25 hit@10 when available.
Average query length and document length in characters.

Include positives-per-query rows in the visible information table only when they add signal: for example, when the average is not exactly 1.00, when the minimum/maximum differ from 1, or when any query has multiple positives. If every query has exactly one positive qrel, omit the visible distribution rows. Still keep the full positives-per-query statistics in the final YAML metadata block so index builders and audits can use them.

Character averages are sufficient for the current version. Prefer repository-maintained query_text_stats and document_text_stats when present. If they are missing, compute them from the Nano queries and corpus tables.

Required Page Structure

Use this structure unless a task needs a clearly better variant:

# {Nano set} / {task} title.
GitHub note, placed immediately after the title, warning that the page was generated by an LLM from source papers, dataset cards, repository metadata, and sampled data, and that it may contain mistakes. Keep it simple and reader-facing:
```
> [!NOTE]
> This page was generated by an LLM using source papers, dataset cards,
> repository metadata, and sampled benchmark data. It may contain mistakes;
> please treat it as a reference aid rather than a definitive source.
```
## Overview: a paper-centered summary of what the benchmark task is. Start from the source paper when one exists: what retrieval problem the paper introduced, how the source data is framed, and what the concrete task asks a model to retrieve. If no source paper is available, summarize the benchmark task itself from the dataset card, official project page, and sampled data. The Overview should be task-specific prose, not a reusable sentence pattern such as "{Task} evaluates ... Queries are ...". Mention Nano packaging only when it changes how the source task is interpreted.
## Details: longer interpretive prose about the original task/data, source-paper findings, observed Nano data tendencies, BM25 difficulty, and why the benchmark differs from adjacent benchmarks.
## Example Data: random query-positive examples from the actual Nano split.
## Dataset Information: a Markdown table for structured facts.
### Public Sources: source papers, official pages, and dataset records.
### Hugging Face Links: the Nano dataset and source Hugging Face datasets when known.
### Source Reference Table: structured source title, year, type, URL.
## Machine-Readable Metadata: final YAML block for index generation.

Example Policy

Show five query-positive examples when possible. Select five queries by deterministic random sampling, not by taking the head of the query table. For each sampled query, use a positive qrel with matching query and corpus records. Use the repository script so regenerated pages stay stable:

uv run python scripts/extract_benchmark_task_examples.py hakari-bench/NanoMMTEB-v2 argu_ana

For bulk refreshes, replace only the ## Example Data sections with:

uv run python scripts/extract_benchmark_task_examples.py --update-docs docs/benchmark_tasks

Use a Markdown table with exactly two columns by default: Query and Positive document. The visible table should focus on the actual query and positive document text. Omit query/doc IDs, BM25 ranks, and extra count columns unless a task specifically needs them. Append full character counts inline. Truncate long content to the configured visible character limit and show the full pre-truncation length with the compact marker [truncated 225 chars](1258 chars).

| Query | Positive document |
| --- | --- |
| What is ...? (12 chars) | The answer-bearing passage ... [truncated 225 chars](1800 chars) |

For extremely long-context, legal, patent, medical, code, or documentation tasks, use a vertical sample-block format only when the table would be unreadable on GitHub:

### Sample 1

| Field | Value |
| --- | --- |
| Query ID | `q1` |
| Positive Doc ID | `d1` |

**Query**

> Truncated query text ... [truncated from 2400 chars]

**Positive document**

> Truncated positive document text ... [truncated from 18000 chars]

Do not summarize samples. Show the actual query and positive document text from the Nano tables. Long query or document text must be truncated to a readable length, with the original character count visible, for example [truncated from 20442 chars]. Even when text is truncated, the reader should be able to tell the full query and positive-document character counts.

Interpretation Policy

The Details section should explain the data itself. Do not spend the section explaining that this is a Nano subset or that the Nano format has query, corpus, qrels, and BM25 tables. Mention Nano sampling only when the observed sampled data changes how readers should interpret the task.

Use these subheadings inside ## Details:

### What the Original Data Measures

### Observed Data Profile

### BM25 Difficulty

### Training Data That May Help

### Synthetic Data Guidance

Discuss:

what the task asks the model to retrieve,
what the original paper or official dataset source says the dataset was built to evaluate,
what the source paper says about dataset construction, annotation, split design, related benchmarks, baseline behavior, limitations, or intended use,
what the actual Nano data looks like: query style, document genre, document length, positives per query, language, and domain,
whether lexical matching is likely to be strong,
whether the task is multilingual, domain-specific, code-oriented, long-document-oriented, or fact/evidence-oriented,
whether qrels are mostly single-positive or multi-positive,
how BM25 nDCG@10 and hit@10 should be read for this task,
what existing non-evaluation training data may help,
what synthetic source documents and synthetic questions would be useful.

Avoid generic filler. Each final task page should include at least one task-specific paragraph grounded in the original paper, dataset card, or benchmark source.

If a source paper exists, cite it in prose. Good detail text should read like:

[CodeRAG-Bench: Can Retrieval Augment Code Generation?](https://arxiv.org/abs/2406.14497)
reports that the benchmark aggregates programming solutions, online tutorials,
library documentation, StackOverflow posts, and GitHub repositories as retrieval
sources. This matters for this task because ...

If no source paper is confirmed, say so plainly:

No source paper was confirmed for this task. The interpretation below is based
on the official Hugging Face dataset card, project metadata, and observed sample
queries and positives.

Training Data That May Help

This subsection should answer which existing datasets or supervised pairs could teach the domain without using evaluation answers.

Cover these cases:

If the original dataset provides a train split or official training data, state that it is the first source to inspect. Do not assume it may be used for a leaderboard unless the benchmark rules allow it.
Always check split provenance. If the Nano task is derived from an upstream dev or test split, say that data likely to overlap with the benchmark, such as the same upstream dev/test split, should preferably be excluded from training. Recommend upstream train splits or other source data that are unlikely to overlap with the evaluation task.
For public datasets, warn in practical terms that obvious overlap with the benchmark should be avoided. Detailed ID or text overlap audits are useful for production-quality training pipelines, but the reader-facing guidance should not make the first-pass task brief feel like an implementation checklist.
State that learning the evaluation queries, qrels, or positive passages can inflate benchmark scores. For retrieval tasks, memorizing the answer passage is not the same as learning retrieval behavior.
For code tasks, recommend source-aligned data such as documentation retrieval, DocPrompting-style NL-intent-to-doc pairs, StackOverflow QA, tutorials, migration guides, docstrings, issue-to-fix pairs, and API examples.
For multilingual tasks, recommend native-language supervised pairs and same-language corpora rather than translated English-only pairs.
Keep this subsection concise and technical. It should list the data types that help, not re-explain the entire benchmark.

Synthetic Data Guidance

This subsection should explain what synthetic documents and questions to create. It should be separate from Training Data That May Help.

Cover these cases:

If synthetic data is recommended, specify the document genre, document contents, question style, question intent, and how the generated question should be answerable from the generated or selected document.
Distinguish document-to-question generation from joint document-and-question generation. Document-to-question generation should use non-evaluation source documents. Joint generation should create both realistic source-style documents and questions with explicit answer grounding.
Do not use evaluation split queries or positive passages as seeds for synthetic generation. For example, if a Nano task is derived from MIRACL dev, use MIRACL train or non-overlapping Wikipedia passages, not MIRACL dev/test positives.
For multilingual tasks, prefer native-language synthetic queries and documents over translated English-only data.
For code tasks, synthetic data should preserve executable/API semantics, identifiers, version constraints, stack traces, and realistic developer tasks.
For legal, patent, medical, finance, or scientific tasks, synthetic documents should use realistic domain structure, terminology, citations, measurements, entities, and evidential wording.
For multi-positive tasks, train with multi-positive objectives or listwise / distillation signals rather than reducing the task to one positive per query.

Machine-Readable Metadata

Each task page must end with a fenced YAML block. This block is for future index page generation and should be easy to parse without reading prose.

Use this marker immediately before the YAML block:

<!-- benchmark-task-metadata:v1 -->

The block must be the final content in the Markdown file:

## Machine-Readable Metadata

<!-- benchmark-task-metadata:v1 -->

```yaml
benchmark_task_metadata:
  schema_version: 1
  document_status: first_pass
  nano_set: NanoMIRACL
  backing_dataset: NanoMIRACL
  dataset_id: hakari-bench/NanoMIRACL
  task_name: ja
  split_name: ja
  language: ja
  category: natural_language
  document_path: docs/benchmark_tasks/NanoMIRACL/ja.md
  source_research:
    primary_source_type: task_paper
    paper_pdf_or_html_checked: true
    no_paper_note: null
  counts:
    queries: 200
    documents: 1846
    positive_qrels: 200
  positives_per_query:
    average: 1.0
    min: 1
    median: 1.0
    max: 1
    multi_positive_queries: 0
    multi_positive_query_percent: 0.0
  text_stats_chars:
    query_mean: 17.47
    document_mean: 297.912784
  bm25:
    ndcg_at_10: 0.5956231823
    hit_at_10: 0.94
    source: dataset_bm25_column
  learning:
    original_train_split: unknown
    evaluation_split_origin: unknown
    train_eval_overlap_audit: not_audited
    leakage_note: do not train on upstream dev/test queries, qrels, or positive passages
    useful_training_data:
      - official non-overlapping train split
      - native-language question-to-passage retrieval pairs
      - non-overlapping source-corpus passage QA pairs
    synthetic_data:
      document_generation: native-language answer-bearing passages from the source collection style
      question_generation: native-language information needs answerable from those passages
      answerability: questions should be grounded in explicit facts, entities, or relations in the document
    multi_positive_training: single_positive_question_document_focus
  links:
    nano_dataset: https://huggingface.co/datasets/hakari-bench/NanoMIRACL
    source_urls:
      - label: MIRACL unified source dataset
        url: https://huggingface.co/datasets/hotchpotch/miracl-hf-unified
    source_notes: []
  references:
    - title: "MIRACL: A Multilingual Retrieval Dataset Covering 18 Diverse Languages"
      url: https://arxiv.org/abs/2210.09984
      year: 2023
      doi: 10.1162/tacl_a_00595
      is_paper: true
      source_confidence: definitive_paper_link
```

Metadata field guidance:

schema_version: increment only for incompatible schema changes.
document_status: use first_pass, reviewed, or needs_review.
document_path: repository-relative path only.
source_research.primary_source_type: use task_paper, benchmark_paper, dataset_card, project_page, technical_article, or sample_inference. Use benchmark_paper when the strongest paper source is the benchmark paper that includes the task, even if no standalone task paper was confirmed.
source_research.paper_pdf_or_html_checked: boolean. Set true only when the paper PDF or HTML was inspected beyond title/abstract metadata.
source_research.no_paper_note: short public note when no paper was confirmed, otherwise null.
bm25.source: must be dataset_bm25_column unless the task explicitly uses a different source.
learning.original_train_split: use available, not_found, or unknown. Leave this as unknown unless the original source split was explicitly audited.
learning.evaluation_split_origin: record the upstream split if known, such as train, dev, test, validation, or unknown.
learning.train_eval_overlap_audit: use passed, failed, not_applicable, or not_audited. Use not_audited until query IDs, document IDs, source titles, and positive text overlap were checked.
learning.leakage_note: short public warning about what not to train on.
learning.useful_training_data: concise machine-readable list of existing data types that may teach the domain without using evaluation answers.
learning.synthetic_data: concise machine-readable hints for what synthetic documents and questions to generate. These are for index/filter pages and should mirror the prose in ### Synthetic Data Guidance.
learning.multi_positive_training: use multi_positive_objective when the qrels contain multiple positives for a meaningful share of queries, otherwise single_positive_question_document_focus.
references[].url: use the arXiv URL first when one exists.
links.source_urls: structured {label, url} objects with public URLs only. These may include Hugging Face datasets, project pages, or source repositories.
links.source_notes: optional non-URL source notes from README metadata.

Document Template

Use this template for new pages:

# {Nano set} / {task name}

> [!NOTE]
> This page was generated by an LLM using source papers, dataset cards,
> repository metadata, and sampled benchmark data. It may contain mistakes;
> please treat it as a reference aid rather than a definitive source.

## Overview

{About 500 English characters summarizing what this benchmark task is. When a
paper exists, ground the overview in the paper: what problem the paper introduced,
how the data was constructed or adapted, what the query and document sides
represent, and what retrieval behavior is being tested. When no paper exists,
summarize the task from the dataset card, official page, repository metadata, and
sampled data. Avoid a fill-in-the-blank pattern such as "`{Task}` evaluates ...
Queries are ..."; the paragraph should contain details that would not fit most
other tasks in the same group.}

## Details

### What the Original Data Measures

{Explain the original data or benchmark from the source paper, official dataset
card, or project page. Focus on what retrieval behavior is being tested, not on
Nano packaging. If a paper was used, cite it directly in prose, e.g.
`[Paper Title](url) reports that ...`. If no paper was confirmed, state that the
interpretation is based on public dataset cards, project pages, and sample data.}

### Observed Data Profile

{Summarize the actual sampled task with useful interpretation, not only counts:
query and document styles, recurring intents, multi-positive clusters, visible
data quirks, and what those imply for retrieval.}

### BM25 Difficulty

{Explain BM25 nDCG@10 and hit@10 for this task. Include concrete patterns from
the Nano BM25 candidate ranking when useful, such as cases where lexical matching
finds the topic but misses intent equivalence.}

### Training Data That May Help

{List concise, technical existing-data recommendations. Mention the official
train split when available, but warn that upstream dev/test sets, or other data
likely to overlap with the benchmark, should preferably be excluded. Keep
detailed overlap-audit mechanics in metadata or implementation notes rather than
the main prose unless the task specifically needs them.}

### Synthetic Data Guidance

{Describe what synthetic source-style documents and questions to create. Separate
document-to-question generation from generating both documents and questions.
State that evaluation split queries and positive passages should not be used as
seeds.}

## Example Data

{Five deterministic random query-positive examples. Generate with
`scripts/extract_benchmark_task_examples.py`. Use a two-column Markdown table,
include full character counts inline, and visibly truncate long content with
`[truncated 225 chars](N chars)`.}

## Dataset Information

| Field | Value |
| --- | --- |
| Nano set | {Nano set} |
| Backing dataset | {Backing dataset} |
| Task / split | {Task or split} |
| Hugging Face dataset | [{dataset_id}](https://huggingface.co/datasets/{dataset_id}) |
| Language | {language} |
| Category | {natural_language or code} |
| Queries | {query_count} |
| Documents | {document_count} |
| Positive qrels | {qrel_count} |
| Avg positives / query | {avg_positives_per_query; omit this row when all queries have exactly one positive} |
| Positives per query (min / median / max) | {min} / {median} / {max; omit this row when all queries have exactly one positive} |
| Queries with multiple positives | {count} ({percent}%; omit this row when all queries have exactly one positive} |
| BM25 nDCG@10 | {bm25_ndcg_at_10} |
| BM25 hit@10 | {bm25_hit_at_10} |
| Query length avg chars | {query_mean_chars} |
| Document length avg chars | {document_mean_chars} |

### Public Sources

- [{Primary paper title, preferably arXiv when available}]({primary_public_url}); {year}; {authors}; DOI: `{doi}`.
- [{Dataset card or project page}]({public_url}).

### Hugging Face Links

- Nano dataset: [{dataset_id}](https://huggingface.co/datasets/{dataset_id})
- Source dataset: [{source_dataset_id}](https://huggingface.co/datasets/{source_dataset_id})

### Source Reference Table

| Title | Year | Type | URL |
| --- | ---: | --- | --- |
| {title} | {year} | paper | {url} |

## Machine-Readable Metadata

<!-- benchmark-task-metadata:v1 -->

```yaml
benchmark_task_metadata:
  schema_version: 1
  document_status: first_pass
  nano_set: {Nano set}
  backing_dataset: {Backing dataset}
  dataset_id: {dataset_id}
  task_name: {task_name}
  split_name: {split_name}
  language: {language}
  category: {category}
  document_path: docs/benchmark_tasks/{Nano-set name}/{task name}.md
  source_research:
    primary_source_type: {task_paper|benchmark_paper|dataset_card|project_page|technical_article|sample_inference}
    paper_pdf_or_html_checked: {true|false}
    no_paper_note: {null or public note}
  counts:
    queries: {query_count}
    documents: {document_count}
    positive_qrels: {qrel_count}
  positives_per_query:
    average: {avg_positives_per_query}
    min: {min_positives}
    median: {median_positives}
    max: {max_positives}
    multi_positive_queries: {multi_positive_query_count}
    multi_positive_query_percent: {multi_positive_query_percent}
  text_stats_chars:
    query_mean: {query_mean_chars}
    document_mean: {document_mean_chars}
  bm25:
    ndcg_at_10: {bm25_ndcg_at_10}
    hit_at_10: {bm25_hit_at_10}
    source: dataset_bm25_column
  learning:
    original_train_split: unknown
    evaluation_split_origin: {train|dev|test|validation|unknown}
    train_eval_overlap_audit: not_audited
    leakage_note: {short leakage warning}
    useful_training_data:
      - {existing_training_data_type}
    synthetic_data:
      document_generation: {synthetic_document_generation}
      question_generation: {synthetic_question_generation}
      answerability: {synthetic_answerability}
    multi_positive_training: {multi_positive_training}
  links:
    nano_dataset: https://huggingface.co/datasets/{dataset_id}
    source_urls:
      - label: {source_label}
        url: {source_url}
    source_notes: []
  references:
    - title: {title}
      url: {public_url}
      year: {year}
      doi: {doi}
      is_paper: true
      source_confidence: {source_confidence}
```

Group Index Pages

Before scaling to all 500+ tasks, add group index pages such as:

docs/benchmark_tasks/{Nano-set name}/index.md

Build index pages from the machine-readable metadata blocks. Each group index should include:

task name and document link,
language and category,
query/document/qrels counts,
positives-per-query summary only when the task is not exactly one-positive per query,
BM25 nDCG@10 and hit@10,
average query/document character lengths,
source status and primary paper title,
document status.

Maintenance Checklist

Before publishing a batch:

Confirm every generated page has a Nano dataset link and at least one public source or a visible note that source metadata is missing.
Confirm arXiv was checked for every paper source and is used as the first URL when available.
Confirm that every cited paper was checked beyond the abstract when possible: use the PDF or HTML to inspect dataset construction, source data, splits, related work, baselines, limitations, and task-specific discussion.
If no source paper is confirmed, confirm the page says so and explains that the interpretation is based on official dataset cards, project pages, technical articles, Hugging Face metadata, and observed samples.
Confirm train/dev/test provenance. If the Nano task comes from an upstream dev/test split, the page must warn against training on that split and should recommend only non-overlapping train or source-corpus data.
Confirm BM25 nDCG@10 was computed from the Nano bm25 table, not from a fresh local BM25 run.
Confirm positives-per-query statistics were computed from qrels.
Confirm exactly five random examples come from the selected Nano split when at least five qrel pairs are available.
Confirm sample data shows actual query and positive document text, not a summary, and that long samples are visibly truncated with original character count.
Confirm the final YAML metadata block parses successfully.
Confirm no generated benchmark outputs, caches, local paper paths, local wiki paths, or private scratch artifacts are committed.

Final Prose Quality Review

After generating a task page, do a final pass specifically for writing quality and usefulness. The page should not feel like a statistics dump. It should help a reader understand what kind of retrieval behavior the task rewards, why the task is difficult, and what data would plausibly teach the domain.

Use this checklist before considering a generated task page ready:

The overview explains the original benchmark or source dataset first, then explains the concrete Nano task. It describes the task itself, not the Markdown file or the Nano packaging.
The source discussion cites the paper, benchmark paper, dataset card, or project page in the sentences that depend on that source. When no paper was confirmed, the page says so plainly and does not pretend that a paper-backed interpretation exists.
The details section includes at least one paragraph grounded in the source paper or official dataset card: dataset construction, annotation workflow, source corpus, split design, benchmark purpose, or known limitations.
The observed data profile goes beyond counts. It names visible query types, document genres, recurring domains, language-specific issues, entity or terminology patterns, multi-positive clusters when present, and any quirks that affect retrieval.
The BM25 section interprets the score. It should explain what BM25 is doing well, what it fails to distinguish, and include concrete patterns from the dataset-provided BM25 ranking when those patterns are informative.
The task-specific difficulty is explicit. For example, debate tasks should discuss stance and counterargument matching; duplicate-question tasks should discuss intent equivalence and paraphrase clusters; Wikipedia QA retrieval should discuss short fact queries and passage evidence; public-health FAQ retrieval should discuss procedural guidance and action-specific matching.
The training-data section is concise and technical. It recommends existing data types that teach the domain without using likely evaluation answers, and it includes a practical overlap warning for public train/dev/test data.
The synthetic-data section focuses on what documents and questions to generate. It should specify document genre, question style, answerability, and domain details. Do not spend this section on hard negatives.
The examples are actual query-positive text from the Nano split. They should be readable on GitHub, include full character counts, and show truncation clearly when content is shortened.
The page avoids generic filler. Replace broad statements such as "this task requires semantic understanding" with task-specific statements about the exact relation being retrieved.
The writing separates evidence from inference. If a claim comes from a paper, say so with a link. If it comes from inspecting the sampled Nano data, make that clear through wording such as "the sampled data shows" or "the observed BM25 ranking suggests".
The final page has a coherent reader flow: warning note, overview, details, samples, dataset information, public sources, and machine-readable metadata. A user should understand the task before reaching the tables and source lists.