Spaces:
Running
Running
| # Creating Benchmark Task Documents | |
| This document defines the policy and template for task-level benchmark | |
| documentation under `docs/benchmark_tasks/`. | |
| ## Purpose | |
| Each task page should let a reader understand what the retrieval task measures | |
| before looking at leaderboard scores. A page should explain the source benchmark, | |
| the concrete query and document shapes, the domain, the language, the BM25 | |
| baseline behavior, representative examples from the actual Nano tables, and the | |
| kind of training data likely to improve the task without leaking evaluation | |
| answers. | |
| The pages are public GitHub Markdown. Do not include local paper paths, local | |
| Obsidian links, local filesystem paths, private notes, or machine-specific URLs. | |
| ## Output Location | |
| Write task documents under: | |
| ```text | |
| docs/benchmark_tasks/{Nano-set name}/{task name}.md | |
| ``` | |
| For collection-level samples where the backing Nano dataset is different from | |
| the collection name, include the backing dataset in the file name, for example: | |
| ```text | |
| docs/benchmark_tasks/MNanoBEIR/NanoBEIR-ja__NanoMSMARCO.md | |
| ``` | |
| ## Source Policy | |
| Prefer task-level source metadata from `config/datasets/*.yaml` and | |
| `config/dataset_collections/*.yaml`. If task-level metadata is absent, fall back | |
| to dataset-level metadata, then to the Hugging Face dataset README, then to | |
| upstream benchmark metadata. | |
| For papers, check whether the original paper has an arXiv page. Use the arXiv | |
| URL as the first source URL when it exists, even if an ACL Anthology, DOI, | |
| publisher, OpenReview, or project page also exists. Still include the DOI or | |
| official proceedings URL as secondary metadata when it is useful for citation | |
| accuracy. When a paper exists, read the paper PDF or HTML, not only the abstract, | |
| and use the paper's dataset construction, related work, limitations, retrieval | |
| source, annotation policy, train/dev/test split, and baseline analysis to improve | |
| the task explanation. | |
| Source priority: | |
| 1. arXiv page for the original task or benchmark paper. | |
| 2. Official proceedings, DOI, OpenReview, publisher, or ACL Anthology page when | |
| no arXiv page exists, or as a secondary URL. | |
| 3. Official dataset card, project page, GitHub repository, or Hugging Face | |
| dataset. | |
| 4. Upstream benchmark source metadata. | |
| 5. Blog posts only when they are the canonical source and no stronger source is | |
| available. | |
| For benchmark collections such as BEIR, MTEB, MIRACL, BIRCO, CodeRAG-Bench, or | |
| domain-specific benchmark suites, distinguish three paper levels: | |
| - `task_paper`: a paper primarily introducing the exact source task or dataset. | |
| - `benchmark_paper`: a paper introducing the benchmark that includes this task | |
| and discusses its construction, split policy, table statistics, task category, | |
| evaluation setting, or limitations. | |
| - `related_paper`: a paper about the general task family that does not define the | |
| evaluated dataset. | |
| If no standalone task paper is found but a benchmark paper includes a section, | |
| appendix entry, dataset table, or construction note for the task, treat that | |
| benchmark paper as a source that must be read and reflected in the Details | |
| section. Do not write "no paper was confirmed" in a way that implies no paper was | |
| used; instead say that no standalone task paper was confirmed and cite the | |
| benchmark paper for the available construction details. For example, Quora in | |
| BEIR should use the BEIR paper's duplicate-question retrieval category, Quora | |
| statistics, split construction, and overlap-removal notes, even though the Quora | |
| Question Pairs record is the dataset source. | |
| Do not invent a source paper just to make a task look citable. If only a dataset | |
| card or project page is known, list it as a source URL and make that limitation | |
| visible. In that case, explicitly say that no task or benchmark paper was | |
| confirmed and that the interpretation is based on the official dataset card, | |
| Hugging Face dataset, project page, technical article, and observed sample data. | |
| When a paper is used in prose, cite it explicitly in the sentence that relies on | |
| it, for example: `[MIRACL: A Multilingual Retrieval Dataset Covering 18 Diverse | |
| Languages](https://arxiv.org/abs/2210.09984) states that ...`. Do not hide paper | |
| usage only in the source list. | |
| When using a benchmark paper, inspect more than the abstract. Look for the task | |
| overview figure, dataset statistics table, appendix entry for the task, split | |
| construction, overlap or leakage mitigation, evaluation metric discussion, | |
| baseline discussion, and limitations. Mention whichever of those items materially | |
| changes the interpretation of the Nano task. | |
| OB Wiki Search can be used as background research, but local wiki or paper-note | |
| paths must not be written into generated Markdown or into the machine-readable | |
| metadata block. | |
| ## Structured Fields | |
| Place structured reference material after `## Example Data` and before | |
| `## Machine-Readable Metadata`. This keeps the reader-facing flow focused on the | |
| task first: overview, interpretation, then examples. The information table | |
| should include at least: | |
| - Nano set name. | |
| - Backing dataset name. | |
| - Backing Hugging Face dataset id. | |
| - Task or split name. | |
| - Language and category. | |
| - Query count, document count, and positive qrels row count. | |
| - BM25 nDCG@10 computed from the dataset's `bm25` candidate column and positive | |
| qrels. | |
| - BM25 hit@10 when available. | |
| - Average query length and document length in characters. | |
| Include positives-per-query rows in the visible information table only when they | |
| add signal: for example, when the average is not exactly `1.00`, when the | |
| minimum/maximum differ from `1`, or when any query has multiple positives. If | |
| every query has exactly one positive qrel, omit the visible distribution rows. | |
| Still keep the full positives-per-query statistics in the final YAML metadata | |
| block so index builders and audits can use them. | |
| Character averages are sufficient for the current version. Prefer | |
| repository-maintained `query_text_stats` and `document_text_stats` when present. | |
| If they are missing, compute them from the Nano `queries` and `corpus` tables. | |
| ## Required Page Structure | |
| Use this structure unless a task needs a clearly better variant: | |
| 1. `# {Nano set} / {task}` title. | |
| 2. GitHub note, placed immediately after the title, warning that the page was | |
| generated by an LLM from source papers, dataset cards, repository metadata, | |
| and sampled data, and that it may contain mistakes. Keep it simple and | |
| reader-facing: | |
| ```markdown | |
| > [!NOTE] | |
| > This page was generated by an LLM using source papers, dataset cards, | |
| > repository metadata, and sampled benchmark data. It may contain mistakes; | |
| > please treat it as a reference aid rather than a definitive source. | |
| ``` | |
| 3. `## Overview`: a paper-centered summary of what the benchmark task is. Start | |
| from the source paper when one exists: what retrieval problem the paper | |
| introduced, how the source data is framed, and what the concrete task asks a | |
| model to retrieve. If no source paper is available, summarize the benchmark | |
| task itself from the dataset card, official project page, and sampled data. | |
| The Overview should be task-specific prose, not a reusable sentence pattern | |
| such as "`{Task}` evaluates ... Queries are ...". Mention Nano packaging only | |
| when it changes how the source task is interpreted. | |
| 4. `## Details`: longer interpretive prose about the original task/data, | |
| source-paper findings, observed Nano data tendencies, BM25 difficulty, and | |
| why the benchmark differs from adjacent benchmarks. | |
| 5. `## Example Data`: random query-positive examples from the actual Nano split. | |
| 6. `## Dataset Information`: a Markdown table for structured facts. | |
| 7. `### Public Sources`: source papers, official pages, and dataset records. | |
| 8. `### Hugging Face Links`: the Nano dataset and source Hugging Face datasets | |
| when known. | |
| 9. `### Source Reference Table`: structured source title, year, type, URL. | |
| 10. `## Machine-Readable Metadata`: final YAML block for index generation. | |
| ## Example Policy | |
| Show five query-positive examples when possible. Select five queries by | |
| deterministic random sampling, not by taking the head of the query table. For | |
| each sampled query, use a positive qrel with matching query and corpus records. | |
| Use the repository script so regenerated pages stay stable: | |
| ```bash | |
| uv run python scripts/extract_benchmark_task_examples.py hakari-bench/NanoMMTEB-v2 argu_ana | |
| ``` | |
| For bulk refreshes, replace only the `## Example Data` sections with: | |
| ```bash | |
| uv run python scripts/extract_benchmark_task_examples.py --update-docs docs/benchmark_tasks | |
| ``` | |
| Use a Markdown table with exactly two columns by default: `Query` and | |
| `Positive document`. The visible table should focus on the actual query and | |
| positive document text. Omit query/doc IDs, BM25 ranks, and extra count columns | |
| unless a task specifically needs them. Append full character counts inline. | |
| Truncate long content to the configured visible character limit and show the | |
| full pre-truncation length with the compact marker | |
| `[truncated 225 chars](1258 chars)`. | |
| ```markdown | |
| | Query | Positive document | | |
| | --- | --- | | |
| | What is ...? (12 chars) | The answer-bearing passage ... [truncated 225 chars](1800 chars) | | |
| ``` | |
| For extremely long-context, legal, patent, medical, code, or documentation tasks, | |
| use a vertical sample-block format only when the table would be unreadable on | |
| GitHub: | |
| ```markdown | |
| ### Sample 1 | |
| | Field | Value | | |
| | --- | --- | | |
| | Query ID | `q1` | | |
| | Positive Doc ID | `d1` | | |
| **Query** | |
| > Truncated query text ... [truncated from 2400 chars] | |
| **Positive document** | |
| > Truncated positive document text ... [truncated from 18000 chars] | |
| ``` | |
| Do not summarize samples. Show the actual query and positive document text from | |
| the Nano tables. Long query or document text must be truncated to a readable | |
| length, with the original character count visible, for example `[truncated from | |
| 20442 chars]`. Even when text is truncated, the reader should be able to tell the full | |
| query and positive-document character counts. | |
| ## Interpretation Policy | |
| The `Details` section should explain the data itself. Do not spend the section | |
| explaining that this is a Nano subset or that the Nano format has query, corpus, | |
| qrels, and BM25 tables. Mention Nano sampling only when the observed sampled | |
| data changes how readers should interpret the task. | |
| Use these subheadings inside `## Details`: | |
| ```markdown | |
| ### What the Original Data Measures | |
| ### Observed Data Profile | |
| ### BM25 Difficulty | |
| ### Training Data That May Help | |
| ### Synthetic Data Guidance | |
| ``` | |
| Discuss: | |
| - what the task asks the model to retrieve, | |
| - what the original paper or official dataset source says the dataset was built | |
| to evaluate, | |
| - what the source paper says about dataset construction, annotation, split | |
| design, related benchmarks, baseline behavior, limitations, or intended use, | |
| - what the actual Nano data looks like: query style, document genre, document | |
| length, positives per query, language, and domain, | |
| - whether lexical matching is likely to be strong, | |
| - whether the task is multilingual, domain-specific, code-oriented, | |
| long-document-oriented, or fact/evidence-oriented, | |
| - whether qrels are mostly single-positive or multi-positive, | |
| - how BM25 nDCG@10 and hit@10 should be read for this task, | |
| - what existing non-evaluation training data may help, | |
| - what synthetic source documents and synthetic questions would be useful. | |
| Avoid generic filler. Each final task page should include at least one | |
| task-specific paragraph grounded in the original paper, dataset card, or | |
| benchmark source. | |
| If a source paper exists, cite it in prose. Good detail text should read like: | |
| ```markdown | |
| [CodeRAG-Bench: Can Retrieval Augment Code Generation?](https://arxiv.org/abs/2406.14497) | |
| reports that the benchmark aggregates programming solutions, online tutorials, | |
| library documentation, StackOverflow posts, and GitHub repositories as retrieval | |
| sources. This matters for this task because ... | |
| ``` | |
| If no source paper is confirmed, say so plainly: | |
| ```markdown | |
| No source paper was confirmed for this task. The interpretation below is based | |
| on the official Hugging Face dataset card, project metadata, and observed sample | |
| queries and positives. | |
| ``` | |
| ### Training Data That May Help | |
| This subsection should answer which existing datasets or supervised pairs could | |
| teach the domain without using evaluation answers. | |
| Cover these cases: | |
| - If the original dataset provides a train split or official training data, | |
| state that it is the first source to inspect. Do not assume it may be used for | |
| a leaderboard unless the benchmark rules allow it. | |
| - Always check split provenance. If the Nano task is derived from an upstream | |
| dev or test split, say that data likely to overlap with the benchmark, such as | |
| the same upstream dev/test split, should preferably be excluded from training. | |
| Recommend upstream train splits or other source data that are unlikely to | |
| overlap with the evaluation task. | |
| - For public datasets, warn in practical terms that obvious overlap with the | |
| benchmark should be avoided. Detailed ID or text overlap audits are useful for | |
| production-quality training pipelines, but the reader-facing guidance should | |
| not make the first-pass task brief feel like an implementation checklist. | |
| - State that learning the evaluation queries, qrels, or positive passages can | |
| inflate benchmark scores. For retrieval tasks, memorizing the answer passage is | |
| not the same as learning retrieval behavior. | |
| - For code tasks, recommend source-aligned data such as documentation retrieval, | |
| DocPrompting-style NL-intent-to-doc pairs, StackOverflow QA, tutorials, | |
| migration guides, docstrings, issue-to-fix pairs, and API examples. | |
| - For multilingual tasks, recommend native-language supervised pairs and | |
| same-language corpora rather than translated English-only pairs. | |
| - Keep this subsection concise and technical. It should list the data types that | |
| help, not re-explain the entire benchmark. | |
| ### Synthetic Data Guidance | |
| This subsection should explain what synthetic documents and questions to create. | |
| It should be separate from `Training Data That May Help`. | |
| Cover these cases: | |
| - If synthetic data is recommended, specify the document genre, document | |
| contents, question style, question intent, and how the generated question | |
| should be answerable from the generated or selected document. | |
| - Distinguish document-to-question generation from joint document-and-question | |
| generation. Document-to-question generation should use non-evaluation source | |
| documents. Joint generation should create both realistic source-style | |
| documents and questions with explicit answer grounding. | |
| - Do not use evaluation split queries or positive passages as seeds for | |
| synthetic generation. For example, if a Nano task is derived from MIRACL dev, | |
| use MIRACL train or non-overlapping Wikipedia passages, not MIRACL dev/test | |
| positives. | |
| - For multilingual tasks, prefer native-language synthetic queries and | |
| documents over translated English-only data. | |
| - For code tasks, synthetic data should preserve executable/API semantics, | |
| identifiers, version constraints, stack traces, and realistic developer | |
| tasks. | |
| - For legal, patent, medical, finance, or scientific tasks, synthetic documents | |
| should use realistic domain structure, terminology, citations, measurements, | |
| entities, and evidential wording. | |
| - For multi-positive tasks, train with multi-positive objectives or listwise / | |
| distillation signals rather than reducing the task to one positive per query. | |
| ## Machine-Readable Metadata | |
| Each task page must end with a fenced YAML block. This block is for future index | |
| page generation and should be easy to parse without reading prose. | |
| Use this marker immediately before the YAML block: | |
| ```markdown | |
| <!-- benchmark-task-metadata:v1 --> | |
| ``` | |
| The block must be the final content in the Markdown file: | |
| ````markdown | |
| ## Machine-Readable Metadata | |
| <!-- benchmark-task-metadata:v1 --> | |
| ```yaml | |
| benchmark_task_metadata: | |
| schema_version: 1 | |
| document_status: first_pass | |
| nano_set: NanoMIRACL | |
| backing_dataset: NanoMIRACL | |
| dataset_id: hakari-bench/NanoMIRACL | |
| task_name: ja | |
| split_name: ja | |
| language: ja | |
| category: natural_language | |
| document_path: docs/benchmark_tasks/NanoMIRACL/ja.md | |
| source_research: | |
| primary_source_type: task_paper | |
| paper_pdf_or_html_checked: true | |
| no_paper_note: null | |
| counts: | |
| queries: 200 | |
| documents: 1846 | |
| positive_qrels: 200 | |
| positives_per_query: | |
| average: 1.0 | |
| min: 1 | |
| median: 1.0 | |
| max: 1 | |
| multi_positive_queries: 0 | |
| multi_positive_query_percent: 0.0 | |
| text_stats_chars: | |
| query_mean: 17.47 | |
| document_mean: 297.912784 | |
| bm25: | |
| ndcg_at_10: 0.5956231823 | |
| hit_at_10: 0.94 | |
| source: dataset_bm25_column | |
| learning: | |
| original_train_split: unknown | |
| evaluation_split_origin: unknown | |
| train_eval_overlap_audit: not_audited | |
| leakage_note: do not train on upstream dev/test queries, qrels, or positive passages | |
| useful_training_data: | |
| - official non-overlapping train split | |
| - native-language question-to-passage retrieval pairs | |
| - non-overlapping source-corpus passage QA pairs | |
| synthetic_data: | |
| document_generation: native-language answer-bearing passages from the source collection style | |
| question_generation: native-language information needs answerable from those passages | |
| answerability: questions should be grounded in explicit facts, entities, or relations in the document | |
| multi_positive_training: single_positive_question_document_focus | |
| links: | |
| nano_dataset: https://huggingface.co/datasets/hakari-bench/NanoMIRACL | |
| source_urls: | |
| - label: MIRACL unified source dataset | |
| url: https://huggingface.co/datasets/hotchpotch/miracl-hf-unified | |
| source_notes: [] | |
| references: | |
| - title: "MIRACL: A Multilingual Retrieval Dataset Covering 18 Diverse Languages" | |
| url: https://arxiv.org/abs/2210.09984 | |
| year: 2023 | |
| doi: 10.1162/tacl_a_00595 | |
| is_paper: true | |
| source_confidence: definitive_paper_link | |
| ``` | |
| ```` | |
| Metadata field guidance: | |
| - `schema_version`: increment only for incompatible schema changes. | |
| - `document_status`: use `first_pass`, `reviewed`, or `needs_review`. | |
| - `document_path`: repository-relative path only. | |
| - `source_research.primary_source_type`: use `task_paper`, `benchmark_paper`, | |
| `dataset_card`, `project_page`, `technical_article`, or `sample_inference`. | |
| Use `benchmark_paper` when the strongest paper source is the benchmark paper | |
| that includes the task, even if no standalone task paper was confirmed. | |
| - `source_research.paper_pdf_or_html_checked`: boolean. Set `true` only when | |
| the paper PDF or HTML was inspected beyond title/abstract metadata. | |
| - `source_research.no_paper_note`: short public note when no paper was | |
| confirmed, otherwise `null`. | |
| - `bm25.source`: must be `dataset_bm25_column` unless the task explicitly uses a | |
| different source. | |
| - `learning.original_train_split`: use `available`, `not_found`, or `unknown`. | |
| Leave this as `unknown` unless the original source split was explicitly | |
| audited. | |
| - `learning.evaluation_split_origin`: record the upstream split if known, such | |
| as `train`, `dev`, `test`, `validation`, or `unknown`. | |
| - `learning.train_eval_overlap_audit`: use `passed`, `failed`, | |
| `not_applicable`, or `not_audited`. Use `not_audited` until query IDs, | |
| document IDs, source titles, and positive text overlap were checked. | |
| - `learning.leakage_note`: short public warning about what not to train on. | |
| - `learning.useful_training_data`: concise machine-readable list of existing | |
| data types that may teach the domain without using evaluation answers. | |
| - `learning.synthetic_data`: concise machine-readable hints for what synthetic | |
| documents and questions to generate. These are for index/filter pages and | |
| should mirror the prose in `### Synthetic Data Guidance`. | |
| - `learning.multi_positive_training`: use `multi_positive_objective` when the | |
| qrels contain multiple positives for a meaningful share of queries, otherwise | |
| `single_positive_question_document_focus`. | |
| - `references[].url`: use the arXiv URL first when one exists. | |
| - `links.source_urls`: structured `{label, url}` objects with public URLs only. | |
| These may include Hugging Face datasets, project pages, or source repositories. | |
| - `links.source_notes`: optional non-URL source notes from README metadata. | |
| ## Document Template | |
| Use this template for new pages: | |
| ````markdown | |
| # {Nano set} / {task name} | |
| > [!NOTE] | |
| > This page was generated by an LLM using source papers, dataset cards, | |
| > repository metadata, and sampled benchmark data. It may contain mistakes; | |
| > please treat it as a reference aid rather than a definitive source. | |
| ## Overview | |
| {About 500 English characters summarizing what this benchmark task is. When a | |
| paper exists, ground the overview in the paper: what problem the paper introduced, | |
| how the data was constructed or adapted, what the query and document sides | |
| represent, and what retrieval behavior is being tested. When no paper exists, | |
| summarize the task from the dataset card, official page, repository metadata, and | |
| sampled data. Avoid a fill-in-the-blank pattern such as "`{Task}` evaluates ... | |
| Queries are ..."; the paragraph should contain details that would not fit most | |
| other tasks in the same group.} | |
| ## Details | |
| ### What the Original Data Measures | |
| {Explain the original data or benchmark from the source paper, official dataset | |
| card, or project page. Focus on what retrieval behavior is being tested, not on | |
| Nano packaging. If a paper was used, cite it directly in prose, e.g. | |
| `[Paper Title](url) reports that ...`. If no paper was confirmed, state that the | |
| interpretation is based on public dataset cards, project pages, and sample data.} | |
| ### Observed Data Profile | |
| {Summarize the actual sampled task with useful interpretation, not only counts: | |
| query and document styles, recurring intents, multi-positive clusters, visible | |
| data quirks, and what those imply for retrieval.} | |
| ### BM25 Difficulty | |
| {Explain BM25 nDCG@10 and hit@10 for this task. Include concrete patterns from | |
| the Nano BM25 candidate ranking when useful, such as cases where lexical matching | |
| finds the topic but misses intent equivalence.} | |
| ### Training Data That May Help | |
| {List concise, technical existing-data recommendations. Mention the official | |
| train split when available, but warn that upstream dev/test sets, or other data | |
| likely to overlap with the benchmark, should preferably be excluded. Keep | |
| detailed overlap-audit mechanics in metadata or implementation notes rather than | |
| the main prose unless the task specifically needs them.} | |
| ### Synthetic Data Guidance | |
| {Describe what synthetic source-style documents and questions to create. Separate | |
| document-to-question generation from generating both documents and questions. | |
| State that evaluation split queries and positive passages should not be used as | |
| seeds.} | |
| ## Example Data | |
| {Five deterministic random query-positive examples. Generate with | |
| `scripts/extract_benchmark_task_examples.py`. Use a two-column Markdown table, | |
| include full character counts inline, and visibly truncate long content with | |
| `[truncated 225 chars](N chars)`.} | |
| ## Dataset Information | |
| | Field | Value | | |
| | --- | --- | | |
| | Nano set | {Nano set} | | |
| | Backing dataset | {Backing dataset} | | |
| | Task / split | {Task or split} | | |
| | Hugging Face dataset | [{dataset_id}](https://huggingface.co/datasets/{dataset_id}) | | |
| | Language | {language} | | |
| | Category | {natural_language or code} | | |
| | Queries | {query_count} | | |
| | Documents | {document_count} | | |
| | Positive qrels | {qrel_count} | | |
| | Avg positives / query | {avg_positives_per_query; omit this row when all queries have exactly one positive} | | |
| | Positives per query (min / median / max) | {min} / {median} / {max; omit this row when all queries have exactly one positive} | | |
| | Queries with multiple positives | {count} ({percent}%; omit this row when all queries have exactly one positive} | | |
| | BM25 nDCG@10 | {bm25_ndcg_at_10} | | |
| | BM25 hit@10 | {bm25_hit_at_10} | | |
| | Query length avg chars | {query_mean_chars} | | |
| | Document length avg chars | {document_mean_chars} | | |
| ### Public Sources | |
| - [{Primary paper title, preferably arXiv when available}]({primary_public_url}); {year}; {authors}; DOI: `{doi}`. | |
| - [{Dataset card or project page}]({public_url}). | |
| ### Hugging Face Links | |
| - Nano dataset: [{dataset_id}](https://huggingface.co/datasets/{dataset_id}) | |
| - Source dataset: [{source_dataset_id}](https://huggingface.co/datasets/{source_dataset_id}) | |
| ### Source Reference Table | |
| | Title | Year | Type | URL | | |
| | --- | ---: | --- | --- | | |
| | {title} | {year} | paper | {url} | | |
| ## Machine-Readable Metadata | |
| <!-- benchmark-task-metadata:v1 --> | |
| ```yaml | |
| benchmark_task_metadata: | |
| schema_version: 1 | |
| document_status: first_pass | |
| nano_set: {Nano set} | |
| backing_dataset: {Backing dataset} | |
| dataset_id: {dataset_id} | |
| task_name: {task_name} | |
| split_name: {split_name} | |
| language: {language} | |
| category: {category} | |
| document_path: docs/benchmark_tasks/{Nano-set name}/{task name}.md | |
| source_research: | |
| primary_source_type: {task_paper|benchmark_paper|dataset_card|project_page|technical_article|sample_inference} | |
| paper_pdf_or_html_checked: {true|false} | |
| no_paper_note: {null or public note} | |
| counts: | |
| queries: {query_count} | |
| documents: {document_count} | |
| positive_qrels: {qrel_count} | |
| positives_per_query: | |
| average: {avg_positives_per_query} | |
| min: {min_positives} | |
| median: {median_positives} | |
| max: {max_positives} | |
| multi_positive_queries: {multi_positive_query_count} | |
| multi_positive_query_percent: {multi_positive_query_percent} | |
| text_stats_chars: | |
| query_mean: {query_mean_chars} | |
| document_mean: {document_mean_chars} | |
| bm25: | |
| ndcg_at_10: {bm25_ndcg_at_10} | |
| hit_at_10: {bm25_hit_at_10} | |
| source: dataset_bm25_column | |
| learning: | |
| original_train_split: unknown | |
| evaluation_split_origin: {train|dev|test|validation|unknown} | |
| train_eval_overlap_audit: not_audited | |
| leakage_note: {short leakage warning} | |
| useful_training_data: | |
| - {existing_training_data_type} | |
| synthetic_data: | |
| document_generation: {synthetic_document_generation} | |
| question_generation: {synthetic_question_generation} | |
| answerability: {synthetic_answerability} | |
| multi_positive_training: {multi_positive_training} | |
| links: | |
| nano_dataset: https://huggingface.co/datasets/{dataset_id} | |
| source_urls: | |
| - label: {source_label} | |
| url: {source_url} | |
| source_notes: [] | |
| references: | |
| - title: {title} | |
| url: {public_url} | |
| year: {year} | |
| doi: {doi} | |
| is_paper: true | |
| source_confidence: {source_confidence} | |
| ``` | |
| ```` | |
| ## Group Index Pages | |
| Before scaling to all 500+ tasks, add group index pages such as: | |
| ```text | |
| docs/benchmark_tasks/{Nano-set name}/index.md | |
| ``` | |
| Build index pages from the machine-readable metadata blocks. Each group index | |
| should include: | |
| - task name and document link, | |
| - language and category, | |
| - query/document/qrels counts, | |
| - positives-per-query summary only when the task is not exactly one-positive per | |
| query, | |
| - BM25 nDCG@10 and hit@10, | |
| - average query/document character lengths, | |
| - source status and primary paper title, | |
| - document status. | |
| ## Maintenance Checklist | |
| Before publishing a batch: | |
| 1. Confirm every generated page has a Nano dataset link and at least one public | |
| source or a visible note that source metadata is missing. | |
| 2. Confirm arXiv was checked for every paper source and is used as the first URL | |
| when available. | |
| 3. Confirm that every cited paper was checked beyond the abstract when possible: | |
| use the PDF or HTML to inspect dataset construction, source data, splits, | |
| related work, baselines, limitations, and task-specific discussion. | |
| 4. If no source paper is confirmed, confirm the page says so and explains that | |
| the interpretation is based on official dataset cards, project pages, | |
| technical articles, Hugging Face metadata, and observed samples. | |
| 5. Confirm train/dev/test provenance. If the Nano task comes from an upstream | |
| dev/test split, the page must warn against training on that split and should | |
| recommend only non-overlapping train or source-corpus data. | |
| 6. Confirm BM25 nDCG@10 was computed from the Nano `bm25` table, not from a | |
| fresh local BM25 run. | |
| 7. Confirm positives-per-query statistics were computed from qrels. | |
| 8. Confirm exactly five random examples come from the selected Nano split when | |
| at least five qrel pairs are available. | |
| 9. Confirm sample data shows actual query and positive document text, not a | |
| summary, and that long samples are visibly truncated with original character | |
| count. | |
| 10. Confirm the final YAML metadata block parses successfully. | |
| 11. Confirm no generated benchmark outputs, caches, local paper paths, local wiki | |
| paths, or private scratch artifacts are committed. | |
| ## Final Prose Quality Review | |
| After generating a task page, do a final pass specifically for writing quality | |
| and usefulness. The page should not feel like a statistics dump. It should help a | |
| reader understand what kind of retrieval behavior the task rewards, why the task | |
| is difficult, and what data would plausibly teach the domain. | |
| Use this checklist before considering a generated task page ready: | |
| 1. The overview explains the original benchmark or source dataset first, then | |
| explains the concrete Nano task. It describes the task itself, not the | |
| Markdown file or the Nano packaging. | |
| 2. The source discussion cites the paper, benchmark paper, dataset card, or | |
| project page in the sentences that depend on that source. When no paper was | |
| confirmed, the page says so plainly and does not pretend that a paper-backed | |
| interpretation exists. | |
| 3. The details section includes at least one paragraph grounded in the source | |
| paper or official dataset card: dataset construction, annotation workflow, | |
| source corpus, split design, benchmark purpose, or known limitations. | |
| 4. The observed data profile goes beyond counts. It names visible query types, | |
| document genres, recurring domains, language-specific issues, entity or | |
| terminology patterns, multi-positive clusters when present, and any quirks | |
| that affect retrieval. | |
| 5. The BM25 section interprets the score. It should explain what BM25 is doing | |
| well, what it fails to distinguish, and include concrete patterns from the | |
| dataset-provided BM25 ranking when those patterns are informative. | |
| 6. The task-specific difficulty is explicit. For example, debate tasks should | |
| discuss stance and counterargument matching; duplicate-question tasks should | |
| discuss intent equivalence and paraphrase clusters; Wikipedia QA retrieval | |
| should discuss short fact queries and passage evidence; public-health FAQ | |
| retrieval should discuss procedural guidance and action-specific matching. | |
| 7. The training-data section is concise and technical. It recommends existing | |
| data types that teach the domain without using likely evaluation answers, and | |
| it includes a practical overlap warning for public train/dev/test data. | |
| 8. The synthetic-data section focuses on what documents and questions to | |
| generate. It should specify document genre, question style, answerability, and | |
| domain details. Do not spend this section on hard negatives. | |
| 9. The examples are actual query-positive text from the Nano split. They should | |
| be readable on GitHub, include full character counts, and show truncation | |
| clearly when content is shortened. | |
| 10. The page avoids generic filler. Replace broad statements such as "this task | |
| requires semantic understanding" with task-specific statements about the | |
| exact relation being retrieved. | |
| 11. The writing separates evidence from inference. If a claim comes from a | |
| paper, say so with a link. If it comes from inspecting the sampled Nano data, | |
| make that clear through wording such as "the sampled data shows" or "the | |
| observed BM25 ranking suggests". | |
| 12. The final page has a coherent reader flow: warning note, overview, details, | |
| samples, dataset information, public sources, and machine-readable metadata. | |
| A user should understand the task before reaching the tables and source | |
| lists. | |