leaderboard / docs /benchmark_tasks /NanoDAPFAM /NanoDAPFAMInTitlAbsToFullText.md
hotchpotch's picture
Deploy remote main docs sync
1f41326 verified
# NanoDAPFAM / NanoDAPFAMInTitlAbsToFullText
## Overview
`NanoDAPFAMInTitlAbsToFullText` retrieves full-text patent-family records from
title-abstract queries. Positives are IN-domain DAPFAM citation links where the
query and target share an IPC3 class.
## Details
### What the Original Data Measures
[DAPFAM: A Domain-Aware Family-level Dataset to benchmark cross domain patent retrieval](https://arxiv.org/abs/2506.22141)
reports that domain partitions are central to understanding patent retrieval
difficulty. This split represents the same-domain setting with compact queries
and full target documents.
### Observed Data Profile
The Nano split has 200 queries, 10,000 documents, and 3,072 positive qrels.
Queries average 771.33 characters, and full-text targets average 68,924.34
characters. Positives per query average 15.36.
The query title and abstract usually identify the technical problem and proposed
solution, while full target records contain detailed descriptions and claims
from same-domain patent families.
### BM25 Difficulty
Using the dataset-provided BM25 candidate column and best-ranked positive per
query, BM25 reaches nDCG@10 = 0.6643 and hit@10 = 0.8500. It ranks 95 positives
first and finds a positive in the top 100 for every query.
The same-domain restriction and long targets make lexical matching effective,
though the task still requires choosing among many related patents.
### Training Data That May Help
Useful data includes same-domain title-abstract prior-art search against full
patent records, family-level citation retrieval, and patent semantic search
outside NanoDAPFAM evaluation families.
### Synthetic Data Guidance
Generate compact source summaries and long same-domain target records. Positives
should share a cited technical mechanism, while hard negatives should have
similar IPC terms but no direct prior-art relation.
## Example Data
| Query | Positive document |
| --- | --- |
| snow removal equipment with automatic walking function the invention relates to snow removal equipment with an automatic walking function. the snow removal equipment comprises a walking module, a working module and a control ... [truncated 225 chars](988 chars) | multifunctional device for clearing snow an apparatus and method for clearing an accumulation of matter from a surface that includes a blade configured to collect matter upon movement of the apparatus and means to shift the c ... [truncated 225 chars](59310 chars) |
| waste disposal devices waste disposal device including a housing defining a waste compartment for receiving enclosed waste and arranged to removably receive a cartridge containing a length of flexible tubing which operatively ... [truncated 225 chars](891 chars) | cassette for dispensing pleated tubing a cassette for use in dispensing a pleated tubing. the cassette includes an annular body having a generally u shaped housing with an open central cylindrical core. the annular body inclu ... [truncated 225 chars](36269 chars) |
| an article including identification for use in an electrically heated smoking system. there is provided an electrically heated smoking system (101) for receiving a smoking article (115) or cleaning article (205) configured fo ... [truncated 225 chars](1164 chars) | apparatus for generating aerosol from an aerosolisable medium, an article of aerosolisable medium and a method of determining a parameter of an article to provide an apparatus that heats an aerosolizable medium to volatilize ... [truncated 225 chars](52756 chars) |
| low weight carpet and carpet tile and methods of manufacture low weight and non-square carpet tile suitable for use in mass transit vehicles, particularly passenger aircraft. the carpet tile preferably weighs less than about ... [truncated 225 chars](565 chars) | anti-static mats and carpets a novel carpet material or mat which is characterized by an extraordinary ability to quickly and comfortably discharge any build-up of a static electricity charge on a person who has built up such ... [truncated 225 chars](17195 chars) |
| organosilicon precursors for interlayer dielectric films with low dielectric constants a method of forming a low dielectric constant interlayer dielectric film on a substrate by reacting, under chemical vapor deposition condi ... [truncated 225 chars](546 chars) | radiation shield a radiation shield and an assembly and a reactor including the radiation shield are disclosed. the radiation shield can be used to control heat flux from a susceptor heater assembly and thereby enable better ... [truncated 225 chars](48536 chars) |
## Dataset Information
| Field | Value |
| --- | --- |
| Nano set | NanoDAPFAM |
| Backing dataset | NanoDAPFAM |
| Task / split | NanoDAPFAMInTitlAbsToFullText |
| Hugging Face dataset | [hakari-bench/NanoDAPFAM](https://huggingface.co/datasets/hakari-bench/NanoDAPFAM) |
| Language | en |
| Category | natural_language |
| Queries | 200 |
| Documents | 10000 |
| Positive qrels | 3072 |
| Positives per query | avg 15.36, min 1, median 18.0, max 20 |
| BM25 nDCG@10 | 0.6643 |
| BM25 hit@10 | 0.8500 |
| Query length avg chars | 771.33 |
| Document length avg chars | 68924.34 |
### Public Sources
- [DAPFAM: A Domain-Aware Family-level Dataset to benchmark cross domain patent retrieval](https://arxiv.org/abs/2506.22141); 2026; Iliass Ayaou, Denis Cavallucci, and Hicham Chibane; DOI: `10.1016/j.array.2026.100720`.
- [DAPFAM DOI record](https://doi.org/10.1016/j.array.2026.100720).
- [datalyes/DAPFAM_patent dataset card](https://huggingface.co/datasets/datalyes/DAPFAM_patent).
### Hugging Face Links
- Nano dataset: [hakari-bench/NanoDAPFAM](https://huggingface.co/datasets/hakari-bench/NanoDAPFAM)
- Source dataset: [datalyes/DAPFAM_patent](https://huggingface.co/datasets/datalyes/DAPFAM_patent)
### Source Reference Table
| Title | Year | Type | URL |
| --- | ---: | --- | --- |
| DAPFAM: A Domain-Aware Family-level Dataset to benchmark cross domain patent retrieval | 2026 | arXiv paper | https://arxiv.org/abs/2506.22141 |
| DAPFAM DOI record | 2026 | DOI | https://doi.org/10.1016/j.array.2026.100720 |
| datalyes/DAPFAM_patent | 2025 | dataset card | https://huggingface.co/datasets/datalyes/DAPFAM_patent |
## Machine-Readable Metadata
<!-- benchmark-task-metadata:v1 -->
```yaml
benchmark_task_metadata:
schema_version: 1
document_status: first_pass
nano_set: NanoDAPFAM
backing_dataset: NanoDAPFAM
dataset_id: hakari-bench/NanoDAPFAM
task_name: NanoDAPFAMInTitlAbsToFullText
split_name: NanoDAPFAMInTitlAbsToFullText
language: en
category: natural_language
document_path: docs/benchmark_tasks/NanoDAPFAM/NanoDAPFAMInTitlAbsToFullText.md
source_research:
primary_source_type: benchmark_paper
paper_pdf_or_html_checked: true
counts:
queries: 200
documents: 10000
positive_qrels: 3072
positives_per_query:
average: 15.36
min: 1
median: 18.0
max: 20
multi_positive_queries: 198
text_stats_chars:
query_mean: 771.33
document_mean: 68924.3436
bm25:
ndcg_at_10: 0.6642533082
hit_at_10: 0.85
source: dataset_bm25_column
learning:
original_train_split: not_confirmed
evaluation_split_origin: DAPFAM IN-domain title-abstract to full-text patent-family retrieval
train_eval_overlap_audit: not_audited
leakage_note: exclude NanoDAPFAM evaluation family IDs and citation labels
useful_training_data:
- same-domain title-abstract patent retrieval
- full-text patent prior-art search
- family-level citation retrieval
synthetic_data:
document_generation: full-text same-domain patent family records
question_generation: title and abstract source patent summaries
answerability: positives should be cited same-domain families
multi_positive_training: citation_family_multi_positive
links:
nano_dataset: https://huggingface.co/datasets/hakari-bench/NanoDAPFAM
source_urls:
- label: DAPFAM arXiv
url: https://arxiv.org/abs/2506.22141
- label: DAPFAM DOI
url: https://doi.org/10.1016/j.array.2026.100720
- label: datalyes/DAPFAM_patent
url: https://huggingface.co/datasets/datalyes/DAPFAM_patent
references:
- title: "DAPFAM: A Domain-Aware Family-level Dataset to benchmark cross domain patent retrieval"
url: https://arxiv.org/abs/2506.22141
year: 2026
doi: 10.1016/j.array.2026.100720
is_paper: true
source_confidence: definitive_paper_link
```