leaderboard / docs /benchmark_tasks /NanoDAPFAM /NanoDAPFAMOutTitlAbsToFullText.md
hotchpotch's picture
Deploy remote main docs sync
1f41326 verified
# NanoDAPFAM / NanoDAPFAMOutTitlAbsToFullText
## Overview
`NanoDAPFAMOutTitlAbsToFullText` retrieves full-text patent-family records from
title-abstract queries. It is an OUT-domain split: positives are citation-linked
families without shared IPC3 classes.
## Details
### What the Original Data Measures
[DAPFAM: A Domain-Aware Family-level Dataset to benchmark cross domain patent retrieval](https://arxiv.org/abs/2506.22141)
uses OUT-domain partitions to study cross-domain patent retrieval. This split
tests whether a compact source summary can retrieve long patent records from a
different technical class.
### Observed Data Profile
The Nano split has 200 queries, 10,000 documents, and 1,259 positive qrels.
Queries average 786.64 characters, full-text target documents average 71,902.31
characters, and positives per query average 6.29.
The shorter query makes the cross-domain signal sparse. Relevant target records
may contain the answer-like prior-art relation deep inside long descriptions or
claims.
### BM25 Difficulty
Using the dataset-provided BM25 candidate column and best-ranked positive per
query, BM25 reaches nDCG@10 = 0.1102 and hit@10 = 0.2100. It ranks 8 positives
first and finds a positive in the top 100 for every query.
Full text gives BM25 more chances to match terms than title-abstract targets,
but cross-domain terminology keeps top-10 recall low.
### Training Data That May Help
Helpful training data includes cross-domain title-abstract to full-text patent
retrieval, cross-IPC citations, and prior-art search examples that require
technical analogy.
### Synthetic Data Guidance
Generate short source patent summaries and long target records from different
technical areas. Positives should share a mechanism, material, or prior-art
role despite different domain labels.
## Example Data
| Query | Positive document |
| --- | --- |
| bicycle handlebar grip a bicycle handlebar grip contains a plastic inner shell having a tubular shape and an outer surface; a fiber layer having an inner surface and an outer surface and includes a plurality of fibers interwe ... [truncated 225 chars](821 chars) | durable flexible membrane and method of making same a flexible membrane having a valuable combination of desirable properties is composed of a generally heavy, dense supporting and reinforcing reticulated base fabric constitu ... [truncated 225 chars](28042 chars) |
| method for improving belt press dewatering a method for increasing the removal of a higher fraction of liquid from the press cake in any belt press is described. specifically, the invention incorporates a series of rollers th ... [truncated 225 chars](620 chars) | artificial human anti-factor b antibody problem to be solved: to provide novel engineered forms of a monoclonal antibody and antigen-binding fragments thereof that bind complement protein factor b and selectively inhibit the ... [truncated 225 chars](108109 chars) |
| stitch distribution control system for tufting machines a stitch distribution control system for a tufting machine for controlling placement of yarns being fed to the needles of the tufting machine by yarn feed mechanisms to ... [truncated 225 chars](647 chars) | method and apparatus for measuring direction or position of weft yarn of fabric the measurement of the pick or stitches course position in continuously moved fabrics involves examining at least one gap-shaped segment in a top ... [truncated 225 chars](24253 chars) |
| low weight carpet and carpet tile and methods of manufacture low weight and non-square carpet tile suitable for use in mass transit vehicles, particularly passenger aircraft. the carpet tile preferably weighs less than about ... [truncated 225 chars](565 chars) | modular floor covering units with built-in lighting an apparatus for guiding the occupants of a structure along a path of travel within the structure is provided. the apparatus is comprised of modular floor covering units whi ... [truncated 225 chars](35319 chars) |
| method and apparatus for the zonal transmission of data using building lighting fixtures this invention relates to the zonal transmission of data by the modulation of the light output of arc lamps (150) or discharge lamps; li ... [truncated 225 chars](969 chars) | shelf tag with ambient light detector the present invention relates to an electronic shelf display device which includes an optical device and an ambient light detector circuitry. the electronic shelf display device includes ... [truncated 225 chars](54320 chars) |
## Dataset Information
| Field | Value |
| --- | --- |
| Nano set | NanoDAPFAM |
| Backing dataset | NanoDAPFAM |
| Task / split | NanoDAPFAMOutTitlAbsToFullText |
| Hugging Face dataset | [hakari-bench/NanoDAPFAM](https://huggingface.co/datasets/hakari-bench/NanoDAPFAM) |
| Language | en |
| Category | natural_language |
| Queries | 200 |
| Documents | 10000 |
| Positive qrels | 1259 |
| Positives per query | avg 6.29, min 1, median 4.0, max 20 |
| BM25 nDCG@10 | 0.1102 |
| BM25 hit@10 | 0.2100 |
| Query length avg chars | 786.64 |
| Document length avg chars | 71902.31 |
### Public Sources
- [DAPFAM: A Domain-Aware Family-level Dataset to benchmark cross domain patent retrieval](https://arxiv.org/abs/2506.22141); 2026; Iliass Ayaou, Denis Cavallucci, and Hicham Chibane; DOI: `10.1016/j.array.2026.100720`.
- [DAPFAM DOI record](https://doi.org/10.1016/j.array.2026.100720).
- [datalyes/DAPFAM_patent dataset card](https://huggingface.co/datasets/datalyes/DAPFAM_patent).
### Hugging Face Links
- Nano dataset: [hakari-bench/NanoDAPFAM](https://huggingface.co/datasets/hakari-bench/NanoDAPFAM)
- Source dataset: [datalyes/DAPFAM_patent](https://huggingface.co/datasets/datalyes/DAPFAM_patent)
### Source Reference Table
| Title | Year | Type | URL |
| --- | ---: | --- | --- |
| DAPFAM: A Domain-Aware Family-level Dataset to benchmark cross domain patent retrieval | 2026 | arXiv paper | https://arxiv.org/abs/2506.22141 |
| DAPFAM DOI record | 2026 | DOI | https://doi.org/10.1016/j.array.2026.100720 |
| datalyes/DAPFAM_patent | 2025 | dataset card | https://huggingface.co/datasets/datalyes/DAPFAM_patent |
## Machine-Readable Metadata
<!-- benchmark-task-metadata:v1 -->
```yaml
benchmark_task_metadata:
schema_version: 1
document_status: first_pass
nano_set: NanoDAPFAM
backing_dataset: NanoDAPFAM
dataset_id: hakari-bench/NanoDAPFAM
task_name: NanoDAPFAMOutTitlAbsToFullText
split_name: NanoDAPFAMOutTitlAbsToFullText
language: en
category: natural_language
document_path: docs/benchmark_tasks/NanoDAPFAM/NanoDAPFAMOutTitlAbsToFullText.md
source_research:
primary_source_type: benchmark_paper
paper_pdf_or_html_checked: true
counts:
queries: 200
documents: 10000
positive_qrels: 1259
positives_per_query:
average: 6.295
min: 1
median: 4.0
max: 20
multi_positive_queries: 139
text_stats_chars:
query_mean: 786.64
document_mean: 71902.3141
bm25:
ndcg_at_10: 0.1101697656
hit_at_10: 0.21
source: dataset_bm25_column
learning:
original_train_split: not_confirmed
evaluation_split_origin: DAPFAM OUT-domain title-abstract to full-text patent-family retrieval
train_eval_overlap_audit: not_audited
leakage_note: exclude NanoDAPFAM evaluation family IDs, positives, and qrels
useful_training_data:
- cross-domain title-abstract patent retrieval
- cross-IPC patent citation pairs
- long-target prior-art search
synthetic_data:
document_generation: long full-text patent records from different technical classes
question_generation: compact source patent title and abstract summaries
answerability: positives should be cited cross-domain patent families
multi_positive_training: citation_family_multi_positive
links:
nano_dataset: https://huggingface.co/datasets/hakari-bench/NanoDAPFAM
source_urls:
- label: DAPFAM arXiv
url: https://arxiv.org/abs/2506.22141
- label: DAPFAM DOI
url: https://doi.org/10.1016/j.array.2026.100720
- label: datalyes/DAPFAM_patent
url: https://huggingface.co/datasets/datalyes/DAPFAM_patent
references:
- title: "DAPFAM: A Domain-Aware Family-level Dataset to benchmark cross domain patent retrieval"
url: https://arxiv.org/abs/2506.22141
year: 2026
doi: 10.1016/j.array.2026.100720
is_paper: true
source_confidence: definitive_paper_link
```