leaderboard / docs /benchmark_tasks /NanoDAPFAM /NanoDAPFAMOutTitlAbsToFullText.md
hotchpotch's picture
Deploy remote main docs sync
1f41326 verified

NanoDAPFAM / NanoDAPFAMOutTitlAbsToFullText

Overview

NanoDAPFAMOutTitlAbsToFullText retrieves full-text patent-family records from title-abstract queries. It is an OUT-domain split: positives are citation-linked families without shared IPC3 classes.

Details

What the Original Data Measures

DAPFAM: A Domain-Aware Family-level Dataset to benchmark cross domain patent retrieval uses OUT-domain partitions to study cross-domain patent retrieval. This split tests whether a compact source summary can retrieve long patent records from a different technical class.

Observed Data Profile

The Nano split has 200 queries, 10,000 documents, and 1,259 positive qrels. Queries average 786.64 characters, full-text target documents average 71,902.31 characters, and positives per query average 6.29.

The shorter query makes the cross-domain signal sparse. Relevant target records may contain the answer-like prior-art relation deep inside long descriptions or claims.

BM25 Difficulty

Using the dataset-provided BM25 candidate column and best-ranked positive per query, BM25 reaches nDCG@10 = 0.1102 and hit@10 = 0.2100. It ranks 8 positives first and finds a positive in the top 100 for every query.

Full text gives BM25 more chances to match terms than title-abstract targets, but cross-domain terminology keeps top-10 recall low.

Training Data That May Help

Helpful training data includes cross-domain title-abstract to full-text patent retrieval, cross-IPC citations, and prior-art search examples that require technical analogy.

Synthetic Data Guidance

Generate short source patent summaries and long target records from different technical areas. Positives should share a mechanism, material, or prior-art role despite different domain labels.

Example Data

Query Positive document
bicycle handlebar grip a bicycle handlebar grip contains a plastic inner shell having a tubular shape and an outer surface; a fiber layer having an inner surface and an outer surface and includes a plurality of fibers interwe ... [truncated 225 chars](821 chars) durable flexible membrane and method of making same a flexible membrane having a valuable combination of desirable properties is composed of a generally heavy, dense supporting and reinforcing reticulated base fabric constitu ... [truncated 225 chars](28042 chars)
method for improving belt press dewatering a method for increasing the removal of a higher fraction of liquid from the press cake in any belt press is described. specifically, the invention incorporates a series of rollers th ... [truncated 225 chars](620 chars) artificial human anti-factor b antibody problem to be solved: to provide novel engineered forms of a monoclonal antibody and antigen-binding fragments thereof that bind complement protein factor b and selectively inhibit the ... [truncated 225 chars](108109 chars)
stitch distribution control system for tufting machines a stitch distribution control system for a tufting machine for controlling placement of yarns being fed to the needles of the tufting machine by yarn feed mechanisms to ... [truncated 225 chars](647 chars) method and apparatus for measuring direction or position of weft yarn of fabric the measurement of the pick or stitches course position in continuously moved fabrics involves examining at least one gap-shaped segment in a top ... [truncated 225 chars](24253 chars)
low weight carpet and carpet tile and methods of manufacture low weight and non-square carpet tile suitable for use in mass transit vehicles, particularly passenger aircraft. the carpet tile preferably weighs less than about ... [truncated 225 chars](565 chars) modular floor covering units with built-in lighting an apparatus for guiding the occupants of a structure along a path of travel within the structure is provided. the apparatus is comprised of modular floor covering units whi ... [truncated 225 chars](35319 chars)
method and apparatus for the zonal transmission of data using building lighting fixtures this invention relates to the zonal transmission of data by the modulation of the light output of arc lamps (150) or discharge lamps; li ... [truncated 225 chars](969 chars) shelf tag with ambient light detector the present invention relates to an electronic shelf display device which includes an optical device and an ambient light detector circuitry. the electronic shelf display device includes ... [truncated 225 chars](54320 chars)

Dataset Information

Field Value
Nano set NanoDAPFAM
Backing dataset NanoDAPFAM
Task / split NanoDAPFAMOutTitlAbsToFullText
Hugging Face dataset hakari-bench/NanoDAPFAM
Language en
Category natural_language
Queries 200
Documents 10000
Positive qrels 1259
Positives per query avg 6.29, min 1, median 4.0, max 20
BM25 nDCG@10 0.1102
BM25 hit@10 0.2100
Query length avg chars 786.64
Document length avg chars 71902.31

Public Sources

Hugging Face Links

Source Reference Table

Title Year Type URL
DAPFAM: A Domain-Aware Family-level Dataset to benchmark cross domain patent retrieval 2026 arXiv paper https://arxiv.org/abs/2506.22141
DAPFAM DOI record 2026 DOI https://doi.org/10.1016/j.array.2026.100720
datalyes/DAPFAM_patent 2025 dataset card https://huggingface.co/datasets/datalyes/DAPFAM_patent

Machine-Readable Metadata

benchmark_task_metadata:
  schema_version: 1
  document_status: first_pass
  nano_set: NanoDAPFAM
  backing_dataset: NanoDAPFAM
  dataset_id: hakari-bench/NanoDAPFAM
  task_name: NanoDAPFAMOutTitlAbsToFullText
  split_name: NanoDAPFAMOutTitlAbsToFullText
  language: en
  category: natural_language
  document_path: docs/benchmark_tasks/NanoDAPFAM/NanoDAPFAMOutTitlAbsToFullText.md
  source_research:
    primary_source_type: benchmark_paper
    paper_pdf_or_html_checked: true
  counts:
    queries: 200
    documents: 10000
    positive_qrels: 1259
  positives_per_query:
    average: 6.295
    min: 1
    median: 4.0
    max: 20
    multi_positive_queries: 139
  text_stats_chars:
    query_mean: 786.64
    document_mean: 71902.3141
  bm25:
    ndcg_at_10: 0.1101697656
    hit_at_10: 0.21
    source: dataset_bm25_column
  learning:
    original_train_split: not_confirmed
    evaluation_split_origin: DAPFAM OUT-domain title-abstract to full-text patent-family retrieval
    train_eval_overlap_audit: not_audited
    leakage_note: exclude NanoDAPFAM evaluation family IDs, positives, and qrels
    useful_training_data:
      - cross-domain title-abstract patent retrieval
      - cross-IPC patent citation pairs
      - long-target prior-art search
    synthetic_data:
      document_generation: long full-text patent records from different technical classes
      question_generation: compact source patent title and abstract summaries
      answerability: positives should be cited cross-domain patent families
    multi_positive_training: citation_family_multi_positive
  links:
    nano_dataset: https://huggingface.co/datasets/hakari-bench/NanoDAPFAM
    source_urls:
      - label: DAPFAM arXiv
        url: https://arxiv.org/abs/2506.22141
      - label: DAPFAM DOI
        url: https://doi.org/10.1016/j.array.2026.100720
      - label: datalyes/DAPFAM_patent
        url: https://huggingface.co/datasets/datalyes/DAPFAM_patent
  references:
    - title: "DAPFAM: A Domain-Aware Family-level Dataset to benchmark cross domain patent retrieval"
      url: https://arxiv.org/abs/2506.22141
      year: 2026
      doi: 10.1016/j.array.2026.100720
      is_paper: true
      source_confidence: definitive_paper_link